Abstract-This paper discusses parallelization of elliptic curve cryptography hardware accelerators using elliptic curves over binary fields 2 . Elliptic curve point multiplication, which is the operation used in every elliptic curve cryptosystem, is hierarchical in nature, and parallelism can be utilized in different hierarchy levels as shown in many publications. However, a comprehensive analysis on the effects of parallelization has not been previously presented. This paper provides tools for evaluating the use of parallelism and shows where it should be used in order to maximize efficiency. Special attention is given for a family of curves called Koblitz curves because they offer very efficient point multiplication. A new method where the latency of point multiplication is reduced with parallel field arithmetic processors is introduced. It is shown to outperform the previously presented multiple field multiplier techniques in the cases of Koblitz curves and generic curves with fixed base points. A highly efficient general elliptic curve cryptography processor architecture is presented and analyzed. Based on this architecture and analysis on the effects of parallelization, a few designs are implemented on an Altera Stratix II field-programmable gate array (FPGA).
I. INTRODUCTION
T HE USE OF elliptic curves in public-key cryptography was independently proposed by Koblitz [1] and Miller [2] in 1985 and, since then, an enormous amount of work has been done on elliptic curve cryptography (ECC). The attractiveness of using elliptic curves arises from the fact that similar level of security can be achieved with considerably shorter keys than in methods based on the difficulties of solving discrete logarithms over integers or integer factorizations.
Public-key cryptography is computationally intensive, and hardware acceleration is frequently required in practical applications. Thus, many publications have considered hardware acceleration of ECC. Some application-specific integrated circuit (ASIC) implementations have been published such as [3] - [6] , but the majority of designs including [7] - [23] have been implemented on field-programmable gate arrays (FPGAs). A comprehensive survey of hardware acceleration of ECC is given in [24] .
The research on hardware acceleration has concentrated on efficient implementation of elliptic curve point multiplication, the fundamental operation of all elliptic curve cryptosystems. The elliptic curve point multiplication is computed with point operations which, further, are computed using finite field arithmetic. The sequential nature of the point multiplication makes efficient use of parallelization challenging. However, although the point multiplication itself is hard to parallelize, it is possible to efficiently use parallelism in lower hierarchy levels, namely in point operations [9] , [11] , [16] and field arithmetic [14] , [15] , [25] , [26] . Many published articles use parallel computing in both point operations, e.g., multiple field multipliers, and field arithmetic operations, e.g., digit-serial multipliers, without making any analysis of their efficiency. This paper provides tools for evaluating the use of parallelism and points out where parallelism should be used in order to maximize efficiency.
Koblitz curves [27] are a family of curves on which point multiplication is considerably faster than on generic curves. Thus, Koblitz curves are included in many standards, e.g., [28] , [29] . Despite their efficiency, only few publications on hardware implementation have considered Koblitz curves. To the authors' knowledge, they have been discussed only in [12] , [17] , [19] , [20] . Koblitz curves were shown to be fast and easy to implement in software in [30] . It is shown in this paper that point multiplication on Koblitz curves can be computed very efficiently also in hardware. In addition to faster point multiplication, Koblitz curves also provide interesting possibilities for further use of parallelism compared to generic curves as will be shown in this paper.
The main contributions of this work include the following (in order of appearance):
• highly efficient general ECC processor architecture is described for FPGAs (see Section IV); • analysis on existing parallelization techniques is presented (see Section V); • fair comparison between existing techniques is given which is possible because different techniques are evaluated on the same architecture (see Section V); • method for reducing latency by using parallel processors is presented and analyzed (see Section VI); • very efficient high-speed FPGA-based implementations are described (see Section VII). Emphasis of this work is on studying effects of parallelization on performance, area, and their tradeoff in high-speed accelerators. Such aspects as side-channel attacks are not considered in order to keep the work focused.
The remainder of this paper is organized as follows. Section II presents preliminaries of ECC. Parallelization of point multiplication is discussed and previous work on the subject is reviewed in Section III. Section IV introduces the processor architecture that is used in the analysis. Parallelization techniques are described and their effects on the architecture are discussed in Section V. A new method for reducing latency with parallel field arithmetic processors is suggested and analyzed in Section VI. Finally, results on an Altera Stratix II FPGA are presented in Section VII and conclusions are drawn in Section VIII.
II. ELLIPTIC CURVE CRYPTOGRAPHY (ECC)
This paper considers elliptic curves defined over finite binary fields . The curves are so-called ordinary curves (see [31] , for example), i.e., they are defined as (1) with so that . Let denote the set of all points on . A pair , where , is a point in if it satisfies (1) . The point at infinity, denoted as , is also a point in . A binary field with polynomial basis (PB) is constructed by representing field elements as polynomials of degree at most . Addition is performed as an addition of polynomials modulo 2, i.e., a bitwise exclusive-or (XOR), and multiplication is performed modulo an irreducible polynomial. In normal basis (NB), the elements are represented with a basis of the form with the property that . Obviously, squaring is a cyclic shift and thus very inexpensive but multiplications cost more than in PB. Inversion of , i.e., computing such that , is the most expensive field operation regardless of the basis, see, e.g., [32] . A method based on Fermat's Little Theorem suggested by Itoh and Tsujii in [33] is used in this paper.
Every elliptic curve cryptosystem requires computation of an operation called elliptic curve point multiplication. Given a point , called the base point, and an integer , the point multiplication is defined as (2) where is called the result point. Point addition, with , and point doubling, , are basic operations in computing (2) .
Binary method (or double-and-add method) is probably the most common way to compute (2) . In the binary method, is represented with binary expansion as and a point doubling is performed for each bit and a point addition if . Thus, because , point doublings and point additions are required on average.
Traditionally points are represented by using two coordinates as . This representation is henceforth referred to as affine coordinates, or for short. If points are presented in , point addition and point doubling both require one inversion in . Because inversions are expensive, it is commonly preferred to represent points by using three coordinates as because then the number of inversions in computation of (2) can be reduced to one. Such coordinate systems considered in this paper are projective and López-Dahab coordinates, and for short. A point in and represents the affine points and , respectively. The and mappings are simply and do not require any operations.
The costs of addition, squaring, multiplication, and inversion in are denoted as , , , and , respectively. Point operation costs vary depending on coordinate systems. Table I can be further reduced with windowing methods, but then certain points need to be precomputed (see [32] , for example).
An efficient method for computing (2) was presented in [34] by López and Dahab. The method is called the Montgomery point multiplication and it computes (2) in with only the -coordinate ( and ) and the -coordinate is recovered in the end. Both point doubling and point addition are computed for every , but they are very efficient to compute because only the -coordinate is considered. The combined recovery of the -coordinate and mapping requires certain additional field operations as shown in Table I .
Curves for which and in (1) are called Koblitz curves [27] . They have special attractiveness among elliptic curves, because point doublings can be replaced by efficiently computable Frobenius endomorphism [27] . Let be a Koblitz curve. The Frobenius map is defined by (3) Frobenius maps cost or depending on the coordinate system. Notice that squaring is cheap. Actually, squaring in with NB is only a cyclic shift of the bit vector. Thus, the cost of (2) is only point additions 1 with the binary method. Before the fast Frobenius maps can be utilized, the integer needs to be converted in -adic expansion as , where . Algorithms for converting integers into -adic non-adjacent form were presented in [38] .
III. PARALLELISM IN POINT MULTIPLICATION
As presented in [9] , for example, the point multiplication decomposes into three hierarchical levels as shown in Fig. 1(a) . Similar hierarchical levels can be found from the ways of how parallelism is used inside an accelerator. The hierarchy of parallelism is depicted in Fig. 1(b) . This hierarchy is studied next from the bottom to the top. Fig. 1 . Hierarchical levels of (a) computation of elliptic curve point multiplication [9] and (b) parallelization of hardware implementations. Decisions made in (a) define which techniques can be used in hierarchical levels of (b). For example, selection of the basis, i.e., PB or NB, on the lowest level of (a) define which multiplier architectures can be used in the lowest level of (b), etc. One multiplication saved if a 2 f0;1g and one addition is saved if a = 0. Recovery of the y-coordinate and P 7 ! A mapping.
A. Parallelism in Field Arithmetic Blocks
Parallelism in field arithmetic blocks, mostly in multipliers, 2 has been studied in numerous publications. Multiplication is the operation which has the most crucial effect on the performance and area of an accelerator. Work on parallelization in field multipliers includes, e.g., [14] , [15] , [25] , [26] for PB and [39] for NB. A bit-serial multiplier computes one bit of the output per cycle with a single processing block resulting in latency of . In bit-parallel multipliers, all bits of the output are computed in one cycle. A digit-serial multiplier is a tradeoff where bits of the output are computed in parallel, thus, resulting in latency of .
B. Parallelism in Field Arithmetic Processors
Parallelism in point operations is also an efficient way to reduce latency of point multiplication as shown, e.g., in [9] , [11] , [16] , [22] , and [40] . Certain field operations can be computed in parallel depending on the coordinate system. Parallelism that can be utilized in Montgomery point multiplication [34] and point addition in mixed coordinates [37] is considered next. Only multiplications are examined as adding parallel adders or squarers has only a negligible effect on performance. Let denote the number of parallel field multipliers.
Montgomery point multiplication [34] is used for generic curves because it is the most efficient method which does not involve precomputations [31] . Point addition and doubling together require six multiplications and parallel multipliers can be utilized efficiently as shown in [11] , [22] . With , the critical path reduces to three multiplications and, with , to only two multiplications [11] , [22] . Three or more than four multipliers do not give any further improvements.
For Koblitz curves point additions are computed with the mixed coordinate point addition as presented in [37] . The formula for computing are as follows [37] :
The data dependency graph of (4) is presented in Fig. 2 which shows that (4) requires eight multiplications but the critical path can be reduced to five or four multiplications with or , respectively. Data dependencies restrict from achieving further reductions with more than three multipliers.
C. Parallel Processors
Parallel processors can be used for increasing throughput of an accelerator by simply computing several point multiplica- tions in parallel. Optimizations in parallel processor cases were recently studied by the authors in [17] . Using parallel processors for reducing computation latency is hard because of the sequential nature of the point multiplication and, at least to the authors' knowledge, the method discussed in Section VI is the first such method presented in the literature.
IV. ARCHITECTURE OF THE ACCELERATOR
This section presents an accelerator computing elliptic curve point multiplication which is used for studying the effects of parallelization. The accelerator comprises field arithmetic processor (FAP), FAP control logic and interface logic. A converter is also required for Koblitz curves. The architecture itself is generic but the implementations are optimized for Altera Stratix II FPGAs [41] .
Montgomery point multiplication is used for generic curves and the binary method where point additions are computed in mixed coordinates with in is used for Koblitz curves. These methods were selected because they are the fastest methods which do not require precomputations [31] .
For simplicity, the discussion in the remainder of this paper is restricted to the smallest field size specified in the NIST recommended elliptic curves for federal government use [28] ; namely the curves NIST B-163 and NIST K-163 are used. This does not sacrifice the generality of the architecture, the methods, or the analysis. Again, to keep discussion clear and simple, only NB are considered similarly as in [11] , for example. The methods and analysis tools are valid for PB too, but the results would be different, of course.
A. Field Arithmetic Processor
The FAP consists of adder, squarer, multiplier(s), storage RAM, and instruction decoder. A block diagram is presented in Fig. 3 .
1) Adder and Squarer:
The adder computes an -bit bitwise XOR in one clock cycle, i.e.,
. The squarer is a shifter which can compute successive squarings , where and . Computation requires one clock cycle, i.e., .
2) Multiplier:
Multiplication is critical for the overall performance. Multiplication in NB is computed with a Massey-Omura multiplier [39] . One bit of the result , where is computed from and by using a logic function called the -function. Formulae for constructing the -function are publicly available in the appendices of [28] , for example. The -function is field specific, and the same is used for all output bits as follows: , where denotes cyclical left shift by bits. Hence, a bit-serial implementation of the Massey-Omura multiplier requires three -bit shift registers and one -function block. A bit-parallel multiplier requires -function blocks and an -bit register for storing the result [28] , [39] .
In practice, the bit-serial multiplier requiring at least clock cycles is too slow and the bit-parallel multiplier requires too much area. A good tradeoff is a digit-serial multiplier, where bits are computed in parallel with -function blocks. The -function blocks can be pipelined in order to increase the maximum clock frequency. The latency of a digit-serial multiplier is (5) where is the number of pipeline stages, i.e.,
. In this paper,
. One clock cycle is also required in loading the operands into the shift registers.
FAPs can include several multipliers and the number of multipliers is denoted with .
3) Others: The storage RAM is used for storing elements of . It is implemented as a dual-port RAM by using embedded memory, e.g., M4K in Stratix II [41] . The storage RAM stores up to elements. When Stratix II is used, a logical choice is because, while in true dual-port mode, the widest mode that an M4K block can be configured to is 256 18-bits. The width of the storage RAM was selected to be 163 bits in order to minimize writing and reading delays. In memory constrained environments narrower bus widths could be used in order to reduce memory requirements at the expense of longer delays. The storage RAM requires M4Ks resulting in a storage capacity of 256 163-bits. This much storage space is rarely needed, but it can be used for example for storing precomputed points. Furthermore, selecting a smaller depth would not reduce M4Ks. Both writing and reading require one clock cycle. However, the dual-port RAM can be configured into the read-during-write mode [41] which saves certain clock cycles, see Section IV-B.
The instruction decoder simply decodes instructions to signals controlling the FAP blocks.
B. Control Logic
The FAP control logic consists of finite-state machine (FSM) and ROM containing instruction sequences.
The instruction sequences are carefully hand-optimized in order to minimize latencies of point operations. As mentioned in Section IV-A3, the read-during-write mode can be used for reducing latencies. Operations are ordered so that the result of the previous operation is used as the operand of the next operation whenever possible. One clock cycle is saved every time this can be used, because the operands of the next operation can be read simultaneously with the writing of the result of the previous operation.
Inversions are computed with successive multiplications and squarings as suggested by Itoh and Tsujii in [33] . An Itoh-Tsujii inversion has the constant cost of (6) which results in when [33] . Although the number of squarings is high, the successive squaring feature of the squarer (see Section IV-A1) ensures that the cost remains reasonable.
As mentioned in Sections III and IV-A2, the latencies of point operations can be reduced with parallel multipliers, i.e., . Multiplications in the Itoh-Tsujii inversion cannot be computed in parallel but Montgomery point addition and doubling and point addition in mixed coordinates benefit from parallel multipliers as shown in Section III. Instruction sequences were optimized for different . Table II lists the latencies of instruction sequences used in this paper and presents the latencies of computing (2) with different setups.
C. Top-Level and Clocking
When (2) is computed on a Koblitz curve, a converter is required for converting into -adic expansion. The converter presented by the authors in [42] is used in the designs. The latency of a conversion is clock cycles on average [42] . Because the converter has a lower maximum clock frequency than the rest of the circuitry, it is separated into its own clock domain as shown in Fig. 4 . The accelerator has an interface clock , the FAP, and its control logic operate with the clock , and the converter operates with the clock . The clock domains are separated by first-in, first-out (FIFO) buffers implemented in embedded memory, i.e., in M512 and M4K in Stratix II. Table III presents the areas of different blocks in the architecture which are used in Section V for analyzing the effects of parallelization. The areas are averages from the values received from the synthesis for Stratix II FPGA because the exact values varied slightly after the place&route. The areas are given as the number of occupied adaptive logic modules (ALMs). It is assumed that the total area depends linearly on the number of blocks. Hence, an estimate for the area of an implementation on Stratix II is given by (7) where , , , and are the numbers of parallel FAPs, converters, field multipliers, and -function blocks, respectively, and the areas are as in Table III .
V. PARALLELISM IN ELLIPTIC CURVE ACCELERATORS
This section discusses effects of parallelization in different parts of the ECC processor presented in Section IV. It is assumed in the following analysis that the complexity of the design does not have an effect on the quality of place&route results, i.e., on area or timings. Thus, the area of an accelerator is assumed to be given by (7) . The clock frequency is assumed to be constant for all implementations because the same -function block determines the critical path, see Section IV-A2. These assumptions are necessary in order to provide an analytic approach to parallelization. However, in reality the more difficult place&route becomes the more area it usually has to consume in order to meet the given timing constraints and the more probable it becomes that these constraints are not met at all. Hence, it is probable that the estimates given in the analysis are too optimistic for the most complex designs. Inaccuracies caused by the assumptions are analyzed in Section VII.
First, metrics for evaluating designs are defined. Let denote the parameters which define the degrees of parallelism used in the accelerator, i.e., . Performance is rated by three metrics, namely latency, point multiplication time, and throughput. Latency is the average number of clock cycles required to compute point multiplication. Point multiplication time, hereafter referred to as pm-time, is the average time in seconds required in point multiplication.
Throughput is the maximum number of point multiplications computed in a given time frame on average. Throughput is measured with operations per second (ops).
Let , , and denote latency, pm-time, and throughput with parallelism parameters , respectively. Parallelism may also have an impact on the maximum clock frequency and, therefore, the frequency, , must be considered as well. Latency, pm-time, and throughput are related through the following formula: (8) (9) where is the maximum number of point multiplications that can be computed in parallel. Because in this section each FAP computes a single point multiplication at a time, . Let denote the area of the accelerator with . In the following analysis, is given by (7) . Important evaluation metrics are the latency-area , time-area , and throughput-area ratios defined as
The higher the ratios are the better the implementation can be considered by that metric. Notice that, if the accelerator computes only one point multiplication at a time, . However, if an accelerator is capable of computing several point multiplications simultaneously, i.e., , . Two designs with parallelism parameters and can be compared with speedup ratios as follows:
(13) (14) (15) for latencies, pm-times, and throughputs, respectively. All ratios describe how much faster is compared to .
A. Parallelism in Field Multipliers
This section studies parallelization of the digit-serial MasseyOmura multiplier, see Section IV-A-2. The free parallelism parameter is the number of -function blocks, .
Because the -function blocks are the same for all bits of the result, the critical path of the multiplier is constant regardless of . Thus, it is assumed that the maximum clock frequency does not depend on and a normalized frequency can be used in the analysis. It suffices to consider only latency and area of the multiplier. The area of the multiplier consists of the area of -function blocks and constant area of the shift registers. Thus, latency and area are given by the following formula:
where and are as given in Table III . Because of the round up in (16), only certain values of are feasible. Because , should be chosen from the set 
B. Parallelism in FAPs
This section studies parallelism in FAPs. Multipliers dominate in performance and area cost. Because parallel adders and squarers do not give any major performance benefits, the analysis is restricted to the number of multipliers, , and it is assumed that there is only one adder and squarer. The area of an FAP obtained from (7), is given by (18) where includes adder, squarer, and control logic. Two questions are studied. First, what setup gives the best latency-area ratio and, second, how to determine whether one should use one fast multiplier or multiple slower ones? The first question is relevant when throughput is being maximized with parallel FAPs, because then one should use FAPs that give the best area efficiency. The second question has importance when one targets to certain latency and wants to achieve it with minimal area. Fig. 6 (a) and (b) plot for generic and Koblitz curves, respectively. The best is received when and for both generic and Koblitz curves. The similarity is not surprising considering that the ratio of multiplications and other operations is almost the same in both cases, see Table II, and receives a high also in the analysis of Section V-A, see Fig. 5 . For Koblitz curves, has a significantly higher than Fig. 6 . Latency-area ratios R (; ) of the FAPs. In (a), the combined point addition and doubling [11] of the Montgomery point multiplication [34] is used and, in (b), L(; ) the mixed coordinate point addition algorithm [37] is used. Fig. 7 . Latency-area plots of the FAPs. In (a), L(; ) is the latency of the combined point addition and doubling [34] and, in (b), it is the latency of mixed coordinate point addition [37] . . Both and receive high for generic curves, but reduces considerably. When an implementation targets for low latency, the smaller number of multiplications on the critical path offered by parallel multipliers seems attractive. However, it is not obvious which one is the most efficient solution: several slow parallel multipliers or one fast multiplier with large . This question is studied in Fig. 7(a) and (b) . As shown in Fig. 7(a) , parallel multipliers offer a large benefit on generic curves. Fig. 7(a) shows that with loose latency constraints, one multiplier with should be used. If lower latency is needed, one should select two multipliers with . If even they are too slow, then one should switch to four multipliers with . Fig. 7(b) shows that, for Koblitz curves, one should use one multiplier up to the point where . If even lower latency is needed, then one should use either two or three parallel multipliers. However, it will be shown in Section VI that even lower latency can be achieved with smaller area by using parallel FAPs with and, therefore, one multiplier is the only feasible solution for Koblitz curves.
C. Parallel FAPs
Let be the number of parallel processors implemented as presented in Section IV each of which is computing different point multiplications independently of each other, i.e., . One FAP has latency and throughput . When parallel FAPs compute different point multiplications simultaneously, average latency remains the same, i.e.,
, but the throughput increases, i.e., , . In order to maximize throughput-area ratio of a multi-FAP design, one should replicate FAPs with maximum , i.e., based on the analysis of Section V-B, one should use FAPs with parameters and . When (2) is computed on Koblitz curves, a converter is required as discussed in Section IV-C and it must be considered in throughput calculations. Instead of attaching a converter to each FAP, it is preferable to let one converter serve several FAPs because the conversion time is much shorter than the point multiplication time, i.e., . However, it should be guaranteed that the converter(s) do not become a bottleneck and the number of converters, , must satisfy (19) where , , , , , , and denote throughput, pm-time, latency, and clock frequency of the FAP(s) and the converter(s), respectively.
VI. REDUCING LATENCY WITH PARALLEL PROCESSORS
This section presents how the latency of point multiplication can be reduced with parallel FAPs. In other words, parallel FAPs compute a single point multiplication, i.e. but . It is assumed that the FAPs can exchange data with each others.
In [43] Okeya et al. presented a method for reducing memory requirements of windowing methods on Koblitz curves by exploiting the inexpensiveness of the Frobenius maps. The same feature can be exploited for reducing the computation latency as will be shown in this section. The method is not restricted to Koblitz curves but the base point needs to be fixed before the method operates efficiently on a generic curve because of precomputations involving . The new method can be combined with other techniques such as parallel multipliers or windowing methods.
Obviously, (2) can be expressed in the following manner by using the binary expansion of (20) Assume that parallel FAPs are available. Then (20) (22) The number of point doublings in the FAP has now reduced by . In order to minimize the number of Montgomery point additions and doublings on the critical path, one should choose which minimizes . The problem is, however, that one would require a priori information about in order to precompute . As this information is not available in practice, one must split into words by using fixed values, i.e., and are fixed. This method is used in Section VI-A1.
On Koblitz curves zeros in lose their significance because Frobenius maps are almost free. Thus, it is assumed that the complexity is defined solely by the number of ones in , and one should find which minimizes , i.e., the maximum number of nonzeros processed in any FAP. Precomputations are also almost free and they can be computed on-the-fly. Thus, base points need not to be fixed.
An algorithm for computing by using parallel FAPs is shown in Fig. 8 . Derivation of from is referred to as splitting. It is performed with one of the two splitting algorithms discussed in Section VI-A. Both splitting algorithms, , return and exponents for computing base points or . The parallel computations can be performed independently of each other by using any point multiplication method, e.g., windowing methods can be used. Combining parallel computations, i.e., , requires point additions, but the critical path consists of only point additions because parallelism can be utilized.
A. Splitting Algorithms
In order to achieve the best possible performance with parallel FAPs, the computational load must be divided for the FAPs as evenly as possible. The problem is, however, that the computational cost depends on . Moreover, the way in which determines the computational cost depends on various parameters such as the curve, the coordinate system, etc.
Because of the reasons mentioned before, finding a splitting algorithm resulting in the optimal splitting result every time proved to be a difficult task. Thus, two different splitting algorithms are suggested, both having advantages and disadvan- tages. The splitting algorithms considered in the following are called fixed window and cyclic splitting algorithms.
1) Fixed Window:
The integer is split into words using predefined windows with a size of . A logical choice for is . Now, is constructed so that consists of the least significant bits (LSBs) of , contains the next bits, etc. The base points can be precomputed because the window sizes are fixed, and they are given by (23) for generic and Koblitz curves, respectively, i.e.,
. The longest precomputation requires point doublings or Frobenius maps.
2) Cyclic: Starting either from the LSB or MSB of , each nonzero bit of is split cyclically so that the first nonzero is processed in the first FAP, the second nonzero in the second FAP, etc. The th nonzero is again processed in the first FAP, the th in the second FAP, etc. Each bit results in either a zero or nonzero bit in and, therefore, the length of the longest is . The base points are simply given by (24) i.e., , for all and there are no precomputations. The cyclic splitting algorithm always results in the minimum number of nonzeros, i.e., .
B. Examples and Comparison of the Splitting Algorithms
Splitting examples are given in Table IV Performance of the splitting algorithms was tested by selecting 10 000 random 163-bit integers and evaluating them for both Koblitz and generic curves. When Koblitz curves were considered, the 163-bit was first converted to . Frobenius maps were ignored. The Montgomery point multiplication was used for generic curves and the computational cost of precomputations required in the fixed window algorithm was neglected because a fixed was assumed. Both one and zero bits have the same cost and, therefore, only the length of has significance. The results are presented in Fig. 9 which depicts speedups versus the one FAP case. Fig. 9 shows that the cyclic splitting results in the best speedups for Koblitz curves. On the other hand, the fixed window splitting algorithm is, expectedly, the only one performing well for generic curves. Notice that, although the speedups versus the one FAP case are smaller for Koblitz curves than for generic curves, the actual point multiplication is still considerably faster on Koblitz curves. Furthermore, Koblitz curves do not require any precomputations or they are very cheap.
C. Multiple FAPs Versus Multiple Multipliers
This section presents comparisons of implementations having multiple FAPs, i.e.,
, and implementations having a single FAP with multiple multipliers, i.e., and . As presented in Section V-B, optimizes latency-area ratio of an FAP for both generic and Koblitz curves. In that case, (18) gives the area of 3084 ALMs for the FAP of which the multiplier occupies 2354 ALMs (76.3%). This percentage is in line with other designs reported in the literature, see, e.g., [9] , [13] , and [14] . Let denote the area of an FAP with . Because the area of a multiplier is 0.763 , FAPs with , is computed with point additions in requiring 11 multiplications (including Itoh-Tsujii inversion). The mapping also requires 11 multiplications (including Itoh-Tsujii inversion), but the critical path reduces to 10 multiplications if . Based on the previously mentioned facts, estimates of area, speedup, and speedup per area ratio were derived as presented in Table V .
The fixed window splitting was selected for generic curves because it was the only one of the two algorithms that offers significant speedups, see Section VI-B. The critical path consists of 6, 3, or 2 multiplications when , , or , respectively, see Section III-B. The mapping and the recovery of the -coordinate have the critical path of 19, 15, or 13 multiplications (including Itoh-Tsujii inversion) when , , or , respectively. Combining is performed similarly as for Koblitz curves. The precomputations require multiplications on generic curves, because one needs to compute . Because point doubling in is expensive, it is faster to perform doublings in and then map the result point to . Thus, the longest precomputation requires multiplications (including Itoh-Tsujii inversion). Estimates for area, speedup, and speedup per area ratio are presented in Table VI .
Tables V and VI show that the multiple FAP method can be efficiently used for speeding up computation on Koblitz curves and, if is fixed, also on generic curves. The method allows speedups beyond the limitations of the multiple multiplier methods and, moreover, even outperforms multiple multiplier methods in achieved speedup per area on Koblitz curves.
VII. IMPLEMENTATIONS
Several designs were implemented on an FPGA with different parameters in order to investigate the validity of the analysis and methods presented in Sections V and VI. The designs were written in VHDL and synthesized for Altera Stratix II EP2S180F1020C3 FPGA, henceforth referred to as S180C3, by using Altera Quartus II 6.0 SP1 design software. Functionality of the designs was verified with ModelSim SE 6.1b. Stratix II S180C3 has 71 760 ALMs, 930 M512s, and 768 M4Ks [41] . Modular design style was used in VHDL and field multipliers were generated with automated designs tools written specifically for this purpose. Hence, implementing multiple designs could be done with moderate amount of work, but all designs required some hand optimization. Parameters and were selected for the FAPs in parallel FAP implementations presented in Sections V-C and VI because they offer the best latency-area ratio based on the analysis in Section V-B. The performance of the method presented in Section VI was demonstrated only on Koblitz curves because the base point would need to be fixed on generic curves. The cyclic splitting was used, because it performs better than the fixed window method as shown in Section VI-B.
A. Results
The results are shown in Tables VII and VIII for generic and Koblitz curves, respectively. The parameters of the designs are given on the left. The number of converters is always in Table VIII . The number of point multiplications that can be computed simultaneously is denoted with in Table VIII . The results obtained from Quartus II are given in the middle so that the area of the design is given in the ALMs column followed by the maximum clock frequencies for the converter, , and for the FAP, . Multiplication latency, , is given by (5) and the average point multiplication The results presented in Tables VII and VIII were obtained by synthesizing each design once. Different constraints were used for generic and Koblitz curves. However, the same constraints were used for all designs with the same curve. Fig. 10 plots the pm-times of the designs with presented in Tables VII and  VIII as functions of areas. Fig. 10 shows that is always the best choice for Koblitz curves as estimated in Section V-B. For generic curves, however, multiple multipliers are feasible in practice, too. Actually, multiple multipliers perform even better than expected, because when grows, clock frequency decreases thus resulting in slower performance. This favors the use of multiple multipliers because multiple multipliers with small operate on a higher clock frequency than one with large .
B. Discussion and Comparisons
The superiority of Koblitz curves is obvious in Fig. 10 . Although needs to be converted to , they are clearly faster and more area efficient than generic curves. Pm-time on Koblitz curves can be reduced with minor additional area to approximately 40 s. Further reductions in pm-time result in considerable increase in area with traditional parallelization methods. However, the new method presented in Section VI offers faster pm-times with smaller area (circled points in Fig. 10) .
The synthesization results vary slightly from run to run which is one reason for the variation of maximum clock frequencies in Tables VII and VIII . On the other hand, when the size of the design grows, maximum clock frequencies start to decrease dramatically because the place&route becomes harder. Hence, the assumptions that the area grows linearly and the clock frequencies are constant are not valid for large designs as was conjectured in Section V. However, the assumptions hold well for smaller designs.
The size of the multiplier, , has a considerably larger effect on clock frequencies than or which is not surprising considering that the critical path is in the multiplier. Differences of estimated and actual areas are investigated in Fig. 11 which shows that estimates hold well if . However, when several multipliers are used, i.e.
, area estimates are too optimistic with large . Again, this was expected because the place&route becomes hard when the size of an FAP grows.
A large number of FPGA-based implementations have been published in the literature. Fair comparison of these implementations is difficult-if not impossible-because of the variety of different FPGAs, elliptic curves, fields, coordinate systems, etc. Arguably, the largest problem for fair comparison is the variety of FPGAs, because it is hard to map area requirements and timings between different FPGA architectures without synthesizing the design for all them. A valuable effort for evaluating designs on different families of Xilinx FPGAs was made in [11] , where estimates of the effect of the FPGA families were given by synthesizing the designs for different families. However, as the VHDL describing the architecture of Section IV was written specifically for Stratix II FPGAs, this approach could not be used. Table IX summaries FPGA implementations presented in the literature. When a publication presents many implementations, Table IX presents the one which is the most comparable with the designs presented in this article. The implementations presented in this paper are clearly among the fastest ones. However, as mentioned before, it is impossible to say which portion of the differences is caused by the different implementation platform.
FPGA implementations of Koblitz curves have been presented in [12] , [17] , [19] , and [20] . The fastest implementation in this paper computes point multiplication in 25.81 s including the conversion and it outperforms all previous implementations. The designs presented in [19] and [20] do not include a converter. The significant difference in their pm-times is caused by different FPGAs and design architectures. Fast performance presented in [12] was achieved by representing with a double-base expansion. The implementation in [17] computes a multiple point multiplication and targets for maximum throughput which makes it incomparable with other implementations.
VIII. CONCLUSION
Parallelization of high-speed ECC accelerators was studied. A generic accelerator architecture was presented in Section IV and it was used in studying the effects of parallelization. The analysis concerned both generic and Koblitz curves.
Analytic tools were provided for estimating efficiency of different parallelism parameters. The accuracy of the tools was studied by implementing several designs on Stratix II S180C3. These implementations are among the fastest ones published in the literature. The tools were shown to provide accurate estimates although accuracy decreases when designs become large because the place&route is harder. Anyhow, the tools provide valuable information on how and where parallelism should be used in ECC implementations.
When parallel multipliers in an FAP are used for reducing latency, the optimal setup depends on the curve. Only one multiplier should be always used for Koblitz curves, but multiple multipliers offer considerable improvements for generic curves. For them, the optimal setup depends on various aspects, such as available area and pm-time constraints, as discussed in Section V-B.
Koblitz curves were shown to offer considerably faster point multiplication than generic curves with equal amount of area even when the converter was included. Furthermore, the new method utilizing parallel FAPs presented in Section VI can be used efficiently for Koblitz curves. If base points are fixed or changed infrequently, the method is useful also on generic curves but precomputations prevent its use if base point flexibility is essential. The method can be combined with existing techniques such as windowing methods. The implementations of the method have very high latency-area efficiencies which prove the usability of the new method.
