Abstract. Elliptic curve signature schemes offer shorter signatures compared to other methods and a family of curves called Koblitz curves can be used for reducing the cost of signing and verification. This paper presents an FPGA implementation designed specifically for rapid verification of self-certified identity based signatures using Koblitz curves. Verification requires computation of three elliptic curve point multiplications which are computed efficiently with 3-term multiple point multiplication and joint sparse form. Certain improvements to precomputations associated with multiple point multiplications are introduced. It is shown that, when using parallel processors, it is possible to gain considerable increases in the number of operations per second by allowing slightly longer computation times for single operations. It is demonstrated that up to 166,000 verifications per second can be computed using a single Altera Stratix II FPGA.
Introduction
Research on hardware realization of cryptographic algorithms has been intensive during the past few years. Implementation of elliptic curve cryptosystems by using field programmable gate arrays (FPGAs) has been one of the most active areas in the field, and numerous designs have been described in the literature. This paper extends the research on the subject by describing a very efficient implementation designed specifically for one of the most computationally demanding tasks of modern cryptosystems; namely, signature verification.
Elliptic curve cryptography [1, 2] is a branch of public-key cryptography which has recently been a subject of much interest because a high level of cryptographic security is achievable with shorter key lengths than with other existing methods. The implementation computes elliptic curve operations involved in verification of self-certified identity based signatures based on Nyberg-Rueppel signature scheme [3] . The implementation uses one of the standardized Koblitz curves listed in [4] , henceforth referred to as the NIST curve K-163, because computations are much faster on Koblitz curves [5] . Further improvements in performance are achieved by computing all operations required in signature verification simultaneously by using multiple point multiplication techniques. Performance is increased by introducing certain improvements to precomputations.
Signature verification is a basic operation in many cryptosystems. Applications, such as the Packet Level Authentication (PLA) scheme [6, 7] where computational requirements for signature verifications are very high, directly benefit from the results presented in this paper.
The contributions of the paper include the following:
-Unified point addition and subtraction formulae are presented which can be used in speeding up precomputations in various methods including multiple point multiplications and combings. -A new algorithm for 3-term joint sparse form precomputations is presented resulting in a major speed up compared to existing methods. -To the authors' knowledge, this is the first publication where computation time vs. the number of operations per second (ops) tradeoff is being explored when using parallel processing in elliptic curve operations. It is shown that allowing slightly longer latencies can result in considerable increases in ops. -A highly efficient implementation which utilizes parallel processing is presented for an Altera Stratix II FPGA. The implementation is capable of performing up to 166,000 verifications per second which exceeds all previously presented implementations. -It is shown that schemes, such as PLA [6, 7] , could be feasible if the implementation presented in this paper is used for accelerating verifications.
The remainder of the paper is organized as follows. Sec. 2 presents the preliminaries of elliptic curve cryptography and self-certified identity based signatures. Algorithms that are used in the implementation are introduced and derived in Sec. 3. The implementation is presented and the results are analyzed in Secs. 4 and 5, respectively. Conclusions are drawn in Sec. 6 and the paper ends with certain suggestions of possible directions for the future research.
Preliminaries

Packet Level Authentication
Packet Level Authentication (PLA) is a scheme where the authenticity of packets in IP (Internet Protocol) traffic is verified by signing and verifying them with cryptographic signatures. The authenticity of packets is verified from node to node instead of from point to point as in other schemes. This helps in preventing many threats including denial-of-service (DoS) attacks but as a downside PLA adds the length of the packet header and most importantly is computationally very demanding. Thus, hardware acceleration is essential. [6, 7] PLA is one of the possible applications for the implementation of the paper as mentioned above and the rationales behind many design decisions originate from the requirements of PLA.
The use of signatures based on elliptic curves instead of other techniques such as RSA or ElGamal is practically mandatory because the length of signatures must be kept in minimum in order to minimize the overhead caused by PLA [6] . Koblitz curves were chosen in order to maximize the speed of the implementation because operations are notably faster on Koblitz curves than on general curves [5] . Self-certified identity based signatures were selected because they result in shorter signatures and reduced computational complexity [8] .
Preliminaries of elliptic curve cryptography and self-certified identity based signatures are presented next in Secs. 2.2 and 2.3, respectively.
Elliptic Curve Cryptography
Every elliptic curve cryptosystem is based on an operation called elliptic curve point multiplication, and it is defined as Q = kP where Q and P are points on an elliptic curve and k is an integer.
Koblitz curves [5] are a family of elliptic curves of the form
where a ∈ {0, 1} and x and y are elements of the finite field F 2 m . Elliptic curve point multiplication is computed with successive point additions and point doublings with the binary method so that, when k = −1 i=0 κ i 2 i , point doublings are performed for all κ i and point additions when κ i = 1. On Koblitz curves, however, point doublings are replaced by computationally cheap Frobenius maps which results in significant improvement in performance. Before this feature can be utilized, k needs to be converted into τ -adic representation. Algorithms for finding τ -adic non-adjacent form (τ NAF) were presented in [9] . When k is represented in τ NAF, it has the form k = −1 i=0 κ i τ i where τ = (−1) 1−a + √ −7 /2 and κ i ∈ {0, ±1} so that κ i κ i+1 = 0 for all i. The average number of non-zero terms in k is /3 [9] . Because point additions are required when κ i = 0 and ≈ m, point multiplication on E K requires on average m/3 point additions and m Frobenius maps.
A sum of integer multiples of two points, i.e. k 1 P 1 + k 2 P 2 , can be accelerated with Shamir's trick [10] where the integers are represented as a matrix having k 1 and k 2 as rows. First, P 1 + P 2 is precomputed. Point multiplication is carried out with the binary method so that one adds the point P 1 if the column is , only point doubling or Frobenius map is performed. When k 1 and k 2 are in NAF, also the point P 1 − P 2 is precomputed. Two integers can be represented in joint sparse form (JSF) [11] in order to maximize the number of zero columns. JSF was generalized for n integers in [12] . JSF can be used also for Koblitz curves as an algorithm for finding τ -adic JSF (τ JSF) for two integers was presented in [13] . A generalization for n integers was recently proposed in [14] and it is henceforth referred to as 3-term τ JSF because its average number of non-zero columns is equivalent to the 3-term JSF [14] . A 3-term τ JSF has a probability of 0.5897 for a non-zero column [14] which yields a Hamming weight, i.e. the number of non-zero terms, H(k) = 0.5897m on average. This paper considers the following 3-term multiple point multiplication:
(1)
Self-Certified Identity Based Signatures
In the following, a self-certified identity based signature scheme [3] 
where u is an integer selected at random from the interval [1, r−1] and compress compresses a point (x, y) to (x, b(y)) which requires only m + 1 bits. hash is a hash function.
Alice generates a signature (c, d) for a message M by calculating
where v is a random integer such that v ∈ [1, r − 1] and [vG] x is the x-coordinate of vG. Bob verifies the signature on the message M by first extracting Alice's public key W A from (r A , b A ) which are public by computing
where decompress is the inverse operation of compress. Thus, (3) requires one point multiplication. After extraction, the validity of the signature is verified by checking
which requires two point multiplications. Verification and extraction can be simplified into the following 3-term multiple-point multiplication as shown in [8] :
which obviously has the form of (1).
As signings, i.e. computations of (2), are computationally cheaper than verifications and they can be accelerated further with methods such as fixed-base windowing (see [16] , for example), the performance of the scheme is bounded by verifications. 3-term multiple point multiplication dominates in the computational requirements of verification because decompression and subtraction are fast to compute. The hash can be computed simultaneously with point multiplication, and many fast and compact hash modules have been presented in the literature. Thus, the remainder of the paper focuses in accelerating (1).
Algorithms
This section introduces the algorithms which are used in computing (1) . Point multiplications are computed using known algorithms which are reviewed in Sec. 3.1 but new algorithms are derived for precomputations in Sec. 3.2.
Elliptic Curve Point Multiplication
When two points on E K are represented in affine coordinates, A for short, as (x 1 , y 1 ) and (x 2 , y 2 ), a point addition (x 3 , y 3 ) = (x 1 , y 1 ) + (x 2 , y 2 ) is given with the following formulae:
They have the cost I + 2M + S + 8A where I, M , S and A denote the costs of inversion, multiplication, squaring and addition in F 2 m , respectively. A negation of the point (x 1 , y 1 ) is given by (x 1 , x 1 + y 1 ) and it has the cost of A. [16] Because inversions are expensive, it is commonly preferred to represent points with three coordinates as (X, Y, Z) because then the number of inversions in point multiplication can be reduced to one. Coordinate system called López-Dahab coordinates [17] , or LD for short, is used in this paper and a point (X, Y, Z) in LD represents the point (X/Z, Y /Z 2 ) in A [17] . When points are represented in LD, point addition P 3 = P 1 + P 2 can be computed as presented in [18] so that P 1 is in LD and P 2 in A. This is referred to as the mixed coordinate point addition and it has the cost of only 8M + 5S + 8A on the NIST curve K-163. Frobenius map is (X 2 , Y 2 , Z 2 ) in LD and it is obviously cheap to compute. The A → LD mapping is performed at the beginning simply as (x, y, 1) but the LD → A mapping requires I + 2M + S. However, as shown in (4), the y-coordinate is not needed in verification and, hence, the cost reduces to only I + M .
Finite fields F 2 m are typically represented with polynomial basis or normal basis. In polynomial basis, the field is constructed by using an irreducible polynomial with a degree of m. In normal basis, the set {α, α 2 , . . . , α 2 m−1 }, where α 2 i are linearly independent, is used as a basis and an element is represented as a = m−1 i=0 a i α 2 i where a i ∈ {0, 1}. Multiplication is considered more efficient in polynomial basis. However, in normal basis squaring is simply a rotation of the bit vector and Frobenius maps are thus very cheap to compute. For this reason normal basis was chosen. Addition is computed with a simple bitwise exclusive-or (XOR). Inversion is computed with Itoh-Tsujii inversion [19] requiring exactly ( log 2 (m − 1) + H(m − 1) − 1)M + (m − 1)S where H(m − 1) is the Hamming weight of m − 1 [19] . As m = 163, the cost is I = 9M + 162S. Because squarings are cheap, multiplications dominate in I.
To summarize, the implementation computes (1) on the NIST K-163 (normal basis) with the binary method using a 3-term τ JSF. Point additions are computed in mixed coordinates and, in the end, the x-coordinate is mapped to A by computing X/Z, where the inversion is computed with an Itoh-Tsujii inversion.
Precomputation
When (1) is computed with multiple point multiplication techniques, certain points need to be precomputed. These precomputations cannot be computed offline similarly as, e.g., in fixed-base windowing methods because points P i are not fixed. Thus, precomputations are on the critical path and it is essential to compute them as fast as possible. In order to be able to use fast mixed coordinate point additions, precomputed points should be in A. The first step in improving precomputations is to utilize the fact that the same inversion is computed in both P 1 + P 2 and P 1 − P 2 computations. The same fact has been previously used at least in [20] but it is shown in the following that it is also possible to save some additions.
Theorem 1 (Unified point addition and subtraction). Given two points P 1 = (x 1 , y 1 ) and P 2 = (x 2 , y 2 ) on an elliptic curve E, P
3 ) = P 1 − P 2 can be computed with the following formulae:
Proof. (6c) and (6d) are simply the point addition formulae (5b) and (5c), i.e. (x
3 ) = P 1 + P 2 , and it remains to show that (x
because 2λλ = 0. As −P 2 = (x 2 , x 2 + y 2 ), (5a) yields
Now substituting (8) into (6f) and (7) shows that (x
Cost of computing (6a)-(6f) is only I + 4M + 2S + 14A. Thus, Theorem 1 saves I + 3A compared (5a)-(5c). This is significant because inversion dominates in the cost of point addition.
Precomputations in 3-term (τ )JSF require 10 point additions or subtractions because points presented in Table 1 need to be available. Obviously, pairs (R 4 , R 5 ), (R 6 , R 7 ), (R 8 , R 9 ), (R 10 , R 11 ) and (R 12 , R 13 ) can be computed using (6) . Thus, the precomputations require only 5 unified point additions and subtractions. It should be noted that the use of unified point additions and subtractions does not restrict to JSF precomputations because similar pairs can be found, e.g., in precomputations involved in combings when integers are in NAF. 
Computational cost of precomputations can be reduced even further by using Montgomery's trick (see [16] , for example) for computing the five inversions. Montgomery's trick is based on the observation that 1/θ 1 = θ 2 (1/θ 1 θ 2 ) and 1/θ 2 = θ 1 (1/θ 1 θ 2 ) and it operates as follows. Let θ 1 , θ 2 , . . . , θ n be the elements to be inverted. First, set γ 1 = θ 1 and, for i = 2, . . . , n, compute γ i = γ i−1 θ i . Then invert γ −1 n and compute θ
2 . Montgomery's trick inverts n elements with the cost of 3(n − 1)M + I. [16] However, Montgomery's trick is not directly applicable in 3-term JSF precomputations because it requires that all θ i are known in advance. Let R i = (x i ,ŷ i ) as defined in Table 1 . The following inverses are needed in computing
and θ −1 5 = (x 9 +x 1 ) −1 in which onlyx 1 = x 1 ,x 2 = x 2 andx 3 = x 3 are known beforehand. In order to be able to use Montgomery's trick,x 8 andx 9 need to be presented by using x 1 , y 1 , x 2 , y 2 , x 3 and y 3 .
Because R 8 = (x 8 ,ŷ 8 ) = P 3 + P 2 , it follows directly from (5a) and (5b) that
.
Let θ 4 denote the denominator of (9). Similarly as above,
and, again, let θ 5 denote the denominator of (10). Now, Montgomery's trick can be used for computing inverses for the elements
2 as shown in (9) and (10). Finally, R i can be computed with (6b)-(6f), i.e. by skipping the inversion of (6a). An algorithm is presented in Alg. 1.
Algorithm 1 Precomputation in 3-term (τ )JSF
Input: P1 = (x1, y1), P2 = (x2, y2), P3 = (x3, y3) Output: Precomputed points Ri as described in Table 1 θ1
R1 ← P1; R2 ← P2; R3 ← P3; R4,5 ← R2 ± R1; R6,7 ← R3 ± R1; R8,9 ← R3 ± R2 R10,11 ← R8 ± R1; R12,13 ← R9 ± R1 Table 2 lists the costs of 3-term (τ )JSF precomputations with the three techniques considered above, i.e. with 10 point additions (naïve) or 5 unified point additions and subtractions without (unified) or with (unified + Montgomery) Montgomery's trick. The methods presented above reduce the number of multiplications required in precomputations by 58 % in the case of F 2 163 and Itoh-Tsujii inversion.
Implementation
This section presents the design in detail. The design is implemented on an Altera Stratix II EP2S180 DSP development board, professional edition [21] , which includes an Altera Stratix II EP2S180F1020C3 FPGA [22] .
The goal of the implementation is in maximizing the number of operations per second (ops) rather than in minimizing computation time of a single operation. The implementation is designed to be modular so that it can be easily parallelized in order to increase ops. It consists of two main modules; namely, converters for finding 3-term τ JSF for integers and field arithmetic processors (FAPs) with control logic for computing point multiplications. These modules are considered in Secs. 4.1 and 4.2, respectively. There are certain parameters which define the performance and area requirements of an implementation. It is not obvious how these design parameters should be chosen and, thus, parameter space exploration is performed in Sec. 4.3 in order to find optimal parameters.
It should be noticed that, while side-channel attacks are a serious threat for many security applications in FPGAs [23] , they are insignificant in this case because all information is public anyhow.
τ NAF and 3-term τ JSF Conversions
As mentioned in Sec. 2.2, integer k needs to be converted into a τ -adic expansion before point multiplication. Conversions to τ NAF are performed as presented by the authors in [24] . Because three conversions are required in 3-term multiple point multiplication, there are basically two alternatives: either required conversions are computed with one τ NAF converter resulting in a critical path of three conversions or with three τ NAF converters and a critical path of one conversion. The latter alternative was chosen mainly for two reasons:
1. Latency is shorter, and 2. no storage for converted values is needed before τ JSF conversions.
Once the integers are converted into τ NAF, a 3-term τ JSF is build up as presented in [14] . The algorithm of [14] was implemented so that the four most recent signed bits from the τ NAF converters, which output their results in serial, are stored into three shift registers, each of which contains 4 signed bits. The values of the shift registers are input into a circuit that determines whether the values of all three registers are reducible or not. If they are reducible and there are no all-zero columns, then the values of the registers are updated with reduced values. In the 3-term case the value 1001 of a shift register is replaced by 0011,1001 by 0011, 1010 by 0110, and1010 by 0110.
Point Multiplication
Point multiplication is computed with an architecture comprising an FAP and logic controlling it.
Field arithmetic processor. The FAP consists of adder, squarer, multiplier, storage RAM and instruction decoder.
The adder computes a bitwise XOR of two m-bit operands, and it has a latency of one clock cycle, i.e. A = 1. The squarer supports computation of multiple successive squarings, i.e. x Field multiplication is critical for the overall performance. Multiplication in normal basis is performed with a multiplier which is a digit-serial implementation of the Massey-Omura multiplier [25] . In a bit-serial Massey-Omura multiplier, one bit of the output is calculated in one clock cycle and, hence, m cycles are required in total. One bit z i of the result z = x × y where x, y, z ∈ F 2 m is computed from x and y by using an F -function. The F -function is field specific, and the same F is used for all output bits z i as follows: z i = F (x ≪i , y ≪i ), where ≪ i denotes cyclical left shift by i bits. Hence, a bit-serial implementation of the Massey-Omura multiplier requires three m-bit shift registers and one F -block. A bit-parallel implementation, where all bits z i are computed in parallel, requires m F -blocks and an m-bit register for storing the result. [4, 25] In practice, the bit-serial implementation requiring at least m+1 clock cycles is too slow and the bit-parallel implementation requires too much area. A good tradeoff is a digit-serial multiplier, where v bits are computed in parallel with v F -blocks. The F -block forms the critical path of an FAP and determines the maximum clock frequency. Thus, the maximum clock frequency can be increased by pipelining the F -blocks. As one clock cycle is required in loading the operands into the shift registers and each pipeline stage increases latency by one clock cycle, the latency becomes
where c is the number of pipeline stages inside the F -blocks, i.e. c ≥ 0. In this paper, c = 1. It follows directly from (11) that, when m = 163, the number of F -blocks, v, should be chosen from the following set of integers: All other values only increase area without decreasing latency. The storage RAM is used for storing elements of F 2 m . Stratix II devices include M512, M4K and M-RAM memory blocks and they contain 575, 4,608, and 589,824 bits of RAM, respectively [22] . Using embedded memory blocks is advantageous because more logic resources are saved for the actual computation. The storage RAM is implemented with M4Ks as a dual-port RAM and it is capable of storing W elements. A logical choice is W = 256 because, while in true dual-port mode, the widest mode that an M4K block can be configured to is 256 × 18-bits [22] . Thus, the storage RAM requires 163/18 = 10 M4Ks resulting in a storage capacity of 256 × 163-bits. This much storage space is rarely needed but it can be used for example for storing precomputed points. Moreover, selecting a smaller depth than 256 would not reduce the number of required M4Ks. Both writing and reading to and from the storage RAM require one clock cycle. However, the dual-port RAM can be configured into the readduring-write mode [22] which saves certain clock cycles as will be discussed in the following.
Control logic. The logic controlling the FAP consists of finite state machine (FSM) and ROM containing instruction sequences.
The instruction sequences are carefully hand-optimized and certain tricks are used in order to minimize latencies of point operations. The read-during-write mode can be used for reducing latencies. In order to maximize the advantages in this case, operations are ordered so that the result of the previous operation is used as an operand for the next operation whenever possible. This saves one clock cycle because the operands of the next operation can be read simultaneously while the result of the previous operation is being written.
Latency of computing k 1 P 1 + k 2 P 2 + k 3 P 3 with a 3-term τ JSF becomes
Point additions and Frobenius maps
X/Z and interfacing (12) clock cycles where H(k) is the number of non-zero columns in the 3-term τ JSF. Fig. 1 presents an example operation schedule of an implementation with one converter and two FAPs. The implementation computes five point multiplications in the example so that when the first integers and points arrive (data #1), it immediately starts computing a 3-term τ JSF for the integers in the converter and precomputed points in the first FAP. Because a precomputation requires more time than a conversion, the computation time only consists of precomputation time and point multiplication time if there are resources available immediately at the arrival of data. This is the case for datas #1, #2 and #5. However, when data #4 arrives, conversion can be started instantaneously but precomputation can be started when the second FAP becomes available. The situation is even worse for data #3 because, when it arrives, there are no converters or FAPs available, and thus even longer delay occurs.
Parameter Exploration
Free parameters in the design are the numbers of F -blocks, v, and the number of parallel FAPs, p, of which only v ∈ F determines the latency of a single point Field multiplication determines point multiplication latency together with H(k) as shown in (12) . It is assumed in the following analysis that JSFs have an average number of non-zero columns, and such JSFs are henceforth referred to as average JSFs. Thus, latency depends only on the latency of field multiplication. The critical path determining the maximum clock frequency does not depend on v and, thus, it is assumed that all FAPs operate at the same clock frequency; see Sec. 4.2. Based on the results obtained from Quartus II, it is assumed that the clock frequency is 160 MHz. are reserved for the converters, interfacing, etc. Stratix II S180 includes 71,760 ALMs in total [22] . The maximum ops is received when v = 11. In that case, an FAP can compute 3-term multiple point multiplication in 117 µs and 19 FAPs fit into Stratix II S180 resulting in the maximum throughput of 162,000 ops. Fig. 2 leads to a conclusion that tolerating slightly longer computation latencies can lead to major increases in ops. Furthermore, it can be seen that v < 11 should be never selected because higher ops can be achieved with shorter computation time. However, all v ≥ 11 with v ∈ F are justified. If v > 11 is selected resulting in shorter computation time, then one must tolerate fewer ops. The design implemented in this paper uses the setup p = 19 and v = 11 in order to maximize ops.
The number of converters must be selected so that they do not become a bottleneck. If only a few converters are implemented, the average end-to-end computation time grows because data needs to wait for free converters longer; see Sec. 4.2. However, if many converters are implemented, the area constraint for FAPs needs to be lowered resulting in a decrease in performance.
The design presented in Sec. 4 was written in VHDL and synthesized for the Stratix II FPGA by using Quartus II 6.0 SP1. Simulations were performed with ModelSim SE 6.1b. The design comprising 4 τ JSF converters, 19 FAPs (v = 11) and FIFO buffers separating blocks requires in total 67,467 ALMs which is 94 % of the device resources and 240 M512 (26 %) and 305 M4K (40 %) memory blocks. The converters and the FAPs are separated into different clock domains and they have the maximum clock frequencies of 82.38 MHz and 167.50 MHz, respectively.
A phase-locked loop (PLL) in Stratix II was used for creating 82 MHz and 164 MHz clocks for the converters and FAPs. The converters compute a τ JSF for three integers on average in 499 clock cycles which equals to 6.9 µs. The average latency of 18,733 clock cycles for a 3-term multiple point multiplication including precomputations is given by (12) which equals to 114.2 µs. This is also the minimum time in which the implementation computes a 3-term multiple point multiplication with an average JSF because conversions and precomputations are computed in parallel. Theoretically, the implementation is capable of performing up to 166,000 verifications per second.
To the authors' knowledge, the fastest published FPGA implementation for the NIST curve K-163 was presented by Dimitrov et al. in [26] where a 1-term point multiplication requires 35.75 µs on Xilinx Virtex-II which would result in approximately 9,300 verifications per second. This was achieved by representing k with multiple-base expansions [26] . The implementation was optimized for low latency but, naturally, it could be parallelized in order to increase ops and rough estimates are given next. Because the FAP used in [26] requires 6,494 slices, a parallel implementation outperforming 166,000 verifications per second would need 18 FAPs resulting in approximately 117,000 slices without converters. Thus, the implementation would be too large to fit any FPGA available at the moment. However, the FAP with v = 24 [26] is probably larger than the one optimizing ops. Thus, the idea presented in this paper could be used, most probably resulting in more ops with fewer resources.
The results have shown that using parallel FAPs and 3-term τ JSF enables considerable performance increases and the implementation presented here outperforms all previously published implementations if ops are considered.
Conclusions
This paper presented an efficient implementation designed specifically for rapid verification of self-certified identity based signatures. It was shown that it is possible to compute up to 166,000 verifications per second with a single Altera Stratix II FPGA. The results have significance in many cryptosystems whose performance is bounded by demanding signature verifications. One example is PLA where packets are verified by using cryptographic signatures.
The high performance was achieved by using parallel processors which were carefully optimized. Instead of concentrating in minimizing computation time of a single processor, the objective was shifted to maximizing the number of verifications per second computed by the parallel processors. It was concluded that major increases in ops can be achieved by tolerating slightly longer computation times, i.e. by using multiple smaller processors instead of only a few large processors. The idea can be easily generalized to other elliptic curve cryptosystems and implementation platforms.
Future work. Because field multiplication dominates in the performance and area requirements, it is of interest to optimize the multiplier architecture. One possibility is to use polynomial basis instead of normal basis. Polynomial bases are commonly preferred in implementing elliptic curve cryptosystems in hardware and they could offer some performance improvements. Another option is to use a more efficient architecture for normal basis multiplication.
A counterpart implementation which produces self-certified signatures will be designed. As mentioned, high performance is easier to achieve in signing because fewer point multiplications are needed and it is possible to use such methods as fixed-base windowing. Thus, performance should not be a problem. However, countermeasures against side-channel attacks are needed in signing acceleration in order to ensure confidentiality of private keys.
Although point multiplications are the most expensive operations in signing and verification, also other operations, such as hash functions, are needed and they will be included into the implementations in the future.
