Abstract. This paper presents implementation results of a reconfigurable elliptic curve processor defined over prime fields GF (p). We use this processor to compare a new algorithm for point addition and point doubling operations on the twisted Edwards curves, against a current standard algorithm in use, namely the Double-and-Add. Secure power analysis versions of both algorithms are also examined and compared. The algorithms are implemented on an FPGA, and the speed, area and power performance of each are then evaluated for various modes of circuit operation using parallel processing. To the authors' knowledge, this work introduces the first documented FPGA implementation for computations on twisted Edwards curves over fields GF (p).
Introduction
Elliptic Curve Cryptography (ECC) was established as a form of public key cryptosystem in 1985 by Miller [1] and Koblitz [2] . Its advantage over other public key cryptosystems, such as RSA [3] , is that it provides an equivalent level of security using shorter cryptographic key sizes. Increasing the key size increases the overall security of the cryptosystem [4] . In practice however, hardware resources such as computer processing power and system memory are limiting factors which can reduce the speed and increase the area for such an increase. Therefore, a balance needs to be set between the complexity of the mathematical computation, and the resources available to implement the computations.
Another element to take into account when designing a public key cryptosystem is the physical security of the system. Implementations can leak sensitive information during the execution of a computation [5] , which may lead to a release of secret information, no matter how mathematically secure the system may be. This method of side-channel analysis consists of monitoring some sidechannel information, such as the power consumption [6] , and using the data emitted to deduce, or partially deduce, the secret key [7] .
In this paper we present implementation results of a reconfigurable elliptic curve processor for a generalisation of the Edwards curve [8] , the twisted Edwards curve, recently proposed by Bernstein et al [9] . We examine both the implementation efficiency and implementation security of twisted Edwards curves and compare them against current standard curves and methods in use today. We examine firstly, in projective coordinates, the explicit formulas for point addition and point doubling of the widely used Double-and-Add method [10] and compare against the standard twisted Edwards formulas 1 .We then examine the strongly unified formula, which is resistant to simple power analysis (SPA) [7] , a form of side-channel analysis, and compare it to its equivalent, the Double-andAdd-Always method [11] .
Elliptic Curves
We consider an elliptic curve over the field GF (p) for some prime p, given by the affine Weierstrass equation
In Jacobian projective coordinates, this curve is given by the equation
where the Jacobian projective point (X 1 :
and O the point at infinity, if Z 1 = 0. We consider Jacobian projective coordinates as they currently provide the fastest implementation of ECC for hardware.
The basic operations of ECC are point scalar computations of the form:
The Elliptic Curve Discrete Logarithm Problem (ECDLP) is the problem of retrieving k given P and Q, where P is a point on the curve and k is an integer [12] . The assumed difficulty of this problem is the basis of security for elliptic curve public key schemes. Point scalar multiplication (PM) can be performed using algorithms such as the Double-and-Add method, as shown in Algorithm 1. This method requires m − 1 point doublings (PD) and w − 1 point additions (PA), where m is the length and w is the Hamming weight of the binary expansion of k. Each PA and PD is comprised of finite field additions, subtractions, multiplications and inversions. By representing each point on the curve in projective (X, Y, Z) rather than affine (x, y) coordinates, each PA and PD can be performed without the need for inversions, albeit at the cost of extra multiplications. This will improve efficiency since the cost of inversions is significantly more expensive than multiplications [13] .
The equations governing PA and PD in projective coordinates using the Double-and-Add method, Algorithm 1, on a Weierstrass curve, are given in Algorithms 3 and 4 respectively. Each PA requires 16 multiplications and 7 additions/subtractions, with 10 multiplications and 4 additions/subtractions required for a PD.
Simple Power Analysis Resistance
Simple Power Analysis (SPA), makes use of side-channel analysis to monitor and measure the power emitted from a single execution of a cycle of a crypto processor. Each PA and PD operation produces a different power trace when executed because of the different number of multiplications and additions involved in each, and as the execution of a point addition in the Double-and-Add is directly related to the secret key (k i ), it is possible to retrieve the secret key by monitoring the power consumption of a single execution of a scalar multiplication.
The first successful power analysis attack against an FPGA was done byÖrs et al. [14] in which they attacked an elliptic curve processor and retrived the secret key.
The Double-and-Add-Always, Algorithm 2, is a simplistic approach to solving the problem of the SPA susceptibility. It performs dummy point addition executions, so that every execution of the key k executes a point double and a point addition regardless of whether k i = 0 or k i = 1. This leads to an inefficient design as unnecessary operations are performed, but it does prevent the recognition of individual bits.
Edwards Curves
In [9] , Bernstein et al. introduced the twisted Edwards curves
where a, d ∈ GF (p) are distinct and non-zero. If a = 1, the curve may be called an Edwards curve. They further showed that a significant number of elliptic curves over GF (p) (roughly 1/4 of isomorphism classes of elliptic curves) are birationally equivalent to a twisted Edwards curve. Two curves are birationally equivalent if there is an invertible rational mapping between them (such as (x, y) → (
), which may be undefined at a finite number of points. The chief advantage of Edwards and twisted Edwards curves over standard curves is that the addition laws defined on them can be made unified, i.e., a single addition formula can be used to add points and double points, with no exception for the identity.
We use the projective twisted Edwards curve
so as to avoid inversions. The projective point (X 1 :
Addition law on twisted Edwards curve
Let (X 3 : Y 3 : Z 3 ) be the sum of the two points (X 1 :
on the projective twisted Edwards curve, i.e.,
We note that the projective twisted Edwards curve has two singular points, (1 : 0 : 0) and (0 : 1 : 0), and the addition law is not defined at these points. An elliptic curve cryptosystem based on an implementation of Edwards or twisted Edwards curve should not allow either of these points as inputs.
Algorithm 5 and 6 give the PA and PD for the non SPA resistant twisted Edwards algorithms, while Algorithm 7 gives the unified formula.
Algorithm 5: Point Doubling for twisted Edwards input :
For the separate PA and PD formulae, each PA requires 12 multiplications and 8 additions, while the PD requires 8 multiplications and 7 additions. The unified single point operation, processes the same formula for both PA and PD, thereby giving it the same power trace for either operation, at a cost of 14 multiplications and 5 additions per point operation.
FPGA Based Elliptic Curve Processor
A reconfigurable architecture for performing elliptic curve cryptography was designed [15] and ported onto an FPGA device. It consists of a controller, containing an instruction set stored in ROM and a finite state machine (FSM), a user definable number of arithmetic logic units (ALU's) for addition, subtraction and multiplication calculations in parallel, and BlockRAM for storage of results, as illustrated in Figure 1 . Software was developed in C++ to generate the instruction set and associated VHDL code for the reconfigurable processor. The elliptic curve processor (ECP) properties can be configured by the user for any characteristic p, and extension field m, as well as the respective memory sizes. In this respect it can be modified to perform a number of different algorithms in The ALUs shown in Figure 2 perform the GF (p) operations described in section 2, namely the modular multiplications, additions and subtractions. Mode bits are used to select between operations.
For modular addition, the modular addition operation adds A and B in the first adder and subtracts the modulus p from the sum. To subtract the modulus from the intermediate result, the modulus is bitwise inverted and added to (A + B) with the carry-in set to 1, thus performing a two's complement subtraction. The carry-out of the second adder controls which intermediate result holds the correct result. If (A + B) is in the correct range, the result of the first adder is the correct result, otherwise, the second adder holds the correct result.
For modular subtraction, B is bitwise inverted and added to A with the carry-in set to 1. If the carry-out of this adder is low, the modulus must be added to give an output in the correct range.
Modular multiplication is more complex, but by using the Montgomery multiplication algorithm [16] , we can compute the binary number while avoiding the need to perform a division by the modulus. Due to the large number of multiplications required for calculation by the elliptic curve processor, it is more cost effective to initially convert all values to the Montgomery domain, and then to convert them back afterward. The Montgomery modular product is defined as:
where p b is the field size in bits and M ont is a montgomery multiplication. The result of a Montgomery multiplication is therefore out by a factor of 2 −p b +2 . To correct this reduction, the output must be Montgomery multiplied by 2 2p b +2 (mod p), and a value is converted back by Montgomery multiplying it by 1. For modular
Algorithm 8: Montgomery Multiplication
Initialise: R ← 0; bp b + 1 ← 0; for i ← 0 to p b + 1 do qi = Ri−1 + biA(mod p); Ri = (Ri−1 + QiM + biA)/2; end multiplication, following the process described in Algorithm 8, the inputs to the first adder are b i A and the previous result R i−1 . q i p is added to the sum of the first adder if the LSB of the sum (q i ) is equal to 1. A shift register is then used to check each bit of B for b i A and the final result is right shift divided by 2.
Modular multiplication is executed in p b + 2 clock cycles, where p b is the field size in bits, while modular additions and subtractions each take 2 clock cycles.
GFP Controller
The ROM is generated with the use of microcode stored in Xilinx BlockROM. This is implemented to reduce the development time of the processor and to increase the flexibility of the design. A major advantage of this is that the instruction set can be updated to perform any number of operations without a need to recompile the entire processor. The instruction set to control the algorithm is stored in ROM and the instructions are processed consecutively.
For the architecture to accommodate this, mode bits are used to set the operation of the ALUs. After initially loading the elliptic curve parameters and Montgomery constants into RAM, the controller performs operations for the selected cryptographic algorithm. The initial operands and the results from the arithmetic units are stored in a RAM block. This RAM can be configured for single port, or dual port operation, thereby increasing the speed of the ECP, through the use of parallelisation. The ECP can be programmed to run any number of ALUs in parallel to process an elliptic curve formula. This design is limited only by the size of the FPGA. For this paper, the architecture was evaluated on a Spartan3E XC3S500E, and used one to four ALUs operating in parallel. Table 1 shows the measured results  for the FPGA.  Table 1 firstly details the post place and route (PPR) clock frequency (F max ). There is very little variation in clock frequency between the different algorithms. The clock frequency in fact depends on the number and type of ALUs used. The minimum PPR frequency reported for all the configurations and combinations was recorded when using four multipliers.
Performance Results of ECC Algorithms
The circuit design also remains the same for each of the four formulae, differing only in the size of the instruction set in ROM. The area, therefore, approximately remains the same for each of the four different formulae, and increases equivalently with more ALUs.
The average power (P wr) dissipation of the processor was measured at a frequency of 10 MHz. The current being drawn by the FPGA on its V CCIN T and V CCAU X line was measured. The voltage supplied to each line by the board's voltage regulator was also measured. These voltage and current measurements were then used to calculate the total average power consumed on both lines. The energy per average multiplication is also given. The energy is calculated using the average power value and the average time per point multiplication, as shown in Tables 2 and 3 , based on the number of clock cycles and the 10 MHz clock frequency.
Computation Time
The scheduling was examined, again with a variable number of ALUs, to examine the timing of each algorithm. As described in Algorithm 1, a multiplication is executed in p b + 2 clock cycles, while additions and subtractions each take 2 clock cycles. Using a key size of 192 and performing all multiplications and additions/subtractions for the Algorithms 3, 4, 5, 6 and 7, defined in Sections 2 and 4, the timing results in Tables 2 and 3 were obtained. As can be seen from the table, the number of multiplication stages required to process an algorithm decreases with an increase in parallelisation.
Next we tested each of the formulae with a 192-bit value for the key (k) and a Hamming weight of k 2 to measure an iteration of the algorithms. The graph in Figure 3 shows the timing results. The graph shows that the standard twisted Edwards algorithm performs on average 60% faster than its equivalent Double-and-Add algorithm for all counts of ALUs. The graph also shows that the SPA resistant unified twisted Edwards performs comparably to the non SPA resistant Double-and-Add method and performs faster for one or four ALUs. However, neither the standard twisted Edwards nor Double-and-Add achieve any great increase in timing when increasing from 3 ALUs to 4 ALUs, due to the algorithms limitations of parallelism. From the graph we can see that again the standard twisted Edwards gives the best value across the range of ALUs, with 3 ALUs giving the best performance. For the Unified twisted Edwards and both the Double-and-Add formulae 4 ALUs gives the fastest timing. 
Efficiency
Although the ECP can be configured to run any number of ALUs in parallel, some operations in a particular formula are dependent on the results of other operations, which creates a limit to the amount of parallelism that can be exploited [17] . This leads to redundancy in the design, as there comes a point where the addition of extra ALUs leads to a decrease in the efficiency, and results in a small increase in speed at the cost of large increase of area. We define the efficiency as the number of multiplication operations, and therefore the number of ALUs, that can be run in parallel at each particular time stage, in relation to the overall number of ALUs available for parallel processing for a time stage. Figure 4 shows the schedule for a point operation for a four ALU strongly unified twisted Edwards, where the first eight memory addresses containing the points used for the point doubling and point addition, there is only one stage where all four ALUs can be used in parallel, while there are two stages where only one ALU can be in use. When this is compared against the twisted Edwards design which uses three ALUs, shown in Table 3 , it results in approximately the same completion time, for a much larger increase in circuit area. The efficiency for one to four ALUs for each of the four formulae is shown in Figure 5 . From the graph it is clear that all of the formulae result in a drop in efficiency as more ALUs are added, with the Unified twisted Edwards having the worst case, and the Double-and-Add-Always making the best use of parallelism. The area-time (AT) product was calculated to get a representation of any speed increase against the increase in size, as shown in Figure 6 . This gives a more accurate representation of the cost that each increase in ALU has in relation to the overall system. The minimum AT value, i.e. the most efficient combination in an area time sense, is again the standard twisted Edwards, giving the best value across the range of ALUs, with 2 ALUs giving the best performance. For the Unified twisted Edwards, a single ALU gives the best performance, while the Double-and-Add formulae give best AT at 2 and 3 ALUs respectively.
Area Time Product

Conclusions
In this paper, we presented implementation results of an ECP with a reconfigurable architecture and used it to compare the standard and strongly unified formulae that define the twisted Edwards curve, against the Double-and-Add and Double-and-Add-Always formulae. We showed that the twisted Edwards performs on average 60% faster and uses less area than the Double-and-Add, and that the performance of the SPA resistant strongly unified version of the twisted Edwards, far exceeded its Double-and-Add-Always equivalent. We also showed that by using one or four ALUs operating in parallel, the strongly unified twisted Edwards execution time exceeds the Double-and-Add for an equivalent number of ALUs. Future work could be an examination of the cost of converting a Double-and-Add to a strongly unified twisted Edwards curve to gain SPA resistance at comparable speeds.
