This work revisits the recent complete addition formulas for prime order elliptic curves of Renes, Costello and Batina in light of parallelization. We introduce the first hardware implementation of the new formulas on an FPGA based on three arithmetic units performing Montgomery multiplication. Our results are competitive with current literature and show the potential of the new complete formulas in hardware design. Furthermore, we present algorithms to compute the formulas using anywhere between two and six processors, using the minimum number of parallel field multiplications.
Introduction
The main operation in many cryptographic protocols based on elliptic curves is scalar multiplication, which is performed via repeated point addition and doubling. In early works formulas for the group operation used different sequences of instructions for addition and doubling [23, 30] . This resulted in more optimized implementations, since doublings can be faster than general additions, but naïve implementations suffered from side-channel attacks [24] . Indeed, as all special cases have to be treated differently, it is not straightforward to come up with an efficient and side-channel secure implementation.
A class of elliptic curves which can avoid these problems is the family of curves proposed by Bernstein and Lange, the so-called Edwards curves [9] . Arguably, the primary reason for their popularity is their "complete" addition law, that is, a single addition law which can be used for all inputs. The benefit of having a complete addition law is obvious for both simplicity and side-channel security. Namely, having only one set of formulas that works for all inputs simplifies the task of implementers and helps to thwart some side-channel attacks, e. g. safe-error attacks [41] . After the introduction of Edwards curves, more curves models have been shown to possess complete addition laws [7, 8] . Moreover, (twisted) Edwards curves are being deployed in software, for example in the library NaCl [11] . In particular, software implementations typically rely on specific curves, e. g. on the Montgomery curves Curve25519 [6] by Bernstein or Curve448 [21] proposed by Hamburg. Moving to a hardware scenario, using the nice properties of these specific curves is not as straightforward anymore. Hardware development is costly, and industry prefers IP cores as generic solutions for all possible clients. Moreover, backwards compatibility is a serious concern, and curves defined over large prime fields in most standards (e. g. [13, 16, 31] ) are prime order curves written in short Weierstrass form. This issue prohibits using (twisted) Edwards, (twisted) Hessian and Montgomery models, since these impose that the group order must be composite. The desire for complete addition formulas for prime order curves in short Weierstrass form was recognized and Renes, Costello and Batina [33] proved this to be realistic. They present complete addition formulas with an efficiency loss of 34%-44% (depending on the size of the field) in software when compared to the widely used incomplete formulas based on Jacobian coordinates.
As the authors mention, one can expect to have better performance in hardware, but they do not present results. In particular, when using Montgomery multiplication one can benefit from very efficient modular additions and subtractions (which appear a lot in their formulas), which changes the performance ratio derived in the original paper. Therefore, it is of interest to investigate the new complete formulas from a hardware point of view. In this paper we show that the hardware performance is competitive with the literature, building scalar multiplication on top of three parallel Montgomery multipliers. In more detail, we summarize our contributions as follows:
we present the first hardware implementation based on the work of [28] , working for every prime order curve over a prime field of up to 522 bits, and obtain competitive results; we present algorithms for various levels of parallelism for the new formulas. We show that the number of parallel multiplications decreases for up to six Montgomery multipliers.
All algorithms and their respective Magma [12] verification code can be found in Appendix B and C. The entire VHDL source code is provided on https://github.com/pmassolino/hw-tripleweierstrass.
Related work. There are numerous works on curve-based hardware implementations. Mostly on various FPGA platforms, making a meaningful comparison very difficult. Güneysu and Paar [20] proposed a new speed-optimized architecture that makes intensive use of the DSP blocks in an FPGA platform. Guillermin [19] introduced a prime field ECC hardware architecture and implemented it on several Altera FPGA boards. The design is based on Residue Number System (RNS), facilitating carry-free arithmetic and parallelism. Yao et al. [40] followed the idea of using RNS to design a high-speed ECC co-processor for pairings. Sakiyama et al. [35] proposed a superscalar coprocessor that could deal with three different curve-based cryptosystems, all in characteristic 2 fields. Varchola et al. [38] designed a processor-like architecture, with instruction set and decoder, on top of which they implemented ECC. This approach has the benefit of having a portion written in software, which can be easily maintained and updated, while having special optimized instructions for the elliptic curve operations. The downside of this approach is that the resource costs are higher than a fully optimized processor. As it was the case for Güneysu and Paar [20] , their targets were standardized NIST prime curves P-224 and P-256. Consequently, each of their synthesized circuit would only work for one of the two primes. Pöpper et al. [32] follow the same approach as Varchola et al. [38] , with some side-channel related improvements. The paper focuses on analyze of countermeasures and their effective cost. Roy et al. [34] followed the same path, but with more optimizations with respect to resources and only for curve NIST P-256. However, the number of Block RAMs necessary for the architecture is much larger than of Pöpper et al. [32] or Varchola et al. [38] . Fan et al. [17] created an architecture for special primes and curves, namely the standardized NIST P-192. The approach was to parallelize Montgomery multiplication and formulas for point addition and doubling on the curve. Vliegen et al. [39] attempted to reduce the resources with a small core aimed at 256-bit primes.
Organization. We start with preliminaries in Section 2, and briefly discuss parallelism for the complete formulas in Section 3. Finally we present our hardware implementation using three Montgomery multipliers in Section 4.
Preliminaries for elliptic curve cryptography
We briefly review the basics of elliptic curves. For a more elaborate study see [18, 37] . Let p = 2, 3 be prime, q = p n for a positive integer n, and F q be the finite field of q elements. Let E/F q be an elliptic curve defined over F q with specified point O, which we call the point at infinity. Every such curve is isomorphic to a curve in short Weierstrass form
such that a, b ∈ F q and O = (0 : 1 : 0). Any point not equal to O is called an affine point. It is easy to see that a point (X : Y : Z) is affine if and only if Z = 0, hence setting x = X/Z and y = Y /Z it follows that the affine points correspond to (x, y) such that
The points on E form a group, with O as its identity element. Denote by E(F q ) the subgroup of F q -rational points of E. Its order is its order as a group, and in this document we are only concerned with the prime order case.
There are many ways to compute the group law on E, see [10] . These differ depending on the representation of the curve and the points. As mentioned in the introduction, we put emphasis on complete addition formulas for prime order elliptic curves. The work of Renes et al. [33] presents addition formulas for curves in short Weierstrass form with homogeneous coordinates. They compute the sum of two points P = (X 1 : Y 1 : Z 1 ) and Q = (X 2 :
This computes, without exception, P + Q for any two points P, Q ∈ E, and has been shown to still be efficient (in software).
Elliptic-curve cryptography [23, 30] commonly relies on the hard problem called the "elliptic curve discrete logarithm problem". This means that given two points P, Q on an elliptic curve, it is hard to find a scalar k ∈ Z such that Q = kP , if it exists. Therefore the main component of curve-based cryptosystems is the scalar multiplication operation (k, P ) → kP . Since in many cases k is a secret, this operation is very sensitive to attacks. In particular many sidechannel attacks [24, 5] and correspondingly countermeasures [15] have been proposed. To ensure protection against simple power analysis (SPA) attacks it is important to use regular scalar multiplication algorithms, e. g. the Montgomery ladder [22] or Double-And-Add-Always [15] , executing both an addition and a doubling operation per scalar bit.
Parallelism
An important way to increase the efficiency of the implementation is to use multiple Montgomery multipliers in parallel. In this section we give a brief explanation for our choice of three multipliers.
The addition formulas on which our scalar multiplication is built are shown in Algorithm 1 of [33] . We choose to ignore additions and subtractions since we assume to be relying on a Montgomery multiplier for which the cost of field multiplications is far higher than that of field additions. The total (multiplicative) cost in the most general case is 12M + 2m a + 3m 3b
1 . Because our processors do not distinguish full multiplications and multiplications by constants, we consider this cost simply as 17M. The authors of [33] introduce optimizations for mixed addition and doubling, but in our case this only saves a single multiplication (and some additions). Since this does not make up for the price we would have to pay for the implementation of a second algorithm, we only examine the most general case. In Table 1 we show the interdependencies of the multiplications.
Stage Result
Multiplication
1, 2, 5, 9, 10 3 14 (3 0 + 7) · ( 10 + 9) 0, 7, 9, 10 Table 1 . Dependencies of multiplications inside the complete addition formulas This allows us to write down algorithms for implementations running n processors in parallel. Denote by M n resp. a n the cost of doing n multiplications resp. additions (or subtractions) in parallel. In Table 2 we present the costs for 1 ≤ n ≤ 6. We make the simple approximations that M n = M and a n = a. We note that this ignores some practical aspects. For example a larger number of Montgomery multipliers can result in scheduling overhead, which we do not take into account. For our implementation we have chosen for n = 3, i. e. three Montgomery multipliers. This number of multipliers achieves a good area-time trade-off, while obtaining a favorable speed-up compared to n = 1. Moreover, the aforementioned practical issues (e. g. scheduling) are not as complicated to deal with as for larger n.
Alg. 6 Table 2 . Efficiency approximation of the number of Montgomery multipliers against the area used. All algorithms for 2 ≤ n ≤ 6 are from this work.
Implementation of the formulas with three processors
In this section we introduce a novel hardware implementation, parallelizing the new formulas using three Montgomery processors. We make use of the Montgomery processors which have been proposed by Massolino et al. [28] for Microsemi R IGLOO2 R FPGAs, for which the architecture is shown in Figure 1 . We give a short description of the processor in Section 4.1, but for more details on its internals we refer to [28] . As a consequence of building on top of this processor, we target the same FPGA. However, it is straightforward to port it to other FPGAs, or even ASICs, which have a Montgomery multiplier with the same interface and instructions. The elliptic curve scalar multiplication routine is constructed on top of the Montgomery processors. As mentioned before, to protect against simple power analysis attacks, we im-plement a regular scalar multiplication algorithm (i. e. Double-And-Add-Always [15] ). The algorithm relies on three registers R 0 , R 1 and R 2 . The register R 0 contains the operand which is always doubled. The registers R 1 resp. R 2 contain the result of the addition when the exponent bit is zero resp. one. This algorithm should be applied carefully since it is prone to fault attacks, see [4] . From a very high level point of view the architecture consists of the three Montgomery multipliers and a single BRAM block, shown in Figure 2 . We note that this BRAM block is more than large enough to store the necessary temporary variables. So, although Algorithm 3 tries to minimize the number of these, this is not necessary for our case. In the rest of this section we elaborate on the details of the implementation. 
The Montgomery processor
Massolino et al. [28] proposed two different Montgomery processors. Our scalar multiplication builds on top of "version 2", which has support for two internal multipliers and two memory blocks. It can perform three operations: Montgomery multiplication, addition without reduction and subtraction without reduction. To perform Montgomery multiplication, the processor employs the FIOS algorithm proposed by Koç et al. [25] . In short, FIOS computes the partial product and partial reduction inside the same iterative loop. This can be translated into a hardware architecture, see Figure 1 , with a unit for the partial product and another partial modular reduction. The circuit behaves like a three-stage pipeline: in the first stage operands are fed into the circuit, in the second they are computed and in the third they are stored into memory. The pipeline system is reused for the addition and subtraction operation in the multiplier, and values are added or subtracted directly. In case of subtraction the computation also adds a multiple of the prime modulus. Those operations can be done without applying reduction, because reduction will be applied later during a multiplication operation. However, there is a limit to the number of consecutive additions/subtractions with no reduction, on which we elaborate in Section 4.4.
Memory
The main RAM memory in Figure 2 is subdivided in order to lower control logic resources and to facilitate the interface. The main memory operates as a true dual port memory of 1024 words of 17 bits. We create a separation in the memory, composing a big word of 32 words (i. e. 544 bits). This way we construct the memory as 32 × 32 big words. A big word can accommodate any temporary variable, input or output of our architecture. An exception is possibly the scalar of the point scalar multiplication. Although a single word would be large enough to contain 523-bit scalars (in the largest case of a 523-bit field), the scalar blinding technique can double the size of the scalar [15] . Therefore, we use two words to store the scalar. By doing this, it will in the future be possible to execute scalar multiplication with a blinded scalar [14] . Lastly, there is a 17-bit shift register into which the scalar is loaded word by word.
Control logic
The formulas and control system are done through two state machines: a main one which controls everything, and one related to memory transfer. The memory-transfer state machine was created with the purpose to reduce the number of states in the main machine. This was done by providing the operation of transfer between the main memory and the Montgomery processor's memory. Therefore, the main machine can transfer values with just one state, and can reuse most of the transfer logic. This memorytransfer machine becomes responsible for various parts of the bus between main memories, processors and other counters. However, the main state machine still has to be able to control everything. Hence, the main state machine shares some components with the memory transfer machine, increasing control circuit costs.
The main state machine controls all the circuits that compose the entire cryptographic core. Given that it controls the entire circuit, the machine also has the entire Table 2 scheduling implemented as states. The advantage of doing this through states is the possible optimization of the design and the entire control. However, the cost of maintenance is a lot higher than a small instruction set or microcode that can also implement the addition formulas or scalar multiplication. Because the addition formulas are complete, it is possible to reduce the costs of performing both addition and doubling through only the addition formulas. This decreases the amount of states and therefore makes the final implementation a lot more compact. Hence, the implementation only iterates over the addition formulas, until the end of the computations.
Consecutive additions
For the Montgomery processor to work in our architecture, part of the original design was changed. The authors of [28] did not need to reduce after each addition or subtraction, as they assumed that these operations would always be followed by Montgomery multiplications (and its corresponding reduction). However, they were not able to do multiple consecutive additions and subtractions, as the Montgomery division value r was chosen to be only 4 bits larger than the prime. On the other hand, it is readily seen that in Algorithm 3 there are several consecutive additions and subtractions. To be able to execute these without having to reduce, we need a Montgomery division value at least 5 bits larger than the prime. As a consequence, the processor only works for primes up to 522 bits (as opposed to 523), which is still one bit more than the largest standardized prime curve [31].
Scheduling
The architecture presented in Figure 2 has one dual port memory, whereas it has three processors. This means that we can only load values to two processors at the same time. As a consequence the three processors do not run completely in parallel, but one of the three is unsynchronized. Table 3 showcases how operations are split into different processors. They are distributed with the goal of minimizing the number of loads and stores for each processor and to minimize MM2 being idle. The process begins by loading the necessary values into MM0 and MM1 and executing their respective operations. As soon as the operations in MM0 and MM1 are initialized, it loads the corresponding value into MM2 and executes the operation. As soon as MM0 and MM1 finish their operations, this process restarts. Since the operations executed in MM2 are not synchronized with those in MM0 and MM1, both of the operations in MM0 and MM1 should be independent of the output of MM2, and vice versa. Furthermore, since multiplications are at least ten times slower than additions for our processor choice [28] , the additions and subtractions from lines seven and eight in Algorithm 3 can be done by the otherwise idle processor MM2 in stage six. This makes them basically free of cost. Table 3 . Scheduling for point addition P ← P + Q, where P = (X 1 : Y 1 : Z 1 ) and Q = (X 2 : Y 2 : Z 2 ). For doubling simply put P = Q.
Pre-and post-processing
For completeness, we include cycle counts for the necessary pre-processing and post-processing computations. To initialize the scalar multiplication on a point P = (X : Y : Z) we perform three multiplications X R = X · R, Y R = Y · R and Z R = Z · R to put the values in the Montgomery domain. After the scalar multiplication is complete, we obtain a point kP = (X R : Y R : Z R ). Now we apply Algorithm 1 to Z R to obtain Z −1 . Finally we compute x = (X R ·Z −1 )/R and y = (Y R ·Z −1 )/R, so that kP = (x : y : 1). Note that all multiplications are Montgomery multiplications, and we avoid the need for an explicit division by R to exit the Montgomery domain. Table 4 presents cycle counts for pre-processing, post-processing and point addition for different field sizes. Table 4 . Cycle count for pre-processing, post-processing and point addition. Point doubling has the same cycle count as point addition.
Algorithm 1 Inversion using Fermat's little theorem
Require: Z R Ensure: Z −1 , where Z R = Z · R. 1. M0 ← 1 2. M1 ← Z R 3. for 0 ≤ i ≤ log (p − 2) do 4. if (p − 2)i = 1 then 5. M0 ← (M1 · M0)/R 6. else 7. M1 ← (M1 · M1)/R
Comparison
As our architecture supports primes from 116 to 522 bits, we can run benchmarks and do comparisons for multiple field sizes. The results for common prime sizes are shown in Table 6 in Appendix A. In this section we consider only the currently widely adopted 128-bit security level, presented in Table 5 . Integer addition, subtraction and Montgomery modular multiplication results are the same as in Massolino et al. [28] , while the cycle counts for pre-processing, post-processing and point addition are in Table 4 . This is the first work implementing the new complete formulas for elliptic curves in short Weierstrass form [33] in hardware, and leads to a scalar multiplication routine which takes about 8.61ms for a 256-bit prime.
To understand our results better, we not only provide area results for IGLOO2 FPGA, but also for other technologies. This was done by describing the components architecture with behavioral VHDL. Both the behavioral VHDL and the components instantiation behave the same as in our architecture, guaranteeing correct execution. However, since they are not optimized for other FPGAs inner components, the results are only an approximation, and can likely be further improved. Nevertheless, in some cases we achieve faster scalar multiplication than in our original platform due to better and more expensive technology, e. g. in the Zynq, Virtex 5 and Virtex 6.
Even with a lot of different results, it is not straightforward to do a well-founded comparison among works in the literature. Table 5 contains different implementations of elliptic-curve scalar multiplications, but they have different optimization goals. For example we top [39] in terms of milliseconds per scalar multiplication when compared with the same FPGA and field arithmetic. On the other hand [34, 26] have a trade-off between area and time, even though their proposal was aimed and optimized for a different FPGA than ours. In [38] these optimizations become more clear with a lower area and better time results on the Virtex II-Pro platform.
Other works [1, 20, 36, 19, 29, 27] outperform our architecture in terms of speed, but use a much larger number of embedded multipliers. Also, implementations only focusing on NIST curves are able to use the special prime shape, yielding a significant speed-up. Depending on the needs of a specific hardware designer, this specialization of curves might not always be desirable. As mentioned before, many parties in industry might prefer generic cores. Despite these remarks, we argue that the implementation is competitive with the literature, making a similar trade-off between size and speed. Thus the new formulas can be implemented with little to no penalties, while having the benefit of not having to deal with exceptions.
Conclusions
In this work we provide addition formulas that can be used from one to six processors and a proof of concept architecture for three processors. The formulas can be applied not only in hardware architectures with a great array of processors, but also in software implementations that are using vector instructions. As shown in our proof of concept implementation, the formulas have competitive results with other implementations in the literature. Since our implementation is still a proof of concept, several further optimizations could be made to achieve even better results. Table 5 , the difference is the cycle count for each field.
t8 ← t2 + t8;
t5 ← t5 + t8;
t9 ← t9 · t10; ( 14) t0 ← t0 · t6; ( 11) t1 ← t1 · t5; ( 15) Y9 ← t6 + t9;
Algorithm 3 Parallelized complete addition formulas for a prime order elliptic curve in Weierstrass form, using three processors Require: P = (X1 : Y1 : Z1), Q = (X2 : Y2 : Z2), E : Y 2 Z = X 3 + aXZ 2 + bZ 3 and b3 = 3 · b.
Ensure: (X3 : Y3 : Z3) = P + Q.
t8 ← a · t2; ( 7) t9 ← t0 + t0;
Algorithm 4 Parallelized complete addition formulas for a prime order elliptic curve in Weierstrass form, using four processors Require: P = (X1 : Y1 : Z1), Q = (X2 : Y2 : Z2), E : Y 2 Z = X 3 + aXZ 2 + bZ 3 and b3 = 3 · b.
Ensure: (X3 : Y3 : Z3) = P + Q. t0 ← t0 − t4;
t7 ← a · t6; ( 7) t11 ← t4 + t7;
t2 ← t2 − t6;
t2 ← a · t2; ( 8) t10 ← t0 · t10; ( 16) t1 ← t1 · t5; ( 15) Z3 ← t1 + t10;
t3 ← Y2 + Z2;
t6 ← Z1 · Z2; ( 2) t1 ← t1 − t6;
t8 ← b3 · t6; ( 6) t2 ← t2 − t4;
t10 ← t10 + t11;
Algorithm 5 Parallelized complete addition formulas for a prime order elliptic curve in Weierstrass form, using five processors Require: P = (X1 : Y1 : Z1), Q = (X2 : Y2 : Z2), E : Y 2 Z = X 3 + aXZ 2 + bZ 3 and b3 = 3 · b.
1. t5 ← X1 + Y1;
t8 ← X2 + Z2;
3. t10 ← Y2 + Z2; t11 ← t0 + t0;
4. t3 ← t3 − t1;
5. t5 ← t9 · t10; ( 5) t8 ← a · t4; ( 8)
6. t5 ← t5 − t1;
t10 ← t6 + t8;
7. t0 ← a · t4; ( 10)
8. t0 ← t0 + t9;
t5 ← t5 − t2;
9. t1 ← t3 · t7; ( 11) t8 ← t11 · t0; ( 14) 10. X3 ← t1 − t2;
t6 ← X2 + Y2;
t6 ← b3 · t2; ( 6) t9 ← b3 · t4; ( 9) t11 ← t11 + t7;
t6 ← t3 · t11; ( 16) t7 ← t1 − t10;
t2 ← t5 · t0; ( 12) t9 ← t5 · t10; ( 15) Y3 ← t4 + t8;
t7 ← a · t2; ( 7) t4 ← t0 − t7;
t10 ← t1 + t10;
t4 ← t10 · t7; ( 13) Z3 ← t9 + t6;
Algorithm 6 Parallelized complete addition formulas for a prime order elliptic curve in Weierstrass form, using six processors Require: P = (X1 : Y1 : Z1), Q = (X2 : Y2 : Z2), E : Y 2 Z = X 3 + aXZ 2 + bZ 3 , b3 = 3 · b and a2 = a 2 .
Ensure: (X3 : Y3 : Z3) = P + Q. 
