Abstract. This paper presents a scalable hardware implementation of both commonly used public key cryptosystems, RSA and Elliptic Curve Cryptosystem (ECC) on the same platform. The introduced hardware accelerator features a design which can be varied from very small (less than 20 Kgates) targeting wireless applications, up to a very big design (more than 100 Kgates) used for network security. In latter option it can include a few dedicated large number arithmetic units each of which is a systolic array performing the Montgomery Modular Multiplication (MMM). The bound on the Montgomery parameter has been optimized to facilitate more secure ECC point operations. Furthermore, we present a new possibility for CRT scheme which is less vulnerable to side-channel attacks.
Introduction
Security of communication or in general of some digital data is founded by various cryptographic algorithms. Especially implementations of Public Key Cryptography (PKC) present a challenge in vast majority of application platforms varying from software to hardware. Software platforms are cheap and a more flexible solution but it appears that only hardware implementations provide a suitable level of security especially related to side-channel attacks. Two best known and most widely used public-key cryptosystems are RSA [26] and ECC [18] , [13] . When it comes to RSA, it is believed to be on its "sunset" but still keeping up with requirements. Namely, because of various factors such as well developed speed-ups in the form of Chinese Remainder Theorem (CRT) techniques and its suitability for hardware, RSA is the main technology for high-speed applications in network security, financing etc. On the other hand, ECC is expected to take the lead within wireless applications. The reason is that ECC operates with higher speed, lower power consumption and smaller certificates, which are all necessities within these areas including the smartcard industry. In short, it is mostly desired to develop an architecture which can efficiently perform both RSA and ECC, RSA for VPNs, banking etc. and ECC still mostly for wireless applications.
Our contribution deals with an FPGA implementation of RSA and ECC cryptosystems over a field of prime characteristic. The architecture for Montgomery Modular Multiplication (MMM) used in this work is efficient and secure [22] . The systolic array is used for arbitrary precision in bits, hence easily bridging the gap between the bit-lengths for ECC from 160 bits to 2048 (or higher) bit long modulus for RSA. The notion of scalability we discuss includes both, freedom in choice of operand precision as well as adaptability to any desired gate complexity. To the latter is usually referred to as "flexibility". We use modular exponentiation based on Montgomery's method without any modular reduction achieving the best possible bound [29] , [3] . We are first to introduce a similar bound for ECC which allows us to perform a very secure and yet efficient point addition and doubling. We show that in the case of two or more arithmetic units a high level of parallelism can be achieved altering ECC operations between those units. The eventual parallelism between more units and also between cells of the systolic array is beneficial for side-channel resistance. Moreover, in this work we introduce a new variation of Garner's scheme for CRT decryption, which has built-in countermeasure against timing and power analysis based attacks.
Since the introduced architecture was dedicated to RSA applications, it was natural to implement elliptic curve arithmetic in GF (p). In this way all required components were already available as ECC in GF (p) is based on ordinary modular arithmetic. Assuming one uses projective coordinates modular multiplication remains as the most time consuming operation for ECC. Hence, efficient implementation relies on efficient modular multiplication, as is the case for RSA. Nevertheless, it is also important to focus on time-constant algorithms which are less likely to leak side-channel information. To conclude, in this work we aimed to introduce a secure combined RSA-ECC implementation which as well meets high demands in speed implied by state of art for RSA hardware implementation. See for example [8] .
The remainder of this paper is organized as follows. Section 2 gives a survey of previous implementations of public-key algorithms in hardware relevant for our work. In Section 3, we outline the architecture of the targeted implementation platform. Section 4 describes new options for point operations. In Section 5, the implementation results and timings are given. Section 6 introduces a new variant of Garner's scheme for CRT which is as well efficient but more resistant to side-channel attacks. Implications of the proposed changes on security of both RSA and ECC are considered in Section 7. Sections 8 concludes the paper.
Related Work
This section reviews some of the most relevant previous work in hardware implementations for PKC. The vast majority of published work that is considering implementations of PKC deals with software platforms. Some of the work is done on FPGAs and only very few implementations are presenting an ASIC implementation of ECC in the field of prime characteristic. Most of the work is done in binary field and some authors have considered dual field implementations i.e. ECC in prime and binary field.
Goodman and Chandrakasan proposed a domain-specific reconfigurable cryptographic processor (DSRCP) in [8] . The DSRCP performs a variety of algorithms ranging from modular integer arithmetic to elliptic curve arithmetic. They mainly discussed the arithmetic in binary field. Most recent published work is the one of Satoh and Takano [27] . They presented the dual field multiplier with the best performance so-far in both type of fields. The throughput of EC scalar multiplication is maximized by use of Montgomery multiplier and on-the-fly redundant binary converter. The great quality of their design is in scalability in operand size and also flexibility between speed and hardware area. Another hardware solution for both types of fields was presented by Wolkerstorfer in [31] . The author introduced low power design which features short critical path to enable high clock frequencies. Most operations are executed within a single clock cycle and the redundant number representation was used. The idea of unified multiplier was first introduced by Savaş et al. in [28] . The authors have discussed a scalable and unified architecture for a Montgomery multiplication module. They deployed an array of word size processing units organized in a pipeline. The same idea is the basis of work in Grosschädl [9] . The bit-serial multiplier which is introduced is performing multiplications in both types of fields. The author also modified the classical MSB-first version for iterative modular multiplication. All concepts are introduced in detail, but the actual VLSI implementation is not given. Some hardware implementations in GF (p) on FPGA are also known. The ECC-only processor over fields GF(p) was proposed by Orlando and Paar [21] . They proposed so-called Elliptic Curve Processor (ECP) which is scalable in terms of area and speed. The ECP is also best suited for projective coordinates and it is using a new type of high-radix precomputationbased Montgomery multiplier. The scalability of the multiplier to larger fields was also verified in the field whose size is 521 bits. The authors have estimated eventual timing of 3 ms for computing one point multiplication in 192-bit prime field.Örs et al discussed an ECC-processor which is optimized for MMM in [23] . They described an efficient implementation of an elliptic curve processor over GF (p). The processor can be programmed to execute a modular multiplication, addition/subtraction, multiplicative inversion, EC point addition/doubling and multiplication. A detailed overview of hardware implementations for PKC is given in [4] .
Still plenty of the work in ECC over GF(p) deals with software implementations, where there exist many hardware implementations over binary field. It appears that the arithmetic in characteristic 2 is easier to implement and area and power consumption are smaller than in the case of GF (p). This is believed to be true, but only for platforms where specialized arithmetic coprocessors for finite field arithmetic are not available. On the other hand, an advantage of prime field is in its suitability for both RSA and ECC with an a resourceful sharing of hardware.
Previous Work and Background
In this paper we discuss how an FPGA implementation of Montgomery multiplication that was originally designed for RSA can efficiently be used to perform prime field ECC operations. This design consists of a Large Modular Montgomery Multiplier (MMM), designed as a systolic array. This array is onedimensional and consists of a fixed number of Processing Cells ( Figure 1 shows a schematic of the systolic array that was implemented in the MMM. A PC contains adders and multipliers that can process α bits of X and β bits of Y in one clock cycle. Here X and Y are the multiplicand and multiplier. Each PC calculates Tj +xiyj +miNj 2 α in each clock cycle. The detailed description is given in Section 4.4. In the original notation of Montgomery after each multiplication in the exponentiation algorithm a reduction was needed [19] . The input had the restriction X, Y < N and the output T was bounded by T < 2N . The result of this is that in the case T > N, N must be subtracted so that the output can be used as input of the next multiplication. To avoid this subtraction a bound for R can be calculated such that for inputs X, Y < 2N also the output is bounded by T < 2N .
Systolic Array
In [3] the need of avoiding this reduction after each multiplication is addressed. In practice this means that the output of the multiplication can be directly used as an input of the next Montgomery multiplication. The following theorem is proven in [3] .
Theorem 1. The result of a Montgomery multiplication XY R
The final round in the modular exponentiation is the conversion to the integer domain, i.e. calculating the Montgomery multiplication of the last result and 1. The same arguments prove that this final step remains within the following bound: M ont(T, 1) ≤ N . In practice, A B mod N = N will never occur since A = 0.
ECC Processor
The MMM need not only be used for fast RSA implementation but also for ECC point operations in the prime field. Due to the scalability of the design, the FPGA architecture can perform both, i.e. efficient exponentiations on large operands (for RSA) and modular multiplication on the smaller ECC operands. In Figure 2 a schematic of an FPGA implementation for ECC is given. One or two MMMs are used to perform the modular (Montgomery) multiplications. A Large Number Co-Processor (LNCP) is added to the design to perform the additions and subtractions. These units have their own RAM's and are connected with a data bus.
As already explained, the performance of an elliptic curve cryptosystem is primarily determined by the efficient realization of the arithmetic operations (addition, multiplication and inversion) in the underlying finite field. If projective coordinates are used the inversion operation becomes negligible. Therefore, coprocessors for elliptic curve cryptography are primarily designed to accelerate the field multiplication. Considering multiplication in the prime field i.e., GF (p), the whole work which is done for the RSA implementation is relevant. The only difference is that shorter bit-lengths are used i.e., 160-300 bits. Scalability is again a point of concern and even more inter operability between different implementations.
New Implementation
In this section we present our FPGA implementation for ECC point operations for prime fields. 
Point Addition
Point addition and doubling can be performed according to the algorithm given in [5] .
Here we assume that the two points that will be added i.e., P = (X 1 , Y 1 , Z 1 ) and Q = (X 2 , Y 2 , Z 2 ) are already transformed to the Projective coordinates and Montgomery representation. The result point
Scheduling of point addition. Point addition can be even performed more efficient if two MMM units are used. The operations can be conveniently divided between the two units. (Modular) addition and subtraction will be done on a Large Number Co-processor. Those operations can be performed in the same time as the Montgomery multiplication. The following scheduling as shown in Table 1 can be used. Table 1 shows that the performance can almost be doubled by using two MMM units.
Point Doubling
Here we discuss a special case of point addition i.e. point doubling, where the points P and Q are respectively given as:
Scheduling of point doubling. In Table 2 a possible schedule for point doubling over the 2 MMMs and the LNCP is given .
The difficulty in the scheduling of point doubling lies in the operations scheduled in MMM2 and the LNCP, which are all depending on the answer of the previous operation. Table 2 . Scheduling of point doubling.
Point multiplication can be implemented as a repeated combination of point addition and point doubling.
Modular Addition and Subtraction
Modular (i.e. Montgomery) multiplication, modular addition and modular subtraction are the basic operations for point addition. MMM is performed on our highly scalable Montgomery based multiplier. Modular addition and modular subtraction can be implemented as a repeated addition. However, the number of additions/subtractions would be data dependent. Let us take a better look at these two operations. As proven in Section 3.1, the result of an operation on our multiplier will always be smaller than twice the modulus (2N ). All modular additions and subtractions in the point addition scheme are with two outputs of the Montgomery multiplier.
For example:
The result of the modular addition and subtraction is again the input of another Montgomery multiplication and can therefore be larger than the modulus but should be positive. If it would be possible to calculate the previous calculations as "normal" i.e. non-modular addition and subtraction, this would make the operations very efficient but more importantly time constant.
Keeping in mind the "2p" bound for the operands as a result of the bound for the Montgomery parameter, we get:
Our target is now to try to fix a bound for the Montgomery parameter such that we can use these non-modular addition and subtraction instead of the modular forms. To achieve this we must ensure that the inputs X and Y of the Montgomery multiplier that are smaller than 4p result in a Montgomery product that is smaller than 2p.
As already mentioned, in the original implementation of our MMM the inputs of a Montgomery multiplication should be smaller than 2p. We will use the following lemma.
Lemma 1. If the Montgomery parameter R satisfies the following inequality R > 16N , then for inputs X, Y < 4N the result T of the MMM will satisfy: T < 2N (as required).
Proof: The Montgomery multiplication as implemented in the MMM calculates the following:
here m is calculated modulo R. Filling in the bounds for the inputs and R > 16N we get
If n is the length of modulus N in bits then the following is valid: N < 2 n and 16N < 2 n+4 . With R = 2 r , we get r ≥ n + 4. We have shown that for all modulus lengths, inputs smaller than 4p will result after a Montgomery multiplication on the MMM in a value which is smaller than 2p. Therefore we can use the more efficient and time constant implementation of modular addition and subtraction. Furthermore, there is no any loss in efficiency caused by this enlarged bound because R is usually already bigger than this bound (especially for α, β > 1.)
Montgomery Modular Multiplication
The processing cells in the systolic array shown in Figure 1 performs Equation 5. x i and m i have α bits, y j and n j have β bits. c1 i,j and c0 i,j denote the carry chain on the array. Because the critical path of the systolic array is the same as the critical path of one PC, the clock frequency of the Montgomery multiplier will be the same for all bit-lengths. This property gives the advantage of using the circuit for RSA and ECC.
Parameters α and β are 4 for this implementation. Table 3 shows the performance of the FPGA implementation of the Montgomery multiplier. Parameter n is the bit-length of N , l in Figure 1 is n 
Results and Timings
In the work of Lenstra and Verheul [16] , the authors made a security comparison between RSA and ECC key lengths. They introduced a table that included corresponding key bit-lengths assuring minimal security in the years to come for the two Public Key systems. In Figure 3 the performances for ECC and RSA are given according to the key sizes that were given in their paper. The figures show also that especially for the future applications the performance of ECC is more attractive than the performance of RSA. Figure 4 shows the performance for an ECC implementation with one and two MMMs. The implementation with 2 MMMs is scheduled according to the schedule given in Table 1 and Table 2 . Figure 4 shows a speed-up of a factor of 2 for the two MMMs variant.
For the sake of preciseness we give detailed performance results in the Table 4 . 6 Side-Channel Security of CRT
We will now briefly review some benefits of Montgomery's Multiplication Method, which are also evident for CRT implementations. In [11, 29] , R > 4N is proposed which, with some savings in hardware, omits completely all reduction steps. Especially implementations of CRT schemes are found to be very sensitive to side-channel attacks. For example, recently a new SPA-based attack was introduced by Novak [20] , which is targeting the algorithm of Garner [17] . This scheme is often used in all sorts of applications, including smartcards. It is usually implemented as follows:
The third step is the critical one. Novak observed that if the modular subtraction is implemented in the common way it may leak information. More precisely, to perform subtraction (mod p) one has to check the sign of s − t and conditionally add p if s − t < 0 (p > q is required). Novak managed to build a successful attack based on this observation. An implementation of the above algorithm can produce the optional pattern in a power trace as a result of the conditional addition.
We propose the following solution. Instead of the subtraction mod p, one can compute the following:
For p > q the result stays within the following bounds 0 < x < 2p which can be handled easily if Step 4 is implemented by use of the algorithm of Montgomery. Namely, the algorithm as proposed in [11, 29] for Montgomery modular multiplication takes two inputs 0 < X, Y < 2p and the result is also within the same interval, if the proper bound for Montgomery parameter R is chosen. This result is converted from the Montgomery domain to the usual domain by a Montgomery multiplication with 1. Changing Garner's scheme in this way the algorithm is always performing a constant execution path. We prove this in more detail.
Claim. The result of x = s + p − t is always smaller than 2p, for the parameters s, p, t defined as above, i.e. Now, we can prove the claim. It is obvious that 0 ≤ s < p and 0 ≤ t < q. We assume p > q with which this proof does not loose its generality, the other case is almost the same. Then we get x = s + p − t < p + p − 0 = 2p. Hence, if the multiplication in the Algorithm 1 is implemented as the one of Montgomery, no conditional subtraction is required as in original algorithm. This concludes the proof.
Security Remarks
In this section we address side-channel security i.e. resistance to timing [14] , [10] and power analysis based attacks [15] . These types of attacks, together with fault-analysis based attacks [6] , [12] , [2] electromagnetic analysis attacks (EMA) [25] , [7] and other physical attacks such as probing attacks [1] are a major concern especially for wireless applications. Mainly because of space limitation we only briefly discuss the first two, which are also believed to be the most practical .
Namely, computations performed in non-constant time i.e. computations which are time-dependent on the values of the operands, may leak secret key information. This observation is the basis for timing attacks. On the other hand, power analysis based attacks use the fact that the power consumed at any particular time during a cryptographic operation is related to the function being performed and data being processed. The attack can be usually performed easily because smartcards, for example receive the power externally and an attacker can easily get to hold on the source of this side-channel information.
In our implementation all modular reductions are excluded. The weaknesses in the conditional statements of the algorithm (used for realization of the reduction step) are time variations and therefore these should be omitted. By use of an optimal upper bound the number of iterations required in the algorithm based on Montgomery's method of multiplication can be reduced [30] . Another timing information leakage that was observed by Quisquater [10] et al. and Walter [30] was the timing difference between "square" and "multiply". This information can be used to attack RSA, even advanced exponentiation methods were used. In our architecture, this weakness is removed, because the same systolic array is performing squarings and multiplications, which are therefore indistinguishable with respect to timing.
Besides that, when considering power analysis attacks, some other precautions have also been introduced. The fact that all of the PCs operate in parallel makes these types of attacks far less likely to succeed. Both, RSA and ECC can benefit from this fact.
As already mentioned, this architecture can be an option for wireless devices, although we have chosen here to introduce a network security devoted product. Again, because of space limitation we were not able to discuss the smaller, compact implementation but that also features very secure low-power design with attractive performances.
Ors et al characterized the power consumption of a XILINX Virtex 800 FPGA in [24] . They showed that it is possible to draw conclusions about vulnerability of an ordinary ASIC in CMOS technology by performing power-analysis attacks on an FPGA-implementation. With respect to this, an FPGA design can serve as a good model for ASIC platform not just for usual hardware related properties but also for security.
Conclusions
We have presented the hardware implementation on systolic array architecture that is scalable in all parameters and ideally suitable for RSA and ECC algorithms.
We have also introduced a bound on Montgomery parameter R, which allows us to perform the most efficient point addition and doubling for ECC, as well as modular exponentiation. Even in the case of CRT the Montgomery's algorithm is proven to be the best option for side-channel resistance.
