Recently, the Residue Number System and the Cox-Rower architecture have been used to compute efficiently Elliptic Curve Cryptography over FPGA. In this paper, we are rewriting the conditions of Kawamura's theorem for the base extension without error in order to define the maximal range of the set from which the moduli can be chosen to build a base. At the same time, we give a procedure to compute correctly the truncation function of the Cox module. We also present a modified ALU of the Rower architecture using a second level of Montgomery Representation. Such architecture allows us to select the moduli with the new upper bound defined with the condition. This modification makes the Cox-Rower architecture suitable to compute 521 bits ECC with radix downto 16 bits compared to 18 with the classical Cox-Rower architecture. We validate our results through FPGA implementation of a scalar multiplication at classical cryptography security levels (NIST curves). Our implementation uses 35% less LUTs compared to the state of the art generic implementation of ECC using RNS for the same performance [5] . We also slightly improve the computation time (latency) and our implementation shows best ratio throughput/area for RNS computation supporting any curve independently of the chosen base.
Introduction
The Residue Number System (RNS) has shown interest for efficient implementation and high performances in large integer computations for public key cryptography and digital signature [6, 5] . Due to the ability to compute any operation quickly (O(n) complexity in RNS vs O(n log 2 (3) ) in multiprecision for multiplications when using Karatsuba) without carry propagation and with natural parallelism, RNS has gained interest in the literature [11, 12, 1] . Recently, it has also been demonstrated to be suitable for pairing computations [3, 13] . Improvement has been made for efficient computation of the final exponentiation in [2] . All these implementations are based on the Cox-Rower architecture proposed by Kawamura for RSA [6] and improved by Guillermin for ECC computations [5] .
In this paper, we reformulate the conditions for the base extension in order to build bases for the RNS Cox-Rower. Then, we present a new ALU that takes advantages of the new conditions for the base extension.
The paper is organised as follow: in Section 2, we will recall briefly mathematical background about RNS, Montgomery over RNS and approximations made in the base extension. Section 3 deals with the range of the moduli set induced by the approximation made during the base extension. The truncation function of the Cox is re-evaluated under those conditions. Section 4 presents a new Rower architecture, together with its base extension algorithm, to take advantage of the maximal range of the moduli set defined in Section 3. Section 5 gives results with scalar multiplication as well as area and performance comparisons with the classical Rower architecture. Section 6 concludes the paper.
Background Review

Residue Number System
RNS represents a number using a set of smaller integers. Let B = {m 1 , . . . , m n } be a set of coprime natural integers. B is also called a base. Let M = n i=1 m i . The RNS representation of X ∈ Z/M Z is the unique set of positive integers {X} B = {x 1 , . . . , x n } with x i = X mod m i . The conversion from RNS representation to binary representation can be computed using the Chinese Remainder Theorem (other methodology as Mixed Radix is possible):
Operations in RNS are computed as follows: ∀X, Y ∈ Z/M Z, ∃Z ∈ Z/M Z s.t.: Z = X Y mod M ⇔ z i = x i y i mod m i with ∈ {+, −, * , ÷} and ÷ only available when Y is coprime with M and a divisor of X.
Notation: In the rest of the paper, {X} B will refer to the representation of X in the RNS base B. We use braces to denote the fact that this is a set of integers.
RNS and Montgomery
RNS arithmetic has several drawbacks over multiprecision arithmetic. One of them is that reduction over p is complex. Reduction over p is still possible when using Montgomery Reduction since it computes exactly the value using a base extension [9, 10] . Thereafter, we recall the algorithm to compute the Montgomery Reduction in RNS [6, 5, 10] .
The main part of Montgomery Reduction relies on the Base Extension function (BE in the algorithm) that is described in the next section.
Algorithm 1: Montgomery Reduction in RNS
Input: {X} B , {X} B Output: {S} B , {S} B 1 Precomputed: {−p −1 } B , {p} B , {M −1 } B 2 {Q} B ← {X} B * {−p −1 } B 3 {Q} B ← BE({Q} B , B, B ) 4 {S} B ← ({X} B + {Q} B * {p} B ) * {M −1 } B 5 {S} B ← BE({S} B , B , B)
Base Extension
Let n be the cardinality of the base in RNS. In [9, 10] , Posch and Posch introduced a floating approach to compute the base extension function. In [6] , Kawamura came to a similar result, but the base extension function introduced by Kawamura supposes that the moduli m i are pseudo-Mersenne numbers of the form
The base extension function relies on the conversion from RNS representation to binary representation. From (1), we have:
for some k to be determined. Let ξ i (x i ) = x i M −1 i mod m i but we will use ξ i to lighten notations. Then it follows:
Thanks to the special form of m i and to the condition 0 ≤ µ i 2 r , Kawamura has approximated m i by 2 r to ease the computation. Letk be:
To evaluate the error due to the truncation approximation, Kawamura introduced some definitions that we recall here:
The denominator's approximations error is called mi whereas δ mi is due to the numerator's approximation. Then, Kawamura proved 2 theorems for the base extension function. The conditions of one of the theorems will help to find the µ i 's upper bound (called µ max ), which is the maximal range of the set from which we can select the moduli to build a base.
Theorem 1 (Kawamura [6] ). If 0 ≤ n( + δ) ≤ α < 1 and 0 ≤ x < (1 − α)M , then k = k and the base extension function extends the base without error.
One can see from the proof of the Theorem 1 ( [6] or see Appendix A) that the conditions can be relaxed in:
This new condition will help us to estimate µ max 's upper bound. To our knowledge, conditions on µ max have not been clearly established. In order to ease the moduli selection, we define the conditions on µ max in the next section.
3 New Bounds for the Cox-Rower Architecture
µ i 's Upper Bound for RNS Base
In the previous section, we have presented Kawamura's approximation of the factor k for the base extension. The only condition given by Kawamura is 0 ≤ µ i 2 r . In this section, we will explore the different equations to evaluate the impact on µ i 's upper bound. From (4) and (5), we have:
On the other hand, ∀x ∈ Z/M Z we have:
From the new condition (6) , it follows that:
Now, we will evaluate Equation (7) in to find the condition on m i since = 2 r −min(mi) 2 r = µmax 2 r . Let substitute in (7) :
If q = r, then µ max is maximum and is in the range of:
Then, we can rewrite an equivalent condition of the Theorem 1 using only the parameters α, r, n, q and µ max , which is more explicit for implementations:
then k = k and the base extension function extends the base without error.
With this new formulation, we can easily build bases for the RNS Cox-Rower architecture.
Lower Bound for the parameter q of the Cox
In [6] , Kawamura described a procedure to determine n, , δ, α and q for a given p. While n is easy to determine (same order of magnitude as n ≈ log 2 (p)/r), q is determined using the approximations 1 and 2 −(r−q) 1 with Theorem 1's conditions. While those approximations are asymptotically correct, we want to determine q for any range of parameters. We give, here, a new procedure to determine correctly q from α, n, r and µ max .
Once the bases are choosen using (9), from the Theorem 2's conditions, the following equation can be applied to find the parameter q:
This is a necessary and sufficient condition to get an exact computation. Unlike Kawamura's method [6] , no assumption is made on (or equivalently on µ max ) and 2 −(r−q) .
A New Cox-Rower Architecture
In the previous section, conditions on µ max has been determined. In this section, we first present the algorithm and the classical ALU used to compute the reduction inside the Rower. To our knowledge, it is the only ALU used with the RNS Cox-Rower architecture [8, 5, 3, 13, 2] .
Then, we introduce the new ALU proposed in this paper. This new ALU has been designed to fit on FPGAs, and we compare it with the classical ALU. Our comparison analysis uses 3 types of cells: DSP (Digital Signal Processing) blocks, LUTs (Look-Up Table) and registers (basic elements of FPGA) to compare the 2 ALUs. Multipliers are implemented inside DSP blocks on FPGA, with some additional features such as pre/post-adder/substracter. LUTs are the cell bases to implement any combinatorial logic.
Algorithm 2: Efficient Reduction Algorithm
Input: a ≤ 2 r , b ≤ 2 r and mi = 2 r − µi with 0 ≤ µi < √ 2 r Output: z = (ab) mod mi 1 c ← ab = c12 r + c0 2 d ← c1µi = d12 r + d0 3 e ← d1µi 4 z ← (e + d0 + c0) mod mi
Classical Rower Unit
The Cox-Rower architecture defined in [6, 8, 5, 3, 13] computes the reduction inside the Rower using Algorithm 2 when 0 ≤ µ i < √ 2 r .
The last addition (line 4 of Algorithm 2) gives a number up to 3·2 r < 4m i . It is also possible to reduce the last addition during the computation of the multiplications, if the adder/reducer block are not the critical path of the design compared to the multipliers. Such implementation gives good results for efficient implementation and computation for F p /RSA and ECC [6, 8, 5, 3, 13, 2] . Figure 1 presents the ALU of the Rower unit introduced by Guillermin [5] . The first reduction stage (second level in Fig. 1 ) is not necessary because its output is reduced within the second stage (third level in Fig. 1 ) (in the design, we have 2 r +m i < 3m i but 2 r + 2 r < 3m i ). The last part of the design is two accumulators before adding and reducing the 2 branches.
New Rower Unit
A drawback of the previous ALU is the condition 0 ≤ µ i ≤ √ 2 r . This restriction on moduli is taken to allow efficient reduction. Notice that, on the contrary, the condition we derived in (9) has to be met to ensure a base extension without error.
Then the two following cases can be met:
In that case, choosing moduli in the range [2 r α n ;
In this second case we observe that, using the classical ALU, we are restricted for the choice of moduli while our conditions (9) shows that taking more moduli without inducing errors is possible.
As an example, when r ≥ 14 and log 2 (p) = 521, we are restricted by the condition 0 ≤ µ max ≤ √ 2 r to select the moduli. The condition given for efficient reduction, when r is large, is sufficient to be in (ii), which is the case in [6, 8, 5, 3, 13, 2] .
We propose here a new ALU for the Rower unit to exploit the upper bound µ max ≤ 2 r α n given by our condition (9) . Using this upper bound, we will be able to use smaller radix than the classical ALU for computing equivalent size of p (r = 16 for computing log 2 (p) = 521 whereas we need r = 18 with the classical ALU). Our ALU is based on the Montgomery reduction 3 inside the Rower unit (called inner level of Montgomery). Our ALU computes the reduction using Algorithm 3 without any assumption on m i excepted the one that m i is coprime with 2 r to ease the computation in hardware 4 .
Algorithm 3: Inner Montgomery Reduction algorithm
The most significant bits of the last addition (line 3 of Algorithm 3) gives a number up to 2m i (compared to 4m i with the classical ALU). Figure 2 presents the ALU of the Rower unit proposed in this paper.
Levels of multiplication and reduction are also well separated, which makes our design fully pipelinable inside DSP blocks of the FPGA. Our ALU has also one accumulator. Moreover, we can take advantage of the adder integrated in the DSP blocks to compute the last addition of the Montgomery reduction algorithm (Algorithm 3). 
Computation Algorithm
The computation of the Montgomery reduction over RNS (called outer level of Montgomery), when using the classical ALU, is given in [5] . We recall this algorithm in the Appendix. It is based on precomputation of values depending on the parameters of the elliptic curve (a4, a6, p with y 2 = x 3 + a 4 x + a 6 ) and on the values of the bases
. Our ALU uses the same algorithm as the one given in [5] . Differences reside in the precomputed values. Indeed, values that have to be computed are {X2 r } B = {x i = x i 2 r mod m i } 5 . Mainly, we precompute the values using Montgomery representation inside the ALU (which is ×2 r mod m i in the inner level of Montgomery). When we use the base extension function, we need to compute the real value (inner level of Montgomery representation to normal representation mod m i ) to extend it to the second base. The new ALU needs the same number of cycles in order to compute the outer Montgomery compared to the classical ALU (Algorithms for outer Montgomery computation, as well as precomputed values, for the classical ALU and our ALU are given in Appendix B).
Comparison Analysis
Despite the fact that our ALU was designed specifically to fit on FPGA, we give some comparisons for ASIC implementations.
Area analysis. Size of the multipliers are not the same between the classical ALU and our ALU. When using the classical ALU, we need 3 multipliers of size r × r → 2r, r × r/2 → 3r/2 and r/2 × r/2 → r (line 1, 2 and 3 of Algorithm 2). Our ALU costs the same number of multipliers, but the size will be r × r → 2r, r × r → r and r × r → 2r. With our ALU, we fully used the full size of the DSP blocks on FPGA whereas quarter and half of the DSP blocks are lost with the classical ALU. When looking at LUTs used on FPGA, our ALU is less complex (in term of additions and reductions) than the classical ALU. This reduces the number of LUTs used within our ALU. The final adder in Montgomery reduction algorithm (Algorithm 3) can also be included inside the DSP blocks of the FPGA to help reducing the number of LUTs used, which is not the case with the classical ALU. Looking at Fig. 2 , we can estimate that we would use 5 times less LUTs with our ALU than with the classical one. For ASIC, those considerations are no more true since the cost of the reduction level is far more important on FPGA than in ASIC (where multipliers are far more area consuming than adders).
Timing analysis. Timing path of a classical multiplier is an affine function on the size of its inputs. In the classical ALU, for each multiplications, we need the most significant bits of the previous multiplication (line 2 and 3 of Algorithm 2). In ASIC or FPGA, this is usually the critical path of the design if it is not well pipelined. On the other hand, our ALU only needs the least significant bits from one multiplier to the next (line 2 and 3 of Algorithm 3), which reduces the length of the critical timing path.
Others considerations. Stages of multiplications and reductions are well separated, which reduces the fanouts, placement and routing issues. Stages of multiplication are also fully pipelinable without any impact on the final reduction in our ALU.
Remarks. With the classical ALU, Kawamura's approximation on 1 and 2 −(r−q) 1 to determine q is correct when r is large enough to have √ 2 r 2 r α n . With the new ALU, the procedure to determine q, defined in the previous section, is available.
Experiments and Compariso
Validation on FPGA
Target technology. We have implemented our ALU (and also the classical ALU [5] for the purpose of comparison) on a Xilinx Kintex-7 FPGA using the KC705 evaluation board available from Xilinx. This board includes the device xc7k325t which is a mid range FPGA on the 28nm process node.
Parameters design. We have implemented the classical cryptography security level from NIST but no restriction is given on the parameters of the elliptic curve but to be a valid curve. DSP blocks of the Xilinx 7 series family are signed multipliers of size 25 × 18 → 43. Since we need only the unsigned part of the multiplier, and we want to be base-independent, we choose to take radix r = 17. The base has been chosen such that we can take α = 0.5.
Implementation. For both design (classical ALU and our ALU) and each curve, Table 1 gives the area in terms of slices 6 , maximum frequency after Place and Route, number of cycles for a whole computation (binary to RNS or INT2RNS, scalar multiplication or MULT, final inversion or INV, and RNS to binary transformation or RNS2INT), the computation time, q (size of the adder in the Cox module), log 2 (µ max ) and the ratio bits.s −1 /slices. The slice count is independent on DSP slices or BRAM (Block RAM). Table 4 , in Appendix C, gives the details account on LUTs, registers, DSP and BRAM, as well as the cycles for each command. Area implementation results take the datapath, the sequencer and the interface into accounts. Only the ALU has been modified as well as the precomputations.
Design
Curve n Cycles Slices Fmax Latency q log 2 (µmax) Ratio Classical ALU (C) 160 Comparison of the 2 ALUs. Because of the condition given for an efficient reduction (0 ≤ µ i ≤ √ 2 r = 362) with the classical ALU, we were not able to build 2 bases with r = 17 for log 2 (p) > 256 which is a critical size for the DSP block for the Xilinx FPGA. On the other hand, using our ALU and the condition (9) (0 ≤ µ max ≤ 2 r α n = 2114), we were able to build 2 bases up to log 2 (p) = 521. To reach similar size of p, Guillermin took r = 18 with the classical ALU to overcome this issue [5] , which it's not acceptable if we want to use 1 DSP block per multiplication and don't want to penalize the maximal frequency and latency. As expected in the previous section, we use 35% less area, globally, with the Montgomery ALU than with the classical ALU. The area reduction given here takes into account the logic for the whole datapath, the sequencer and the interface. The area reduction inside the ALU is around 75%. The area of the 256 bits with the classical ALU is almost the same as with the 521 bits for our ALU. The gap on the maximal frequency between the 2 ALUs is due to the placement and routing issues. Indeed, critical timing paths of the classical ALU are from multipliers to adder/reducers blocks (Fig. 1) . The multiple interconnections make those paths really difficult to place and route efficiently (essentially due to the fanouts). On the other hand, critical timing paths of our ALU is from one multiplier to the next multiplier. Thus, if we want to increase the frequency, we will have to increase the pipeline. For scalar multiplication in ECC, a pipeline of 5 registers is enough to have 95% of the pipeline used during the whole computation (Guillermin came to similar results [5] ). For application to pairing computations, we can increased the pipeline to 10 registers thus expecting better frequency than for scalar multiplication [3, 13] .
Comparison
We compare our design with 3 others design RNS and non RNS. Our architecture supports any elliptic curve over F p and implement the Montgomery Ladder algorithm to be SPA resistant. We used projective coordinates for computations. We considered the general elliptic curve in the Weierstrass form y 2 = x 3 +a 4 x+a 6 with no assumption on the parameters. Our architecture does not make assumption on the form of the moduli except that they respect Theorem 2's conditions.
(i) First design is the one given in [5] and is based on RNS. The ALU used is the classical one. A larger size of radix has also been used in his implementation. This design shows really fast computation with any elliptic curve over F p . To our knowledge, it is the fastest implementation of elliptic curve scalar multiplication with generic curves independently of the choosen base on FPGA using RNS Cox-Rower architecture. For ratio comparison, a slice in recent Xilinx devices (virtex-5 and beyond) is equivalent to 3 ALMs 7 in Altera. To achieve high running frequency, all the precomputed values and the GPR are implemented into registers inside ALMs. (ii) Second design is an implementation of a specific curve where p is a pseudo-Mersenne number [4] . Using the property of the pseudo-Mersenne value, this implementation can be specialized to run at high frequency and quickly computing the multiplication scalar. (iii) Third design is based on fast quotient pipelining Montgomery multiplication algorithm in [7] . The scalar multiplication algorithm is based on window method algorithm. Jacobian coordinates is used and a 4 parameter is set to −3 (which is not a real restriction with Weierstrass form through an isogeny). To our knowledge, it is the fastest implementation of scalar multiplication over ECC and smallest design for such performance with generic curves. Design (i) is the one we compare during the paper. Our implementation is smaller and a slightly faster than the implementation in [5] .
Design (ii) used the specific form of the parameter p to improve the overall performance. This design is faster than ours, but it is dependent on the pseudo-Mersenne form of the parameter p of the elliptic curve.
Design (iii) shows really fast computation of ECC scalar multiplication. Compared to our design, the gain in computation time comes from the use of Jacobian coordinates and the window method algorithm whereas we use Montgomery Ladder and projective coordinates. But when comparing the numbers of cycles to complete a multiplication and an addition/substraction, 35 cycles is needed to compute a multiplication whereas we need 2n + 3 cycles (35 cycles for 256 bits), and 7 cycles is needed to compute an addition/substraction, whereas we need 1 cycle for an addition/substraction. Eventually, the gain in performance is not scalable to any size of elliptic curve as our work.
Conclusion and perspectives
In this paper, we established the link between moduli's properties and base extension for the Cox-Rower architecture. To our knowledge, that was not clearly defined yet. Now, the given bounds are more appropriate for designers. We also give a new procedure to determine q parameter which is used for truncation in the Cox module. We propose a new ALU design, based on an inner Montgomery reduction. This ALU is designed to fully use the bounds of the Cox-Rower architecture and to reduce the combinatorial area of the architecture on FPGA without penalizing performance. Moreover, using the same pipeline depth, we manage to increase the frequency of our ALU compare to the classical one.
In future works, we will increase the pipeline depth in DSP blocks for applications to pairing computations in order to improve computation time. Furthermore, we will take advantage of the pre-substracter of the DSP block to easily compute (−AB) mod p and reduce computation time. In the perspective of improving the algorithmic, we will study the use of different coordinates and implementations, such as Jacobian coordinates and window method. Although our ALU is designed for FPGA, we will also study the potential application of our ALU to ASIC.
ξi mi > −n . Therefore, we have:
But m i ≤ 2 r and trunc q (ξ i ) ≤ ξ i , then it follows:
From here, the same arguments as in [6] are valid. Table 2 recalls the value that has to be precomputed for Algorithm 4.
Algorithm 5 is the algorithm for the Montgomery reduction over RNS, when using our ALU. Operation ⊗ will denote the inner Montgomery multiplication and reduction (Algorithm 3) such that a ⊗ b mod m = ab2 −r mod m. Table 3 recalls the value that has to be precomputed for Algorithm 5. 
