Abstract-Scalar recoding is popular to speed up ECC scalar multiplication: non-adjacent form, double-base number system, multi-base number system. But fast recoding methods require pre-computations: multiples of base point or off-line conversion. In this paper, we present a multi-base recoding method for ECC scalar multiplication based on i) a greedy algorithm starting least significant terms first, ii) cheap divisibility tests by multi-base elements and iii) fast exact divisions by multibase elements. Multi-base terms are obtained on-the-fly using a special recoding unit which operates in parallel to curve-level operations and at very high speed. This ensures that all recoding steps are performed fast enough to schedule the next curve-level operations without interruptions. The proposed method can be fully implemented in hardware without pre-computations. We report FPGA implementation details and very good performances compared to state-of-art results.
I. INTRODUCTION
Scalar multiplication is the most time consuming operation in elliptic curve cryptography (ECC) protocols. It is denoted by [k]P where P is a curve point and k a scalar. Basic scalar multiplication algorithm scans each bit of k and performs some curve-level operations depending on the bit value. Scalar representation significantly impacts the number of point operations to be executed and overall computation time. Consequently scalar recoding methods are very popular: non-adjacent forms (NAF and wNAF), double-or multi-base number systems (DBNS/MBNS), etc. Sec. II recalls these methods and basic ECC elements. Previous fast recoding methods require a precomputation step prior to scalar multiplication. For wNAF, several multiples of P have to be precomputed and stored. For DBNS/MBNS, the scalar must be recoded off-line.
Below we present a method and its FPGA implementation to recode on-the-fly the scalar using MBNS without precomputations. Our recoding is performed in parallel to curvelevel operations. It uses very cheap divisibility tests for each base element and an efficient implementation of exact division algorithms used for multiple-precision arithmetic. Exact division refers to division where the remainder is known to be zero. Due to paper length limit, we only deal here with curves defined over F p but our method can be easily applied in F 2 m case. Sec. III and IV present respectively unsigned and signed versions of our method. Section V compares our results to state-of-art ones. 
II. STATE-OF-ART IN ECC SCALAR MULTIPLICATION
A brief introduction is presented below. The reader is referred to [1] , [2] for further details. An elliptic curve E over the prime field F p , of large characteristic, can be defined by the simplified Weierstrass equation y 2 = x 3 +ax+b with curve parameters a, b ∈ F p and 4a 3 + 27b 2 = 0. The rational points on the curve and a special point, called point at infinity denoted by O, form an abelian group (denoted additively where O acts as the identity) on top of which the cryptosystem works. Given points P, Q on the curve, curve-level operations are defined: point addition P + Q where P = ±Q (denoted by ADD) and point doubling [2]P = P +P (DBL). Scalar multiplication [k] P is defined by [k]P = P +P +· · ·+P with k −1 additions. The scalar k is (k n−1 k n−2 . . . k 1 k 0 ) 2 with n in the range 160-520 bits for typical cryptographic sizes. Each operation at curve-level involves a sequence of operations at field-level (multiplication: M, square: S, inversion: I). Curve points can be represented using affine coordinates (A): (x, y). In that case, ADD and DBL operations require expensive field inversions (in F p one inversion is about 15 to 30 multiplications). Hence most efficient implementations use projective coordinates. In this paper, we use Jacobian coordinates (J ) as a popular class of projective coordinates where (X : Y : Z) corresponds to the affine point (X/Z 2 , Y /Z 3 ) for Z = 0.
A. Basic Scalar Multiplication Methods
Basic scalar multiplication algorithm, called double-andadd, is presented on Fig. 1 . Its average computation cost is 0.5n ADD + n DBL (0.5n ones in k for security requirement).
Point addition at line 4 always uses the same point P . Then P can be kept in affine coordinates and used by mixed addition mADD (J + A → J ) in order to speed up the computation and reduce the P coordinates storage (see Sec. V-A for cost).
Point subtraction (SUB) is as efficient as point addition (A: −P = (x, −y) and J : −P = (X : −Y : Z) for curves over F p ). This motivates the use of signed digits such as NAF (k i ∈ {0, ±1}) where no two consecutive signed digits are non-zero [1, Sec. 3.3.1] . Scalar multiplication using NAF recoding is straightforward: replace line 4 in Fig. 1 by "if k i = 0 then Q ← Q sign(k i ) P ". The average computation cost is 0.3n ADD + n DBL (cost for SUB is the same than ADD).
Another optimization, called wNAF, processes a window of w digits of k at a time. wNAF uses digits k i ∈ {0, ±1, ±3, ±5, . . . , ±2 w−1 − 1}, and at most one of any w consecutive digits is non-zero [1, Sec. 3.3.1] . Multiples P j = [j]P have to be pre-computed and stored for all j ∈ {3, 5, . . . , 2 w−1 − 1}. Scalar multiplication is done using ADD/SUB of pre-computed multiple P j (corresponding pseudocode is "if k i = 0 then Q ← Q sign(k i ) P |ki| "). The average computation cost is n/(w + 1) ADD + n DBL without the precomputation step. These pre-computations may be interesting if the same point P is reused. In practice, wNAF is used with w ≤ 4 for limited storage overhead. Fig. 2 illustrates the typical number of operations required at each level. One [k]P operation requires hundreds of curvelevel operations. Each curve operation (ADD, DBL) requires a sequence of 8-12 field-level operations. Finally, each field operation requires tens (for large operators) to hundreds (for small iterative operators) of clock cycles.
B. Scalar Multiplication using Double-Base Number System
DBNS was initially introduced in [3] , used for modular exponentiation in [4] , for signal processing in [5] and for ECC in [6] , [7] , [8] , [9] , [10] and [11] (which is very complete). In DBNS, number x is represented by the sum of mixed powers of two co-prime integers b 1 and b 2 , the two bases, typically
Unsigned DBNS (x i = 1) leads to larger terms number n ′ . For ECC computations in DBNS, a new curve-level operation has to be defined: point tripling [3]P = P + P + P (denoted TPL). It is faster than ADD (see Sec. V-A for cost). DBNS is a very sparse representation (number of terms n ′ is very small compared to number of bits n in standard binary representation). Then, the number of point additions is reduced. In [7] , a special type of DBNS recoding, called DBNS chain, is proposed with an Horner like factorization of DBLs and TPLs operations (under conditions u 1 ≥ u 2 ≥ . . . ≥ u n ′ and v 1 ≥ v 2 ≥ . . . ≥ v n ′ ) leading to higher improvement. We will report performances of some DBNS scalar multiplication algorithms from literature in Sec. V. There are DBNS scalar multiplication extensions using pre-computed multiples of P leading to higher speed but with a higher storage cost [11] . DBNS helps to reduce the total computation time. But binary to DBNS conversion is performed off-line. Most of proposed conversions proceed most significant terms first by subtracting to k a good/best approximation of k by a term of form 2 u 3 v using huge tables or expensive computations. The authors from [10] claim that tree based approach conversion is too costly for hardware implementation of systems using integers in the cryptographic range (p. 437, Sec. 3). In [8] , 10 to 72 points have to be pre-computed and stored (a better usage of silicon area should be a parallel architecture). In [12] , an FPGA implementation of binary to DBNS conversion is proposed but only for very small operands (n ≤ 20 bits) in signal processing applications.
DBNS is a very redundant number system. In [13] , this redundancy is used to randomly select the recoding as a counter-measure against some side-channel attacks (SCAs).
C. Scalar Multiplication using Multi-Base Number System
MBNS is a generalization of DBNS with more than two bases [14] , [15] , [16] , [17] , [18] , [19] and [20] . A multi-base B is a tuple of l co-prime integers (b 1 , b 2 , . . . , b l ). Number x is represented as the sum of terms x = For ECC scalar multiplication in MBNS, new curve-level operations have to be defined: point quintupling [5] P (QPL), point septupling [7] P (SPL), point eleventupling [11] P (EPL), etc. These new operations are more efficient than equivalent sequences of ADD, DBL and TPL operations (see Sec. V-A for typical costs). MBNS scalar multiplication is similar to DBNS algorithms with more curve-level operations QPL, SPL, etc.
MBNS suffers from the same limitation as DBNS: the need for off-line conversion with huge tables and/or long pre-computations. In [14] and [17] conversion uses good approximations of k using terms of form ± l j=1 b ej j similarly to DBNS conversion. In [15] and [16] conversion uses an adaptation of wNAF with detection of b j multiples into a limited window, but it requires pre-computations and additional storage. To our knowledge, [15] and [16] provide the best MBNS results but without hardware implementation details.
III. PROPOSED METHOD Notations used in paper remainder are:
into t words of w bits with w(t − 1) < n ≤ wt (i.e. last word may be 0-padded). k (i) the ith word of k starting from least significant for 0 ≤ i < t.
• B the multi-base with l base elements (co-prime integers),
• predicate divisible(x, B) returns true if x is divisible by at least one base element in B (false for other cases).
• number x represented as the sum of terms
for j from 1 to l do 9:
ej ← 0 10:
LT ← LT ∪ (d, e1, e2, . . . , e l ) 14: return LT Fig. 3 . Unsigned MBNS recoding algorithm
• Q, P curve points and Q = [k]P scalar multiplication. Due to space limitation, we only present results for elliptic curves defined over F p , but it can be used for F 2 m (fine tuning is slightly different due to different cost ratios I/M and S/M). In this section, we present a simple unsigned version (d i = 1) for the sake of simplicity. Sec. IV details algorithms for signed representations (d i = ±1). Units described in this section can be directly used or slightly adapted for signed representation.
A. Unsigned Algorithms
Our MBNS unsigned recoding algorithm, see Fig. 3 , is very simple. Divisibility of k by B elements is tested. When k is not divisible, 1 is subtracted to k. b 1 = 2 is selected for efficiency purpose (divisibility is ensured 50% of time). For lines 8-12, the scalar k is divided by all base elements b j in B as much as possible using cheap divisibility tests and exact divisions. This division step provides the term exponents e 1 , e 2 , . . . , e l . LT denotes the list of terms which stores the MBNS recoding of k, LT = (d 1 , e 1,1 , e 2,1 , . . . , e l,1 ), (d 2 , e 1,2 , e 2,2 , . . . , e l,2 ), . . . with d i ∈ {0, 1}. Only the first term may have d 1 = 0 (if the initial k is immediately divisible in B). Divisibility tests at line 3 and 10 can be shared. The algorithm stops when k ≤ 1 due to Horner form such as 2
. Divisibility tests are detailed in Sec. III-B. There are implemented using t + ε clock cycles (ε is a small constant) for all b j = 2 and only one for b j = 2 s with s ≤ w. Once k is divisible by b j , we use fast exact division algorithms to perform k ← k/b j as detailed in Sec. III-C and with t + ε ′ clock cycles for all b j = 2.
MBNS recoding algorithm in Fig. 3 works in a serial way: one multi-base term at a time and starting with the least significant one. Each term can be immediately used in the scalar multiplication algorithm in Fig. 4 . This algorithm computes Q = [k]P using LT and is a multi-base adaptation of the standard left-to-right scalar multiplication algorithm (see for instance [1, Sec. 3 
The combination of recoding ( Fig. 3 ) and scalar multiplication (Fig. 4 ) algorithms allows to overlap recoding steps by curve-level operations. For instance, when divisibility by 3 is detected, exact division by 3 and TPL operations can
for j from 1 to l do 5:
be launched in parallel. Our recoding algorithm produces a MBNS representation with a recursive factorization similar to Horner scheme. Fig. 10 illustrates a complete example. Unlike previous DBNS and MBNS recoding methods, ours can be fully embedded in hardware and operates on-the-fly. First, we do not need costly tables or computations such as the approximation of k by multi-base terms. Second, as soon as a divisibility is detected, we can launch the corresponding curve-level operation.
As we start with least significant terms first, we cannot use mixed coordinates point addition (mADD). We are obliged to use standard point addition which is a little slower. Clearly our method is not competitive compared to the fastest stateof-art ones when costly off-line recoding is possible. But it provides the first full on-the-fly hardware implementation.
Overlapping recoding operations by curve-level operations is possible due to the very fast divisibility tests and exact divisions. For instance, with n = 160, w = 12 and t = 14, divisibility tests by all b j = 2 and exact division by one b j = 2 respectively require t + 3 = 17 and t + 4 = 18 clock cycles. These small durations have to be compared to the duration of one DBL, TPL, QPL, etc. which are significantly slower.
There is a short latency at the very beginning (less than 0.01% of total [k]P computation time for n = 160 and even less for larger fields). The first curve-level operation is determined using the first divisibility test results. After t + ε clock cycles there are two cases: i) k is divisible by one multibase element b j then a DBL, TPL, QPL, etc. can be launched depending on which b j divides k, ii) k is not divisible by B elements and an ADD can be launched.
In this work we do not study an analytical evaluation of the number of each type of curve-level operation. But we provide extensive experimental evaluations of both these types of operations and complete scalar multiplication duration.
Selection of B elements is required. Fig. 5 reports statistical [k]P timings (in M) for 10 000 random 160-bit values recoded using our unsigned MBNS algorithm for various multi-bases.
Scalar multiplication algorithm in Fig. 4 against SCAs. Simple power analysis attacks can be used due to the different behavior of curve-level operations in lines 3 and 5. We will see in Sec. IV how signed MBNS recoding can be used as a protection against some attacks.
B. Implementation of the Divisibility Tests
At each recoding step, the scalar remainder has to be tested for divisibility by all b j for 1 ≤ j ≤ l. Testing divisibility by 2 s with s ≤ w in a radix-2 representation is straightforward and implemented in a very small module of the recoding unit, see Sec. III-D. For b j = 2, we use a very old method based on specific properties of the sum of argument digits modulo b j . This method for divisibility test by b j in radix-r representation is reported in a Blaise Pascal's post-mortem publication [21] (in Latin, see [22] for comments in English). This method is often called Pascal's tape. We only provide details for b j ∈ {3, 5, 7} as they lead to the most efficient B. Tab. I reports the remainders 2 i mod b j . They form a periodic sequence. Using Tab. I with Pascal's tape in case b j = 3, the periodic sequence is (2 1) * , then one has:
Computation of α requires the sum of many 2-bit words (n ≫ 100). α is a multi-bit integer, then it has to be recursively reduced using the same method. There is a trade-off between the size of the intermediate accumulators and the reduction completion. Architecture presented on Fig. 6 decomposes this large operation into partial sums accumulated and partial reduction on a limited number of bits for each word of the scalar. This is the purpose of the light block denoted " for b j = 3" and connected register on R for bj = 7 
Hardware implementations reported below have been described in VHDL and implemented on a XC5VLX50T FPGA using ISE 12.4 from Xilinx with standard efforts for synthesis, place and route. We report numbers of clock cycles, best clock frequencies and numbers of occupied slices. We also report numbers of look-up tables (LUTs with 6 inputs in Virtex 5) and flip-flops (FFs) for area. A XC5VLX50T contains 7 200 slices with 4 LUT and 4 flip-flops per slice. We use flip-flops for all storage elements. FPGA implementation results are reported in Tab. II for divisibility tests by b j ∈ {3, 5, 7} and n = 160.
C. Implementation of the Exact Division by b j Elements
Here, exact division k/b j means that we know that dividend k is divisible by divisor b j (using divisibility tests presented above). This significantly optimizes the division. [23] provides an efficient algorithm when the radix is prime or power of 2. Division by 2 s is straightforward and is not considered in this section (a shifter is included in the recoding unit Sec. III-D).
We use a dedicated version of algorithm presented in [23] for b j ∈ {3, 5, 7} and optimized for hardware implementation. Our algorithm, presented in Fig. 7 , operates in t iterations in a word-serial way starting with least significant. Iteration number i deals with k (i) the ith word of k. The inverse of divisor b j modulo 2 w is a constant and always exists since multi-base elements are co-prime and include 2.
The main differences for the 3 operations (b j ∈ {3, 5, 7}) are the multiplication by modular inverse on line 4 and comparisons to constants on line 7. All other elements are shared in the architecture (operators, control, registers).
Tab. III reports binary representations of modular inverses for exact division by 3, 5, 7. Multiplication r × (b −1 j mod 2 w ) at line 4 in Fig. 7 is implemented as a sequence of additions/subtractions and shifts using an in-house multiplication by constants algorithm [24] . Subtraction at line 3 is inserted in the sequence. Some adders in the 3 sequences are shared to reduce area. Tab. III reports γ the number of additions/subtractions required to perform r × (b −1 j mod 2 w ). The architecture of our exact division by b j ∈ {3, 5, 7} unit is presented on Fig. 8 . At iteration i, word k (i) , read (R port) from scalar memory, is added to −c and used in the addition sequence (block denoted "×(b j mod 2 w )" and "± seq.") corresponding to b j . The correct value is selected by MUX1 and written in scalar memory (W port). We use an in-place version of the algorithm k ← k/b j to keep memory footprint as small as possible. Comparisons in loop lines 6-8 of algorithm Fig. 7 are unrolled and implemented as combinatorial logic ("cmp. b j " block) for each b j ∈ {3, 5, 7}. The correct value c is selected by MUX2 and sent the addition sequences. The FPGA implementation results for our exact division unit are reported in Tab. IV for n = 160. For w = 24, speed decreases due to the complexity of the comparison blocks. For example, in case b j = 7, six comparisons are required.
D. Unsigned Multiple-Base Recoding Unit
The complete recoding unit architecture is presented on Fig. 9 . The scalar memory stores k. The small subtraction block is in charge of line 5 in the recoding algorithm Fig. 3 . The DTD-2 block is in charge of divisibility test (1-bit result) and exact division (w-bit bus) by 2 or small powers of 2 (2 s with s ≤ w, if s > w several iterations are used). Divisibility test unit for other base elements is described in Sec. III-B (3-bit output) while the exact division unit is 1: c ← 0 2: for i from 0 to t − 1 do 3:
for h from 1 to bj − 1 do 7:
if r ≥ h × ⌈(2 w − 1)/bj⌉ then 8:
c ← c + 1 9: FPGA implementation results for the complete unsigned recoding unit are reported in Tab. V for B = (2, 3, 5, 7) and n = 160 (see Sec. IV-B for comparison to a ECC processor).
Our recoding unit operates significantly faster than curvelevel operations and in parallel to them. It provides on-the-fly informations on which curve-level operations to be launched without interruptions as illustrated on Fig. 10 . On this figure, "CLO" denotes curve-level operations, DT denotes divisibility test, "res." their results and "/b j " exact division by b j . For k = 87, the recoding is 87 = 0 + 3 1 × (1 + 2 2 7 1 ). The first term (0 + 3) recoding is performed as fast as possible while the second one (1 + 2 2 7 1 ) is spread over the computations curve-level operations, without interruption. This provides us options when designing the control.
For n = 160, w = 12 and t = 14, recoding a scalar corresponding to 2 8 3 4 5 2 7 1 requires 292 clock cycles. This is less than one DBL operation (even for a parallel architecture). It reduces to 195 clock cycles for w = 24 and t = 7. The same term recoding requires 772 clocks cycles for n = 521 and w = 12 (435 for w = 24).
E. Validation
Both recoding and scalar multiplication algorithms have been implemented in PARI/GP (http://pari.math.u-bordeaux. unsigned version 
IV. SIGNED-DIGIT OPTIMIZATIONS
The unsigned recoding algorithm, Fig. 3 of Sec. III-A, only performs k ← k −1 when k is not divisible by B elements. As with other number systems, such as Booth recoding, wNAF, Avizienis representations, or DBNS, using signed digits may help us to reduce the number of terms.
A. Signed-Digit MBNS Recoding
A simple modification of our MBNS recoding algorithm Determining S such that the recoding algorithm always produces the shortest list of terms is a very hard problem. We compared 4 heuristic selection functions and trade-offs between recoding performances (LT length) and implementation complexity (i.e. silicon area and recoding speed). Fig. 12 . When k is not divisible, then k − 1 and k + 1 will be divisible by, at least, 2 (we always use b 1 = 2 in practice) and potentially other b j . For each value k − 1 and k + 1, divisibility tests and exact division unit are used to Fig. 12 . Principle of the min selection function produce k ′ and k ′′ . k ′ (resp. k ′′ ) corresponds to k − 1 (resp. k + 1) divided as much as possible by B elements. S returns
1) Minimum value selection function (min): min is illustrated on
leads to similar performances). The min selection function only provides a local minimum for the total number of terms. A second scalar memory (t words × w bits) has been added in the recoding unit to store both k ′ and k ′′ . The exponents corresponding to both k − 1 and k + 1 have been computed and stored during the exploration. The controller is adapted such that the correct set of exponents is selected.
2) Maximum number of divisors selection function (max_nb_div):
In min, half of divisibility tests and exact divisions are discarded. In max_nb_div, the number of base elements b j which divide k − 1 and k + 1 is counted. S returns d according to the maximum number of divisors among k − 1 and k + 1. Only divisibility unit for k − 1 and k + 1 is used, not the exact division unit. max_nb_div is a cheap optimization but with low efficiency (see Sec. IV-C).
3) Approximated minimum value selection function (approx):
In order to provide a cheap optimization but with higher performances, the approx selection function compares approximations of k ′ and k ′′ from min (instead of computing them exactly like in min). Exponents e ′ j (resp. e ′′ j ) correspond to the divisibility test results for k − 1 (resp. k + 1). The approximations of k ′ and k ′′ are respectively defined by
where ⌊log 2 (k − 1)⌋ + 1 is the position of the most significant bit (MSB) of k − 1 (idem for k + 1). MSB positions can be easily detected using our divisibility test unit (during the t iterations loop for each word). Approximation of weight III-B) . Then, there is no need for multiplications. As an example, for B = (2, 3, 5, 7) δ ′ = MSB(k − 1) − e b1=2 − 1.5e b2=3 − 2.25e b3=5 − 2.75e b4=7 where e b1=2 ≤ w and e b2=3 , e b3=5 , e b4=7 ≤ 1. The constants come from: log 2 3 ≈ 1.59, log 2 5 ≈ 2.32, and log 2 7 ≈ 2.81. The approximation for both k ′ and k ′′ , as well as their comparison, can be easily implemented using a very small circuit (see [24] ).
4) 2 steps minimum value selection function (min2):
It uses the time margin illustrated on Fig. 10 using a recursion limited to the next term. The first step uses min with (k−1, k+ 1) to produce (k ′ , k ′′ ). The second step uses min with (k
B. FPGA Implementation
Signed MBNS recoding unit has been implemented on FPGA (see end of Sec. III-B for target and tools details). Tab. VI reports corresponding results for approx selection function, B = (2, 3, 5, 7) and n = 160 bits. Compared to the unsigned version Tab. V, area overhead for signed version is very small: 13% more slices (8% FF and 13% LUTs). In Virtex 5 FPGAs, very small memories, such scalar ones, can be efficiently implemented using distributed RAMs in the LUTs of SLICEMs. This explains why only 25 additional flip-flops are required for the signed version while there is a 168-bit memory (t = 14 and w = 12) for the second scalar memory. The same speed is achieved for both signed and unsigned versions (the critical path is in the exact division unit).
In order to compare our MBNS recoding unit to a complete ECC processor, Tab. VII reports two FPGA implementations of an ECC processor provided by the authors of [25] (for curves over F p , n = 160 bits and Jacobian coordinates). The first one (small version) uses NAF method with one arithmetic unit per field operation, while the second (large version) uses 4NAF method and two arithmetic units per field operation but one modular inversion. Our MBNS signed recoding unit work at higher frequency than the ECC processor. And its area represents less than 10% (resp. 7%) for w = 12 compared to the complete small (resp. large) version of the ECC processor. 
D. Randomized Selection Function
Side-channel attacks exploit some correlations between secret values manipulated in the device and physical parameters measured on the device such as power consumption, electromagnetic emanations or computation timing. Refer to [26] for a complete introduction on power analysis based SCAs. Typical SCAs and counter-measures for ECC are summarized respectively in [27] and [28] . Protections against simple attacks (based on only one or a very few traces) mainly use uniformization (or atomicity) and randomization methods. Protections against differential attacks (based on statistics over many traces) mainly use randomization methods.
We experimented with a simple randomized selection function (rnd) as a protection against some SCAs. When k is not divisible by B elements, S returns d = 1 or d = −1 randomly. Obviously this leads to larger number of terms (and point additions) in the recoding as reported in Tab. VIII (for simplified Weierstrass curves with a = −3 and 10 000 random scalars). Proposed randomization scheme allows a scalar substring to be represented using totally different recodings. This is a direct protection against some differential attacks due to the very huge number of different representations using signed digits [14 robustness of our randomized selection function relies on the fact that point addition and point subtraction cannot be distinguished in traces. Indeed, protecting the sign change when using point subtraction is supposed to be simple in the circuit. But we still have to perform a more complete security evaluation at hardware level using a real attack system and to compare to other protection schemes (e.g. addition chains [29] ).
V. COMPARISON TO STATE-OF-ART
Below we compare our MBNS recoding and scalar multiplication algorithms for various multi-bases to state-of-art methods. We report results over F p for simplified Weierstrass curves with unspecified parameter a and a = −3, using Jacobian coordinates, similarly to most of DBNS/MBNS references.
A. Costs of Curve-Level Operations
Tab. IX reports best computation costs, given in fieldlevel operations (M, S) for various curve-level operations over F p from literature. EFD is the excellent web site ExplicitFormulas Database http://hyperelliptic.org/EFD. We apply the typical cost assumption used in many references: 1S = 0.8 M. λDBL (resp. λTPL) denotes a sequence of λ successive DBL (resp. TPL) operations (e.g. k = 2 λ or k = 3 λ ). We use λDBL or λTPL operations when they are faster than their equivalent sequence of DBL or TPL.
B. Performance Comparisons
Some previous scalar multiplication algorithms require additional points to speed up computations. These additional points are multiples of the initial point P and stored in the cryptoprocessor during the complete scalar multiplication (2 n-bit registers per additional point). Most of methods assume pre-computed points represented using affine coordinates to benefit from fast mixed coordinates addition mADD. Tab. X reports costs of typical pre-computations. Costs at field-level include a conversion to affine coordinates which requires field inversions (usually 1M+1S+2I per addition point). We assume 1I = 15M for F p inversion.
These costs can be neglected for multiple successive [k]P operations with the same P , but it is not the case if P changes before each scalar multiplication (e.g. support of various protocols/sizes, base point randomization method, . . . ).
Tab. XI and XII compares scalar multiplication methods from literature to our signed MBNS method using 10 000 random scalars and approx selection function. In these tables, DBNS results have been computed using the PARI/GP program kindly provided by the author of [7] . This program generates a signed DBNS chain using pre-computations and where approximations are obtained by a search table.
The results presented in references [16] and [32] are based on a left-to-right MBNS scalar multiplication algorithm to benefit from mADD while the scalar is recoded using a rightto-left algorithm (this strategy prevents them from providing an on-the-fly computation). If we use a similar strategy, the computation cost reduction is estimated to (11 + 5 × 0.8) − (7 + 4 × 0.8) times the number of ADD operations. In case B = (2, 3, 5) and n = 160, this leads to a reduction about 134M. Hence, references [16] and [32] are still faster than our method but with a much smaller difference.
VI. CONCLUSION
In this paper, we proposed a simple multi-base recoding algorithm for ECC scalar multiplication in hardware without any pre-computations. The scalar recoding is performed on-thefly and in parallel to curve-level operations without additional latency. The proposed recoding circuit uses cheap divisibility test by multi-base elements and exact division using very small dedicated hardware units. Our MBNS recoding and scalar multiplication method is a little less competitive compared to other DBNS/MBNS methods when pre-computations or offline recoding can be used. But our method leads to more efficient solutions in embedded applications fully integrated in hardware without resources for costly recoding and limited storage. As future work, we plan to deal with more advanced recoding schemes to reduce the number of produced terms and improved randomization schemes to increase robustness against side-channel attacks. 11M + 5S 7M + 4S 3M + 5S 7M + 7S 11M + 11S 18M + 11S 28M + 15S [16] , [31] 11M + 5S 7M + 4S 3M + 5S 7M + 8S 10M + 12S 14M + 15S n. a. curves references λDBL λTPL a = −3 [6] , [11] , [33] 4λM + (4λ + 2)S (11λ − 1)M + (4λ + 2)S curves references λTPL / λ ′ DBL a = −3
[6], [11] (11λ + 4λ ′ − 1)M + (4λ + 4λ ′ + 3)S 
