In this paper, we present modular multipliers for hardware implementations of (hyper)-elliptic curve cryptography on FPGAs. The prime modulus P is generic and can be configured at run-time to provide flexible circuits. A finelypipelined architecture is proposed for overlapping the partial products and reductions steps in the pipeline of hardwired DSP slices. For instance, 2, 3, or 4 independent multiplications can share the hardware resources at the same time to overlap internal latencies. We designed a tool, distributed as open source, for generating VHDL codes with various parameters: width of operands, number of logical multipliers per physical one, speed or area optimization, possible use of BRAMs, target FPGA. Our modular multipliers lead to, at least, 2 times faster as well as 2 times smaller circuits than state of the art operators.
I. INTRODUCTION
Designing efficient modular arithmetic is key for implementing asymmetric cryptography. In such applications, multiplication is the most important operation for speed and cost purpose. Several algorithms and architectures have been proposed. Here, we deal with modular multipliers for GF(P ) used in implementations of (hyper-)elliptic curve cryptography ((H)ECC) on embedded systems. In our cryptographic applications, we target generic primes (i.e. P has a non-specific binary representation) and field sizes about 128 bits for HECC and 256 bits for ECC. We do not consider field sizes smaller than 100 bits or larger than 400 bits for our applications. In order to provide flexible operators, the modulus P can be defined at run time. This allows us to easily support various curves and applications in one circuit.
We focus on FPGAs embedding hardwired DSP slices dedicated to operations such as a × b ± c over small integers (typically 10 to 20 bits). To exploit them at a high frequency, one has to use the dedicated pipeline registers inside each DSP slice. This can be tricky in iterative algorithms where the number of operations per iteration is smaller than the pipeline depth (leading to idle stages in the schedule). Hyper-threading has been proposed in this type of situation, see for instance [1] . One physical unit is simultaneously shared by several logical units for overlapping internal latencies.
In [2] , we proposed finely-pipelined modular multipliers (FPMMs) for 128-bit HECC designed by hand for a few FPGAs (the URL is given in the references Section). Below, we extend these results with new optimizations, a wider G. Gallin is with CNRS, IRISA UMR 6074, University Rennes and INRIA Centre Rennes -Bretagne Atlantique, France.
A. Tisserand is with CNRS, Lab-STICC UMR 6285 and University South Brittany, in Lorient, France. (arnaud.tisserand@univ-ubs.fr).
parameter space and more supported FPGAs. The parameter space is so wide that we decided to design a tool, distributed as open source [3] , to generate optimized FPMMs in VHDL. Compared to [2] , our tool allows us to explore multipliers with:
• flexible prime P defined at run time (P was fixed in [2] ); • fewer DSP slices and a shorter computation time; • a higher ratio between the operations throughput and the area cost compared to state of the art solutions; • a higher frequency (close to the maximum allowed by the technology, e.g. 502 MHz upon 550 MHz in Virtex-5); • a larger parameter space at design time: width of operands, number of logical multipliers per physical one, more supported FPGAs, potential use of BRAMs, and new area/speed optimizations. Thanks to our finely-pipelined architecture, various optimizations and our exploration tool, we can propose more efficient multipliers than the state of the art. As an example, for a 128-bit prime modulus on a Virtex-7 FPGA, our best multiplier (code name F44D below) only requires 9 DSP slices, 600 LUTs and 151 clock cycles at 633 MHz to compute 8 modular multiplications (239 ns). As a comparison, the best operator from the state of the art (code name MA16) requires 21 DSP slices, 1182 LUTs, and 167 clock cycles at 350 MHz (478 ns). At least, our most efficient solutions are 2 times faster as well as 2 times smaller than state of the art one for typical applications in HECC.
Section II recalls the state of the art. Our FPMM operators and tool are described in Sections III and IV respectively. Implementation results and comparisons are reported and analyzed in Section V. Finally, Section VI concludes the paper.
Notations:
• modular multiplication A × B mod P with P prime; • n the width of P (e.g. 128, 256 bits); • m the width of field elements (m > n) in Montgomery domain decomposed into s words of w bits (m = sw); • σ the number of logical multipliers (LMs) per FPMM; • Clock cycles are denoted cycles and abbreviated cc; • λ the multiplication latency (i.e. duration in cc); • τ the interval between 2 multiplications in a LM (in cc); • LSW (/MSW) stands for least (/most) significant word.
II. STATE OF THE ART IN MODULAR MULTIPLICATION
In this work, we do not consider specific modulus forms such as (pseudo)-Mersenne primes where the reduction can be performed using a few shifts and additions. These specific primes lead to fast (H)ECC implementations (see for instance [4] ), but they are limited to only one field characteristic. Our target applications require the support of several curves and fields then we only deal with generic primes.
Several algorithms and architectures for modular multiplications with generic primes have been proposed. Some solutions use the interleaved algorithm from Blakley [5] : e.g. [6] , [7] and [8] . Other solutions use the reduction method from Barrett [9] : e.g. [10] . Montgomery Modular Multiplication (MMM) [11] algorithm and its numerous variants are widely used for both software and hardware implementations.
Thanks to operands and results represented in Montgomery domain, MMM performs the modular reduction by the help of two multiplications, a few additions, and one shift: e.g. [11] , [12] . In [13] , Walter proposed to remove the final subtraction in the original MMM by enlarging the domain such that the internal parameter R is R > 4P (instead of R > 2P ).
In [14] , five MMM variants are compared. Among them, the Coarsely Integrated Operand Scanning (CIOS) algorithm was proposed for software implementations with small computation time and memory requirements, see Algo. 1. CIOS efficiently interleaves partial products and partial reductions steps in a regular decomposition using high-radix subwords (while Blakley algorithm uses interleaving but with very small subwords). Efficient CIOS implementations on FPGAs have been proposed in [15] , [16] , [17] using DSP slices. Among MMM solutions, the CIOS algorithm, and its variants, are widely used in cryptographic applications.
The high-radix algorithms proposed in [12] relax data dependencies inside iterations and simplify the quotient computation by enlarging the internal datapath with a wider Montgomery domain. Later, [18] and [19] used this idea to design FPGA implementations with increased internal parallelism and reduced computation time. The 256-bit ECC processor over generic GF(P ) presented in [19] is one of the fastest on FPGAs. Their multiplier requires 37 DSP slices. The reported frequency is 250 or 291 MHz for Virtex-4 or Virtex-5 respectively (the authors state that the critical path is in the multiplier), but no detailed area metrics are reported.
As the results from [19] are very good, we reproduced their multiplier on more FPGAs and other sizes. In our first work [2] , we only implemented the 128-bit version of the multiplier from [19] . Since this first work, we added a 256bit version of their multiplier. Our implementation results of the multipliers from [19] are reported in Table III and denoted MA16. They are very close to the original results from [19] . For 256 bits, we also need 37 DSP slices, and we have very close frequencies on the same FPGAs. Only our internal schedule is slightly different: for one MMM our implementation requires 37 cycles instead of 35, but the interval between 2 multiplications is reduced to 28 cycles instead of 29. Our re-implementation of the multiplier from [19] is similar to the original one. Thanks to all the details provided in [19] , we were able to reproduce their results with a good accuracy. This was not possible for other works.
Iterative Digit-Digit Montgomery Multiplication (IDDMM) was proposed in [20] for 256-bit to 1024-bit generic primes. This solution was inspired from CIOS. For comparisons, we use their results for 256-bit multipliers on Virtex-5 with 64-bit internal datapath (solution denoted below MO64).
In [21] , a modified IDDMM was proposed on Virtex 7. For comparisons, we use their results for 256-bit multipliers with 32-bit (denoted AM32) and 64-bit (AM64) internal datapaths.
Systolic architectures have been proposed for MMM in [22] and frequently optimized: [23] , [24] and [25] . For comparisons, we use some results from [26] where systolic CIOS implementations are reported for 128 and 256-bit multipliers. Internal decompositions into 8 and 16 words have been proposed and are denoted below MR8 and MR16.
Using the CIOS and FIOS (Finely Integrated Operand Scanning) algorithms from [14] , the work presented in [27] recently proposed area-efficient MMMs for ECC on low-power IGLOO 2 FPGAs. For comparisons, we use their results with CIOS (denoted MAS1) and FIOS (denoted MAS2). Their units embed both MMM and modular addition/subtraction.
Other solutions have been proposed for ASIC implementations where small multipliers (such as DSP slices) cannot be used efficiently. For instance, the Multiple-Word Radix-2 Montgomery Multiplication (MWR2MM) proposed in [28] uses iterations with products of very small 2 or 4-bit words by the full multiplicand. The works [29] , [30] , [31] and [32] provide interesting results for ASIC implementations. Below, we only target FPGA implementations with DSP slices. Then, we do not consider this type of MMM variant.
III. FINELY-PIPELINED MODULAR MULTIPLIERS

A. Fine pipeline for speeding-up GF(P ) multipliers
We use the CIOS algorithm from [14] without final subtraction from [13] , see Algo. 1, because it is simple and regular. In the outer loop (index i), the internal computations have sequential dependencies which are difficult to map in DSP slices without "bubbles" in the pipeline at a high frequency. Two typical solutions in the literature mitigate this dependency problem: i) using a more complex algorithm with relaxed dependencies (e.g. [12] ); ii) reusing some hardware resources in different tasks with a more complex control (e.g. [27] ).
Internal values are represented on m bits, instead of n for "raw" GF(P ) elements due to the Montgomery domain. Based on the CIOS algorithm from [14] without final subtraction from [13] , we select R > 4P (where R = 2 m ). Then m is slightly larger than n. In most of FPGAs, hardwired units are not wide enough for cryptographic sizes (128 or 256 bits in our applications). In our target FPGAs, BRAMs words are at most 36-bit wide while DSP slices are 18 × 18 or 18×25 bits units. Then processing each GF(P ) element would require many parallel BRAMs and DSP slices. But using many BRAMs is useless since the number of intermediate GF(P ) elements is small in ECC and HECC (up to 10 and 20 respectively). In FPMM, we decompose m-bit elements into w-bit smaller words processed sequentially (w m). We denote s the number of w-bit words required for each operand with m = sw. This processing scheme efficiently fits DSP slices and BRAMs in our FPGAs. It also reduces the interconnect area in a complete cryptoprocessor. Finally, it leads to higher clock frequencies.
We use CIOS for its simplicity and regularity properties combined to a finely-pipelined architecture to overlap internal latencies in the hardwired pipeline of DSP slices. A physical FPMM supports σ logical multipliers for independent MMMs simultaneously. In FPMM, operand loading and result output are limited to one LM per cycle (see Figure 1 ).
The behavior for σ = 3 LMs and s = 2 words per element is illustrated in Figure 1 (C means cycle):
• C1: load w-bit words (a 0 , b 0 ) for the first operation; • C2: load (a 1 , b 1 ) and start A × B in LM1; • C3: load (c 0 , d 0 ) for the second operation; • C4: load (c 1 , d 1 ) and start C × D in LM2; • C5: load (e 0 , f 0 ) for the last operation; • in next cycles, all LMs compute σ independent products; • Cδ: output the w-bit word AB 0 of the result; • Cδ + 1: output the word AB 1 of the result; • the next 4 cycles output products CD and EF . After loading the operands (e.g. a 0,...,s−1 , b 0,...,s−1 ), all pipeline stages in each DSP slice perform intermediate computations for all independent MMMs at different clock cycles.
We denote λ the latency between the loading of the first word of the operands and the output of the last word of the result (i.e. δ + 1 cycles in our example).
The interval between two successive MMMs in the same LM, denoted τ , is less than λ. New operands can be loaded in the first stage of the pipeline while the last stages are finishing the current MMM. In our example on Figure 1 , this corresponds to τ = δ − 1 cycles.
In the CIOS Algo. 1, the computations at iteration i of the outer loop are decomposed into 3 dependent tasks:
• Task 1: lines 3-5 compute the multiplication of word a i by B (with accumulation of previous iteration i − 1); • Task 2: line 6 computes the "reduction quotient" q i for the Montgomery algorithm; 
Algorithm 1: CIOS algorithm without final subtraction (based on [14] and [13] ). • Task 3: lines 8-10 compute the product q i × P (with an accumulation of the value computed in task 1). Between task 3 at iteration i and task 1 at iteration i + 1, we discard the least significant word (see Sec. III-F for implementation details) accordingly to the Montgomery algorithm.
Our FPMM architecture, presented in Fig. 2 , is designed to fit this decomposition into 3 tasks. Each task is handled by a dedicated block.
With pipelined DSP slices, the result from task 3 is not immediately available in task 1 for the next iteration i + 1. We need to wait for the pipeline depth. Our FPMM solution is inspired from hyper-threading to overlap this delay. With σ independent MMMs in one FPMM, we can always fill all stages without any "bubbles" in the pipeline. Hyper-threading principle is illustrated in Fig. 3 for a 3-stage iterative pipeline. In the top of Fig. 3 , only 1 MMM (in green) is computed in the pipeline, leading to idle stages. In the bottom of Fig. 3 , idle stages are used to compute 2 other independent MMMs (in red and blue) at the same time and all stages are used after the initial latency.
B. Selection of parameters s and w
Both s and w impact the performances. s is the number of iterations in both outer and inner loops in Algo. 1. In our case, the minimal possible value for s is 2 (this ensures the correct use of the CIOS principle). A larger s leads to a larger latency λ (it grows as O(s 2 )). The decomposition into w bits requires w × w bits partial products. w must be as large as possible to efficiently fill the operand width in the DSP slices and reduce s. Currently, we do not consider "rectangular" DSP configurations such as 18 × 25 bits (we plan to see how we can exploit those asymmetric operators in the future). We use 18 × 18 bits DSP slices (designed for two's complement). For unsigned integers, one has to set the MSB to 0 and only use the 17 LSBs (as stated in the FPGA documentation). Then w should be 17 bits or a multiple of 17 to use the full capacity of the DSP slices. We explored several values for w : 17, 34, 51 and 68 bits, with respectively 1, 4, 9 and 16 DSP slices for w × w bits partial products.
For w = 17 bits, we need only one DSP slice per block (3 in total for one FPMM). One BRAM is large enough to store both operands (e.g. A and B) for each LM. But the latency λ is very large due to a large s.
For w larger than 36 bits, we need several BRAMs per operand (with a very low memory usage, for instance HECC with w = 68 only requires 2 words). w larger or equal to 51 bits leads to huge circuits (at least 9 DSP slices per block) and a lower frequency. Then we do not consider large w parameters.
We select w = 34 bits. We need 4 DSP slices in each block for tasks 1 and 3 as illustrated in Fig. 2 . A complete FPMM requires 11 DSP slices. All words are small enough to fit into one single BRAM. The capacity of the smallest target BRAM is 9 Kb on Spartan FPGAs. The 2 BRAMs in a FPMM allow to store more than 200 w -bit words. They can store at least 50 pairs of 128-bit HECC operands or 25 pairs for 256-bit ECC. In practice, the number of logical multipliers σ is way smaller than those bounds (see Subsection III-C). Setting w = 34 bits is the most efficient solution in our FPGAs for both 128-bit HECC (s = 4) and 256-bit ECC (s = 8).
Internal loops in tasks 1 and 3 are processed in s cycles. One extra cycle is required for final carry propagation in each task: line 5 for task 1 and line 10 for task 3 in Algo. 1. We used this s + 1 cycles schedule in our previous work [2] . In Section III-F, we propose a new improvement to remove this extra cycle in tasks 1 and 3.
For the block dedicated to task 2, we showed in [2] that only 3 DSP slices are required to compute q i due to the reduction modulo 2 w . In Section III-E, we propose an improved version of this hardware block with only one DSP slice.
C. Selection of parameter σ
Parameter σ must be at least greater than 2 for pipelining several logical multiplications in the operator. The outer loop (index i) in Algo. 1, task 1 → task 2 → task 3 → task 1, colored in red on Figure 2 , has a duration of α cycles. The minimal value for α is 15 cycles, coming from the FPMM internal architecture and configuration of DSP slices in hardware blocks for tasks 2 and 3. This value only depends on words size w . It does not depend on the field size (e.g. 128, 256 bits) and on the number s of words.
The inner loops in tasks 1 and 3 require s + 1 cycles each to load and start the partial products. In order to fill all stages without "bubbles", one needs to select σ such that:
If σ is too big, result from task 3 must be delayed to wait until task 1 starts all intermediate computations in all LMs. Then σ(s +1)−α extra registers must be inserted. This would lead to a larger FPMM without any speed gain. If σ is too small, there are α − σ(s + 1) "bubbles" in the pipeline. This leads to under-utilizing some hardware resources.
The 128-bit architecture, s = 4 and w = 34, without the improvement from Section III-F, is illustrated in Fig. 2 . Fortunately, s + 1 = 5 leads to a minimal σ = 3 without need to increase α (no extra register).
For 256-bit, s = 8 and w = 34, the minimal α is not a multiple of s + 1 = 9. To stay close to the minimal α = 15, two σ values are possible: 1 and 2. With σ = 1, we have the same architecture but 6 "bubbles" in the pipeline (40 % of time is then wasted). With σ = 2, we have to add 3 extra w -bit registers between task 3 and task 1. Then there is no "bubble" in the pipeline.
Parameter σ = 3 or 2, respectively for 128 or 256-bit MMM, is the smallest possible value. However, a larger σ (and larger circuit due to extra registers) may lead to a better speedarea trade-off. We explore different values for σ in Section V.
D. Differences with our first FPMM published in [2]
Our first FPMM [2] was manually implemented only for 128-bit GF(P ) fields for the fixed parameters w = 34 bits, s = 4 words and σ = 3 LMs. The prime P was a constant fixed at design time. Implementations were done on Virtex-4, Virtex-5 and Spartan-6 FPGAs. It was more efficient than state of the art. But it was not possible to explore various sizes and parameters by hand.
The current paper presents the tool (see Section IV) we developed for generating various FPMMs, in VHDL, with several major improvements: j=0  j=1  j=2  j=3  j=4  j=0  j=1  j=2  j=3  j=4   j=0  j=1  j=2  j=3  j=0  j=1  j=2  j=3  j=4  j=4 before after Fig. 4 . Schedule example for 2 MMMs (colors) before and after latency reduction (rectangles are w -bit words computed at iterations of index j).
• Reduced latency λ of s × s cycles instead of s × (s + 1) (see optimization in subsection III-F); • Frequency increased up to 20 % (and now close to the maximal frequency of the DSP slices or BRAMs) with improved control; • Alternative architectures inside block for task 1 (see subsection III-G); • More supported FPGA families. As in [2] , we still support architectures with either BRAMs or DRAMs for operands memories.
E. Reduction of the number of DSP slices in block 2
The block dedicated to task 2 computes the "reduction quotient" q i from the Montgomery algorithm for each iteration of the outer loop (line 6 in Algo. 1). One w × w bits partial product modulo 2 w must be computed for each q i . For w = 34, this only requires 3 DSP slices instead of 4 due to the reduction modulo 2 w (see [2] for details). These 3 DSP slices were used with different configurations.
We analyzed the schedule of all internal operations. We were able to optimize the control and spread the three 17 × 17 multiplications at different cycles in only one DSP slice without any penalty on the global latency α. Now the only DSP slice uses 3 different configurations at different cycles. The DSP slice inputs are also selected among several possible internal values at the right cycle by the new internal control.
This optimization reduces the number of DSP slices from 3 to 1 in block 2 (from 11 to 9 in a complete FPMM). It requires a few new registers (less than 10 bits in total), 2 new w -bit multiplexers, but it does not reduce the frequency.
F. Latency reduction
In the FPMM from [2] , inner loops in tasks 1 and 3 both require s cycles for operand loading plus 1 extra cycle for the carry propagation into the MSW for task 1 (u s at line 5 in Algo. 1) and task 3 (t s−1 at line 10). It introduces a "bubble" in the DSP slices pipeline at the transition between 2 consecutive MMMs in 2 LMs as illustrated on top of Fig. 4 .
We propose a modification in the control to remove this "bubble" as illustrated on the bottom of Fig. 4 . Our modification targets the carry propagation in lines 5 and 10 in Algo. 1. Modified CIOS algorithm is described in Algo. 2, and the new control now produces:
• At the end of task 1 (line 4 in Algo. 2):
• At the end of task 2 (line 5 in Algo. 2):
• At the end of task 3 (line 7 in Algo. 2):
Where all variables with (p) as exponent are the values computed in the previous independent MMM. 
Algorithm 2: Proposed CIOS algorithm without final subtraction (based on [14] and [13] ) modified for latency optimization. Value t In our optimized Algo. 2, the result t s−1 of the first LM at iteration j = 4 and the result t −1 of the second LM at iteration j = 0 share the same w -bit word. This optimization is possible as the LSW t −1 of the result is always zero and is discarded in MMM. We use this word during each outer loop iteration to store intermediate values for the next LM. In the following, (c) and (n) denote respectively the values computed in the current and next LM.
Below, we show that sharing one w -bit word between successive LMs to store two different intermediate values throughout intermediate computations does not modify the expected results. We need to verify that t (n)
thanks to the properties of the MMM algorithm. We also verify that overlapping the MSW t s−1 of the current LM with the LSW t −1 of the next LM does not create "hidden" carries between results of successive independent MMMs.
In Algo. 2, carries d and c are propagated between successive LMs. As in CIOS Algo. 1, we define u s = d after the last iteration of the first inner loop (lines 2 to 4 in Algo. 2), and t s−1 = u s + c after the last iteration of the second inner loop (lines 6 and 7 in Algo. 2). To compute T i in the current LM, we need the value t (n) −1 , computed at iteration j = 0 of the second inner loop in the next LM:
Due to the propagation of carry words u s between successive LMs, we have u (n)
However, the "reduction quotients" q i are computed before the propagation of the u s carry words, and then
Putting together equations (1), (2) and (3), we obtain:
Due to Montgomery algorithm, the LSW is equal to 0:
Then, from equation (4), the value t
s is computed as:
From properties of MMM algorithm, we have:
Then the value of u (c) s is bounded by
Similarly, the value of c is
From bounds in Eq. (5) we have:
We now use the bounds for u (c) s and c (c) in equations (6) and (7) to bound the value of t 
We then have:
and the result of MMM is
Results for independent MMMs in successive LMs then do not overlap, and are the same as in original algorithm.
This optimization removes the "bubble" but requires a modified architecture and control to deal with the overlap indicated at line 8 of Algo. 2 and illustrated in Fig. 4 . Fortunately, it leads to a more regular control without any measurable area overhead. The latency reduction impacts the FPMM pipeline configuration. We now have: σ = α/s cycles instead of α/(s + 1) .
G. Optimization of the DSP slices configuration
All the DSP slices in [2] are configured with a 3-stage pipeline as illustrated in Fig. 2 . However, the first DSP slice in task 1 (top left in Fig. 2) does not use the hardwired adder.
We provide two FPMM variants depending on the configuration of the first DSP slice:
• Fast version with a 3-stage pipeline to reach a higher frequency (only this version was available in [2] ); • Small version with only 2-stage pipeline. The total latency decreases by one cycle, but the frequency in the DSP slice also drops a little. This optimization reduces the delay for operands loading in the 3 next DSP slices and allows us to remove 3 w -bit registers. Both variants have some interest depending on the parameters and the target FPGA (see examples in Sec. V).
H. Flexible P definition at run time
Unlike most of MMM literature solutions, our new FPMM allows us to change P at run time. The parameters n, m, w , and s are fixed at design time, but P and P ' can be defined and loaded at run time in the configured FPGA.
Our new FPMM has now two running modes:
• Setup: P and P ' are loaded and all LMs are reset;
• Run: MMMs are performed as described above for the current P . FPMM in "run" mode reads 3 w -bit words from P at every clock cycle in task 3. This would require at least 2 w -bit dual-port BRAMs (with a very few words). Storing P in the operands BRAMs is not possible since they are used at every cycle for the LMs. Only w bits are required to store P '.
We chose to store P in w shift registers of depth s bits (each one uses only one LUT since s is small). For P ', we use a single w -bit register (in regular flip-flops).
In "setup" mode, P and P ' are loaded into dedicated registers (respectively using A and B FPMM inputs). This only requires s cycles and a few additional LUTs and flipflops.
Simul.
Sage
CAD Tools
FPGA
Final Report inputs Synth. Implem. 
I. Validation
At the arithmetic level, the only difference between our FPMM and the original CIOS without final addition is the latency reduction presented and proved in Sec. III-F. At the architecture and implementation levels, we performed very intensive simulations for the validation of our operators (see Sec. IV). We used predefined operands and millions of random ones with success.
IV. VHDL GENERATOR
We developed a tool dedicated to the automatic generation of FPMMs for many parameter combinations. Our tool, available on Unix platforms as open source [3] , is based on Bash scripts and Python programs.
The generator input is a text file with the complete FPMM specification: width m of operands, internal decomposition parameters (s, w ), number σ of LMs, target FPGA, operand storage in BRAMs or DRAMs, enable/disable latency reduction optimization (see III-F), fast or small version (see III-G). After verification of the specification consistency, the generator produces a set of VHDL files ready for implementation of the selected FPMM on the target FPGA. The generator also reports some computed properties such as τ and λ.
In order to help design space exploration, we also provide in [3] scripts for synthesis, implementation and validation steps. The complete flow, illustrated in Fig. 5 (with steps in numbered circles), currently supports the following Xilinx tools: ISE 14.7, ISim simulator and SmartXplorer.
Steps 1 and 2 respectively correspond to the user specification and VHDL generation.
FPGA synthesis and implementation are performed in step 3. SmartXplorer selects the best implementation after many runs of the place and route tool (set to 100 by default).
Step 4 validates the mathematical and functional behavior of the produced FPMM through very intensive simulations (with millions of random and selected operands). The theoretical expected results are computed by the internal math library from the Sage open source mathematical software available at the URL http://www.sagemath.org/. This library uses state of the art software algorithms to compute the modular multiplication. In order to intensively test the generated operators, our tool also generates VHDL codes for simulating the obtained FPMMs and automatically compare their results to the mathematical reference from Sage at bit level and cycle level.
Finally, the last step produces a final report with the main implementation (time and area) and validation results.
Users can easily adapt our generator and flow for their CAD tools and target FPGAs. 
Report results in Database
Algorithm 3: Exploration of FPMM parameters for a given maximal size n of primes P . Λ(k) is the total latency for k MMMs in FPMM (in cc).
In Algo. 3, we describe the exploration of FPMM parameters for a given size n of prime P and a given configuration of the DSP slices (i.e. fast or small, see Sec. III-G). In this algorithm, ε is the delay in clock cycles between the loading of the operands words a 0 and b 0 and computing the first u 0 at end of task 1 in one LM. Delay ε is composed by:
• 2 cc to write a 0 and b 0 to operands memories; • 2 cc to read a 0 and b 0 from operands memories to Task 1; • 3/2 cc in first DSP in Task 1, in fast/small version resp.; 
B. FPGA specific optimizations of FPMM
We implemented many FPMMs on Virtex-4, 5, 7 and Spartan-6 from Xilinx: denoted V4, V5, V7, S6 for XC4VLX100, XC5VLX110T, XC7VX690T and XC6SLX75. These FPGAs embed different types of DSP slices, listed in Table I . Reported maximum frequencies come from Xilinx official documentation: [33] for V4; [34] for V5; [35] for S6; and [36] for V7.
Internal pipelines of DSP slices in our FPMMs are optimized for each FPGA in order to reach the highest frequency allowed by the hardware. The best configurations of DSP slices for each FPMM implemented on a specific FPGA are selected after implementation of various possible internal pipelines in each of these DSP slices.
C. Discussions and Comparisons
Below we report implementation results for some 128 and 256 bits multipliers (more complete results are available in our generator website [3] ). For area metrics, we report the number of DSP slices, logic slices, LUTs, flip-flops (FFs) and BRAMs. For timing aspects, we report the latency λ for one MMM (in cycles), the clock frequency and the computation time for 8 independent MMMs to illustrate internal parallelism benefits (based on the HECC application from [37] ). FPMM specifications are abbreviated by 1 letter, 2 digits and 1 letter: i) F/S for fast/small version; ii) σ in {3, 4} or {2, 3, 4} resp. for n = 128 or 256 bits; iii) s or s +1 cycles for the delay between 2 LMs with or without latency reduction; iv) B/D for BRAM/DRAM operands storage. For instance, F45B means a fast version with σ = 4 LMs, s + 1 = 5 cycles between LMs (i.e. no latency reduction) and BRAMs. Figure 6 presents area/time trade-offs for all 128-bit FPMMs on V4 and V7. Remember that all FPMMs use w = 34 bits and require 9 DSP slices. On V7, the smallest FPMM is S44B while F44B and F44D are the fastest ones with and without BRAMs. On V4, the smallest one is still S44B, the fastest one is F44B but now F44D is not anymore on the Pareto front. Without our generator, it would be difficult to identify the best specification for one application. When exploring FPMMs for different FPGA families, it is also very difficult to predict the impact of some parameters. On a recent V7, the fastest solution uses DRAMs while on an older V4, it uses BRAMs. On V7 (based on LUT-6), the area variations among all FPMMs specifications are only 26 % but 116 % on V4 (based on LUT-4). Table II reports a selection of the most interesting FPMMs on more FPGAs for both 128 and 256-bit finite fields. Bold values indicate the best trade-off metrics and BRAM capacity is denoted for 9Kb and ⊕ for 18Kb.
Our generator helps cryptographic applications designers to explore many trade-offs and select the best FPMM for each FPGA and optimization target (area vs. speed). Table III reports results for recent efficient multipliers, with generic primes, from the state of the art (see Sec. II): MA16 we reimplemented from [19] (on all FPGAs), MR8/16 from [26] (on A7 for Artix-7), AM32/64 multipliers from [21] (on V7), and MO64 multiplier from [20] (on V5). We also report the recent, very small, multipliers MAS1/2 from [27] on IGLOO 2 (I2) but we do not have this FPGA for comparison.
Comparing the raw results from Tables II and III is not easy. Then, we illustrate the main differences between our FPMMs and state of the art multipliers using figures. Figure 7 compares the computation time for several independent 128-bit multiplications. For only one or two MMMs, MA16 is more efficient than FPMM. But for more operations, FPMM is always faster. Our multipliers lead to a stair-shaped evolution since before reaching σ MMMs in one FPMM, launching a new multiplication is almost free: it only adds s extra cycles. For 8 MMMs and 128-bit generic primes on V7, our best FPMM (F44D) is 2 times faster than the best state of the art multiplier (MA16). F44D is also much smaller than MA16: 9 DSP slices instead of 21 and 600 LUTs instead of 1182. Figure 8 depicts the schedule for 8 independent MMMs in various 128-bit multipliers. On the right, we report the number In Table IV we present the achievable throughput (MMM · s −1 ) per area ratio (TPAR) as a global efficiency metric for state-of-the-art multipliers and some of our most efficient FPMMs. For instance, respectively 129k and 37k MMMs can be computed per second per logic slice for our FPMM and for MA16 from [19] on Virtex-7 and 128-bit operands. Figure 9 illustrates TPAR for MA16 from [19] and our best trade-offs on Virtex-7 and 128-bit operands. In this figure, the further from center means the higher TPAR, and thus the best hardware efficiency. Our finely-pipelined solution clearly leads to arithmetic units with a better utilization of the hardware resources without "bubbles" in the pipeline for simple and regular algorithms such as the CIOS. For 256 bits, only the MA16 multiplier is at most 1.5 times faster but it requires 4.1 times more DSP slices and 3.9 times more LUTs. All other multipliers are both slower and larger.
D. Applications to HECC
In [37] , we used our first 128-bit FPMMs from [2] to design cryptoprocessors for HECC over GF(P ). Below, we present new results for these HECC cryptoprocessors where we use our new flexible FPMMs. Table V provides Architecture A2 corresponds to a small architecture with only 1 adder/subtractor unit, 1 FPMM and 1 data memory. A3 corresponds to a parallel architecture with 2 adder/subtractor units, 2 FPMMs and 1 data memory. Finally, A4 is a clustered architecture, with 2 adder/subtractor units, 2 FPMMs and 2 data memories (see [37] for schematics). In Table V , w is the width of the interconnect. More details on these architectures can be found in [37] .
Thanks to our FPMMs, our HECC cryptoprocessors reach high frequencies with reduced area compared to the best equivalent state-of-the-art ECC cryptoprocessors. For example, compared to the very efficient ECC solution for generic primes from [19] , our smallest cryptoprocessors are at least 2 times smaller for the same computation time.
Compared to our first HECC cryptoprocessors published in [37] , our new architectures based on the new F44B are both smaller and faster.
VI. CONCLUSION AND FUTURE PROSPECTS
We propose finely-pipelined modular multipliers (FPMMs) for FPGA implementations with DSP slices. FPMM internal control, inspired from hyper-threading, allows us to more efficiently fill the DSP slices pipeline. Our first FPMMs presented in [2] were limited to 128 bits and led to larger and slower circuits. In this paper, we propose more advanced FPMMs and a tool for generating them for various sizes and FPGAs, leading to both faster and smaller circuits. For instance, we get 2 to 3 times smaller designs for the same speed compared to the state of the art [19] . Our generator is available as open source at [3] (with some VHDL codes and implementation results for many parameters).
In the future, we plan to extend our tool to support other FPGA vendors and rectangular DSP slices (e.g. 25 × 18 bits). For other types of applications, we also plan to investigate other field sizes (e.g. n < 100 or n > 400 bits) with other w values. Finally, we plan to study FPMM architectures with protections against side channel and fault injection attacks.
PLACE PHOTO HERE
Arnaud Tisserand (PhD'97. M'00, SM'06) is senior researcher at CNRS (French National Center for Scientific Research) in computer science in Lab-STICC laboratory. His research interests include computer arithmetic, computer architecture, digital security, VLSI and FPGA design, design automation, low-power design and applications in applied cryptography, scientific computing, digital signal processing. He is senior member of the IEEE (SSCS, CAS).
