Abstract. Elliptic Curve Cryptosystems (ECC) have gained increasing acceptance in practice due to their significantly smaller bit size of the operands compared to other public-key cryptosystems. Since their computational complexity is often lower than in the case of RSA or discrete logarithm schemes, ECC are often chosen for high performance publickey applications. However, despite a wealth of research regarding highspeed software and high-speed FPGA implementation of ECC since the mid 1990s, providing truly high-performance ECC on readily available (i.e., non-ASIC) platforms remains an open challenge. This holds especially for ECC over prime fields, which are often preferred over binary fields due to standards in Europe and the US.
Introduction
With the explosive growth of Internet-based applications like ecommerce, peerto-peer networks and distributed gaming as well as embedded ones -ranging from mobile over set-top boxes to automotive -the demand for security in such systems has also grown dramatically. In these applications, asymmetric cryptography is used to achieve a large variety of security goals. However, asymmetric cryptographic algorithms are extremely arithmetic intensive since their security assumptions rely on computational problems which are considered to be hard in combination with parameters of significant bit sizes.
Neal Koblitz and Victor Miller proposed independently in 1985 [20, 17] the use of Elliptic Curve Cryptography providing similar security compared to classical cryptosystems but using smaller keys. This benefit allows for greater efficiency when using ECC (160-256 bit) compared to RSA or discrete logarithm schemes over finite fields (1024-4096 bit) while providing an equivalent level of security [18] . Due to this, ECC has become the most promising candidate for many new applications, especially in the embedded domain, which is also reflected by several standards by IEEE, ANSI and SECG [15, 1, 5, 6] .
In addition to many new "lightweight" applications (e.g., digital signature on RFID-like devices), there are also many new applications which call for highperformance asymmetric primitives. Even though very fast public-key algorithms can be provided for PC and server applications by accelerator cards equipped with ASICs, providing very high speed solutions in embedded devices is still a major challenge. Somewhat surprisingly, there appears to be extremely few, if any, commercially available ASICs or chip sets that provide high speed ECC and which are readily available for integration in general embedded systems. A potential alternative is provided by Field Programmable Gate Arrays (FPGA). FPGAs have evolved over the last decade to a powerful alternative for classical ASIC circuits. In addition, FPGAs provide the advantage of dynamic and flexible circuit reconfigurability allowing for rapid prototyping at little development costs. However, despite a wealth of research regarding high-speed FPGA (and high-speed software) implementation of ECC since the mid 1990s, providing truly high-performance ECC (i.e., to reach less than 100μs per point multiplication) on readily available platforms remains an open challenge. This holds especially for ECC over prime fields, which are often preferred over binary fields due to standards in Europe and the US, and a somewhat clearer patent situation.
In this work, we propose a novel hardware architecture based on reconfigurable FPGAs supporting ECC cryptography over prime fields GF (p) offering the highest single-chip performance reported in literature up to now. Usually, known ECC implementations for reconfigurable logic implement the computationally expensive low-level arithmetic in configurable logic elements, allowing for greatest flexibility but offering only moderate performance. Some implementations have attempted to address this problem by using dedicated arithmetic hardware in the reconfigurable device for specific parts of the computations, like built-in 18x18 multipliers. But other components of the circuitry for field addition, subtraction and inversion have been still implemented in the FPGA's fabric which usually leads to a significant decrease in performance.
The central idea of this contribution is to relocate the arithmetic intensive operations of ECC over prime fields entirely in dedicated hardcore units on the FPGA actually reserved for use in Digital Signal Processing (DSP) filter applications. These DSP accelerating functions are built-in components in the static logic of modern FPGA devices capable to perform integer multiplication, addition and subtraction as well as a multiply-accumulate operation.
Previous Work
We briefly summarize previously published results of relevance to this contribution. There is a wealth of publication addressing ECC hardware architectures, and a good overview can be found in [8] . In the case of high speed architectures for ECC, most implementation primarily address elliptic curves over binary fields GF (2 m ) since the arithmetic is more hardware-friendly [22, 10] . Our work, however, focuses on the prime field GF (p). First implementations for ECC over prime fields GF (p) have been proposed by [23, 24] demonstrating ECC processors built completely in reconfigurable logic. The contribution by [19] proposes a high-speed ECC crypto core for arbitrary moduli with up to 256 bit length designed on a large number of built-in multiplier blocks of FPGA devices providing a significant speedup for modular multiplications. However, other field operations have been implemented in the FPGA fabric, resulting in a very large design (15,755 slices and 256 multiplier blocks) on a large Xilinx XC2VP125 device. The architecture presented in [7] was designed to achieve a better trade-off between performance and resource consumption. According to the contribution, an area consumption of only 1,854 slices and a maximum clock speed of 40 MHz can be achieved on a Xilinx Virtex-2 XC2V2000 FPGA for a parameter bit length of 160 bit.
Our approach to implementing an FPGA-based ECC engines was to shift all field operations into the integrated DSP building blocks available on modern FPGAs. We show that this approach leads to an extremely high throughput. Furthermore, our strategy frees most configurable logic elements on the FPGA for other applications and requires less power compared to a conventional design. To the best of our knowledge, this architecture offers the fastest performance for ECC computations over prime fields with up to 256 bit security in reconfigurable logic.
Mathematical Background
In the following, we will briefly introduce to the mathematical background relevant for this work. We will start with a short review of the Elliptic Curve Cryptosystems (ECC). Please note that only ECC over prime fields GF (p) will be subject of this work since binary extensions fields GF (2 m ) require binary arithmetic which is not (yet) natively supported by DSP blocks.
Elliptic Curve Cryptography
Let p be a prime with p > 3 and F p = GF (p) the Galois Field over p. Given the Weierstrass equation of an elliptic curve
with a, b ∈ GF (p) and 4a 3 + 27b 2 = 0, points P i ∈ E, we can compute tuples (x, y) also considered as points on this elliptic curve E. Based on a group of points defined over this curve, ECC arithmetic defines the addition R = P + Q of two points P, Q using the tangent-and-chord rule as the primary group operation. This group operation distinguishes the case for P = Q (point doubling) and P = Q (point addition). Furthermore, formulas for these operations vary for affine and projective coordinate representations. Since affine coordinates require the availability of fast modular inversion, we will focus on projective point representation to avoid the implementation of a costly inversion circuit. Given two points P 1 , P 2 with P i = (X i , Y i , Z i ) and P 1 = P 2 , the sum P 3 = P 1 + P 2 is defined by
where A, B, C are auxiliary variables and P 3 = (X 3 , Y 3 , Z 3 ) is the resulting point in projective coordinates. Similarly, for P 1 = P 2 the point doubling P 3 = 2P 1 is defined by
Most ECC-based cryptosystems rely on the Elliptic Curve Discrete Logarithm Problem (ECDLP) and thus employ the technique of point multiplication k·P as cryptographic primitive, i.e., a k times repeated point addition of a base point P. Precisely, the ECDLP is the fundamental cryptographic problem used in protocols and crypto schemes like the Elliptic Curve Diffie-Hellman key exchange [9] , the ElGamal encryption scheme [12] and the Elliptic Curve Digital Signature Algorithm (ECDSA) [1] .
Standardized General Mersenne Primes
The arithmetic for ECC point multiplication is based on modular computations over a prime field GF (p). These computations always include a subsequent step to reduce the result to the domain of the underlying field. Since the reduction is very costly for general primes due to the demand for a multiprecision division, special primes have been proposed by Solinas [26] which have been finally standardized in [21] . These primes provide efficient reduction algorithms based on a sequence of multi-precision addition and subtractions only and eliminate the need for the costly division. Special primes P-l with bitlengths l = {192, 224, 256, 384, 521} are part of the standard. But we believe that the primes P-224 and P-256 are the most relevant bit sizes for future implementations of the next decades.
According to Algorithm 1 the modular reduction for P-224 can be performed with two 224-bit subtractions and additions. However, these four consecutive operations can lead to a potential over-and underflow in step 1. With Z = z 1 + z 2 + z 3 − z 4 − z 5 , we can determine the bounds −2p < Z < 3p reducing the number of final correction steps to two additions or subtractions to compute the correctly bounded c mod p 224 .
Algorithm 2 presents the modular reduction for P-256 requiring two doublings, four 256-bit subtractions and four 256-bit additions. Based on the computation 9 , the range of the result to be corrected is −4p < Z < 5p. 
An Efficient ECC Architecture Using DSP Cores
In this section we demonstrate how to implement ECC over NIST primes P-224 and P-256 using available DSP blocks of Xilinx Virtex-4 FPGAs.
DSP-Accelerator Blocks in FPGAs
Modern FPGA devices like Xilinx Virtex-4 and Virtex-5 as well as Altera Stratix FPGAs have been equipped with dedicated arithmetic hardcore extensions to accelerate, in particular, digital signal processing applications. These function blocks (DSP blocks) can be used to build a more efficient implementation in terms of performance and reduce at the same time the demand for logical elements. In general, DSP blocks of FPGAs can be programmed to perform basic arithmetic functions, especially, multiplication, addition and subtraction of (un)signed integers. A common DSP component comprises an l M -bit signed integer multiplier coupled with an l A -bit signed adder, where l A > l M holds. For enabling maximum performance, the multiplier and adder block can be augmented with pipeline registers to reduce signal propagation delays between components. Using different data paths, DSP blocks can operate on external inputs A, B, C as well as on feedback values from accumulation or even results P j±1 from a neighboring DSP block. Figure 1 shows the generic DSP-block used in recent Xilinx FPGA devices [29] .
ECC Engine Design Criteria
When using DSP blocks to develop a high-speed ECC design, there are several criteria which should be met to exploit their full performance. Note that the following aspects have been designed to target the requirements of Xilinx Virtex-4 FPGAs:
1. Build DSP cascades: Neighboring DSP blocks can be cascaded to widen or extent their atomic operand width (e.g., from 18 bit to 256 bit). 2. Use DSP routing paths: DSPs have been provided with inner routing paths connecting two adjacent blocks. It is advantageous in terms of performance to use these paths as frequently as possible instead of using FPGA's general switching matrix for connecting logic blocks. 3. Consider DSP columns: Within a Xilinx FPGA, DSPs are aligned in columns, i.e., routing paths between DSPs within the same column are efficient while a switch in columns can lead to degraded performance. Hence, DSP cascades should not exceed the column width (typically 32/48/64 DSPs per column). 4. Use DSP pipeline registers: DSP blocks feature pipeline stages which should be used to achieve the maximum clock frequency supported by the device (up to 500 MHz). 5. Use different clock domains: Optimally, DSP blocks can be operated at maximum device frequency. This is not necessarily true for the remainder of the design so that separate clock domains should be introduced (e.g. by halving the clock frequency for control signals) to address the critical paths in each domain individually.
Arithmetic Units
According to the EC arithmetic introduced in Section 3.1, an ECC engine over GF (p) based on projective coordinates requires functionality for modular addition, subtraction and multiplication. Since modular addition and subtraction is very similar, both operation are combined. In the following description we will assume a Virtex-4 FPGA as reference device and corresponding DSP block arithmetic with word sizes l A = 32 and l M = 16 for unsigned addition and multiplication, respectively. Note that native support by the DSP blocks on a Virtex-4 device is available for up to 48-bit signed addition and 18-bit signed multiplication.
Modular Addition/Subtraction. Let A, B ∈ GF (P ) be two multi-precision operands with lengths |A|, |B| ≤ l and l = log 2 P + 1. Modular addition C = A+B mod P and subtraction C = A−B mod P can be efficiently computed according to Algorithm 3:
Algorithm 3. Modular addition and subtraction
Input: A, B, P with 0 ≤ A, B < P ; Operation flag f ∈ {0, 1} denotes a subtraction when f = 1 and addition otherwise Output:
For using DSP blocks, we need to divide the l-bits operands into multiple words each having a maximum size of l A bit due to the limited width of the DSP input port. Thus, all inputs A, B and P to the DSP blocks can be represented in the form
i·lA , where n A = l/l A denotes the number of words of an operand. According to Algorithm 3, we employ two cascaded DSP blocks, one for computing s (0,i) = a i ± (b i + C IN0 ) and a second for s (1 
The resulting values s (0,i) and s (1,i) each of size |s (j,i) | ≤ l A +1 are temporarily stored and recombined to S 0 and S 1 using shift registers (SR). Finally, a 2-to-1 l-bit output multiplexer selects the appropriate value C = S i . Figure 2 presents a schematic overview of a combined modular addition and subtraction based on two DSP blocks. Note that DSP blocks on Virtex-4 FPGAs provide a dedicated carry input c IN but no carry output c OUT . Particularly, this fact requires extra logic to compensate for duplicate carry propagation to the second DSP which is due to the fixed cascaded routing path between the DSP blocks. In this architecture, each carry is considered twice, namely in s 0,i+1 and s 1,i what needs to be corrected. This special carry treatment requires a wait cycle to be introduced so that one l A -bit word can be processed each two clock cycles. However, this is no restriction for our architecture since we design for parallel addition and multiplication so that the (shorter) runtime of an addition is completely hidden in the duration of a concurrent multiplication operation. , trade multiplications for additions using a divide-and-conquer approach. Due to the higher number of additions, this latter strategy is only preferable in case that the complexity costs of an an addition is significantly below that of a multiplication [28] . But even when neglecting any further control overhead introduced by the Karatsuba method, this does not hold for Virtex-4 devices since multiplication is comparably cheap within the DSP blocks. Let A, B ∈ GF (P ) be two multi-precision integers with bit length l ≤ log 2 P + 1. According to the limited input size l M of DSP blocks, we split now the values A, B in n M = l/l M words represented as X = nM −1 i=0
For parallel execution on n M DSP units, we compacted the order of inner product computations as shown in Figure 3 . All n M DSP blocks operate in a loadable Multiply-and-Accumulate mode (MACC) so that intermediate results remain in the corresponding DSP block until an inner product s i = i j=0 a j b i−j is fully computed. Note that s i returned from the n M DSP blocks are not aligned and can vary in size up to |s i | ≤ 2l M + log 2 (n M ) = l ACC = 36 bits. Thus, all s i need to be converted to non-redundant representation to finally form the final product of words c i with maximum size 2l M each. Hence, we feed all values into a subsequent accumulator to combine each s i with the corresponding bits of s i−1 and s i+1 . Considering the special input constraints, timing conventions and carry transitions of DSP blocks, we developed Algorithm 4 to address the accumulation of inner products based on two DSP blocks performing l ACC -bit additions. Figure 4 gives a schematic overview of the multiplication circuit returning the full-size product C. This result has to be reduced using the fast NIST prime reduction scheme discussed in the next section. Modular Reduction. At this point we will discuss the subsequent modular reduction of the 2n M -bit multiplication result C using the NIST reduction scheme. All fast NIST reduction algorithms rely on a reduction step (1) defined as a series multi-precision additions and subtractions followed by a correction step (2) to achieve a final value in the interval [0, . . . , P − 1] (cf. Algorithms 1 and 2). To implement (1), we decided to use one DSP-block for each individual addition or subtraction, e.g., for the P-256 reduction we reserved a cascade of 8 DSP blocks. Each DSP performs one addition or subtraction and stores the result in a register whose output is taken as input to the neighboring block (data pipeline). For the correction step (2), we need to determine in advance the possible overflow or underflow of the result returned by (1) to avoid wait or idle cycles in the pipeline. Hence, we introduced a Look-Ahead Logic (LAL) consisting of a separate DSP block which exclusively computes the expected overflow or underflow. Then, the output of the LAL is used to select a corresponding reduction value which are stored as multiple {0, . . . , 5P } in a ROM table. The ROM values are added or subtracted to the result of (1) ensuring that the final result is always in {0, . . . , P − 1}. Figure 5 depicts the general structure of the reduction circuit which is applicable for both primes P-224 and P-256.
ECC Core Architecture
With the basic field operations for l−bit computations at hand supporting NIST primes P-224 and P-256, we have combined a modular multiplier and a modular subtraction/addition component with dual-port RAM modules (BRAM) and a state machine to build an ECC core. We have implemented an asymmetric datapath supporting two different operand lengths: the first operand provides full l-bit of data whereas the second operand is limited to 32-bit words so that several words need to be transferred serially to generate the full l-bit input. This approach allows for direct memory accesses of our serial-to-parallel multiplier architecture. Note further that we introduced different clock domains for the core arithmetic based on the DSP blocks and the state machines for upper layers (running at half clock frequency only). An overview of the entire ECC core is shown in Figure 6 . We implemented ECC group operations based on projective Chudnowsky coordinates 1 since the implementation should support to compute a point multiplication k · P as well as a corresponding linear combination k · P + r · Q based on a fixed base point P ∈ E, k, r ∈ {1, . . . , ord(P) − 1} and Q ∈ P . Both operations can be considered as basic ECC primitives, e.g., used for ECDSA signature generation and verification [1] . The computation of k ·P +r ·Q can make use of Shamir's trick to efficiently compute several point products simultaneously [12] . For this first implementation of the point multiplication and the sake of simplicity, we used a standard double-and-add (binary method) algorithm [14] , but more efficient windowing methods [2] can also be implemented without significantly increasing the resource consumption. 
ECC Core Parallism
Due the intensive use of DSP blocks to implement the core functionality of ECC, the resulting implementation requires only few reconfigurable logic elements on the FPGA. This allows for efficient multiple-core implementations on a single FPGA improving the overall system throughput by a linear factor n dependent on the number of cores. Note that most other high-performance implementations occupy the full FPGA due to their immense resource consumption so that these cannot easily be instantiated several times.
Based on our synthesis results, the limiting factor of our architecture is the number of available DSP blocks of a specific FPGA device (cf. Section 5).
Implementation
The proposed architecture has been synthesized and implemented for the smallest available Xilinx Virtex-4 device (XC4VFX12-12SF363) and the corresponding results are presented in Subsection 5.1. This FPGA type offers 5,472 slices (12,288 4-input LUTs and flip flops) of reconfigurable logic, 32 DSP blocks and can be operated at a maximum clock frequency of 500 MHz. Furthermore, to demonstrate how many ECC computations can be performed using ECC core parallelism, we take a second device, the large Xilinx Virtex-4 XC4VSX55-12FF1148 providing the maximum number of 512 DSP blocks and 24,576 slices (49,152 4-input LUTs and flip flops) as a reference for a multi-core architecture.
Implementation Results
Based on the Post-Place and Route (PAR) results using Xilinx ISE 9.1 we can present the following performance and area details for ECC cores for primes P-224 and P-256 on the small XC4VFX12 device as shown in Table 1 . Note that up to now the implementation for P-224 is not yet fully verified in functionality or optimized. The core for P-256, however, is already available for use in realworld products. 
Throughput of a Single ECC Core
Given an ECC core with a separate adder/subtracter and multiplier unit, we can perform a field multiplication and field addition simultaneously. By optimizing the execution order of the basic field operations, it is possible to perform all additions/subtraction required for the ECC group operation in parallel to a multiplication. Based on the runtimes of a single field multiplication, we can determine the number of required clock cycles for the operations k ·P and k ·P + r · Q using the implemented Double-and-Add algorithm. Moreover, we also give estimates concerning their performance when using a window-based method [2] based on a window size w = 4. Note that the specified timing considers signal propagation after complete PAR excluding the timing constraints from I/O pins since no underlying data communication layer was implemented. Hence, when being combined with an I/O protocol of a real-world application, the clock frequency will be slightly lower than specified in Table 1 and 3.
Multi-core Architecture
Since a single ECC core has obviously moderate resource requirements, it is possible to place multiple instances of the core on a larger FPGA. On a single XC4VSX55 device, we can implement, depending on the underlying prime field, between 16-18 ECC cores running in parallel (cf. Table 3 ). Due the small amount of LUTs and flip flops required for a single core, the number of available DSP blocks (and routing resources) on the FPGA is here the limiting factor.
Comparison
Based on our architecture, we can estimate a throughput of more than 37,000 point multiplications on the standardized elliptic curve P-224 per second which exceeds the throughput of all single-chip hardware implementation known to the authors by far. A detailed comparison with other implementations is presented in Table 4 . At this point we like to point out that the field of highly efficient prime field arithmetic is believed to be predominated by implementations on general purpose microprocessors rather than on FPGAs. Hence, we will also compare our hardware implementation against the performance of software solutions on recent microprocessors. Since most performance figures for software implementations are given in cycles rather than absolute times, we assumed for comparing throughputs that, on a modern microprocessor, repeated computations can be performed without interruption simultaneously on all available cores with no further cycles spent, e.g., on scheduling or other administrative tasks. Note that this is indeed a very optimistic assumption possibly overrating the performance of software implementations with respect to actual applications.
For example, a point multiplication using the highly efficient software implementation by Dan Bernstein based on floating point arithmetic for ECC over P-224 requires 839.000 cycles on an (outdated) Intel Pentium 4 [3] at 1.4GHz. According to our assumption for cycle count interpretation, this correlates to 1670 point multiplication per second.
Despite the good performance figures on this platform, we prefer to take more recent results, e.g., obtained from ECRYPT's eBATS project. According to the report from March 2007 [11] , an Intel Core2 Duo running at 2.13 GHz is able to generate 1868 and 1494 ECDSA signatures based on the OpenSSL implementation for P-224 and P-256, respectively. Taking latest Intel Core2 Quad microprocessors into account, these performance figures might even double. We also compare our work to the very fast software implementation by [13] using an Intel Core2 system at 2.66 GHz. However, in this contribution the special Montgomery and non-standard curve over F 2 255 −19 is used instead of a standardized NIST prime. Despite of that, for the design based on this curve the authors report the impressive throughput of 6700 point multiplications per second. For a fair comparison with software solutions it should be considered that a single Virtex-4 SX 55 costs about US$ 1,170
2 . Recent microprocessors like the Intel Core2 Duo, however, are available at only about a quarter of that price. With this in mind, we might not be able to beat all software implementation in terms of the cost-performance ratio, but we still like to point out that our FPGAbased design -as the fastest reported hardware implementation so far -definitely closes the performance gap between software and hardware implementations for ECC over prime fields. Furthermore, we like to emphasize again that all software related performance figures are based on very optimistic assumptions.
Conclusion
We presented a novel ECC implementation for fields over NIST primes P-224 and P-256. Due to the exhaustive utilization of DSP blocks, which are contained as hardcores in modern FPGA devices, we are able to perform the critical components computing low-level integer arithmetic operations nearly at maximum device frequency. Furthermore, considering a multi-core architecture on a Virtex-4 XC4VSX55 FPGA, we can achieve a throughput of more than 24,000 and 37,000 point multiplications per second for P-256 and P-224, respectively, what significantly exceeds the performance of all other hardware implementation known to the authors and comes close to the cost-performance ratio provided by the fastest available software implementations in the open literature.
