Abstract. In this paper, we answer the question whether binary extension field or prime-field based processors doing multi-precision arithmetic are better in the terms of area, speed, power, and energy. This is done by implementing and optimizing two distinct custom-made 16-bit processor designs and comparing our solutions on different abstraction levels: finite-field arithmetic, elliptic-curve operations, and on protocol level by implementing the Elliptic Curve Digital Signature Algorithm (ECDSA). On the one hand, our F2m based processor outperforms the Fp based processor by 19.7 % in area, 69.6 % in runtime, 15.9 % in power, and 74.4 % in energy when performing a point multiplication. On the other hand, our Fp based processor (11.6 kGE, 41.4 µW, 1,313 kCycles, and 54.3 µJ) improves the state-of-the-art in Fp 192 ECC hardware implementations regarding area, power, and energy results. After extending the designs for ECDSA (signature generation and verification), the area and powerconsumption advantages of the F2m based processor vanish, but it still is 1.5-2.8 times better in terms of energy and runtime.
Introduction
Elliptic Curve Cryptography (ECC) has been introduced in the 1980s and is used nowadays in a variety of different applications. Every application has its own design criteria and raises special requirements for hardware designs. While contactless powered devices have to meet low-power constraints, battery-powered devices need energy-aware implementations that consume as little energy as possible to increase the life-time of the battery.
The most fundamental decision concerning future hardware designs is whether to use a binary-extension field or a prime field as basis of the used elliptic curve. Most related work in dedicated hardware designs has been done in implementing ECC over binary fields using full-precision arithmetic. Only a few papers compared binary and prime fields in hardware. Wolkerstorfer [36] and Satoh [32] used full-precision dual-field hardware with bit-serial multipliers. We however are interested in multi-precision designs, where the big integers are split and processed in small words. This design methodology has the advantage that the Central Processing Unit (CPU) can be reused to perform other work (e.g. protocol handling). In this paper we want to answer the following questions:
-What are the advantages and disadvantages of prime and binary-field processors in custom multi-precision hardware? -How big are the differences when identical design methodologies and elliptic curves with similar security level are used? -How does the performance of prime and binary-field processors scale in higher-level protocols? -Does the speed advantage of carry-less operations makes up the additional need of prime-field arithmetics?
In this paper, we answer these questions by presenting two distinct custom 16-bit processors that leverage binary-field operations and prime-field operations and are based on [35] and [34] . Using a metric consisting of area, speed, power, and energy, we not only compare both designs in terms of finite-field operation and ECC point-multiplication performance, we also investigate a higher-level protocol. When performing an ECC-point-multiplication, the F 2 m based processor (9.3 kGE, 34.8 µW, 400 kCycles, and 13.9 µJ) is 3.3 times faster, 20 % smaller, uses 16 % less power, and needs 3.9 times less energy compared to the F p based processor (11.6 kGE, 41.4 µW, 1,313 kCycles, and 54.3 µJ). Nevertheless our F p based processor improves the state-of-the-art in area, power, and energy results for prime-field based ECC (doing point multiplication).
We further present two full hardware implementations of the Elliptic Curve Digital Signature Algorithm (ECDSA). It shows that the F 2 m based processor does not outperform the F p based processor in every category of the metric. The F 2 m based processor is 4.4-5.5 % larger, needs up to 6.3 % more power, but still is 2.8 times faster and needs 2.8 times less energy when calculating a signature. The runtime and energy advantage drops down to a factor of 1.5 when the verification is done.
The paper is organized as follows. Section 2 discusses related work on ECC implementations. Section 3 gives an introduction to elliptic curve cryptography and introduces a metric. Whereas Section 4 gives a comparison, Section 5 thoroughly discusses all implementation results. Conclusions are given in Section 6.
Related Work on ECC-Hardware Implementations
There exist many hardware implementations of elliptic-curve cryptography. In the following, we consider only lightweight implementations that address embedded systems, wireless sensors, and contactless-powered applications. Most of the given implementations are based on either binary field, prime field, or dual-field arithmetic. A very tiny ECC processor over binary fields has been proposed by Y. K. Lee et al. [25] in 2008. They based their design on a compact architecture of a Modular Arithmetic Logic Unit (MALU) that has been first presented by the work of L. Batina et al. [4] in 2006. The processor performs (full-precision) operations in F 2 163 and calculates a scalar multiplication between about 80 000 and 300 000 clock cycles (depending on the digit size of the hardware multiplier). The final architecture needs about 12-20 kGEs of area. Similar results have been also reported by S. Kumar and C. Paar [24] who presented a generic binary-field processor over F 2 113−193 . The run-time and area requirements of the proposed processor is similar, needing between 170 000 and 560 000 clock cycles and 10-19 kGEs. D. Hein et al. [15] reported a low-resource co-processor for passive Radio Frequency Identification (RFID) applications. In contrast to the previous work, they applied multi-precision arithmetic over F 2 163 . Their ECC design needs about 300 000 clock cycles for one scalar multiplication and consumes about 11 kGEs of chip area. In view of power consumption, all described designs need between 8 and 30 µWs of power at 100 kHz and are thus well applicable to the targeted applications.
Prime-field based processors have been reported by, for example, E.Öztürk et al. [31] in 2004. They presented an ECC architecture over the prime field F 2 (167+1) /3 . Their design needs 545 440 clock cycles for one scalar multiplication and requires about 30 kGEs of area. Similar results have also been reported by F. Fürbass and J. Wolkerstorfer [11] in 2007. Their F p192 processor needs 502 000 clock cycles and about 23 kGEs of area. Recently, M. Hutter et al. [16] presented an ECC processor over the same prime field needing about 750 000 clock cycles for a scalar multiplication and about 19 kGEs of area. E. Wenger et al. [34] reduced the area requirements even further to only about 12 kGEs but their design needs about 1.4 million clock cycles. The power consumption of most of the reported prime-field processors is about 20 to several hundred µWs of power at 100 kHz.
By the given related work, it seems that binary-field processors benefit from a more efficient computation for application-specific hardware implementations. However, it is impossible to make a fair comparison since the authors used different design techniques, synthesis tools, bit/word sizes, and EC parameters. This renders a comparison largely unfeasible. Nevertheless, there exist only a few publications that reported dual-field processors for ECC that give detailed comparison results. A. Satoh and K. Takano [32] presented a processor over F 2 m and F p supporting 160 to 256 bits. They show that the binary-field operations can be performed about six times faster than their prime-field opponents (1.21 ms vs. 0.19 ms for a 160-bit scalar multiplication). Furthermore, the area requirements for the prime-field controller is 1.47 times larger than the binary-field controller (6 606 GEs vs. 4 490 GEs for 8-bit word size and a 160-bit scalar). J. Wolkerstorfer [36] also presented a dual-field processor that supports 190 to 256 bits. One of his outcomes has been that binary-field operations can be performed about 1.58, 1.42, and 1.27 times faster than prime-field operations for 191/192, 233/224, and 283/256 bits respectively. However, he did not compare the hardware requirements of both types of supported fields and reported only the total area requirements of his processor which is between 24-31 kGEs.
In the following, we design both a binary and prime-field based ECC processor in order to compare them in a fair environment. In contrast to existing work, we consider not only scalar multiplication but evaluate and compare the performance also for higher-level protocols such as ECDSA. First of all, we give a brief introduction into ECC and define a metric to compare different criteria which is done in the next section.
Implementations of Elliptic-Curve Cryptography
Elliptic curves have been introduced by Koblitz [22] and Miller [28] in the 1980s and they have been thoroughly analyzed by the community throughout the last decades. They are based on the Weierstrass equation which can be written as
with a i=1,2,3,4,6 , x, y ∈ K. K defines the finite field. A point P = (x, y) is a valid point on the elliptic curve if it fulfills the Weierstrass equation, i.e. Equation (1). The basic operations performed on the elliptic curve are point addition and point doubling. Using those operations, a point multiplication (often referred as scalar multiplication) Q = k × P can be calculated. The Elliptic Curve Discrete Logarithm Problem (ECDLP) states that finding k is a mathematical hard problem if the points P and Q are given. For a more detailed introduction into elliptic curves and its properties we refer the reader to [3, 5, 13, 23] .
Protocols

Elliptic-Curve Operations
Finite-Field Arithmetic Figure 1 shows the hierarchy of ECC implementations. All ECC operations are based on finite-field arithmetics. Higher-level protocols make use of the underlying ECC operations to provide various cryptographic services such as authentication, data integrity, nonrepudiation, or confidentiality. Note that most of these protocols (such as ECDSA) require different operations over finite fields such as prime-field addition or multiplication.
Among the most commonly used types of finite fields are prime fields F p and binaryextension fields F 2 m . These types have different characteristics so that the Weierstrass equation can be simplified and different formulas for point addition and point doubling can be derived. Due to the differences of those fields, the performance of both software and hardware implementations can vary significantly.
In this paper, we compare two ECC implementations that are based on F 2 191 and on F p192 . Both implementations use multi-precision arithmetic that means that all finite-field elements are split into smaller bit vectors of size W . Note that the one-bit difference does not have an impact in a relative comparison of ECC implementations since we use the same metric for both implementations. As elliptic curves, we decided to use the recommended NIST prime-field curve P-192 [30] and the ANSI X9.62 compliant binary-field curve B-191 [1] , i.e. c2tnb191v1. This is because we would like to compare curves with nearly identical bit sizes (191 vs. 192 bits). The one-bit difference between those two fields can be considered as negligible.
Throughout the paper, we used the following notation. For prime fields with modulo p, n = log 2 (p) bits are required to represent a number. For binary fields with f (z) = z m + r(z) denoting an irreducible binary polynomial of degree m, a bit-vector with m entries can be used to represent any binary polynomial. Consequently the number of needed words to represent a F p number is N = n/W and number of needed words to represent a F 2 m polynomial is M = m/W .
Comparison Metric and Criteria
The efficiency of ECC-hardware implementations depends on different criteria. In order to make a fair comparison, we introduce the following metric consisting of four main attributes:
-The area requirement of a chip is important for any cost-sensitive application. This is because the area largely determines the chip costs at fabrication. -Embedded systems require low-power and -low-energy designs. This is an important issue especially in battery-powered environments. -Speed of computation is important for many applications to be applicable in practice. The most neutral unit for measurement is the number of cycles it takes to perform a certain operation.
The maximum frequency that can be used to clock a design has a direct impact on the resulting execution time (speed) of any algorithm. But the previously mentioned applications heavily constrain the maximum frequency anyways, so we do not include the frequency measure into our metric.
Because the energy W = P t is defined as product of the electrical power P and time t, its properties are not handled explicitly within Section 4.
Comparing ECC-Hardware Designs over F 2 m and F p
In this section, we compare ECC hardware designs and the respective algorithms over F 2 m and over F p . We describe the differences of the finite-field operations, the respective elliptic-curve group operations, and compare the hardware designs of both types of fields regarding cryptographic protocols like ECDSA.
Finite-Field Arithmetics
Modular Addition and Subtraction. The most basic finite-field algorithms are addition and subtraction. Algorithm 1 and Algorithm 2 show modular-addition Algorithm 1 Prime-field addition.
Require: Two integers a, b ∈ [0, p−1] and modulus p.
end for 10: end if 11: Return(c).
Algorithm 2 Binary-field addition.
Require: Binary polynomials a(z), b(z)
with maximum degree m-1. Ensure: c(z) = a(z) + b(z).
1: for i from 0 to M − 1 do 2:
algorithms over F p and F 2 m . The major difference of those algorithms is the carry propagation ε. The polynomial addition is a simple XOR operation that does not incorporate a carry. A A3 B1  A2 B1  A1 B1  A0 B1   A0 B0  A3 B0  A2 B0  A1 B0   A3 B2  A2 B2  A1 B2  A0 B2   A3 B3  A2 B3  A1 B3  A0 B3   R0  R1  R2  R3  R4  R5  R6  R7 0 Fig. 3 . 4-bit carry-less multiplier for F2m .
C [5] c2tnb191v1 approach. A multiply-accumulate unit (cf. [12, 15, 16] ) can be used to increase the efficiency of the product-scanning method. Such a multiply-accumulate unit can be designed for F p and F 2 m . Figures 2 and 3 show the internal structure of 4-bit multipliers for integers and polynomials. The biggest advantage of the carry-less multiplier for F 2 m are the shorter critical path and the smaller area requirement (logical XOR cells are used instead of full-adder standard cells). Thus, the difference between a F 2 m and a F p multiplication module can be up to 40 % in terms of area requirement. Also the power consumption for a multiplier designed out of XORs instead of full-adders is lower. However the execution times (in cycles) for an integer or binary-polynomial multiplication using the product-scanning method are equivalent. A finite-field multiplication always needs a reduction. There exist many ways to realize modular reduction in hardware. One efficient way is to apply a (fast) reduction method using special primes, so called Mersenne-like primes, which are often used for recommended and standardized elliptic curves (e.g. the NIST recommended curves [30] ). Figure 4 shows how intermediate multiplication results can be reduced using this fast reduction method for primes and polynomials over the curves NIST P-192: p = 2 192 − 2 64 − 1 and ANSI X9.62 c2tnb191v1: f (z) = z 191 + z 9 + 1. The reduction can be performed with only shifts and additions. The for NIST P-192 necessary shift operations fit very well within the addressing scheme of 8-bit, 16-bit, or 32-bit architectures. The shift operations required by c2tnb191v1 do not fulfill this property. However, in cases where the shift operations are smaller than W , an additional hardcoded reduction logic can be used. In terms of area, this reduction logic is very cheap (about the size of a F 2 m addition).
Modular Squaring. Modular squaring is equivalent to a modular multiplication with two identical operands. Thus, an explicit implementation is often not necessary, especially in implementations where low area is a stringent require-
A[5]
A [4] A ment. However, if implemented it improves the performance since it is typically faster than modular multiplications [13] . During a prime-field squaring operation, the two intermediate products Figure 5 shows the operands of a 6-word squaring operation where only the necessary operations (multiplications) are shaded. Thus, the squaring operation can be up to two times faster than a multiplication.
Squarings over binary fields, as opposed, have the nice property that
Thus, zero values are simply inserted between two consecutive bits a i . Utilizing the binary multiplier from Figure 3 , only M multiplications A[i]×A[i] are required to perform a binary-field squaring operation. As it can also be seen in Figure 6 , the squaring operation is M times faster than a binary field multiplication. It can be performed with a similar runtime complexity as a modular addition.
In terms of runtime and lines-of-code, a F 2 m squaring can be up to
2M times faster than a F p squaring. Modular Inversion. What the inversion operations for prime and binary fields have in common is the very slow execution time. There are two common inversion methods. One is based on the extended Euclidean algorithm and one is based on Fermat's little theorem (a = a 2 m mod f (z) ∀a ∈ F 2 m ). For this paper the Montgomery inversion technique by Kalinski et al. [20] has been used for prime field inversion operations. Using Fermat's little theorem [18] for binary field inversions, with a −1 ≡ a 2 m −2 mod f (z), a field inversion can be performed by using m − 1 squarings and several multiplications. In the case of c2tnb191v1, 190 squarings and 12 multiplications are necessary. Because of the fast squaring operations within binary fields, the runtime of this method exceeds any Euclidean-based algorithm. [14] gives a comparison of different algorithms for an inversion within the NIST B-163 field.
Elliptic-Curve Operations
The performance of EC-group operations over F 2 m and F p differ significantly. We used formulae that reflect the state of the art in efficient ECC implementations. For binary-field arithmetic, we applied the formulae proposed by J. López and R. Dahab [27] . Their formulae need six finite-field multiplications, five squarings, and three additions per key bit. For prime-field arithmetic, we applied the formulae of M. Hutter et al. [17] needing 12 multiplications, four squarings, and 16 additions (incl. subtractions). Both formulae have been applied within the Montgomery powering ladder scalar multiplication [19] . By comparing the formulae, it clearly shows that the binary formulae need 50 % less multiplications than the formulae over prime-field arithmetic. This is one of the most advantageous properties that encourages the use of F 2 m operations in ECC-hardware implementations. Note that both formulae use projective coordinates that means that no modular inversion is needed throughout the scalar multiplication 1 .
Cryptographic Protocols
After the basic elliptic-curve operation of a scalar multiplication, we compare the performance of F 2 m and F p processors in terms of higher-level protocols. In particular, we implemented ECDSA [30] on both types of (binary and prime-field based) processors. The main additional operations needed to support ECDSA is the SHA-1 [29] algorithm 2 to calculate the message digest of the message m and some prime-field operations, i.e. modular addition, multiplication, and inversion to calculate the digital signature (r, s) = (k × P, k −1 (SHA-1(m) + rd)), where d represents the used private key.
For a more efficient ECDSA-verify algorithm, we additionally implemented a different methodology for calculating point multiplications. First, we applied Shamir's trick [8, 9] to improve the performance of multiple point multiplication. Second, we used different formulae to perform the verification using Jacobianprojective coordinates [13] for the prime-field processor and López-Dahab coordinates [13, 26] for the binary-field processor.
Comparison Results
For a fair comparison of binary field and prime-field ECC implementations, it is important to select a common controlling engine, common development tools, the same process technology, and elliptic curves of nearly the same bit size.
As a controller, we decided to use our own 16-bit microcontroller called Neptun [33] [34] [35] that is especially optimized for elliptic-curve cryptography. The processor comes with twelve special-purpose registers and uses a Harvard architecture with separated program and data memory. The usually area consuming data memory is made from a very area-efficient single-port RAM macro 3 . The program memory is a synthesized lookup table stored as Read-Only Memory (ROM). In fact, the area requirements of this lookup table is proportional to the number of lines-of-code (LOC) stored within the program memory. The central processing unit (CPU) is capable of the most basic arithmetic operations such as addition/subtraction, logic operations (AND, OR, XOR), and shift operations.
As target technology, we selected a 130nm low-leakage CMOS technology by UMC. This technology needs fewer power compared to larger 180 nm and 350 nm technologies and has a lower power leakage than smaller (e.g. 90 nm) technologies. The standard-cell library has been provided by Faraday Technology. The RAM-macro blocks have been generated using the Standard Memory Compiler FSA0A Memaker 200901.1.1 by the Faraday Technology Corporation [10] . For synthesis we used the Cadence RTL compiler [7] Version v08.10. For power-simulations we used Cadence First Encounter Version v08.10.
Finite-Field Arithmetic
The finite-field algorithms have been implemented as described in Section 4.1. All algorithms (except the algorithms for modular inverses) have been unrolled and optimized for our custom microcontroller instruction set (Assembler language). All results are summarized in Table 1 .
It shows that our processor performs the binary-field addition about 40.6 % faster than the prime-field addition (the same holds for modular subtraction). Binary-field multiplication is 19.5 % faster than its prime-field counterpart because of the extra reduction logic provided to take advantage of the Mersennelike irreducible polynomial. However, it shows that even when multi-precision arithmetic is used, the biggest advantage of binary-field operations is within the squaring operation. Its runtime is 4.2 times faster than the prime-field squaring operation. Finally, the two very distinct inversion techniques, discussed in Section 4.1, result in very different runtime and LOC results. The binary-field inversion implementation is 3.19 times faster and needs only 29.5 % LOC. It reuses the squaring and multiplication methods and subsequently only works for a single irreducible polynomial. The prime-field inversion, in contrast, works for any prime. The main reason for the higher code size are actually the additional utility functions (addition, subtraction, multiplication with 2, division by 2) that had to be implemented. Table 2 compares the absolute values and Table 3 compares the relative differences of the implemented prime-field and binary-field ECC implementations. The relative differences shown in Table 3 have been calculated using the Formula
Elliptic-Curve Operations
In the following, we separately consider point multiplication as well as signature generation and verification of the higher-level protocol of ECDSA.
In view of point multiplication, it shows that the binary-field based implementation is 3.28 times faster than the prime-field based opponent. The area requirement is 19.7 % better and the power consumption is 15.9 % lower for the binary-field processor. This results in an energy consumption which is 3.91 times lower than the calculation over prime fields. Note that the area difference mostly comes from the size of the program memory, the used multiplier within the CPU and the size of the necessary RAM macro. Even note that in both designs, about 50 % of the total power is consumed within the CPU.
Cryptographic Protocols
For ECDSA, only 455 lines of code (38 %) have to be added to the prime-field ECC processor to support all operations to sign data. This and the small increase of necessary RAM entries increased the total area requirement by 26.5 %. The execution time is increased by only 6.2 %. The differences in power and energy consumption are hardly noticeable. The changes to the binary-field ECC processor are much more significant. The CPU had to be extended with a small 8-bit integer multiply-accumulate unit, making it capable of prime and binaryfield operations, increasing the area requirements of the CPU by 20 %. Adding all those algorithms increased the size of the program memory by 177 % and the total area of the processor by 64 %. Also the power and energy consumption increased by 13.5 % and 40.5 %. However, the runtime of the binary-field based ECDSA processor is still 2.82 times faster than the runtime of the prime-field based ECDSA processor. Even though the area and power consumption are approximately identical, the binary-field ECDSA processor needs 2.82 times less energy than the prime-field ECDSA processor. The ECDSA verification needs one additional point multiplication compared to the ECDSA-signature generation algorithm which needs only one. Cause of Shamir's trick the runtime for the prime-field based algorithms differ by only 2 %. The area differs by 7.5 % and the power and energy results are almost identical. The ECDSA-signature verification algorithm over binary fields does not handle the two point multiplications as well. Whereas the area increased by only 8.7 %, the runtime increased by 80 %. This doubles the required energy needed for an ECDSA-signature verification compared to an ECDSA-signature generation. Table 4 gives a comparison with related work. All power results have been scaled to 1 MHz. The first five rows give related work over prime fields. The remaining rows contain related work over binary fields. Our F p processor is 51 % smaller than the best related design by Wolkerstorfer [36] . In terms of cycles this processor is above average. Only the energy requirement byÖztürk [31] design is lower, but their design is not based on NIST P-192. The area results of the math processor are 10.4 % smaller than the smallest related implementation. Our speed, power, and energy results are larger than many other designs, but it should be noted that those designs have an advantage cause of the smaller elliptic curve used. Table 5 summarizes related work regarding low-resource ECDSA-hardware implementations. In terms of power and energy consumption, we outperform existing solutions. The area requirements are lower than the work of Kern [21] and Hutter [16] but are higher than the work of Wenger [34] .
Comparison with Related Work
Conclusion
In this paper, we compared the performance of two distinct ECC-hardware implementations that are based on prime-field (NIST P-192) and binary-field (ANSI c2tnb191v1) arithmetic. The comparison of the finite-field algorithms showed us the clear runtime advantage of the squaring (4.2 times) and addition (1.7 times) operations within the binary-extension field. When doing point multiplications, the F 2 m based processor outperforms the F p based processor by 19.7 % in area, 69.6 % in runtime, 15.9 % in power, and 74.4 % in energy. In addition to these outcomes, we analyzed the impact of higher-level protocols on the finite-field processors. The implementation of both digital-signature generation and verification using ECDSA had led us to interesting findings. It was shown that the area and power advantages for the F 2 m based processor vanish while it still is 1.5-2.8 times faster and consequently more energy efficient than the F p based processor.
These results can be applied to any future design of an ASIC ECC processor that is integrated in an area, power, or energy constrained device.
