Abstract. Since its proposal by Victor Miller [17] and Neal Koblitz [15] in the mid 1980s, Elliptic Curve Cryptography (ECC) has evolved into a mature public-key cryptosystem. Offering the smallest key size and the highest strength per bit, its computational efficiency can benefit both client devices and server machines. We have designed a programmable hardware accelerator to speed up point multiplication for elliptic curves over binary polynomial fields GF (2 m ). The accelerator is based on a scalable architecture capable of handling curves of arbitrary field degrees up to m = 255. In addition, it delivers optimized performance for a set of commonly used curves through hard-wired reduction logic. A prototype implementation running in a Xilinx XCV2000E FPGA at 66.4 MHz shows a performance of 6987 point multiplications per second for GF (2 163 ). We have integrated ECC into OpenSSL, today's dominant implementation of the secure Internet protocol SSL, and tested it with the Apache web server and open-source web browsers.
Introduction
Since its proposal by Victor Miller [17] and Neal Koblitz [15] in the mid 1980s, Elliptic Curve Cryptography (ECC) has evolved into a mature public-key cryptosystem. Extensive research has been done on the underlying math, its security strength, and efficient implementations.
ECC offers the smallest key size and the highest strength per bit of any known public-key cryptosystem. This stems from the discrete logarithm problem in the group of points over an elliptic curve. Among the different fields that can underlie elliptic curves, integer fields F (p) and binary polynomial fields GF (2 m ) have shown to be best suited for cryptographical applications. In particular, binary polynomial fields allow for fast computation in both software and hardware implementations.
Small key sizes and computational efficiency of both public-and private-key operations make ECC not only applicable to hosts executing secure protocols over wired networks, but also to small wireless devices such as cell phones, PDAs and SmartCards. To make ECC commercially viable, its integration into secure protocols needs to be standardized. As an emerging alternative to RSA, the US government has adopted ECC for the Elliptic Curve Digital Signature Algorithm (ECDSA) and specified named curves for key sizes of 163, 233, 283, 409 and 571 bit [18] . Additional curves for commercial use were recommended by the Standards for Efficient Cryptography Group (SECG) [7] . However, only few ECC-enabled protocols have been deployed in commercial applications to date. Today's dominant secure Internet protocols such as SSL and IPsec rely on RSA and the Diffie-Hellman key exchange. Although standards for the integration of ECC have been proposed [4] , they have not yet been finalized.
Our approach towards an end-to-end solution is driven by a scenario of a wireless and web-based environment where millions of client devices connect to a secure web server.
The aggregation of client-initiated connections/transactions leads to high computational demand on the server side, which is best handled by a hardware solution. While support for a limited number of curves is acceptable for client devices, server-side hardware needs to be able to operate on numerous curves. The reason is that clients may choose different key sizes and curves depending on vendor preferences, individual security requirements and processor capabilities. In addition, different types of transactions may require different security levels and thus, different key sizes.
We have developed a cryptographic hardware accelerator for elliptic curves over arbitrary binary polynomial fields GF (2 m ), m ≤ 255. To support secure web transactions, we have fully integrated ECC into OpenSSL and tested it with the Apache web server and open source web browsers.
The paper is structured as follows: Section 2 summarizes related work and implementations of ECC. In Section 3, we outline the components of an ECCenabled secure system. Section 4 describes the integration of ECC into OpenSSL. The architecture of the hardware accelerator and the implemented algorithms are presented in Section 5. We give implementation cost and performance numbers in Section 6. The conclusions and future directions are contained in Section 7.
Related Work
Hardware implementations of ECC have been reported in [20] , [2] , [1] , [11] , [10] and [9] . Orlando and Paar describe a programmable elliptic curve processor for reconfigurable logic in [20] . The prototype performs point multiplication based on Montgomery Scalar Multiplication in projective space [16] for GF (2 167 ). Their design uses polynomial basis coordinate representation. Multiplication is performed by a digit-serial multiplier proposed by Song and Parhi [22] . Field inversion is computed through Fermat's theorem as suggested by Itoh and Tsujii [13] . With a performance of 0.21 ms per point multiplication this is the fastest reported hardware implementation of ECC. Bednara et al. [2] designed an FPGA-based ECC processor architecture that allows for using multiple squarers, adders and multipliers in the data path. They researched hybrid coordinate representions in affine, projective, Jacobian and López-Dahab form.
Two prototypes were synthesized for GF (2 191 ) using an LFSR polynomial basis multiplier and a Massey-Omura normal basis multiplier, respectively. Agnew et al. [1] built an ECC ASIC for GF (2 155 ). The chip uses an optimal normal basis multiplier exploiting the composite field property of GF (2 155 ). Goodman and Chandrakasan [11] designed a generic public-key processor optimized for low power consumption that executes modular operations on different integer and binary polynomial fields. To our knowledge, this is the only implementation that supports GF (2 m ) for variable field degrees m. However, the architecture is based on bit-serial processing and its performance cannot be scaled to levels required by server-type applications.
3 System Overview Figure 1 shows the implementation of a client/server system using a secure ECCenhanced protocol. We integrated new cipher suites based on ECC 
Secure Sockets Layer
Secure Sockets Layer (SSL aka TLS) [8] is the most widely deployed and used security protocol on the Internet today. The protocol has withstood years of scrutiny by the security community and, in the form of HTTPS 1 , is now trusted to secure virtually all sensitive web-based applications ranging from banking to online trading to e-commerce.
SSL offers encryption, source authentication and integrity protection for data exchanged over insecure, public networks. It operates above a reliable transport service such as TCP and has the flexibility to accommodate different cryptographic algorithms for key agreement, encryption and hashing. However, the specification does recommend particular combinations of these algorithms, called cipher suites, which have well-understood security properties. The two main components of SSL are the Handshake protocol and the Record Layer protocol. The Handshake protocol allows an SSL client and server to negotiate a common cipher suite, authenticate each other 2 , and establish a shared master secret using public-key algorithms. The Record Layer derives symmetric keys from the master secret and uses them with faster symmetric-key algorithms for bulk encryption and authentication of application data. Public-key cryptographic operations are the most computationally expensive portion of SSL processing, and speeding them up remains an active area for research and development. Figure 2 shows the general structure of a full SSL handshake. Today, the most commonly used public-key cryptosystem for master-key establishment is RSA but the IETF is considering an equivalent mechanism based on ECC [4] .
Public-key Cryptography in SSL

RSA-based Handshake
The client and server exchange random nonces (used for replay protection) and negotiate a cipher suite with ClientHello and ServerHello messages. The server then sends its signed RSA public-key either in the Certificate message or the ServerKeyExchange message. The client verifies the RSA signature, generates a 48-byte random number (the pre-master secret ) and sends it encrypted with the server's public-key in the ClientKeyExchange. The server uses its RSA private key to decrypt the pre-master secret. Both endpoints then use the pre-master secret to create a master secret, which, along with previously exchanged nonces, is used to derive the cipher keys, initialization vectors and MAC (Message Authentication Code) keys for bulk encryption by the Record Layer.
The server can optionally request client authentication by sending a CertificateRequest message listing acceptable certificate types and certificate authorities. In response, the client sends its private key in the Certificate and proves possession of the corresponding private key by including a digital signature in the CertificateVerify message.
ECC-based Handshake
The processing of the first two messages is the same as for RSA but the Certificate message contains the server's Elliptic Curve DiffieHellman (ECDH) public key signed with the Elliptic Curve Digital Signature Algorithm (ECDSA). After validating the ECDSA signature, the client conveys its ECDH public key in the ClientKeyExchange message. Next, each entity uses its own ECDH private key and the other's public key to perform an ECDH operation and arrive at a shared pre-master secret. The derivation of the master secret and symmetric keys is unchanged compared to RSA. Client authentication is still optional and the actual message exchange depends on the type of certificate a client possesses.
ECC Hardware Acceleration
Point multiplication on elliptic curves is the fundamental and most expensive operation underlying both ECDH and ECDSA. For a point P in the group
defined by a nonsupersingular elliptic curve with parameters a, b ∈ GF (2 m ) and for a positive integer k, the point multiplication kP is defined by adding P k-1 times to itself using + P 3 . Computing kP is based on a sequence of modular additions, multiplications and divisions. To efficiently support ECC, these operations need to be implemented for large operands.
The design of our hardware accelerator was driven by the need to both provide high performance for named elliptic curves and support point multiplications for arbitrary, less frequently used curves. It is based on an architecture for binary polynomial fields GF (2 m ), m ≤ 255. We believe that this maximal field degree offers adequate security strength for commercial web traffic for the foreseeable future. We chose to represent elements of GF 
Architectural Overview
We developed a programmable processor optimized to execute ECC point multiplication. The data path shown in Figure 3 implements a 256-bit architecture. Parameters and variables are stored in an 8kB data memory DMEM and program instructions are contained in a 1kB instruction memory IMEM. Both memories are dual-ported and accessible by the host machine through a 64-bit/66MHz PCI interface. The register file contains eight general purpose registers R0-R7, a register RM to hold the irreducible polynomial and a register RC for curvespecific configuration information. The arithmetic units implement division (DIV), multiplication (MUL) and squaring/addition/shift left (ALU). Source operands are transferred over the source bus SBUS and results are written back into the register file over the destination bus DBUS. Program execution is orchestrated by the Control Unit, which fetches instructions from the IMEM and controls the DMEM, the register file and the arithmetic units. As shown in Table 1 , the instruction set is composed of memory instructions, arithmetic/logic instructions and control instructions. Memory instructions LD and ST transfer operands between the DMEM and the register file. The arithmetic and logic instructions include MUL, MULNR, DIV, ADD, SQR and SL. We implemented a load/store architecture. That is, arithmetic and logic instructions can only access operands in the register file. The execution of arithmetic instructions can take multiple cycles and, in case of division and multiplication, the execution time may even be data-dependent. To control the flow of the program execution, conditional branches BMZ and BEQ, unconditional branch JMP and program termination END can be used.
The data path allows instructions to be executed in parallel or overlapped. The Control Unit examines subsequent instructions and decides on the execution model based on the type of instruction and data dependencies. An example for parallel and overlapped execution of an instruction sequence I 0 ; I 1 ; I 2 is given in Figure 4 . Parallel execution of I 0 ; I 1 is possible if I 0 is a MUL or MULNR instruction and I 1 is an ADD or SQR instruction and no data dependencies exist between the destination register/s of I 0 and the source and destination register/s of I 1 of I 2 is different from destination register RD1 of I 0 , i.e. RS0 can be read over the SBUS while RD1 is written over the DBUS.
ALU
The ALU incorporates two arithmetic and one logic operation: Addition, squaring and shift left. The addition of two elements a, b ∈ GF (2 m ) is defined as the sum of the two polynomials obtained by adding the coefficients mod 2. This can be efficiently computed as the bitwise XOR of the corresponding bit strings.
Squaring is a special case of multiplication and is defined in two steps. First, the operand a ∈ GF (2 m ) is multiplied by itself resulting in a polynomial c 0 = a 
Using t m ≡ M − t m mod M as a special case of (1), the congruency c
the reduced result c = c i can be computed in a maximum of i ≤ m − 1 reduction iterations. The minimum number of iterations depends on the second highest term in the irreducible polynomial M [22] , [12] . For
it follows that a better upper bound for
The minimum number of iterations i is given by
To enable efficient implementations, M is often chosen to be either a trinomial M t or pentanomial M p :
apparently limits the number of reduction iterations to 2, which is the case for all irreducible polynomials recommended by NIST [18] and SECG [7] . The multiplications c j,h * (M − t m ) can be optimized if (M − t m ) is a constant sparse polynomial. In this case, the two steps of a squaring operation can be hard-wired and executed in a single clock cycle. As shown in Figure 5 , the ALU implements hard-wired reduction for the irreducible polynomials t 163 
Multiplier
We studied and implemented several different architectures and, finally, settled on a digit-serial shift-and-add multiplier. Figure 6 gives a block diagram of the multiplier. The result is computed in two steps. First, the product is computed by iteratively multiplying a digit of operand X with Y , and accumulating the partial products in Z . Next, the product Z is reduced by the irreducible polynomial. In our implementation, the input operands X and Y can have a size of up to n = 256 bits, and the reduced result Z has a size of m = 163, 193, 233 bits according to the specified named curve. The digit size d is 64. We optimized the number of iterations needed to compute the product Z such that the four iterations it takes to perform a full 256-bit multiplication are only executed for m = 193, 233 whereas three iterations are executed for m = 163. To compensate for the missing shift operation in the latter case, a multiplexer was added to select the bits of Z to be reduced. The reduction is hard-wired and takes another clock cycle. The alternative designs we studied were based on the Karatsuba algorithm [14] and the LSD multiplier [22] . Applying the Karatsuba algorithm to Figure 6 .192] and then use the Karatsuba algorithm to calculate the four partial products. Compared with the shift-and-add algorithm the Karatsuba algorithm is attractive since it lowers the bit complexity from O(n 2 ) to O(n lg3 ) [6] . It does, however, introduce irregularities into the wiring and, as a result, additional wire delays. As we will show in Table 3 , this design did not meet our timing goal.
We also implemented the LSD multiplier shown in Figure 7 . When compared with the shift-and-add multiplier of Figure 6 the LSD multiplier is attractive since it reduces the size of the register used for accumulating the partial results from 2n bits to n + d bits. This is accomplished by shifting the Y operand rather than the product Z and reducing Y every time it is shifted. The implementation cost is an additional reduction circuit. Since the two reduction operations of Y and Z do not take place in the same clock cycle, it is possible to share one reduction circuit. However, considering the additional placement and routing constraints imposed by a shared circuit, two separate circuits are, nevertheless, preferred. An analysis of our FPGA implementation shows no advantage in terms of size or performance. The size of the multiplier is dominated by the amount of combinational logic resources and, more specifically, the number of look-up tables (LUTs) needed. Thus, there is no advantage in reducing the size of the register holding Z . Note, that as the digit size d is reduced, the ratio of registers and LUTs changes; given the fixed ratio of registers and LUTs available on an FPGA device, the LSD multiplier, therefore, can be attractive for small digit sizes.
As it is our goal to process arbitrary curve types, we can rely on the hardwired reducers only for the named curves. All other curve types need to be handled in a more general way, for example, with the algorithm presented in Section 5.5. We, therefore, need a multiplier architecture that either provides a way to reduce by an arbitrary irreducible polynomial or offers the option to calculate a non-reduced product. We opted for the latter option and added a path to bypass the reducer in Figure 6 . Note that with the LSD multiplier a non-reduced product can not be offered thus requiring full multipliers to replace the reduction circuits.
Divider
The hardware accelerator implements dedicated circuitry for modular division based on an algorithm described by Shantz [21] . A block diagram of the divider is shown in Figure 8 . It consists of four 256-bit registers A, B, U and V and a fifth register holding the irreducible polynomial M . It can compute division for arbitrary irreducible polynomials M and field degrees up to m = 255.
Initially, A is loaded with the divisor X, B with the irreducible polynomial M , U with the dividend Y , and V with 0. Throughout the division, the following invariants are maintained:
Through repeated additions and divisions by t, A and B are gradually reduced to 1 such that U (respectively V ) contains the quotient Y X mod M . A polynomial is divisible by t if it is even, i.e. the least significant bit of the corresponding bit string is 0. Division by t can be efficiently implemented as a shift right operation. In contrast to the original algorithm, which included magnitude comparisons of registers A and B, we use two counters CA and CB to test for termination of the algorithm. CB is initialized with the field degree m and CA with m − 1. The division algorithm consists of the following operations The preconditions ensure that for any configuration of A, B, U and V at least one of the operations can be executed. It is interesting to note that operations, whose preconditions are satisfied, can be executed in any order without violating invariants (6) and (7). The control logic of the divider chooses operations as preconditions permit starting with 1a and 2a. To ensure termination, 3a is executed if CA > CB and 3b is executed if CA ≤ CB. CA and CB represent the upper bound for the order of A and B. This is due to the fact that the order of A + B is never greater than the order of A if CA > CB and never greater than the order of B if CA ≤ CB. Postconditions of 2a, 2b, 3a and 3b guarantee that either 1a or 1b can be executed to further decrease the order of A and B towards 1. The division circuit shown in Figure 8 was designed to execute sequences of operations per clock cycle, e.g. 3a,2a and 1a could be executed in the same cycle. In particular, it is possible to always execute either 1a or 1b once per clock cycle. Therefore, a modular division can be computed in a maximum of 2m clock cycles.
Point Multiplication Algorithms
We experimented with different point multiplication algorithms and settled on Montgomery Scalar Multiplication using projective coordinates as proposed by López and Dahab [16] . This choice is motivated by the fact that, for our implementation, multiplications can be executed much faster than divisions. Expensive divisions are avoided by representing affine point coordinates (x, y) as projective triples (X, Y, Z) with x = X Z and y = Y Z . In addition, this algorithm is attractive since it provides protection against timing and power analysis attacks as each point doubling is paired with a point addition such that the sequence of instructions is independent of the bits in k.
A point multiplication kP can be computed with log 2 (k) point additions and doublings. Throughout the computation, only the X-and Z-coordinates of two points P 1,i and P 2,i are stored. Montgomery's algorithm exploits the fact that for a fixed point P = (X, Y, 1) and points P 1 = (X 1 , Y 1 , Z 1 ) and P 2 = (X 2 , Y 2 , Z 2 ) the sum P 1 + P 2 can be expressed through only the X-and Z-coordinates of P, P 1 and P 2 if P 2 = P 1 + P . P 1 and P 2 are initialized with P 1, log2(k) = P and P 2, log2(k) = 2P . To compute kP , the bits of k are examined from left (k log2(k) ) to right (k 0 ). For k i = 0, P 1,i is set to 2P 1,i+1 (8) and P 2,i is set to P 1,i+1 + P 2,i+1 (9) .
Similarly, for k i = 1, P 1,i is set to P 1,i+1 + P 2,i+1 and P 2,i is set to 2P 2,i+1 . The Y-coordinate of kP can be retrieved from its X-and Z-coordinates using the curve equation. In projective coordinates, Montgomery Scalar Multiplication requires 6 log 2 (k) + 9 multiplications, 5 log 2 (k) + 3 squarings, 3 log 2 (k) + 7 additions and 1 division. (8) and (9) for named curves over GF (2 163 ), GF (2 193 ) and GF (2 233 ) is shown in Table 2 . The computation of the two equations is interleaved such that there are no data dependencies for any MUL/SQR or MUL/ADD instruction sequences. Hence, all MUL/SQR and MUL/ADD sequences can be executed in parallel. Furthermore, there are no data dependencies between subsequent arithmetic instructions allowing for overlapped execution.
Named Curves An implementation of Equations
Generic Curves Squaring and multiplication require reduction, which can either be hard-wired or implemented for arbitrary field degrees through an instruction sequence of polynomial multiplications (MULNR) and additions (ADD) as To evaluate the performance of the divider, we implemented an inversion algorithm proposed by Itoh and Tsujii [13] based on Fermat's theorem. With this algorithm, an inversion optimized for GF (2 163 ) takes 938 cycles (0.01413 ms), while the divider is almost three times faster speeding up point multiplication by about 6.4%. Table 5 shows hardware and software performance numbers for point multiplication on named and generic curves as well as execution times for ECDH and ECDSA with and without hardware support. The hardware numbers were obtained on a 360MHz Sun Ultraç60 workstation and all software numbers represent a generic 64-bit implementation measured on a 900MHz Sun Fireç280R server. For generic curves, the execution time for point multiplications depends on the irreducible polynomial as described in Sections 5.5 and 5.2. The obtained numbers assume irreducible polynomials with k 3 ≤ m−1 2 . Hard-wired reduction for named curves improves the execution time by a factor of approximately 10 compared to generic curves.
For ECDH-163, the hardware accelerator offers a 12.5-fold improvement in execution time over the software implementation for named curves. Overhead is created by OpenSSL and accesses to the hardware accelerator leading to a lower speedup than measured for raw point multiplication. A disproportionally larger drop in speedup can be observed for ECDSA-163 since it includes integer operations executed in software. In addition, ECDSA signature verification requires two point multiplications. All numbers were measured using a single process on one CPU. The hardware numbers for ECDH and ECDSA could be improved by having multiple processes share the hardware accelerator such that while one processes waits for a point multiplication to finish, another process can use the CPU. 
Hardware
Conclusions
We have demonstrated a secure client/server system that employs elliptic curve cryptography for the public-key operations in OpenSSL. We have further presented a hybrid hardware accelerator architecture providing optimized performance for named elliptic curves and support for generic curves over arbitrary fields GF (2 m ), m ≤ 255. Previous approaches such as presented in [11] and [20] focused on only one of these aspects.
The biggest performance gain was achieved by optimizing field multiplication. However, as the number of cycles per multiplication decreases, the relative cost of all other operations increases. In particular, squarings can no longer be considered cheap. Data transport delays become more critical and contribute to a large portion of the execution time. To make optimal use of arithmetic units connected through shared data paths, overlapped and parallel execution of instructions can be employed.
For generic curves, reduction has shown to be the most expensive operation. As a result, squarings become almost as expensive as multiplications. This significantly impacts the cost analysis of point multiplication algorithms. In particular, the Itoh-Tsujii method becomes much less attractive since it involves a large number of squaring operations.
Dedicated division circuitry leads to a performance gain over soft-coded inversion algorithms for both named and generic curves. However, the tradeoff between chip area and performance needs to be taken into account.
Although prototyped in reconfigurable logic, the architecture does not make use of reconfigurability. It is thus well-suited for an implementation in ASIC technology. For commercial applications this means lower cost at high volumes, less power consumption, higher clock frequencies and tamper resistance.
As for future work, we are in the process of setting up a testbed that will allow us to empirically study the performance of ECC-based cipher suites and compare it to conventional cipher suites. This includes measurements and analysis of the system performance at the web server level. As for the hardware accelerator, we intend to improve the performance of point multiplication on generic curves. Furthermore, we want to optimize the hardware-software interface to achieve higher performance at the OpenSSL level. We plan to discuss the results in a follow-on publication.
