We describe a cryptographic processor for Elliptic Curve Cryptography (ECC). ECC is evolving as an attractive alternative to other public-key cryptosystems such as the Rivest-ShamirAdleman algorithm (RSA) by offering the smallest key size and the highest strength per bit. The cryptographic processor performs point multiplication for elliptic curves over binary polynomial fields GF(2 m ). In contrast to other designs that only support one curve at a time, our processor is capable of handling arbitrary curves without requiring reconfiguration. More specifically, it can handle both named curves as standardized by the National Institute for Standards and Technology (NIST) as well as any other generic curves up to a field degree of 255. Efficient support for arbitrary curves is particularly important for the targeted server applications that need to handle requests for secure connections generated by a multitude of heterogeneous client devices. Such requests may specify curves which are infrequently used or not even known at implementation time.
nomial fields GF (2 m ), the latter requiring arithmetic operations that cannot be performed by standard integer or floating-point units.
Dedicated cryptographic hardware is used on server machines to optimize the throughput of secure web-based applications. Servers running security protocols such as SSL or the Internet Protocol Security (IPSec) are confronted with an aggregation of secure connections created by a multitude of heterogeneous clients. Terminating secure connections on the server side not only demands high computational power but also flexibility in responding to client devices that are limited in the set of cryptographic algorithms supported. As clients are often limited in processing power and memory capacity, they may be capable of supporting only a small number of curves. With respect to ECC, a client might possibly support only a single curve. To be able to establish a secure connection and, with it, provide service, a server, in turn, is required to be flexible enough to support any such curve requested by a client. 1 While a server certainly needs to implement the standardized curves and the associated irreducible polynomials, it should further implement any other arbitrary curve and, thus, any arbitrary irreducible polynomial. In the following, we refer to the former as named curves and to the latter as generic curves. There are several reasons why generic curves need to be supported. The standards only recommend curves and, thus, new curves might emerge in the future. Furthermore, curves might be abandoned for security reasons and replaced by different ones that were not known at implementation time.
While previous work has targeted implementations optimized for specific curves, our design has the unique property of providing optimized performance for multiple named curves and support for arbitrary generic curves. In a previous publication [13] , we introduced a technique called partial reduction that allows for generic curve-independent implementations of ECC. While our previous publication only described a firmware implementation of this technique, we are now introducing a novel digit-serial multiplier that implements modular multiplication in hardware for both named and generic curves. By adding hardware support for generic curves, we were able to significantly increase performance for generic curves over GF (2 163 ) from 1075 point multiplications per second for the firmware implementation to 3308 point multiplications per second for the new design. Compared with 6955 point multiplications per second for named curves over GF (2 163 ), the performance penalty for named curves is now roughly a factor of two which is low given the complexity of the problem.
The report is structured as follows. Section 2 summarizes related work. Section 3 describes the ECC-enhanced protocol stack that interfaces the cryptographic processor. Section 4 briefly explains ECC point multiplication. In Section 5, we describe the arithmetic underlying ECC and address the problem of modular reduction. The architecture and implementation of the cryptographic processor is presented in Section 6. The program code for the point multiplication is shown in Section 7. Section 8 analyzes the design and gives performance numbers. Finally, Section 9 contains the conclusions.
Related Work
Hardware and firmware implementations of ECC point multiplication over different fields GF (2 m ) have been reported in numerous publications. A design for GF ((2 8 − 17) 17 ), optimized for 8-bit processors, is described by Woodbury et al. in [22] . The implementation targets a SmartCard based on an Intel 8051 microcontroller. Orlando and Paar describe a programmable elliptic curve processor for reconfigurable logic in [19] . Different curves can be handled by parameterizing the hardware architecture and reconfiguring the logic. The prototype performs point multiplication on curves over GF (2 167 ). Bednara et al. [2] designed an FPGA-based cryptographic processor architecture that allows for using multiple squarers, adders and multipliers. They researched hybrid coordinate representations in affine, projective, Jacobian and López-Dahab form. Two prototypes were synthesized for GF (2 191 ). Agnew et al. [1] built an application-specific integrated circuit implementing ECC point multiplication for GF (2 155 ). The chip uses an optimal normal basis multiplier exploiting the composite field property of GF (2 155 ). Goodman and Chandrakasan [9] designed a generic public-key processor that executes modular operations on integer and binary polynomial fields. The internal data path can be reconfigured to support different field degrees. Point multiplication over binary polynomial fields is computed by a microcoded double-and-add algorithm. To our knowledge, this is the only implementation that supports GF (2 m ) for variable field degrees m. However, the architecture is optimized for low power consumption and its performance cannot be scaled to levels required by server-type applications. All other implementations described above target either one or a small number of specific curves. That is, none of them can handle a curve that is not specified at implementation time without requiring the software to be modified or the hardware to be reconfigured.
ECC-enabled Secure Protocol Stack
This section provides the context for the work described in this report by outlining the ECC-enhanced protocol stack that interfaces the cryptographic processor. Figure 1 shows the implemented client/server system. We integrated new cipher suites based on ECC into OpenSSL [18], the most widely used open-source implementation of the Secure Sockets Layer (SSL). More specifically, we added the Elliptic Curve Digital Signature Algorithm (ECDSA), the Elliptic Curve Diffie-Hellman key exchange (ECDH), and means to generate and process X.509 certificates containing ECC keys. We validated our implementation by integrating it with the Apache web server and open-source web browsers Dillo and Lynx running on a handheld client device under Linux. The cryptographic processor accelerates public-key operations on the server side, where client connections are aggregated. The processor is connected to the host machine through an IO interface based on the Peripheral Component Interface (PCI) standard and accessed by a character device driver running under the SolarisçOperating Environment.
The SSL and TLS protocols 2 [8] are the most widely deployed and used security protocol on the Internet today. The protocol has withstood years of scrutiny by the security community and, in the form of the Secure Hyptertext Transport Protocol (HTTPS), 3 is now trusted to secure virtually all sensitive web-based applications ranging from banking to online trading to electronic commerce. SSL offers encryption, source authentication and integrity protection for data exchanged over insecure, public networks. It operates above a reliable transport service such as the Transport Layer Protocol (TCP) and has the flexibility to accommodate different cryptographic algorithms for key agreement, encryption and hashing. However, the specification does recommend particular combinations of these algorithms, called cipher suites, which have well-understood security properties.
The two main components of SSL are the Handshake protocol and the Record Layer protocol. The Handshake protocol allows an SSL client and server to negotiate a common cipher suite, authenticate each other, 4 and establish a shared master secret using publickey algorithms. The Record Layer derives symmetric keys from the master secret and uses them with faster symmetric-key algorithms for bulk encryption and authentication of application data. Public-key cryptographic operations are the most computationally expensive portion of SSL processing, and speeding them up remains an active area for research and development. Figure 2 shows the general structure of a full SSL handshake. Today, the most commonly used public-key cryptosystem for master-key establishment is RSA but the IETF is considering an equivalent mechanism based on ECC [5] . In the following paragraphs we briefly describe the SSL handshakes for RSA-and ECC-based cipher suites, respectively.
RSA-based Handshake:
The client and server exchange random nonces 5 and negotiate a cipher suite with ClientHello and ServerHello messages. The server then sends its signed RSA public-key either in the Certificate message or the ServerKeyExchange message. The client verifies the RSA signature, generates a 48-byte random number (the pre-master secret) and sends it encrypted with the server's public-key in the ClientKeyExchange. The server uses its RSA private key to decrypt the pre-master secret. Both end-points then use 3 HTTPS is HTTP over an SSL-secured connection. 4 Client authentication is optional. Only the server is typically authenticated at the SSL layer and client authentication is achieved at the application layer, e.g., through the use of passwords sent over an SSLprotected channel. However, some deployment scenarios do require stronger client authentication through certificates.
5 Nonces are single-use random numbers to guard against replay attacks.
the pre-master secret to create a master secret, which, along with previously exchanged nonces, is used to derive the cipher keys, initialization vectors and Message Authentication Code (MAC) keys for bulk encryption by the Record Layer. The server can optionally request client authentication by sending a CertificateRequest message listing acceptable certificate types and certificate authorities. In response, the client sends its private key in the Certificate and proves possession of the corresponding private key by including a digital signature in the CertificateVerify message.
ECC-based Handshake:
The processing of the first two messages is the same as for RSA but the Certificate message contains the server's Elliptic Curve Diffie-Hellman (ECDH) public key signed with the Elliptic Curve Digital Signature Algorithm (ECDSA). After validating the ECDSA signature, the client conveys its ECDH public key in the ClientKeyExchange message. Next, each entity uses its own ECDH private key and the other's public key to perform an ECDH operation and arrive at a shared pre-master secret. The derivation of the master secret and symmetric keys is unchanged compared to RSA. Client authentication is still optional and the actual message exchange depends on the type of certificate a client possesses.
Point Multiplication
The fundamental and most expensive operation underlying ECC is point multiplication, which is defined over finite field operations. 6 The point multiplication kP of an integer k and a point P on an elliptic curve C : y 2 + xy = x 3 + ax 2 + b; x, y ∈ GF (2 m ) with curve parameters a, b ∈ GF (2 m ) over a binary polynomial field GF (2 m ) can be decomposed into point additions and point doublings. For example, 9P can be computed with one point addition and three point doublings since 9P = P + 2 * 2 * 2P .
Various algorithms have been proposed to efficiently compute point multiplications. 7 We experimented with different point multiplication algorithms and settled on Montgomery's point multiplication algorithm using projective coordinates as proposed by López and Dahab [17] . Affine point coordinates (x, y) are represented as projective triples (X, Y, Z) with x = X Z and y = Y Z to avoid expensive divisions. Montgomery's algorithm exploits the fact that for a fixed point P = (x, y) = (X, Y, 1) and points P 1 = (X 1 , Y 1 , Z 1 ) and P 2 = (X 2 , Y 2 , Z 2 ), 2P 1 , 2P 2 and P 1 + P 2 can be expressed through only the X-and Zcoordinates of P, P 1 and P 2 if P 2 = P 1 + P . A point multiplication kP can be computed with log 2 (k) point additions and doublings. The computation is done as follows. The bits of the binary representation of k are examined from left (k log 2 (k) ) to right (k 0 ). For the first non-zero bit of k, P 1 and P 2 are initialized with P 1, log 2 (k) = P and P 2, log 2 (k) = 2P :
For all following bits of k, with k i = 0, P 1,i is set to 2P 1,i+1 (1) and P 2,i is set to
.
(1)
Similarly, for k i = 1, P 1,i is set to P 1,i+1 +P 2,i+1 and P 2,i is set to 2P 2,i+1 . The Y-coordinate of kP can be retrieved from its X-and Z-coordinates using the curve equation. The result kP = (x kP , y kP ) in affine coordinates is given by
Using projective coordinates, Montgomery point multiplication requires 6 log 2 (k) +9 multiplications, 5 log 2 (k) + 3 squarings, 3 log 2 (k) + 7 additions and 1 division.
ECC Arithmetic in GF(2 m )
ECC over finite fields is based on modular addition, subtraction, multiplication, squaring and division. In this report, we will focus on binary polynomial fields GF (2 m ). Using
The addition of two elements a, b ∈ GF (2 m ) is defined as the sum of the two polynomials obtained by adding the coefficients a i and b i , which corresponds to a bitwise XOR operation. For example, a polynomial addition (t 4 + t 2 + 1) + (t 3 + t 2 + t) = t 4 + t 3 + t + 1 can be computed as 10101 xor 1110 = 11011. Since every element of GF (2 m ) is identical to its additive inverse, subtraction is identical to addition.
Multiplication of two elements a, b ∈ GF (2 m ) is carried out in two steps. First, the operands are multiplied using polynomial multiplication resulting in
The degree of c 0 is less than 2m − 1, i.e. deg(c 0 Figure  4 . An illustrative way to look at reduction is that M is aligned with the most significant bit of the operand and added until the degree of the result is smaller than m. Polynomial multiplication can be efficiently implemented using well-known techniques such as the Ofman-Karatsuba method [16] . Field multiplication, i.e. polynomial multiplication combined with reduction, can be implemented using techniques such as the least significant digit (LSD) first or most significant digit (MSD) first multiplication method [20] . Implementations of reduction will be discussed in detail in Sections 5.1 and 5.2.
Division
is defined as a multiplication of the dividend a with the multiplicative inverse of the divisor b. Algorithms for finding the inverse element include the extended Euclidean algorithm [3] and methods employing Fermat's little theorem a p−1 ≡ 1 mod p for GF (2 m ). A method for efficiently implementing division was proposed by ChangShantz [7] .
Reduction
Field multiplication and squaring operations require reduction by an irreducible polynomial M . Rather than computing a full polynomial division, reduction can be done by executing a sequence of polynomial multiplications and additions based on the congruency
for an irreducible polynomial M and arbitrary polynomials u and v over GF (2) . Reduction
< m can be computed iteratively as follows. Since the degree of c 0 is less than 2m − 1, c 0 can be split up into two polynomials c 0,h and c 0,l with deg
Subsequent polynomials c j+1 can be computed iteratively by setting
Using t m ≡ M − t m mod M as a special case of (4), it is obvious that c j+1 ≡ c 0 mod M . The reduced result c = c i , deg(c) < m can be computed in a maximum of i ≤ m − 1 reduction iterations. The minimum number of required iterations depends on the second highest term t k of the irreducible polynomial M [20, 14] . For
it follows that deg(c j+1 ) gradually decreases such that
The minimum number of iterations i is given by
That is, i is the number of iterations that need to be executed if the test c j,h = 0 is considered too costly.
apparently limits the number of reduction iterations to two. This is the case for all irreducible polynomials recommended by NIST [21] and SECG [6] .
Partial Reduction
Polynomials c ∈ GF (2 m ) can be represented in reduced canonical form, i.e., deg(c) < m, or in non-reduced form with deg(c) ≥ m. Using polynomials in both reduced and nonreduced form is the idea underlying partial reduction. For a chosen integer n ≥ m, we define a polynomial c ∈ GF (2 m ) to be in partially-reduced representation if deg(c) < n. For hardware implementations, n could, for example, be the maximum operand size of a multiplier. All computations for a point multiplication in GF (2 m ) can be executed on polynomials in partially-reduced representation. Reduction of the results to canonical form only needs to be done in a last step.
For 
The result c = c i , deg(c) < n can be computed in at most i ≤ n − 1 reduction steps. Given M as defined in (7), the minimum number of iterations i is given by
A more detailed description of partial reduction can be found in [13] .
ECC Processor Architecture
We chose a microprogrammable architecture for the cryptographic processor. The microprogram is stored in static memory and uploaded by the host at initialization time. Although the functionality of the cryptographic processor is fixed, controlling program execution by a microprogram rather than hardwired control logic provided an ideal platform for experimenting with different point multiplication algorithms.
Data Path
We decided on a bus structure for the data path to keep the design as flexible as possible. This design decision proved to be valuable as it allowed us to easily change the function units without affecting the communication infrastructure. Figure 5 shows the data path and the control unit. Dual-ported instruction and data memories called IMEM and DMEM, respectively, connect the cryptographic processor with the PCI bus of the host system. The data path is n = 256 bits wide, that is, the busses, the register file and memories are 256 bits wide and the function units operate on 256-bit operands.
The data memory DMEM, the registers and the function units are connected by the busses SBUS and DBUS. The data memory DMEM has a capacity of 8 kBytes and stores parameters and variables. The register file contains eight general purpose registers R0-R7, a register RM that holds the irreducible polynomial, and a register RC that specifies the field degree and the type of curve; more specifically, it specifies whether a named curve or 
Instruction Set
The cryptographic processor implements a load/store architecture. That is, memory can be accessed by load and store operations only, and all arithmetic instructions are limited to register operands. Instructions fall into three categories: memory instructions, arithmetic instructions, and control instructions. All instructions have a fixed length of 16 bits. Figure 6 gives the instruction formats and Table 1 contains the complete instruction set. The memory instructions include LD and ST instructions. Since memory is mainly used for passing parameters between the cryptographic processor and the host, only an absolute addressing mode is supported. The arithmetic instructions include DIV, MUL, ADD, SQR, and SL. DIV, MUL, and SQR implement modular reduction, whereby implementations differ significantly as will be shown later. The control instructions contain the three conditional branch instructions BMZ, BEQ, and BNC, the unconditional branch instruction JMP, the instruction NOP and the instruction END that terminates program execution. 
Control Unit
The control unit consists of the instruction memory IMEM that has a capacity of 1 kByte or 512 instructions and a finite state machine (FSM) that controls the data path according to the instructions fetched. The FSM uses a handshake protocol to coordinate with the function units. Handshake signals are pipelined to not delay instruction execution. This protocol allows for optimizing instruction execution times in that function units only take as many cycles as needed. Execution times vary for the execution of MUL and DIV instructions. For the multiplier, the cycle count varies with the field degree m, and for the divider, the cycle count depends on both the field degree m and the values of the operands.
Program execution times are further optimized by overlapping instruction execution and executing instructions in parallel. The control unit overlaps the execution of arithmetic instructions by prefetching the instruction as well as preloading the first source operand. This is illustrated in Figure 7 . While instruction I 0 is being executed, the next instruction I 1 is prefetched, and register RS0 of I 1 is transferred over the SBUS from the register file to an arithmetic unit. Since RS0 of I 1 is loaded at the same time as RD of I 0 is stored, there must not be a data dependency between RS0 of I 1 and RD of I 0 . Dependencies are detected by the assembler and are considered programming errors. Often, these dependencies can be resolved by swapping RS0 and RS1 of I 1 . However, if I 0 is followed by SQR, SL, ST, or DIV, such a dependency cannot be removed as suggested and an NOP instruction needs to be inserted after I 0 . Parallel execution of instructions is implemented for the instruction sequence I 0 ; I 1 if I 0 is a MUL instruction and I 1 is either an ADD or SQR instruction and there are no data dependencies. The choice of these particular instructions is motivated by an analysis of the program code for point multiplication. As shown in Table 6 on page 21 the MUL instruction is the most frequently executed instruction and, in many instances, can be executed in parallel with either an ADD or an SQR instruction. Figure 8 illustrates the timing: I 1 is executed in parallel to I 0 , and I 2 is prefetched while I 0 and I 1 are being executed. The following data dependencies need to be considered: I 0 and I 1 can be executed in parallel if both RS0 and RS1 of I 1 do not depend on RD of I 0 ; the execution of I 2 can be overlapped with the execution of I 0 and I 1 if RS0 of I 2 does not depend on RD of I 0 . 
Arithmetic and Logic Unit
The ALU shown in Figure 9 implements the two arithmetic instructions ADD and SQR and the logic instruction SL. ADD translates into a bit-wise XOR of the two source operands. SQR requires the insertion of zeroes between the bits of the source operand and the subsequent reduction of the so expanded source operand. A hardwired reduction circuit is used that can only handle named curves. SL shifts the source operand to the left by one bit. The ALU further sets the conditions codes EQ and MZ. EQ is set to 1 if the result of an ADD, SQR, or SL instruction is zero, and to 0 otherwise. MZ is set to the most significant bit shifted out of the source operand by the SL instruction. 
Multiplier
The multiplier constitutes the core of the data path. As the performance analysis contained in Section 8 shows, more than half the number of cycles required to process a point multiplication are spent in the multiplier. For this reason, we optimized its performance as much as possible and spent a significant part of the chip resources on it.
We have implemented a number of digit-serial modular multiplier designs based on algorithms described by Song and Parhi in [20] . Our first design described in [12] used an LSD first multiplier that could perform modular multiplication in hardware for named curves only. In addition, the multiplier could generate an unreduced product so that reduction for generic curves could be performed by microcode. Generic curves were implemented in microcode and processed at a tenth of the throughput achieved for named curves. Here, we describe a novel multiplier design that performs modular multiplication for both types of curves in hardware thereby significantly improving the performance for generic curves. The new design is based on a MSD first multiplier. We also considered the LSD first multiplier but, as we will explain later, found that pipelining for the MSD multiplier can be done more efficiently.
We will first describe an MSD first multiplier that works for named curves only, before we describe the final design that can handle both named and generic curves. The pseudo code looks as follows: We further developed a generic MSD first multiplier shown in Figure 11 that can handle both named and generic curves. It uses a hardwired reducer for named curves and it reuses the multiplier circuit to perform reduction for generic curves. Reduction for generic curves is based on the partial reduction algorithm explained in Section 5.2. This technique reduces to the data path width n rather than to the field size m, with m ≤ n. This avoids costly shift and mask operations to extract field-sized operands smaller than the data path width. In comparison with Montgomery modular multiplication, our scheme uses fewer multiplications; this is particularly true when a large multiplier is used. The pseudo code There is one partial product generator that is alternately used to perform a multiplication step and a reduction step. Rather than strictly interleaving these two steps, the computation begins with executing two multiplication steps before the first reduction step is executed. That is, P and Z are computed in the order {P 0 , P 1 , Z 0 , P 2 , Z 1 , ..} such that P i is only needed two cycles later when Z i+1 is calculated. Table 2 shows the state diagram for the generic MSD first multiplier of Figure 11 . Separate control flows are given for named and generic curves. The state diagram for named curves looks as follows: The source operands are loaded from the SBUS in states S0 and S1; the partial products are computed in states S2, S3, S4 and S5 -S3, S4 and S5 also accumulate and reduce the partial results; S6 performs a final accumulation and reduction; finally, the result is transferred over the DBUS into the register file in state S7 (not shown). state S4 is skipped. Looking at generic curves, the state diagram is specified as follows:
The source operands are loaded from the SBUS in states S0 and S1; the partial products are computed in states S2, S3, S5 and S7; the reduction of the accumulated multiplication results happens in states S4, S6, S8 and S9; S10 performs a final accumulation and reduction; finally, the result is transferred over the DBUS into the register file in state S11 (not shown). Since the multiplier is alternately used for a multiplication step and a reduction step, register X alternately supplies the MSD of x and the MSD of the accumulated result, and register Y alternately supplies y and M where M = (M − t m ) * t n−m . The state machine for generic curves is again optimized such that states are skipped for smaller field degrees: States S5 and S6 are skipped for m ≤ 192. We also implemented a similar multiplier using the LSD first method. Comparing the two implementations we found that the MSD multiplier can be pipelined more efficiently saving one state for generic curves. The reason is that dependencies between partial products and reduction results are less stringent for the MSD first multiplier. Table 3 gives the cycle counts for the generic MSD first multiplier. The cycle counts include the time needed to load and store the operands. It takes seven cycles to perform a modular multiplication for named curves over fields GF (2 m 
lef t(P ) + shif t lef t(Z)
n-bit register. This requirement is equivalent to the partial reduction being executable in a single iteration. Employing Equations (10) and (8) and given a partial product generator that multiplies d × n bits, the number of reduction iterations i is
For limiting partial reduction to a single iteration it follows that d ≤ m − k. For d = 64 this limits irreducible polynomials P to those with m − k ≥ 64. All polynomials recommended by NIST and SECG satisfy this condition. Polynomials with m − k < 64 could be accommodated by allowing for multiple reduction iterations. This, however, would significantly reduce the performance of the multiplier. 
Divider
The cryptographic processor implements a modular divider based on an algorithm described by Chang-Shantz [7] that has similarities to Euclid's greatest common divisor (GCD) algorithm. The divider consists of four 256-bit registers -A, B, U and V -and a fifth register holding the irreducible polynomial M . It can compute division for arbitrary irreducible polynomials M and field degrees up to m = 255.
Initially, A is loaded with the divisor X, B with the irreducible polynomial M , U with the dividend Y , and V with 0. Throughout the division, the following invariants are 16 maintained:
Through repeated additions and divisions by t, A and B are gradually reduced to 1 such that U (respectively V ) contains the quotient Y X mod M . One should note that a polynomial is divisible by t if it is even, i.e., the least significant bit of the corresponding bit string is 0. Division by t can be efficiently implemented as a shift right operation. We use two counters CA and CB to test for termination of the algorithm. For named curves, CB is initialized with the field degree m and CA with m − 1. And for generic curves, CB is initialized with the register size n and CA with n − 1. CA and CB represent the upper bound for the order of A and B. This is due to the fact that the order of A + B is never greater than the order of A if CA > CB and never greater than the order of B if CA ≤ CB. The following pseudo code describes the operation of the divider: A modular division can be computed in a maximum of 2m clock cycles for named curves and in a maximum of 2n clock cycles for generic curves. The corresponding block diagram of the hardware implementation of the divider is shown in Figure 12 .
Note that the divider fully reduces the result to the field degree. In particular, divisions by 1 can be used to reduce a polynomial of degree less than n to a polynomial of degree less than m. 
Implementation
We prototyped the cryptographic processor in a Xilinx Virtex-II XCV2000E-7 FPGA. The floorplan of the synthesized design is shown in Figure 13 . Area constraints were provided for the ALU, the divider and the register file, whereas the multiplier was left unconstrained. This way, these blocks do not interfere with each other when resources are allocated, while at the same time as many resources as needed can be allocated to the multiplier which constitutes the critical path. No other constraints and, in particular, no manual placement was required to obtain a synthesized design that runs at the targeted frequency of 66.4 MHz which is derived from the PCI clock. 
Point Multiplication Code
We have implemented Montgomery's point multiplication algorithm that we outlined in Section 4. A fragment of the assembly code implementing Equations (1) and (2) is shown in Table 5 . The computation of the four equations is interleaved to achieve a higher degree of instruction-level parallelism. We use a single code base for named curves and generic curves. This is accomplished by executing MUL and SQR instructions according to the curve type. For named curves, MUL denotes a multiplication with hardwired reduction and, for generic curves, it is executed as a multiplication with partial reduction. The execution of an SQR instruction is slightly more complicated. For named curves, SQR is executed by the ALU. And for generic curves, the SQR instruction is translated into a MUL instruction that is executed as a multiplication with partial reduction. We use the BNC (branch if named curve) instruction in the few places where the program code differs for the two curve types. BNC tests the NC flag which is initialized by the host before a point multiplication computation is started.
As we had explained in Section 6.3, we make use of the fact that the multiplier and the ALU can operate in parallel. That is, if there are no data dependencies, the MUL instruction can be executed in parallel with either an ADD or an SQR instruction. Since the SQR instruction is executed by the ALU for named curves and by the multiplier for generic curves, the order in which instructions are executed differs depending on the curve type even though the code is the same. Data dependencies are detected in different ways. The assembler checks for dependencies that would prevent overlapped instruction execution. In these cases, the programmer needs to resolve the dependencies by reordering operands or inserting NOP instructions. With respect to parallel instruction execution, the control unit examines dependencies and decides whether instructions can be executed in parallel or not.
The code fragment in Table 5 shows no data dependencies for any MUL/SQR or MUL/ADD instruction sequence. Hence, for named curves, all MUL/SQR and MUL/ADD sequences are executed in parallel.
Furthermore, since there are no data dependencies between subsequent arithmetic instructions, instruction execution can be overlapped, thus, saving one cycle per instruction.
Code execution looks different for generic curves as illustrated. In this case, all MUL/SQR sequences have to be executed sequentially as SQR instructions are now executed as MUL instructions. However, there still is one SQR/ADD sequence and one MUL/ADD sequence left that can be executed in parallel. 
Instructions
Execution for Named Curves Execution for Generic Curves
Evaluation
This section contains two parts. First, we look at the distribution of instructions executed by a point multiplication. Next, we compare performance numbers for point multiplication executed in hardware and software. Table 6 gives the distribution of instructions executed by the point multiplication operation for named and generic curves, respectively. For point multiplication on named curves over GF (2 163 ), field multiplications account for almost 62% of the execution time. In the case of generic curves, field multiplications even constitute 81% of the execution time. It is, therefore, justified to allocate a significant portion of the available hardware resources to the multiplier. Parallel and overlapped execution save 36% of the execution time for named curves and 20% for generic curves when compared to sequential execution. There is still room for further improvements since the control flow instructions BMZ, BEQ, SL, JMP and END consume almost 21% of the execution time when processing named curves and 10% when processing generic curves. This time could be saved by separating control flow and data flow.
Instruction Distribution
To evaluate the performance of the divider, we implemented an inversion algorithm that replaces the division with a sequence of multiplications and squarings based on Fermat's little theorem. A hard-coded field inversion for named curves over binary polynomials fields GF (2 163 ) took 938 cycles (0.01413 ms). Thus, the divider is almost three times as fast as a soft-coded implementation. The speedup is even higher when generic curves are used. Also note that the divider provides a reduced canonical result of less than the field degree m, which is a useful property when dealing with generic curves. Table 7 shows performance numbers for implementations of point multiplication in hardware and software. The hardware implementation uses the prototype system described in Section 6.7. The software implementation considers generic curves and does not contain any curve-specific optimizations. It is executed on a 900 MHz Sun Fireç280R server. Hardwired reduction improves the execution time for named curves by more than a factor of two in comparison to generic curves. The speedup of the hardware implementation over the software implementation is a factor of about 20 for named curves which is significant considering that the FPGA prototype runs at less than 1 13 the 900 MHz clock frequency. The poor software performance is mainly due to the lack of support for GF (2 m ) arithmetic in general-purpose CPUs. While the software implementation is optimized for irreducible polynomials that are either pentanomials or trinomials, the hardware implementation is more generic in that it can operate on arbitrary irreducible polynomials. 
Point Multiplication Performance

Conclusions
We presented a cryptographic processor that provides optimized performance for a number of named curves and support for generic curves over arbitrary fields GF (2 m ), m ≤ 255. This flexibility is needed by server applications that have to perform large numbers of point multiplications on different curves.
We described two novel elements of a cryptographic processor: A modular divider and a modular multiplier both capable of handling arbitrary polynomial fields GF (2 m ). The divider leads to a performance gain of a factor of three over a soft-coded implementation based on Fermat's little theorem. The divider, furthermore, comes in handy when the intermediate results computed by the partial reduction algorithm have to be reduced to the actual field degree.
We described a novel modular multiplier that is capable of handling named curves as well as generic curves. Processing the two types of curves differs in that hardwired reduction logic is used for named curves and the multiplier logic is reused to perform reduction for generic curves. Since reduction for generic curves reuses existing logic, the additional resources needed to support generic curves are minimal.
Our cryptographic processor uses a common code base to implement point multiplication for both named curves and generic curves. To make this possible, squaring instructions are dynamically translated into multiplication instructions in the case of generic curves -this translation is necessary since modular squaring is implemented in hardware for named curves only.
We increased performance by exploiting the parallelism found in Montgomery's point multiplication algorithm. More specifically, we allow multiplication instructions, which are the most frequently executed instructions, to be executed in parallel with either add or square instructions. Together with overlapped instruction execution, parallel execution reduces the execution time for a point multiplication operation by 36% for named curves and 20% for generic curves.
We are currently working on a cryptographic processor that uses a common architecture capable of executing point multiplication for both GF (p) and GF (2 m ). Her current interest is in the area of efficient algorithms for implementing Elliptic Curve Cryptosystems, and the development of hardware cryptographic accelerators for RSA and ECC.
About the Authors
Vipul Gupta is a Senior Staff Engineer at Sun Microsystems Laboratories, where his research interests include secure networking protocols and mobile computing. Prior to joining Sun, he was an Assistant Professor at the State University of New York where he taught courses in computer networking, parallel processing, and operating systems and conducted research funded by the National Science Foundation and industry sponsors that included IBM and NEC. He has a Ph.D. in Computer Science from Rutgers University.
