Abstract. This work proposes a new elliptic curve processor architecture for the computation of point multiplication for curves defined over fields GF (p). This is a scalable architecture in terms of area and speed specially suited for memory-rich hardware platforms such a field programmable gate arrays (FPGAs). This processor uses a new type of high-radix Montgomery multiplier that relies on the precomputation of frequently used values and on the use of multiple processing engines.
Introduction
This work introduces, to the authors' knowledge, the first documented processor architecture for the computation of elliptic curves point multiplications for curves defined over fields GF (p). Hardware implementations have been documented for the computation of point multiplications for curves defined over GF (2 m ). The most notable implementations include [1] [2] [3] [4] [5] [6] .
The architecture presented here is based on the standalone elliptic curve processor architecture introduced in [6] . This architecture is modular, programmable, and suitable for algorithms that rely on precomputations.
Multiplication is typically the most critical operation in the computation of elliptic curves point multiplications. The architecture introduced here uses a Montgomery multiplier. This type of multiplier has been the subject of extensive research, see for example [7] [8] [9] [10] [11] .
For the elliptic curve processor (ECP) introduced here, this work develops a new multiplier architecture that draws from [9, 12] an approach for high radix multiplication, from [8, 9] the ability to delay quotient resolution, and from [10] the use of precomputation. In particular, this work extends the concept of precomputation. The resulting multiplier architecture is a high-radix, precomputation-based modular multiplier.
This research was supported in part by NFS CAREER award CCR-9733246
This section provides a brief introduction to elliptic curve point multiplication. Additional information can be found in [13, 14] .
The ECP computes elliptic curve point multiplications for arbitrary curves defined over GF (p). Point multiplication is defined as the product kP = P + P + . . . P k times , where k is an integer and P is a point on the elliptic curve. For fields GF (p), the curves of interest are defined by y 2 = x 3 + ax + b, where 4a
3 + 27b 2 ≡ 0 mod M and M > 3. One can visualize the computation of point multiplications as a hierarchy of processing functions. At the top of the hierarchy are the point multiplication functions. These functions compute point multiplications with repeated point additions and point doubles. At the next level of the hierarchy are the point addition and point double functions, which are intimately related to the coordinates used to represent the points. At the bottom of the hierarchy are the finite field functions required to perform the point addition and the point double functions. Figure 1 shows how this hierarchy maps into the ECP architecture.
The ECP is best suited for the computation of point multiplications using projective coordinates. When compared against algorithms for affine coordinates, algorithms for projective coordinates trade inversions in the point addition and in the point double operations for a larger number of multiplications and a single inversion at the end of the algorithm. This inversion can be computed with repetitive multiplications: a −1 mod M ≡ a M −2 mod M , for prime modulus M .
The ECP uses a Montgomery multiplier. The main advantage of this type of multiplier is that it facilitates quotient estimation and facilitates carry propagation in hardware adders. Their main disadvantage is that they compute weighted products: mult(A,B) = ABR −1 mod M , where R is a constant. For Montgomery multiplication to be effective, the input operands to the point multiplication algorithm must be transformed into weighted residues of the form AR mod M . The algorithm is then executed using these residues. At the end of the algorithm, the results are then transformed back to not weighted residues. Note that as described in [15] the addition and subtraction of these residues can be performed using traditional modular addition and subtraction operations. For most cryptographic algorithms, the cost of these transformations is amortized over a large number of operations.
Processor Architecture
The elliptic curve processor (ECP), shown in Figure 1 , consists of three main components. These components are the main controller (MC), the arithmetic unit controller (AUC), and the arithmetic unit (AU). The MC is the ECP's main controller. It orchestrates the computation of kP and interacts with the host system. The AUC controls the AU. It orchestrates the computation of point additions/subtractions, point doubles, and coordinate conversions. It also guides the AU in the computation of field inversions. The AU is the hardware that computes field additions/subtractions and multiplications, and performs comparisons. The following is a typical sequence of steps for the computation of kP in the ECP using the double-and-add algorithm and the projective coordinates algorithms shown in the appendix.
First, the host loads k into the MC, loads the coordinates of P into the AU, and commands the MC to start processing. The MC does its initialization and then commands the AUC to do its initialization. The AUC initialization includes the conversion of P from affine to projective coordinates and the conversion of these coordinates into weighted residues (X = XR mod M ,Ỹ = Y R mod M , Z = R mod M ). During the computation of kP , the MC scans one bit of k at time starting with the second most significant coefficient and ending with the least significant one. In each of these iterations, the MC commands the AU/AUC to do a point double. If the scanned bit is a 1, it also commands the AU/AUC to do a point addition. For each of these point operations, the AUC generates the control sequence that guides the AU through the computation of the required field and comparison operations. After the least significant bit of k is processed, the MC commands the AU/AUC to convert the result back to affine coordinates. The AU/AUC first converts the result to affine coordinates and then converts the coordinates to not weighted residues (x, y). Then, the MC signals to the host the completion of the kP operation. Finally, the host reads the coordinates of kP from the AU.
The ECP uses two loosely coupled controllers, the MC and the AUC, that execute their respective operations concurrently. These are programmable processors that execute one instruction per clock cycle.
The AU incorporates a multiplier, an adder (or adders), and a register file, all of which can operate in parallel on different data. The AU's large register set supports algorithms that rely on precomputations. An example of a precomutationbased algorithm is an adaptation of a fixed base exponentiation method introduced in [16] for operations involving a known point. This algorithm requires on average m/w + 2 w point additions, the storage of m/w points, and no point doubles. In the previous expressions, w is the window size, which is a measure of the number of bits processed in parallel. To illustrate the benefits of precomputation, consider a fixed point multiplication for an arbitrary curve defined over GF (2 192 − 2 64 − 1), which is one of the fields recommended in [17] . Compared to the traditional double-and-add algorithm, the fixed point algorithm is over four times faster (assuming the use of the projective coordinates in [18] with Z = 1 and w = 4).
Arithmetic Unit
The Arithmetic Unit (AU) is the ECP's main processing unit. As Figure 2 shows, it consists of a register file, an adder (or adders), and a multiplier. The multiplier is the AU's most critical component, and, consequently, it is the component that drives the AU's architecture. The AU's architecture is defined at a high level by the multiplication algorithm it implements and at a low level by the number representation it uses. The most popular cryptographic algorithms in use today require arithmetic with large operands (160 . . . 1024 + bits). To achieve a high rate of computation, most hardware implementations resort to iterated multiplication methods that approximate the desired result rather than computing exact ones. The approximated results are then refined to exact results in post-processing operations. The tradeoff is accuracy for speed. The ECP's multiplier is an example. It implements an iterated multiplication algorithm that approximates the multiplication of AR mod M and BR mod M as ABR mod M + M , where M is a measure of the accuracy of the multiplication. Note that for the basic forms of Montgomery multiplication = 1.
Number representation is an important element of an arithmetic architecture. It defines how the numbers are represented and consequently how arithmetic is conducted. The selection of a number representation is influenced by the design methodology, the target architecture, and the area-time (or cost-speed) goals. The ECP architecture is independent of number representations. To validate the ECP architecture a prototype that uses redundant number representation was developed. The implementation results are discussed in section 8.
Modular Multiplication Algorithm
Algorithm 1 shows the ECP's main multiplication algorithm. This algorithm is a generalized version of the Montgomery multiplication algorithm with quotient pipelining introduced in [9] . This generalized version supports positive and negative operands and incorporates Booth recoding and precomputation. Positive and negative numbers arise naturally in Booth recoding and they are often used in elliptic curve algorithms.
Booth recoding is a technique that allows the representation of a two's complement number B =
Here we assume that B is represented by an integer number of digits of radix 2 u and also that its most significant bit represents the sign.
This work uses the Modified Booth Algorithm, which is a window based method [19, 20] . This method uses s windows, where each window i groups the set of bits (b iu+(u−1) b iu+(u−2) ..b iu−1 ) 2 for i = 0..s − 1, and where
The set of bits enclosed by window i , is encoded as
Note that in Algorithm 1 the recoding is done on a digit-by-digit basis. For this algorithm r, s, u, and v divide k. The variables qh i and bh i are respectively the most significant bits of Q i and B i .
The validity of Algorithm 1 can be proven using an induction argument similar to the one used in [9] to prove the validity of the Montgomery multiplication algorithm in which this algorithm is based. One can verify with induction on i that Equation (1) defines an invariant of the loop. For this verification note that Si/2 k defines a truncated division equivalent to
The symbol |x|M is used to express an approximate modulo reduction that satisfies the following relation:
|x| M is used to express least residue; that is, | |x| M | < M , where the symbol |y| represents the absolute value of y.
Using the loop invariant in Equation (1), one can verify that when i = n+d+1 the output of the algorithm satisfies Equation (2) . This equation establishes that the multiplication output is S n+d+2 ≡ ABR
This equation also defines the range, or accuracy, of the multiplication result in terms of the maximum values of A and B (note B i≥n = 0); the maximum value for the reduction terms, QM, which is defined in Equations (4-5)(note Q 0 = 0); the value of the multiplication constant R = 2 kn ; and the quotient resolution delay, d.
Algorithm 1: Modular Multiplication with Precomputation
Inputs:
end for end for /* Processing */ 4.
for i = 0 to n + d do /* Quotient Determination */ 4.1.
Loop Invariant:
Result after n+d+1 loop iterations:
QM:
Note that implementations can take advantage of the parallelism defined in steps 4.2-4.6 of Algorithm 1 without using Booth recoding. These implementations can set qh i = bh i = 0 for all i, and use digits ql iv+j ∈ [0, 2 u ) and bl is+j ∈ [0, 2 r ).
Analysis of Modular Multiplication Algorithm
In order to realize an area efficient multiplier, the ECP implements Algorithm 1 using precomputation. Precomputation reduces the complexity of the multiple input adder needed to add all the terms in step 4.6 of Algorithm 1 at the expense of a set of additions at the beginning of the algorithm (steps 2 and 3) and storage. The issues associated with the implementation of Algorithm 1 using precomputations are studied in the next sections.
Accuracy
The accuracy of the modular multiplication result is influenced by the range of the input operands, the method employed to compute reduction terms, the multiplication constant R, and the quotient resolution delay d.
In [12] two methods are defined for the computation of the reduction terms |xα|M . These methods are referred to as multiplication-based and lookup-based reduction methods. The multiplication-based approach computes |xα|M = x |α| M . The lookup-based method computes |xα|M = |xα| M . The accuracy of one multiplication-based and two lookup-based reduction methods are summarized in Table 1 . (Note that the reduction method affects the value of Qα i .)
R is a design parameter that influences the reduction accuracy of the multiplier. As Table 1 shows, the accuracy of the result is bounded by the magnitude of QM, which grows proportionally with R. For applications requiring iterated multiplications, such as modular exponentiations, R is often chosen so that the accuracy of a multiplication result falls in the range (−2QM/R, 2QM/R), or [0, 2QM/R) when handling only positive numbers. Examples for this last application can be found in [9] , for which A, B ∈ [0, 2(2
The results in Table 1 
, QM < M R for all the reduction methods listed in Table 1 . Table 1 . Accuracy of multiplication-and lookup-based reduction methods
Red. method
Qαi QM (worst case)
Processing Time
Equation (6) The expression in Equation (6) is normalized with respect to a reference unit of time. The processing cost of a precomputation operation is weighted by a factor a and the processing cost of a processing operation is weighted by the factor b. The factor c defines the number of multiplications over which the precomputation cost is amortized. (Note that it is common in many cryptographic algorithms to perform a large number of consecutive operations using the same modulus.)
The factor e represents the number of precomputation sets to be computed. As written, Algorithm 1 requires one set for the scalar products AB i and up to v sets for the scalar products Qα i . Note that for the Lookup 2 reduction method v sets need to be computed in step 3.1. For the multiplication and the Lookup 1 reduction methods, the precomputation engine can broadcast a single precomputation set to the relevant processing engines. For the precomputation of a single set, eliminate the loop in step 3.1, compute α[i] = |iα|M in step 3.1.1, and compute
uj in step 4.4.
To determine the optimum number of precomputations it is best to express Equation (6) in terms of m = log 2 M . Equation (7) provides an approximation, where n = m/k +d+f , f is a constant and R = 2 m+k(d+f ) . According to Equation (2) and the possible cases in Table 1 , f ∈ [0, 2] when A = B = 2 k/2 QM/R and the target multiplication accuracy is S n+d+2 ∈ (−2(QM/R), 2QM/R). These parameters are of interest here because they define a small number of iterations for Algorithm 1 that generate results suitable for repeated multiplications and they also allow a number of additions to be performed between multiplications without the need for reduction. Unless otherwise specified, this document will assume the use of the aforementioned parameters for general multiplications.
+ a q e q c q 2 u−1 + b q m/uv + b q (2d + f ) [18] . Entry 1 corresponds to the classical multiplication operation. Entry 2 defines a division by 2 requiring just d + 2 iterations of the loop in Algorithm 1. Entry 3 defines a multiplication of a special form which is used here to reduce the magnitude of a value presumed to be |0| M before comparing it to zero. Note that for Entry 3, QM is defined with respect to x (n = x) as shown in Table 1 , and this value may be different from the value of QM used to define A.
Operations of Interest for Scalar Point Multiplication
Some of the elliptic curve algorithms defined in the open literature, such as the ones in [18] , use comparisons in time critical functions, such as point addition and point double. Comparisons are used, among others, to identify the point at infinity during point add and point double operations. These comparisons involve field elements, therefore numbers A and B are considered equal if A − B ≡ |0| M , which implies that their difference is a multiple of M .
The accuracy of Algorithm 1 is of the order QM/R, where QM is defined in Table 1 . Rather than adding specialized circuitry to perform this function, here we recommend an approach that multiplies a value presumed to be zero by a constant. The idea is to perform this multiplication with high accuracy in a short amount of time. To achieve high accuracy, we recommend the use of Algorithm 1 with low quotient resolution delay (d ≈ 0) and possibly by using a more exact version of Algorithm 1 (see Table 1 ). To achieve a short processing time, we recommend multiplication by 2 −kx M according to Table 2 , where the parameter x is adjusted so that the value of the multiplication result is close to the value of M . Note that the algorithm just described is useful for a large set of applications. If additional accuracy is needed for the reduction operations, one can implement in the ECP a more accurate version of the multiplication algorithm, one such algorithm is presented in [9] . 
Area and Storage
The most complex operation of Algorithm 1 is the computation of the two scalar multiplications AB i and Qα i -the multiplication in step 5 is just a shift operation. These scalar multiplications can be computed using scalar multipliers. For the computation of a scalar multiplication, a scalar multiplier would add up to k/2 numbers per clock cycle when employing Booth recoding and k copies when using no recoding. Assuming that all the operands in Algorithm 1 are of the same size, the concurrent computation of step 4.6 would require the addition of k + 1 operands when using recoding or 2k + 1 when using no recoding. On the other hand, when using precomputation the concurrent computation of step 4.6 requires the addition of s + v + 1 operands. A limiting factor in the practical implementation of multiplication with precomputation is the size of the memory required to store the precomputed values. The use of Booth recoding in Algorithm 1, reduces the memory requirements by half when no storage is provided for values known to have zero value (e.g., 0 * A). Assuming that each precomputed product used in the computation of AB i requires m + k(d + 1) + r-bits of storage, that each precomputed scalar product used in the computation of Qα i requires m + u-bits of storage, and that each processing unit stores its own set of precomputed values, Algorithm 1 requires
Note that if multiple reduction methods are used concurrently, such as the use of a reduction method with d = 0 and one with d = 0, more than one copy of reduction coefficients needs to be stored.
Note that the relationship between r and s, and, u and v, allow designers to control the memory size at the expense of processing elements; for example, to achieve a given k, a designer could fix r and then derive s, which defines the required number of processing elements. This approach is particularly attractive for architectures that employ fixed size memory elements, such as field programmable gate arrays (FPGAs).
Effect of Quotient Pipelining
Quotient pipelining is the technique that allows fast rate of computations by allowing the use of delayed reduction terms (d = 0). The delay is reflected in steps 4.4 and 4.6 of Algorithm 1. The computation in step 4.4 occurs in the background and takes d iterations to complete. To avoid stalling, the results from step 4.4 are consumed as they become available in step 4.6.
The cost of this technique is reduced accuracy, increased processing time and increased area. The impact of this technique can be reduced by eliminating processing functions associated with the quotients, such as recoding, and by hiding quotient operations behind other functions. For example, the scalar products in step 4.6 of Algorithm 1 could be computed serially with all the processing engines dedicated to the computation of a scalar multiplication, instead of having two sets, each working on a different scalar product.
Number Representation
The previous discussion considered the upper layers of the ECP architecture, which are independent of the number representation. This section considers the specific example of stored-carry representation.
Stored-carry representation is attractive for the implementation of an ECP, among others, because of its support for fast addition using carry-save addition, natural interaction with non-redundant number representation, and its ability to support two's complement arithmetic.
The main drawbacks of stored-carry representation stem from its representation of a number with two numbers; for example, A = C + S, where A, B, and C are numbers of almost equal size. This representation doubles the storage requirements for an operand with respect to non-redundant representation and makes comparisons difficult. A comparison can be carried out by performing a subtraction, converting the result to non-redundant representation and then comparing the result against zero.
The use of Booth recoding in Algorithm 1 alleviates the storage requirements imposed by stored-carry representation. In addition, the ability to amortize precomputations over a large number of operations can be used to reduce memory requirements by storing precomputed values in non-redundant representation. The ECP's multiplier architecture, shown in Figure 2 , also makes provisions for the conversion of numbers to non-redundant representation; for example, the conversion of B can be done in a digit-by-digit basis before recoding. In addition, the system could employ a carry propagate adder for the conversion of numbers to non-redundant representation before storing them in the register file.
Multiplier Architecture
The AU's architecture is shown in Figure 2 . The multiplier and adder together implement Algorithm 1. The adder, which is optimized for accumulation (A = A + B), feeds precomputation values to the multiplier. Both the adder and the multiplier receive one of their inputs from the register file. They also output results to the register file.
To accomplish a high rate of computation, the architecture shown in Figure  2 can be implemented using stored-carry representation. To balance storage and processing speed requirements one can choose to represent some numbers in stored-carry representation and others in non-redundant representation.
The reduction terms iα2 uj M and some of the temporary results can be converted to non-redundant representation before storage. Operand B of the multiplication can be loaded to the multiplier in stored-carry form and then converted to non-redundant representation one digit at a time as the loop in Algorithm 1 progresses. The reduction terms Q i can also be converted to nonredundant representation before applying Booth recoding.
To support stored-carry representation, the architecture in Figure 2 must be enhanced with a carry propagate adder and with an efficient way to store numbers represented in stored-carry representation. For the ECP prototype described in the next section, we implemented a carry propagate adder with a digit-serial adder placed at point (A) in Figure 2 . For the storage of numbers represented in stored-carry representation, we recommend that the output multiplexer in Figure 2 be able to independently forward to the register file each of the numbers used to represent a number in stored-carry representation; that is, for A = C + S, this multiplexer can send either C or S to the register file.
Note that the two numbers used to represent a number in stored-carry representation can be treated as two numbers represented in non-redundant representation. Therefore, for the terms represented in stored-carry representation, such as A[ |bl is+j | ](sign(bl is+j ), one can use two processing units per term. Where each processing unit handles numbers represented in non-redundant representation. This design approach allows the use of a common processing unit architecture for stored-carry and non-redundant number representations.
Prototype Implementation
The validity of the ECP architecture was verified with a prototype that implemented the double-and-add algorithm using the projective coordinates algorithms defined in [18] for point addition and point double operations (the algorithms are shown in the appendix). This prototype was programmed to support the field GF (2 192 − 2 64 − 1), which is one of the fields specified in [17] . To verify the ECP's architectural scalability to larger fields, a modular multiplier for fields as large as GF (2 521 − 1) was also prototyped. This field is the biggest one recommended in [17] for elliptic curves defined over GF (p). This prototype exhibits the same area scalability and frequency of operation as does the multiplier of the ECP prototype. The following discussion focuses exclusively on the ECP prototype.
The ECP prototype used a 16-bit MC processor with 256 words of program memory, a 32-bit AUC processor with 2048 words of program memory, and a dual set of 128 registers, each of which is m + k(d + 2) bits wide. The dual set of registers permits the storage of numbers in stored-carry representation. (Note that a single register set capable of storing stored-carry numbers could have been used instead.) The prototype provided a 32-bit I/O interface to the host system. The ECP multiplier exhibits the following attributes: The validity of the prototype was verified with non-optimized code. Assuming that the ECP is coded in a form that extracts 100% throughput from its multiplier, it will compute a point multiplication for an arbitrary point on a curve defined over GF (2 192 − 2 64 − 1) in approximately 3 msecs (n = 192/8 + 1, d = 4, k = rs = uv = 8) using the algorithms shown in the appendix. This estimate ignores the processing cost of additions and overhead operations and it assumes the computation of 17m multiplications per point multiplication: 15.5m for the point double and the point add operations and 1.5m for the inverse required in the conversion to affine coordinates. For the modular multiplications, this estimate assumes negligible precomputation cost for the reduction terms, Qα i , and assumes the precomputation of 2 3 − 2 values for the terms AB i (no computation required for 0A or 1A).
The prototypes were implemented using the Xilinix's XCV1000E-8-BG680 (Virtex E) FPGA. The prototypes were coded in VHDL. They were synthesized with Synopsis' FPGA Compiler 3.5.0 and Xilinx's Design Manager M3.1i. Table 3 summarizes the features of the multiplier used in the ECP prototype and the features of one of the multiplier architectures introduced in [10] which also relies on precomputation. Both of these multipliers exhibit comparable area requirements (#LUTs), when one assumes s = v = 1 and r = u = 4. Note that the multiplier in [10] uses a fixed value of k, where this value is highly dependent on the underlying FPGA architecture.
Comparisons with other Implementations
It should be pointed out that the multiplier architecture introduced in [10] can be enhanced with some of the techniques introduced here. For example, to overcome the radix limitation, currently fixed at 2 4 , this multiplier could employ multiple processing engines per cell (s, v = 1), and to reduce memory requirements it could use Booth recoding. 
Conclusions
This work proposed a new elliptic curve processor architecture for the computation of point multiplication for curves defined over fields GF (p). This processor uses a new type of high-radix Montgomery multiplier that relies on the precomputation of frequently used values and on the use of multiple processing engines. The ECP's architectural scalability was verified with prototype implementations suitable for the implementation of secure elliptic curve cryptosystems (192-and 521-bits). Our estimates reflect that if were possible to extract 100% throughput from our multiplier, the computation of a point multiplication in a curve defined over GF (2 192 − 2 64 − 1) could be computed in about 3 msecs using the double-and-add algorithm and the projective coordinates algorithms defined in [18] .
