Abstract
Introduction
parameters because of the prohibitive cost of specialized hardware.
Although a lot of work bas been done in the area of reconfigurable computing and run-time reconfiguration, we are aware of only few practical implementations of generalpurpose reconfigurable computers. SRC-6E from SRC Computers, Inc. was chosen for our study [3]. Our goal was not only to c o n f i the great potential for effective use of reconfigurable computers in cryptography, but also to determine the current and possible future limitations of the reconfigurable computing technology. We chose as OUI benchmark a relatively complex cryptographic operation: scalar multiplication in the group of points on an elliptic c w e over GF(2'") with a polynomial basis representation 14, 5 , 61. This operation is perfect for our study, as it involves a three-level hierarchy of operations. The goal of ow study is to fmd out whch level fimctions need to be implemented by a hardware designer as library macros, and at what level the software designer can take over. Our paper gives an answer to tlus question for the current generation of reconftgurable computers.
SRC Reconfigurable Computer
SRC-6E is a hybrid-architecture platform, which consists of two dual-microprocessor boards and one MAPe board. A block diagram depicting a half of the SRC-6E Reconfigurable Computers are general.purpose highmaclune is shown in Fig. 1 
Basic operations of Elliptic Curve Cryptosystems
Elliptic C w e Cryptosystems (ECCs) are used commonly in constrained environments, such as portable and wireless devices, as a small-area, low-energy alternative to the RSA cryptosystem. The primary application of Elliptic Cuve Cryptosystems is secure key agreement and digital signature generation and verification [5, 7, 81. In both of these applications the primary optimization criterion from the implementation point of view is the minimum latency (rather then the maximum throughput). The primary operation of ECCs is an elliptic curve scalar mtiltiplication. Below we defme this operation in terms of lower level operations.
A nowsupersingular elliptic curve over GF(2"') is defmed as set of points (x,y) that satisfy the equatioq where, x, y, a, E GF(Zm), and a2 E {O,l} , together with the special point called a point at i n f~t y , and denoted as 0.
The elements of the Galois Field GF(2'") can be represented in several different bases, such as polynomial basis, normal basis, dual basis, etc. In all these representations, addition is the same and equivalent to the XOR operation, but multiplication is defmed differently. Our implementation focuses on the polynomial basis representation.
An addition of two points of an elliptic c w e P=(xp, yp) and Q=(xQ, ys), where Q+P=(x, yp+xp), and P, Q # 0 is defmed in Table 1 . Additionally, P+ 0 = 0 + P = P, and P + (-P) = 0. Similarly, point doubling, ZP=P+P, where P# 0, is also defmed in Table 1 . Additionally, 2 0 = 0. Please, note that outside of special cases, both p i n t addition and point doubling involve one inversion, several multiplications, and several admtions in GF(Zm). The primary elliptic c w e operation used in cryptography is scalar multiplication, defmed as W = P t P + + P L__ k-A very well known right-lo-left and left-bright algorithms for scalar multiplication are given above as Algorithms 1 and 2. In Algorithm 1, point addition (line 5) and point doubling (line 7) can be performed in parallel.
The Same is not true for Algorithm 2. Therefore, we have chosen the right-to-left Algorithm 1 for ow implementations.
Investigated partitioning schemes
A hierarchy of operations involved in an elliptic curve scalar multiplication for the case of an elliptic curve over GF(Zm) is given in Fig. 4 . Three levels of operations are involved in this hierarchy: scalar multiplication, kP, at the high level (H), point addition and point doubling at the medium level (M), and the GF(2") multiplication (MUL), inversion (INV), and addition (XOR) at the low level 0.).
Functions belonging to each of these three hierarchy levels (h~gh, medium, and low) can be implemented using Each of these approaches is characterized by a three letter codename, such as HML, OHL, OHM, etc. The first letter of lhis codename determines which level operations (lush, medium, low, or none) are implemented in C on a general-purpose microprocessor.
The second letter, determines which operations are described as a C function for the MAP, and the third letter, which operations are implemented as HDL macros. Unfortunately, an additional timing overhead is introduced during each MAP h c t i o n call because of the control, input, and output transfer between the microprocessor hoard and the MAP hoard. In the cwent generation of the SRC system, this overhead has been measured to be in the range of 370 ps. This value is very large compared to the average execution time of the P+Q and 2P operations in hardware (in the range of 10-20 p).
In order to minimize this overhead, the OHL partitioning scheme (shown in Fig. 5b ) has been implemented. In this scheme, the MAP function is called only once and executes the entire hgh level operation kP. As a result, the control, input, and output overheads are decreased, on average, by a factor of m, i.e., by at least two orders of magnitude for practical values of m (such as m=233 used in our experiments).
Two possible implementation approaches have been considered in the case of the O H L partitioning: the iterative and the unrolled. In the iterative approach (see Fig. da) , only one multiplier instantiation is used to implement the P+Q operatio-and two multiplier instantiations are used to implement the ZP operation. These multipliers are used iteratively to perform a total of 3 multiplications per PtQ operation, and 5 multiplications per 2P operation. In the unrolled approach (see Fig. 6b ), the number of instantiations of the multiplication macro is the same as the number of multiplications to be performed. The iterative approach i s more efficient in terms of the circuit area, and exploits the fact that only a limited number of multiplications can be executed in parallel because of the data dependencies between subsequent multiplications. On the other band, the unrolled approach simplifies control logic and reduces circuit latency. From the programming point of view, both approaches require a similar amount of effort.
A further reduction in the execution time can be accomplished in the O H M partitioning shown in implementing the entire kP operation as VHDL macro (see partitiming scheme OOH shown in Fig. 5d) .
Each of the aforementioned partitioning schemes can be implemented in principle using either a single User P G A or two User FFGAs. In case two FPGA devices are used, the fmt one is used to implement P+Q, inpuffoutput, and possibly control operations for kP (for the O H L and OHM approaches), and the second one is used to implement 2P, as shown in Fig. 6 for the two OHL implementation approaches.
Implementation of multiplication and inversion in GF(2")
Multiplication in GF(2"' ) with polynomial basis representation is defined as follows.
and B =(bo, b,, , bm.+ GF(2'"), and the product C = AB = (CO, CI, , cm.J are treated as polynomials A(x), B(x), and C(x) with respective coefficients. The dependence between these polynomials is given hy
where P(x) is a constant irreducible polynomial of degree m. The straightforward shift-and-add algorithm for multiplication in GF(2'") is given below as Algorithm 3, and its implementation is presented in Fig. 7 . This algorithm has been selected instead of other more complex algorithms reported in the literature [8, 91, because, unllke these algorithms, it easily supports 100 M H z clock frequency required by the SRC system. In our implementation, t l n s multiplier performs a single multiplication in m+l clock cycles. 
Our implementation of inversion in GF(

Design methodology and testing
Our implementations are capable of handling elliptic curve operations over GF(2"' ) for m 5 256. In particular, m=233 was chosen for our experiments, as t h s is one of the sizes recommended by NIST [7] . Additionally, our implementation can be easily extended to process larger values of m by using multiple memory locations to store a single element of GF(24. All hardware macms have been developed fxst using standard tools for simulation and synthesis of digital circuits, such as Aldec Active-HDL and Synplicity Synplify Pro. All macros have been optimized to work at the clock frequency of 100 MHZ. The 'XOR' operation did not need to be implemented as a user macro, as it is a standard macro in the SRC library. This macro is invoked automatically when compiler encounters the XOR operator (denoted as 'A' )within a C MAP function.
The C MAF functions describe both the structure of the execution units (data flow), as well as behavior of the control units. The contents of these functions determine the configuration of the User FPGAs during program execution. The subset of C accepted at this level is constantly increasing as a result of progress i n the development of the SRC MAP compiler. Functions scheduled to run on the microprocessor can be described in the standard C. The entire language is accepted at this level.
AU our implementations have been tested for correct functionality using an optimized software implementation based on LiDiA, the public domain library for computational number theory [12] . Output: 
12: endif 13: 14: endloop
in the number of clock cycles usug the standard SRC macro, read-timero. The erldd-toend time of C functions has been measured in time units using the C timer function of the Linux operating system, gettimeofdayo. All measurements have been repeated 100 times, and the median values are reported in Section 7.
Results
The results of the timing measurements for all investigated partitioning schemes are summarized in The current version of the MAP compiler (SRCdE Carte 1.4.1) optimizes performance over resource utilization. As it matures the compiler will be expected to balance high performance, ease of coding, and resource utilization to yield a truly optimized logic.
Conclusions
Reconfigurahle computers offer a great promise for solvmg complex cryptogmphic problems with the speed of specialized hardware and flexibility and productivity of software implementations. In this paper, we describe our experiences with programmug one of the leading reconfigurable computers available on the market, SRC-6E.
We have chosen as our benchmark the primary operation of Elliptic Curve Cryptosystems over GF(2"') in polynomial hasis representation: scalar multiplication. This operation is particularly challenging for reconfigurable computers because the primary optimization criterion is latency rather than throughput, and there is only limited amount of parallelism involved in the medium level operations, such as point addition and doubling. In spite of these constraints, a speed-up in the range of 8-9 has been demonstrated compared to the highly optimized microprocessor implementation using four different algorithm partitioning approaches (OHL iterative 2-chip, OHL unrolled 2-chp, O H M 2-chip, and OOH 1 -clup).
What is more important however, our study revealed the optimum boundary between hardware and s o h a r e , and between the descriptions of hardware in VHDL vs. C for the threelevel hierarchy of operations constituting the Elliptic C w e scalar multiplication. This boundary had to take into account the trade-off between the end-to-end execution time, the resource utilization, and the designer's productivity and ability. While the first two criteria are relatively easy to quantify, the third one is more difficult to measure objectively, as it depends strongly on the designer's skills and background. Add~tionally, the relative importance and weight of particular criteria might very depending on particular application and design environment.
Assuming as a primary criterion the increased application developer productivity and an attempt to minimize involvement of hardware designers and traditional HDL-based design methodology, we have determined an optimum solution. In this solutio% referred to as unrolled O H L scheme, the entire scalar multiplication is implemented in hardware, hut only low-level operations, GF(2'") multiplication and inversion, needed to be described in VHDL. This partitioning scheme was shown to increase the execution time only by 8% compared to the scheme based on implementing the entire scalar multiplication in VHDL. This result was accomplished at the cost of the increased use of F'PGA resources, such as CLB slices, used mostly as a source of additional flip-flops.
Our research demonstrated that a good knowledge of the system hardware archtecture and programming model of a reconfigurable computer, and the associated overheads, may he useful to fully utilize the potential offered by this promising technology.
