Abstract. We propose a superscalar coprocessor for high-speed curvebased cryptography. It accelerates scalar multiplication by exploiting instruction-level parallelism (ILP) dynamically and processing multiple instructions in parallel. The system-level architecture is designed so that the coprocessor can fully utilize the superscalar feature. The implementation results show that scalar multiplication of Elliptic Curve Cryptography (ECC) over GF(2 163 ), Hyperelliptic Curve Cryptography (HECC) of genus 2 over GF(2 83 ) and ECC over a composite field, GF((2 83 ) 2 ) can be improved by a factor of 1.8, 2.7 and 2.5 respectively compared to the case of a basic single-scalar architecture. This speed-up is achieved by exploiting parallelism in curve-based cryptography. The coprocessor deals with a single instruction that can be used for all field operations such as multiplications and additions. In addition, this instruction only allows one to compute point/divisor operations. Furthermore, we provide also a fair comparison between the three curve-based cryptosystems.
Introduction
Public-key cryptosystems form an essential building block for digital communication. Unlike secret-key algorithms that allow for a fast encryption of a large bulk of data, the importance of Public-Key Cryptography (PKC) is to have secure communications over insecure channels without prior exchange of a secret key. In addition, PKC enables digital signatures as an important cryptographic service. Diffie and Hellman introduced the idea of PKC [1] in the mid 70's.
Implementing PKC is a challenge for most application platforms varying from software to hardware. The reason is that one has to deal with very long numbers in conditions that are often constrained in area and power. For the choice of the implementation platform, several factors have to be taken into account.
Hardware solutions provide the speed and more physical security, but the flexibility is limited. For that property software solutions are needed, but a pure software solution is not a feasible option in most resource-limited environments. Hardware/software co-design potentially allows an efficient design platform that explores trade-off between cost, performance and security.
The most popular and most widely used public-key cryptosystems are RSA [2] and ECC [3, 4] . In embedded systems, ECC is considered a more suitable choice than RSA because ECC obtains higher performance, lower power consumption, and smaller area on most platforms. Another appealing candidate for PKC is HECC. Recently many good results appear for software and hardware implementations of HECC at the same time more theoretical work has shown HECC to be also secure in the case of curves with a small genus [5] .
A considerable amount of work has been reported on improving the performance of Elliptic Curve (EC) scalar multiplication. The work can be classified into following categories: First of all, mathematical investigation has been done for various types of elliptic curves such as Koblitz curves. Secondly, various algorithms for scalar multiplication have been proposed and criteria for improvements include performance as well as side-channel security. One of the best-known examples that meet requirements for both is the Montgomery's powering ladder [6] . Lastly, architecture-level improvements can be considered from a hardware implementations' point of view. Our interest in this paper mainly lies at this level.
The contribution of this paper is in accelerating curve-based cryptosystems by deploying a superscalar architecture. The solution is algorithm-independent and can be applied for any scalar multiplication algorithm. Some previous work reported parallel use of modular arithmetic units for accelerating scalar multiplication [7, 8, 9, 10, 11, 12] . In those papers, point/divisor doubling and addition are reformulated so that they can take advantage of the parallel processing. One original contribution is that our proposed architecture embeds an instruction scheduler that explores the best level of parallelism and assigns tasks for the processing units in an optimal way. In this way the parallelism within the operations can be found on-the-fly by dynamically checking the data dependency in the instructions. We provide also a fair comparison between three cryptosystems, ECC, HECC and ECC over a composite field. Namely, it is known that for HECC of genus 2 one has the ability to work in the field of a size two times smaller than the one for ECC obtaining the same level of security. On the other hand using ECC over GF((2 p ) 2 ), we end up with the same field arithmetic as HECC. In this way, another contribution of this paper lies in the system architecture of three curve-based cryptosystems enabling one to use the same amount of area.
The remainder of this paper is as follows. Section 2 gives a survey of relevant previous work for curve-based cryptography implementations. In Section 3, some background information on ECC and HECC is given. In Section 4 the architecture for our proposed coprocessor is explained. The details of our implementation are introduced in Section 5 and the results are shown for various implementation options in Section 6. Section 7 concludes the paper.
Previous Work
This section lists some relevant previous work. As already mentioned, there is a considerable amount of work done on hardware implementations, especially for ECC [13, 14] , but more recently also some on HECC. Recent improvements on HECC divisor operations' formulae [15, 16, 17] resulted in several hardware implementations featuring efficient HECC performances [18, 11] . The first result showing that HECC performance is comparable to the one of ECC is the work of Pelzl et al. [19] .
In 1989 Agnew et al. reported the first result for performing the elliptic curve operations on hardware [20] . Since then a substantial amount of work dealt with hardware implementations of ECC, the majority of that over binary fields. In 2000 Orlando and Paar proposed a scalable elliptic curve processor architecture which operates over finite fields GF(2 n ) in [13] . Gura et al. [14] have introduced a programmable hardware accelerator for ECC over GF(2 n ), which can handle arbitrary field sizes up to 255.
There is not much previous work on hardware implementations of HECC. The first complete hardware implementation of HECC was given by Boston et al. [21] . They designed a coprocessor for genus two curves over GF (2 113 ) and implemented it on a Xilinx Virtex-II FPGA. The algorithm of Cantor was used for all computations on Jacobians. On the other hand, the work of Elias et al. [18] used Lange's explicit formulae. The results reported were the fastest in hardware at the time. Wollinger et al. investigated an HECC implementation on a VLSI coprocessor. They compared coprocessors using affine and projective coordinates and concluded that the latter should be preferred for hardware implementations [11] .
While ECC applications are highly developed and widely used in practice, the use of HECC is still mainly for research purposes. Previous work on exploring the parallelism between the point/divisor operations has been done for both ECC and HECC. Smart [7] showed that up to three field operations could be executed in parallel for the Hessian form of an elliptic curve. On the other hand, the work of Mischra investigated parallelism between divisor operations [10] , both purely on algorithmic level.
Curve-Based Cryptography
Here, we consider some background information for curve-based cryptography over binary fields; for hyperelliptic curves we are interested only in genus 2 curves. We mention the basic algorithms and the structure of the operations. Good references for the mathematical background are [22, 23, 24] .
The main operation in any curve-based primitive is scalar multiplication. The general hierarchical structure for operations required for implementations of curve-based cryptography is given in Fig. 1(a) . Point/divisor multiplication is at the top level. At the next (lower) level are the point/divisor group operations. The lowest level consists of finite field operations such as addition, multiplication and inversion required to perform the group operations. The only difference between ECC and HECC is in the middle level that in this case consists of different sequences of operations. Those for HECC are more complex when compared with the ECC point operation, but they use shorter operands. One can perform inversion also with a chain of multiplications [25] and only provide hardware for finite field multiplication and addition. The corresponding hierarchy is illustrated in Fig. 1 (b). We use this structure for our proposed coprocessor.
ECC over a Binary Field
ECC relies on a group structure induced on an elliptic curve. A set of points on an elliptic curve (with one special point added, the so-called point at infinity O) together with a point addition as a binary operation has the structure of an abelian group. As we consider a finite field of characteristic 2, i.e. GF(2 n ), a non-supersingular elliptic curve E over GF(2 n ) is defined as the set of solu-
HECC
Let GF(2 n ) be an algebraic closure of the field GF(2 n ). Here we consider a hyperelliptic curve C of genus g = 2 over GF(2 n ), which is given with an equation of the form:
where
) and f (x) is a monic polynomial of degree 2g + 1 (deg(f ) = 2g + 1). Also, there are no solutions (x, y) ∈ GF(2 n )×GF(2 n ) which simultaneously satisfy the equation (1) and the equations: 2v + h(u) = 0, h (u)v − f (u) = 0. These points are called singular points. For the genus 2, in the general case the following equation is used 
for some element f of the function field of C (div(f ) = P ∈C ord P (f )P ). The discrete logarithm problem in the Jacobian is the basis of security for HECC. In practice, the Mumford representation according to which each divisor is represented as a pair of polynomials [u, v] is usually used. Here, u is monic of degree 2, deg(v) < deg(u) and u|f − hv − v 2 (so-called reduced divisors). For implementations of HECC, we need to implement the multiplication of elements of the Jacobian i.e. divisors with some scalar.
ECC over a Composite Field
With respect to cryptographic security it is typically recommended to use fields GF(2 p ) where p is a prime. As an example we consider the case where p = 163. As already mentioned, HECC on a curve of a genus 2 allows one to work in a finite field where bit-lengths are shorter with a factor 2, when compared with ECC. That means, for the equivalent level of security we should choose GF (2 83 ). A similar situation we get when considering ECC over a field of a quadratic extension of GF(2
and deg(g) = 2. In this way one can obtain a speed-up and benefit even more from the parallelism. The reason is that in composite field each element is represented as c = c 1 t+c 0 where c 0 , c 1 ∈GF (2 83 ) and the multiplication in this field takes 3 multiplications and 4 additions in GF(2 83 ) [26] .
Algorithms for Our Implementations
In our implementations scalar multiplication is achieved by use the NAF algorithm [23] . In this way the scalar is decomposed as a NAF and scalar multiplication is done with a series of addition/subtractions of elliptic curve points. We also use projective coordinates for all implementations. Furthermore, we have rewritten the formulae from [23, 16] for EC point operations and HECC divisor doubling, respectively to obtain an optimal usage of our new datapath. We use the same approach to get the formulae for HECC divisor addition in the case of mixed coordinates. Our datapath performs one basic operation, AB + C or A(B + D) + C over a binary field. This operation can be used for the sequence of point/divisor operations. For example, by using A(B + D) + C operation the formulae for HECC divisor addition include 48 instructions instead of 44 multiplications and a lot of additions.
Architecture of the Curve-Based Coprocessor

System Architecture
The proposed architecture of the curve-based cryptosystems is composed of the main controller, several Modular Arithmetic Logic Units (MALUs) and the coprocessor memory that shares intermediate variables between the MALUs (i.e. the so-called shared memory). The block diagram of the cryptosystem is Fig. 2 . The configuration of the coprocessor is flexible to provide from the smallest to the fastest implementation depending on a target application. Some components can be added or removed as will be explained next.
The main CPU communicates with the coprocessor through memory-mapped I/O (e.g. SRAM interface) and has three types of 32-bit in-and outputs; one of them is a signal that tells the controller to stop sending instructions when the instruction buffer is full. A 32-bit input/output passes data back and forward between the main CPU and the coprocessor and a 32-bit output is used to send instructions. The data transfer between the main CPU and the coprocessor is controlled by a Data Bus Controller (DBC). When using SRAM attached to the main CPU for storing intermediate variables for HECC/ECC operations, the coprocessor can be constructed without use of the coprocessor memory. Alternatively, for the purpose of reducing the I/O transfer overhead, the data memory can be embedded in the coprocessor. In this case, the path through the DBC is only activated when an initial point and the parameters of an elliptic curve are sent to the RAM, or when the result is retrieved.
Instructions are sent to the MALU either from the main CPU or from pre-set micro codes in the μ-code RAM. When the main CPU is in charge of dispatching instructions, the IBC block can be detached from the coprocessor. In this case, it occurs that the throughput of issuing instructions is not high enough for the MALU(s) to be utilized effectively. On the contrary, when the μ-code RAM is used for assisting the main CPU, the Instruction Bus Controller (IBC) can handle one instruction per cycle. For instance, the sequence of point doubling is stored in the μ-code RAM and the main CPU calls it as an instruction. Thus multiple MALUs can be activated in parallel without any instruction stalls. During point multiplication, the IBC keeps on reading instructions from the μ-code RAM and stores them to an Instruction Queue Buffer (IQB) unless the IQB is full. The IBC checks if there is instruction-level parallelism (ILP) by checking the data-dependency of instructions in the IQB and forwards them to the MALU(s) (see Section 4.2 and 4.4).
Modular Arithmetic Logic Unit
In this section the architecture for the MALU is briefly explained. The datapath of the MALU is an MSB-first bit-serial polynomial-basis GF(2 n ) multiplier as illustrated in Fig. 3(a) . This is a hardware implementation that computes 
The proposed MALU computes A(x)B(x) + C(x)
By providing T next as the next input T and repeating the same computation for n times, one can obtain the result. The detailed explanation is also discussed in [27] .
Moreover, by providing B(x) + D(x) in place of B(x), an operation, A(x)(B(x) + D(x)) + C(x) mod P (x) can be also supported. This operation requires additional XORs and selector logics for registers storing the coefficients of B(x) or (B(x) + D(x)).
The proposed datapath is scalable in the digit size d (in vertical direction in Fig. 3(b) ) which can be decided by exploring the best combination of performance and cost. The field size n is determined by the key-length. It can be achieved also by interconnecting several MALUs in horizontal direction. Hence, various implementation options can be chosen with the MALU. For instance, the coprocessor can support arbitrary field sizes up to 335 when using four sets of the MALU whose field size is 83.
The MALU Instruction
Here, a new instruction called MALU n is defined. It is worth mentioning that this is the only instruction that operates on the datapath. 
MALU n (A, B, C, D) = A(x)(B(x) + D(x))
+ C(x) mod P (x).(2)
When using A(x)B(x) + C(x) mod P (x) operation, one can ignore D(x) as D(x) = 0. The whole procedure to execute MALU n starts from an instruction fetch and decode (IF/D). Then, variables for A(x), B(x), C(x) and D(x)
are loaded via RAM (R) for the succeeding execution stage. The result is stored to RAM (W) in the last step. Note that the data at different addresses can be read in parallel for the different MALU by replicating RAM (i.e. four clones of single-port RAMs in case of using four MALUs). The write cycle is determined by the number of instructions that can be issued in parallel. When using multiple MALUs, the write operations from every MALU are done at the different cycle to escape memory-write conflicts. This is illustrated in Fig. 4. 
Dynamic Scheduling
ILP is exploited for all instructions as long as two or more instructions are buffered in the IQB. Here, we introduce our strategy to find ILP. A MALU n instruction has four source operands and outputs the result to RAM, i.e. MALU n deals with five types of addresses in the case of operating A(x)(B(x) + D(x)) + C(x) mod P (x). Here, let A, B, C, D be the addresses for four inputs and R be the address where the result is stored. They are expressed as follows:
The MALU n also refers to P (x) that is stored in RAM. Including out-oforder execution, the following two types of dependencies are possible between two instructions, MALU i n and MALU j n (i and j are labels indicating order of instruction in the IQB). By checking the following two dependencies for all i and j that satisfy i < j < ILP D , where ILP D is the size of the instruction window, one can determine the number of instructions to be issued in parallel.
Read-After-Write (RAW) Dependency check for in-order execution (R
If the result of the instruction MALU i n , R i is input for the following instructions, the instruction MALU i n cannot be issued until the preceding instruction completes the operation. 
RAW Dependency check for out-of-order execution (R
In case that all conditions are not true, the instruction MALU j n cannot be issued until the instruction MALU i n finishes. The example using the actual sequence of EC point doubling is shown in the Appendix.
The proposed architecture needs no check for Write-After-Read and WriteAfter-Write dependencies contrary a general superscalar machine. This is because MALU n is a fixed-length multi-cycle instruction and hence we can skip those dependencies in the sequence of point/divisor operations. Suppose the size of the instruction window is ILP D , the number of conditions to check becomes 4(ILP D − 1)
2 . The hardware complexity for ILP expands with a large ILP D , but instead further parallelism can be expected. Table 1 shows some of the primary instructions for the co-processor. The input registers of the MALU are set via data-bus ports. In case of using a 32-bit CPU such as the ARM, setting a register whose address is src1 requires three STORE(@dst) instructions for HECC over GF (2 83 ). After all operands are set in corresponding registers, a MALU(@dst,@src1-4) operation is executed. When using the μ-code configuration, it is possible to define an instruction that consists of a series of MALU(@dst,@src1-4) operations. In this paper, point/divisor operations are all composed of the MALU instruction (see the Appendix).
Implementation
Instruction Sets for the Coprocessor
System Configurations
The system configurations are explored in two steps. First, in order to make the best use of the superscalar coprocessor, four different coprocessor configurations are explored as listed in Fig. 5(a) . This is the so-called vertical exploration of the hardware/software co-design. Secondly, the performance comparison is made with HECC, ECC and ECC over a composite field by changing the number of MALUs. Thus the coprocessor is also investigated from a parallel processing point of view (horizontal exploration).
Design Environment
The proposed design is constructed on GEZEL hardware/software co-design environment with the ARM Instruction Set Simulator (ISS) [28] . The platform provides cycle-accurate simulations for various hardware/ software system configurations. As mentioned in Section 4, the coprocessor is attached to the memory-mapped interface of the ARM. Thus, various types of system configurations are examined to verify the functionality and estimate the performance in a system-level. The GEZEL codes are automatically translated into VHDL codes that can be used for an FPGA prototype. Fig. 5(b) compares the performance of HECC scalar multiplication for different system configurations. For the case of the TYPE I and II, the I/O transfer overhead between the main CPU and the coprocessor is the majority of the cycles (about 97%). The reason for this is that the temporary data variables are stored in the memory of the main CPU and travel through the CPU to the coprocessor for processing. As for the TYPE III, the I/O transfer overhead is reduced significantly due to the effect of the data memory allocated in the coprocessor. However, the I/O overhead is still dominant because the main CPU issues instructions via the slow communication channel. The parallel processing feature is hence useless to improve the performance in such system settings. Note that the ratio of the I/O transfer overheads is reduced ostensibly by introducing smaller d since the datapath performs in more clock cycles. In this way, it is important to find the best digit size, d that can hide the I/O transfer overhead with the TYPE III. This paper, however, focuses on the TYPE IV for a deeper investigation of the parallelism in order to obtain high performance. Because the TYPE IV assures the highest parallelism regardless of the value of d. size is 83 or MALU 83 . Up to four clones of the MALU 83 are embedded in the coprocessor to observe the performance improvement with the superscalar architecture. For ECC, a pair of MALU 83 is equivalent to one MALU 163 in terms of hardware cost. The overall performance improves as increasing the number of MALU 83 for both of the operation type. Also a large ILP D helps exploiting more parallelism and leads to a higher performance. The results show the effectiveness of an operation whose form is A(B + D)+ C especially for the ECC over a composite field. In our case, the performance of ECC is better than others on equivalent hardware resources. The results are also summarized in Table 2 .
Results
Vertical Exploration of System Architecture with Coprocessor
Performance Comparison Between Three Cryptosystems
In order to investigate the performance bottle-neck of HECC and ECC, the required clock cycles in scalar multiplication is split into two factors; one is for the memory access and another is for the data processing of the datapath. As can be seen from the Fig. 7 , operation form, A(B + D) + C introduces more memory accesses while the data can be processed in less clock cycles. Overall the proposed superscalar feature can reduce the clock cycles in both of the coprocessor memory access and the datapath operation. The memory accesses of HECC become dominant as introducing more parallelism. On the other hand the memory accesses in ECC is less than 30 % of the total clock cycles. This fact explains the reason that scalar multiplication of HECC is eventually slower than that of ECC on equivalent hardware resources.
Prototype Results on FPGA
Based on the performance observation, the coprocessor is prototyped with the system configuration of d = 12 and ILP D = 6 on Virtex-II PRO (XC2VP30). The operation that the MALU supports is A(B + D) + C. The the coprocessor memory consist of several 32×84-bit single-port RAMs and each RAM is assigned to each MALU 83 . The μ-code program is implemented as an LUT ROM. As shown in Table 3 , our HECC results show a better trade-off between cost and performance than the previous work. With regard to ECC implementation, our result is based on the IEEE-P1363 compliant sequence [23] and is not as fast as some previous work [13, 29] . However considering the flexibility in our proposed coprocessor, the difference can be regarded as small.
Conclusions
This paper introduced a superscalar coprocessor that could deal with three different curve-based cryptosystems. The implementation results showed that scalar multiplication of ECC over GF (2 163 ), HECC of genus 2 over GF (2 83 ) and ECC over a composite field, GF((2 83 ) 2 ) was improved by a factor of 1.8, 2.7 and 2.5 respectively compared to the case of a basic single-scalar architecture. This speed-up was achieved by vertical and horizontal exploration of the system architecture to exploit parallelism in curve-based cryptography. In our design, ECC showed better performance than others on the same amount of hardware resource. All operations in three curve-based cryptosystems were performed with only one instruction that could be flexibly defined as AB + C or A(B + D) + C. 
