Abstract-This paper presents a reconfigurable curve-based cryptoprocessor that accelerates scalar multiplication of Elliptic Curve Cryptography (ECC) and HyperElliptic Curve Cryptography (HECC) of genus 2 over GFð2 n Þ. By allocating copies of processing cores that embed reconfigurable Modular Arithmetic Logic Units (MALUs) over GFð2 n Þ, the scalar multiplication of ECC/HECC can be accelerated by exploiting Instruction-Level Parallelism (ILP). The supported field size can be arbitrary up to ðn þ 1Þ À 1. The superscaling feature is facilitated by defining a single instruction that can be used for all field operations and point/divisor operations. In addition, the cryptoprocessor is fully programmable and it can handle various curve parameters and arbitrary irreducible polynomials. The cost, performance, and security trade-offs are thoroughly discussed for different hardware configurations and software programs. The synthesis results with a 0:13-m CMOS technology show that the proposed reconfigurable cryptoprocessor runs at 292 MHz, whereas the field sizes can be supported up to 587 bits. The compact and fastest configuration of our design is also synthesized with a fixed field size and irreducible polynomial. The results show that the scalar multiplication of ECC over GFð2 163 Þ and HECC over GFð2 83 Þ can be performed in 29 and 63 s, respectively.
Ç

INTRODUCTION
S INCE Diffie and Hellman introduced the idea of PublicKey Cryptography (PKC) [1] in the mid-1970s, publickey cryptosystems have been an essential building block for digital communication. PKC allows for secure communications over insecure channels without prior exchange of a secret key. It can offer both key exchange and digital signature. The most popular and most widely used PKCs are RSA [2] and Elliptic Curve Cryptography (ECC) [3] , [4] . In embedded systems, ECC is considered a more suitable choice than RSA because ECC obtains higher performance, lower power consumption, and smaller area on most platforms. Another appealing candidate for PKC is HyperElliptic Curve Cryptography (HECC). Recently, many software and hardware implementations of HECC have been described, whereas more theoretical work has shown that HECC is also secure for curves with a small genus [5] . Nevertheless, the performance is still much slower than one for private-key cryptography, such as AES [6] .
Implementing a fast PKC is a challenge for most application platforms, varying from software to hardware. For the choice of the implementation platform, several factors have to be taken into account. Application-Specific Integrated Circuit (ASIC) solutions provide the speed and more physical security, but their flexibility is limited. For that property, software solutions are needed; however, a pure software solution is not a feasible option because of low performance. Application-Specific Instruction set Processor (ASIP) architectures based on hardware/software codesign potentially allow an efficient design platform that offers a trade-off between cost, performance, and security.
A considerable amount of work has been reported on improving the performance of Elliptic Curve (EC) scalar multiplication. The work can be classified into the following categories: First, mathematical investigations have been reported for various types of ECs, for example, Koblitz curves [7] . Second, various algorithms for scalar multiplication have been proposed and criteria for improvements include performance and side-channel security. One of the best-known examples that meets both requirements is Montgomery's powering ladder [8] . Various types of coordinates have been prepared as well as a number of approaches to speed up finite-field arithmetic. Last, architecture-level improvements can be considered from a hardware implementation point of view.
The first contribution of this paper is the acceleration of curve-based cryptosystems by deploying a superscalar architecture. The solution is algorithm independent and can be applied to any scalar multiplication algorithm. We discuss the improvement of the performances for ECC and HECC over binary fields. Some previous work reported parallel use of modular arithmetic units to accelerate scalar multiplication [9] , [10] , [11] , [12] , [13] , [14] . In the papers, point/divisor doubling and addition operations are reformulated so that they can take advantage of the parallel processing. On the other hand, our proposed architecture embeds an instruction scheduler that explores the highest level of parallelism and assigns tasks for the processing units in an optimal way. This way, the parallelism within the operations can be found on the fly by dynamically checking the data dependencies in the instructions.
The second contribution of this research is the design of a reconfigurable data path over binary fields. In order to support multiple curve-based cryptosystems and various field sizes for them, it is necessary to provide different fieldlength modular operations, for example, 193 bits for ECC and 97 bits for HECC. In [15] , Satoh and Takano solve this problem by using an r-bit Â r-bit multiplier and applying an algorithm originally used for software implementations. In other words, the advantage of their solution is that the operand size can be freely chosen, limited only by the size of the memory for storing intermediate variables. The area and time complexities of modular multiplications are considered, respectively, as Oðr 2 Þ and Oðm 2 Þ, where n ¼ m Á r is the field size. For high-speed modular multiplications, the parameter r needs to be large and the data path delay of the multiplier becomes longer. In contrast, the critical path delay is independent of the field size in our solution, where an n-bit Â d-bit digit-parallel multiplier is used. Furthermore, the organization of the multiplier can be reconfigured by changing the interconnections between processing cores that have Modular Arithmetic Logic Unit (MALU) and memory. The so-called coarse-grained reconfigurable data path offers high-speed modular multiplications, efficient use of hardware resource for various field sizes, and support for arbitrary field sizes by providing enough MALU cores.
The third contribution of this paper is a fair comparison between ECC and HECC. For HECC of genus 2, the field size is two times smaller than the one for ECC for the same level of security [16] . Our programmable architecture enables one to use the same hardware design for the two curve-based cryptosystems. Moreover, we also explore different arithmetic operations in the MALU and examine the effects on the level of the parallelism in ECC and HECC. As a result, we discuss the trade-offs between cost, performance, and security.
The remainder of this paper is organized as follows: Section 2 gives a survey of relevant previous work and some background information for implementations of curve-based cryptography. In Section 3, the architecture of our proposed cryptoprocessor is explained. The performance is evaluated with a system-level simulation and the results are reported in Section 4. The details of our implementation are introduced in Section 5 and the results are shown for various implementation options in Section 6. Section 7 concludes the paper.
CURVE-BASED CRYPTOGRAPHY
Here, we consider some background information for curvebased cryptography over binary fields: For hyperelliptic curves, we are interested in genus 2 curves only. We mention the basic algorithms and the structure of the operations. Good references for the mathematical background are [17] , [16] , [18] .
The main operation in any curve-based primitive is point/divisor multiplication (aka scalar multiplication). The general hierarchical structure for operations required for implementations of curve-based cryptography is given in Fig. 1a . Point/Divisor multiplication is at the top level. At the next (lower) level are the point/divisor group operations. The lowest level consists of finite-field operations, such as finite-field addition, multiplication, and inversion, required to perform the group operations. The only difference between ECC and HECC is the sequence of operations in the middle level. The sequence for HECC is more complex when compared with the ECC point operations; however, HECC uses shorter operands. One can also perform inversion with a chain of multiplications [19] and only provide hardware for finite-field addition and multiplication. The corresponding hierarchy is illustrated in Fig. 1b . The hierarchy uses several copies of operation units at the lowest level to accelerate point/divisor group operations and inversions. We use this structure for our cryptoprocessor.
ECC over a Binary Field
ECC relies on a group structure induced on an EC. A set of points on an EC (with one special point added, that is, the so-called point at infinity O), together with a point addition as a binary operation, has the structure of an abelian group. As we consider a finite field of characteristic 2, that is, GFð2 n Þ, a nonsupersingular EC E over GFð2 n Þ is defined as the set of solutions ðx; yÞ 2 GFð2 n Þ Â GFð2 n Þ of the equation
where a; b; x; y 2 GFð2 n Þ and b 6 ¼ 0, together with O. The inverse of the point P ¼ ðx 1 ; y 1 Þ is ÀP ¼ ðx 1 ; Ày 1 Þ. The sum P þ Q of the points P ¼ ðx 1 ; y 1 Þ and Q ¼ ðx 2 ; y 2 Þ (P ; Q 6 ¼ O and P 6 ¼ AEQ) is the point R ¼ ðx 3 ; y 3 Þ. Here, 
This operation is called point addition. For P ¼ Q, the point doubling formulas are
The point at infinity O is the neutral element, similar to the number 0 in ordinary addition. Thus, P þ O ¼ P and P þ ðÀP Þ ¼ O for all points P . The scalar multiplication, that is, the multiplication of a point P on the curve with a scalar k is the main operation for ECC. The scalar multiplication kP can be computed as a combination of sequential point doublings and point additions. There are several computation sequences to compute point doubling and addition, including the recommendation in IEEE P1363 [20] . The selection of the sequences also has a great impact on the performance and cost of curve-based cryptosystems. The sequences are generally implemented as a controller block (for example, Finite-State Machine (FSM)) in hardware design. The number of modular multiplications and the required memory size vary according to the sequence. Moreover, some sequences imply parallelism in point doubling/ addition. In this sense, hardware/software codesign, where the arithmetic is executed on hardware acceleration units while the sequences run in software, is an attractive choice for curve-based cryptoprocessors because it offers the equivalent performance of an ASIC while maintaining the flexibility to support a wide range of curve options.
HECC of Genus 2
Let GFð2
n Þ be an algebraic closure of the field GFð2 n Þ. Here, we consider a hyperelliptic curve C of genus g ¼ 2 over GFð2 n Þ, which is given in the form
where hðxÞ 2 GFð2 n Þ½x is a polynomial of degree degðhÞ g and fðxÞ is a monic polynomial of degree degðfÞ ¼ 2g þ 1. Also, there are no solutions ðx; yÞ 2 GFð2 n Þ Â GFð2 n Þ that simultaneously satisfy (4) and the equations 2v þ hðuÞ ¼ 0 and h 0 ðuÞv À f 0 ðuÞ ¼ 0. These points are called singular points. For the genus 2, in the general case, the following equation is used:
A divisor D is a formal sum of points on the hyperelliptic curve C, that is, D ¼ P m P P , and its degree is degðDÞ ¼ P m P . Let Div denote the group of all divisors on C and Div 0 the subgroup of Div of all divisors with degree zero. The Jacobian J of the curve C is defined as the quotient group J ¼ Div 0 =P . Here, P is the set of all principal divisors, where a divisor D is called principal if D ¼ divðfÞ for some element f of the function field of C (divðfÞ ¼ P P 2C ord P ðfÞP ). The discrete logarithm problem in the Jacobian is the basis of security for HECC. We use the Mumford representation, according to which each divisor is represented as a pair of polynomials ½u; v, where u is monic of degree 2, degðvÞ < degðuÞ, and ujf À hv À v 2 (the so-called reduced divisors). For implementations of HECC, we need to implement the multiplication of elements of the Jacobian, that is, divisors with some scalar.
Algorithms for Our Implementations
In our implementations, the scalar multiplication of ECC is achieved by two different computation sequences: The first is from the recommendation of the IEEE P1363 and the second is based on the idea of the Montgomery's powering ladder of Ló pez and Dahab (denoted as ECC_M in this paper) [21] . With regard to the scalar multiplication of HECC, we use the formulas of Byramjee and Duquesne [22] . All of the sequences use projective coordinates and we apply the binary nonadjacent form (NAF) or the windowed NAF method for scalar multiplication [16] , [23] , except for ECC_M. This way, the scalar is decomposed as an NAF and scalar multiplication is performed with a lower cost than the binary method. Modular inversion is performed with a chain of modular multiplications repeatedly [19] . The total number of modular multiplications required for the modular inverse is fblog 2 ðn À 1Þc þ wðn À 1Þ À 1 þ ðn À 1Þg. Since we need only one modular inversion for each scalar multiplication of ECC and HECC, the inversion cost is not a serious bottleneck. The details will be discussed in Section 6.
As our data path performs one basic operation, AB þ C or AðB þ DÞ þ C, over a binary field, we have rewritten the sequences of point/divisor doubling and addition to obtain an optimal usage of our new data path. For example, the formulas for the mixed addition of HECC includes 48 operations of AðB þ DÞ þ C instead of six squarings, 34 multiplications, and a lot of additions. Note that our strategy for refining the sequences also minimizes the number of intermediate variables to save hardware resource. As a result, scalar multiplication can be performed with at most 16 and 32 registers, respectively, for ECC (including ECC_M) and HECC.
ARCHITECTURE OF THE CURVE-BASED CRYPTOPROCESSOR
System Architecture
The proposed architecture of the curve-based cryptoprocessor is composed of the main controller, several MALU cores, and the Register Files (RFs) that store intermediate variables and share them with the MALU cores. The block diagram of the cryptoprocessor is illustrated in Fig. 2 . The hardware configuration of the cryptoprocessor is flexible to provide from the smallest to the fastest implementation, depending on the target application. Some components can be added or removed, as will be explained in the next sections.
The main CPU communicates with the cryptoprocessor through memory-mapped I/O (for example, a Static RAM (SRAM) interface) and has three types of 32-bit inputs and outputs: One of them is a signal that tells the controller to stop sending instructions when the instruction buffer is full. A 32-bit I/O passes data backward and forward between the main CPU and the cryptoprocessor and a 32-bit output is used to send instructions. The data transfer between the main CPU and the cryptoprocessor is controlled by a Data Bus Controller (DBC). If the intermediate variables for ECC/HECC operations are stored in the SRAM attached to the main CPU, then the cryptoprocessor can be constructed without use of the RFs. However, the I/O transfer overhead becomes the bottleneck of the performance. Hence, the RFs have to be embedded in the cryptoprocessor for the purpose of reducing the data transfer overhead. This way, the path through the DBC is only activated when an initial point and the curve parameters are sent to the RFs or when the result of a scalar multiplication is retrieved.
Instructions are sent to the MALU cores either from the main CPU or from preset microcodes in the -code RAM. When the main CPU is in charge of dispatching instructions, the Instruction Bus Controller (IBC) block can be detached from the cryptoprocessor. In this case, typically, the throughput of issuing instructions is not high enough for the MALU cores to be utilized effectively. However, if the -code RAM is used for assisting the main CPU, then the IBC can handle one instruction per cycle. For instance, the sequence of point doubling is stored in the -code RAM and the main CPU calls it a single instruction. Thus, multiple MALU cores can be activated in parallel without any instruction stalls. During point/divisor multiplications, the IBC keeps on reading instructions from the -code RAM and stores them in an Instruction Queue Buffer (IQB), unless the IQB is full. The IBC checks if there is InstructionLevel Parallelism (ILP) by checking the data dependency of instructions in the IQB and forwards them to the MALU(s) (see Section 3.6).
Architecture of the MALU Core
The data path plays an important role in accelerating scalar multiplication. One way to implement an efficient data path is to use a specific irreducible polynomial, for example, a trinomial such as P ðxÞ ¼ x 193 þ x 15 þ 1. In this case, modular multiplications can be implemented efficiently and the squaring operation only needs several modular adders if it is implemented separately from the multiplier. The critical path delay of the squarer is low enough to use it as a one-cycle operation. Likewise, the modular inversion can be efficiently implemented [24] . Therefore, three different modular operations can be used for the data path. However, this dedicated approach makes the data path inflexible in the size of the operand (that is, the field size).
In contrast, our proposed MALU core is a flexible processor that executes a single operation on a finite field over GFð2 n Þ, for example, AðxÞBðxÞ þ CðxÞ (modP ðxÞ). Also, the irreducible polynomial P ðxÞ can be chosen arbitrarily. The core, as illustrated in Fig. 3 , decodes an incoming instruction in the FSM, loads operands from the RF, executes the finite-field operation by the data path (the MALU), and writes back the result into the RF. All operations necessary for scalar multiplication of the curvebased cryptography can be processed by using the single core iteratively, including modular inversions.
In order to exploit parallelism in scalar multiplications, multiple MALU cores can be instantiated in the cryptoprocessor. The intermediate variables are then shared with the MALU cores through the data bus. The RF architecture is discussed in detail in Section 3.4, since it is one of the most critical blocks in multicore systems. Another advantage of our multicore system is that wider field sizes can be supported by reconfiguring the data path. That is, our proposed core has additional ports that are used for interconnecting MALUs in neighboring cores by setting the configuration register. Thus, we can construct a new data path that can handle larger operands. The details are explained in Section 3.3.
Reconfigurable Data Path
In this section, the architecture for the MALU is explained. The MALU is a data path that is based on an MSB-first bit-serial polynomial-basis GFð2 n Þ multiplier, as illustrated in Fig. 4a . This is a hardware implementation that computes AðxÞBðxÞ (mod P ðxÞ), where AðxÞ ¼ P nÀ1 i¼0 a i x i ,
The intermediate result T ðxÞ ¼ P n i¼0 t i x i is stored in a register. The case for the digit-serial multiplier is shown in Algorithm 1. Algorithm 1. Bit-serial MSB-first modular Multiplication over GFð2 n Þ.
T ðxÞ ¼ ðT ðxÞ þ a i BðxÞ þ m i P ðxÞÞx; 5. Return T ðxÞ=x;
The MALU XORs three inputs, which are a i BðxÞ, m i P ðxÞ, and T ðxÞ, and then outputs the next intermediate result T ðxÞ by computing
where m i ¼ t n . By providing T ðxÞ as the next input and repeating the same computation n times, one can obtain the result AðxÞBðxÞ (see [25] ). Moreover, by providing BðxÞ þ DðxÞ in place of BðxÞ and XORing the result with CðxÞ, the operation form AðxÞðBðxÞ þ DðxÞÞ þ CðxÞ (mod P ðxÞ) can also be supported. The proposed data path is scalable in the digit size d (vertical direction in Fig. 4b ). The corresponding algorithm can be obtained by loop unrolling Algorithm 1, as shown in Algorithm 2. In this case, one operation finishes in dn=de cycles. Thus, the appropriate digit size can be parameterized in the data path design and can be determined by exploring the best combination of cost and performance.
T ðxÞ ¼ ðT ðxÞ þ a iþ1 BðxÞ þ m iþ1 P ðxÞÞx; . . .
The field size n is determined by the key length. A larger field size can also be obtained by interconnecting several MALUs in the horizontal direction. Hence, various implementation options can be chosen with the MALU. For instance, the cryptoprocessor can support arbitrary field sizes up to 587 bits when using six copies of the MALU cores, each of which supports a field size of 97 bits.
The schematic circuit diagram illustrated in Fig. 5 describes how the MALUs are reconfigured for supporting different field sizes. When each of the MALUs is used independently without interconnections for the purpose of parallel processing, cfg1 is set to zero so that a d-bit vector m ¼ ðm iþdÀ1 ; Á Á Á ; m iþ1 ; m i Þ in Algorithm 2 can be used for the modular reduction in its own core. More precisely, the vector m determines whether the irreducible polynomial should be XORed with the intermediate result so that the degree of T ðxÞ can be at most n or degðT Þ n. This way, the LSBs of T ðxÞ can always be 0 because they are provided by another d-bit vector q ¼ ðq iþdÀ1 ; Á Á Á ; q iþ1 ; q i Þ (see Algorithm 2) of the neighboring core. This corresponds to the 1-bit left-shift operation.
On the other hand, when supporting a wider field-sized data path by interconnecting several MALU cores, each vector m is exchanged with one from the neighboring core. For instance, copies of the MALU over GFð2 n Þ with digit size d, that is, MALU nÂd , can be reconfigured as one MALU ððnþ1ÞÀ1ÞÂd . Suppose that ¼ 3 in Fig. 5 and the MALU1, MALU2, and MALU3 are reconfigured to make a data path for the triple field size (more precisely 3n þ 2). The configuration signals in the MALU1, MALU2, and MALU3 should be set to 0, 1, and 1, respectively.
Architecture of the RF
If the MALU supports the operation AðxÞðBðxÞ þ DðxÞÞ þ CðxÞ (mod P ðxÞ), then four different operands need to be read from the RF and the result is written back to the RF after completing the execution. When using three MALU cores, for instance, 12 read and three write operations occur for three parallel executions. This heavy memory access was one of the bottlenecks in our previous multicore cryptoprocessor [26] . In order to reduce the memory-access cycles, especially in read operations, a multiport RF is implemented, as illustrated in Fig. 6a .
The multiport RF supports four simultaneous read operations at four different addresses per cycle (4R). This allows one to read all of the necessary operands for the operation form AðxÞðBðxÞ þ DðxÞÞ þ CðxÞ in a single cycle. This way, the number of the read-access cycles can be reduced by 3/4 or 75 percent. The read cycle is reduced to only three cycles for three parallel executions. The write operation can be done unless those read operations are executed (1W).
Note that one RF can be shared with multiple MALU cores. Only when supporting a wider field size should multiple RFs be allocated in the cryptoprocessor. In addition, the required number of entries in the RF differs from ECC to HECC in that ECC needs 16 registers, whereas HECC uses 32 registers, as mentioned previously. This difference can be a problem when the cryptoprocessor needs to support both cryptosystems. A simple solution is to prepare 32 entries in each RF (denoted as RF nÂ32 in Fig. 6a ). Another solution is to make one 32-entry RF from two 16-entry RFs, as shown in Fig. 6b . The figure illustrates how RF nÂ32 can be configured with RF1 nÂ16 and RF2 nÂ16 . As will be investigated in detail in Section 5, both solutions have drawbacks and advantages.
The MALU Instruction
We now design a new instruction called MALU(). It is worth mentioning again that this is the only instruction that operates on the data path: 
Here, &A, &B, &C, &D denote the addresses for four inputs of the instruction and &R denotes the address where the result is stored. As illustrated in Fig. 7 , the whole procedure to execute MALU() starts from an instruction fetch and decode (IF/D). Then, variables for AðxÞ, BðxÞ, CðxÞ, and DðxÞ are loaded via the RF (R) for the succeeding execution stage. The result is stored to the RF (W) in the last step. When performing parallel processing, the write operations from every MALU core should be sequential in order to escape memory-write conflicts. More precisely, in order to keep data integrity between the RFs, only one data item can be written to the RFs within a cycle: This is a consequence of the way in which the 4R1W RF is embedded in our cryptoprocessor. From another viewpoint, the operands for different MALU cores can also be read sequentially, which means that one RF can be shared with multiple MALU cores. The number of instructions that can be issued in parallel decides consecutive write cycles. In total, an l-way parallel execution takes l þ 1 cycles, in addition to the execution cycles that depend on the MALU configuration.
Dynamic Scheduling for Multicore Architecture
ILP is exploited for all MALU() instructions as long as two or more instructions are buffered in the IQB. Here, we introduce our strategy to find ILP. The instruction has four source operands and outputs the result to the RF; that is, MALU(&R, &A, &B, &C, &D) deals with five types of addresses for the operation AðxÞðBðxÞ þ DðxÞÞ þ CðxÞ (mod P ðxÞ). They are expressed as MALU : &R ¼ &A; &B; &C; &D:
The MALU instruction also refers to the P ðxÞ that is stored in the RF. To include out-of-order executions, two types of dependencies need to be checked between two instructions: MALU i and MALU j (i and j are labels indicating the order of instruction in the IQB). For all i and j that satisfy 0 i < j < ILP D , where ILP D is the size of the instruction window to exploit ILP, one can determine the number of instructions to be issued in parallel by checking the following dependencies:
Read-After-Write (RAW) dependency check for inorder execution ( includes an out-of-order execution, the next condition has to be verified as well.
RAW dependency check for out-of-order execution
As a result of checking the conditions for an in-order execution, it is possible that the instruction MALU j can be issued, whereas some preceding instructions cannot. In this case, we need to check if the result of the instruction MALU j , R j is used for the input of the preceding instructions that cannot be issued. The corresponding condition is described above in parentheses.
The proposed architecture needs no check for WriteAfter-Read and Write-After-Write dependencies, in contrast to a general superscalar machine. Indeed, the instruction MALU() is a fixed-length multicycle instruction and, hence, we can skip those dependencies in checking the sequence of point/divisor operations.
As the zeroth instruction MALU 0 is issued unconditionally, the number of conditions to check ILP becomes 4ðILP D À 1Þ
2 . This fact indicates that the hardware complexity for ILP grows quadratically in a large ILP D , but, in exchange, further parallelism can be exploited. The choice of the ILP D is discussed in Section 4.4.
PERFORMANCE EVALUATION
Design Platform
The proposed design is constructed on the GEZEL hardware/software codesign platform with the ARM Instruction Set Simulator (ISS) [27] . The cryptoprocessor is described in an FSM with Data Path (FSMD) manner. The platform provides cycle-accurate simulations for various hardware/ software system configurations. As mentioned in Section 3, the cryptoprocessor is attached to the memory-mapped interface of the ARM. Thus, various types of system configurations are examined to verify the functionality and estimate the system-level performance quickly. The GEZEL codes are automatically translated into very high density logic (VHDL) codes that can be used to prototype the proposed cryptoprocessor on an FPGA. Table 1 shows some of the primary instructions for the cryptoprocessor. For a 32-bit CPU such as the ARM, storing data to the address dst requires four STORE() instructions for HECC over GFð2 97 Þ. After all operands are set at the corresponding addresses of the RF, the main CPU sends the instructions MALU(). By using the -code RAM in the cryptoprocessor, it is possible to define an instruction that consists of a series of MALU() instructions. In this paper, all necessary point/divisor operations are preprogrammed in the -code RAM and the main CPU uses these instructions (for example, ECC_PA() and ECC_PD()) for scalar multiplication.
Instruction Set for the Cryptoprocessor
Configuration of the MALU Cores
The system performance is heavily dependent on the number of MALU cores and the data path configuration in each MALU core. They also determine the supported range of field sizes. The field sizes of interest in this paper are 163 and 193 bits for ECC because they offer a security level greater than or equal to a 1,024-bit RSA [16] . The corresponding field sizes for HECC are 83 and 97 bits. Therefore, it is reasonable to use the MALU cores with a data path of length n ¼ 97. As for the digit size, d ¼ 12 is chosen as an example case. This data path is denoted as MALU 97Â12 and one modular multiplication over GFð2 97 Þ can be computed in dn=de or nine cycles.
In [28] , the EC Digital Signature Algorithm (ECDSA) standard is designed and the recommended curve parameters and irreducible polynomials are listed for several field sizes of up to 571 bits. Therefore, we also investigate the performance of our cryptoprocessor for a 571-bit ECC.
Suppose that six cores with the data path MALU 97Â12 are allocated in the cryptoprocessor. Various data path configurations can be supported. Table 2 summarizes selected hardware configurations and the number of clock cycles for one MALU instruction ðdn=deÞ over different field sizes. The throughput of the MALU instruction can be estimated with
where l is the degree of parallelism (that is, the execution of l-way parallelism). Although the number of data paths used in parallel varies from 1 to 6, depending on the field lengths, we exploit at most four-way parallelism in this paper in order to reduce the logic complexity in the IBC. Fig. 8 illustrates the minimal and maximal throughput of the MALU operation as a function of the field size. As can be seen in the figure, this configuration can offer a high throughput for a field size of around 97, 195, and 293 bits because we use MALU 97Â12 as a building block of the data path. The maximum throughput can be obtained only if all of the data paths are used in parallel. When no parallelism can be found in the MALU instructions (that is, in the case of single execution of the MALU instruction), our cryptoprocessor performs at the minimum throughput. In other words, by exploiting parallelism for n 293 in the MALU instructions, the throughput can be improved, depending on the degree of parallelism. However, the throughput is almost constant for n > 293.
Degree of Parallelism in ECC and HECC
As the performance of the superscalar architecture is dependent on the degree of parallelism, it is also important to determine ILP D , which is an appropriate number of instructions to search for ILP, as well as the number of MALU cores. Fig. 9 shows the number of clock cycles for a scalar multiplication when setting ILP D from 1 to 8. Up to eight copies of the MALU cores with MALU 97Â12 are instantiated in the cryptoprocessor to evaluate the performance improvement by the superscaling feature for ECC163, ECC_M163, and HECC83. Here, we assume that enough RFs are allocated in the cryptoprocessor, that is, RF 97Â32 is assigned for each MALU core with MALU 97Â12 so that the cryptosystem has no limitations on the supported field sizes and the type of cryptosystem. As a result of the GEZEL system-level simulation, we observe that, for both operation forms, the overall performance improves as the number of MALU 97 increases. Also, a large ILP D helps exploit more parallelism and leads to higher performance. We can also see the effectiveness of an operation if the form is AðB þ DÞ þ C. The results of using this operation with ILP D ¼ 6 are also summarized in Table 3 .
In order to investigate the performance bottleneck of ECC and HECC, the number of clock cycles for a scalar multiplication is split by the degree of parallelism. Fig. 10 shows the results for ECC163, ECC_M163, and HECC83 by changing the number of MALU cores and the type of the operation. Note that the same performance is obtained for ECC_M163, regardless of the type of the operation.
We consider the utilization of the MALU cores in order to know if the data paths are effectively used in parallel. If a parallel execution utilizes all data paths, then the utilization is defined as 100 percent during execution of the parallel operations. On the other hand, the utilization is 50 percent when half of the data paths are used: For example, a twoway computation is executed on a configuration with four data paths. Note that memory accesses are included as a part of a parallel execution. The figures in percentage marked on the bars indicate the utilization of the MALU cores defined as
where l max is the maximum degree of parallelism under the given hardware configuration and R i is the number of clock cycles required for i-way computations. As can be seen in Fig. 10 , the proposed superscalar feature can reduce the overall number of clock cycles. However, utilization of the MALU cores decreases as the value of l max increases. This fact indicates that area and performance trade-offs are getting worse for a larger l max and it can be exploited by inherent data dependencies in ECC and HECC. From this observation, we decide to employ l max 3 for ECC and l max 4 for HECC to maintain high utilization of the MALU cores.
IMPLEMENTATION
The cryptoprocessor discussed in Section 3 has been synthesized with a 0:13-m CMOS technology by using the Synopsys Design Vision. From the performance evaluation discussed in Section 4, we allocate up to eight instantiations of the MALU cores with MALU 97Â12 with an A pair of MALU 97Â12 can be reconfigured as one MALU 195Â12 by changing the interconnection between the MALU cores. This way, both HECC97 and ECC193 can be supported by allocating 2 Á MALU 97Â12 in the cryptoprocessor. As for the RF, we need to prepare either a pair of RF 97Â32 or a pair of RF 97Â16 to support the field lengths.
As explained previously, HECC requires RF nÂ32 , whereas ECC can be computed with RF nÂ16 , where n is the field size of ECC and HECC. Therefore, depending on the configuration of the RF, the supported field lengths and the degree of parallelism are differently determined, as shown in Table 4 . In other words, the configuration using 6 Á RF 97Â32 offers enough registers for both ECC and HECC and, hence, a better degree of parallelism can be expected. Moreover, we can apply the NAF 4 (NAF with a width-4 window) for ECC by utilizing the redundant 16 entries in the RF. In contrast, 6 Á RF 97Â16 can be considered as an ECCcentric configuration because it can save redundant registers when ECC is performed. The drawback of this configuration is that the degree of parallelism in HECC is restricted by the number of RF 97Â16 , that is, the cryptoprocessor can exploit at most a three-way parallelism for HECC in this case.
Figs. 11 and 12 show the average time for an ECC and an HECC scalar multiplication for the configuration of Á MALU 97Â12 þ Á RF 97Â32 , where ¼ 1; 2; 3; 4; 5; 6; 8 (CONFIG-I). Figs. 13 and 14 show the results for the configuration of Á MALU 97Â12 þ Á RF 97Â16 , where ¼ 2; 3; 4; 5; 6; 8 (CONFIG-II). Here, we use a 1-Kbyte -code RAM in order to support ECC and HECC.
As can be seen in the figures, the cost of supporting ECC571 with CONFIG-I is 393 Kgates, which is more expensive than that with CONFIG-II, whose gate size is 244 Kgates. However, CONFIG-II shows a slightly lower performance for HECC. This performance degradation is more apparent for a larger field size. For example, in the case of ¼ 6, the performance of HECC139 decreases from 284 to 670 s when changing the configuration. The computation cost for modular inversion is summarized in Table 5 for CONFIG-I. The ratio of the inversion cost to the computation cost for the scalar multiplication varies from 7 percent to 18 percent in ECC and ECC_M. This is due to the fact that we use the MALU instructions for modular squarings in the Itoh-Tsujii algorithm. However, considering the flexibility of the proposed hardware architecture, the inversion cost can be regarded as low enough. In the case of HECC, the cost for modular inversion is negligible.
For achieving faster performance, other configurations are also considered by fixing the field length and the irreducible polynomial and supporting either ECC or HECC only. In these configurations, one fixed-size RF can be shared with the MALU cores. For instance, we can consider the configuration of 4 Á MALU 83Â12 þ RF 83Â32 for HECC83. In addition, we use ROM for storing the -code program. These configurations offer a higher performance with lower cost compared to CONFIG-I and CONFIG-II at the cost of reduced flexibility and programmability. The results of this configuration are also discussed in Section 6. Table 6 summarizes the performance of ECC and HECC scalar multiplication for selected field sizes and different hardware configurations. Our proposed cryptoprocessor can provide various choices of area and performance. The observed performance maintains high overall for all supported field sizes.
RESULTS
When implementing the cryptoprocessor based on CONFIG-I, the highest flexibility and performance can be obtained for both ECC and HECC, as shown in Table 6 , with a gate size of 393 Kgates. On the other hand, for CONFIG-II, the gate size becomes 244 Kgates, with some performance penalty for ECC and HECC. In both configurations, the performance of HECC is lower than the Comparing with previous work, our HECC implementation results are faster than the implementation reported by Wollinger [13] , which was one of the fastest HECC implementations. Furthermore, our implementation can support both ECC and HECC. Our ECC implementation results also show better performance than other previous work, except an ECC_M implementation of Sozzani et al. [29] . This is because their ASIC design uses the 163-bit fixed field size and a hardwired controller, which offers less scalability and flexibility than our reconfigurable design. In fact, our design with a fixed irreducible polynomial shows better performance than their result while supporting both ECC and ECC_M.
CONCLUSIONS
This paper presented a multicore cryptoprocessor for ECC and HECC to support a wide range of field sizes and to accelerate the scalar multiplication of ECC and HECC of genus 2 over GFð2 n Þ by exploiting ILP on the fly. The superscaling feature is facilitated by defining a single instruction that is flexibly defined as AB þ C or AðB þ DÞ þ C and can be used for all field operations such as modular multiplications, modular additions, and point/ divisor operations. We conclude that the operation AðB þ DÞ þ C is effective to decrease the number of clock cycles for scalar multiplication.
The fully programmable cryptoprocessor can handle various curve parameters and an arbitrary irreducible polynomial. In addition, a wide range of the field size of modular operations can be supported by reconfiguring the data path in the MALU cores. Thus, the trade-off between performance and security can be obtained simply by Hardware and Embedded Systems (CHES '07). Her research interests include circuits, processor architectures, and design methodologies for real-time embedded systems for security, cryptography, digital signal processing, and wireless communications. This includes the influence of new technologies and new circuit solutions on the design of nextgeneration systems on chip. She is a senior member of the IEEE. More information on her research can be found at www.emsec.ee.ucla.edu or www.esat.kuleuven.be/cosic.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
