We present a high-speed public-key cryptoprocessor that exploits three-level parallelism in Elliptic Curve Cryptography (ECC) over GF(2 n ). The proposed cryptoprocessor employs a Parallelized Modular Arithmetic Logic Unit (P-MALU) that exploits two types of different parallelism for accelerating modular operations. The sequence of scalar multiplications is also accelerated by exploiting Instruction-Level Parallelism (ILP) and processing multiple P-MALU instructions in parallel. The system is programmable and hence independent of the type of the elliptic curves and scalar multiplication algorithms. The synthesis results show that scalar multiplication of ECC over GF(2 163 ) on a generic curve can be computed in 20 and 16 μs respectively for the binary NAF (Non-Adjacent Form) and the Montgomery method. The performance can be accelerated furthermore on a Koblitz curve and reach scalar multiplication of 12 μs with the TNAF (τ -adic NAF) method. This fast performance allows us to perform over 80,000 scalar multiplications per second and to enhance security in wireless mobile applications.
Introduction
It is challenging to implement high-performance Public-Key Cryptosystems (PKCs) in embedded devices such as mobile phones and portable RFID readers because they have limited silicon resources and a limited power budget. The idea of PKC was introduced in the mid 70's [1] . PKCs ensure a secure data transfer via an insecure channel without prior key exchanges in order to exchange some confidential data. The most widely-accepted PKCs are RSA [2] and ECC [3, 4] . Due to the fact that RSA requires a large integer for its private key (at least 1,024 bits), ECC is often preferred for a compact hardware implementation because much shorter key-lengths are needed. Blake, Seroussi and Smart argue in [5] that 160-bit ECC has the same security level as 1,024-bit RSA.
The performance of ECC (i.e. scalar multiplication) is several orders of magnitude lower than that of a secret-key cryptography such as AES [6] . However, some fast results are reported with ECC implementations recently, which shows a great potential for wider range of application using ECC, especially in the field of secure networks. For instance, RFID tags with ECC (ECC-RFID tags) are proposed by [7] [8] [9] to enhance the privacy in physical distribution management systems [10] . Such systems require a local network server that can authenticate thousands (or even millions) of the ECC-RFID tags quickly. The position of ECC in mobile network applications is illustrated in Fig. 1a . As other examples of potential applications using a high-speed PKC, electronic voting [11, 12] and private searching on streaming data [13] are presented for the purpose of preserving private information.
In comparison with the relatively simple RSA implementation, ECC has many options in designing the hardware architecture of the system. The main operation in ECC is the elliptic curve (EC) scalar multiplication (kP). It is performed by a sequence of point doublings and point additions which are generally implemented as a controller block (e.g. finite state machine: FSM) in a hardware design or as software in hardware/software co-design as shown in Fig. 1b . The computation sequence of scalar multiplication varies according to the type of the curve. For instance, Smart [14] showed that up to three field operations could be executed in parallel for the Hessian form of an elliptic curve. Several computation sequences are also introduced including the recommendation in IEEE P1363 [15] and FIPS 186-2 [16] . The selection of the sequences has a great impact on cost and performance because the number of modular operations and the required memory size differ according to the sequence. ECC can be defined over GF( p) or GF(2 n ). Namely, 160-bit ECC over GF( p) is determined by a prime number p that satisfies p ≥ 2 160 and ECC over GF(2 n ) is specified by an irreducible polynomial P(x) whose degree, n is a prime that satisfies n > 160 (e.g. n = 163). The field size and the type of the underlying finite field are important parameters that determine the architecture of the data processing block (see Fig. 1b) . Especially, finite fields of characteristic 2 or GF(2 n ) operations are suitable for hardware implementations due to the carry-free arithmetic.
Many interesting computation sequences for ECC over binary fields exist including the following sequences: (1) the binary NAF method [5] , (2) the Montgomery method [17] and (3) the τ -NAF or TNAF method on a Koblitz curve [18] . In this sense, a domain specific programmable architecture is an attractive choice for an Elliptic Curve (EC) cryptoprocessor because it offers the equivalent performance as ASIC while maintaining the flexibility to support the wide range of options for EC scalar multiplication. As summarized in Table 1 , we denote computation sequences of scalar multiplication as ECC, ECC_M and ECC_KC depending on the type of the curve and the scalar multiplication algorithm. Although further performance improvement is possible by applying the window method [5] for ECC and ECC_KC, this is out of scope of this paper. We discuss this in Section 5.
Our contribution in this paper deals with a cryptoprocessor architecture with three-level parallelism for ECC over a binary field. More precisely, Modular Arithmetic Logic Unit (MALU) is based on a digitserial multiplier that can operates n-bit × d-bit modular multiplication in one cycle, where n is the field size and d is the digit size. The critical path delay is only dependent on the digit size and hence this can be considered as the first level parallelism in modular multiplications. Second parallelism is in the Parallelized MALU (P-MALU) that uses two MALUs in parallel and the third is in the instructions determined by the computation sequence of EC scalar multiplication. Namely, the proposed cryptoprocessor exploits ILP in an EC scalar multiplication algorithm and executes instructions on several parallelized datapaths, i.e. the P-MALUs. The design is synthesized with a 0.13-μm CMOS technology to evaluate the performance of scalar multiplication. This research is of interest because a constant progress in public-key cryptography can remove performance bottleneck and enhance security in wireless mobile applications. This paper is organized as follows. Section 2 lists some relevant previous work. Some mathematical background is explained in Section 3. The proposed architecture of our cryptoprocessor is given in Section 4. The main contribution of our work, i.e. a parallelized datapath and ILP are explained in detail. The implementation results are discussed in section 5. Section 6 concludes the paper.
Related work
This section lists some relevant previous work. As already mentioned, there is a considerable amount of work done on hardware implementations on FPGAs as well as ASIC implementations for ECC since Agnew et al. [19] reported the first result for performing the elliptic curve operations on hardware in 1989. The majority of hardware ECC implementations have been over binary fields. Gao et al. proposed an elliptic curve cryptosystem coprocessor with variable key sizes, which utilizes the internal SRAM/registers in an FPGA in [20] . In 2000 Orlando and Paar proposed a scalable elliptic curve processor architecture which operates over finite fields GF(2 n ) in [21] . Gura et al. [22] have introduced a programmable hardware accelerator for ECC over GF(2 n ), which can handle arbitrary field sizes up to 255. Satoh and Takano [23] present a dual field multiplier with high performance in both binary and prime fields. The throughput of an elliptic curve scalar multiplication is maximized by use of the Montgomery modular multiplier and an on-the-fly redundant binary converter. The biggest advantage of their design is in scalability in operand size and also flexibility between speed and hardware area. The work by Tenca and Koç [24] also introduces a scalable architecture for the computation of modular multiplication, based on the Montogmery multiplier. Their proposed multiplier works with any precision of the input operands, limited only by memory or control constraints. Andres et al. [25] improve the version of the Tenca-Koç Montgomery multiplier and achieve half the latency and half the queue memory requirement. Recently significantly fast implementations are proposed by [26, 27] .
Our previous work [28] employs a superscalar architecture to exploit ILP in ECC over GF (2 n ). The cryptoprocessor is scalable in field size and programmable for various EC scalar multiplication algorithms. The architecture in this paper is also based on it and explores the fastest possible performance of ECC by fixing the field size to 163 bits and using the irreducible polynomial, x 163 + x 7 + x 6 + x 3 + 1.
Curve-based cryptography
Here, we consider some background information for curve-based cryptography over a binary field. We mention the basic algorithms and the structure of the 
Intermediate variables operations. Good references for the mathematical background are [5, [29] [30] [31] . In addition, we give the detailed computation sequences for point addition and doubling in projective coordinates for the operation form A(B + D) + C that the proposed P-MALU can handle as explained in section 4. We assume that the memory has a 4-bit address space and stores n-bit data as shown in Table 2. 3.1 ECC over a binary field ECC relies on a group structure induced on an elliptic curve. A set of points on an elliptic curve (with one special point added, the so-called point at infinity O) together with a point addition as a binary operation has the structure of an abelian group. As we consider a finite field of characteristic 2, i.e. GF(2 n ), a nonsupersingular elliptic curve E over GF (2 n ) is defined as the set of solutions (x, y) ∈GF(2 n )×GF(2 n ) of the equation,
The points on the curve and the point at infinity O form an abelian group. The sum P + Q of the points P 1 = (x 1 , y 1 ) and P 2 = (x 2 , y 2 ) (P 1 , P 2 = O, and
This operation is called point addition. For P = Q, point doubling formulae are
The point at infinity O is the neutral element, similar to the number 0 in ordinary addition. Thus, P + O = P and P + (−P) = O for all points P. We use weighted projective coordinates where an affine point (x, y) is converted to a projective point
The projective curve equation corresponding to the affine equation (1) is given by
Suppose that point addition is computed by P 2 = P 2 + P 1 (P 1 = P 2 ), i.e., storing the result of P 2 + P 1 (P 1 = P 2 ) in the register of P 2 , in order to save memory space. The computation sequences for point addition on the projective curve is presented in (5) for the special case Z 1 = 1.
where c = 4 √ b . We assume that this curve parameter is pre-computed and stored in memory.
Many techniques for recoding the scalar k have been proposed in the literature. Here we mention the signeddigit representation. Consider an integer representation of the form k = l i=0 k i 2 i , where k i ∈ {−1, 0, 1}. This is called the (binary) signed digit (SD) representation (see Menezes et al. [5, 31] ). If an SD representation has no adjacent non-zero digits, it is called a nonadjacent form (NAF). Every integer k has a unique NAF which has the minimum weight of any signed digit representation of k. Algorithm I shows the binary NAF method for scalar multiplication.
Algorithm 1 Algorithm for point multiplication: binary NAF method [31] .
Montgomery's powering ladder
The EC scalar multiplication can be operated with the method of Montgomery that maintains the relationship P 2 − P 1 as invariant (Algorithm II) [17] . All computations are performed on the x-coordinate only in affine coordinates and hence the point multiplication can be performed with less computation sequences.
Algorithm 2 Algorithm for point multiplication:
Montgomery method [17] .
If P 1 = P 2 , the x-coordinate of point addition (P 3 = P 2 + P 1 ) is computed as
where
If P 1 = P 2 ,
For the purpose of escaping computationally expensive modular inversions, López and Dahab applied the method to the projective coordinates [32] . In this algorithm, all necessary computations are performed on the X and Z coordinates in projective representation. The computation sequences for point addition (P 2 = P 1 + P 2 ) and point doubling (P 1 = 2P 1 ) are described in (11) and (12) .
Koblitz curve
A Koblitz curve is an elliptic curve defined over GF (2) that is given by the equation
where a ∈ {0, 1}. Koblitz curves are of interest because point doublings in the binary method can be replaced with computationally cheaper operations. Namely, consider the Frobenius map τ :
2 ) for all points on a curve E a except τ (∞) = ∞. This map can be easily computed since it relies on squaring operation in GF(2 n ). It can be shown from the point addition operation that three points (x, y), (x 2 , y 2 ) and (x 4 , y 4 ) satisfy the following:
From the definition of the Frobenius map we get
where μ = (−1) 1−a . So, the Frobenius map can be viewed as a complex number τ that satisfies τ 2 + 2 = μτ , for which we
denote the ring of polynomials in τ with coefficients from Z. Then we write:
Therefore, one has to find a decomposition of a scalar k in the following form k = u t−1 τ t−1 + . . . + u 1 τ + u 0 (so-called τ -adic expansion) and then use the previous equation to compute kP. This scalar multiplication is called TNAF method and it is computed with point additions/subtractions and τ multiplications (τ P) (see Algorithm III). Solinas proposed an efficient τ -adic expansion of k by recoding the scalar in this form [18] .
Algorithm 3 Algorithm for point multiplication: TNAF method [31] . Table 3 summarizes the cost of the three different computation sequences for scalar multiplication. 
Cryptoprocessor architecture
The proposed architecture of the cryptoprocessor is composed of the main controller, several P-MALUs and the register file that shares intermediate variables between the P-MALUs (i.e. the so-called shared memory). The block diagram of the cryptoprocessor is illustrated in Fig. 2 . Although conventional hardware architectures for ECC employ several different data processing units that accelerates specific operations, (e.g. modular squaring and inversions), our cryptoprocessor executes a single operation form, A(B + D) + C on the P-MALU. As modular inversions are performed with a chain of modular additions and multiplications [33] , one can perform all necessary modular operations for ECC with this operation form. This architecture facilitates parallel processing for EC scalar multiplication. The main CPU communicates with the cryptoprocessor through memory-mapped I/O (e.g. SRAM interface) and has three types of 32-bit inputs and outputs; one of them is a signal that tells the main CPU to stop sending instructions when the instruction buffer is full (IQB full in Fig. 2) . A 32-bit input/output passes data back and forward between the main CPU and the cryptoprocessor and a 32-bit output is used to send instructions. The data transfer between the main CPU and the cryptoprocessor is controlled by a data bus controller (DBC). If using SRAM attached to the main CPU for storing intermediate variables during ECC operations, the cryptoprocessor can be constructed without use of the register file in the cryptoprocessor. Alternatively, for the purpose of reducing the I/O transfer overhead, the register file can be embedded in the cryptoprocessor. In this case, the path through the DBC is only activated when an initial point and the parameters of an elliptic curve are sent to the RAM, or when the result is retrieved. The details about a instruction bus controller (IBC) will be discussed in Section 4.3 including our strategy for ILP.
Parallelized modular arithmetic logic unit (P-MALU)
Our proposed architecture for the P-MALU is based on the implementation presented in [34] . The datapath of the P-MALU deals with a polynomial basis (1, α, α 2 , . . . , α n−1 ), where α is a root of an irreducible polynomial P(x) of degree n over GF (2) . It is composed of a conventional bit-serial MSB-first multiplier and a Montgomery's multiplier over GF(2 n ) as illustrated in Fig. 3 (the XOR chain in the left and right side, respectively). This datapath implementation computes
. This is explained as follows. First, the MSB-first multiplier performs modular multiplication by providing the coefficients of A(x) from the MSB up to N bits. Thus we obtain the partial modular multiplication result as
Second, the Montgomery's multiplication is executed by providing the coefficients of A(x) from the LSB as follows.
Lastly, above two multiplications and C(x) are added, and (17) is obtained. The advantage of this method lies in the fact that modular multiplication can be accelerated by computing two different multiplications in parallel. As the operation can be completed in N cycles, the latency of the modular multiplication can be improved compared to the case of using individual multipliers that take n cycles. Although B(x) can be divided into different bit sizes, it is simple to separate it in half (i.e. n/2 bits) so that the computation in (17) is distributed equally to two different type of multipliers. The paper also employs this strategy. For instance, n = 163 can be used for GF (2 163 ). Then a 163-bit×163-bit modular multiplication is divided into two 82-bit×163-bit multiplications.
For convenience of repeated usage of (17), the modified Montgomery form orÃ(x) = A(x) · x N (mod P(x)) is applied because the output is in the Montgomery form as well.
A(x)B(x)
· x −N +C(x) (mod P(x)) = (A(x)B(x) + C(x)) · x N (mod P(x)).(20)
MSB-first modular multiplier and Montgomery modular multiplier
Here, the digit-serial implementation of two multipliers is explained in detail. The MSB-first multiplier sums up
three inputs that are a i B(x), m i P(x), and T(x), and then outputs the intermediate result, T next (x) by computing T next (x) = (T(x) + a i B(x) + m i P(x))x (21) where m i = t n . By providing A(x) by d bits from the MSB-side and T next (x) as the next-cycle input T(x),
one can obtain the result of (18) in N/d cycles. The detailed explanation is also discussed in [35] . As for the Montgomery multiplier, A(x) is sent to the datapath from the LSB. In this way, 
The proposed datapath is scalable in the digit size d which can be decided by exploring the best combination of system performance and cost. The field size n is determined by the key-length. Figure 3 illustrates the block diagram of the digit-serial P-MALU.
For the purpose of checking the improvement in area-performance trade-off by using the P-MALU, we also implement a compact modular arithmetic unit that has only MSB-first multiplier as shown in Fig. 4 . We denote this multiplier as MALU. The MALU performs (23) in n/d cycles which is almost double comparing to the P-MALU.
A(x)B(x) + C(x) (mod P(x))

Instruction-level parallelism
Instructions are sent to the MALU or P-MALU either from the main CPU or from pre-set micro codes in the μ-code RAM. When the main CPU is in charge of dispatching instructions, the IBC block can be detached from the cryptoprocessor. In this case, it occurs that the throughput of issuing instructions is not high enough for the MALU(s) or P-MALU(s) to be utilized effectively. On the contrary, when the μ-code RAM is used for assisting the main CPU, the instruction bus controller (IBC) can handle one instruction per cycle. For instance, the sequence of point doubling is stored in the μ-code RAM and the main CPU calls it as an instruction. Thus multiple P-MALUs can be activated in parallel without any instruction stalls. This is another level of parallelism in the proposed cryptoprocessor. During point multiplication, the IBC keeps on reading instructions from the μ-code RAM and stores them to an instruction queue buffer (IQB) unless the IQB is full. The IBC checks if there is instruction-level parallelism (ILP) by checking the data-dependency of instructions in the IQB and forwards them to the P-MALU(s) including out-of-order execution. Table 4 shows an example of the mechanism of ILP for three consecutive point doublings of ECC. The first two instructions have a RAW (read-after-write) dependency with t 1 . ECDB04 has no RAW dependency upon the first three instructions in in-order and out-of-order execution, and therefore it can be issued prior to the first three instructions.
Results
As discussed in the previous section, the proposed cryptoprocessor is oriented to maximize its performance by deploying hardware that supports multi-level parallelism in ECC as well as using an optimized algorithm for scalar multiplication. The cryptoprocessor is synthesized with 0.13-μm CMOS technology. In this section, cost and performance trade-offs are discussed.
Performance improvement by three-level parallel processing
Our proposed cryptoprocessor has three-level parallelism; the first is in the datapath of the MALU, the second is in modular operations performed in the P-MALU and the third is in the instructions determined by the computation sequence of EC scalar multiplication. Figure 5 illustrates performance improvement of the proposed cryptoprocessor for different hardware configurations and different scalar multiplication algorithms to see the effect of each parallelism. Although double performance cannot be obtained by introducing the P-MALU due to the memory accesses, the performance of scalar multiplication is improved by a factor of 1.8 for all three scalar multiplication algorithms. The performance also improves as the number of the P-MALU or MALU increases because of the parallel execution of instructions. In order to see the effectiveness of this parallelism, we define the degree of parallelism as
where M i is the number of instructions executed in i-way parallel processing and l max is the number of PMALUs or MALUs. For instance, the degree of parallelism becomes 2.0 for ECC_M with two copies of the datapath since almost all instructions are processed in two-way parallel executions. The degree of parallelism is almost proportional to the number of the processing elements up to three-way parallelism for the case of ECC_M. In this way, we obtain 2.7 for degree of parallelism in ECC_M with three copies of the datapath. However, effective performance improvement cannot be observed even with four P-MALUs or MALUs. For the degree of parallelism in ECC and ECC_KC, we obtain 2.1 and 2.2 with three copies of the datapath, respectively. This results indicate that the computation sequence in ECC and ECC_KC cannot utilize more than two processing elements effectively.
Synthesis results
For the purpose of observing the fastest implementation possible, the cryptoprocessor is synthesized with 0.13-μm CMOS technology with the P-MALU configuration of d = 12. As summarized in Table 5 , our results are faster than any other previous ASIC work to our best knowledge. Work by Sozzani et al. [27] showed several different ECC implementations. Although their design of ECC_M shows comparable performance, the result of ECC is much lower than ours. Satoh and Takano proposed a scalable dual-field cryptoprocessor [23] that could support ECC over GF(2 m ) and GF( p). Another scalable dual-field ECC processor was designed by Smyth et al. [36] . Their implementations offer greater flexibility and support of various ECC. Their performances of ECC over GF (2 163 ), however, are much lower than ours. Figure 6 illustrates area and performance trade-offs for the previous work and our cryptoprocessor with various implementation options.
Cost-performance trade-offs
So far, we focused on reducing the time for one scalar multiplication. More precisely, our optimization in the cryptoprocessor is for reducing the latency of one scalar multiplication. This type of improvement is important for applications that perform scalar multiplications sequentially. When an ECC-RFID reader needs to verify the identity of an ECC-RFID tag, the critical operation On the other hand, we can assume other kinds of PKC applications that need scalar multiplications in parallel. If computing two independent scalar multiplications (e.g. k 1 P 1 and k 2 P 2 ), the performance can be doubled by using two EC cryptoprocessors (denoted as two channels) in parallel. Namely, this type of improvement is for increasing the throughput. Figure 7 and 8 show trade-offs between area and performance respectively for ECC_M and ECC_KC with up to four channels. The figures marked at the plot indicate the number of P-MALUs or MALUs. One channel of the cryptoprocessor with one MALU is the smallest hardware configuration in our study, and the gate size is 78K gates. By increasing the number of MALUs, the performance improves up to 38,200 and 48,600 kP/s respectively for ECC_M and ECC_KC. When using the P-MALUs in place of the MALUs, the performance achieves 82,000 and 62,600 kP/s respectively for ECC_M and ECC_KC with four P-MALUs.
As discussed in Section 5.1, the best cost-performance trade-off is obtained from a cryptoprocessor with two P-MALUs for ECC and ECC_KC, and with three P-MALUs for ECC_M. Figure 9 shows the estimated cost-performance curve when preparing up to ten channels of ECC each of which has two PMALUs for ECC and ECC_KC, and three P-MALUs for ECC_M. As can be seen from this figure, our 163-bit ECC implementation offers performances of 689, 446 and 350 kP/s per 1 K gate, respectively for ECC_KC, ECC_M and ECC on the same cyptoprocessor.
The result of the NAF method with windows of width 4 (NAF 4 ) for ECC and the TNAF method with windows of width 4 (TNAF 4 ) for ECC_KC are also plotted to compare to the binary method and the TNAF method, respectively. As a result, NAF 4 and TNAF 4 show worse area-performance trade-offs in our implementation. This is because we use a flip-flop-based multi-port register file, and hence additional memory space for pre-computed points is expensive considering the performance gain.
Conclusions
This paper introduced an EC cryptoprocessor that exploits three-level parallelism in ECC. The implementation results on a generic curve showed that scalar multiplication of ECC over GF (2 163 ) were performed in 16 and 20 μs, respectively with the binary NAF method and the Montgomery method. Further more, the scalar multiplication was accelerated up to 12 μs on a Koblitz curve by using the TNAF method. This speed-up was achieved by thoroughly exploiting parallelism in the system architecture of ECC. She received her Electrical Engineering degree and Ph.D. degree from the K.U. Leuven in Belgium. She is currently a professor at the K.U. Leuven and an adjunct associate professor at UCLA. At K.U. Leuven, she is co-director of the COSIC (Computer Security and Industrial Cryptography) lab. She was a lecturer and visiting research engineer at UC Berkeley from 1992 to 1994. From 1994 to 1998 she was a principal engineer first with TCSI and then with Atmel in Berkeley, CA. She joined UCLA in 1998 as an associate professor and the K.U. Leuven in 2003.
She is active in many conferences: she was the program chair in 2002 and the general chair for ISLPED 2003. She was a member of the executive committee of the 42nd and 43th DAC as the design community chair. She is the program chair of the 2007 CHES (Cryptographic Hardware and Embedded Systems) conference. More information on her research can be found by going to: www.emsec.ee.ucla.edu or www.esat.kuleuven.be/cosic.
