Abstract. Elliptic curve cryptography (ECC) provides high security with shorter keys than other public-key cryptosystems and it has been successfully used in security critical embedded systems. We present an FPGA-based coprocessor that communicates with the host processor via a 32-bit bus. It implements ECC over an elliptic curve that offers roughly 128-bit security. It is the first hardware implementation that uses the recently introduced lambda coordinates and the Galbraith-Lin-Scott (GLS) technique with fast endomorphisms. One scalar multiplication requires 65,000 clock cycles with a maximum clock frequency of 274 MHz on a Xilinx Virtex-5 FPGA, which gives a computation time of 0.24 ms. The area utilization is 1552 slices and 4 BlockRAMs. Our coprocessor compares favorably to other published works both in terms of speed and area, which makes it a good choice for embedded systems that require high-security public-key cryptography.
Introduction
Many embedded systems are used in applications where security and safety are of utmost importance. Such security-critical embedded systems include airplanes, cars, medical devices, home automation systems, military devices, etc. They require that confidentiality, integrity, and authenticity are ensured by using strong cryptography. Cryptography is often computationally demanding and efficient implementation of cryptographic computations is a topic that has been an active research field during the last couple of decades. In particular, public-key cryptosystems are challenging to implement efficiently and securely in embedded systems because they are computationally more demanding than many other forms of encryption. Public-key cryptosystems are essential parts of a variety of cryptosystems because they are required, for example, for computing digital signatures. Therefore, techniques for their fast computation with small amounts of resources are needed in embedded systems in practice.
Elliptic curve cryptography (ECC) is a type of public-key cryptography that was introduced in the mid-1980s. ECC has many benefits compared to other forms of public-key cryptography. The main benefit is that high security levels can be achieved with shorter keys than for other public-key cryptosystems (such as RSA, Diffie-Hellman, ElGamal, etc.). ECC uses 2n-bit keys for achieving roughly n bits of security. For example, RSA requires significantly longer keys: 1,024 and 3,072 bits for 80-bit and 128-bit security levels, respectively [34] . ECC implementations have also proven to be faster than RSA implementations [15] . ECC is also widely included in multiple standards, which is a significant advantage in commercial applications.
Field-programmable gate arrays (FPGA) have been popular implementation platforms for cryptography and a plethora of ECC implementations for FPGAs are available in the literature. While many of them target primarily to hardware acceleration of ECC by optimizing speed with very loose area constraints (see, e.g., [24, 1, 20, 11, 4] ), there also exist many implementations that are suitable for security-critical embedded systems including, e.g., [41, 28, 26, 42] . In them, the primary optimization target is typically either area or speed-area ratio. A vast majority of such designs has been introduced for the 80-bit security level, which should have been used only up to 2010 as recommended by the National Institute of Standards and Technology (NIST) of the United States [33] . Most of the publications target the NIST curves specified in [35] . These curves date back to the 1990s and they cannot utilize certain state-of-the-art optimizations that have been introduced in the recent years. Several new curves have been introduced allowing more efficient computations (see, e.g., [14, 13, 7, 9, 8, 17, 37, 10] ). Although many studies about software performance of these curves are available, hardware implementations are still largely missing from the literature (for some exceptions, see, e.g., [3, 2, 5, 6, 16, 40] ). In particular, to the best of our knowledge, there are no hardware implementations of the new λ-coordinates [37] combined with the Galbraith-Lin-Scott (GLS) technique for binary curves [17] which was shown to be very efficient in software [37] . The implementation in [2] uses the same curve without λ-coordinates, focusing on speed maximization with very loose area constraints. It is widely known that there is often a difference between efficiency in software and hardware and some curves that are fast in software may not be as efficient in hardware, and vice versa. Good examples are prime and binary curves, of which the former are better in software and the latter in hardware. Also NIST has shown interest in the new curves and techniques and there appears to be considerations to standardize new elliptic curves (see, e.g., the call for papers of the NIST workshop on elliptic curves [36] ). Hence, it is important to shed light on hardware performance of the new curves and techniques in order to complete the picture about their efficiency.
In this paper, we present an FPGA coprocessor for ECC which is designed primarily for security-critical embedded systems. Our coprocessor communicates with a host processor via a 32-bit interface which allows easy integration to various systems. The coprocessor implements ECC for the 128-bit security level which matches, e.g., the security offered by the 128-bit version of the Advanced Encryption Standard (AES) [32] . Most other publications consider the significantly less secure 80-bit security level as discussed above. The coprocessor is designed so that it is both fast and compact and, thus, meets the requirements of various security-critical embedded systems. Our implementation uses the elliptic curve and parameters as well as many of the state-of-the-art optimizations introduced by Oliveira et al. [37] in 2013. To the best of our knowledge, our implementation is the first FPGA-based ECC implementation that uses λ-coordinates from [37] and the GLV/GLS technique from [14, 13, 17] . We compiled our architecture for Xilinx Virtex-4, Virtex-5, and Spartan-6 FPGAs and the results show that our coprocessor compares favorably with the related work available in the literature although it offers a significantly higher security level. Our results also show that λ-coordinates and the GLS technique provide good results not only in software but also in hardware.
The paper is structured as follows. Section 2 presents the preliminaries of ECC and the algorithms we implement in our coprocessor. Section 3 describes the architecture of our coprocessor. Section 4 presents the results on Xilinx FPGAs and compares them to other relevant ECC implementations available in the literature. We end with conclusions and discussion on certain topics for future research in Section 5.
Preliminaries
This section provides background on ECC in general and on GLS curves and λ-coordinates that we implement in this paper in particular.
Elliptic Curve Cryptography
The use of elliptic curves for public-key cryptography was independently proposed by Victor Miller [31] and Neal Koblitz [21] in the mid-1980s. ECC achieves high security levels with significantly shorter key lengths than other public-key cryptosystems such as RSA or ElGamal. Hence, ECC has become a popular choice for public-key cryptography especially in embedded systems.
Elliptic curves defined over a finite field F q are used in cryptography. The points (x, y) that satisfy the equation of an elliptic curve combined with a special point called the point-at-infinity O form an additive Abelian group E, where O is the zero element. Let P 1 , P 2 ∈ E. The group operation P 1 + P 2 is called point addition if P 1 = ±P 2 and point doubling if P 1 = P 2 . The most important operation of every elliptic curve cryptosystem is the scalar multiplication:
where k is an integer in the interval [1, r−1] where r is the order of P (the smallest positive integer for which rP = O). The security of ECC is based on the computational difficulty of the elliptic curve discrete logarithm problem (ECDLP), which is the problem of finding k when given Q and P . The ECDLP is believed to be infeasible to solve if the parameters of the system are chosen properly (similarly as integer factorization in the case of RSA). A secure elliptic curve over a 2m-bit finite field is believed to offer roughly m bits of security. Hence, ECC offers roughly 128-bit security with 256-bit keys whereas, for example, RSA would require approximately 3,072-bit keys [34] .
Computing (1) consists of several hierarchical levels, which are depicted in the ECC pyramid of Fig. 1 . The scalar multiplication algorithm is on the top and it computes (1) with a series of point arithmetic operations (typically, point additions and point doublings). The algorithms that implement the point operations with series of finite field operations are in the middle. The algorithms for computing finite field operations including multiplication, addition (subtraction), and inversion (division) are in the bottom. In the following, we discuss the hierarchical levels and provide descriptions of our choices for implementing them from the bottom to the top. 
Finite Field Arithmetic
Either prime fields, where q is a prime p, or binary extension fields, where q = 2 m , are typically used for ECC. Prime fields are more commonly used in software implementations. Binary fields allow significantly more efficient implementations in hardware because they employ carry-free arithmetic operations. The inclusion of carry-free instructions in modern processors has enabled extremely fast implementations using binary fields also in software [43] . Elliptic curves defined over prime and binary fields are called prime and binary curves, respectively.
In this paper, we follow the approach of [37] and use the binary field F 2 254 which can be constructed as a quadratic extension of the binary extension field F 2 127 by using the irreducible polynomial g(u) = u 2 + u + 1. That is, we set
An element a ∈ F 2 254 is represented as a = a 0 + a 1 u where a 0 , a 1 ∈ F 2 127 . Arithmetic operations in F 2 254 can be decomposed into operations in F 2 127 and they are computed as follows [37] :
where
. Let a, m, s, and i denote the costs of addition, multiplication, squaring, and inversion in F 2 127 and let A, M, S, and I denote the costs of the respective operations in F 2 254 . The above equations give that A = 2a, M = 4m + 3a, S = 2s + a, and I = i + 3m + 2s + 3a.
We construct F 2 127 by setting
is a bitwise exclusive-or (xor) of the bit vectors representing the elements. Multiplication is carried out by computing a multiplication of polynomials in F 2 [x] followed by a reduction modulo p(x). Because squaring in F 2 [x] can be performed by adding zeros between each bit of the bit vector, squaring in F 2 127 contains only rewiring followed by the reduction modulo p(x) when implemented in hardware. Inversion can be computed as an exponentiation consisting of multiplications and squarings in F 2 127 by using the Itoh-Tsujii algorithm [19] . We use a variant of the Itoh-Tsujii inversion from [37] that utilizes the optimal addition chain (1, 2, 3, 6, 12, 24, 48, 96, 120, 126) and requires 126s + 9m in F 2 127 .
Point Representation with λ-Coordinates
Point addition and point doubling are the basic point operations. If the points are represented in affine coordinates by using two coordinates (x, y), then both point addition and point doubling require an inversion, which is a very expensive operation as shown in Section 2.2. Hence, projective coordinates, where points are represented with three coordinates as (X, Y, Z), are commonly used in practical implementations of ECC. They allow computing point additions and point doublings without inversions (but with an increased number of other operations). A single inversion is required in the end of computing (1) in order to obtain the affine coordinates of the result point Q. In the case of binary curves, popular choices have been standard projective coordinates, where x = X/Z and y = Y /Z, and López-Dahab (LD) coordinates [27] , where x = X/Z and y = Y /Z 2 . In 2013, Oliveira et al. [37] proposed a new coordinate system called λ-coordinates, where points are represented as (x, λ) so that λ = x + y x . We refer to this coordinate system as affine λ-coordinates. The projective version of λ-coordinates represents a point with three coordinates (X, L, Z) so that x = X/Z and λ = L/Z. They result in the fastest formulae that are currently available for computing point arithmetic on binary Weierstrass curves. A point addition and point doubling require (excluding additions) 8M + 2S and 5M + 4S (including one multiplication by a constant), respectively. One of the main benefits of λ-coordinates compared to LD coordinates is that they allow efficient combination Algorithm 1 Double-and-add (left-to-right)
of point doubling and point addition operations. Computing 2P 1 + P 2 requires only 11M + 6S (including one multiplication with a constant).
Scalar Multiplication on the GLS Curves
In this paper, we use the same curve that was used in [37] . It is a Weierstrass curve over F 2 254 defined by the following equation:
, and b 0 is a specific element in F 2 127 . Scalar multiplication defined by (1) can be computed with a series of point additions and point doublings. The simplest option is to use the double-and-add algorithm given in Algorithm 1. It scans through the bits of k and performs a point doubling for every bit and an additional point addition if the bit is one. Let k be an n-bit integer and let h(k) be its Hamming weight (the number of ones in the binary expansion). Then, one scalar multiplication requires n − 1 point doublings and h(k) − 1 point additions, where h(k) ≈ n/2.
One way of improving the speed of scalar multiplications is to utilize efficiently computable endomorphisms. Menezes and Vanstone [30] showed how point doublings can be replaced with the Frobenius endomorphisms (x, y) → (x 2 , y 2 ) on certain supersingular elliptic curves over F 2 m , but these curves were found to be cryptographically weak [29] . In 1991, Koblitz [22] introduced a secure class of nonsupersingular elliptic curves over F 2 m which have the advantage of the Frobenius endomorphisms after certain conversions are computed for the scalar k. These curves are commonly known as Koblitz curves and they are nowadays included in many standards (e.g., in [35] ). In 2001, Gallant, Lambert, and Vanstone (GLV) [14] introduced a specific class of elliptic curves over F p which allows utilizing efficiently computable endomorphisms also for prime curves. Galbraith, Lin, and Scott (GLS) [13] generalized the GLV technique to a broader class of elliptic curves defined over F p 2 . The GLS curves were generalized for binary curves over F 2 2m by Hankerson, Karabina, and Menezes [17] . In this paper, we focus on these variants of the GLS curves and, in particular, the curve considered by Oliveira et al. in [37] .
The GLS technique allows splitting the computation of (1) with an n-bit k into k 1 P + k 2 ψ(P ) where k 1 and k 2 are approximately n/2-bit integers. That is, instead of the single scalar multiplication, one computes a sum of two smaller scalar multiplications, which can be computed efficiently with the so called Shamir's trick (see below). We skip the mathematical subtleties and merely state that the efficiently computable endomorphism ψ of the GLS technique is based on a composition of the Frobenius endomorophism and endomorphisms between E and its quadratic twist. Interested readers can find details, e.g., from [37, 17, 14, 13] . We use ψ(P ) as defined in [37] as follows:
where x = x 0 + x 1 u and λ = λ 0 + λ 1 u with x 0 , x 1 , λ 0 , λ 1 ∈ F 2 127 . Hence, ψ(P ) requires only three additions (3a) in F 2 127 . The integer k needs to be decomposed into n/2-bit k 1 and k 2 such that k ≡ k 1 + k 2 δ (mod r) where δ is an integer such that ψ(P ) = δP for all P ∈ E. Such a decomposition can be found by using techniques for finding the GLV decomposition given in [14] . The decomposition algorithm can be simplified for specific curve parameters and we use the decomposition algorithm used by Oliveira et al. [37] in the C code that is publicly available 3 . It finds k 1 and k 2 by computing:
where β is a 64-bit constant specific for the curve and k l and k h are the lowest and highest 127-bit words of the 254-bit k. Shamir's trick (see, e.g., [18] ) is a technique that allows evaluating a sum of two scalar multiplications k 1 P + k 2 ψ(P ) simultaneously. We call this operation double scalar multiplication. Let k 1 and k 2 be n/2-bit integers. If the double scalar multiplication is computed with two separate scalar multiplications, then it requires n − 2 point doublings and h(k 1 ) + h(k 2 ) − 1 ≈ n/2 − 1 point additions. Shamir's trick arranges k 1 and k 2 into a 2 × n/2-bit matrix and precomputes the point P + ψ(P ). The double scalar multiplication is computed by scanning through the columns of the matrix. A point doubling is computed for every column and a point addition is computed for all nonzero columns. If the column is , then one adds either P , ψ(P ) or P + ψ(P ), respectively. Hence, Shamir's trick requires n/2 − 1 point doublings and, on average, In [37] , Oliveira et al. used a parallelization technique that splits the scalar multiplication in two parallel computations: one based on the double-and-add and the other on halve-and-add. Halve-and-add computes point halvings Q ← (w-NAF) for representing k 1 and k 2 in the GLV encoding which leads to a smaller number of point additions but requires precomputations and extra storage for the precomputed points. We decided not to use these optimizations in our implementation because implementing them would lead to a significant growth of the control logic and the parallelization technique would also require another unit for field arithmetic (see MALU in Section 3.4).
To summarize, we implement the scalar multiplication by using the GLS technique which splits an n-bit k into n/2-bit k 1 and k 2 so that both are given in standard binary representation. We precompute P +ψ(P ). We then use Shamir's trick for evaluating the double scalar multiplication by computing point doublings and combined point doublings and additions. The point arithmetic is performed in projective λ-coordinates by using the formulae from [37] . Finite field arithmetic is computed in F 2 254 by decomposing the operations into operations in F 2 127 as shown in (2)-(5).
Architecture
We present a coprocessor architecture for FPGAs that implements (1) using λ-coordinates and the fast GLS endomorphism. In this architecture, the finite field processing unit (FFPU) performs operations in F 2 127 and three control units (i.e., finite state machines (FSM)) drive the FFPU in a hierarchical manner according to the ECC hierarchy shown in Figure 1 . This provides a natural way to decompose the complex control logic required for computing (1) into a set of smaller FSM, which can be implemented efficiently in FPGAs. Additionally, smaller processing and control units for integer arithmetic perform the scalar decomposition. A register file and block RAMs (BRAM) are used for storing temporary results. The register file stores frequently used temporary variables. The base point, the result of a scalar multiplication and the curve constant a are also stored in the BRAM. In the following, we describe the architecture of the coprocessor in a hierachical manner according to Figure 1. 
Top Level FSM
Our implementation is designed to be used as an ECC coprocessor. The architecture of the coprocessor is shown in Figure 2 . The host processor initiates the coprocessor by sending the base point P and scalar k over a 32-bit bus. The base point is stored in the BRAM and the scalar is stored in the registers. Next, the coprocessor starts the precomputation. The scalar multiplication FSM (the Top FSM in Figure 2 ) manages the point operation FSM, shifts the scalar registers k 1 and k 2 , and determines the next point operation. The scalar multiplication FSM also organizes precomputations, which are the scalar decomposition, computation of the endomorphism ψ(P ), the point addition for computing P + ψ(P ), finding the first nonzero column of the scalar matrix, and performing the coordinate transformations. First, ψ(P ) is calculated and stored in the BRAM and, then, it is added to P . Since the combined point doubling and point addition requires the other input point to be given in affine λ coordinates, the point P + ψ(P ) is converted to affine λ-coordinates and stored in the BRAM. After that, the scalar decomposition that computes (8) and (9) starts and when it is ready, the precomputation ends with finding the first nonzero column of the scalar matrix. Because all points and the scalars are ready, the scalar multiplication starts after this. As discussed in Section 2.4, the double scalar multiplication k 1 P + k 2 ψ(P ) is implemented using Shamir's trick. The accumulator point Q is initialized with P if the first nonzero column is 1 0 , with ψ(P ) if 0 1 , and with P + ψ(P ) if 1 1 . After this, either point doublings (for zero columns) or the combined point doublings and additions (for nonzero columns) are performed depending on the bits of k 1 and k 2 and the points to be added are determined as above. After the scalar multiplication is finished, the result point Q is first converted to affine λ-coordinates and then to affine coordinates and it is stored into a specific location in the BRAM, where it is available to the host processor. The scalar multiplication algorithm implemented by the coprocessor is shown in Algorithm 2.
Algorithm 2 Scalar multiplication on a GLS curve, Q = kP
Require: Scalars k1 = (k1,n 1 −1 . . . k1,0), k2 = (k2,n 2 −1 . . . k2,0), base point P Ensure: Result point Q = kP = k1P + k2ψ(P ) P1 ← ψ(P ) // 3a in F 2 127 P2 ← P + P1 // Algorithm 3 n ← max(n1, n2) if k1,n−1 = 1 and k2,n−1 = 1 then Q ← P2 else if k1,n−1 = 0 and k2,n−1 = 1 then Q ← P1 else if k1,n−1 = 1 and k2,n−1 = 0 then Q ← P end if for i = n − 2 down to 0 do if k1,i = 1 and k2,i = 1 then Q ← 2Q + P2 // Algorithm 5 else if k1,i = 0 and k2,i = 1 then Q ← 2Q + P1 // Algorithm 5 else if k1,i = 1 and k2,i = 0 then
Point Operations FSM
Point operations (point doubling, point addition and combined point doubling and addition) and coordinate conversion are implemented in the point operation FSM. Point addition, point doubling and combined point doubling and addition require 5M + 2S + 5A, 5M + 4S + 5A, and 11M + 6S + 9A operations in F 2 254 , respectively. Affine coordinates to affine λ-coordinates, affine λ-coordinates to affine coordinates, and projective λ-coordinates to affine λ-coordinates conversions require I + M + A, M + A and I + 2M, respectively. One of the multiplications in both point doubling and combined point doubling and addition is a multiplication by a constant (the curve parameter a).
The point operations FSM fetches input operands for an operation in F 2 254 and writes them into registers of the register file. After the operation in F 2 254 is finished, the point operation FSM writes the result of the operation to the BRAM unless the result is required only for the next operation. In that case, the writing is skipped and the point operation FSM proceeds to the next operation that will operate directly on the result in the register file. Results are written to the register file by default. For some cases results are both needed in the next operation and in later operations. Therefore, the result is stored in the BRAM to be used in the later operations and the next operation uses the result in the register file in order to save clock cycles. Temporary variable R3 that is shown in Algorithm 3, 4 and 5 is a register in the register file and all other temporary variables are in the BRAM.
Point addition is used only once when P + ψ(P ) is computed during the precomputation. Two points in affine λ-coordinates are added in this step. The point addition formula for adding P = (x P , λ P ) and Q = (x Q , λ Q ) is shown in (10) . It returns the point P +Q = (X P +Q , L P +Q , Z P +Q ). An operation sequence for computing (10) is shown in Algorithm 3 in the Appendix.
The point doubling formula for a point in projective λ-coordinates Q = (X Q , L Q , Z Q ) is shown in (11) . It returns the point 2Q = (X 2Q , L 2Q , Z 2Q ). An operation sequence for computing (11) is shown in Algorithm 4 in the Appendix.
The efficiency of combined point doubling and addition is one of the reasons why λ-coordinates give faster results [37] . Computing them separately takes 13M + 6S in total, whereas the combined point doubling and addition requires only 11M + 6S. Therefore, combined point doubling and addition saves 2 multiplications in F 2 254 . Since the double scalar multiplication with Shamir's trick requires significantly more combined point doublings and additions (for 75 % of the columns, on average) than point doublings (for 25 % of the columns), this trick significantly reduces the overall execution time. A formula for combined point doubling and addition with the inputs Q = (X Q , L Q , Z Q ) and P = (x P , λ P ) is shown in (12) . It returns the point 2Q + P = (X 2Q+P , L 2Q+P , Z 2Q+P ). An operation sequence for computing (12) is shown in Algorithm 5 in the Appendix.
Quadratic Extension FSM
Operations in the quadratic extension field (multiplication, constant multiplication, squaring, addition and inversion in F 2 254 ) and the endomorphism ψ(P ) are implemented by the quadratic operations FSM. The quadratic operations FSM drives the MALU and generates the addresses of input operands of the MALU and generates the output write address for the BRAM. Operation costs of multiplication, squaring, addition and inversion are given in Section 2.2. In order to save time, constant multiplication with the curve parameter a is optimized. As discussed in Section 2.4, a = u for this curve (i.e., a 0 = 0 and a 1 = 1). Therefore, multiplication c = a · b simplifies to c = u mod g(u) ). Therefore, multiplication by a only requires the swapping of the two halves followed by an addition in F 2 127 . I.e., the cost is only a and we save 4m + 2a compared to a general multiplication. The constant multiplication is used once in every iteration of the double scalar multiplication because it is needed once both in point doubling and combined point doubling and addition. The endomorphism ψ is performed as shown in (7) and it takes three modular additions in F 2 127 . The Itoh-Tsujii inversion algorithm is implemented for coordinate conversions. Itoh-Tsujii inversion uses an addition chain of length 9 as given in [37] .
Modular Arithmetic Logic Unit (MALU)
Operations in F 2 127 (addition, multiplication and squaring) are performed by the MALU which is imported from [39] . The MALU in [39] supports multiplication and addition in F 2 m . The MALU is basically a most-significant digit first digitserial modular multiplier over F 2 m with digit size d. It is adjustable for different irreducible polynomials and digit sizes, which are fixed at the time of implementation. The support for additions is added to the multiplier architecture with a very low cost by utilizing resource sharing. The latency of multiplication is m d + 2 clock cycles, where m is the bit size of the operands which are elements in F 2 m . The digit size of the MALU is chosen to be d = 16 for this implementation as it provides a good tradeoff between area and latency. Therefore, one multiplication in F 2 127 takes 127 16 + 2 = 10 clock cycles. Addition takes 3 clock cycles. The MALU in [39] does not include a dedicated squaring circuitry and, therefore, squaring takes the same time as multiplication. Squaring in F 2 m is a very simple operation and support for it can be added with a very small overhead. Because the point operations involve several squarings, adding the support for squaring provides a significant speedup. Hence, we extended the MALU from [39] with a dedicated squarer. One squaring takes only 3 clock cycles which makes it as fast as an addition. For all field operations, one clock cycle is consumed by loading the inputs and another one goes to a one-stage pipeline for the inputs of the MALU in order to shorten the critical path. The remaining cycles are the actual operation time of the MALU. 
Scalar Decomposition
The scalar decomposition is computed in the scalar-decomposition module which contains an 8-bit ALU that performs integer addition, subtraction and multiplication operations. An FSM splits operations with large operands into 8-bit operations and another FSM controls the execution of the decomposition equations given in (8) and (9). Scalar splitting is performed as described in [13] . After rearranging the equations, maximum operand sizes reduce to 192 bits for addition and subtraction and 128 bits for multiplication. These large operands are processed in 8-bit pieces by the ALU. The integer multiplication is implemented using Algorithm 5.1 given in [38] . Since the decomposition is executed in the beginning, it uses free space in the BRAM. The integer multiplication uses one of the dedicated multipliers in the FPGA (one DSP block). The architecture of the scalar-decomposition module is shown in Figure 3 .
Optimizations
Point operations are optimized for minimum execution time and the temporary variable count is optimized based on this minimum timing constraint. Since the most complex equations stand for the combined point doubling and addition, it has the largest temporary variable count, which is 7 in our case. Temporary variables are stored in the BRAM excluding one of them (R3) which is in the register file. The base point P , ψ(P ), P + ψ(P ), and curve constants are also stored in the BRAM. When the MALU is processing, one of the operands is stored in the register file. Also the output from the MALU is stored into the registers (two 127-bit registers) so that it can be used for the next operation, which is usually the case. Another 127-bit register is used for storing temporary results of multiplication and inversion in F 2 254 . Therefore, there are in total three 127-bit registers in the register file. The latencies of point operations are given in Table 1 . The differences between estimated and actual cycle counts arise from access times to the BRAM between operations. The control logic is pipelined in order to shorten the critical path and to ensure that the critical path is in the processing units (i.e., in the MALU or the 8-bit ALU). Output multiplexers of the register file are pipelined before feeding the data into the MALU. Although this pipeline increases the execution time of every field operation by one clock cycle, the maximum frequency increases significantly. The slice count does not increase significantly because the slice based structure of FPGAs allows to use the flip-flops of a slice essentially for free if the LUT of the slice is already used. Other pipelined paths are address and control signals of the quadratic operations FSM that are controlling the register file and the BRAM. The 8-bit ALU also includes a one-stage pipeline. We experimented with different word sizes for the ALU and selected 8 bits because it was the optimal choice considering the critical path and area.
Results and Comparison
The architecture of the coprocessor was described in VHDL. This code was compiled with the Xilinx ISE 12.2 tool for Virtex-5 XC5VLX85-3FF676, Virtex-4 XC4VLX200-11FF1513 and Spartan-6 XC6SLX45T-3FGG484 FPGAs. Simulations for verifying the functionality of the implementations were performed with ModelSim software. Detailed area results after placement & routing are given in Table 2 for Virtex-5 XC5VLX85-3FF676. We compiled the design also for Virtex-4 and Spartan-6 FPGAs. Virtex-4 was selected in order to provide a fair comparison with previous works and Spartan-6 shows the performance of our architecture on a low-cost FPGA, which is commonly used in security-critical embedded systems.
The maximum clock frequency after synthesis is 274.982 MHz for the Virtex-5 implementation. This frequency was added as a design constraint for the place & route and it was able to meet this timing goal. One scalar multiplication takes on average around 61,300 clock cycles, but the exact latency depends on the scalar (the length and the number of nonzero columns in the scalar matrix). Therefore, the total time for an entire scalar multiplication is around 0.223 ms. Precomputation, scalar decomposition and coordinate transformations are also included in the given execution time.
There are many FPGA implementations of elliptic curve scalar multiplications over binary fields. However, most of them do not utilize curves with fast en- domorphisms. The most popular curve has been the NIST B-163 curve from [35] .
It provides a lower 80-bit security level, but because finite field arithmetic in F 2 163 cannot be decomposed into a smaller field, the complexity of FFPU is actually larger. However, the latency of computing (1) on curves with lower security levels is expected to be significantly shorter because fewer operations are computed. These differences should be considered when the designs are compared. A summary of related work and our implementations is shown in Table 3 . The design proposed by Sinha Roy et al. in [41] uses LD coordinates for a binary field EC implementation. They implemented a Karatsuba hybrid multiplier for F 2 163 . The design by Ansari and Hasan in [1] uses a Montgomery multiplier and represent points in projective coordinates. The work by Liu et al. in [24] is a very fast implementation for NIST B-163. It computes a scalar multiplication in only 9 µs. However, it occupies a huge area due to the extensive use of parallelism in the processor. The implementation presented by Lutz and Hasan in [28] computes a scalar multiplication on NIST K-163, which is a Koblitz curve over F 2 163 , and represents points in LD coordinates. A scalable design that implements all NIST Koblitz curves was proposed by Loi and Ko in [26] . In [42] , various NIST curve implementations were proposed by Sutter et al. Their architecture uses three multipliers and, therefore, it gives fast results at the expense of area. To the best of our knowledge, the only other implementation that uses the GLS technique is an implementation described by Azarderakhsh and Karabina in [2] , but they do not use λ-coordinates. The implementation provides results on the same security level but it targets high speed at the expense of area which makes comparisons difficult. However, we expect that also their implementation would benefit from the use of λ-coordinates through the reduction of the number of field operations. When our results are compared to the other works in Table 3 , it must be borne in mind that our implementation offers about 128 bits of security whereas the ones for B/K-163 and B/K-233 curves offer only about 80 or 112 bits of security, respectively. The security level has significant effects on both the hardware complexity and the total execution time.
As can be seen from Table 3 , our design is faster than some related works, although it offers more security. Also, the area consumption of our design is very low compared to other FPGA implementations available in the literature. Hence, our design is suitable for embedded systems that require fast scalar multiplications with a small footprint. In such applications, low-cost FPGAs are more feasible implementation platforms than high-end FPGAs such as the Virtex family FPGAs. Hence, we also give implementation result for a low-cost FPGA from the Spartan-6 family in Table 3 . Our results show that the new technique of using λ-coordinates with the GLS endomorphism introduced in [37] results in a fast and compact ECC coprocessor that can be used in various security-critical embedded systems. Note that a slice consists of several LUTs and FFs. In Table 3 , we give both the number of slices and the number of LUTs and FFs in order to compare to other implementations that only give one of both.
Conclusion
We presented a fast and compact FPGA implementation of ECC with a 128-bit security level. Our implementation uses many of the optimizations used in the software implementation presented in [37] by Oliveira et al. To the best of our knowledge, it is the first FPGA implementation that utilizes λ-coordinates and the GLS decomposition. We demonstrated that these techniques do not only offer significant advantages in software implementations but that improvements can be obtained also for hardware implementations. An especially significant advantage in the case of hardware is that finite field arithmetic is decomposed so that it can be performed in a small field, where m is of the size of the security level. This results in small areas and high maximum clock frequencies. In the case of many other curves (e.g., the NIST curves), field arithmetic is performed in fields with sizes of twice the security level. We showed also that the GLS decompositions can be computed efficiently in hardware, although extra hardware is required for the integer operations that are not natively supported by field arithmetic units (such as the MALU that we used in this paper). This extra cost is small compared to the benefits that can be obtained from the faster scalar multiplication.
The target applications for our implementation are various security-critical embedded systems that require high-security public-key cryptography. Hence, primary concerns are small area and good speed-area ratio and not necessarily the lowest computation time or highest throughput (scalar multiplications per second). In fact, the highly optimized software implementation reported in [37] requires only 47,900 clock cycles per scalar multiplication on Intel Sandy Bridge processors which run at significantly higher clock frequencies (e.g., 3.4 GHz) compared to our implementation (274 MHz) and, therefore, it outperforms our FPGA implementation. However, this kind of performance is completely out of the reach of embedded software implementations. For example, a single scalar multiplication requires some tens of milliseconds on a 32-bit processor [12] and more than 0.5 s on an 8-bit microcontroller [23, 25] . Thus, our implementation provides major performance advantages compared to software implementations typically used in security-critical embedded systems.
We foresee several directions for future research on this topic. A careful parameter space exploration for our architecture would allow finding the optimum parameters (e.g., the digit-size of the MALU) which would allow finding the optimum speed-area ratio. Countermeasures against side-channel attacks need to be implemented in order to use our implementation in applications where sidechannel attacks are a threat. In particular, timing and operation patterns need to be constant in order to thwart timing attacks and simple power analysis or electromagnetic attacks. Further speedups can be obtained by using precomputations (e.g., w-NAF) and by using the two-core parallelization technique based on double-and-add and halve-and-add from [37] . Further increase in throughput can be achieved by instantiating multiple cores in a single FPGA. For instance, we extrapolated that about 25 parallel cores would fit into the largest Virtex-5 FPGA (filled up to 80 %). Hence, our implementation can be valuable also for accelerating cryptographic computations.
