Abstract-Elliptic Curve Cryptography (ECC) is considered as the best candidate for Public-Key Cryptosystems (PKC) for ubiquitous security. Recently, Elliptic Curve Cryptography (ECC) based on Binary Edwards Curves (BEC) has been proposed and it shows several interesting properties, e.g., completeness and security against certain exceptional-points attacks. In this paper, we propose a hardware implementation of the BEC for extremely constrained devices. The wcoordinates and Montgomery powering ladder are used. Next, we also give techniques to reduce the register file size, which is the largest component of the embedded core. Thirdly, we apply gated clocking to reduce the overall power consumption. The implementation has a size of 13,427 Gate Equivalent (GE), and 149.5 ms are required for one point multiplication. To the best of our knowledge, this is the first hardware implementation of binary Edwards curves.
I. INTRODUCTION
Public-Key Cryptography (PKC), introduced in 1976 by Diffie and Hellman [1] , is now widely used for key establishment, digital signature and data encryption. The bestknown and most commonly used public-key cryptosystems are RSA [2] and Elliptic Curve Cryptography (ECC) [3] , [4] . The main benefit of ECC is that it offers equivalent security as RSA for much smaller parameter sizes. These advantages result in smaller data-paths, less memory usage and lower power consumption. ECC is widely considered as the best candidate for constrained devices such as RFID-tags.
Integrating a Public Key Cryptosystem into a constrained device such as a RFID-tag is a challenge due to the limitations in costs, area and power. A passive RFID-tag has no battery, thus the processing power of these devices is limited. On the other hand, security is required, in particular to prevent cloning or tracing [5] , [6] . It was widely believed that devices with such constrained resources cannot carry out strong cryptographic operations such as Elliptic Curve Scalar Multiplication (ECSM). However, the feasibility of integrating PKCs into such devices have been recently proven by several implementations [7] , [8] , [9] , [10] .
Standard formulas for adding two points, say P and Q, on a Weierstrass-form elliptic curves fail if P is at infinity, or if Q is at infinity, or if P+Q is at infinity [11] . Attacks to explore this feature has been proposed [12] . Binary Edwards curves provides a different equation to define an elliptic curve which no longer has points at infinity [11] . This feature is known as completeness. In addition to completeness, new explicit addition formulas of Edwards curves make point addition slightly faster than known methods [13] , [14] .
In this paper, we present a hardware implementation of binary Edwards curves. We focus on different techniques to optimize the implementation to make it suitable for constrained devices. The optimization is mainly to reduce the area consumption and process time. By fixing the curve parameters (d 1 and d 2 ) and using a common-Z coordinate system [15] we reduce the number of registers from 9 to 6. Moreover, by using a squarer module we reduce the number of required cycles for a squaring operation to 1. As a result, the overall BEC core requires from 11.72 k to 14.53 k GE and the total process time is limited between 548 to 107 ms at 400 kHz for six different configurations, i.e, different digit sizes for the modular multiplier.
The rest of the paper is organized as follows. In section 2, we briefly recap previous implementations on ECC and introduce elliptic and binary Edwards curves. section 3 summaries several techniques to optimize the algorithm. The architecture of BEC and the main parts of the design are investigated in section 4. section 5 gives synthesis results and comparisons. section 6 provides a conclusion.
II. BACKGROUND

A. Previous Implementations
In 2006, Kumar and Paar [8] proposed an affine coordinate ASIC implementation of the ECC processor, using a modified Montgomery point multiplication method for binary elliptic curves. An area consumption of 10k to 18k GE is obtained in a 0.35 μm CMOS process. Moreover, a bitserial multiplier, a squarer unit, an adder and a memory, which consists of 6 random access registers, are used to compose an ECC-processor. This ECC-processor supports binary curves from GF (2 113 ) to GF (2 193 ). Furthermore, one scalar multiplication takes 31.8 ms running at 13.56 MHz over GF (2 163 ). In the same year, Sakiyama et al. [16] investigated the impact of the digit size on power consumption and area usage. Generally, bit-serial multipliers are more power efficient, whereas the digit-serial type is more energy efficient.
In 2007, Lee et al. [15] presented an optimized architecture based on the so-called Modular Arithmetic Logic
Unit. It uses a common Z-coordinate system for representing EC points to minimize memory requirements. Storage of intermediate values is usually the main contributor to area (roughly 66%). Area has a direct impact on the production cost of an integrated circuit. This work was further improved in [7] .
In 2009, Hein et al. [9] presented an implementation of modular multiplications using 16-bit multipliers. Bit-serial multipliers clock on average 2 * 163 = 326 registers per clock cycle. Their digit-level multiplier clocks roughly 4 * 16 = 64 registers. So, the proposed approach gives a better power, but not energy, characteristic. Additionally, the core component of the datapath is a 16 * 16 multiplier. The data width of the RAM circuit is adjusted to fit the 16-bit datapath. The RAM is organized as a 16-bit wide memory with 64 entries. The circuit features small silicon size (15k GE) and has low power consumption (8.57 μW). It computes 163-bit ECC point-multiplications in 296k cycles and has an ISO 18000-3 RFID interface.
B. Elliptic Curve's Algorithm
Let K be a finite field. An elliptic curve over K is defined by the Weierstrass equation:
where a 1 , a 2 , a 3 , a 4 , a 6 ∈ K and the discriminant Δ = 0 [17] . The K-rational points P(x,y) on curve E together with the point at infinity, O(∞, ∞), form an abelian group E(K). When Char(K) = 2, E can be transformed into the following form without loss of generality.
Given two points P 1 (x 1 , y 1 ) and P 2 (x 2 , y 2 ), one can compute P 3 (x 3 , y 3 ) = P 1 + P 2 as follow:
where λ = y1+y2 x1+x2 if P 1 = P 2 or otherwise λ = x 1 + y1 x1 . For each point addition or doubling, one field inversion is required. Inversion is a quite costly operation (one inversion is equivalent to 8-10 multiplications [18] in GF (2 163 )), therefore projective coordinates were proposed. Using projective coordinates, a point P is represented as (X, Y, Z)∈ K 3 , and the affine coordinates (x, y) can be retrieved as
A scalar multiplication on an elliptic curve is the computation of kP , where k is a large integer and P is a point on the curve. The following algorithm gives a classical way to compute kP .
Due to the conditional branch at step 4-5, Algorithm 1 is considered insecure against Simple Power Analysis (SPA) and Timing Analysis (TA). The Montgomery powering ladder [19] is believed to be secure against SPA and TA. It performs point addition and point doubling in each step regardless of the value of k i .
Algorithm 1
Add-and-double scalar multiplication.
end if 7: end for 8: Return Q Algorithm 2 Montgomery ladder for scalar multiplication.
López and Dahab [13] observed that the Y coordinate is unnecessary when using the Montgomery ladder, which makes the Montgomery ladder a preferred choice in terms of both efficiency and security.
C. Binary Edwards Curves
Let d 1 , d 2 be elements of K with d 1 = 0 and no element t ∈ K satisfying t 2 + t + d 2 = 0, then the binary Edwards curve with coefficients d 1 and d 2 is the affine curve [11] :
This curve is symmetric in x and y and thus it has the property that if (x 1 , y 1 ) is a point on the curve then so is (y 1 , x 1 ). The point (0, 0) is the neutral element of the addition law, while (1, 1) has order 2.
Bernstein et al. [11] proposed to use the so called wcoordinates together with the Montgomery ladder. Assume that P 1 (x 1 , y 1 ), P 2 (x 2 , y 2 ) and P 3 (x 3 , y 3 ) are points on curve E B,d1,d2 (we assume d 1 = d 2 throughout this paper), and P 3 = P 1 + P 2 , then P 2 + P 3 and 2P 2 can be performed with only w-coordinates. Let P 5 (x 5 , y 5 ) = P 2 + P 3 , P 4 (x 4 , y 4 ) = 2P 2 , and w i = x i + y i for i ∈ {1, 2, 3, 4, 5}, the explicit formulas for Point addition (PA) and doubling (DA) are given below.
Affine point addition with w-coordinates:
Affine point doubling with w-coordinates: [1] ) and P (x, y).
In order to recover x and y values from w-coordinates, we can use the following formula to compute 2(x 2 , y 2 ) given x 1 ,y 1 ,w 2 ,w 3 [11] .
This formula produces x 2 2 + x 2 , then we use a half-trace computation to reveal either x 2 or x 2 + 1. The Algorithm 3 shows the half-trace computation.
III. OPTIMIZATION OF THE ALGORITHM
For an ECC processor integrated in a very-constrained device such as passive RFID-tags, the area and power consumption budget is very limited. In this case, reducing the area for temporary storage is crucial since it normally takes more than 50% of the overall area (see [7] ). Therefore, reducing the number of intermediate results is important.
Although projective coordinates save us one inversion in each iteration, one drawback is that extra registers are required to store Z coordinates. The w-coordinates with Montgomery ladder consist of three points (P 1 , P 2 , P 3 ). Hence, at least seven registers (
are required, not even including two registers for intermediate values. Using mixed w-coordinates, which represent P 1 with only w 1 , can save one register. Thus, we choose the mixed w-coordinates for the implementation. [20] Assume that w 1 is given as a field element, w 2 , w 3 are given as fractions W 2 /Z 2 , W 3 /Z 3 and w 4 , w 5 are outputs of the formulas. The explicit formulas of addition and doubling are presented in register form in Table I . The register allocation of values are arranged using the linear scan method in [20] .
A. Original Mixed w-coordinate
Modular addition formula uses 5M + 1S + 1D, where M, S and D denote field multiplication, field squaring and multiplying by the curve parameter d 1 , respectively. Doubling formula uses 1M + 3S + 1D. These formulas can share Table I MIXED w-COORDINATE ALGORITHMS Addition Algorithm Doubling Algorithm 
the computation of T 1 (step 3-4) to reduce the total cost of differential addition and doubling to 5M + 4S + 2D. In order to perform these formulas, seven registers are required to store (w 1 , W 2 , Z 2 , W 3 , Z 3 , T 1 , T 2 ). These formulas can be transformed to a common Z-coordinate system at the cost of three extra field multiplications.
B. Mixed w-coordinate Using Common Z
The common Z-coordinate was introduced by Lee et al. [15] to save registers. After every iteration of Montgomery ladder, we ensure that Z 2 = Z 3 . This can be done with three extra field multiplications as given below.
In the first step of the Montgomery ladder (Q[1] = 2P ), this condition is satisfied with Z = Z 3 and W 2 = W 2 · Z 3 , since the algorithm starts from Z 2 = Z 3 = 1. We apply this method to the mixed w-coordinates of BEC. The formulas of differential addition and doubling shown in Table I are transformed to the sequences in Table II .
In this situation, the formula uses 7M + 3S + 1D, which is much less than (5M + 4S + 2D) + (3M). Since after each iteration Z 2 = Z 3 , step 6 of Table 1 becomes a squaring, and step 13-15 are omitted since they are the same to step 6-8. As a result, we trade (1M + 2S + 1D) with 3M + 1S, and the complexity of one iteration is 7M + 3S + 1D. Due to the use of common Z-coordinate, the proposed sequence requires only 6 registers (w 1 ,W 2 ,W 3 ,Z,T 1 ,T 2 ), or 7 registers if d 1 is not fixed. Furthermore, the mapping of variables in Table II is given in the following section in Table IV by using 6 registers.
C. Complexity
We compare the complexity of point addition and doubling using different coordinates and different elliptic curves. Table III shows the number of field operations for different combinations. Here we assume the Montgomery ladder is in use (as it is in [7] , [9] ). Since d 1 can be chosen small [11] , D can be very efficient. In this case, the cost of one Montgomery step on BEC is lower than normal curves or almost the same when using the common-Z trick.
Besides the coordinate system, the selection of curve parameters also has an impact on the complexity. For example, by selecting a 6 = 1 in (2), one multiplication is omitted [7] . For extremely constrained devices, fixing one or two curve parameters helps to reduce the area and performance. For example, the implementation in [9] chose to implement the B-163 curve [21] . In this paper we choose d 1 = d 2 and d 1 is also fixed. 
IV. IMPLEMENTATION OF A BINARY EDWARDS CURVE
As RFID systems become ubiquitous and used in several applications, the specification must be standardized. For example, ISO 18000-3 requires a 13.56 MHz clock frequency and a power consumption of less than 15 μW at 1.5 V to guarantee 1 m operating range [22] . Moreover, [7] states that the clock frequency is chosen to finish protocols within 250 ms and is a factor of 13.56 MHz. Taking into account these limitations, the goals to achieve in an RFID system are the low area and power consumption, and a short processing time. In this case, the number of registers is the most important issue, since the registers occupy more than 50% of the total area. Besides the common Z trick, we also optimize the register file for limited access. Clock gating is applied to reduce the power consumption.
A. Modular Arithmetic Logic Unit
In order to perform addition, multiplication and squaring operations in Table III , we design a compact Modular Arithmetic Logic Unit (MALU) [16] as illustrated in Figure  1 . In the MALU architecture, operations are performed over finite fields as shown through Equation (8) .
where
and P (x) = x 163 + x 7 + x 6 + x 3 + 1. The cost of the field multiplication is 163 d , with d the digit size, and for addition it is one clock cycle. In every cell, multiplication and addition can share the same XOR array. Therefore, the MALU can be easily scaled to different digit sizes by using cells in series. Furthermore, we design a squarer to perform squaring operations in one clock cycle independent of the digit size. The module of squaring is also added into the MALU with one extra control bit.
The MALU block does not contain any internal registers. To implement Equation (8) 
B. Register File Architecture
The register file architecture is shown in Figure 2 . This architecture was originally proposed in [7] , and has been shown an effective way to reduce the area of register files. We modify it such that it fits binary Edwards curves. Six registers form a big circular shift register file, which has a lower complexity than random access register file. Each register is independently controlled for efficient management. RegA, RegB and RegC are used by MALU, and RegD is used as the input register. RegD has an 8-bit I/O through which data can be loaded and stored. Loading RegD with a new value can occur in parallel with multiplications. Furthermore, RegB is a circular shift register which can be shiftted by d bits (digit size of the data path) to the left. Two extra connections are added to avoid extra shifting in the circular register file Clock-gating is applied to the circular shift register to reduce area and power consumption. An enable signal is created for all registers to control whether to update with new values or to keep previous values. As a result, the amount of multiplexers needed for the register file is reduced. RegE and RegF no longer require multiplexers.
C. Architecture of BEC Processor
The main architecture of the binary Edwards curves processor is shown in Figure 3 . It consists of a processor, a small interface, a RAM, a ROM and front-end module. The ROM stores first point coordinates (x 1 , y 1 ) and a key value (3 × 21-byte). Internal values and result points (W 2 , W 3 , Z) of the differential addition and doubling are kept in the register file. At the end of EC scalar multiplication, the final point is recovered by Algorithm 3 and sent to the RAM. The interface provides the connection from/to memory according to the address bits. The processor part of the design consists of a control block, a register file and a modular arithmetic logic unit (MALU). The control block has the finite state machine data path that manages the modules according to the addition, squaring and multiplication.
D. Low Power ASIC Design with Clock Gating
Clock gating is one of the power-saving techniques used on many synchronous circuits. It creates additional logic to generate clock trees for each register individually. Thus, flip-flops do not change state in disabled parts of the design. Their switching power consumption goes to zero, and only leakage currents are incurred. After clock gating is applied to the register file, the area and power consumptions are reduced by 8% and 20%, respectively, compared to the normal clocked version due to the removal of the multiplexers. The functionality of the clock gated version is verified with Modelsim SE. Moreover, the clock gated version is implemented on a Spartan-3E FPGA board with some transformations to verify the functionality. Table IV  MAPPING OF THE POINT MULTIPLICATION (SEE TABLE 2 FOR THE OPERATION OF EACH STEP) 
E. Verifying Design on an FPGA
Global free-running clocks in an FPGA are distributed using dedicated interconnects specifically designed to connect and supply clock inputs to various resources in the FPGA. These clock networks are optimally designed to have low skew, low power, and improved jitter tolerance. Gating a global clock signal with a logic circuit forces the gated clock signal to traverse the much slower routing network, thus introducing significant skew [23] . This can lead to hold errors. Therefore, we resort to manual transformation of registers of clock gating while maintaining identical functionality. It is achieved by re-declaring every gated flipflop with two nodes, a free-running clock node and an enable node. That is, flip-flops that use the gated clocks are changed to enabled flip-flops. Such flip-flops are then clocked by the free-running clock and enabled by the gate signal. Consequently, the functionality of a low power ASIC design with clock gating is successfully tested on an FPGA. The implementation of binary Edwards curves is written in the GEZEL [24] hardware description language and it is optimized for fast computation and low-area consumption. The implementation is synthesized for an ASIC design using a low leakage library of UMC's 0.13 μ m using the Synopsys Design Compiler tools [25] . Table V shows the area, power and delay estimations of the proposed design using different digit sizes.
A. Area, Power and Energy Estimations
In order to find the best trade-off on digit size, we evaluated six different digit sizes. For example, the implementation (d=4) uses 13,427 GE and finishes one EC scalar multiplication in 59,800 clock cycles. The power consumption is estimated by using both Design Vision and ModelSim SE, which provides average power consumptions by including switching activity of gates. The total dynamic power is around 12 μW at 400 kHz. Since one point multiplication takes 149.5 ms, the energy consumption is 1.79 μJ. According to the synthesis results, the power consumption is low enough to implement the proposed design in a passive RFID-tag. Table VI shows the comparison of this work with related work in GF(2 163 ). Our strategy is similar to that of [7] . It uses a digit-serial multiplier and a cyclic register structure, Energy for one EC scalar multiplication † Extra memory is required but no squarer modules. Although the area consumption of [7] is smaller than our design, the power consumption is still larger at 400 kHz. Moreover, the energy cost of one EC scalar multiplication is halved in our design.
B. Comparison with Previous Implementations
The result of [8] shows 15,094 GE which is larger than our proposal and requires almost seven times as many cycles. [8] uses affine coordinates rather than projective or mixed coordinates, so each iteration takes considerably more time than our implementation due to inversions. The design in [9] has the lowest power consumption among the related works, but is much slower than our design.
The synthesis results do not include ROM or RAM in Table VI , therefore the consumption of area and power of memories should be added. Table VII compares the number of memory access and memory requirements in one EC scalar multiplication. Every design must have at least three ROM blocks of size 21-byte to store the main point (x 1 , y 1 ) and a key value. Moreover, the number of RAM blocks is chosen depending on the representation of the final point. By storing the base point in shared memory, one 163-bit register can be saved (Type-1 in [7] ). However, the ECC core needs to load 21 bytes in each iteration. This loading dramatically increases the number of memory accesses. Our design only reads the EC base point (x 1 , y 1 ) once in the beginning to calculates w 1 . For one EC scalar multiplication, the RAM and ROM block are accessed 63 and 84 times, respectively. The requirements of memory units and the number of memory access are not indicated in [8] and [9] , thus we add minimum requirements of ordinary ECC curves on Table VII. Based on these comparisons, this work requires only a limited number of memory units and accesses.
VI. CONCLUSION
In this paper, the first hardware implementation of a binary Edwards curve is presented. We propose a compact architecture for a binary Edwards curve. We suggest the use of mixed w-coordinate with the common Z-coordinates Minimum memory requirements for ordinary ECC curves † Type-1 in the original paper which reduces the size of the register file. According to the synthesis results, the implementation of a binary Edwards curve can provide the same level of efficiency as previous ECC designs.
