Abstract. Elliptic Curve Cryptography (ECC) based processors have gained large attention in the context of embedded-system design due to their ability of efficient implementation. In this paper, we present a lowresource processor that supports ECC operations for less than 9 kGEs. We base our design on an optimized 16-bit microcontroller that provides high flexibility and scalability for various applications. The design allows the use of an optimized RAM-macro block and reduces the complexity by sharing various resources of the controller and the datapath. Our results improves the state of the art in low-resource F 2 163 ECC implementations (14 % less area needed compared to the best solution reported). The total size of the processor is 8,958 GEs for a 0.13 µm CMOS technology and needs 285 kcycles for a point multiplication. It shows that the proposed solution is well suitable for low-power designs by providing a power consumption of only 3.2 µW at 100 kHz.
Introduction
With the rapid development of more powerful and energy-saving devices, we unwittingly move towards the vision of the Internet of things. The required security services within this vision can be particularly achieved using Elliptic Curve Cryptography (ECC). This paper focuses on a low-resource hardware processor that provides ECC capabilities while meeting the low-area and lowpower requirements of embedded systems.
There exist many proposals for low-resource ECC processors. Most of the processors operate on binary-field elliptic curves and use full-precision arithmetic to increase the performance of point multiplication [3, 12, 24, 35] . One of the most efficient solutions in terms of low-resource requirements has been reported by Lee et al. [25] .
They presented a processor supporting a small elliptic curve over F 2 163 which makes use of a tiny 8-bit microcontroller to handle higher-level protocol implementations. The ECC operation of k · P is performed by a separated Modular Arithmetic Logic Unit (MALU). The processor needs 12,506 GEs and 276 kcycles to perform a point multiplication. However, the area estimations do not including program ROM and RAM to store intermediate results and the necessary secret scalar k. Similar datapath architectures have been reported by Batina et al. [1] and Sakiyama et al. [32] . Hein et al. [16] reported a very efficient co-processor (without microcontroller) for the same elliptic curve supporting multi-precision arithmetics. They applied a finite-state machine based control-engine needing 11,904 GEs including a standard-cell based RAM memory.
In this paper, we present a low-resource hardware processor that is based on a 16-bit multi-precision architecture and an area-optimized custom microcontroller. This combination allows several optimizations. First, it allows the use of an efficient RAM-macro block that reduces the area requirements for short-term memory significantly. Second, since both the microcontroller and the datapath use a 16-bit architecture, all resources are shared to minimize the area footprint of the processor. As an outcome, we present a complete solution including memory for short-term (RAM) as well as long-term storage (program ROM), controller, and datapath using a polynomial multiply-accumulate (MAC) unit. In addition, we present results of higher-level protocol implementations of the Elliptic Curve Digital Signature Algorithm (ECDSA) [29] and give results for digital signature generation as well as verification. For a point multiplication, our NIST B-163 based processor needs only 8,958 GEs in total and performs a point multiplication within 285 kcycles. We demonstrate that the proposed solution is also well suitable for low-resource embedded systems by providing a power consumption of only 3.2 µW at 100 kHz.
The rest of the article is structured as follows. In Section 2, a brief introduction into elliptic curve cryptography is given. In Section 3, we face the challenge of low-resource ECC hardware implementations and explore various design possibilities. We evaluate appropriate word sizes of a processor and analyze different memory types. Section 4 presents details about the hardware architecture of our processor. Details about the implementation are given in Section 5. In Section 6, the results are presented. Conclusions are drawn in Section 7.
Elliptic Curve Cryptography
Within Elliptic Curve Cryptography (ECC), not only a single number or polynomial is used, but a pair of those. Each pair (x, y) of such numbers that satisfy the general Weierstrass equation
is called a point on an elliptic curve. When a certain type of number is used, in our case binary polynomials within GF (2 m ), the Weierstrass equation can be reduced to
Among the most critical operation in terms of speed and security is the ECC point multiplication. The implementation of this multiplication has to be secure against various implementation attacks such as side-channel and fault-analysis attacks. The Montgomery ladder [27, 20] provides very beneficial properties in this context. We therefore decided to use it for our design and applied the very fast group-operation formulas of López and Dahab [26] . The formulas are based on projective coordinates (which avoid expensive field inversions) that can be nicely combined with proposed countermeasures (see also the work of Junfeng Fan et al. [10] ) such as randomized projective coordinates (RPC) [6] or pointvalidity checks [7] .
We use the following notations throughout the paper (similar to [15] ). Let f (z) = z m + r(z) denote an irreducible binary polynomial of degree m. The elements of F 2 m are binary polynomials of degree at most m − 1. An addition of field elements is the usual addition of binary polynomials. Multiplication is performed modulo f (z). A field element a(z) = a m−1 z m−1 + · · · + a 2 z 2 + a 1 z + a 0 is associated with the binary vector a = (a m−1 , . . . , a 2 , a 1 , a 0 ) of length m. For further readings on ECC we refer to several books [5, 2, 15, 22] that discuss the topic extensively.
Design-Space Exploration
In this section, we will explore different hardware-design options to obtain best results for a low-resource ECC processor. The design goals have been to meet all requirements of embedded systems which are low area (due to the production costs), low power (due to a possible contactless operation), appropriate speed (required for certain applications), security and flexibility. Due to the latter requirement, we decided to base our design on a customized microcontroller. This has the advantage of being modular in terms of protocol implementations and modifications of already implemented solutions.
By following the principles of hardware/software co-design, it showed that the dominant factors of ECC processors are the finite-field hardware multiplier and the type and size of the applied data memory. In the following, we discuss these factors and explore the design space to find the best solution for our objectives.
The Hardware Multiplier
One of the most area consuming parts within the ALU of an ECC-hardware design is the finite-field multiplier. The size, speed, and power consumption of such a multiplier largely depends on the word size of the processor and the underlying finite field. Figure 1 shows the hardware architecture of a 4-bit multiplier for binary-field (carry-less multiplier), prime-field (integer multiplier), and dualfield arithmetic. The basic structure of all three types of multiplier is the same. Only the adder structure needs to be adopted. Fig. 1 . General 4-bit multiplier structure to the left. Carry-less, integer, and dual-field adder (from top to bottom) on the right. Table 1 shows the area evaluation of different hardware-multiplier types. We evaluated multipliers for prime-field, binary-field, and dual-field arithmetic for word sizes of 8, 16, 32, and 64 bits (on register-transfer level). For the evaluation we used the UMC-L130 CMOS technology where an AND gate needs 1.25 GEs, a XOR gate needs 2.75 GEs, and a full-adder cell needs 5.5 GEs.
Obviously the area requirement scales quadratically with the given word size and carry-less multipliers provide the lowest area footprint and lowest increase in area for all given word sizes. Runtime approximations for an ECC point multiplication showed that the word size of the carry-less multiplier must be at least 16 bits in order to achieve a sensible runtime.
Next to a carry-less multiplier, an integer multiplier is necessary to provide operations for higher-level protocols (e.g. ECDSA). Note that this multiplier is needed only very few times for most protocols (only four prime field multiplications are required for ECDSA signature generation, for instance). Thus, lower word sizes are acceptable since no significant reduction in speed is expected. We therefore decided to implement a 16-bit carry-less multiplier (to provide an appropriate speed for a point multiplication) and an 8-bit integer multiplier instead of a dual-field 16-bit multiplier (which needs 1,946 GEs). This would sum up to 1,226 GEs which is 720 GEs less than for a dual-field multiplier. 
The Memory Type and Architecture
One of the most area expensive chip components of ECC processors is the Random Access Memory (RAM). RAM is necessary to store intermediate values (e.g. point coordinates during point multiplication k · P ) and the secret scalar k. The size of the memory varies depending on the requirements of the ECC formulas (the formulas of López Dahab [26] need at least 5 registers of memory for full-precision architectures and 6 registers for multi-precision architectures due to the need of intermediate storage of in-place operations).
In Table 2 , we compare different 16 × 128-bit RAM types concerning their area requirements. We compare standard-cell based implementations with dedicated RAM macro blocks synthesized in CMOS UMC-L130 technology. The standard-cell based RAM implementations (register and latch based) have been designed on RTL-level and synthesized using Cadence RTL compiler [4] . The RAM-macro blocks have been generated using the Standard Memory Compiler FSA0A Memaker 200901.1.1 by the Faraday Technology Corporation [11] . All except of one type of RAM provide a single read-port and a single write-port. There is one S-RAM macro that features a dual-port read/write interface.
It shows that the latch-based RAM is about 12 % smaller than the registerbased RAM. This is because the size of a flip-flop is 5 GE and the size of a latch is 4 GE. This 25 % difference in area is debilitated because some additional registers and control logic is required so that the latch-based RAM works the same way as the register-based RAM. Adding a second read port to those RAMs would be relatively cheap in terms of chip area (it would require about 3,000 GEs in addition by introducing a second multiplexer at the output). Note that a dualport memory would increase the performance of a multi-precision multiplication by a factor of about two.
From the two available single-port RAM macros, the register-file macro is about 50 % smaller than the S-RAM macro. The dual-port S-RAM macro, in contrast, is only 12 % larger than the single-port S-RAM macro, however, it is about 2.3 times larger than the register-file based RAM macro.
The register-file RAM macro provides best performance in our evaluation scenario. We performed several power simulations using Cadence Encounter and obtained similar results for the register-file RAM macro and the standard-cell based RAM architectures. The main disadvantages of the register-file macro are the lack of a second read port (speed) and the limit of clock-synchronous read operations. The lack of a second read port can be compensated by using temporary working registers. The lack of an asynchronous read functionality can be balanced with a more difficult control logic.
Hardware Architecture
In this section, we introduce the hardware architecture of our processor. It is based on the microprocessor design called Neptun [34] , which uses a Harvard architecture. This allows to fetch, decode, execute, and store data within the same clock cycle and allows low-area optimizations due to the choice of different memory types and sizes. Figure 2 shows the block diagram of the architecture. It is mainly composed of a Central Processing Unit (CPU) including register file and Arithmetic Logic Unit (ALU), and memories for program code, constants, and data.
Central Processing Unit (CPU)
The heart of the processor is the 16-bit CPU. It is composed of several internal registers and an ECC optimized ALU. The register file consists of a program counter (PC), a stack pointer (SP), three base registers, four working registers, and an accumulator register: The program counter is used as index for the program memory. The stack pointer (SP) is needed to store registers on the data memory. The stack is also used to store program-return addresses that are needed for function calls. In order to address certain base addresses within the data memory, three base registers are used. We integrated two source registers and one destination register. They are used together with a 4-bit offset to address data in the memory. The offset address is stored within a program word. We implemented four 16-bit working registers that can be used as general-purpose registers. The registers are needed for almost any ECC operation and are used to reduce the number of memory-read cycles within the finite-field multiplication.
The accumulator register (ACC) is needed for the multiply-accumulate operation of the 163-bit multi-precision multiplication.
We integrated several optimizations to increase the performance of ECC operations. First, the ALU accesses data directly without loading it first into CPU registers (as it is in the case of conventional microcontrollers). In the first clock cycle, the data is addressed in the memory. In the second cycle, the data is processed by the ALU and the result is stored back in memory within the same clock cycle. This increases the performance of memory-access operations significantly. Second, loading and processing of data is done simultaneously by the processor. This avoids unnecessary idle cycles and improves the efficiency of multi-precision arithmetic operations. Those optimizations are described in more detail in [34] .
Arithmetic Logic Unit (ALU). The arithmetic logic unit (ALU) mainly consists of a reduction-logic unit, a carry-less multiplier, an arithmetic unit (addition/subtraction), and a logic unit (supporting OR, AND, XOR, and shift operations). For higher-level protocols, an integer multiplier is needed in addition (drawn with dashed lines). Figure 3 shows a high-level diagram of the ALU. We also integrated an operand isolation technique for each submodule which reduces the power-consumption significantly.
Memory for Program, Data, and Constants
Our processor provides a long-term storage memory that mainly stores the program for ECC point multiplication. The memory provides 72 control signals and contains up to 1,800 entries depending on the implemented algorithms and higher-level protocols. Most of the control signals are used to control the dataflow within the CPU. Best area results have been achieved by directly synthesizing the memory table as Read Only Memory (ROM) using standard cells. Experiments in which a 16-bit instruction set or a ROM macro have been introduced resulted in a larger area requirement.
For short-term data storage, we used a 16-bit RAM macro (register-file based) as discussed in Section 3. Note that in contrast to most processors reported in literature [3, 24, 25, 30] , we include the number for the required storage of the secret scalar k. For an ECC point multiplication, 1,296 bits (81 entries) are necessary (we used a 16 × 84 macro in that case). For higher-level protocols, additional memory is needed (e.g. 1,536 bits for ECDSA signature generation (16 × 96 macro) and 2,384 bits for ECDSA signature verification (16 × 152 macro)).
ECC constants have been stored in a ROM. The ROM has been implemented as a look-up table and stores between 880 and 2,564 bits such as the x and y coordinate of the base point P , the ECC parameters a and b (see Equation (2)), and the irreducible polynomial f (z).
The input/output of data has been realized via memory mapped I/O. Data can be written and read using a 16-bit parallel interface.
Implementation Details
In the following, we give details about the implemented carryless multiplyaccumulate unit and the modular arithmetics in order to perform ECC operations.
Carry-Less Multiply-Accumulate Unit
The multi-precision multiplication over F 2 163 has been realized following a multiply-accumulate (MAC) approach. There exist several publications that make use of MAC units to increase the performance of modular multiplication (see e.g. the work of [8, 13, 14, 16, 33] ). We implemented the multiplication by a product-scanning form (often referred as Comba multiplication), where each partial product of
Note that for the polynomial MAC unit the handling of carry propagation is not needed. Thus, the accumulator register needs a size of only (2W − 1) bits.
We implemented several improvements to increase the performance. First, the entire multiplication algorithm has been unrolled so that no extra cycles are wasted for loop operations. Second, we reused the working registers as a memory cache to reduce the number of necessary load operations. With each working register used, the total number of read operations has been reduced by about 2N . Third, we added a third word to the accumulator register (ACC 2 , ACC 1 , ACC 0 ) in order to allow efficient reduction of the accumulated sum. Thus, the MAC operation is performed on the words (ACC 2 , ACC 1 ) instead of (ACC 1 , ACC 0 ) and ACC 0 is used to store the previous intermediate result. A detailed description of the reduction method is given in the following subsection.
Algorithm 1 shows the algorithm of the implemented polynomial multiplication. The polynomials a(z) and b(z) get multiplied and the reduced result is stored in c(z). In the lines 1 to 8, the lower N words of the result c(z) are calculated. Note that in this phase the ACC 0 register is not used. In line 9, the lower Algorithm 1 Polynomial multiplication with interleaved reduction.
Require: Binary polynomials a(z) and b(z) of degree at most m − 1.
end for 6:
ACC ← ACC W . 8: end for 9: ACC ← higher(ACC). 10: for k from t to 2N − 2 do 11:
for each element of {(i, j)|i
end for 14:
15:
ACC ← ACC W . 16: end for 17:
(m − W (N − 1)) bits of the accumulator need to be cleared. Those are the bits of the results that do not need to be reduced. The lines 10-16 calculate the higher N words of c(z) and reduce them immediately. According to the recommended NIST irreducible polynomial B-163 f (z) = z 163 + z 7 + z 6 + z 3 + 1, the reduction function (line 14) can be written as
Finally, in lines 17-20 the rest of the accumulator and the higher bits of
Polynomial NIST B-163 Reduction Logic. We make use of the recommended NIST irreducible polynomial B-163 to perform a very efficient modular reduction for modular multiplication and squaring. The reduction logic is shown in Figure 4 . We hard-wired the output of the appropriate accumulator register according to Equation Figure 4 shows the dedicated reduction logic.
It should be noted that although the reduction logic has been specially optimized for NIST B-163, the CPU is capable of handling arbitrary irreducible polynomials. Thus requirements such as flexibility and extendability are ensured.
Modular Arithmetic
Modular Addition. The simplest operation is the modular addition. It is a simple XOR operation. Neither a carry flag nor a finite-field reduction need to be considered. Modular addition over F 2 163 needs 35 clock cycles on our processor. Modular Multiplication. Modular multiplication has been realized using the carryless multiply-accumulate unit described in Section 5.1. Our processor needs 222 clock cycles for a 163-bit multiplication. Modular Squaring. Modular squaring can be performed very efficiently. The binary representation of the polynomial can be easily squared by inserting a 0 between each consecutive bit of the polynomial, e.g.
This can be realized with only a few additional hardware components. The polynomial-reduction logic can be reused for squaring. One modular squaring needs 41 clock cycles on our processor and thus is 5.4 times faster than a modular multiplication. Modular Inversion. Modular inversion is required to transform the projective coordinates back into affine. For this operation, we made use of Fermat's little theorem [19] that states that a = a 2 m mod f (z) ∀a ∈ F 2 m . As a result,
. This exponentiation can be performed using 162 squaring and only 9 multiplications for the NIST B-163 binary field. As a result 11,031 cycles are needed for an inversion. 
Results
We synthesized our processor using different CMOS technologies from various manufacturers. For synthesis, we used the Cadence RTL compiler [4] Version v08.10. Table 3 shows the total area and power-consumption estimation of the processor using latch-based RAMs 1 (described in Section 3.2). The powerconsumption estimations were made using Cadence Encounter Version v08.10. All obtained area results are within a 20 % margin. In view of power consumption, best performance had been obtained for the UMC-L130 technology. For all following approximations we used register-based RAM macros.
In Table 4 , the area and power requirements for individual chip components are listed. The memory needs most of the area which is 5,399 GEs. The CPU needs 3,556 GEs in total where only 849 GEs are used for the carry-less multiplier. The total size of the processor sums up to 8,958 GEs.
In Table 5 , we compare our results with related work. There exist many publications of ECC processors over F 2 163 . Most of those processors use full-precision arithmetic to perform the point multiplication. For a fair comparison, we listed the results of the authors for different digit sizes (d=1...8). All implementations need between 10,392 GEs and 16,247 GEs of chip area and between 47 and 430 kcycles for the computation of k ·P . Our implementation needs 8,958 GEs of area which is 1,434 GEs less area than the best reported solution. This is an area improvement by about 14 %. The number of needed clock cycles can be compared with the full-precision solutions with d=1. The power and energy consumption is very low and fulfills most requirements of embedded-system designs.
Results for Higher-Level Protocol Implementations
As a higher-level protocol, we implemented the Elliptic Curve Digital Signature Algorithm (ECDSA) [29] . In addition to a point multiplication over the binary field F 2 163 , ECDSA needs a hash function and several prime-field arithmetic operations to generate and verify a digital signature. As a hash function, we implemented the 160-bit SHA-1 algorithm according to ISO/IEC FIPS-180-3 [28] . Replacing the SHA-1 algorithm with one of the current SHA-3 candidates [18] would be easily possible. For prime-field multiplications and inversion, we decided to implemented Montgomery-arithmetic operations. We implemented the Finely Integrated Product Scanning Form (FIPS) according to Koç et al. [23] . The algorithm is used only four times, so we optimized the code for low area (no loop unrolling etc.). Furthermore, we implemented the Montgomery-inversion algorithm according to Kalinski et al. [21] .
For signature verification, we applied Shamir's trick [31, 9] to improve the performance of multiple-point multiplication. All described operations for ECDSA have been implemented as Assembler functions for our processor and have been stored in program memory. Table 6 shows the results after synthesizing the processor. For ECDSA signature generation, our processor needs 15,387 GEs which outperforms existing solutions in terms of area, power, and speed [12, 17, 34, 35] . Signature verification can be realized using a chip area of 16,005 GEs. a The numbers include y-recovery, randomized projective coordinates (RPC) sidechannel countermeasure [6] , and ECC point-validity check [7] . b Includes the SHA-1 hash function [28] , Random Number Generation (RNG) [29] , and prime-field arithmetics.
Conclusions
In this paper, we presented a low-resource implementation of an ECC hardware processor. The processor needs 8,958 GEs and performs a point multiplication within 285 kcycles. The power consumption is about 3.2 µW at 100 kHz. We met the low-resource constraints of embedded systems by applying a very modular microcontroller architecture that allows the execution of higher-level protocols like ECDSA. The elliptic-curve operations have been performed over the NIST F 2 163 elliptic curve using multi-precision arithmetic. The outcome improves the state of the art in low area ECC hardware designs and provides even a smaller area footprint than most of the proposed SHA-3 candidates [18] .
A Statistics for ECC multiplication
During the development of the ECC and ECDSA functions we used a statistics feature of our tool-chain to investigate the code-line and cycle consumption of each function. Table 7 shows the number of times each function is called, the size of each function in code lines and the total runtime of each function. Even though the multiplication algorithm is optimized down to 222 cycles it still covers 74 % of the total runtime. Table 8 shows how often each and every type of instruction is used. The parallelized commands are a combination of other commands. They cover 71 % of the total runtime. Note that only 4.4 % of the total runtime is used for programflow instructions such as RET, CALL, BRA, and JMP. This overhead would not exist if a dedicated state machine instead of a CPU with instruction set would be used. Table 8 . Instructions used during an ECC point multiplication with y-recovery and point-validity check.
