Abstract. The performance of elliptic curve based public key cryptosystems is mainly appointed by the efficiency of the underlying finite field arithmetic. This work describes two generic and scalable architectures of finite field coprocessors, which are implemented within the latest family of Field Programmable System Level Integrated Circuits FPSLIC from Atmel, Inc. The HW architectures are adapted from Karatsuba's divide and conquer algorithm and allow for a reasonable speedup of the top-level elliptic curve algorithms. The VHDL hardware models are automatically generated based on an eligible operand size, which permits the optimal utilization of a particular FPSLIC device.
Introduction
Today there is a wide range of distributed systems, which use communication resources that can not be safeguarded against eavesdropping or unauthorized data alteration. Thus cryptographic protocols are applied to these systems in order to prevent information extraction or to detect data manipulation by unauthorized parties. Besides the widelyused RSA method [1] , public-key schemes based on elliptic curves (EC) have gained more and more importance in this context. In 1985 elliptic curve cryptography (ECC) has been first proposed by V. Miller [2] and N. Koblitz [3] . In the following a lot of research has been done and nowadays ECC is widely known and accepted. Because EC methods in general are believed to give a higher security per key bit in comparison to RSA, one can work with shorter keys in order to achieve the same level of security [4] . The smaller key size permits more cost-efficient implementations, which is of special interest for low-cost and high-volume systems. Because ECC scales well over the whole security spectrum, especially low-security applications can benefit from ECC.
Each application has different demands on the utilized cryptosystem (e.g., in terms of required bandwidth, level of security, incurred cost per node and number of communicating partners). The major market share probably is occupied by the low-bandwidth, low-cost and high-volume applications, most of which are based on SmartCards or similar low complexity systems. Examples are given by the mobile phone SIM cards, electronic payment and access control systems. In case of access control systems, ECC allows to use one device and one key-pair per person for the entire application. A very fine granular control is possible and in contrast to present systems, which are mostly based on symmetric ciphers, there is no problem regarding the key handling.
Depending on the application, the performance of genuine SW implementations of ECC is not sufficient. In this paper two generic and scalable architectures of Finite Field coprocessors for the acceleration of ECC are presented. The first one, which is mainly composed of a single combinational Karatsuba multiplier (CKM), allows for a significant speed-up of the finite field multiplication while spending only a small amount of HW resources. The second one is a finite field coprocessor (FFCP), implementing field multiplication, addition and squaring completely within HW. The proposed multisegment Karatsuba multiplication together with a cleverly selected sequence of intermediate result computations permits high-speed ECC even on devices offering only approx. 40K system gates of HW resources. A variety of fast EC cryptosystems can be built by disposing the proposed system partitioning. Running the EC level algorithms in SW allows for algorithmic flexibility while the HW accelerated finite field arithmetic contributes the required performance.
Recently, Atmel, Inc. introduced their new AT94K family of FPSLIC devices (Field Programmable System Level Integrated Circuits). This architecture integrates FPGA resources, an AVR microcontroller core, several peripherals and SRAM within a single chip. Based on HW/SW co-design methodologies, this architecture is perfectly suited for System on Chip (SoC) implementations of ECC.
The mathematical background of elliptic curves and finite fields is briefly described in the following section. In Sec. 3 the architectures of the proposed finite field coprocessors are detailed. Sec. 4 introduces the FPSLIC hardware platform. Finally, we report on our implementation results give some performance numbers and conclusions.
Mathematical Background
There are several cryptographic schemes based on elliptic curves, which work on a subgroup of points of an EC over a finite field. Arbitrary finite fields are approved to be suitable for ECC. In this paper we will concentrate on elliptic curves over the finite field ´¾ Ò µ 1 and their arithmetics only. For further information we refer to [5] and [6] .
Elliptic Curve Arithmetic
An elliptic curve over ´¾ Ò µ is defined as the cubic equation 
Finite Field Arithmetic
As previously mentioned, the EC arithmetic is based on a FF of characteristic 2 and extension degree Ò: ´¾ Ò µ, which can be viewed as a vector space of dimension Ò over the field ´¾µ. There are several bases known for ´¾ Ò µ. The most common bases, which are also permitted by the leading standards concerning ECC (IEEE 1363 [8] and ANSI X9.62 [9] ) are polynomial bases and normal bases.
The representation treated in this paper is a polynomial basis, where field elements are represented by binary polynomials modulo an irreducible binary polynomial (called reduction polynomial) of degree Ò. Given an irreducible polynomial È´Üµ Ü Ò · È Ò ½ ¼ Ô Ü , with Ô ¾ ´¾µ; an element ¾ ´¾ Ò µ is represented by a bit strinǵ
is a polynomial in Ü of degree less than Ò with coefficients ¾ ´¾µ.
By exploiting a field of characteristic 2, the addition is reduced to just XOR-ing the corresponding bits. The sum of two elements ¾ ´¾ Ò µ is given by
and therefore takes a total of Ò binary XOR operations. Squaring is a special case of multiplication. For ¾ ´¾ Ò µ the square is given by:
In the case of multiplication and squaring a polynomial reduction step has to be performed, which is detailed in Sec. 
This results in a total of three Ò ¾-bit multiplications and some extra additions (XOR operations) to perform one Ò-bit multiplication.
Multi-Segment Karatsuba Multiplication. The fundamental Karatsuba multiplication for polynomials in ´¾ Ò µ is based on the idea of divide and conquer, since the operands are divided into two segments. One may attempt to generalize this idea by subdividing the operands into more than two segments. [11] reports on such an implementation with a fixed number of three segments denoted as Karatsuba-variant multiplication. The Multi-Segment Karatsuba (MSK) multiplication scheme, which is detailed subsequently, is more general because an arbitrary number of segments is supported. Disregarding some slight arithmetic variations, the Karatsuba-variant multiplication is a special case of the MSK approach.
Two polynomials in ´¾ Ò µ are multiplied by a -segment Karatsuba multiplica- 
The application of the above equations for a Å Ë Ã ¿ multiplication, made up of six Ò ¿-bit multiplications, is illustrated in the appendix of this paper.
Polynomial Reduction. For ¾ ´¾ Ò µ, the maximum degree of the resulting polynomial ´Üµ ´Üµ ¡ ´Üµ is ¾Ò ¾. Therefore, in order to fit into a bit string of size Ò, ´Üµ has to be reduced. The polynomial reduction process modulo È´Üµ is based on the equivalence
Implementations of the reduction can especially benefit from hard-coded reduction polynomials with low Hamming weight such as trinomials, which are typically used in cryptographic applications.
Given such a trinomial as prime polynomial È´Üµ Ü Ò · Ü · ½ the reduction process can be performed efficiently by using the identities:
This leads to Ò-bit register is sufficient to store the resulting bit string.
Result Register (n bit) 
Hardware Architecture
An ideal HW/SW partitioning targeting a reconfigurable HW platform for an EC based cryptosystem depends on several parameters. As stated before, the FF arithmetic is the most time critical part of an EC cryptosystem. Depending on the utilized key size and the amount of available FPGA resources the FF operations can not inevitably be performed completely within HW. Therefore, flexibility within the HW design flow is essential, in order to achieve the maximum performance from a specific FPGA device. In order to ensure this flexibility, the HW design flow is based on the hardware description language VHDL, which is the de-facto standard for abstract modeling of digital circuits. A VHDL generator approach (similar to that one documented in [12] ) was exploited to derive VHDL models for both of the subsequently described FF coprocessors. In Sec. 3.1 the combinational Karatsuba multiplier (CKM) is illustrated and Sec. 3.2 details the architecture of the entire finite field coprocessor (FFCP).
Combinational Karatsuba Multiplier (CKM)
As stated in Sec. 2.2 and shown in Fig. 3a the multiplication over ´¾µ is computed by a single AND operation. According to Eqn. 5 the multiplication of two polynomials of degree Ñ can be computed with three Ñ ¾-bit multiplications and some XOR operations to determine interim results and to accumulate the final result. This leads immediately to a recursive construction process, which builds CKMs of width Ñ ¾ for arbitrary ¾ AE (see Fig. 3 ). With slight modifications this scheme can be generalized to support arbitrary bit widths. Exploiting the VHDL generator, CKM models for arbitrary Ñ ¾ AE can be automatically generated.
To determine the number of gates that constitute an Ñ-bit CKM, we take a look at Fig. 4 . In addition to the resources of the three Ñ ¾-bit multipliers, ¾´Ñ ¾µ Ñ 2-input XOR's are needed to compute the sub-terms´ ½¨ ¼ µ and´ ½¨ ¼ µ of Ì ¾ . 
With the master method [13] it can easily be shown that all of these recurrences belong to the complexity class ¢´Ñ ÐÓ ¾ ¿ µ. Explicit gate counts for CKM of various bit widths are summarized in the Tab. 1. 
Finite Field Coprocessor (FFCP)
This section presents a generic and scalable FFCP architecture, which accelerates field multiplication, addition and squaring. Addition and squaring are operations, which require only a few logical resources and hence can be implemented by combinational logic. In contrast, the multiplication can not reasonably be implemented by combinational logic only. By the use of the proposed MSK multiplication scheme (see Sec. 2.2) and a cleverly selected sequence of intermediate result computations, the resulting datapath has only modest requirements on logic resources and at the same time a low cycle count for a complete field multiplication.
The datapath is build around a low complexity Ñ-bit CKM as detailed in Sec. 3.1, but of course any other multiplier design would also do. By application of the sequential Å Ë Ã multiplication algorithm, ¡Ñ bit wide operands can be processed. With respect to the implementation in Sec. 5.2 and for reasons of easy illustration we assume in the following, but the scheme applies and scales in a nice way for arbitrary ½.
Eqn. 6 evaluated for (Å Ë Ã ) is illustrated in Fig. 5a . Each rectangle denotes the result of an Ñ-bit multiplication. As one would expect, these products are as wide as two segments. The labels in the rectangles determine the indices of the segments, whose sums have been multiplied. E.g., the label "234" represents the term´ ¾¨ ¿¨ µ ¡ ¾¨ ¿¨ µ, which is denoted Å ¿ ¾´ µ in Eqn. 7. The horizontal position of a rectangle denotes the exponent of the associated factor Ü . E.g., the rectangle in the lower left edge labeled "4" together with its position denotes the term´ First, most partial products are added two times in order to compute the final result. They can be grouped together and placed in one of three patterns, which are indicated in Fig. 5b . This is true for all instances of the multi-segment Karatsuba algorithm. In the datapath, these patterns are computed by some additional combinational logic, which is connected to the output signals of the CKM (see part (c) of Fig. 6 ).
Second, the resulting patterns are ordered by descending of their factor Ü . In this way, the product can be accumulated easily in a shift register. As the third optimization criterion the remaining degree of freedom is taken advantage of in the following way: The patterns are once more partially reordered, such that when iterating over them from top to bottom, one of the two following conditions holds: Either the current pattern is constructed out of a single segment product (e.g. ¬ ), or the set of indices of the patterns segments differs only by one index from its predecessor (as in the partial products´ ¼¨ ½ µ ¡´ ¼¨ ½ µ and ¼¨ ½¨ ¾ µ ¬´ ¼¨ ½¨ ¾ µ). In Fig. 5b this criterion is met for all but one iteration step (namely it is not met for the step from "23" to "1234"). Thus, based on the datapath in Fig. 6 the computation of the partial product "1234" takes a total of two clock cycles, which is one more compared to all other iteration steps. The number of additional clock cycles due to the fact that this third criterion can not be met increases slowly with the number of segments . By applying the third optimization criterion to the pattern sequence, the partial product computations can be performed as follows: By placing Ñ-bit accumulator registers at the inputs of the CKM, the terms Å Ñ Ð´ µ can be computed iteratively. This results in a two stage pipelined design for the complete datapath and yields a total of ½ clock cycles to perform one field multiplication utilizing the Å Ë Ã .
The complete datapath is depicted in Fig. 6 . In part a) the two operand registers of width Ð ¡ Ñ are shown as well as their partitioning into five segments. Both are implemented as shift-registers in order to allow data exchange with the external controller. The multiplexors in part b) select one from the five segments of the operands. They can both be controlled by the same set of signals, since they are always operating synchronously. Besides the combinational addition and squaring blocks, part c) illustrates the two accumulator registers. Both can either be loaded with a new segment, or they can accumulate intermediate segment sums. Section d) of Fig. 6 consists of the CKM. Part e) covers the pattern generation stage, which is mainly composed of multiplexors. Finally, in part f) the multiplication accumulator register is shown. It can either hold its value or the current pattern can be added to it in each cycle. Each time the intermediate result is shifted left by Ñ bit, an interleaved reduction step according to Eqn. 10 is performed. This way, the accumulator needs only to be Ò bits wide, where Ò is the degree of the reduction polynomial. Furthermore, the necessary number of logic elements for the reduction step is minimized and no additional clock cycle is needed.
In order to reduce the amount of communication between the controller and the FFCP, the result of the current computation is fed back to one of the operand registers. Thus, interim results need not inevitably be transferred several times between controller and FFCP.
Tab. 2 gives an overview of the amount of structural and logical components, which are required to implement the proposed datapath (excluding the CKM resources, please refer to Sec. 3.1 for the CKM implementation complexity). The number of states of the finite state machine, which controls the datapath, is in the order of Eqn. 8. Thus, logic resources for the FSM are negligibly small.
Atmel FPSLIC Hardware Platform
For the implementation of the previously detailed FF coprocessors the AT94K FPSLIC hardware platform from Atmel, Inc. is used within this work [14] . This product family integrates FPGA resources, an AVR 8-bit RISC microcontroller core, several peripherals and up to 36K Bytes SRAM within a single chip. The AVR microcontroller core is a common embedded processor, e.g., on SmartCards and is also available as a stand-alone device. The AVR is capable of 129 instructions, most of which can be performed within a single clock cycle. This results in a 20+ MIPS throughput at 25 MHz clock rate.
The FPGA resources within the FPSLIC devices are based on Atmel's AT40K FPGA architecture. A special feature of this architecture are FreeRam 4 cells which are located at the corners of each 4x4 cell sector. Using these cells results in minimal impact on bus resources and by that in fast and compact FPGA designs. The FPGA part is connected to the AVR over an 8-bit data bus. The amount of available FPGA resources ranges from about 5K system gates within the so-called FPSLIC to about 40K system gates within the AT94K40.
Both, the AVR microcontroller core and the FPGA part are connected to the embedded memory separately. Up to 36K Bytes SRAM are organized as 20K Bytes program memory, 4K Bytes data memory and 12K Bytes that can dynamically be allocated as data or program memory.
Atmel provides a complete design environment for the FPSLIC including tools for software development (C Compiler), tools for hardware development (VHDL synthesis tools) and a HW/SW co-verification tool, which supports the concurrent development of hardware and software.
For the implementations detailed subsequently the Atmel ATSTK94 FPSLIC demonstration board is used. This board comes with a AT94K40 device and is running at 12 MHz clock rate. The FPGA part consists of 2304 logic cells and 144 FreeRam cells, which is equivalent to approx. 40K system gates.
Implementation
Three different prototype implementations were built in order to evaluate the architectures detailed in Sec. 3. Due to the restrictions in terms of available FPGA resources these implementations support 113 bit EC point multiplication only. This is certainly not sufficient for high-security applications, but can be applied in low-security environments.
The following sections present some implementation details and performance numbers for a purely software based implementation, a design that is accelerated with a 32-bit CKM and another one, which applies the FFCP. Furthermore an extension to the FFCP design is proposed and performance numbers for this extended version are estimated.
Pure Software without HW Acceleration
The software variant is entirely coded in assembler and has been optimized regarding the following design criteria:
-High performance.
-Resistance against side channel attacks.
-Easy SW/HW exchange of basic FF operations.
Concerning the performance, special effort has been spent at FF level in optimizing the field multiplication and reduction, which is the performance critical part of the entire ¡È algorithm. At the EC level the so-called 2P Algorithm documented in [15] is utilized to perform the EC point multiplication. This algorithm takes only 4 multiplications, 1 squaring and 2 additions in the underlying FF for one EC-Add computation. One EC-Double takes only 2 multiplications, 4 squaring and 1 addition. Summing up, this ¡È implementation is about 2 times faster compared to standard Double-and-Add implementations. Furthermore, the 2P Algorithm is inherently resistent against pertinent timing resp. power attacks, since in every iteration of its inner loop both operations (EC-Add and EC-Double) have to be computed, regardless of the binary expansion of . Thus, besides some pre-and postprocessing overhead, one ¡È computation over ´¾ Ò µ takes exactly Ò EC-Add and Ò EC-Double operations. At the FF level countermeasures against side-channel attacks based on randomization and avoidance of conditional branches are applied as well [16] . Tab. 3 summarizes the performance of the implementation on FF level as well as on EC level. The analysis of the ¡È algorithm identifies the field multiplication as the most time consuming operation, which amounts to about 85% of the overall cycle count.
Hardware Acceleration
The subsequently detailed FPGA designs have been implemented by using the design tools which are packaged with the utilized demonstration board. Acceleration based on CKM. The genuine SW implementation can be accelerated by utilizing a CKM as presented in Sec. 3.1, which is implemented in the FPGA part of the AT94K40 device. Matching to the particular bit width Ñ of the raw CKM, two Ñ-bit input registers and a ¾Ñ-bit output register is added on the HW side. In order to allow a reasonable communication over the fixed 8-bit interface, the input registers are designed as 8-bit shift-in and parallel-out registers. Accordingly, the output register is parallel-in and 8-bit shift-out. Tab. 4 summarizes the performance of the combined HW/SW implementation based on a 32-bit CKM. The 32-bit CKM takes about 53% of the FPGA resources. At the FF level this results in a speed-up of about ¿ and for the ¡È algorithm there is still a speed-up factor of about ¾ ¾ compared to the values given in Tab. 3.
The CKM architecture is of special interest for HW platforms offering only a small amount of FPGA resources, such as the FPSLIC (see Sec. 4). This device is still sufficient for the implementation of an 8-bit CKM, which results in 3384 cycles for one 128-bit field multiplication. This is still a speed-up of about ½ compared to the genuine SW implementation.
Acceleration based on FFCP. Utilizing the FFCP architecture detailed in Sec. 3.2 instead of the stand-alone CKM design allows for a further significant performance gain. For the implementation presented here, the particular design parameters are fixed to 113-bit operand width, 24-bit CKM and 5-segment Karatsuba multiplication (Å Ë Ã ).
This results in a FPGA utilization of 96% for the entire FFCP design. Due to the fact that the result of each operation is fed back into one of the operand registers, the cycle count of a particular operation (I/O overhead plus actual computation) differs regarding to data dependencies. The corresponding best-and worst-case value for each FF operation is denoted in Tab. 5.
Tab. 5 unveils that the major part of cycles is necessary to transfer 113-bit operands over the fixed 8-bit interface between AVR and FPGA. These transfers can be avoided almost completely with an additional register file on the FFCP and an extended version of the finite state machine, which interprets commands given by the software running on the AVR. Assuming a 2-byte command format (4 bit opcode, 12 bit to specify the destination and the source registers) results in cycle counts according to the right column of Tab. 5. With respect to the FPSLIC architecture and their special FreeRAM feature, such a register file can be implemented without demand on additional logic cells. The extended version of the FFCP is currently under development on our site.
Performance Comparison
There are several FPGA based hardware implementations of EC point multiplication documented in the literature [12] [17] [18] [19] . The performance values of these stateof-the-art implementations are given in Tab. 6. Additionally, Tab. 6 comprises the particular figures of the previously described FPSLIC based implementations.
A performance comparison of hardware implementations against each other is in general not straight forward. This is mostly because of different key sizes and due to the fact that different FPGA technologies are used for their implementation.
A basically scalable HW architecture is common to all implementations referenced in Tab. 6. In contrast to our SoC approach, the implementations in [12] [17] [18] and [19] are mainly focusing on high-security, server based applications. Their functionality is entirely implemented within a relatively large FPGA and no arrangements against side-channel attacks are documented.
In [12] and [17] the underlying field representation is an optimal normal basis. Both implementations are based on FPGAs from Xilinx, Inc. Furthermore, VHDL module generators are used in both cases to derive the particular HW descriptions. The approach in [17] allows for a parameterization of the key size only. Parallelization, which is essential in order to achieve maximum performance from a specific FPGA, is additionally supported by the design in [12] . For the implementation in [17] a XCV300 FPGA with a complexity of about 320K system gates is used. The design in [12] is based on a XC4085XLA device with approx. 180K system gates. The implementations in [18] and [19] are both designed for polynomial bases and the field multiplications are in principle composed of partial multiplications.
The design in [18] is based on an Altera Flex10k family device with a complexity of about 310K system gates. The architecture is centered around a Û ½ -bit¢Û ¾ -bit partial multiplier. Due to the flexibility in Û ½ and Û ¾ it is shown, that the architecture scales well, even for smaller FPGA platforms. The best performing implementation, representing the current benchmark with respect to ¡È performance, is described in [19] . It is highly optimized, exploiting both pipelining and concurrency. The field multiplication is performed with a digit-serial multiplier. A Xilinx XCV400E FPGA with a complexity of about 570K system gates, running at 76.7 MHz is used for the implementation. Compared to our design this signifies a factor of more than 10 in space and a factor of about 6 in speed.
Conclusion
Speeding up the most time critical part of EC crypto schemes enables the use of these methods within combined HW/SW systems with relatively low computing power. Running the EC level algorithms in SW facilitates algorithmic flexibility while the required performance is contributed by dedicated coprocessors. Two generic and scalable architectures of FF coprocessors (CKM and FFCP) which are qualified for SoC implementations have been illustrated in this paper. While CKM supports only multiplication, the FFCP architecture implements multiplication, addition and squaring completely within HW. The proposed multi-segment Karatsuba multiplication scheme, which is the core of the FFCP architecture, permits fast and resource saving HW implementations. By exploiting the presented coprocessor architectures a considerable speed-up of EC cryptosystems can be achieved.
