Recently, Lenstra and Verheul proposed an efficient cryptosystem called XTR. This system represents elements of F * p 6 with order dividing p 2 − p + 1 by their trace over F p 2 . Compared with the usual representation, this one achieves a ratio of three between security size and manipulated data. Consequently very promising performance compared with RSA and ECC are expected. In this paper, we are dealing with hardware implementation of XTR, and more precisely with Field Programmable Gate Array (FPGA). The intrinsic parallelism of such a device is combined with efficient modular multiplication algorithms to obtain effective implementation(s) of XTR with respect to time and area. We also compare our implementations with hardware implementations of RSA and ECC. This shows that XTR achieves a very high level of speed with small area requirements: an XTR exponentiation is carried out in less than 0.21 ms at a frequency beyond 150 MHz.
Introduction and Basics
Nowadays more and more applications need security components. However, these requirements should not interfere with the performance, otherwise security would be disregarded. Ideally, the best solution is when security does not penalize the application. However, two ways are possible to achieve this characteristic: design efficient primitive algorithms and/or try to find fast and optimized implementations of existing algorithms.
Supported by the FRIA Belgium fund.
XTR, first presented in [12] , has been designed as a classical discrete logarithm (crypto)system, see also [11] . However, element representation is done in a special form that allows efficient computation and small communications. This system also has the advantage of very efficient parameter generations. As shown in [26] , the performance of XTR is competitive with RSA in software implementations, see also [7] for a performance comparison of XTR and an alternative compression method proposed in [22] . Mainly two kinds of implementation have to be distinguished: software and hardware. The latter generally allows a very high level of performance since "dedicated" circuits are developed. Moreover it also provides designers with a large array of implementation strategies. This is particularly true for the size of multiplier, possible parallel processing, stages of pipelining, and algorithm strategies. In this paper, we propose an efficient hardware implementation of this primitive that can be used for asymmetric digital signature, key exchange and asymmetric encryption. To our knowledge this is the first hardware implementation of XTR.
In 1994, Smith and Skinner introduced the LUC public key cryptosystem [24] based on Lucas function. This is an analog to discrete logarithm over F * p 2 with elements of order p + 1 represented by their trace over F p . More recently, Gong and Harn [6] used a similar idea with elements in F * p 3 of order p 2 + p + 1. Finally, Lentra and Verheul proposed XTR in [12] , that represents elements of F * p 6 with order (dividing) p 2 − p + 1 by their trace over F p 2 . These representations induce security over the fields F * p i , with i = 2, 3, 6 with respect to LUC, Gong-Harn or XTR cryptosystems, whereas numbers manipulated are over F p 2 for XTR or F p for the others. XTR is the most efficient out of the three since it allows a reduction factor of 3 between size of security and size of manipulated numbers.
Parameter p is chosen as a prime number. Another condition for security requirements is that there exists a sufficiently large prime number q that divides p 2 − p + 1. Typically, p is chosen as a 160-bit integer whereas q is a 170-bit integer. With these parameters, XTR security is considered as "equivalent" to RSA security with 1024-bit modulus or an elliptic curve cryptosystem (ECC) based on 160-bit field. The parameter p is also chosen to be equivalent to 2 modulo 3. In this case, F p 2 is isomorphic to
If α denotes the root of (X 2 +X +1), then (α, α 2 ) is a normal basis of F p 2 over F p . Finally, any element of F p 2 can be represented as (x 1 , x 2 ) with x 1 , x 2 ∈ F p . XTR operations are performed over F p 2 . This is achieved by representing elements of the subgroup of F * p 6 of order q (dividing p 2 − p + 1), generated by g, by their trace over F p 2 . Trace over F p 2 of an element is just the sum of its conjugates. Let a be an element of < g >, then Tr(a) := Tr F p 6 /F p 2 (a) = a + a p 2 + a p 4 and Tr(a) ∈ F p 2 . Let x and y be two elements of F p 2 represented respectively by (x 1 , x 2 ) and (y 1 , y 2 ), then it is shown in [12, Lem. 2.1.1] that 1. x p is represented by (x 2 , x 1 ) and this way computing x p from x is obtained by permuting elements representing x, 2. x 2 is represented by (x 2 (x 2 − 2x 1 ), x 1 (x 1 − 2x 2 )) and this way computing x 2 is done with two multiplications in F p , 3. x · y is represented by (x 2 y 2 − x 1 y 2 − x 2 y 1 , x 1 y 1 − x 1 y 2 − x 2 y 1 ) or by (x 1 y 1 +2x 2 y 2 −(x 1 +x 2 )(y 1 +y 2 ), 2x 1 y 1 +x 2 y 2 −(x 1 +x 2 )(y 1 +y 2 )) and this way the product of two F p 2 -elements is obtained through three multiplications in
) and this way this special operation on F p 2 -elements is obtained through four multiplications in F p .
In the remainder of this paper, we follow the notation used in [12] [13] [14] [15] 26] . We denote Tr(g) by c and for any integer k, Tr(g k ) by c k . The basic operation with XTR is the analog to exponentiation, i.e. from an integer k and a subgroup element g of F * p 6 , computing Tr(g k ). This is performed in an efficient way by using formulae from [13, Cor. 2.3.3] quoted below: With the previous formulae an XTR exponentiation is carried out using Algorithm 1.1 from [13] .
We can first remark that computing S 2k (c) or S 2k+1 (c) is done exactly in the same manner. More importantly, triplet representing S 2k (c) and S 2k+1 (c) can be calculated independently. This is one of the very useful characteristic of XTR that allows us to reach a very high speed performance in our hardware implementation. This paper is organized as follows. Next section deals with modular product evaluation. A new algorithm, of independent interest, using a look-up table is presented together with an algorithm proposed by Koç if n < 0 then apply this algorithm to −n and c, then use negative result.
2 − 2c p ). if n = 2 then use formulae from App. A and S1(c) to compute c3.
j with mj ∈ {0, 1} and mr = 1.
and Hung [9] . Based on these two algorithms, Section 3 presents the main results of this paper: implementation choices and performance obtained to compute an XTR exponentiation. We also make comparison between hardware implementations of XTR and other cryptosystems like RSA and ECC. Finally, we conclude in Section 4.
Algorithms: Implementation Options
As already shown in Section 1.1, XTR exponentiation is done with a very uniform set of operations. Contrary to classical exponentiation where a 'square-and-multiply' algorithm is used, the only changes at each loop of XTR are the inputs. According to the bit of the exponent expressed as binary expansion, S 2k (c) or S 2k+1 (c) are computed from S k (c). Details of performed operations over F p are given in Appendix A.
Costly operations are products of elements. This can be done using the Koç and Hung algorithm from [9] . An alternative is simply to use a look-up table.
Modular multiplication in hardware
Let A and B be two integers. The product of A and B cannot be achieved in one single step without a big loss in timing performance and in consumed hardware resources (area). Thus this product is usually obtained by iteratively accumulating partial products a i B. This type of multiplier is also called scaling accumulator or shift-and-add method. One of the advantages is that only one single adder is reused for all the multiplication steps.
Unfortunately, when large numbers have to be manipulated, typically 1024-bit with RSA, the important length of the carry chain may become an issue. This is especially true when using reconfigurable hardware where the length of fast carry chains is limited to the size of columns. An alternative is the use of redundant representations, i.e. carry-save representations. This eliminates the carry propagation delay. The delay of a carry-save adder (CSA) is independent of the length of operands.
Many different algorithms to compute modular multiplication using the shift-and-add technique exist in the literature [2, 4, 17, 21, 23] . Most of them suggest interleaving the reduction step with the accumulating one in order to save hardware resources and computation time. The usual principle is to compute or estimate the quotient Q = U/p and then subtract the required amount from the intermediate result.
Modular multiplication using look-up table
As aforementioned, redundant representations can lead to very good timing performances. Moreover, to obtain a light hardware, we have chosen to base the multiplication on a scaling accumulator. In order to prevent the growth in length of the temporary value of the product, the addition steps are interleaved with the reduction ones.
Let p be a prime of l bits, such as 2 l−1 < p < 2 l . Let A and B be two integers, 0 ≤ A, B < p. Then, the modular multiplication of A and B can simply be written as
This suggests the successive reduction of the temporary value in the case of 'left-to-right' multiplication. Our fairly simple idea is based on the following observation: reduction can be carried out using a look-up table.
If S and C denote the redundant representation, the three most significant bits (MSB) of S and C are extracted and added together. The corresponding reduced number is then chosen among the precalculated values. All the 2 3+1 − 1 = 15 possible cases are stored in memory. The reduced number is then added with the two MSB-free values, pre-multiplied by 2 before being re-used in the multiplication loop. The next partial product a i B is also added providing a new S and C pair of redundant representation.
The operation is repeated until all bits of A have been covered. Eventually the values are processed one last time, but without new partial product input. This extra step guarantees the sum of the redundant vectors to be lower than 2p. After the step −1, the result then requires at most one final reduction. This can be simply proven by the observation that after step 0: S, C < 2 l−2 . After the shift and the addition with the feedback of the residues: S + C < 2 l + 2p. Since 2 l < 2p, the following relation holds: S + C < 4p. Finally, dividing the result by 2 gives R < 2p. Algorithm 2.1 gives a detailed description. 
Modular multiplication with sign estimation technique
Another type of algorithm (more advanced) was proposed by Koç and Hung in [9] . Once again, it interleaves the reduction step with the addition of the partial product and the intermediate result is stored in redundant representation. .
This algorithm is based on the following clever idea: the sign of the number represented by the carry-sum pair can be evaluated and used to add/subtract a multiple of the modulus in order to keep the intermediate result within two boundaries. This is done by ES(S, C). The sign estimation requires to add the 5 MSB of the two vectors S and C.
The skeleton is given in Algorithm 2.2 and we refer the interested reader to [9] for further details.
Implementation Results

Methodology
After having introduced a new algorithm for modular multiplication using look-up table in the intermediate reductions and having recalled the Koç and Hung algorithm, let us now consider the subject of this paper: XTR implementation. In this section, the global approach of the design is discussed and two architectures are presented. Implementation results and performances are given as well. Particular considerations about scalability and portability conclude the section.
One of our purposes for implementing XTR architectures on reconfigurable hardware is to achieve a well-balanced trade-off between hardware size and frequency. Nevertheless, particular care has been taken to keep the architectures open to all kinds of FPGAs. This is the reason why some available features have not been used, e.g. internal multipliers. This way, designs can be directly synthesized, whatever the device target.
Even if our architecture is more general than an FPGA oriented implementation, we decide to adopt the classical design methodology described in [25] . The authors introduced the concept of hardware efficiency which could be represented as the ratio Nbr. of registers / Nbr. of LUTs. To achieve a high level of sub-pipelining, this ratio must be as close as possible to one. This was presented in the view of designing efficient implementation of symmetric ciphers but remains partially true for general designs, at least it suggests a method. And while implementing our design, we tried to use this concept to reach high clock frequency. Implementation results appear in Section 3.
As aforementioned, the 'parallel characteristic' of XTR is obvious. Indeed, each component of S n (c), with n = 4k or 4k + 1, can be computed independently. As an illustration, if we consider elliptic curve cryptosystems point addition or doubling, many dependencies exist during computation, see [1] . This issue is removed using the Montgomery ladder principle, see for an overview [8] . Moreover each element of F * p 2 is represented as a couple. Each component of the couple is evaluated at the same time and independently. Then computations for the α and α 2 components are similar and can thus be executed separately. This means that S n (c) is represented by 6 components that can be evaluated independently. A closer look shows that the computation of c 4k+1 (and/or c 4k+3 ) is composed of two parts alike, with a final addition. Hence it is possible to process one step of the encryption at once in parallel with eight independent processes. Furthermore, operations are quite similar. A generic cell can easily be derived to design a generic process unit able to perform the encryption in a sequential mode, at a lighter hardware cost. This also underlined the flexibility of design allowed by XTR.
Parallel designs are presented underneath. The general layout of both architectures is as follows. A 160-bit shift register containing n produces the MSB m on each iteration 1.1. With respect to this bit, different multiplexors forward the data to the inputs of the corresponding processing units. Each of them computes its data and returns the results to the multiplexors, for the next iterations.
The core of the process unit is the modular multiplier. It is preceded by some logic dealing with the preliminary additions and subtractions. Its result is stored in a shift-register.
Architectures of a process unit
The internal structures of Koç and Hung algorithm and ours are displayed in Fig. 1 . Our look-up table based algorithm is centered around two CSA taking as input the partial product a i B, the (l − 2)-bit truncated result vectors and the reduced values based on the 3 most significant bits. The originality of this method is due to the modular reduction technique. Just recall the Algorithm 2.1: the most significant bits are extracted and added together in order to keep the intermediate values in fixed boundaries. According to the initial values (S l = C l = 0), the utmost limits are 0 ≤ S i , C i < 2 l+1 . The 15 possible values for
are precalculated according to p and stored in the memory (denoted M in the figure). Both 3-bit MSB are added together in order to produce 4-bit address. The memory can thus be mapped by the use of l LUT. Throughout each iteration of the multiplication, a new partial product is inserted and the feedback values must therefore be doubled. As previously explained, one final iteration without inserting a new partial product ensures the final result to be under 2p. After addition by a ripple carry adder (RCA), there may thus be an extra p left over. It is easily handled by the use of another RCA and a multiplexor, as suggested in [20] . The RCA uses the fast carry chain available on every FPGA. Nevertheless the carry chain for a 170-bit RCA would lengthen the critical path. They are then composed of pipelined smaller RCAs.
The implementation structure of the Koç and Hung algorithm is very similar to ours. Most of the design choices were identical for both algorithms. The main difference lies in the number of bits taken to evaluate the estimation function (i.e. 2 × 5 for Koç and Hung algorithm and 2 × 3 for ours). Moreover the Koç and Hung algorithm keeps the whole value intact (no truncation is applied after the registers S and C), this requires thus a bus length of l + 4-bit.
Discussion about algorithms performances
Efficiency of two implementations is always difficult to compare. The same algorithm could lead to very different performances depending on the type of device used (ASIC or FPGA) and on the technology (0.12µm CMOS for Virtex II), on the cleverness of designers (smart trade-offs between area and latency) and finally on the options chosen for place-and-route (PAR).
Algorithms described above present many similarities. They require two CSAs of size O(l), a module of last reduction and an estimation function feeding a look-up table. Both of them shift their feedbacked S and C pair at each iteration. Koç-Hung algorithm requires less memory than ours. However, the estimation function given by Koç-Hung takes 2 × 5 inputs, and our algorithm takes 2 × 3 inputs.
A Field Programmable Gate Array is a tool situated between hardware and software. With the increase of powerful internal features it becomes very competitive compared to ASIC. We used FPGA to implement our design. This gives an advantage to our algorithm in terms of latency (critical path) with small area increase.
Most FPGA devices use 4-input look-up tables (LUTs) to implement function generators. Four independent inputs are provided into each of the 2 function generators in a slice. Then, any boolean function of four inputs can be mapped in a LUT. So the propagation delay is independent of the function implemented. Moreover, each Virtex slice contains two multiplexers (MUXF5 and MUXF6). This allows the combination of slices to obtain higher-level logic functions (any function of 5 or 6 inputs) 1 .
From these considerations, we can consider the delay of the 2 estimation functions. In our algorithm, the estimation function can be mapped as a 6-input boolean function with a propagation delay of 1 LUT. In the case of Koç-Hung algorithm a 10-input function must be implemented, so 2 stages of LUT are needed. This implies a latency of 2 LUTs. This endows to our algorithm an advantage for an FPGA implementation but the two algorithms have very similar performances and it is difficult to evaluate the performance for Koç and Hung algorithm using another technology. Table 1 gives the synthesis result of our implementation. We can see that our algorithm can achieve a higher frequency, as expected. In [3] , the complexity of the implementation of a binary multiplication is formally defined. This definition includes many parameters such as the technology used, the area and time required and the length of the operand. In this way, we decide to adopt an Area-Time Complexity and then the product AT as an element of comparison of the algorithms we implemented.
Performances
As far as we know, this paper is the first dealing with XTR cryptosystem implementation on reconfigurable hardware. Even if it is not fully satisfactory, we decided to compare it with the best existing implementations (as we know) of the RSA algorithm [5, 16] and elliptic curve processors [18, 19] . Table 2 indicates that our implementation is definitely competitive with respect to other designs for equivalent security. Note that no assumption on the form of p has been made: this freedom brings an enormous flexibility in the use of our designs.
Our designs were synthesized on a Virtex2 XC2V6000-6-FF152, which contains 33,792 Slices, 144 Select RAM Blocks, 144 18-bit x 18-bit Multipliers. The synthesis was performed with Synplify Pro 7.3 (SYNPLIC-ITY) and automatic place-and-route (PAR) was carried out with XILINX ISE 6.1i. Moreover, concerning the timing performances, we decided to pack the input/output registers of our implementation into the input/output blocks (IOB) in order to try and reach the achievable performance. 
Conclusion
In this article, the first implementation of the XTR (crypto)system on reconfigurable hardware (FPGA) is presented. Various implementations are discussed. Evaluation of modular products is the costly part. This can be carried out using the clever algorithm from Koç and Hung. We also propose a (competitive) alternative based on look-up table. The performances of these two algorithms seem to be in a similar gap.
The main subject of this paper is XTR implementation. The intrinsic parallelism of XTR allows us to obtain a very high level of performance with very small memory requirements. Compared with RSA exponentiation, XTR appears as a very interesting alternative in hardware: an XTR exponentiation is carried out in about 0.21 ms at frequency beyond 150 MHz.
Moreover, implementations are fully generic and have been designed for any FPGA device without using any particular feature. Portability is then another characteristic of our designs. Once again there is absolutely no constraint on p (characteristic of the field over which XTR is defined). Designs are dedicated to any p up to 170 bits and it would be obvious to oversize their length. Eventually, using special forms of p (e.g. Mersenne primes as used for elliptic curves) could lead to considerable improvements, to the detriment of the present generality.
We stress that porting our implementation on ASIC would also underline the very good efficiency of XTR compared with RSA or elliptic curve cryptosystems.
