FPGA Design for Algebraic Tori-Based Public-Key Cryptography by Junfeng, F. et al.
PDF hosted at the Radboud Repository of the Radboud University
Nijmegen
 
 
 
 
The following full text is a publisher's version.
 
 
For additional information about this publication click this link.
http://repository.ubn.ru.nl/handle/2066/127417
 
 
 
Please be advised that this information was generated on 2017-03-09 and may be subject to
change.
FPGA Design for Algebraic Tori-Based Public-Key Cryptography
Junfeng Fan, Lejla Batina, Kazuo Sakiyama and Ingrid Verbauwhede
Katholieke Universiteit Leuven, ESAT/SCD-COSIC,
Kasteelpark Arenberg 10
B-3001 Leuven-Heverlee, Belgium
{Junfeng.Fan, Lejla.Batina, Kazuo.Sakiyama, Ingrid.Verbauwhede}@esat.kuleuven.be
Abstract
Algebraic torus-based cryptosystems are an alternative
for Public-Key Cryptography (PKC). It maintains the se-
curity of a larger group while the actual computations are
performed in a subgroup. Compared with RSA for the same
security level, it allows faster exponentiation and much
shorter bandwidth for the transmitted data. In this work
we implement a torus-based cryptosystem, the so-called
CEILIDH, on a multicore platform with an FPGA. This
platform consists of a Xilinx MicroBlaze core and a mul-
ticore coprocessor. The platform supports CEILIDH, RSA
and ECC over prime fields. The results show that one 170-
bit torus T6 exponentiation requires 20ms, which is 5 times
faster than 1024-bit RSA implementation on the same plat-
form.
1 Introduction
Diffie and Hellman introduced the idea of Public-Key
Cryptography (PKC) [3] in the mid 70’s. Their break-
through showed that one can eliminate the need for prior
agreement of a key in order to exchange some confiden-
tial data. One important application of Public-Key ser-
vices are digital signatures. The best-known and most com-
monly used public-key cryptosystems are based on factor-
ing (RSA) and on the discrete logarithm problem in a large
prime field (Diffie-Hellman, ElGamal, Schnorr, DSA) [7].
Elliptic Curve Cryptography (ECC), which was proposed
in the mid 80’s by Miller [8] and Koblitz [6], is based on a
different algebraic structure. In the case of ECC, instead of
integers modulo n another group is used i.e., the group of
points on an elliptic curve. It is important to point out that
ECC offers equivalent security as RSA for much smaller
key sizes. Other benefits include higher speed, lower power
consumption and smaller certificates which is especially
useful in constrained environments (smart cards, mobile
phones, PDAs, etc.).
Algebraic torus-based cryptosystems are another alter-
native for PKC. Torus-based cryptography assumes using
algebraic torus to construct a group on which the discrete
logarithm problem is defined. This idea was first introduced
by Rubin and Silverberg in 2003 [10] and they proposed the
name of CEILIDH. The idea behind it was to obtain the se-
curity of Fp6 , while data to be transmitted are compressed
with a factor 3 and the underlying arithmetic is performed
in a subgroup e.g. Fp. Their benefits are that they allow
for shorter transmissions which is of interest for embedded
applications.
So, torus-based cryptography gives a possibility to work
in a subgroup, while maintaining the security of a bigger
group. More precisely, Rubin and Silverberg showed that
the factor n
ϕ(n) can be achieved for compression. Here,
ϕ(n) is the Euler’s totient function that is defined to be the
number of positive integers less than or equal to n that are
coprime to n. For example, ϕ(6) = 2 i.e. the numbers 1 and
5 are coprime to 6. They introduced a new public-key cryp-
tosystem CEILIDH [10] that is based on torus T6. Hence,
in this case we get the compression of 6
ϕ(6) = 3. The ad-
vantage of the torus when compared for RSA for example,
lies in the compression factor, which allows one to use keys
of length three times smaller than those for RSA. Another
advantage is that the basic arithmetic behind is performed
in a prime field, where the prime is 160-170 bits long which
is a typical case of ECC. Thus, tori and ECC can be easily
implemented using the same arithmetic unit.
In this paper we consider efficient implementations of
CEILIDH. The work of Granger et al. [5] was first to intro-
duce efficient arithmetic on T6. They implemented T6 on a
PC and they concluded that CEILIDH is not much slower
than XTR, which is another PK cryptosystem using the
same idea of keeping the security of Fp6 while transmitting
only two elements of Fp. In this work we propose a flex-
ible platform architecture which supports CEILIDH, RSA
and ECC over prime fields. A hierarchical design method
is used, and in this manner we show that all three cryp-
tosystems can be efficiently supported by the same hard-
978-3-9810801-3-1/DATE08 © 2008 EDAA 
 
ware platform. To our knowledge, this work presents the
first architecture for efficient implementation of CEILIDH,
ECC and RSA together.
The rest of the paper is organized as follows. Section 2
gives a brief introduction on the mathematical background
of torus. Section 3 describes the multicore platform and our
implementation considerations. We show the implementa-
tion results in section 4 and conclude the paper and give
some future work in section 5.
2 Mathematical Background
The field Fp6 can be viewed as an extension field of de-
gree 6 over Fp. More precisely, Fp6 = Fp[x]/(f(x)), where
deg(f) = 6. So, the elements of Fp6 are six-tuples of ele-
ments from Fp. There are other representations possible of
the field Fp6 e.g. using subfields with exponents 2 and 3. In
principle, it is possible to choose a representation in which
field arithmetic can be performed more efficiently and then
map an arbitrary element from torus T6 to the corresponding
one in any other representation. An algebraic torus T6(Fp)
is defined in such a way that over Fp6 , this structure can be
represented by a pair of elements from Fp. This means that
one can achieve the security of Fp6 , while transmitting only
two elements of Fp. Those maps between various represen-
tations as well as the representations are given in [5]. The
choice of representation is usually related to the implemen-
tation platform. This is where our considerations alter from
those in [5].
The first representation denoted as F1 is the basic one
i.e. viewing Fp6 as an extension field of Fp of degree 6, so
F1 = Fp6 = Fp[x]/(f(x)), where deg(f) = 6. We con-
sider only this representation for our implementation but for
a complete cryptosystem also the mappings between differ-
ent representations have to be implemented.
2.1 Overall Structure of Operations
In Fig. 1 we give the complete overview of all operations
included in the torus arithmetic. The mappings τ and ρ are
isomorphisms and so are the corresponding inverse map-
pings τ−1, ψ. Those mappings are used to change from one
representation to the other e.g. to map an element from T6 to
the one from F2 where F2 is a quadratic extension of Fp3 ,
so F2 = Fp3 [y]/g(y) and deg(g) = 2. In Fig. 1 we also
see the relations among all those operations. Further on, it
is denoted which subfields are used for representations F1
and F2.
2.2 The Representation F1
Here we focus on the representation F1 as the one in
which we perform the required arithmetic that includes ex-
addmul.inv.
add mul. inv.
add mul. inv.
τ ρ ψτ−1 Fp6Fp6Fp6Fp6
Fp3Fp3 Fp3 Fp3
FpFpFp F1
F2
Figure 1. An overview of the T6(Fp) opera-
tions.
ponentiation in this group. F1 = Fp6 = Fp[x]/(f(x)) and
let p ≡ 2 mod 9 (or p ≡ 5 mod 9) and f(x) = x9−1
x3−1 =
x6+x3+1 is an irreducible polynomial with a root z. Then,
z6 = −z3−1 and each element from F1 can be represented
in the basis {1, z, z2, z3, z4, z5}. Hence, an arbitrary ele-
ment from this group is denoted as A(z) =
∑i=5
i=0 aiz
i
. In
order to perform exponentiation in this group we will first
describe the basic operations in this field i.e. addition and
multiplication. We denote multiplication/squarings and ad-
ditions/subtractions in Fp with M and A respectively.
2.2.1 Addition in Fp6
Take two elements from F1, A(z) =
∑i=5
i=0 aiz
i and
B(z) =
∑i=5
i=0 biz
i
. Then the sum is defined as: C(z) =
∑i=5
i=0 ciz
i =
∑i=5
i=0(ai + bi)z
i and it requires 6 additions in
Fp.
2.2.2 Multiplication in Fp6
Multiplication of two polynomials of degree five can be
performed in 18M plus many additions [1]. We explain
that in more detail as it was also elaborated in [11]. Let
A(z) =
∑i=5
i=0 aiz
i and B(z) =
∑i=5
i=0 biz
i be two fifth de-
gree polynomials to be multiplied. Write A = A0 + A1z3
and B = B0 + B1z3 where Ai, Bi for i = 0, 1 are sec-
ond degree polynomials. Then A · B = A0B0 + (A0B1 +
A1B0)z
3 +A1B1z
6 where one can precompute the values
C0 = A0B0, C1 = A1B1 and C2 = (A0 −A1)(B0 −B1).
This results in A · B = C0 + (C0 + C1 − C2)z3 + C1z6 .
Let A0 = a0 + a1x + a2x2 and B0 = b0 + b1x + b2x2 ,
where ai, bi ∈ Fp for i = 0, 1, 2; then for C0 we get
C0 = a0b0 + (a0b1 + a1b0)x+ (a2b0 + a1b1 + a0b2)x
2 +
(a2b1 + a1b2)x
3 + a2b2x
4 . If the following values c0 =
a0b0, c1 = a1b1, c2 = a2b2, c3 = (a0 − a1)(b0 − b1),
c4 = (a0 − a2)(b0 − b2) and c5 = (a1 − a2)(b1 − b2) are
precalculated we finally get: C0 = c0 + (c0 + c1 − c3)x+
(c0 + c1 + c2 − c4)x
2 + (c1 + c2 − c5)x
3 + c2x
4 . It fol-
lows from above that each Ci requires 6M + 11A so the
total number of multiplications adds to 18M . The result
A ·B still has to be reduced modulo an irreducible polyno-
mial in Fp6 , which adds a few more additions to the total
number of multiplications. According to [5] this all adds to
18M +60A as the cost for one multiplication in F1 in basis
{z, z2, z3, z4, z5, z6}.
2.3 Operations in Fp
To summarize, we need the following arithmetic oper-
ations in Fp: addition, subtraction, multiplication and in-
version. Exponentiation is performed via repeated multi-
plications. We use Montgomery’s modular multiplication
algorithm to perform modular multiplications. The algo-
rithm of Montgomery is the best manner to avoid the time-
consuming trial division in modular multiplications [9].
Alg. 1 shows a high radix Montgomery algorithm called
FIOS (Finely Integrated Operand Scanning), which is suit-
able for a software implementation on a w-bit datapath.
Algorithm 1 Radix-2w n-bit Montgomery modular multi-
plication (FIOS). [2]
Input: integers P = (ps−1, ..., p0)r, X = (xs−1, ..., x0)r,
Y = (ys−1, ..., y0)r, where 0 ≤ X,Y < P , r = 2w, s =
⌈ n
w
⌉, R = rs with gcd(p, r) = 1 and p′ = −P−1mod r.
Output: X · Y ·R−1 mod p
1: z = (zs−1, ..., z0)r ← 0
2: for i = 0 to s− 1 do
3: T ← (z0 + x0 · yi) · p
′
mod r
4: Z ← (Z +X · yi + P · T )/r
5: end for
6: if Z > p then
7: Z ← Z − P
8: end if
9: return Z
3 Implementation
3.1 Platform Architecture
We implement the torus-based cryptosystem on a multi-
core platform. This platform has multiple data-paths and is
completely programmable, thus different algorithms can be
efficiently implemented on it. In Fig. 2 the block diagram of
(a) Schematic block diagram for the platform.
(b) Schematic block diagram for the core.
Figure 2. Overview of the multi-core system.
the platform is shown. It consists of a MicroBlaze processor
and a multicore coprocessor. MicroBlaze is a synthesizable
core offered by Xilinx, and is used here as a controller. The
coprocessor is the workhorse of the implementation. Multi-
ple cores of the coprocessor can be programmed to perform
different computation, such as modular multiplications and
additions with arbitrary operand length. Therefore, differ-
ent Public-Key cryptosystems such as ECC over Fp and
RSA can also be easily implemented on this platform.
As shown in Fig. 2(a), the MicroBlaze processor com-
municates with the coprocessor via memory-mapped regis-
ters, i.e., instruction register (A) and two data sharing reg-
ister (B and C), and an interrupt signal. The coprocessor
consists of a decoder, data memory (DataRAM), microin-
struction memory (InsRom) and multiple embedded cores.
Fig. 2(b) shows the block diagram of a core. Each core here
is a highly simplified Load/store CPU, and supports only 7
instructions. It does not support branch jumps. We also uti-
lize the dedicated multipliers on the FPGA to construct the
ALU of each core. In order to reduce the area, both InsRom
and DataRAM are single port memory and implemented in
the Block RAM of the FPGA.
The decoder fetches instructions from the instruction
register (register A), and performs correspondding microin-
structions stored in InsRom. The microinstructions are dis-
patched to the cores in parallel via the instruction bus. The
data memory has only one read/write port, therefore, a sin-
gle data memory access is allowed in each cycle. The
decoder manages the data memory so that conflicts are
avoided.
3.2 Implementation Hierarchy
As shown in Fig. 1, torus arithmetic can be represented in
various ways and on different levels. On the platform shown
in Fig. 2, the torus exponentiation is performed in three lev-
els. One torus exponentiation consists of a sequence of Fp6
operation, which consists of a sequence of Modular Multi-
plications (MM) and Modular Additions (MA) in Fp. Ob-
viously, the sequence of modular operations can either be
generated in software e.g. as a C code or can be put in the
coprocessor. We investigate both of these two types of im-
plementation.
3.2.1 Type-A Implementation
Figure 3. Torus exponentiation in hierarchy:
Type-A Implementation.
Fig. 3 shows the Type-A implementation. Here the Mi-
croBlaze generates the sequence of MM and MA, and sends
them to the coprocessor one by one. For example, the Mi-
croBlaze puts a ”MM” instruction to register A to perform
a modular multiplication.
MM AddrC, AddrA, AddrB
The coprocessor decodes this instruction, and executes the
corresponding microinstructions that are stored in the In-
sRom. After finishing this multiplication, the coprocessor
generates an interrupt signal, which will be monitored and
handled by the MicroBlaze. Afterwards the MicroBlaze can
send next instruction.
As one Fp6 operation consists of 18M + 60A, the total
of 78 register A accesses and 78 interrupts handling are re-
quired. One register A access together with one interrupt
handling requires 184 clock cycles, while one 170-bit mod-
ular multiplication requires 193 clock cycles. Therefore, the
communication between the MicroBlaze processor and the
coprocessor becomes the bottleneck of the whole system.
3.2.2 Type-B Implementation
One possible way to improve the performance is to reduce
the communication overhead. Without losing any flexibil-
ity, we add another instruction ROM (InsRom1) to the co-
processor and we denote this architecture as Type-B. In the
InsRom1 we store the sequence on level 2. Fig. 4 shows this
implementation.
Figure 4. Torus exponentiation in hierarchy:
Type-B Implementation.
Now MicroBlaze sends instruction on level 1.
T6M AddrC, AddrA, AddrB
The coprocessor decodes this instruction, and fetches the
corresponding sequence of MM and MA in InsRom1. For
each MM or MA, the coprocessor performs the correspond-
ing microinstructions stored in InsRom2. The Type-B im-
plementation requires only one register A access and one
interrupt for each Fp6 operation, therefore the performance
is improved.
Both Type-A and Type-B offer high flexibility. Instead
of 170-bit MM/MA, one can compose program with mi-
croinstructions to perform 1024-bit MM/MA, thus 1024-bit
RSA is supported. In order to support ECC, on level 2 we
can also put a sequence of MM/MA to construct a Point
Addition (PA) or Point Doubling (PD) operations instead of
Fp6 . We also implemented 160-bit ECC and 1024-bit RSA
on this platform to compare their performance with the per-
formance of the torus.
3.3 Implementation of Montgomery Modular
Multiplication
The performance of one Montgomery modular multipli-
cation is bounded by the system architecture and the in-
struction scheduling method in use. Efficient instruction
scheduling method for Montgomery modular multiplication
on multicore system was discussed in [4]. The main chal-
lenge here is to reduce the number of data transfers between
different cores. The data dependency of Alg. 1 is mainly
caused by the carry generated by additions, i.e., by the
computation of Z ← (Z + X · yi + P · T )/r. To uti-
lize all the cores efficiently, carry should be used only in the
core where it was generated. In [4], we observed that only
Z0 has to be generated in the end of each iteration, while
Zs−1, .., Z1 can be generated in the end of the loop. Based
on this observation, an instruction scheduling method which
avoids carry transfers and efficently utilizes all the cores is
proposed. The result in [4] shows that a 256-bit MM on a
4-core system is 2.96 times faster than the single core based
implementation. We use this instruction scheduling method
here in the CEILIDH implementation.
Figure 5. Parallelized 256-bit Montgomery
modular multiplication on a 4-core system.
Fig.5 shows how this method works. During the whole
loop (z1, z0) is generated and stored in core-1, (z3, z2) in
core-2, (z5, z4) in core-3 and (z7, z6) in core-4. Carry is
only used in the local core. At the end of each iteration, z2
in core-2 becomes new z1 and is sent to core-1. Also, z4 is
sent to core-2 and z6 is sent to core-3. After eight iterations
and a conditional substraction, Z = X · Y · R−1 mod P
is obtained and stored separately in the register file of each
core.
4. Results
We implemented this platform on a Xilinx FPGA Virtex-
II Pro. Table 1 shows the number of clock cycles for differ-
ent modular operations. The result shows that one 170-bit
Montgomery modular multiplication requires 193 clock cy-
cles, while one addition needs 47 clock cycles. The rea-
son that modular additions are relatively slow is that we
only use one core to perform modular additions and sub-
tractions. This is because carry needs to be transferred if
multiple cores are used to perform modular additions.
While 160-bit modular operations are a little bit faster
than 170-bit operations, 1024-bit Montgomery modular
multiplication is about 23 times slower than 170-bit mul-
tiplication.
Table 1. Number of clock cycles for different
operations.
Bitlength Operations Number of
clock cycles
Interrupt Handling 184
Modular Mult. 193
170-bit Modular Add. 47
(torus) Modular Sub. 61
Modular Mult. 163
160-bit Modular Add. 40
(ECC) Modular Sub. 53
1024-bit (RSA) Modular Mult. 4447
Table 2 shows the results of Fp6 multiplication and ECC
PA/PD of Type-A and Type-B implementations. One 170-
bit Fp6 multiplication takes 22348 clock cycles for the
Type-A implementation. However, only 5908 clock cycles
are requried for the Type-B implementation, which is 3.78
times faster than the Type-A implementation. For ECC,
PA on Type-B implementation is about 2.49 times faster
compared with that on Type-A, and PD is about 2.17 times
faster.
The design is synthesized and implemented on a Xilinx
Virtex-II Pro (XC2VP30) FPGA. A maximum frequency
of 74 MHz can be achieved. The data memory and in-
struction memory are implemented in block RAM of the
FPGA board. In total, 5419 slices are used for this design,
where the coprocessor requires 3285 slices. Table 3 shows
the performance of torus, ECC and RSA on this platform.
One 170-bit T6 exponentiation requires 20 ms, while one
1024-bit RSA exponentiation requires 96 ms. In this case,
CEILIDH is about 5 times faster than RSA on the same plat-
form. Note that one Fp6 multiplication requires 18 MM and
60 MA. Further performance improvement is acquirable by
Table 2. Number of clock cycles for different
operations in Type-A and Type-B implemen-
tation.
Architecture Operations Number of
Type clock cycles
torus T6 Mult. 22348
Type-A ECC PA 7185
ECC PD 5793
torus T6 Mult. 5908
Type-B ECC PA 2888
ECC PD 2665
Table 3. Performance comparison between
torus, ECC and RSA on the same platform.
PKC Area Freq. Time
[slices] [MHz] [ms]
170-bit torus 5419 74 20
1024-bit RSA 5419 74 96
160-bit ECC 5419 74 9.4
performing parallel computation between these modular op-
erations. On the same platform, one 160-bit ECC scalar
multiplication requires 9.4 ms, which is about two times
faster than CEILIDH.
5 Conclusions and Future work
We describe a design approach of CEILIDH on a mul-
ticore platform. A MicroBlaze is used as a controller to-
gether with a multicore coprocessor. The result shows that
170-bit T6 exponentiation requires 20 ms, which is about 5
times faster than 1024-bit RSA on the same platform. Com-
pared to ECC, CEILIDH has the same advantage of small
key size and small cypher length. However, it is about two
times slower than ECC with equivalent security.
For future work, we believe that by deploying fast mod-
ular adders, the performance can be improved. Also, by ex-
plorering parallelism between modular operations, further
improvement is obtainable.
6. Acknowledgements
Junfeng Fan, Lejla Batina and Kazuo Sakiyama are
funded by research grants of Katholieke Universiteit
Leuven (OT/06/40) and FWO projects (G.0300.07 and
G.0450.04). This work was supported in part by the IAP
Programme P6/26 BCRYPT of the Belgian State (Belgian
Science Policy), by the EU IST FP6 projects (ECRYPT)
and by the IBBT-QoE project of the IBBT.
References
[1] T. Blum and C. Paar. High-radix Montgomery modular ex-
ponentiation on reconfigurable hardware. IEEE Transac-
tions on Computers, 50(7):759–764, July 2001.
[2] C¸.K. Koc¸, T. Acar, and B. Kaliski Jr. Analyzing and com-
paring Montgomery multiplication algorithms. IEEE Micro,
16(3):26–33, June 1996.
[3] W. Diffie and M. Hellman. New directions in cryptogra-
phy. IEEE Transactions on Information Theory, 22:644–
654, 1976.
[4] J. Fan, K. Sakiyama, and I. Verbauwhede. Montgomery
modular multiplication algorithm on multi-core systems. In
Proceedings of the IEEE Workshop on Signal Processing
Systems: Design and Implementation (SIPS 2007), pages
261–266, 2007.
[5] R. Granger, D. Page, and M. Stam. A comparison of
CEILIDH and XTR. In D. Buell, editor, Proceedings of Al-
gorithmic Number Theory - ANTS-VI, number 3076 in Lec-
ture Notes in Computer Science, pages 235–249, 2004.
[6] N. Koblitz. Elliptic curve cryptosystem. Math. Comp.,
48:203–209, 1987.
[7] A. Menezes, P. van Oorschot, and S. Vanstone. Handbook
of Applied Cryptography. CRC Press, 1997.
[8] V. Miller. Uses of elliptic curves in cryptography. In H. C.
Williams, editor, Advances in Cryptology: Proceedings of
CRYPTO’85, number 218 in Lecture Notes in Computer Sci-
ence, pages 417–426. Springer-Verlag, 1985.
[9] P. Montgomery. Modular multiplication without trial divi-
sion. Mathematics of Computation, 44(170):519–521, 1985.
[10] K. Rubin and A. Silverberg. Torus-based cryptography. In
D. Boneh, editor, Advances in Cryptology: Proceedings of
CRYPTO’03, number 2729 in Lecture Notes in Computer
Science, pages 349–365. Springer-Verlag, 2003.
[11] M. Stam and A. Lenstra. Efficient Subgroup Exponentiation.
In B. Kaliski Jr., C¸.K. Koc¸, and C. Paar, editors, Proceed-
ings of 4th International Workshop on Cryptographic Hard-
ware and Embedded Systems (CHES), number 2535 in Lec-
ture Notes in Computer Science, pages 318–332. Springer-
Verlag, 2002.
