A soft-core processor for finite field arithmetic with a variable word size accelerator by Iwasaki Aiko et al.
A SOFT-CORE PROCESSOR FOR FINITE FIELD ARITHMETIC WITH A VARIABLE
WORD SIZE ACCELERATOR
Aiko Iwasaki, Keisuke Dohi, Yuichiro Shibata, Kiyoshi Oguri, Ryuichi Harasawa
Graduate School of Engineering, Nagasaki University
1-14, Bunkyo, Nagasaki, Japan
email: {iwasaki,dohi}@pca.cis.nagasaki-u.ac.jp,{shibata,oguri,harasawa}@cis.nagasaki-u.ac.jp
ABSTRACT
This paper presents implementation and evaluation of an ac-
celerator architecture for soft-cores to speed up reduction
process for the arithmetic on GF (2m) used in Elliptic Curve
Cryptography (ECC) systems. In this architecture, the word
size of the accelerator can be customized when the architec-
ture is configured on an FPGA. Focusing on the fact that the
number of the reduction processing operations on GF (2m)
is affected by the irreducible polynomial and the word size,
we propose to employ an unconventional word size for the
accelerator depending on a given irreducible polynomial and
implement a MIPS-based soft-core processor coupled with a
variable-word size accelerator. As a result of evaluation with
several polynomials, it was shown that the performance im-
provement of up to 10.2 times was obtained compared to the
32-bit word size, even taking into account the maximum fre-
quency degradation of 20.4 % caused by changing the word
size. The advantage of using unconventional word sizes was
also shown, suggesting the promise of this approach for low-
power ECC systems.
1. INTRODUCTION
In this paper, we focus on the algorithm of reduction pro-
cessing [1] on finite fields GF (2m) of characteristic 2, which
is one of the most time consuming basic operations in ellip-
tic curve arithmetic. Scott proposed a technique to choice
optimal irreducible polynomials for a given architecture [2].
In contrast, we propose to use a soft-core processor with
configurable word size for ECC arithmetic and to tailor the
architecture to a given irreducible polynomial, by making
the best use of flexibility of FPGAs. Compared to Scott’s
approach, our approach allows us to use standard irreducible
polynomials widely used in many ECC applications [3]. While
many researches to speed up finite field arithmetic for ECC
using FPGAs have been reported so far [4][5][6][7][8][9],
there have been hardly any attempts to implement a soft-
core processor with configurable ECC arithmetic accelera-
tor with variable word size, as far as the authors’ knowledge
goes.
2. MATHEMATICAL BACKGROUND
ECC which was proposed by Koblitz[10] and Miller[11] in
1985, has the advantage that smaller key size can be used
compared to RSA of equivalent cryptographic strength. For
ECC, we often use elliptic curves defined over GF (2m)
from an implementational point of view, where GF (2m) is
an extension field of degree m over GF (2). The elements in
GF (2m) are expressed with m-bit binary numbers. While
addition and subtraction on GF (2m) can be realized as an
m-bit XOR operation without carry propagation, multiplica-
tion needs to be followed by reduction processing, which is
an operation to obtain a remainder of an intermediate prod-
uct divided by the irreducible polynomial of the field and
tends to be time consuming.
2.1. Reduction processing
We express the field GF (2m) as the residue class ring as
GF (2)[x]/(f(x)), where f(x) is an irreducible polynomial
over GF (2) of degree m. When f(x) is of the form
f(x) = xm + xa + xb + xc + 1 (1)
with m > a > b > c > 0, we see
xm = xa + xb + xc + 1 (2)
because f(x) = 0 in GF (2m). By iteratively substitut-
ing Eq. (2), any polynomials can be reduced so as to have
a smaller degree than that of the irreducible polynomial.
For example, let g(x) be a polynomial of degree 2m−2.
By Eq. (2), we have
g(x) = g2m−2x
2m−2 + · · · + gmxm + gm−1xm−1
+ · · · + g1x + g0
= (g2m−2x
m−2 + · · · + gm)(xa + xb + xc + 1)
+ gm−1x
m−1 + · · · + g1x + g0 (3)
where gi ∈ GF (2) = {0, 1}. Namely, the term xixm is re-
duced to the i-bit left shift result of the binary representation
of r(x) := xa + xb + xc + 1. So, the reduction process can
be performed by checking each bit of g(x) from the MSB
and shift-and-xoring r(x) into g(x) when the corresponding
bit is 1.
2.2. Fast reduction processing algorithm
In software processing, the speed of the reduction algorithm
can be improved by using word size operations [1]. For ex-
ample, let us assume that the reduction by f(x) = x283 +
x12 + x7 + x5 + 1 is performed on a machine with a 32-bit
word length. As shown in Fig. 1, polynomials are repre-
sented as a bit string that spans on multiple 32-bit words.
Most operations in this algorithms are shift and XOR, and
the number of variations of shift amounts appearing in the
code is only eight as shown in Table 1. Shift amounts re-
quired for the algorithm are given by the formulas shown in
Table 2 when the word size is ω.
2.3. Effect of processing word size
The operation count of the reduction algorithm largely de-
pends on a processing word size. First of all, the longer
the processing word size becomes, the more parallelism can
be extracted from long-word XOR operations. In addition,
since the required shift amounts depend on the relationship
between the formula of the irreducible polynomial and word
size, the required number of shift operations can be decreased
by selecting a suitable processing word size for a given irre-
ducible polynomial.
Fig. 2 demonstrates how the word size impacts the re-
duction process shown in Fig. 1. Obviously, there are some
special word sizes that significantly reduce the required op-
erations counts compared to neighbor sizes by canceling out
many shift operations. This motivated us to take the FPGA
implementation approach that makes possible to select even
an unconventional word size.
3. DESIGN AND IMPLEMENTATION
Since ECC is a basic building block for constructing a vari-
able public key cryptosystem protocols, it is desirable that
the system offers some sort of software programability. There-
fore, we propose an architecture of an FPGA-based soft-core
processor coupled with an accelerator for ECC arithmetic as
shown in Fig. 3.
The main processor is based on the MIPS architecture[12]
and offers standard functionality for software execution. The
accelerator has the same pipeline structure to the MIPS and
is connected to the main core at the instruction decode and
register fetch (Id) and memory access (Ma) stages. This ac-
celerator has its own register file (ECC regfile) and arith-
metic circuit (ECC ALU). The colored datapath in Fig. 3
can be configure to any designated word size. The word
Algorithm reduction processing modulo f(x) = x283 + x12 + x7 + x5 + 1
Input: A binary polynomial represented as 18 32-bit words g[.]
Output: g[.] modulo f(x), represented as 9 32-bit words
1: g[17]← g[17], g16 ← g[16], g15 ← g[15], g14 ← g[14], g13 ← g[13], g12 ← g[12], g11 ←
g[11], g10 ← g[10], g9 ← g[9], g8 ← g[8], g7 ← g[7], g6 ← g[6], g5 ← g[5], g4 ← g[4],
g3 ← g[3], g2 ← g[2], g1 ← g[1], g0 ← g[0]
2: g[17]← g[16]← g[15]← g[14]← g[13]← g[12]← g[11]← g[10]← g[9]← 0
3: g9 = g9 ⊕ (g17 ≫ 15)⊕ (g17 ≫ 20)⊕ (g17 ≫ 22)⊕ (g17 ≫ 27)
4: g8 = g8 ⊕ (g17 ≪ 17)⊕ (g17 ≪ 12)⊕ (g17 ≪ 10)⊕ (g17 ≪ 5)⊕ (g16 ≫ 15)⊕
(g16 ≫ 20)⊕ (g16 ≫ 22)⊕ (g16 ≫ 27)
5: g[7] = g7 ⊕ (g16 ≪ 17)⊕ (g16 ≪ 12)⊕ (g16 ≪ 10)⊕ (g16 ≪ 5)⊕ (g15 ≫ 15)⊕
(g15 ≫ 20)⊕ (g15 ≫ 22)⊕ (g15 ≫ 27)
6: g[6] = g6 ⊕ (g15 ≪ 17)⊕ (g15 ≪ 12)⊕ (g15 ≪ 10)⊕ (g15 ≪ 5)⊕ (g14 ≫ 15)⊕
(g14 ≫ 20)⊕ (g14 ≫ 22)⊕ (g14 ≫ 27)
7: g[5] = g5 ⊕ (g14 ≪ 17)⊕ (g14 ≪ 12)⊕ (g14 ≪ 10)⊕ (g14 ≪ 5)⊕ (g13 ≫ 15)⊕
(g13 ≫ 20)⊕ (g13 ≫ 22)⊕ (g13 ≫ 27)
8: g[4] = g4 ⊕ (g13 ≪ 17)⊕ (g13 ≪ 12)⊕ (g13 ≪ 10)⊕ (g13 ≪ 5)⊕ (g12 ≫ 15)⊕
(g12 ≫ 20)⊕ (g12 ≫ 22)⊕ (g12 ≫ 27)
9: g[3] = g3 ⊕ (g12 ≪ 17)⊕ (g12 ≪ 12)⊕ (g12 ≪ 10)⊕ (g12 ≪ 5)⊕ (g11 ≫ 15)⊕
(g11 ≫ 20)⊕ (g11 ≫ 22)⊕ (g11 ≫ 27)
10: g[2] = g2 ⊕ (g11 ≪ 17)⊕ (g11 ≪ 12)⊕ (g11 ≪ 10)⊕ (g11 ≪ 5)⊕ (g10 ≫ 15)⊕
(g10 ≫ 20)⊕ (g10 ≫ 22)⊕ (g10 ≫ 27)
11: g[1] = g1 ⊕ (g10 ≪ 17)⊕ (g10 ≪ 12)⊕ (g10 ≪ 10)⊕ (g10 ≪ 5)⊕ (g9 ≫ 15)⊕
(g9 ≫ 20)⊕ (g9 ≫ 22)⊕ (g9 ≫ 27)
12: g0 = g0 ⊕ (g9 ≪ 17)⊕ (g9 ≪ 12)⊕ (g9 ≪ 10)⊕ (g9 ≪ 5)
13: t = (g8 ≫ 27)
14: g0 = g0 ⊕ t
15: t = (t≪ 27)
16: g[8] = g8 ⊕ t
17: g[0] = g0 ⊕ (t≫ 15)⊕ (t≫ 20)⊕ (t≫ 22)























Figure 2: Operations counts for f(x) = x283 +x12 +x7 +x5 +1
Table 1: Shift amounts for the algorithm in Fig. 1





Table 2: Shift amounts for the reduction algorithm
right shift amount left shift amount
xa m − a (mod ω) ω − (m − a (mod ω))
xb m − b (mod ω) ω − (m − b (mod ω))
xc m − c (mod ω) ω − (m − c (mod ω))
1 m (mod ω) ω − (m (mod ω))
size of the ECC accelerator can be configured to a given ir-
reducible polynomial for the application system. The main
memory of the proposed soft-core is provided by on-chip
BRAM. The architecture was designed in Verilog-HDL RTL
descriptions.
3.1. Architecture of the accelerator
In the ECC accelerator, the dedicated ALU provides XOR
and shift operations for the reduction process. In general,
shift hardware for large word size tends to consume huge
logic area and become a bottleneck for the clock speed. For-
tunately, however, reduction process requires only a few types
of shift amounts according to a given irreducible polynomial
and word size as aforementioned. This allows us to imple-
ment significantly simple and fast shift hardware for long
word size. The ECC register file provides 32 registers of the
designated word size, where the value of the register zero is
always 0. The word size for ECC arithmetic is given as a
parameter at the stage of logic synthesis.
3.2. Data memory alignment
While the word size for the ECC accelerator can be tailored
to any length, that of the MIPS main core is fixed to be 32
bits. Therefore, memory space sharing between the acceler-
ator and the main core causes a complex alignment issue.
In order to make the architecture simple and efficient,
we put some alignment constraints on memory access made
by the ECC accelerator. When the word size w is designated
by a designer, the main memory is automatically configured





Here, the value of 32 comes from the word size of the MIPS
main core. Since always wd ≥ w, load and store instruc-
tions for the ECC registers are executed in one clock cycle.
However, in order to avoid introducing a lot of multiplex-
ers, effective address of ECC load/store must be aligned to a
multiple of wd. Therefore, when ECC accelerator access to
the main data memory, the accelerator can not access to the
upper region of wd − w bits of the main memory word. Ef-
fective addresses of ECC load/store instructs are calculated
in the MIPS main-core processor.
3.3. C library
We also developed a C library to embed dedicated ECC in-
structions in C source code using an in-line assembler. Users
can develop applications using a GCC MIPS cross compiler
with this library. Our ECC dedicated instructions can be in-
voked as a function call with operand registers as arguments.
At present, we support the functions using in reduction pro-
cessing.
4. EVALUATION
We used a Xilinx Spartan-6 XC6SLX45 FPGA as a tar-
get device and implemented functions of the reduction pro-
cessing with three irreducible polynomials, f(x) = x241 +
x70 + 1, f(x) = x283 + x12 + x7 + x5 + 1 and f(x) =
x163 + x7 + x6 + x3 + 1. The evaluated code was compiled
by an MIPS cross compiler based on GNU Compiler Collec-
tion (GCC) with our function library for the ECC dedicated
instructions. In this time, we used GNU binutils-2.23.2 and
gcc-core-4.2.4 in the little-endian mode.
First, we evaluated how the increase in the word size re-
stricts the maximum operational clock frequency. We mapped
every architectural configuration of our architecture, chang-
ing the ECC word size from 32 bits to 300 bits and per-
formed static timing analysis. Fig. 4 shows the frequency
degradation caused by the increase in the ECC word size
was not so severe; approximately 0.03 MHz per 1 bit in av-
erage. This is due to a long-word XOR instruction does not
cause carry propagation and a long-word shift for ECC does
not increase the number of potential shift amounts. The im-
pact on the frequency given by irreducible polynomials was
slight, since the hardware differences are just shift amounts.
There are relatively large frequency drops at 128 bit and 256
bit. This is due to additional multiplexers were inserted be-
tween the main core and the long-word main memory bank.
Next, we evaluated how the number of clock cycles re-
quired for the reduction processing was influenced by chang-
ing the ECC word size. From results of Fig. 5, by selecting
an appropriate word size for the given irreducible polyno-
mial, the required clock cycles were significantly reduced
and this effect was obviously superior the frequency drop
shown in Fig. 4.
Fig. 6 demonstrates how the proposed architecture im-
proved the execution performance of the reduction process-
ing compared to the 32-bit MIPS main-core processor. As
a result, the performance was improved by up to 10.2 times
in spite of the maximum frequency degradation of 20.4 %.
In addition, the evaluation results clearly reveal the advan-
tage of employing an unconventional word size for the ECC
arithmetic, which is not a power of two number. Meanwhile,
increase in FPGA resources by adding the accelerator was
2.84 times and 3.79 times for FFs and LUTs, on the 294-bit
architecture for f(x) = x283 + x12 + x7 + x5 + 1. Consid-
ering the performance enhancement obtained, this increase
in resources would be reasonable.
5. CONCLUSION
This paper presented implementation of an accelerator with
variable word size for a MIPS-based soft-core to speed up
reduction processing on GF (2m) arithmetic widely used in
ECC systems. In our architecture, the word size for ECC

































































































































































If stage Id stage Ex stage Ma stage Wb stage













































least squares regression of three polynomials
f(x)=x




















































































Figure 6: Obtained speed-up and word size
HDL description, so that optimal word size can be selected
according for the irreducible polynomial. We also devel-
oped a function library, so that users can easily insert ded-
icated ECC instructions in C source code. Evaluation re-
sults showed that although the maximum clock frequency
was degraded according to an increase in word size, the per-
formance was improved by up to 10.2 times by choosing the
best word size for a given irreducible polynomials. In ad-
dition, it was shown that adopting an unconventional word
size such as 294 bits by making the best use of flexibility of
FPGAs was significantly advantageous.
6. REFERENCES
[1] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Ellip-
tic Curves Cryptography. Springer, 2004, pp. 53–56.
[2] M. Scott, “Optimal irreducible polynomials for GF (2m)
arithmetic,” in Cryptology ePrint Archive, Report 2007/197,
2007.
[3] D. Brown, “Sec 2: Recommended elliptic curve domain pa-
rameters,” in Standars for Efficient Cryptography, ver. 2.0.
Certicom Corup, 2010.
[4] K. O. Takehiro Ito, Yuichiro Shibata, “Implementation of the
extended euclidean algorithm for the tate pairing on fpga,”
in Field-Programmable Logic and Applications. Springer,
2004.
[5] J.-J. Q. Guerric Meurice de Dormale, Philippe Bulens, “Ef-
ficient modular division implementation, ecc over GF (p)
affine coordinates application,” in Field-Programmable Logic
and Applications. Springer, 2004.
[6] C. P. Sandeep Kummar, “Reconfigurable instruction set ex-
tension for enabling ecc on an 8-bit processor,” in Field-
Programmable Logic and Applications. Springer, 2004.
[7] W. M. Maurice Keller, Tim Kerins, “FPGA implementation
of a GF (2m) multiplier for use in pairing based cryptosys-
tems,” in FPL 2005, 2005.
[8] H. Lia, J. Huangb, P. Sweanya, and D. Huang, FPGA imple-
mentations of elliptic curve cryptography and Tate pairing
over a binary field. Journal of Systems Architecture, 2008,
pp. 1077–1088.
[9] M. Keller, R. Ronan, W. Marnane, and C. Murphy, Hardware
architectures for the Tate pairing over GF (2m). Computers
and Electrical Engineering, 2007, pp. 392–406.
[10] N. Koblitz, Elliptic curve cryptosystems. American Mathe-
matical Society, 1987, pp. 203–209.
[11] V. S. Miller, Use of Elliptic Curves in Cryptography.
Springer Berlin Heidelberg, 1986, pp. 417–426.
[12] D. A. Patterson and J. L. Hennessy, COMPUTER ORGANI-
ZATION AND DESIGN THE HARDWARE / SOFTWARE IN-
TERFACE, 4th ed. Morgan Kaufmann, 2009.
