FPGA Realization of Low Register Systolic All One-Polynomial Multipliers Over \u3cem\u3eGF\u3c/em\u3e (\u3cem\u3e2\u3csup\u3em\u3c/sup\u3e\u3c/em\u3e) and their Applications in Trinomial Multipliers by Chen, Pingxiuqi
Wright State University 
CORE Scholar 
Browse all Theses and Dissertations Theses and Dissertations 
2016 
FPGA Realization of Low Register Systolic All One-Polynomial 
Multipliers Over GF (2m) and their Applications in Trinomial 
Multipliers 
Pingxiuqi Chen 
Wright State University 
Follow this and additional works at: https://corescholar.libraries.wright.edu/etd_all 
 Part of the Electrical and Computer Engineering Commons 
Repository Citation 
Chen, Pingxiuqi, "FPGA Realization of Low Register Systolic All One-Polynomial Multipliers Over GF (2m) 
and their Applications in Trinomial Multipliers" (2016). Browse all Theses and Dissertations. 1532. 
https://corescholar.libraries.wright.edu/etd_all/1532 
This Thesis is brought to you for free and open access by the Theses and Dissertations at CORE Scholar. It has 
been accepted for inclusion in Browse all Theses and Dissertations by an authorized administrator of CORE 
Scholar. For more information, please contact library-corescholar@wright.edu. 
FPGA REALIZATION OF LOW
REGISTER SYSTOLIC ALL
ONE-POLYNOMIAL MULTIPLIERS
OVER GF (2m) AND THEIR
APPLICATIONS IN TRINOMIAL
MULTIPLIERS
A thesis submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical Engineering
By
PINGXIUQI CHEN






I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPERVISION BY
Pingxiuqi Chen ENTITLED FPGA REALIZATION OF LOW REGISTER SYSTOLIC ALL ONE
-POLYNOMIAL MULTIPLIERS OVER GF (2m) AND THEIR APPLICATIONS IN TRINOMIAL
MULTIPLIERS. BE ACCEPT IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
















Vice President for Research and
Dean of the Graduate School
Abstract
Pingxiuqi,Chen. M.S.E.E., Department of Electrical Engineering, Wright State Univer-
sity, 2016. FPGA realization of low register systolic all one-polynomial multipliers over
GF (2m) and their applications in trinomial multipliers.
All-one-polynomial (AOP)-based systolic multipliers over GF (2m) are usually not con-
sidered for practical implementation of cryptosystems such as elliptic curve cryptography
(ECC) due to security reasons. Besides that, systolic AOP multipliers usually suffer
from the problem of high register-complexity, especially in field-programmable gate array
(FPGA) platforms where the register resources are not that abundant. In this thesis,
however, we have shown that the AOP-based systolic multipliers can easily achieve low
register-complexity implementations and the proposed architectures can be employed as
computation cores to derive efficient implementations of systolic Montgomery multipli-
ers based on trinomials, which are recommended by the National Institute of Standards
and Technology (NIST) for cryptosystems. In this paper, first, we propose a novel data
broadcasting scheme in which the register-complexity involved within existing AOP-based
systolic multipliers is significantly reduced. We have found out that for practical usage,
the modified AOP-based systolic structure can be packed as a standard computation
core. Next, we propose a novel Montgomery multiplication algorithm that can fully em-
ploy the proposed AOP-based computation core. The proposed Montgomery algorithm
employs a novel pre-computed-modular (PCM) operation, and the systolic structures
based on this algorithm fully inherit the advantages brought from the AOP-based core
(low register-complexity, low critical-path delay, and low latency) except some marginal
hardware overhead brought by a pre-computation unit. The proposed architectures are
then implemented by Xilinx ISE 14.1 and it is shown that compared with the existing
designs, the proposed designs achieve at least 70.0% and 47.6% less area-delay product
(ADP) and power-delay product (PDP) than the best of competing designs, respectively.
iii





1.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summery of contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Existing AOP Systolic Multiplier Based on Trinomial . . . . . . . 2
1.2.2 Proposed Systolic Structure . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Finite Field 5
2.1 Introduction of Finite Field . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Finite Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Field Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Binary Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Extension Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Polynomial basis multiplication over GF (2m) 8
3.1 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Polynomial basis representation over GF (2m) . . . . . . . . . . . . . . . 9
3.3 Irreducible polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.1 Definition of irreducible polynomial . . . . . . . . . . . . . . . . . 9
3.3.2 Important irreducible polynomial based on finite field . . . . . . . 10
3.4 Polynomial Basis Multiplication Over GF (2m) . . . . . . . . . . . . . . . 11
3.4.1 AOP Basis Multiplication Over GF (2m) . . . . . . . . . . . . . . 11
3.4.2 Trinomial Basis Multiplication Over GF (2m) . . . . . . . . . . . . 14
v
3.5 Existing Researches About Polynomial Basis Multiplication Based On Fi-
nite Field GF (2m) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.1 Existing Example For AOP . . . . . . . . . . . . . . . . . . . . . 19
3.5.2 Existing Example For Trinomial . . . . . . . . . . . . . . . . . . . 19
4 Low Register-Complexity AOP based Systolic Multipliers (AOP-Based
Computation Core) 20
4.1 Review of AOP Multiplication Algorithm [24] . . . . . . . . . . . . . . . 20
4.2 Existing Systolic Structures . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Modified Low Register-Complexity Structures . . . . . . . . . . . . . . . 22
4.4 Low Latency Implementations . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Digit-Parallel Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Area-Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7 FPGA Implementation of Various AOP-based Structures . . . . . . . . . 27
4.8 AOP-based Computation Core . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Application of the Proposed AOP-based Computation Core 29
5.1 Montgomery Multiplication Algorithm . . . . . . . . . . . . . . . . . . . 29
5.2 Proposed Montgomery Multiplication Algorithm . . . . . . . . . . . . . . 31
5.3 Proposed Low Register-Complexity Systolic Structure Employing the AOP-
based Computation Core . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4 Low-Latency Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Digit-Parallel Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.6 Area and Time Complexities . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.6.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.6.2 FPGA Implementations . . . . . . . . . . . . . . . . . . . . . . . 40
5.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6 Conclusion 42
6.1 Low complexity multiplier based on trinomial . . . . . . . . . . . . . . . 42
6.2 Efficient systolic structure trinomial multiplier over GF (2m) . . . . . . . 42
6.3 Digital-Parallel Systolic structure trinomial multiplier over GF (2m) . . . 43
vi
7 Future Research 44
7.1 How to improve this multiplier with different structures . . . . . . . . . . 44





3.1 d(x)=a(x)b(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 c(x)=d(x)mod f(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 C = A ·B mod f(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 systolic array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 serial in parallel out structure . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 parallel in serial out structure . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 cut-set retiming of the SFG with TA + TX critical path . . . . . . . . . . 22
4.2 Conventional systolic structure of AOP-based multiplication (structure-I:
S-I), where BSC denotes the bit-shifting cell and the black box denotes
the registers. (a) Structure. (b) Internal structure of PE-1. (c) Internal
structure of regular PE. (d) Internal structure of PE-2. . . . . . . . . . . 23
4.3 cut-set retiming of the SFG with max{TA, TX} critical path . . . . . . . 23
4.4 Existing low critical-path structure of [24] for AOP-based multiplication
(structure-II: S-II), where the black box denotes the registers. (a) Struc-
ture. (b) Internal structure of PE-1. (c) Internal structure of PE-2. (d)
Internal structure of regular PE. (e) Internal structure of PE-3. (f) Internal
structure of PE-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Modified structure-I (MS-I), where the black box denotes the registers. For
AOP implementation, we can remove the PE inside the red-dotted area
since bk = 0, but for the formation of standard computation core, this PE
will be preserved. (a) MS-I. (b) Internal structure of PE-1. (c) Internal
structure of regular PE. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
viii
4.6 Modified structure-II (MS-II), where the black box denotes the registers.
For AOP implementation, we can remove the PE inside the red-dotted
area since bk = 0, but for the formation of standard computation core,
this PE will be preserved. (a) MS-II. (b) Internal structure of PE-1. (c)
Internal structure of PE-2. (d) Internal structure of regular PE. (e) Internal
structure of PE-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.7 Low latency implementation of systolic structure, where the internal PEs
can be those of MS-I or MS-II. . . . . . . . . . . . . . . . . . . . . . . . . 25
4.8 PE design for digit-parallel implementation (d = 2, based on the PEs from
MS-I). (a) Original two neighboring PEs. (b) Combined PE. (c) Internal
structure of previous two PEs. (d) Internal structure of combined PE. . . 26
4.9 AOP-based standard computation core, where the internal PEs can be
those of MS-I or MS-II (the internal structure can be as that of Fig. 5,
based on specific application environment). . . . . . . . . . . . . . . . . . 27
5.1 Proposed low register-complexity systolic multiplier based on the AOP-
based computation core (MS-I), where the black box denotes the registers.
(a) Proposed structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Proposed low register-complexity systolic multiplier based on the AOP-
based computation core (MS-I), where the black box denotes the registers
(b) Internal structure of the AOP-based computation core (MS-I, where
e = 2). (c) Detailed design of PE-0. (d) Detailed design of PE-1. (e)
Detailed design of regular PE. (f) Detailed design of PE-2.. . . . . . . . . 33
5.3 Detailed design of two stage XOR operations in PE-0 for trinomial f(x) =
x233 + x74 + 1, where the black box denotes bit-register. . . . . . . . . . . 36
5.4 Proposed low-latency systolic multiplier. . . . . . . . . . . . . . . . . . . 38
5.5 Comparison of register count and latency of various bit-parallel structures
based on trinomial f(x) = x233 + x74 + 1 ([8] refers to the super-systolic
structure). (a) Comparison of number of registers required by various
designs. (b) Comparison of latency (number of cycles) for various designs
(we have chosen e = 16 for the proposed structure of Fig. 10). . . . . . . 39
ix
List of Tables
4.1 COMPARISON OF AREA-TIME COMPLEXITIES OF VARIOUS SYS-
TOLIC AOP-BASED MULTIPLIERS . . . . . . . . . . . . . . . . . . . 26
4.2 FPGA IMPLEMENTATION RESULTS OF VARIOUS AOP-BASEDMUL-
TIPLIERS FOR k=162 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1 COMPARISON OF AREA-TIME COMPLEXITIES OF VARIOUS SYS-
TOLIC MULTIPLIERS BASED ON TRINOMIALS . . . . . . . . . . . 37
5.2 COMPARISON OF AREA-TIME COMPLEXITIES OF VARIOUS DE-




This chapter will give the outline of whole thesis. It presents some basic ideas about
related algorithms, and the corresponding structures based on these algorithms. The
contributions of this report are also given.
1.1 Preliminary
In recent years, the finite field algorithm as one of the high efficiency and low complexity
algorithms has already been used in various fields, such as error-control codes, information
theory and elliptic curve cryptography (ECC). Elliptic curve cryptography (ECC)can
be used in various devices, such as wearable devices, key agreement and bank account
systems. On one side, cryptographic system and algorithm should be high resistable
to reduce the potential attacks, on the other side, the complexity of the cryptographic
system shall be reduced. Basically, there are two bases, polynomial basis [5-13] and
normal basis [14-17], which can be selected to represent the field operation. Nevertheless,
in hardware realization, polynomial basis multipliers are more widely used compared to
normal basis multipliers [8].
All-one-polynomials (AOP)s and trinomials are two of the important irreducible polyno-
mials being used [7-11], [17-27]. Due to security reasons, AOPs are usually not preferred
for cryptosystem implementations though the AOP-based multipliers are quite simple
and regular, while trinomial-based multipliers are more popular than AOP-based ones,
as two trinomials have been recommended by the National Institute of Standards and
Technology (NIST) for ECC implementation [5]. However, because of the complexity
1
differences, AOPs and trinomials usually are not considered together in practical field
multiplication implementations [18].
There are basically two kinds of structures for multipliers over GF (2m): systolic design
and non-systolic design. Systolic multipliers over GF (2m) based on irreducible polynomi-
als are preferred in high-performance applications due to their features such as modularity
and regularity [5-11]. Systolic structures also have high register-complexity since all pro-
cessing elements (PEs) in the systolic array need to use registers for pipelining [5], while
non-systolic designs usually have lower complexity with larger critical-path delay.
For practical applications, especially in field-programmable gate array (FPGA) platforms,
where the register-resources are not that abundant, low register-complexity systolic struc-
tures are required. Many efforts have been reported to reduce the register-complexity in
systolic multipliers based on irreducible AOPs and trinomials [7-10], [23-27]. A bit-
parallel AOP-based systolic multiplier has been introduced in [23]. Furthermore, another
efficient AOP-based design is presented in [24]. Moreover, one low-complexity systolic
Montgomery AOP-based multiplier has been proposed in [26]. In [7], Lee et al. present-
ed a bit-parallel systolic trinomial multiplier. Meher [8] proposed efficient bit-parallel
systolic and super-systolic designs. Xie et al. [9] introduced a low register-complexity
systolic structure. Very recently, Montgomery systolic multipliers were presented where
the register count was efficiently reduced [10]. Several other works were reported for
efficient realization of finite field Montgomery multiplication over GF (2m) [11], [17].
In this thesis, we combine low register-complexity and Montgomery multiplication algo-
rithm together to speed up the multiplication process.
1.2 Summery of contribution
1.2.1 Existing AOP Systolic Multiplier Based on Trinomial
There are some designs about finite field systolic multiplier based on trinomial have been
reported. Most of these designs focus on the way to design the PEs inside of multiplier
based on the critical-path. In this paper, we suggest two kinds of structure with critical
paths of TA + TX and MAX{TA, TX}, where the duration of each cycle period is TA + TX
(TA and TX refer to the delay of an AND gate and a XOR gate, respectively). The
critical-path of the second structure is shorter than the first one, so it has lower latency.
2
But the second multiplier needs more registers.
1.2.2 Proposed Systolic Structure
The efficiency of multiplier will be limited if we only use one algorithm. So the main con-
tribution of this thesis needs to combine another novel Montgomery Algorithm together
to improve the overall efficiency. Inside of the structure, we can apply the strategies of
registers sharing and parallel-array pipelining to decompose the linear systolic design into
several parallel arrays. According to the characteristics of one the inputs’ matrix, we can
observe that each two adjacent columns have mostly the same elements. Correspond-
ingly, every two adjacent PEs can share the same input operands. In this way, we not
only can decrease the latency and amount of registers but also can decrease the number
of XOR2 gates. In order to confirm the proposed design, we choose m = 233 which has
been recommended by NIST [15].
As we all know, the NAND gate is usually faster than the AND gate, so we use NAND2
to instead all AND2 in original structures. In order to satisfy the logic functions, we
also need to change the XOR2 gate to XNOR2. Besides that, there is a need to redesign
the wire connections between each shift unite. After changing these components, we can
realize the same function with lower latency and lower register-complexity circuits. We
use an example which m = 162 to test the proposal in this thesis.
1.2.3 Report Outline
The following parts of the report are organized in this way.
Chapter 2 shows the processes of mathematical formulation of polynomial basis multi-
plication over GF (2m). Several classes of irreducible polynomials are shown within this
chapter.
Chapter 3 talks about the polynomial basis multiplication over GF (2m), and we show
some examples about how to implement multiplication on finite field.
Chapter 4 presents a low register-complexity AOP-based systolic multipliers. There are
some AOP-based multipliers, but they have high complexitites. In this chapter, we
introduce two new AOP multipliers which have lower complexity of components.
Chapter 5 introduces how to use the proposed AOP multiplier as core component to
realize trinomial multiplier. Besides this, in this chapter, we apply a new algorithm to
3
change the structure of previous multiplier which significantly improves the speed and
area complexity performance. The structure is also improved by using digital-parallel
way.




This chapter presents some basic mathematical background knowledge about finite field.
In the following sections, we introduce the definition of field and finite field and algorithm
operations in finite field. Besides this, we introduce a special finite field named Binary
Field, as well as those field operations based on this field.
2.1 Introduction of Finite Field
2.1.1 Field
A field (f,+,·) is consisted by a set F with two operations which are addition (denoted
by +) and multiplication (denoted by ·). And these two operations also satisfy the usual
arithmetic properties, such as distributive law (a + b) · c = a · c + b · c for a, b, c ∈ F .
Fields are abstractions of number systems and their essential properties, such as the real
number R and the complex numbers C.
2.1.2 Finite Field
Finite field(also called Galois field) is a field which contains finite number of elements,
and the order of a finite field is the number of elements it contains. Here is an example,
if a field has q elements, then we can express it as Fq(also can be denoted as GF (q),or
Z/qZ),represented as the integers modulo q, the order of the field is q.
5
2.2 Field Operation
Addition and multiplication are two operations which can be involved in field F. So for
subtraction and division, we need to use addition and multiplication to define them. We
define subtraction in terms of addition: a − b = a + (−b), for a, b ∈ F , and −b is the
unique element in F (−b is the negative of b, b+ (−b) = 0). Similar for division, division
is defined in terms of multiplication: a/b = a · b−1, for a, b ∈ F , b ̸= 0, and b−1 is a
unique element in F (b−1 is the inverse of b, b · b−1 = 1).
Here is an example of arithmetic operation in finite field.
The elements of GF (26) are 0,1,2...,25, then the arithmetic operations in GF (26) are:
(1). Addition:15+20=9 since 35 mode 26=9.
(2). Subtraction:15-20=15+(-20)=21 since -5 mode 26=21.
(3). Multiplication:15 · 20 = 14 since 300 mode 26=14.
2.3 Binary Field
If the finite field with the the order 2m is called binary field, the integer m is called the
degree of the field. There are several ways to construct GF (2m). Such as the binary field
GF (2m) can be consisted by 2m possible bit strings with length m. For example
GF (2) = {0, 1}
GF (23) = {000, 001, 010, 011, 100, 101, 110, 111}
The addition and multiplication are as follows:
Addition:
0 + 0 = 0, 0 + 1 = 1, 1 + 1 = 0
The addition of two elements should be optimized by bitwise addition modulo 2.
Such as (11000) + (10111) = (01111)
Multiplication:
0 ∗ 0 = 0, 0 ∗ 1 = 0, 1 ∗ 1 = 1
6
Or we can use other two ways to construct GF (2m), one way is to use polynomial basis
representation, the other way is to use normal basis representation. We are going to
introduce polynomial basis representations in the next chapter.
GF(2)is the simplest finite field, it is consisted only by two elements, 0 and 1. For addition
and multiplication in GF(2), we need to modulo 2. Depending on the results, the addition
is equivalent to logical XOR, the multiplication is equivalent to logical AND.
2.4 Extension Field
Let A be a filed and B be subfield of field A. If C is subset of field A, then the field A
is the extension field of field B(C)(which contain both B and C), which can be denoted
as A/B. For instance, the rational numbers’ extension field is real numbers, and the real




Polynomial basis multiplication over
GF (2m)
3.1 Polynomials
Assume there has field F, and the elements an, an−1, an−2..., a1, a0 belong to field F. Then





called a polynomial with degree n over F. The element ai is called the coefficient of x
i in
f(x), and an ̸= 0. Depend on the each power of X and corresponding coefficient we can
justify if this two polynomials are equal or not.
Polynomial ring R[X]is a ring which can be formed by the set of polynomials in one or
more variables(such as x) with coefficients in another ring R(or field). Such as in X over





2 + a1x+ a0, where an, an−1, an−2..., a1, a0 are the coefficients which are
the elements of field P, then these coefficients can forma ring,called polynomial ring P[x].
Polynomials can be equipped with arithmetic operations. Two standard operations for
polynomials are addition and multiplication. Here is an example. Let A(x) = x4+3x2+2
and B(x) = 3x2 + x + 4 be elements of polynomial ring R4[x]. The multiplications and
addition of this two polynomials are:
A(x) ·B(x) = 3x6 + x5 + x4 + 3x3 + 2x2 + 2x+ 4
A(x) +B(x) = x4 + 2x2 + x+ 2
8
3.2 Polynomial basis representation over GF (2m)
The way to use polynomial basis representation to construct binary field GF (2m) means
the elements of GF (2m) are binary polynomials with the degree at most m − 1. In this
condition the polynomials’ coefficients are in the field of GF (2). Such as, for finite field
GF (2m), the elements in this field are the polynomials{0, 1, x, x+1, x2, x2+1, ..., xm−1+
xm−2 + ...+ x+ 1}, where the x is a root of an irreducible polynomial f(α) over GF (2),
and the polynomial coefficients are GF (2) = {0, 1}, where f(x) = 0.
We can use an example to show the exactly elements for finite field based on polynomials.
The elements of finite field GF (23) are as follows




x3 x+ 1 (0,1,1)
x4 x2 (1,0,0)
x5 x2 + 1 (1,0,1)
x6 x2 + x (1,1,0)
x7 x2 + x+ 1 (1,1,1)
3.3 Irreducible polynomial
3.3.1 Definition of irreducible polynomial
The irreducible polynomial means if a polynomial(f(x)) couldn’t be factored into the
product of two non-constant polynomials over the same field(f(x) ̸= g(x)h(x)). Whether
the polynomial can be factored or not should depend on the field and ring to which
the coefficients are considered belong to. For instance, the polynomial f(x) = x2 − 2 is
irreducible if the coefficients -1 and 2 are considered belong to integers. But if we consider







3.3.2 Important irreducible polynomial based on finite field
Irreducible polynomial also can be called as primitive polynomials. The irreducible poly-
nomial can be used to represent the elements of a finite field. For example, there is an
irreducible polynomial over GF(2) with degree m f(x) = xm+fm−1x
m−1+fm−2x
m−2...+
f1x + f0. Let α ∈ GF (2m) be the root of this irreducible polynomial, then the set
{1, α, α2, ...αm−2, αm−1} can constitute the polynomial basis in GF (2m). So these poly-
nomial basis can be used to represented the elements in GF (2m) with the most degree as
m− 1, and the form is GF (2m) = {a(x)|a(x) = am−1xm−1 + a(m− 2)xm−2 + ...+ a2x2 +
a1x+ a0, aj ∈ GF (2)}.
There are several different kinds of polynomial, and we usually choose all one polynomi-
als(AOP), trinomials, and pentanomials to represent the elements of finite field. In this
paper, our design is based on AOP and trinomial.
All One Polynomials(AOP)
All one polynomial (AOP) means the polynomial which all coefficients are one, and over
the finite field GF (2). AOP is irreducible polynomial only if m + 1 is prime and 2 is a
primitive modulo m + 1. For example, the value of m should be m ∈ {2, 4, 6, 10, 12, ...}




xi = xm + xm−1 + xm−2 + ...+ x2 + x+ 1 (3.1)
The degree is m (m + 1 should be prime number). AOP has simple form, so it usually
be used to define efficient algorithm and implement multiplication.
Trinomial
Trinomial is a polynomial consisting of three non-zero terms. We usually use trinomial
in mathematics. Such as
• 5x+ y + 6z, where x,y,z are variables
• m2 + 4n+ p, where m,n,p are variables
10
• Axm +Bxn + Cxp, where x is variable, m,n,p are nonnegative integers
Trinomials over finite field are widely applied in different areas. Such as using trinomial to
implement multiplier over finite field. And this kind of multiplier is the core component
for ECC(Elliptic Curve Cryptography). The form of trinomial overGF (2m) is
f(x) = xm + xk + 1
The National Institute of Standard and Technology (NIST) [15] has recommend five bina-
ry finite fields for ECC implementation. There are two binary fields generate trinomials,
f(x) = x233 + x73 + 1, and f(x) = x409 + x87 + 1.
3.4 Polynomial Basis Multiplication Over GF (2m)
There is a irreducible polynomial over GF (2)
f(x) = xm + fm−1x
m−1 + fm−2x
m−2 + ...+ f2x
2 + f1x+ f0, fi ∈ GF (2)={0,1}
The set {1, x, x2, ...xm−2, xm−1} is the polynomial basis over finite field GF (2m). Compar-
ing characters of this set and the form of polynomial, we can use polynomial to represent




m−2 + ...+ β2x
2 + β1x+ β0,where βi ∈ GF (2).
Let a(x) and b(x) be two field elements, and c(x)is the product of them, then we can
have
c(x) = a(x)b(x) mod f(x)
So, the polynomial basis multiplication is consisted by two steps: one is polynomial
multiplication and another reduction modulo an irreducible polynomial. The product
of d(x)=a(x)b(x) is a polynomial with degree 2m − 2, so after the modular reduction
c(x) = a(x)b(x) mod f(x), the degree 2m− 2 of polynomial d(x)is reduced by degree m
irreducible polynomial f(x). The multiplication process can be represented by matrix as
follows(fig. 3.1, fig. 3.2),
So the process of C = A ·B mod f(x) can be represented as follows(fig. 3.3)
3.4.1 AOP Basis Multiplication Over GF (2m)
There is a irreducible AOP f(α) of degree m over GF(2)
f(α) = αm + αm−1 + αm−2 + ...+ α2 + α + 1
11
Figure 3.1: d(x)=a(x)b(x)
Figure 3.2: c(x)=d(x)mod f(x)
Figure 3.3: C = A ·B mod f(x)
12
where m+1 is prime number,and 2 is the primitive modulo of m+1. Assume x is a root
of f(α) = 0 , then we can have
f(x) + xf(x) = 0,
that means, (xm+xm−1+xm−2+ ...+x2+x+1)+x(xm+xm−1+xm−2+ ...+x2+x+1)
= xm + xm−1 + xm−2...+ x2 + x+ 1 + xm+1 + xm + xm−1 + ...+ x3 + x2 + x (1)
because in finite field GF(2), 1 + 1 = 0, 1 + 0 = 1, 0 + 0 = 0 thus for function(1),
(1) = xm+1 + 1 = 0
and we can have
xm+1 = 1
We define {xm+1, xm, xm−1, ..., x2, x, 1} as the extended polynomial basis. For any ele-















where aj, bj, cj ∈ GF (2).
We define C is the product of A multiply B, then we can have











j · A mod f(x)) (3.6)







then we can represent Ai+1 with Ai as
Ai+1 = x · Ai mod f(x) = (ai0x+ ai1x2 + ...+ aim · xm+1) mod f(x)






j−1(1 ≤ j ≤ m− 1)
Depend on above equations we also can get
ai+lj =
 aim−l+j+1 if 0 ≤ j ≤ l − 1aij − l otherwise
3.4.2 Trinomial Basis Multiplication Over GF (2m)
Let we define a finite filed GF (2m), usually the form of irreducible polynomial of degree
m is
f(x) = xm + fm−1x
m−1 + fm−2x
m−2 + ...+ f2x
2 + f1x+ f0 (3.8)
where fi ∈ GF (2) = {0, 1} and {fi, for1 ≤ i ≤ m− 1}
Assume α is a root of polynomial f(x). The polynomial basis {1, α, α2, ..., αm−2, αm−1}
can be represented by irreducible polynomial. Let A,B be two elements in finite field










for which aj, bj ∈ GF (2), and 0 ≤ j ≤ m− 1
Let C be the product of A and B










i · A mod f(x)) (3.12)
14
We can let Qi = biA
i, A0 = A and Ai = αi ·A mod f(x)then the equation (3.11) can be





C = Q0 +Q1 +Q2+, , ,+Qm−3 +Qm−2 +Qm−1, (3.14)
Qi = biA
i, Qi+1 = bi+1A
i+1 (3.15)
Ai+1 can be obtain from Ai recursively as:
Ai+1 = α · Ai mod f(x) (3.16)
The equation (3.12) reflect the addition of reduced polynomial. The equation (3.14)
reflect the partial product generation. The equation (3.15) reflect the modular reduction.
For the process of adding, we can use XOR gate for our product.




2 + ...+ aim−2α
m−2 + aim−1α
m−1] mod f(x) (3.17)
Therefor, for polynomial equation (3.15), the right side can ne expanded as
Ai+1 = α · Ai mod f(x) (3.18)





2 + ...+ aim−2α
m−2 + aim−1α
m−1)] mod f(x) (3.19)




3 + ...+ aim−2α
m−1 + aim−1α
m] mod f(x) (3.20)




2 + f1x+ f0, where fi ∈ GF (2) = {0, 1} and {fifor1 ≤ i ≤ m− 1}.
Thus,
f(α) = αm + fm−1α
m−1 + fm−2α
m−2 + ...+ f2α
2 + f1α + f0 = 0 (3.21)
αm = fm−1α
m−1 + fm−2α
m−2 + ...+ f2α
2 + f1α + f0 (3.22)
15
Substituting equation (3.21) into equation (3.19), then the equation of (3.19) can be
obtained as
Ai+1 = [ai+10 + a1i+ 1α + ... = a
i+1
m−2α
m−2 + ai+1m−1αm− 1] mod f(x) (3.23)
Compare the equation(3.22) with the equation (3.16), we can observe the transitions from






j−1 ⊕ fj · aim−1, (3.25)
where j = 1, 2, 3, ...,m− 1
In this paper, the multiplier over finite field GF (2m) that we design is based on trinomial.
Assume there is an irreducible trinomial with degree m, the form is
f(x) = xm + xn + 1 (3.26)
Comparing equation (3.26) and (3.8), we can find
fj =
 1 if j = k0 if 1 ≤ j ≤ m− 1 and j ̸= k







k−1 ⊕ aim−1, (3.28)
ai+1j = a
i
j−1, for1 ≤ j ≤ m− 1andj ̸= k (3.29)
The logic equations from (3.27) to (3.29) can reflect the recursive process of modular
reduction from Ai to Ai+1. Further more, we can extend these equations to have higher
order modular reduction. Assume we need to operand from Ai to Ai+l, which 1 ≤ l ≤
m− k. Then the logic equations can be written as
ai+lj =

aim−l+j for 0 ≤ j ≤ l − 1
aij−l for k ≤ j ≤ k + l − 1
aij−l otherwise
16
3.5 Existing Researches About Polynomial Basis Mul-
tiplication Based On Finite Field GF (2m)
For finite field algorithm, the most important application is in cryptograph, such as El-
liptic Curve Cryptography (ECC) and pairing based cryptography. The multiplier based
on polynomial over finite field is the main component for ECC.
The multiplier is easy to be implemented on a software platform. But for the real practical
applications, such as credit card and cellphone, it’s not easy to embed software platform
in these devices. Except this, the software platform is also hard meet the requirement
of speed and time critical systems. In order to satisfy the specific requirements, we need
the hardware implementations in cryptographic systems.
The hardware implementation usually has two key points need to be paid attention, area
and timing. We prefer the devices that we use can be faster and lighter, such as cellphone.
The iphone is usually popular than other cellphone, one of the important reason is iphone
has faster speed than most cellphones. If a device should be built by a lot of components,
then this device will have larger area and higher cost. If we want to get more profits from
what we design, we need to save the sources. Therefore, high speed and smaller area are
already become two important demands of design.
As the development of technologies and algorithms, there are some efficient multipliers
based on finite field have been realized. In structure part, we apply four main structures:
(1)parallel-in serial-out(PISO), (2)serial-in parallel-out(SIPO),(3)parallel in parallel out,
(4)serial/prallel structures. The parallel-in serial-out structures has smaller area but has
low throughput. The serial-in parallel-out structure has higher throughput, but it need
larger area. For hardware implementation, we usually has critical requirement for tim-
ing and area. There are some works can achieve this optimal balance. We can divide
these works in to two types: systolic structure and non-systolic structure. The systolic
structure can accomplish the critical timing requirement and has the features of regular-
ity and modularity. Except this, for systolic design, all processing elements (PEs) are
fully pipelined to produced very high throughput rate. For non-systolic design, it usually
has low latency but it has has low throughput. The relative pictures will be shown below.
17
Figure 3.4: systolic array
Figure 3.5: serial in parallel out structure
Figure 3.6: parallel in serial out structure
18
3.5.1 Existing Example For AOP
In paper [7], the author introduced the systolic and super-systolic multiplier for GF (2m)
based on trinomial. In this paper, the author analysed what is systolic structure, and
how to implement multiplication based on trinomial. Combine with appropriate cut-set
retiming and bit level pipeline systolic structure, the latency can be decreased.
3.5.2 Existing Example For Trinomial
AOP has simpler form so we like to use it to implement some simple structures, but we
don’t prefer multiplier based on AOP as one of the component for ECC. In paper [16], the
authors present an efficient recursive formulation and use systolic structure to implement
the multiplier based on AOP. This new multiplier is finished by efficient way that can






In this section, we briefly review the AOP-based multiplication algorithm first, and then
present our proposed architectures based on the existing structures.
4.1 Review of AOP Multiplication Algorithm [24]
For simplicity of discussion, let f(α) = αk + αk−1 + · · · + α + 1, be an irreducible AOP
of degree k over GF (2) (where k + 1 is prime and 2 is the primitive modulo k + 1). For
any x ∈ GF (2), and x is a root of f(α) = 0, we have
f(x) + xf(x) = (xk + xk−1 + · · ·+ x+ 1)
+x(xk + xk−1 + · · ·+ x+ 1) = xk+1 + 1 = 0,
(4.1)
and then we have
xk+1 = 1. (4.2)
Then, let {xk+1, xk, . . . , x, 1} be the extended polynomial basis [28]. For any A,B,C ∈

















where aj, bj, cj ∈ GF (2), for 0 ≤ j ≤ k − 1, and ak = 0, bk = 0, and ck = 0.
Let us define C as the product of A and B, and then we have
C = A ·B mod f(x), (4.6)





j · A mod f(x)). (4.7)
Let us define A0 = A, and Ai = xi ·A mod f(x), such that Ai+1 can be obtained from Ai
as
Ai+1 = x · Ai mod f(x). (4.8)
Then, we have
Ai+1 = (ai0x+ a
i
1x








Ai+1 = ai+10 + a
i+1








j−1, for 1 ≤ j ≤ k − 1.
(4.12)
21
Figure 4.1: cut-set retiming of the SFG with TA + TX critical path
One can also extend to obtain Ai+l from Ai for 1 ≤ l ≤ k, such that
ai+lj = a
i





4.2 Existing Systolic Structures
The conventional systolic structure based on the algorithm in Section II-A can be seen
in Fig. 4.2 (structure-I: S-I), where it consists of (k + 1) PEs (including three types of
PEs: PE-1, PE-2 and regular PE). The internal structures of these PEs are shown in Fig.
4.2. 1(b), (c), and (d), respectively, where BSC denotes the bit-shifting cell. The latency
of the structure in Fig. 4.2 is (k + 1) cycles, where the duration of each cycle period is
TA + TX (TA and TX refer to the delay of an AND gate and a XOR gate, respectively).
The way we do cut-set to get the critical path as TA + TX is shown in Fig.4.1. A recent
work has presented a low critical-path delay systolic structure (only TX) [24], and it is
shown in Fig. 4.4 (structure-II: S-II). The entire structure contains (k + 2) PEs, where
the internal structures of PEs are shown in Figs. 4.4(b), (c), (d) and (e), respectively.
The latency of this structure is (k + 2) cycles (critical-path delay: TX). The way we do
cut-set to get the critical path as max{TA, TX} is shown in Fig. 4.1, usually TX > TA, so
the critical path is TX .
4.3 Modified Low Register-Complexity Structures
For the structures of Figs. 4.2 and 4.4, we find that k2 registers in the PEs pipeline
identical data (in shifted order) to the neighboring PEs. These registers can be removed
22
Figure 4.2: Conventional systolic structure of AOP-based multiplication (structure-I:
S-I), where BSC denotes the bit-shifting cell and the black box denotes the registers.
(a) Structure. (b) Internal structure of PE-1. (c) Internal structure of regular PE. (d)
Internal structure of PE-2.
Figure 4.3: cut-set retiming of the SFG with max{TA, TX} critical path
if we change the broadcasting strategy. As shown in Figs. 4.5 and 4.6, i.e., MS-I and
MS-II, a shifted connection strategy is used that the input A is directly fed to each
PE, and thus reduces the registers required previously. Moreover, the details of shifted
connection is also shown in Figs. 4.5 and 4.6. To reduce the complexity further, we have
used NAND and XNOR gates to replace the original AND and XOR gates,and also the
speed of NAND and XNOR is faster than AND and XOR, as depicted in [7] and [11] (the
critical-path is then shortened to TNA +TXN , where TNA and TXN represent the delay of
NAND and XNOR gates, respectively). It is noted that for AOP-based multiplication,
the last PE (inside the dotted box) can be removed as bk = 0. The modified structures
involve nearly the same time-complexity as the previous ones, but the register-complexity
is significantly reduced.
23
Figure 4.4: Existing low critical-path structure of [24] for AOP-based multiplication
(structure-II: S-II), where the black box denotes the registers. (a) Structure. (b) Internal
structure of PE-1. (c) Internal structure of PE-2. (d) Internal structure of regular PE.
(e) Internal structure of PE-3. (f) Internal structure of PE-4
Figure 4.5: Modified structure-I (MS-I), where the black box denotes the registers. For
AOP implementation, we can remove the PE inside the red-dotted area since bk = 0, but
for the formation of standard computation core, this PE will be preserved. (a) MS-I. (b)
Internal structure of PE-1. (c) Internal structure of regular PE.
4.4 Low Latency Implementations
For practical applications, we can further reduce the latencies of structures shown in Figs.
4.5 and 4.6, for k + 1 = pq + f , where 0 ≤ f ≤ q. Without loss of generality, we assume
24
Figure 4.6: Modified structure-II (MS-II), where the black box denotes the registers. For
AOP implementation, we can remove the PE inside the red-dotted area since bk = 0, but
for the formation of standard computation core, this PE will be preserved. (a) MS-II.
(b) Internal structure of PE-1. (c) Internal structure of PE-2. (d) Internal structure of
regular PE. (e) Internal structure of PE-3.
Figure 4.7: Low latency implementation of systolic structure, where the internal PEs can
be those of MS-I or MS-II.
f = 0, and then we can decompose the original one systolic array of k + 1 PEs into p
parallel arrays to achieve low latency implementations, as shown in Fig. 4.7. An extra
pipelined adder tree consisting of XNOR gates(m is even) and registers is needed to add
the results from p arrays together to yield final result C.
25
Figure 4.8: PE design for digit-parallel implementation (d = 2, based on the PEs from
MS-I). (a) Original two neighboring PEs. (b) Combined PE. (c) Internal structure of
previous two PEs. (d) Internal structure of combined PE.
Table 4.1: COMPARISON OF AREA-TIME COMPLEXITIES OF VARIOUS SYS-
TOLIC AOP-BASED MULTIPLIERS
Design AND NAND XOR XNOR Register Critical-path delay Latency
Fig. 4.2 (S-I) (k + 1)2 0 k2 + k 0 2k2 + 3k + 1 TA + TX k + 1
Fig. 4.4 (S-II) (k + 1)2 0 k2 + k 0 3k2 + 6k + 3 TX k + 2
Fig. 4.5 (MS-I) 0 k2 + k 0 k2 − 1 k2 + 2k TNA + TXN k
Fig. 4.6 (MS-II) 0 k2 + k 0 k2 − 1 2k2 + 2k TXN k + 1
Fig. 4.7 1 0 k2 + k 0 ≃ k2 − 1 ≃ k2 + 2k TNA + TXN h+ log2p
Digit-parallel2 (d = 2) 0 k2 + k 0 ≃ k2 − 1 ≃ k2/2 + k TNA + TXN (h+ log2p)/2
1: Based on structure of Fig. 4.5 (MS-I).
2: Based on low-latency structure of Fig. 4.7 (MS-I).
4.5 Digit-Parallel Structures
We can combine neighboring PEs in a systolic array into one PE to reduce the register
usage further. Fig. 4.8 shows an example of combining two neighboring PEs into one
PE (based on the PEs from MS-I). The critical-path delay of the new PE thus turns
into (TNA + 2TXN). For simplicity, we define the structure based on new PE in Fig.
4.8(b) as a digit-level parallel structure with digit-size d = 2. If we choose the value of
d appropriately, the proposed architecture can achieve the optimal area-time complexity
for specific application environments.
26
Figure 4.9: AOP-based standard computation core, where the internal PEs can be those
of MS-I or MS-II (the internal structure can be as that of Fig. 5, based on specific
application environment).
4.6 Area-Time Complexity
The area-time complexity of proposed designs in Figs. 4.5, 4.6, 4.7, and 4.8 are shown
in Table 4.1 along with existing and conventional designs of Figs. 4.2 and 4.4. It can
be seen that the proposed designs involve significantly less area-time complexity when
compared with competing ones, especially on the register-complexity.
4.7 FPGA Implementation of Various AOP-based Struc-
tures
We have also implemented these AOP-based systolic structures to confirm the efficacy of
proposed structures. We have synthesized these designs using Xilinx ISE 14.1 on Virtex
6 family device with k = 162. The results in terms of area-time-power complexity are
shown in Table II. It can be seen that the proposed structures outperform the existing
ones, especially for area-complexity. Since there is only minor difference between critical-
paths of TNA + TXN and TXN in FPGA platforms, The proposed MS-II does not have
significant advantage over existing ones. Therefore, the proposed MS-I can be used more
widely than MS-II in practical applications.
27
Table 4.2: FPGA IMPLEMENTATION RESULTS OF VARIOUS AOP-BASED MUL-
TIPLIERS FOR k=162
Design Area Delay1 Power ADP2 PDP3
Fig. 4.2 (S-I) 53, 300 154.035 2.236 8,210,066 344.42
Fig. 4.4 (S-II) 79, 868 154.98 2.407 12,377,943 373.04
Fig. 4.5 (MS-I) 26, 569 141.102 2.159 3,748,939 304.64
Fig. 4.6 (MS-II) 53, 138 154.035 2.23 8,185,112 343.50
Unit for area: number of slice register; Unit for delay: ns; Unit for power: W (Power is
estimated at 100MHz).
1: Delay = Latency.
2: ADP: Area-delay product = Area×Delay.
3: PDP: Power-delay product = Power×Delay.
4.8 AOP-based Computation Core
To fully utilize the special property of proposed AOP-based multipliers, we pack the
structure of Fig. 4.7 (or combine with the structure of Fig. 4.8) as a standard computation
core. The standard computation core is shown in Fig. 4.9, which consists of k + 1 input
bits from A, k+1 bits from input B, and k+1 bits of output C. For practical applications
of this standard computation core, we can replace k+1 with any other integer. It is noted




Application of the Proposed
AOP-based Computation Core
In this chapter, we focus on the application of the AOP-based computation core to obtain
a low register-complexity Montgomery multiplication based on trinomials.
5.1 Montgomery Multiplication Algorithm
Montgomery multiplication is a method to perform fast modular multiplication, it was
introduced by American mathematician Peter L. Montgomery in 1985. Montgomery
multiplication is a way to transform the classical modular multiplication a · b mod N to
Montgomery form a · b ·R mod N).
Let f (x ) be a degree m irreducible trinomial over GF (2) as
f(x) = xm + xn + 1. (5.1)
where 1 ≤ n ≤ m− 1, such that we can have the Montgomery multiplication as [12]
C = A ·B r−1mod f(x), (5.2)

















for {aj, bj, and cj}∈GF (2).
For r is the Montgomery factor that satisfies gcd(r, f(x)) = 1 (gcd refers to the greatest
common divisor). Different algorithms have different selections of r to have the corre-
sponding structures, as shown in [10-12]. In this algorithm, we have chosen r = xt =
x(m−1)/2 (for NIST recommended trinomials, m is an odd number). Then, (5.2) can be
expressed as














bi · A · xi−tmod f(x).
(5.7)
For C1, we define A
(0)
1 = A, A
(1)
1 = A · x−1mod f(x), . . ., A
(t)






1 · x−1mod f(x) (5.8)







































Since x is the root of f(x) = xm+xn+1, we can have xm+xn = 1 and xm−1+xn−1 = x−1.




















for 0 ≤ j ≤ m − 2 and j ̸= n. Similarly, for C2, we can define A(0)2 = A, A
(1)
2 =




2 · x mod f(x) (where













2,1x+ · · ·+ a
(i)
2,m−1x



















for 1 ≤ j ≤ m− 1 and j ̸= n.
5.2 Proposed Montgomery Multiplication Algorith-
m
The equations (5.1)-(5.14) represent the standard Montgomery multiplication process. To
facilitate the Montgomery multiplication suitable for employing the proposed AOP-based
computation core, we present the following proposed algorithm: Let xm be an extended













































i (0 ≤ i ≤ m − 1) and a(1)U,ixi (0 ≤ i ≤ n − 1, n + 1 ≤ i ≤ m) can















U,0 + · · · + a
(2)
U,m+1x




































































where xm+1, . . . , xm+t−1 are defined as extended polynomial basis and (applicable to two





















1,j−1 (1 ≤ j ≤ n+ 1),



















(0 ≤ i ≤ n− t, n+1 ≤ i ≤ m+ t− 1) can be chosen, respectively, to construct A(0)1 , A
(1)
1 ,





U , 0) = A
(0)
1 ,
· · · · · · · · ·
ξ(A
(t)




where ξ(·) represents the bit-selection operation. Similarly, for C2, we have
32
Figure 5.1: Proposed low register-complexity systolic multiplier based on the AOP-based
computation core (MS-I), where the black box denotes the registers. (a) Proposed struc-
ture.
Figure 5.2: Proposed low register-complexity systolic multiplier based on the AOP-based
computation core (MS-I), where the black box denotes the registers (b) Internal structure
of the AOP-based computation core (MS-I, where e = 2). (c) Detailed design of PE-0.





V , 0) = A
(0)
2 ,
· · · · · · · · ·
ξ(A
(t)










































2,m−j (1 ≤ j ≤ t).
(5.23)
Based on the above, (5.6) can be rewritten as














V directly from operand A







V = PCM(A, V ).
(5.25)
Based on (5.15)-(5.25), the proposed Montgomery multiplication algorithm for employing
the AOP-based computation core is thus given by Algorithm 1:
Algorithm 1 Proposed Montgomery multiplication algorithm
Inputs: A and B are the pair of polynomials in GF (2m) to be multiplied.
Output: C = A ·B r−1mod f(x).
1. Initialization step
1.1 r−1 = x−t.
34







V = PCM(A, V ).
2.1-c. D = Abt.








3.1. C = E.
where Step 2.2 refers to the bit-parallel multiplication process. According to the proposed
algorithm, we generate operands at the first cycle period, and then they are distributed
into t partial products to be accumulated in a systolic way, which greatly facilitates
employing of the proposed AOP-based computation core since all involved bits are already
generated from PCM (the details can be seen in the following subsection).
5.3 Proposed Low Register-Complexity Systolic Struc-
ture Employing the AOP-based Computation Core
The proposed structure based on the proposed Algorithm 1 (employing the proposed
AOP-based computation core) is shown in Fig. 5.1(a). It contains one AOP-based
computation core and three extra PEs. PE-0 yields two outputs (each output with
m + t − 1 bits) to the computation core to be selectively connected with m − 1 input





V ) to the computation core, respectively (m bits of operand A are
shared). The PCM of (31) only takes one XOR delay while the PCM of (27) takes two
XORs’ propagation time. To lower the critical-path delay, we have used two stage XOR
operations to minimize the critical-path delay to one XOR delay (stage-I uses the least
number of XOR gates required by (27), while stage-II realizes the rest of operations of (27)
and (31)), as shown by an example design in Fig. 5.3. PE-1 calculates the multiplication
of operand A and bt according to Algorithm 1, while PE-2 functions as the final addition
to produce the output C.
The internal structure of AOP-based computation core is shown in Fig. 5.2(b), where we
35
Figure 5.3: Detailed design of two stage XOR operations in PE-0 for trinomial f(x) =
x233 + x74 + 1, where the black box denotes bit-register.
have used PEs from MS-I as internal PEs for e = 2 (one can extend the structure to any
value of e). The computation core contains (2t + 1) PEs, where the detailed designs of
PEs are shown in Fig. 5.2(d)-(f), respectively. PE-1 performs a multiplication between
one m bits operand and one bit of operand B and then yield the result to their right. The
regular PE performs a multiplication between selected operand and one bit of operand B.
The result of multiplication is added with the input from previous PE and then produces
the result to the PE on its right. The last PE, PE-2, performs the addition of two systolic
arrays and yields the final result.
The duration of maximum cycle period of the proposed multiplier of Fig. 5.1 is (TNA +
TXN) (if we choose the PEs from MS-II, the critical-path delay will be TXN). The
proposed design gives the first output of desired product (t + 3) cycles after the pair of
operands are fed to the structure, while the successive outputs are produced in every
cycle thereafter.
36
Table 5.1: COMPARISON OF AREA-TIME COMPLEXITIES OF VARIOUS SYS-
TOLIC MULTIPLIERS BASED ON TRINOMIALS
Design AND NAND XOR XNOR Register Latency Critical-path delay
Bit-parallel systolic structures
[7] m2 0 m2 + m − 1 0 4m2 + 2m − 2 2m − 1 TA + TX
[8]1 0 m2 m2 − 1 0 2m2 − 2m m TNA + TX
[8]2 0 m2 m2 + m − 2
√
m 0 2m2 + m
√
m − 2m 2
√
m TNA + TX
Fig. 4.4 [10] 0 m2 < 1.5m2 + 0.5m + 1 0 1.5m2 + 0.5m m + 1 2TX
Fig. 4.5 [10] 0 m2 < 1.5m2 + 0.5m + 1 0 1.5m2 + 2m m + 2 TNA + TX
Fig. 5.23∗ 0 m2 0 m2 − 1 m2 + 3m − 1 (m + 7)/2 TNA + TXN
Fig. 5.23∗ 0 m2 0 m2 − 1 2m2 + m (m + 9)/2 TXN
Fig. 104∗ 0 m2 0 m2 − 1 ≃ m2 + 3m − 1 (m − 1)/e + 3 + log2e TNA + TXN
Digit-parallel systolic structures (d = 2)
[8]5 0 m2 m2 + m 0 m2 m/2 TNA + 2TX
[9]6 0 m2 ≃ m2 + m 0 ≃ 1.5m2 + 2m < 2
√
m TNA + TX
Fig. 87∗ 0 m2 0 m2 − 1 m2/2 + m − 1/2 (m + 11)/4 TNA + 2TXN
Fig. 87∗ 0 m2 0 m2 − 1 m2 − m + m − 1/2 (m + 11)/4 2TXN
Fig. 5.48∗ 0 m2 0 m2 − 1 ≃ m2/2 + 1.5m (m − 1)/(2e) + 2 + log2(e/2) TNA + 2TXN
∗: The XOR gates in the PE-0 have been counted as XNOR gates.
1: The regular systolic structure.
2: Super-systolic structure.
3: Structure with e = 2 and d = 1 (PEs of the computation core are from MS-I).
4: Structure with d = 1 (MS-I) (PEs of the computation core are from MS-I).
5: Digit-level structure for d = 2.
6: The structure here can be seen as digit-level structure of d = 2.
7: Structure with e = 2 and d = 2.
8: Structure with d = 2 (MS-I) (PEs of the computation core are from MS-I).
5.4 Low-Latency Structure
Let 2t = eh + l, where 0 ≤ l ≤ h. For simplicity, we can assume l = 0, however, it can
be extended to l ̸= 0. Then, we can rewrite (5.24) as













U , i)bt−i + ξ(A
(t)
V , i)bi+t)










where the original two systolic arrays in the computation core of Fig. 5.2 can be divided
into e arrays, as shown in Fig. 5.4. The latency of the structure in Fig. 5.4 is only
(h + 3 + log2e) cycles, which is significantly shorter than the previous one in Fig. 4.2.
A pipelined adder tree is used to add together the results of e systolic arrays of the
computation core.
37
Figure 5.4: Proposed low-latency systolic multiplier.
5.5 Digit-Parallel Structure
We can also employ the PEs from Fig. 4.8 to have digit-parallel structure to reduce the
register-complexity further. It is noted that digit-parallel structure can be combined with
low-latency one to achieve optimal implementation.
5.6 Area and Time Complexities
5.6.1 Comparison
The area and time complexities in terms of logic gate count, register count, latency, and
critical-path of the proposed structures and existing structures of [7-10] are listed in Table
5.1.
The proposed designs outperform the existing designs, especially in the register count.
The proposed designs have lower area-time complexity than the design of [7]. When
compared with the low-latency super-systolic structure of [8], the proposed design (Fig.
5.4) has shorter latency (if we choose e =
√
m) and less registers. Furthermore, when
compared to the two recent designs in [9] and [10], the proposed designs not only have
38
Figure 5.5: Comparison of register count and latency of various bit-parallel structures
based on trinomial f(x) = x233 + x74 + 1 ([8] refers to the super-systolic structure). (a)
Comparison of number of registers required by various designs. (b) Comparison of latency
(number of cycles) for various designs (we have chosen e = 16 for the proposed structure
of Fig. 10).
lower register count, but also involve significantly lower latency. Among all the existing
designs, only the authors of [8-9] proposed the similar digit-parallel structures as Fig.
4.8. From Table III, it is shown that the proposed digit-parallel structures have less
register-count and shorter latency than those of [8-9].
For a fair comparison, we have also given the comparison of register count and latency
of various designs based on trinomial f(x) = x233 + x74 + 1, as shown in Fig 11. It can
be seen that the proposed design, especially Fig. 8 (MS-I) has the best performance in
both register count and latency.
39
5.6.2 FPGA Implementations
We have implemented the proposed designs, including the structures of Figs. 5.1 and
5.2 (e = 2) and Fig. 10 (e = 16, d = 2), using Xilinx ISE 14.1 on the Virtex 6 family
device based on the trinomial f(x) = x233 + x74 + 1. The area-time-power complexities
of the best existing designs ([9] and Fig. 3 of [10]) are also estimated. The obtained
area-time-power complexities of all these designs are shown in Table IV.
As shown in Table IV, the proposed structures significantly outperform the existing de-
signs. The proposed structures are found to have at least 70.0% and 47.6% less ADP and
PDP than the corresponding designs, respectively.
5.6.3 Discussion
It is noted that from Table IV, the proposed design of Fig. 5.4 (e = 16 and d = 2)
achieves the best area-time complexity among all the designs. The reduction of registers
brought by digit-parallel implementation is significant. For practical applications, one
can choose suitable value of d (coordinating with the selection of e) to obtain optimal
realization.
5.7 Conclusion
A novel strategy for low complexity implementation of finite field multipliers over GF (2m)
based on trinomials on FPGA platform has been proposed. We have proposed a modified
data broadcasting technique to reduce the register-complexity within existing AOP-based
multipliers. Then, the AOP-based multipliers are packed as standard computation cores
to be used for trinomial based multipliers. A novel low register-complexity Montgomery
multiplication algorithm for systolic trinomial-based finite field multipliers is presented.
The systolic multiplier based on the proposed algorithm can employ the AOP-based com-
putation core to offer low register-complexity implementation. We have also introduced
structures for low-latency and digit-parallel implementations. Both the theoretical anal-
ysis and synthesis results have confirmed the higher efficiency of the proposed designs
than the competing designs.
40
Table 5.2: COMPARISON OF AREA-TIME COMPLEXITIES OF VARIOUS DESIGNS
BASED ON TRINOMIAL f(x) = x233 + x74 + 1
Design Area Delay1 Power ADP2 PDP3
Bit-parallel systolic structures
Fig. 4.5 ([10]) 81, 805 222.1 2.515 18,168,891 558.58
Figs. 5.1 and 5.24 54, 032 128.2 2.336 6,926,902 299.48
Fig. 5.45 56, 400 37.29 2.351 2,103,156 87.67
Digit-parallel systolic structures (d = 2)
[9]6 81, 911 35.60 2.516 2,916,032 89.57
Fig. 5.47 22, 160 22.04 2.130 488,406 46.95
Unit for Area: number of slice register; Unit for delay: ns; Unit for power: W (power is
estimated at 100MHz).
1: Delay = Latency.
2: ADP: Area-delay product = Area×Delay.
3: PDP: Power-delay product = Power×Delay.
4: based on structure of MS-I with e = 2.
5: based on structure of MS-I with e = 16 and d = 1.
6: structure here has 16 parallel systolic arrays (d = 2).




This design focuses on finite field trinomial multiplier with AOP core, and also use Mont-
gomery algorithm to improve the speed of the multiplier. The speed and complexity are
two main points of hardware. There are multiple structures and algorithm which had
been developed depending on the previous designs. In this thesis, our design focus on
finite field which can reduce the complexity of arithmetic.
6.1 Low complexity multiplier based on trinomial
We present a low complexity multiplier in the beginning. This multiplier is based on AOP.
There are several papers, which shown the structures of AOP multiplier. But depending
on the signal flow diagram with two different ways of cut-set, we observed the registers
of each PE can be saved. And according to the structures of AND, NAND, XOR and
XNOR, we know the NAND and XNOR are faster than AND and XOR. Thus we change
all AND and XOR with NAND and XNOR. For this new structure, we significantly
decrease the area and latency.
6.2 Efficient systolic structure trinomial multiplier
over GF (2m)
For classic trinomial usually need to apply large amount of XOR gate and registers.
But according to the structures, we know for each PE there just has one XOR gate
difference. So we can design an independent preoccupation component which consist 44
42
by all XOR gates. In this way, we can combine preoccupation component with AOP
multiplier together to build a new trinomial multipliers. Apart from changing structure,
we can use another algorithm called Montgomery algorithm. Using this algorithm, the
multiplier can reduce the latency but keep the same critical path.
6.3 Digital-Parallel Systolic structure trinomial mul-
tiplier over GF (2m)
After using Montgomery algorithm to reduce the latency. We still can use another strat-
egy to inproce the speed, that is using digital-parallel way to pipeline PEs. We assume
there has m PEs, before we connect all PEs in one line, but now we separate one line
systolic structure in several lines. In this way, all systolic structure line can do calculation
together. Thus, the latency can be decreased. In all, our design are finished by above




7.1 How to improve this multiplier with different
structures
After finishing this thesis, we know that if we want to design a better circuit, we need
to combine using new algorithm and optimizing structure. Only try to decrease the
components based on the structure without applying new algorithm has its limitation.
Montgomery algorithm act an important role this thesis. Similar strategy can be used in
some new algorithms, such as Karatsnba algorithm or TMVP. Using different algorithm
can get totally different structures which can change the latency and area significantly.
7.2 Use the same algorithm to different circuits
In this thesis, we focus on trinomial, so we still can use similar structure to implement




P. Chen, S. N. Basha, M. M.-Kermani, R. Azarderakhsh, and J. Xie, “FPGA Realization
of Low Register Systolic All-One-Polynomial Multipliers over GF (2m) and Their Appli-




[1] I. Blake, G. Seroussi, and N. P. Smart, Elliptic Curves in Cryptography, ser. London
Mathematical Society Lecture Note Series. Cambridge, U.K.: Cambridge Univ. Press,
1999.
[2] N. R. Murthy, and M. N. S. Swamy, “Cryptographic applications of Brahmaqupta-Bha
skara equation,” IEEE Trans. on Circuits and Systems-I, vol. 53, no. 7, pp. 1565-1571,
2006.
[3] M. Sun, L. E. Burke, Z.-H. Mao, Y. Chen, H.-C. Chen, Y. Bai, Y. Li, C. Li, and W. Jia.
“eButton: A wearable computer for health monitoring and personal assistance,” Annual
Design Automation Conference (DAC ’14). ACM, New York, NY, USA, Article 16.
[4]Y. Bai, C. Li, W. Jia, J. Li, Z.-H. Mao, and M. Sun “Designing a wearable computer for
lifestyle evaluation,” in Proc. Annual Northeast Bioengineering Conference, pp. 93-94,
2012 March 16-18; Philadelphia, PA.
[5] National Institute of Standards and Technology, “FIPS 186-2, Digital Signature Stan-
dard (DSS), Federal Information Processing Standards Publication 186-2,” 2000.
[6] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation.
New York: Wiley, 1999.
[7] C.-Y. Lee, J.-S. Horng, I.-C. Jou, and E.-H. Lu, “Low-complexity bit-parallel systolic
montgomery multipliers for special classes of GF (2m),” IEEE Trans. Comput., vol. 54,
no. 9, pp. 1061-1070, 2005.
[8] P. K. Meher, “Systolic and super-systolic multipliers for finite field GF (2m) based
on irreducible trinomials,” IEEE Trans. Circuits and Systems-I, vol. 55, no. 4, pp.
1031-1040, May, 2008.
46
[9] J. Xie, P. K. Meher and J. He, “Low-latency area-delay-efficient systolic multiplier
over GF (2m) for a wider class of trinomials using parallel register sharing,” in Proc. of
IEEE International Symposium on Circuits and Systems (ISCAS)-2012, pp. 89-92, 2012.
[10] S. B.-Sarmadi, and M. Farmani, “High-throughput low-complexity systolic Mont-
gomery multiplication over GF (2m) based on trinomials,” IEEE Trans. on Circuits and
Systems-II, vol. 62, no. 4, pp. 377-381, 2015.
[11] J. Xie, J. He and P. K. Meher, “Low latency systolic Montgomery multiplier for finite
field GF (2m) based on pentanomials,” IEEE Trans. on VLSI Systems, vo. 21, no. 2, pp.
385-389, 2013.
[12] P. Montgomery, “Modular multiplication without trial division,”Math, Computation,
vol. 44, no. 170, pp. 519-521, 1985.
[13] R. Azarderakhsh, K. Jarvinen, and V. Dimitrov, “Fast inversion in GF (2m) with
normal basis using hybrid-double multipliers,”IEEE Trans. on Comput., vol. 63, no. 4,
pp. 1041-1047, 2014.
[14] R. Azarderakhsh, D. Jao, and H. Lee, “Space complexity reduction algorithms for
Gaussian normal basis multiplication,”IEEE Trans. on Information Theory, vol. 61, no.
5, pp. 2357-2369, 2015.
[15] R. Azarderakhsh, M. Mozaffari-Kermani, “High-performance two-dimensional fi-
nite field multiplication and exponentiation for cryptographic applications,”IEEE Trans.
Computer Aided Design Integrated Circuits Systems, vol. 34, no. 10, pp. 1-8, 2015
[16] C-Y. Lee and P. K. Meher, “Area-efficient subquadratic space-complexity digit-serial
multiplier for type-II optimal normal basis of GF (2m) using symmetric TMVP and block
recombination techniques,”IEEE Trans. on Circuits and Systems-I, vol. 62, no. 12, pp.
2846-2855, 2015.
[17] S. Talapatra, H. Rahaman, and S. K. Saha, “Unified digit serial systolic Montgomery
multiplication architecture for special classes of polynomials over GF (2m),”in Conf. on
Digital System Design: Architectures, Methods and Tools, pp. 427-432, 2010.
[18] S. Fenn, and M. Parker, “Bit-serial multiplication in GF (2m) using all-one polyno-
mials,”IEE proc. Com. Digit. Tech., vol. 144, no. 6, pp. 391-393, 1997.
[19] K.-Y. Chang, D. Hong, and H.S. Cho, “Low complexity bit-parallel multiplier for
GF (2m) defined by all-one polynomials using redundant representation,”IEEE Trans.
Comput., vol. 54, no. 12, pp. 1628-1629, 2005.
47
[20] H.-S. Kim, and S.-W. Lee, “LFSR multipliers over GF (2m) defined by all-one poly-
nomial,”Integr., the VLSI jour., vol. 40, no. 4, pp. 571-578, 2007
[21] P. K. Meher, and C.Y. Lee, “An optimized design of serial-parallel finite field multi-
plier for GF (2m) based on all-one polynomials,”ASP-DAC 2009, pp. 210-215, 2009.
[22] M.-Sandoval, M. F.-Uribe, and C. Kitsos, “Bit-serial and digit-serial GF (2m) Mont-
gomery multipliers using linear feedback shift registers,”IET Comput. & Digital Tech.,
vol. 5, no. 2, pp. 86-94, 2011.
[23] C.-Y. Lee, E.-H. Lu, and J.-Y. Lee, “Bit-parallel systolic multipliers for GF (2m)
fields defined by all-one and equally spaced polynomials,”IEEE Trans. Comput., vol. 50,
no. 6, pp. 385-393, May, 2001.
[24] J. Xie, P. K. Meher and J. He, “Low-complexity multiplier for GF (2m) based on all
one polynomials,”IEEE Trans. on VLSI Systems, vol. 21, no. 1, pp. 168-172, 2013.
[25] Y.-R. Ting, E.-H. Lu, and Y.-C. Lu, “Ringed bit-parallel systolic multipliers over a
class of fields GF (2m),”Integration, the VLSI journal, vol. 38, no. 4, pp. 571-578, 2005
[26] C.-Y. Lee, J.-S. Horng, I.-C. Jou, and E.-H. Lu, “Low-complexity bit-parallel systolic
montgomery multipliers for special classes of GF (2m),”IEEE Trans. Comput., vol. 54,
no. 9, pp. 1061-1070, Sep. 2005.
[27] S. Talapatra, H. Rahaman, and J. Mathew, “Low complexity digit serial systolic
Montgomery multipliers for special class of GF (2m),”IEEE Trans. VLSI Sys., vol. 18,
no. 5, pp. 847-852, May, 2010.
[28] T. Itoh and S. Tsujii, “Structure of parallel multipliers for a class of fields GF (2m),”
Information and Computation, vol. 83, no.1, pp. 21-40, 1989.
[29] S.-B. Wicker, and V.K. Bhargava, Reed− Solomon Codes and Their Applications,
Piscatawy, NJ: IEEE Press, 1994.
[30] J.-P. Deschamps, J.L.Imana, and G.D. Sutter, Hardware implementation of finite
field arithmetic. McGraw-Hill, 2009.
[31] I. Blake, G. Seroussi, and N. P. Smart, Elliptic Curves in Cryptography, ser.
London Mathematical Society Lecture Note Series. Cambridge, U.K.: Cambridge Univ.
Press, 1999.
[32] K. K. Parhi, V LSI Digital Signal Processing Systems : Design
and Implementation. New York: Wiley, 1999.
[33] S. Y. Kung, V LSI array processors. Englewood Cliffs, NJ: Prentice-Hall, 1988.
48
[34] H. Fan, and M.A. Hasan, “Relationship between GF (2m) Montgomery and shifted
polynomial basis multiplication algorithms,” IEEE Trans. Computers, vol. 55, no. 9,
pp. 1202-1206, 2006.
[35] C.-S. Yeh, I. S. Reed, and T. K. Truong, “Systolic multipliers for finite Fields
GF (2m),” IEEE Trans. Comput., vol. C-33, no. 4, pp. 357C360, Apr. 1984.
[36] Digital Signature Standard (DSS), FIPS 186-2, National Institute of Standards and
Technology, 2000.
[37] P. K. Meher, “Systolic and non-systolic scalable modular designs of finite field multi-
pliers for Reed-Solomon Codec,” IEEE Trans. V ery Large Scale Integr. (V LSI) Syst.,
vol. 17, no. 6, pp. 747C757, June, 2009.
[38] S. K. Jain, L. Song, and K. K. Parhi, “Efficient semisystolic architectures for finite
field arithmetic,” IEEE Trans. V ery Large Scale Integr. (V LSI) Syst. vol. 6, no. 1,
pp. 734-749, Mar. 1998.
[39] C.-S. Yeh, I. S. Reed, and T. K. Truong, “Systolic multipliers for finite fields
GF (2m),” IEEE Trans. Comput., vol. C-33, no. 4, pp. 357C360,Apr. 1984.
[40] C.-L. Wang and J.-L. Lin, “Systolic array implementation of multipliers for finite
fields GF (2m),” IEEE Trans. Circuits Syst., vol. 38, no. 7, pp. 796C800, Jul. 1991.
[41] C. H. Kim, C. P. Hong, and S. Kwon, “A digit-serial multiplier for finite field
GF (2m),” IEEE Trans. V ery Large Scale Integr. (V LSI) Syst., vol. 13, no. 4,
pp. 476-483, 2005.
[42] N-Y. Kim, H-S. Kim, and K-Y. Yoo, “Computation of AB2 multiplication in GF (2m)
using a low-complexity systolic architecture,” IEE Proc.−Circuits Devices Syst., vol.
150, no. 2, pp. 119-123, April, 2003.
[43] L. Song, K. K. Parhi, I. Kuroda, and T. Nishitani, “Hardware/software codesign of fi-
nite field datapath for low-energy Reed-Solomn codecs,” IEEE Trans. V ery Large Scale
Integr. (V LSI) Syst. vol. 8, no. 2, pp. 160-172, Apr. 2000.
[44] K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede, “Multicore Curve-Based
Cryptoprocessor with Reconfigurable Modular Arithmetic Logic Units over GF (2n),”
IEEE Trans. Comput. Vol. 56, no. 9, pp. 1269-1282, 2007.
[45] H. Wu, “Bit-parallel polynomial basis multiplier for new classes of finite fields,”
IEEE Trans. Computers, vol. 57, no. 8, pp. 1023-1031, 2008.
[46] C. Paar, “Low complexity parallel multipliers for Galois fields GF ((2n)4) based on
49
special types of primitive polynomials”, IEEE International Symposium on Information
Theory, 1994.
[47] P. K. Meher, “On efficient implementation of accumulation in finite field over GF (2m)
and its applications,” IEEE Trans. V ery Large Scale Integr. (V LSI) Syst., vol. 17,
no. 4, pp. 541-550, 2009.
[48] T.-C. Chen, S.-W. Wei, and H.-J. Tsai, “Arithmetic unit for finite field GF (2m),”
IEEE Circuits and System− I, vol. 55, no. 3, pp. 828-837, 2008.
[49] N. R. Murthy, and M. N. S. Swamy, “Cryptographic applications of brahmaquptabha
skara equation,”
IEEE Circuits and System− I, vol. 53, no. 7, pp. 1565-1571, 2006.
[50] C. Spagnol, E. M. Popovici, and W. P. Marnane, “Hardware implementation of
GF (2m) LDPC decoder,” IEEE Circuits and System − I, vol. 56, no. 12, pp. 2609-
2620, 2009.
50
