FPGA Realization of Low Register Systolic Multipliers over GF(2^m) by Shao, Qiliang
Wright State University 
CORE Scholar 
Browse all Theses and Dissertations Theses and Dissertations 
2016 
FPGA Realization of Low Register Systolic Multipliers over 
GF(2^m) 
Qiliang Shao 
Wright State University 
Follow this and additional works at: https://corescholar.libraries.wright.edu/etd_all 
 Part of the Electrical and Computer Engineering Commons 
Repository Citation 
Shao, Qiliang, "FPGA Realization of Low Register Systolic Multipliers over GF(2^m)" (2016). Browse all 
Theses and Dissertations. 1657. 
https://corescholar.libraries.wright.edu/etd_all/1657 
This Thesis is brought to you for free and open access by the Theses and Dissertations at CORE Scholar. It has 
been accepted for inclusion in Browse all Theses and Dissertations by an authorized administrator of CORE 
Scholar. For more information, please contact library-corescholar@wright.edu. 
FPGA REALIZATION OF LOW
REGISTER SYSTOLIC MULTIPLIERS
OVER GF (2m)
A thesis submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical Engineering
By
QILIANG SHAO






I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPERVISION BY
Qiliang Shao ENTITLED FPGA REALIZATION OF LOW REGISTER SYSTOLIC MULTIPLIERS
OVER GF (2m). BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR
















Vice President for Research and
Dean of the Graduate School
Abstract
Qiliang,Shao. M.S.E.E., Department of Electrical Engineering, Wright State University,
2016. FPGA realization of low register systolic multipliers over GF (2m).
Finite field multiplication over GF (2m) is a critical component for elliptic curve cryptog-
raphy (ECC). National Institute of Standards and Technology (NIST) has recommend-
ed five polynomials (two trinomials and three pentanomials) for ECC implementation.
Although there are a lot reports available on polynomial basis multipliers, efficient im-
plementation of a design with flexible field-size is quite rare. There is another basis to
represent the field called normal basis. Normal basis multiplication over GF (2m) is wide-
ly used in various applications such as elliptic curve cryptography (ECC). As a special
class of normal basis with low complexity, Gaussian normal basis (GNB) has received
considerable attention recently. In this paper, we first propose a novel low-complexity
hybrid-size systolic polynomial basis multiplier based on a proposed novel hybrid-size
(for both pentanomial and trinomial) algorithm for efficient systolization of finite field
multiplications. Next, we propose a novel decomposition algorithm to develop a digit-
level (DL) low critical-path delay and low register-complexity systolic structure for GNB
multiplication over GF (2m). For the hybrid-size systolic polynomial multipliers, both the
theoretical and field-programmable gate array (FPGA) implementation show that, our
proposed architectures have lower register-complexity than the existing ones. The pro-
posed hybrid-size multiplier can also be extended to other field-size and can be used as a
third-party intellectual property (IP) core for various cryptosystems. At the same time,
the proposed systolic Gaussian normal basis multipliers can achieve both low critical-path
and low register-complexity through the theoretical and application-specific integrated
circuit (ASIC) comparisons with the existing GNB multipliers.
Key Words-Gaussian normal basis (GNB), finite field multiplication, systolic structure,




1.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Polynomial Multipliers . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Gaussian Normal Basis Multipliers . . . . . . . . . . . . . . . . . 2
1.2 Summery of contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Proposed Low-Complexity Hybrid-Size Systolic Polynomial Multi-
pliers over GF (2m) . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Proposed Low-Complexity Systolic Gaussian Normal Basis Multi-
pliers over GF (2m) . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Efficient Implementation of Low Complexity Hybrid-Size Systolic Poly-
nomial Multipliers over GF(2m) 6
2.1 Low Register-Complexity Systolic Multipliers Based on NIST Trinomials
and Pentanomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Review of Conventional Polynomial Multiplication . . . . . . . . . 6
2.1.2 Conventional Systolic Structure . . . . . . . . . . . . . . . . . . . 8
2.1.3 Modified Low Register-Complexity Systolic Structure . . . . . . . 9
2.1.4 Area-Time Complexities . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Proposed Low Complexity Hybrid-Size Systolic Polynomial Multipliers . 13
2.2.1 Proposed Hybrid Polynomial Multiplication Algorithm . . . . . . 13
2.2.2 Detailed Example and Computation Steps . . . . . . . . . . . . . 17
2.2.3 Proposed Hybrid-Size Systolic Structure . . . . . . . . . . . . . . 18
2.2.4 Low-Latency Structure . . . . . . . . . . . . . . . . . . . . . . . . 22
iv
2.3 Area and Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Theoretical Comparison . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Low Critical-Path Low-Complexity Digit-Level Systolic Gaussian Nor-
mal Basis Multiplier 25
3.1 Review of the Existing DL Systolic GNB Multiplier . . . . . . . . . . . . 25
3.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Low Critical-Path Delay . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Low Register-Complexity . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Proposed DL Systolic GNB Multiplication Algorithm . . . . . . . 34
3.3 Proposed Low Critical-path Delay Low Register-Complexity DL Systolic
GNB Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Proposed Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Area-Time Complexities . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Theoretical Comparison . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.2 ASIC Implementation . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Conclusion 41
4.1 Low complexity hybrid-size systolic polynomial multipliers . . . . . . . . 41





2.1 Conventional systolic structure for finite field multiplication over GF (2m)
based on NIST pentanomials and trinomials. (a) Systolic structure. (b)
Internal structure of PE-1. (c) Internal structure of a regular PE (PE-2
through PE-(m− 1)). (d) Internal structure of PE-m. . . . . . . . . . . . 9
2.2 Detailed design of RC. (a) Detailed design of RC for pentanomial F (t) =
tm+ts1+ts2+ts3+1. (b) Detailed design of RC for trinomial F (t) = tm+ts+1. 9
2.3 Detailed design of the modified systolic structure. (a) Modified systolic
structure. (b) Modified reduction cell for pentanomial F (t) = tm + ts1 +
ts2 + ts3 + 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Detailed design of the modified systolic structure. (a) Modified systolic
structure. (b) Modified reduction cell for trinomial F (t) = tm + ts + 1. . . 11
2.5 Detailed example according to Algorithm 2.1. (a) Pentanomial example of
GF (2163). (b) Trinomial example of GF (2233). . . . . . . . . . . . . . . . 17
2.6 Proposed hybrid-size systolic multiplier, where the dotted box denotes se-
lective connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Detailed internal structure of PE-0. . . . . . . . . . . . . . . . . . . . . . 19
2.8 Detailed structure of the proposed multiplier, where the black boxes rep-
resent the registers. (a) Internal structure of the computation core. (b)
Example of detailed design of PE-1. (c) Example of detailed design of a
regular PE. (d) Detailed design of PE-2. . . . . . . . . . . . . . . . . . . 20
2.9 Proposed low-latency systolic structure. . . . . . . . . . . . . . . . . . . . 21
3.1 (a) Existing DL systolic GNB multiplier over GF (2m) [48]. (b) Detailed
structure of PE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vi
3.2 Proposed strategy of designing low critical-path delay DL systolic GNB
multiplier overGF (2m), where S box refers to shift operation. (a) Proposed
DL systolic GNB multiplier. (b) Detailed structure of PEs, where black
boxes denote registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Data pipelining of operand B among PEs for the structure of Fig. 3.1,
where the diagonal line represents data flow between PEs, and the vertical
line represents data flow in one PE. . . . . . . . . . . . . . . . . . . . . . 31
3.4 Data pipelining of operand V among PEs for the structure of Fig. 3.1,
where the diagonal line represents data flow between PEs, and the vertical
line represents data flow in one PE. . . . . . . . . . . . . . . . . . . . . . 31
3.5 Data pipelining of operand B among PEs with added operands for the
structure of Fig. 3.1, where the gray area represents all added operands,
and the green area represents one specific operand cluster fed to all PEs. 32
3.6 Data pipelining of operand V among PEs with added operands for the
structure of Fig. 3.1, where the gray area represents all added operands,
and the green area represents one specific operand cluster fed to all PEs. 32
3.7 Example of rearranging data pipelining by using operand cluster B′i. . . . 32
3.8 Proposed low critical-path delay low register-complexity DL systolic GNB
multiplier over GF (2m). (a) Proposed structure of DL systolic GNB mul-
tiplier. (b) Detailed internal structures of PEs, where black boxes denote
registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
vii
List of Tables
2.1 COMPARISON OF AREA-TIME COMPLEXITIES OF VARIOUS SYS-
TOLIC NIST POLYNOMIAL MULTIPLIERS . . . . . . . . . . . . . . . 12
2.2 FPGA IMPLEMENTATION RESULTS OF VARIOUS POLYNOMIAL-
BASED MULTIPLIERS . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 COMPARISON OF AREA-TIME COMPLEXITIES OF VARIOUS SYS-
TOLIC POLYNOMIAL MULTIPLIERS . . . . . . . . . . . . . . . . . . 22
2.4 FPGA IMPLEMENTATION RESULTS OF VARIOUS POLYNOMIAL-
BASED MULTIPLIERS . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 COMPARISON OF THE AREA AND TIME COMPLEXITIES FOR VAR-
IOUS DL MULTIPLIERS OVER GF (2m) . . . . . . . . . . . . . . . . . 38
3.2 COMPARISON OF LATENCY OVER GF (2409) . . . . . . . . . . . . . 39
3.3 COMPARISON OF CRITICAL-PATH DELAY WITH DIFFERENT DIGIT-
SIZE FOR VARIOUS DL MULTIPLIERS OVER GF (2409) . . . . . . . . 39
3.4 ASIC SYNTHESIS RESULTS OF THE PROPOSED SYSTOLIC MUL-
TIPLIER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 ASIC SYNTHESIS RESULTS FOR THE EXISTING AND THE PRO-




This chapter will introduce the outline of the whole thesis. It presents some existing
works of finite field multipliers, both polynomial multipliers and GNB multipliers. The
contributions of this report are also given here.
1.1 Preliminary
1.1.1 Polynomial Multipliers
Elliptic curve cryptography (ECC) is widely used in many fields such as wearable devices
and portable systems [1-4]. Finite field multiplication over GF (2m) is a crucial part
of ECC, and there are mainly two bases selected to represent the field operation, i.e.,
polynomial basis [5-13] and normal basis [14-17]. Due to the simpler design structures,
polynomial-based multipliers are more popularly used in hardware implementation than
normal basis ones [8].
In general, pentanomials and trinomials are two main irreducible polynomials [7-11], [17-
26]. The National Institute of Standards and Technology (NIST) has recommended three
pentanomials and two trinomials, which are popularly used for ECC implementation [5].
Although there are many reports available in the literature focusing efficient implemen-
tation of finite field multipliers based one either pentanomial or trinomial, there is still
few specific design with hybrid-size multiplier.
Systolic design and non-systolic design are two basic structures for field multipliers over
GF (2m). Because of the modularity and regularity of the structure, polynomial-based
systolic multipliers are considered to be used in applications [5-11]. Also, systolic design
1
has the pipelining structure, and the registers must be used in all the processing elements
(PEs) in the systolic array [5]. Therefore, systolic structures usually have higher register-
complexity, compared with lower complexity and longer time delay of non-systolic designs.
Systolic structure with low register-complexity are needed when realized in field-program-
mable gate array (FPGA) platforms, where the register-resources are not so rich. Based
on the irreducible polynomials, a great deal of efforts have been made to reduce the
complexity [7-10], [23-26]. In [23], Lee et al. introduced a bit-parallel AOP-based systolic
multiplier. In addition, Xie et al. [24] presented another efficient AOP-based structure of
multiplier. In [7], a bit-parallel trinomial based systolic multiplier has been presented by
Lee et al. Meher [8] introduced efficient bit-parallel systolic and super-systolic designs. A
low register-complexity systolic structure has been proposed in [9]. Jos Luis Imana et al.
[28] introduced a low-complexity bit-parallel multiplier based on irreducible pentanomials.
Many other works were reported for implementation of finite field multiplication over
GF (2m) [11], [17].
1.1.2 Gaussian Normal Basis Multipliers
Low-complexity implementations of finite field multipliers over GF (2m) have drawn sub-
stantial attention recently due to their widespread applications in various environments.
A lot of efforts have been carried to obtain low-complexity multipliers with high-performa-
nce for various usage including elliptic curve cryptography (ECC) [1], [31]-[33].
In general, there are three bases can be selected to represent a finite field, i.e., polyno-
mial basis, normal basis, and dual basis [34]-[48]. Compared with polynomial basis and
dual basis, normal basis is much more efficient in the hardware designs involving with
many squaring operations since normal basis has an advantage that the squaring of field
elements can be simply obtained by cyclic shifting without hardware usage. Compared
with other two bases, Normal basis, therefore, has drawn much more attention in the
applications which utilize frequent squarings.
Gaussian normal basis (GNB), as a special class of normal basis over GF (2m) [34]-[38]
(where m > 1 and m is not divisible by eight), has received considerable attention in the
literature due to its low complexity. GNB has been included in a number of standards
such as IEEE [49] and NIST [50] for elliptic curve digital signature algorithm (ECDSA).
According to the structuring of finite field multipliers based on normal basis, especially
2
GNB, the respective implementations can be categorized into three groups: 1) bit-level
including parallel-in serial-out (PISO) [51], serial-in parallel-out (SIPO) [52]-[53], and
parallel-in parallel-out (PIPO) [54]-[55]; 2) digit-level which includes the structures of:
parallel-in serial-out [56], parallel-in parallel-out [46]-[47], and serial-in parallel-out [48];
and 3) bit-parallel, which includes the multiplier of [32].
For large field sizes in GF (2m), the multiplications can be realized by using systolic array
to achieve high-speed and regular implementations [39]. Systolic structures are vast-
ly used in applications with high-performance requirements as the processing elements
(PEs) in the structure employ registers for pipelining. In [40], Kwon has presented an
efficient digit-serial systolic multiplier based on optimal normal basis. In [41], another
systolic multiplier is proposed for high-performance implementation. Other efficient sys-
tolic multipliers over GF (2m) have been proposed in [39] and [42]. Besides that, efficient
digit-serial systolic multipliers are introduced in [26], [43]-[44]. Very recently, an efficient
digit-level (DL) systolic GNB multiplier is introduced in [48]. Although these GNB mul-
tipliers have been optimized to achieve low complexity through DL implementation, their
area-time complexities are still relatively high and need to be improved.
In Chapter 3 of this thesis, we propose a novel decomposition algorithm to develop DL
systolic GNB multipliers over GF (2m) in order to achieve low critical-path delay, high-
speed and low-complexity implementations. First, we introduce novel multiplication algo-
rithms to reduce the respective critical-path delay and register-complexity. Then, a new
structure of proposed systolic GNB multiplier is proposed. The proposed multiplier can
achieve low critical-path delay and low register-complexity compared with the best of the
existing GNB systolic multiplier of [48]. Finally, we also compare the hardware and time
complexities of the proposed architectures with the existing ones through application-
specific integrated circuit (ASIC) synthesis to benchmark the higher efficiencies of the
presented design.
3
1.2 Summery of contribution
1.2.1 Proposed Low-Complexity Hybrid-Size Systolic Polyno-
mial Multipliers over GF (2m)
Many designs about finite field polynomial systolic multipliers have been reported. Most
of these designs focus on the specific NIST recommended polynomials, such as pen-
tanomials and trinomials. In Chapter 2 of this thesis, we propose a novel hybrid-size
(for both pentanomial and trinomial) algorithm for efficient systolization of finite field
multiplications. After that, an detailed example of a hybrid-size multiplier (combined
with GF (2163) and GF (2233)) is presented as well as the proposed systolic structure.
Both the theoretical and field-programmable gate array (FPGA) implementation show
that, the proposed hybrid-size design (can perform both GF (2163) and GF (2233) multi-
plications) are found to have at least 70.0% and 47.6% less area-delay product (ADP)
and power-delay product (PDP) than the combination of proposed individual GF (2163)
and GF (2233) multipliers (best of all existing designs), respectively. Besides that, the
proposed hybrid-size one only involves 70.0% and 47.6% more ADP and PDP than the
proposed individual GF (2233) multiplier, respectively.
1.2.2 Proposed Low-Complexity Systolic Gaussian Normal Ba-
sis Multipliers over GF (2m)
A lot of works have been done concerning Gaussian normal basis multipliers. In Chapter
3 of this thesis, we propose a novel decomposition algorithm to develop a digit-level (DL)
low critical-path delay and low register-complexity systolic structure for GNB multipli-
cation over GF (2m). Compared with the existing digit-level GNB multipliers (through
both the theoretical and application-specific integrated circuit (ASIC) comparison), the
proposed multiplier not only has lower critical-path delay, but also achieves significantly
less area-delay product (ADP), e.g., for a systolic structure of digit-size of 7 for GF (2409),
the proposed structure has 28.9% less critical-path delay and 26.8% less ADP compared
to the best of the existing designs in [24], respectively.
4
1.2.3 Report Outline
The following parts of the report are organized in this way.
Chapter 2 proposes hybrid-size systolic polynomial multipliers over GF (2m) with low
complexity. Several classes of irreducible polynomials are shown within this chapter.
Chapter 3 talks about the low critical-path low-complexity digit-level systolic gaussian
normal basis multipliers.
Chapter 4 presents the conclusion for the whole thesis.
5
Chapter 2
Efficient Implementation of Low
Complexity Hybrid-Size Systolic
Polynomial Multipliers over GF(2m)
2.1 Low Register-Complexity Systolic Multipliers Based
on NIST Trinomials and Pentanomials
In this section, we briefly review the conventional algorithm and existing systolic struc-
tures for NIST trinomials and pentanomials first, and then present the modified low
register-complexity systolic structures.
2.1.1 Review of Conventional Polynomial Multiplication
Let’s define F (t) = tm + pm−1 · tm−1 + · · ·+ p2 · t2 + p1 · t+ 1, as an irreducible polynomial
of degree m over GF (2m) where pi ∈ GF (2) are the coefficients. The polynomial basis
{1, t, t2, . . . , tm−1}, where t is a root of F (t), is used to represent the field elements.




ai · ti, B =
m−1∑
i=0
bi · ti, C =
m−1∑
i=0
ci · ti, (2.1)
where ai, bi, ci ∈ GF (2), for 0 ≤ i ≤ m− 1.
6
Let us define C as the product of A and B, then we have
C = A ·B mod F (t). (2.2)




bi · (ti · A mod F (t)). (2.3)
Let us define
A(0) = A
A(1) = ti · A mod F (t),
(2.4)
so that we can derive A(i+1) from A(i) recursively that
A(i+1) = t · A(i) mod F (t). (2.5)
Then, we have





aij · tj. (2.7)
Because t is a root of polynomial F (t), then we have
F (t) = tm + pm−1 · tm−1 + · · ·+ p1 · t+ 1 = 0, (2.8)
then we can have
tm = pm−1 · tm−1 + · · ·+ p1 · t+ 1. (2.9)
Substituting the tm into (2.7), then we have
A(i+1) = aim−1 + (a
i
0 + p1 · aim−1) · t




A(i+1) = ai+10 + a
i+1
1 · t+ · · ·+ ai+1m−1 · ti+1, (2.11)






j−1 + pj · aim−1, for 1 ≤ j ≤ m− 1.
(2.12)
Given a general pentanomial of degree m,
F (t) = tm + ts1 + ts2 + ts3 + 1. (2.13)





















j−1, for 1 ≤ j ≤ m− 1 and j 6= s1, s2, s3.
(2.14)
The same as the general trinomial of degree m given by
F (t) = tm + ts + 1, (2.15)











j−1, for 1 ≤ j ≤ m− 1 and j 6= s.
(2.16)
2.1.2 Conventional Systolic Structure
The conventional systolic multiplier based on NIST polynomials is shown in Fig. 2.1,
where it consists of m PEs including three types of PEs: PE-1, PE-m, and regular
PE (PE-2 through PE-(m − 1)). The internal structures of these PEs are shown in
Figs. 2.1(b), (c), and (d), respectively, where RC denotes the reduction cell (perform
the operations of (2.14) and (2.16)). The internal structure of RC is shown in Fig. 2.2,
8







































Figure 2.1: Conventional systolic structure for finite field multiplication over GF (2m)
based on NIST pentanomials and trinomials. (a) Systolic structure. (b) Internal structure


























































Figure 2.2: Detailed design of RC. (a) Detailed design of RC for pentanomial F (t) =
tm + ts1 + ts2 + ts3 + 1. (b) Detailed design of RC for trinomial F (t) = tm + ts + 1.
where it involves three XORs for pentanomial and one XOR operation for trinomial. The
latency of the structure in Fig. 2.1 is m cycles, where the duration of each cycle period is
TA +TX (TA and TX refer to the delays of an AND gate and an XOR gate, respectively).
2.1.3 Modified Low Register-Complexity Systolic Structure
For the structure of Figs. 2.1 and 2.2, we find that (m−3) or (m−1) registers in the RC
of each PE pipeline unprocessed (without XOR operations) signals to the next PE. These
































































Figure 2.3: Detailed design of the modified systolic structure. (a) Modified systolic
structure. (b) Modified reduction cell for pentanomial F (t) = tm + ts1 + ts2 + ts3 + 1.
2.3 and 2.4, a novel connection strategy including three types of connections is proposed,
namely the full connection, selected connection, and recombination. For each PE, m bits
of operand A(i) are directly fed to the AND gate in the PE for multiplication operation
with one bit of operand B through full connection. Besides, 4 (or 2) bits of operand
A(i) are selected to be fed into the modified RC (MRC) for XOR operations according
to (2.14) and (2.16). The 3 (or 1) output bits from MRC will then be recombined into
the operand A(i+1) to be used in next PE. Therefore, compared with the structure of
Fig. 2.1, (m− 3) (or (m− 1)) registers are reduced because of the proposed connection
strategy. The modified structure has the same time-complexity as the previous one, but






















































Figure 2.4: Detailed design of the modified systolic structure. (a) Modified systolic
structure. (b) Modified reduction cell for trinomial F (t) = tm + ts + 1.
2.1.4 Area-Time Complexities
The area-time complexities of the proposed design in Figs. 2.3 and 2.4 are shown in
Table 2.1, along with existing and conventional designs. It can be seen that the proposed
design involves significantly less area-complexity when compared with competing ones,
especially in terms of the register-complexity.
2.1.5 FPGA Implementation
We have also implemented these systolic structures in Table 2.1 to confirm the efficiency
of proposed structures. We have synthesized these designs using Xilinx ISE 14.1 on
Virtex 6 FPGA family with NIST pentanomial F (t) = t163 + t7 + t6 + t3 +1 and trinomial
F (t) = t233 + t74 + 1. The results in terms of area-time-power complexities are shown
in Table 2.2. It can be seen that the proposed structures outperform the existing ones,
especially on register-complexity.
11
Table 2.1: COMPARISON OF AREA-TIME COMPLEXITIES OF VARIOUS SYS-
TOLIC NIST POLYNOMIAL MULTIPLIERS
Design AND NAND XOR XNOR Register Critical-path delay Latency






[m/(2l + 2) + 1
2lm + 2l + 2 2lm− 2l− 2 +log2(2l + 2)]
MBP-II [29]2 m2 -
m2 − 3m2/d2+
-
m2 +md +m2/d + 3m
TA + TX d + 1 + log2dm/de5m/d + 3m2/d− −ds1 + ds3 + 5m/d−
m− 3− s1 + s3 s1 + s3 − 4d + 4m− 3
Fig. 2.33 m2 - m2 + 2m− 1 - m2 + 3m− 1 TA + TX m
For NIST Trinomials F (t) = tm + ts + 1
[8] - m2 m2 − 1 - 2m2 − 2m TNA + TX m
Fig. 2 [10] - m2 < 1.5m2 + 0.5m + 1 - 1.5m2 + 0.5m 2TX m + 1
Fig. 3 [10] - m2 < 1.5m2 + 0.5m + 1 - 1.5m2 + 2m TNA + TX m + 2
Fig. 8 [30] - m2 - m2 − 1 m2 + 3m− 1 TNA + TXN (m + 7)/2
Fig. 2.43 m2 - m2 − 1 - m2 +m TA + TX m
TNA: The delay time of an NAND gate. TXN : The delay time of an XNOR gate.
1: Here l=min{m− s1, s1 − s2, s2 − s3}. In [11], the authors have also used NAND and XNOR to replace part of original AND and XOR
gates, we just list them as AND and XOR gates here, for simplicity of discussion.
2: The design of [29] is a low-latency design, where the original systolic array is decomposed into d arrays for parallel implementation.
3: For simplicity of discussion, we list here only the basic systolic structures. Although the proposed designs can be extended for
low-latency implementation, which will be seen in Section III.
Table 2.2: FPGA IMPLEMENTATION RESULTS OF VARIOUS POLYNOMIAL-
BASED MULTIPLIERS
Design Area Delay1 Power ADP2 PDP3
For NIST Pentanomials F (t) = t163 + t7 + t6 + t3 + 1
[11] 52, 482 1.695 1.613 88, 957 2.734
MBP-II [30] 55, 249 1.695 1.698 93, 647 2.878
Fig. 2.3 28, 149 1.695 0.865 47, 713 1.466
For NIST Trinomials F (t) = t233 + t74 + 1
[8] 111, 840 1.695 3.435 189, 569 5.822
Fig. 8 [31] 54, 987 1.695 1.689 93, 202 2.863
Fig. 2.4 54, 522 1.695 1.675 92, 415 2.839
Unit for area: number of slice register; Unit for delay: ns; Unit for power: W (Power is
estimated at 100MHz).
1: Delay = Critical-Path.
2: ADP: Area-delay product = Area×Delay.






























a0 am1−1 am1−2 am1−3 · · ·















as1−1 as1−2 as1−3 as1−4 · · ·
as1 as1−1 + am1−1 as1−2 + am1−2 as1−3 + am1−3 · · ·
















as2−1 as2−2 as2−3 as2−4 · · ·
as2 as2−1 + am1−1 as2−2 + am1−2 as2−3 + am1−3 · · ·















as3−1 as3−2 as3−3 as3−4 · · ·
as3 as3−1 + am1−1 as3−2 + am1−2 as3−3 + am1−3 · · ·































































a0 am2−1 am2−2 am2−3 · · ·
a1 a0 am2−1 am2−2 · · ·
a2 a1 a0 am2−1 · · ·
















as−1 as−2 as−3 as−4 · · ·
as as−1 + am2−1 as−2 + am2−2 as−3 + am2−3 · · ·



































2.2 Proposed Low Complexity Hybrid-Size Systolic
Polynomial Multipliers
In this section, we first present the proposed hybrid-size polynomial multiplication algo-
rithm, and then give the details of the proposed low complexity systolic structure based
on detailed example.
2.2.1 Proposed Hybrid Polynomial Multiplication Algorithm
First of all, let us assume that the field orders of pentanomial and trinomial arem1 andm2,
respectively. Without loss of generality, we can also assume m1 < m2 (though it is easy to
extend to other cases like m1 > m2). Then, the pentanomial and trinomial multiplication
process of (2.1)-(2.16) can be represented by the two matric-vector multiplications of
(2.17) and (2.18), respectively. For simplicity of discussion, we can use [CP ] = [AP ]× [BP ]
to represent (2.17), where [CP ] and [BP ] are bit-vectors contain all the bits of operands C
and B, respectively, while [AP ] is the multiplication process matrix of (2.17). Similarly,
we can also use [CT ] = [AT ] × [BT ] to represent (2.18) (where [CT ] and [BT ] are bit-
vectors of operands C and B, respectively, and [AT ] is the multiplication process matrix
of (2.18)).
13
Comparing [AP ] of (2.17) with [AT ] of (2.18), we can find that they all share with one
matrix (m1 by m1) as
[AC ] =

a0 0 0 · · · 0
a1 a0 0 · · · 0






am1−1 am1−2 am1−3 · · · a0

, (2.19)
where [AC ] is a “half-cyclic” matrix, i.e., near half of the matrix are all “0”, while the
rest are all bits of “a0, a1, a2, ..., am−1” (no bit-addition operations are involved).
Substitute (2.19) into [CP ] = [AP ]× [BP ], we can have
[CP ] = ([APX ] + [AC ])× [BP ], (2.20)
where [APX ] is a matrix of m1 by m1 as
[APX ] =

0 am1−1 am1−2 · · ·
0 0 am1−1 · · ·
0 0 0 · · ·
0 am1−1 am1−2 · · ·
0 0 am1−1 · · ·
0 0 0 · · ·
0 am1−1 am1−2 · · ·
0 am1−1 am1−2 + am1−1 · · ·





0 am1−2 am1−3 · · ·

. (2.21)
Similarly, one can also decompose [CT ] = [AT ]× [BT ] as
[CT ] = ([ATX ] + [A
′
C ])× [BT ], (2.22)
where [A′C ] is a m2 by m2 matrix mainly constituted with matrix [AC ] and a number of
14
“0” to fill the rest empty positions, as:
[A′C ] =

[AC ] 0 0 0





0 0 · · · 0
 , (2.23)
and the detail of matrix [ATX ] is as follows:
[ATX ] =

0 am2−1 am2−2 am2−3 · · ·
0 0 am2−1 am2−2 · · ·






0 am2−1 am2−2 am2−3 · · ·






am1 am1−1 am1−2 am1−3 · · ·






am2−1 am2−2 am2−3 am2−4 · · ·

. (2.24)
One can see that as (2.19) and (2.23) are all matrices with no XOR operations, (2.21)
and (2.24) still involve certain amount of XOR operations. To facilitate the proposed
systolic implementation, we can precalculate these XOR operations in advance, then the
rest computation will be easier, e.g., the term of am1−2 + am1−1 in the third column of
matrix [APX ] can be computed in advance for (2.21) (similar rules apply to [ATX ] of
(2.24)). Such that (2.20) and (2.22) can be rewritten as
[CP ] = (ρ{[APX ]}+ [AC ])× [BP ], (2.25)
and
[CT ] = (ρ{[ATX ]}+ [A′C ])× [BT ], (2.26)
where ρ{·} represents the above mentioned pre-XOR-computation operations.
15
Based on (2.17)-(2.26), we can have the proposed hybrid multiplication algorithm as given
by Algorithm 2.1.
Algorithm 2.1 Proposed hybrid-size polynomial multiplication.
Inputs: A and B are the two field elements of polynomials in GF (2m) to be multiplied,
S is the select signal.
Output: C = A ·B mod F (t) (F (t) can be pentanomial and trinomial).
1. Initialization step
1.1 [AP ] = [AC ] + [APX ].
1.2 [AT ] = [A
′
C ] + [ATX ].
1.3 [A1] = 0, [A2] = 0 [D] = 0, [D
′] = 0, [E] = 0, [E1] = 0, [E2] = 0.
2. Pre-XOR-computation step
2.1-a. [A1] = ρ{[APX ]}.
2.1-b. [A2] = ρ{[ATX ]}.
3. Multiplication step
3.1. [D] = [AC ]× [BP ] (or [D′] = [A′C ]× [BT ]).
3.2-a. [E1] = [A1]× [BP ].
3.2-b. [E2] = [A2]× [BT ].
4. Selection step
If S = 0, then
[E] = [E1] + [D].
Else,
[E] = [E2] + [D
′].
5. Final step
5.1. Operand C ⇐ [C] = [E].
where for simplicity of discussion, [E] is sized as m1 by m1 for pentanomial (the same
size as [D]), and m2 by m2 for trinomial (the same size as [D
′]).
16
162 161 160 1 161 158 157





3 2 1 0
4 3 2 1
5 4 3 2
6 5 4 3
7 6 5 4













a a a a a a a





a a a a
a a a a
A
a a a a
a a a a
a a a a


















2 3 160 159
4 161 160
162 161 160
1 161 158 157
5 162 161
162 161




7 4 161 160
162 161 160
1 161 158 157
8









a a a a
a a a
a a




a a a a
a a a
a a a a
a












2 162 159 158





3 2 1 0
73 72 71 70
74 73 72
0 0 0 0
( )








a a a a







a a a a




















    
 
    
 
   
 
 
   

232 231 230 229 228 1 160
232 231 230 229 2 161
232 231 230 3 162
232
71
75 74 73 72




0 0 0 0
0
0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
a a a a a a a
a a a a a a
a a a a a
a a
a
a a a a

























232 231 230 229 228 75 1 160
232 231 230 229 76 2 161
163 89
163 162 161 160 159 158 164 90
164 163 162 161 160 159 165 91
232 231 23
0 0 0 0 0 0
0
0 0
0 0 0 0 0 0
a a
a
a a a a a a a a
a a a a a a a
a a
a a a a a a a a





































Figure 2.5: Detailed example according to Algorithm 2.1. (a) Pentanomial example of
GF (2163). (b) Trinomial example of GF (2233).
2.2.2 Detailed Example and Computation Steps
To have a clear understanding of the proposed algorithm and design strategy, we give
here an example of defining pentanomial F (t) = t163 + t7 + t6 + t3 + 1 and trinomial
F (t) = t233 + t74 + 1. Then, according to Algorithm 2.1, we have the equations as shown
in Fig. 2.5.



















































Figure 2.6: Proposed hybrid-size systolic multiplier, where the dotted box denotes selec-
tive connection.






C · · · A
′(232)




C are the corresponding
columns of [A′C ].





PX · · · A
(162)






TX · · · A
(232)









TX are every columns of ρ{[APX ]} and ρ{[ATX ]}, respectively.
Then, the rest multiplication step according to Algorithm 2.1 is:
[E] = A
(0)
C b0 + A
(1)














C b0 + A
′(1)











Then, we can have the following structure.
2.2.3 Proposed Hybrid-Size Systolic Structure
The overall structure of hybrid-size systolic multiplier based on Algorithm 2.1 (where the
example of Fig. 2.5 is employed) is shown in Fig. 2.6, which consists of one computation





















































































































Figure 2.7: Detailed internal structure of PE-0.
to Step 2 of Algorithm 2.1. The detailed structure of PE-0 is shown in Fig. 2.7, where
we use multi-stage pipelining technique to maintain the critical-path of PE-0 as TX (TX
is the delay time of an XOR gate, and according to Fig. 2.5, the maximum time for pre-
XOR-computation operations is 4TX for [APX ] and 2TX for [ATX ], respectively). Note
that for those bits without XOR operations, we do not need to add register for pipelining.
The output bits of PE-0 (in total 956 bits: 233 bits of a0 through a232, 491 bits from
pre-XOR-computation of [APX ], and 232 bits from pre-XOR-computation of [ATX ]) are
selected to be connected to the computation core (as indicated by the dotted box), where
the computation core mainly executes the Steps of 3 and 4 of Algorithm 2.1.
The internal structure of the computation core is shown in Fig. 2.8, where four arrays























































































Figure 2.8: Detailed structure of the proposed multiplier, where the black boxes represent
the registers. (a) Internal structure of the computation core. (b) Example of detailed
design of PE-1. (c) Example of detailed design of a regular PE. (d) Detailed design of
PE-2.
second array is for [APX ], and the third and fourth arrays are for the calculation of [ATX ]
(decomposed into two arrays for parallel processing). The computation core contains 558
PEs (where two kinds of PEs are being used), and Figs. 2.8(b) and (c) gives the example
of the detailed design of PE-1 and regular PE, respectively. PE-1 performs multiplication
between one selected operand and one bit of operand B, and then produces the result to













• • •• •

























• • •• •

























• • •• •




























• • •• •






























Figure 2.9: Proposed low-latency systolic structure.
21
Table 2.3: COMPARISON OF AREA-TIME COMPLEXITIES OF VARIOUS SYS-
TOLIC POLYNOMIAL MULTIPLIERS
Design AND NAND XOR XNOR Register Latency Critical-path Delay
For NIST Pentanomial F (t) = t163 + t7 + t6 + t3 + 1
Fig. 5(a) 27, 658 - 27, 986 - 28, 312 163 TA + TX
[11] 26, 569 - 27, 225 - 52, 482 44 TA + TX
MBP-II [29] 26, 569 - 26, 890 - 53, 136 164 TA + TX
For NIST Trinomial F (t) = t233 + t74 + 1
Fig. 5(b) 58, 100 - 58, 169 - 58, 565 233 TA + TX
[8]1∗ 54, 289 - 55, 454 - 121, 684 32 TA + TX
Fig. 3∗ [10] 54, 289 - 81, 551 - 81, 900 235 TA + TX
For Hybrid-size Polynomial m1 = 163, m2 = 233
Fig. 2.6 72, 858 - 72, 882 - 74, 940 233 TA + TX
TA and TX are the critical-path delay of AND gate and XOR gate, respectively.
∗: The authors have used NAND and XNOR, we just list them as AND and XOR gates here
for simplicity discussion.
1: The super-systolic structure.
2: The structure with e = 2 and d = 1 (PEs from MS-I).
one bit of operand B first, and then adds the input from the left PE to yield the final
result to the right.
The detailed design of PE-2 of Fig. 2.6 is also shown in Fig. 2.8(d), where the selection
signal S works to produce the output C according to the field-size selected (as Step 4
of Algorithm 2.1). Note that for pentanomial, only the first 163 output bits (from top)
will be counted as output of C. The proposed systolic structure produces a result in 170
cycles (PE-0 is 4 cycles, computation core is 163 cycles, and PE-2 is 3 cycles), and the
critical-path of each cycle is TA + TX (TA denotes the delay time of an AND gate).
2.2.4 Low-Latency Structure
For practical implementations, we can further reduce the latency of proposed structure
in Figs. 2.6 and 2.8. Let 233 = l ·d+x, where 0 ≤ x ≤ d (it is applicable to any field-size
of m). Assuming x = 0 for simplicity, however, it can be extended to x 6= 0. Then, we
can decompose each array in the computation core of Fig. 2.8(a) divided into l arrays as
shown in Fig. 2.9 (each array will have d PEs).
22
Table 2.4: FPGA IMPLEMENTATION RESULTS OF VARIOUS POLYNOMIAL-
BASED MULTIPLIERS
Design Area Delay1 Power ADP2 PDP3
Fig. 2.3 28, 149 1.695 0.865 47, 713 1.446
Fig. 2.4 54, 552 1.695 1.675 92, 415 2.839
Proposed 77, 006 1.695 2.364 130, 525 4.007
Unit for area: number of slice register; Unit for delay: ns; Unit for power: W (Power is
estimated at 100MHz).
1: Delay = Critical-Path.
2: ADP: Area-delay product = Area×Delay.
3: PDP: Power-delay product = Power×Delay.
2.3 Area and Time Complexity
2.3.1 Theoretical Comparison
The area and time complexities in terms of logic gate count, register count, latency, and
critical-path delay of the proposed and existing structures are listed in table 2.3. The
comparison contains two parts: pentanomials and trinomials. According to the table,
the register-complexity of the multiplier in [11] is 52,482, and MBP-II in [29] has 53,136
registers, while our modified structure of pentanomial as shown in Fig. 2.5(a) requires
the register-complexity of 28,312, which is 46.05% less than [11], and 46.72% less than
[29]. Similar condition is found in comparison of trinomials. Our modified structure
of trinomial as shown in Fig. 2.5(b) has lower register-complexity of 58,565 compared
with the existing ones, among which the structure of Fig. 3 in [10] has 81,900 registers,
while the super-systolic structure in [8] need even more registers, which is 121,684. Our
modified trinomial structure saves 28.49% and 51.87% registers when compared to [10]
and [8], respectively. Our proposed hybrid-size systolic polynomial multiplier as shown in
Fig. 2.6 has much less area complexities than any combination of the listed pentanomials
and trinomials.
2.3.2 FPGA Implementation
We also implement our proposed design on hardware (FPGA platform) to investigate the
performance of the proposed low-complexity, low-latency hybrid-size systolic polynomial
23
multiplier as shown in Fig. 2.9. We have synthesized the proposed structure using Xilinx
ISE 14.1 on Virtex 6 FPGA family. The results in terms of area-time-power complexities
are shown in Table 2.4. It can be seen that our proposed low-complexities hybrid-size






In this chapter, we propose a digit-level systolic Gaussian normal basis multiplier which
can achieve low critical-path and low register-complexity.
3.1 Review of the Existing DL Systolic GNB Multi-
plier
Normal basis N = {β, β2, β22 , · · · , β2m−1} exists in the finite field GF (2m) over GF (2) for
any positive integer m, where β is called normal element. Each field element in GF (2m),
take A = (a0, a1, · · · , am−1) as an example, can be represented as a linear combination
of the elements in N , i.e., A =
∑m−1
i=0 aiβ
2i = a0β + a1β
2 + · · · + am−1β2
m−1
, where
ai ∈ GF (2m), 0 ≤ i ≤ m − 1. Assume that m > 1 and T > 1 are two integers. Let
p = mT + 1 be a prime number and gcd(mT/k,m) = 1, where k is the multiplication
order of 2 modulo p. Then, the normal basis N = {β, β2, β22 , · · · , β2m−1} of GF (2m) over
GF (2) is called the Gaussian normal basis of type T [45].
The multiplication over GNB can be performed based on a multiplication matrix Mm×m











then we can have their product C as







Let us define µi,j = β
2i+2j ∈ GF (2m) as a field element, where 0 ≤ i, j ≤ m − 1. Then,










































which can also be represented in a matrix form as
cl = a ·M(l) · btr, 0 ≤ l ≤ m− 1, (3.6)
where a = [ a0, a1, · · · , am−1 ] denotes the row vector corresponding to the field element
A, and btr represents the matrix transpose of row vector b = [ b0, b1, · · · , bm−1 ] which
corresponds to the field element B. M(l) and can be obtained from the l-fold right and















>>B0 , B1 ,···, Bn-1
0 , 0 ,···, 0







Cout=Cin >> d + L(Bin , Vin)
Vout=Vin<< d
(b)
Figure 3.1: (a) Existing DL systolic GNB multiplier over GF (2m) [48]. (b) Detailed
structure of PE.
Algorithm 3.1 Existing DL Systolic multiplication [48].
Inputs: A = (a0, a1, · · · , am−1) and B = (b0, b1, · · · , bm−1) are two field elements over
GF (2m), where m is an odd number.
Output: C = (c0, c1, · · · , cm−1) = A ·B.
1. C = 0
2. V = A 1, where A = (am−1, am−2, · · · , a1, a0).
3. for i = 0 to n− 1 do.
4. Ci = 0.
5. Vi,0 = V  kid+ (k − 1)d.
6. Bi,0 = B  kid+ (k − 1)d.
7. for j = 1 to k do.
8. Ci = (Ci  d) + L(Vi,j−1, Bi,j−1).
9. Vi,j = Vi,j−1  d and Bi,j = Bi,j−1  d.
10. end for.
11. C = (C  kd) + Ci.
12. end for.
27
Recently, a low-latency systolic GNB multiplier has been proposed in [48]. Algorithm
3.1 describes the existing systolic GNB multiplication. Let us define q = dm/de, where




2id(V  id, B  id), where L(V,B) =
∑d−1
j=0 J
2j(V  j, B  j). Let
us define n and k as two integers that satisfy q = kn, then, we can get the partial
product Ci by Ci =
∑k−1
j=0 L
2id(Vi  jd, Bi  jd). Thus, one can decompose the
product C into n-term partial products, which is C = C0 + C
2kd





2kd + · · · )2kd + C0. Each partial product can be written as Ci =
(((L(Vi, Bi))
2d + L(Vi  d, Bi  d))2
d
+ · · ·+)2d + L(Vi  (k − 1)d, Bi  (k − 1)d).
Based on Algorithm 3.1, Fig. 3.1(a) depicts the existing DL systolic GNB multiplier
over GF (2m) [48]. We can see that the existing multiplier contains k processing elements
(PEs) and one accumulating modular (AM). Each PE is carried out by the Steps 8 and
9 of Algorithm 3.1, and the AM is computed by Step 11.
Fig. 3.1(b) presents the detailed internal structure of PEs. According to Algorithm 3.1,
we need to define the output B of PEj as Bi,j to compute the partial product Ci. Two
elements Vi and Bi, 1 ≤ i ≤ n−1, are fed to the multiplier from left cyclically to compute
the partial product Ci recursively. The latency of the multiplier is (k + n) cycles, i.e.,
it takes (k + n) cycles to get the final product C = AB. The critical-path delay of the
existing systolic GNB multiplier is TA + (dlog2T e + dlog2(d + 1)e)TX , where TA and TX




















In this section, we present our proposed algorithms separately to reduce the critical-path
and register-complexity.
3.2.1 Low Critical-Path Delay
From the structure of Fig. 3.1(b) and Algorithm 3.1, we find that inside the PE, the oper-
ation L(Vin, Bin) consists of three main parts, i.e., two additions and one multiplication.
The first addition consists of a recombined addition operation (RAO), which performs
28




























PE-1 PE-2 PE-(k-1) PE-k AC
Figure 3.2: Proposed strategy of designing low critical-path delay DL systolic GNB mul-
tiplier over GF (2m), where S box refers to shift operation. (a) Proposed DL systolic GNB
multiplier. (b) Detailed structure of PEs, where black boxes denote registers.
the addition of a series of reconstructed operand B itself. All the bits of operand B are
positionally-recombined and some of them are added together to form the new bits ac-
cording to Algorithm 3.1. Then, the multiplication is performed between the operand B
after the recombined addition and the operand V . The result of multiplication is added
with the input from the previous PE and then produces the result to the next PE on its
right. The critical-path delay of this structure is thus TA + (dlog2T e+ dlog2(d+ 1)e)TX .
Although the structure in Fig. 3.1 is efficient in implementation, it can still be improved
further, e.g., the critical-path delay needs to be shortened for practical high-performance
applications. Here, we introduce a novel algorithm in which the critical-path delay is
shortened by performing the RAO of the operand B in advance.
Fig. 3.2 depicts the proposed strategy of achieving low critical-path delay implementation
based on the existing DL systolic multiplier over GF (2m). Form Fig. 3.2(a), one can see
that the structure consists of (k + 1) PEs and one accumulation cell (AC). The detailed
internal operational-structure of these components are presented in Fig. 3.2(b). First
of all, an additional PE (PE-0) is added to perform the first RAO of operand B before
PE-1. The result of RAO from PE-0 is then yielded to PE-1 to perform the multiplication
and then the addition. Meanwhile, we still have the RAO in PE-1 which yields its result
to the next PE on its right. “S box” performs the shifting of the bits of both operands
V and B. After (k + 1) clock cycles, we get the first partial product. All the partial
products Cis are recursively fed into the AC to get the final result C after n clock cycles.
The critical-path delay of this proposed architecture is thus TA + (dlog2(d + 1)e)TX (for
NIST recommended GNB, (dlog2(d+ 1)e)TX is larger than dlog2T e).
29
According to the strategy shown in Fig. 3.2, we derive here the modified low critical-path
delay DL systolic multiplication algorithm as presented in the proposed Algorithm 3.2.
Algorithm 3.2 Low critical-path delay DL Systolic Multiplication.
Inputs: A = (a0, a1, · · · , am−1) and B = (b0, b1, · · · , bm−1) are two field elements over
GF (2m), where m is an odd number.
Output: C = (c0, c1, · · · , cm−1) = A ·B.
1. C = 0
2. V = A 1, where A = (am−1, am−2, · · · , a1, a0).
3. for i = 0 to n− 1 do.
4. Ci = 0.
5. Vi,0 = V  kid+ (k − 1)d.
6. Bi,0 = B  kid+ (k − 1)d.
7. B∗i,0 = RAO(Bi,0).
8. for j = 1 to k do.
9. Ci = (Ci  d) +MA(Vi,j−1, B∗i,j−1).
10. Vi,j = Vi,j−1  d and Bi,j = Bi,j−1  d.
11. B∗i,j = RAO(Bi,j).
12. end for.
13. C = (C  kd) + Ci.
14. end for.
According to Algorithm 3.2, we perform the first RAO of the operand B independently
in PE-0, which corresponds to Step 7. Each PE (from PE-1 to PE-k) is carried out by the
computation of the Steps 9, 10, and 11 of Algorithm 3.2, where MA denotes multiplication
and addition inside each PE, and AC is carried out in Step 13. By computing the RAO
of operand B in advance in each PE, we have shortened the critical-path delay of the




























Figure 3.3: Data pipelining of operand B among PEs for the structure of Fig. 3.1, where
the diagonal line represents data flow between PEs, and the vertical line represents data
flow in one PE.

































Figure 3.4: Data pipelining of operand V among PEs for the structure of Fig. 3.1, where
the diagonal line represents data flow between PEs, and the vertical line represents data
flow in one PE.
3.2.2 Low Register-Complexity
Systolic structure sometimes suffers from large register-complexity, as all the PEs in the
array are uniform and fully-pipelined (there are a lot of registers in the PEs). In this
subsection, we propose a novel strategy to reduce the corresponding registers among PEs.
As seen from Fig. 3.1(b), there are generally two types of registers equipped for one PE:
Type-one for operand pipelining (after bits shifting, the top one for operand B and the
bottom one for operand V ); another one for pipelining of computation (the registers used
to pipeline the data after the L(Bin, Vin) operation). The registers used to pipeline the
computational data are critical to the correctness of final output while the registers for
pipelining the shifted operands (the top and bottom ones) are relatively less important.
Based on the above consideration, we propose here a novel strategy to reduce the registers
related to the pipelining of the shifted operands (the top and bottom ones). Let us first




































Figure 3.5: Data pipelining of operand B among PEs with added operands for the struc-
ture of Fig. 3.1, where the gray area represents all added operands, and the green area
represents one specific operand cluster fed to all PEs.





































Figure 3.6: Data pipelining of operand V among PEs with added operands for the struc-
ture of Fig. 3.1, where the gray area represents all added operands, and the green area











Figure 3.7: Example of rearranging data pipelining by using operand cluster B′i.
[48]. It is seen that, in Fig. 3.3, the shifted-operand’s subscript (the subscript denotes
the degree of shifting, according to Fig. 3.1) increases one per cycle for a single PE (for
neighboring PEs, within the same cycle, the subscript increases with the numbering of
PE). The pipelining of the bits of shifted-operand V is similar to the shifted-operand B,
as shown in Fig. 3.4.
To give the detailed register-reduction strategy, we can first add extra operands with
32
corresponding subscript to fill the data flow table (highlighted gray areas), such that the
operand’s subscript increases one per cycle for a single PE and one for neighboring PEs,
as shown in Figs. 3.5 and 3.6. For simplicity of discussion, we can define all the shifted-
operands for one specific clock cycle as one operand cluster, as shown in Figs. 3.5, 3.6,
and 3.7. Thus, after the required clock cycles, we can still get the same output as that
in Fig. 3.1 since the corresponding operand will still be fed to corresponding PE during
each cycle period.





i (for k number of PEs), where 0 ≤ i ≤ n−1 , to represent each operand cluster,
whose initial value is
B
′
i = [ B(n−1−i), B(n−i), · · · , B(n+k−2−i) ]
V
′
i = [ V(n−1−i), V(n−i), · · · , V(n+k−2−i) ]
(3.8)
which can also be represented as
B
′






























Then, we have the modified low register-complexity DL systolic multiplication algorithm
as proposed in Algorithm 3.3.
Algorithm 3.3 Low Register-Complexity DL Systolic Multiplication.
Inputs: A = (a0, a1, · · · , am−1) and B = (b0, b1, · · · , bm−1) are two field elements over
GF (2m), where m is an odd number.
Output: C = (c0, c1, · · · , cm−1) = A ·B.
33
1. C = 0
2. V = A 1, where A = (am−1, am−2, · · · , a1, a0).
3. for i = 0 to n− 1 do.
4. Ci = 0.
5. V
′




















7. for j = 1 to k do.




10. C = (C  kd) + Ci.
11. end for.
where Steps 5 and 6 denote the initialization of the value of the operand-vector and
their cyclic shifting. Through this arrangement, the register count used for pipelining
shifted-operands can be removed.
3.2.3 Proposed DL Systolic GNB Multiplication Algorithm
Based on the discussion in Subsections A and B, we then combine the two modified
algorithms together to propose our novel multiplication algorithm. The proposed multi-
plication algorithm for DL systolic GNB multiplier over GF (2m), which can achieve both
low critical-path delay and low register-complexity, is proposed in Algorithm 3.4.
Algorithm 3.4 Proposed DL Systolic Multiplication.
Inputs: A = (a0, a1, · · · , am−1) and B = (b0, b1, · · · , bm−1) are two field elements over
GF (2m), where m is an odd number.
Output: C = (c0, c1, · · · , cm−1) = A ·B.
1. C = 0
2. V = A 1, where A = (am−1, am−2, · · · , a1, a0).
3. for i = 0 to n− 1 do.
34
4. Ci = 0.
5. V
′

























8. for j = 1 to k do.









12. C = (C  kd) + Ci.
13. end for.
As one can see in Algorithm 3.4, Steps 5 and 6 perform the operations of initialization of
the value of the operand cluster as well as the cyclic shifting. RAO is performed by Steps
7 and 10, and AC is computed by Step 12. By computing the first RAO in advance and
rearranging the data broadcasting, we have successfully shorten the critical-path delay
and reduced the register-complexity.
3.3 Proposed Low Critical-path Delay Low Register-
Complexity DL Systolic GNB Multiplier
Based on the proposed Algorithm 3.4, we present here the proposed DL systolic GNB
multiplier over GF (2m), which can achieve both low critical-path delay and low register-
complexity.
3.3.1 Proposed Structure
The proposed DL systolic structure of GNB multiplier overGF (2m) based on the proposed
Algorithm 3.4 is depicted in Fig. 3.8. As shown in Fig. 3.8(a), it consists of one AC,
(k + 1) number of PEs, and two shift-registers for operands B and V , respectively. The
detailed internal structures of AC and PEs are presented in Fig. 3.8(b). The shift-
registers rearrange all the bits of operand B and V , so that there will be only one operand
35







































Figure 3.8: Proposed low critical-path delay low register-complexity DL systolic GNB
multiplier over GF (2m). (a) Proposed structure of DL systolic GNB multiplier. (b)
Detailed internal structures of PEs, where black boxes denote registers.
cluster to be fed to k number of PEs in one clock cycle period. For PE-0, there is only
one component: RAO, which yields an output to PE-1 to perform multiplication with
operand V and then the addition. The internal structures of PEs from PE-1 to PE-(k−1)
are the same, where each PE contains one RAO, one multiplication operation and one
addition operation. The RAO performs reconstructed addition operation, whose result
is yielded to the multiplication operation in the next PE on its right. The multiplication
is then performed between operand V and the result of RAO from the previous PE, and
the result from multiplication is then added with the output of addition operation from
the previous PE. PE-k performs only multiplication and addition operations. The result
of the last PE (PE-(k)) will fed to AC every cycle after the AC receives its first input
from left, and the final result C can be obtained after (n+ k + 1) clock cycles.
3.3.2 An Example
Let us take type 4 GNB over GF (27) as an example. Assume d = 1 and k = 7, then, we




0 1 0 0 0 0 0
1 0 1 0 0 1 1
0 1 0 1 1 1 0
0 0 1 0 0 1 0
0 0 1 0 0 0 1
0 1 1 1 0 0 1
0 1 0 0 1 1 1

. (3.11)
Thus, we can obtain
c0 = a0b1 + a1(b0 + b2 + b5 + b6)
+ a2(b1 + b3 + b4 + b5) + a3(b2 + b5)
+ a4(b2 + b6) + a5(b1 + b2 + b3 + b6)
+ a6(b1 + b4 + b5 + b6).
(3.12)













0 = [ b1, (b0 + b2 + b5 + b6),
(b1 + b3 + b4 + b5), (b2 + b5),
(b2 + b6), (b1 + b2 + b3 + b6),
(b1 + b4 + b5 + b6) ].
(3.15)
37
Table 3.1: COMPARISON OF THE AREA AND TIME COMPLEXITIES FOR VARI-
OUS DL MULTIPLIERS OVER GF (2m)




(T − 1) + dm dm 3m dm
d
e TW1
GNB [46], [47] ≤ d(m−1)
2
(T − 1) + dm dm 3m dm
d
e TW1
PB [28] dm + 2d dm 4m + 3d + 1 2dm
d
e TW2




















































e + 1 TW3
CPD: Critical-path delay.
TW1 = TA + (dlog2Te + dlog2(d + 1)e)TX ,
TW2 = TA + (dlog2Te + dlog2de)TX ,
TW3 = TA + (dlog2(d + 1)e)TX ,
TA and TX are the critical-path delay of AND gate and XOR gate,respectively.
We can also obtain the value for V
(1)




0 · · ·B
(6)∗
0 . Then, following the steps
in Algorithm 3.4, we can finally get the partial product Ci, which is also the final product
C in this case.
3.4 Area-Time Complexities
3.4.1 Theoretical Comparison
The area-time complexities of the proposed and the existing ones of [26, 43-44, 46-48]
in terms of logic-gate-count, register-count, critical-path delay, and latency are shown
in Table 3.1. According to the table, the latency of the proposed DL systolic multi-
pliers is (2d
√
m/de + 1), while [26] requires 2dm/de and a number of existing GNB
multipliers require the latency of dm/de. In addition, multiplier in [43] has latency of
(d+mT/d(mT/d+ 1)) clock cycles, which is more than the proposed one. We have also
chosen the field of GF (2409) to have a detailed comparison of the latencies of various DL
GNB multipliers. As shown in Tables 3.2, the latency of our proposed structure is only
1 clock cycle more than the one in [48]. However, the latency of proposed one is much
lower than the other existing multipliers. As the digit-size increases from 2 to 14, our
proposed multiplier has nearly 2.3-13.2 times less latency compared with those in [26]
and [44].
In terms of register-complexity and critical-path delay, one can see from Table 3.1 that
38
Table 3.2: COMPARISON OF LATENCY OVER GF (2409)
Digit-size (d) 2 4 6 8 10 12 14
[26] 410 206 138 104 82 70 60
[44] 205 103 69 52 41 35 30
[48] 30 22 18 16 14 12 12
Proposed 31 23 19 17 15 13 13
Table 3.3: COMPARISON OF CRITICAL-PATH DELAY WITH DIFFERENT DIGIT-
SIZE FOR VARIOUS DL MULTIPLIERS OVER GF (2409)
Design Digit-size CPD
Proposed
7 TA + 3TX
13 TA + 4TX
19 TA + 5TX
33 TA + 6TX
[48]
7 TA + 5TX
13 TA + 6TX
19 TA + 7TX
33 TA + 8TX
[26]
7 TA + 5TX
13 TA + 6TX
19 TA + 7TX
33 TA + 8TX
the DL-PIPO multiplier in [48] requires (1 + 3d
√
m/de)m registers and its critical-path
delay is TW1 = TA+(dlog2T e+dlog2(d+1)e)TX . While the proposed architecture requires
< (3+2d
√
m/de)m registers, and the critical-path delay is TW3 = TA+(dlog2(d+1)e)TX ,
which is significantly less than the existing one of [48]. We have also chosen field size of
GF (2409) to have a detailed comparison of ours and the existing ones of [26] and [48] as
shown in Table 3.3. It is seen that the critical-path delay of our proposed structure is
significantly less than the existing ones.
3.4.2 ASIC Implementation
We have also synthesized our proposed and the existing designs to obtain the area-time
complexity. We have used Synopsys Design Compiler based on Taiwan Semiconductor
Manufacturing Company (TSMC) 65-nm standard-cell library. The results in terms of
area, critical-path delay, and latency-cycles (including latency time) of our proposed
systolic structure are shown in Table 3.4 with different field sizes (m = 163, 283, and
39
Table 3.4: ASIC SYNTHESIS RESULTS OF THE PROPOSED SYSTOLIC MULTI-
PLIER
m,T Area [µm2] CPD [ns] d Latency Time [ns]
163, 4
14, 254 0.87 9 11 9.57
16, 146 0.94 11 9 8.46
38, 894 1.96 28 7 13.72
283, 6
109, 512 1.13 16 11 12.43
170, 811 1.31 21 9 11.79
310, 254 2.36 36 7 16.52
409, 4
202, 972 1.04 13 13 13.52
260, 882 1.27 19 11 13.97
618, 238 2.67 41 9 24.03
Table 3.5: ASIC SYNTHESIS RESULTS FOR THE EXISTING AND THE PROPOSED
DL MULTIPLIERS OVER GF (2409)
Design Digit-size Latency Area [µm2] CPD [ns] Latency time [ns] Area-delay product (ADP) [pm2s]
Proposed
7 17 123, 742 0.91 15.47 1, 914
13 13 202, 972 1.04 13.52 2, 744
19 11 260, 882 1.27 13.97 3, 644
[48]
7 16 128, 201 1.28 20.4 2, 615
13 12 207, 222 1.38 16.6 3, 439
19 10 264, 327 1.63 16.3 3, 782
[46], [47]
13 32 71, 760 1.44 46.1 3, 308
18 23 115, 806 1.55 35.6 4, 122
23 18 147, 475 1.68 29.7 4, 380
[26] 13 64 58, 917 1.23 78.7 4, 637
409) and digit sizes. As shown in Table 3.4, our proposed design performs better when
the digit size becomes smaller.
We have also compared our systolic multiplier with the existing DL multipliers in terms
of different digit sizes with the same field size m = 409, as shown in Table 3.5. One
can see that for the same digit size d = 13, our proposed multiplier has shorter critical-
path delay and smaller latency time compared with the other multipliers. Moreover, the
area-delay product (ADP) of the proposed one is the smallest among all the multipliers
in Table 3.5. For the field size of m = 409 (digit size of d = 13), our proposed systolic
multiplier has a latency time of 13.52ns and area of 202, 972µm2, while the multiplier in
[48] (with the same digit size) needs 16.6ns and has 207, 222µm2 ([46] and [47] have even
larger results). For digit size of d = 7, the ADP of the proposed structure is 26.7% more
efficient than [48]. For d = 13, our design is 18.6%, 70.7%, and 82.8% faster than that
(total time) in [48], [46] ([47]), and [26], respectively. The ADP of the proposed design





4.1 Low complexity hybrid-size systolic polynomial
multipliers
An efficient, new hybrid structure of pentanomials and trinomials for low-complexity
implementation of finite field multipliers over GF (2m) has been proposed in Chapter 2.
Based on the similarities of the computation matrices of operand A of the pentanomial-
based and trinomial-based multipliers, we sperate the common bits of operand A of both
multipliers to form a new matrix named AOP-liked matrix, which reduces the register-
complexity. Moreover, a novel low register-complexity algorithm for hybrid systolic finite
field polynomial multipliers has been presented. We have also introduced structures for
low-latency and digit-parallel implementation. Both the theoretical analysis and the
FPGA implementation results have confirmed the higher efficiency of the proposed ar-
chitectures compared with the competing ones.
4.2 Low complexity digit-level systolic Gaussian nor-
mal basis multipliers
A low critical-path delay, low register-complexity DL systolic GNB multiplier overGF (2m)
has been proposed in the third chapter of this thesis. We have proposed a novel multipli-
cation algorithm to reduce the critical-path delay and the register-complexity. Moreover,
both theoretical and ASIC implementation results are presented for comparison. Based
41
on our presented results, our proposed design has smaller critical-path delay and few-
er register-complexity when compared with the existing DL systolic multipliers. The
proposed DL multiplier, thus, can be extended and employed in sensitive usage models




Q. Shao, Z. Hu, S. Chen, P. Chen, R. Azarderakhsh, M. Mozaffari-Kermani, and J.
Xie, “Low Critical-Path Low-Complexity Digit-Level Systolic Gaussian Normal Basis




[1] I. Blake, G. Seroussi, and N. P. Smart, Elliptic Curves in Cryptography, ser. London
Mathematical Society Lecture Note Series. Cambridge, U.K.: Cambridge Univ. Press,
1999.
[2] N. R. Murthy and M. N. S. Swamy, “Cryptographic applications of Brahmaqupta-
Bha skara equation,” IEEE Trans. Circuits and Systems-I, vol. 53, no. 7, pp. 1565-1571,
2006.
[3] M. Sun, L. E. Burke, Z.-H. Mao, Y. Chen, H.-C. Chen, Y. Bai, Y. Li, C. Li, and W. Jia.
“eButton: A wearable computer for health monitoring and personal assistance,” in Proc.
Design Automation Conference, pp. 1-6, 2014.
[4] D. Schinianakis and T. Stouraitis, “Multifunction residue architectures for cryptog-
raphy,” IEEE Trans. Circuits and Systems-I, vol. 61, no. 4, pp. 1156-1169, 2014.
[5] National Institute of Standards and Technology, “FIPS 186-2, Digital Signature
Standard (DSS), Federal Information Processing Standards Publication 186-2,” 2000.
[6] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation.
New York: Wiley, 1999.
[7] C.-Y. Lee, J.-S. Horng, I.-C. Jou, and E.-H. Lu, “Low-complexity bit-parallel systolic
montgomery multipliers for special classes of GF (2m),” IEEE Trans. Comput., vol. 54,
no. 9, pp. 1061-1070, 2005.
[8] P. K. Meher, “Systolic and super-systolic multipliers for finite field GF (2m) based
on irreducible trinomials,” IEEE Trans. Circuits and Systems-I, vol. 55, no. 4, pp.
1031-1040, 2008.
[9] J. Xie, P. K. Meher, and J. He, “Low-latency area-delay-efficient systolic multiplier
44
over GF (2m) for a wider class of trinomials using parallel register sharing,” in Proc. IEEE
Int. Sym. Circuits and Systems, pp. 89-92, 2012.
[10] S. B.-Sarmadi and M. Farmani, “High-throughput low-complexity systolic Mont-
gomery multiplication over GF (2m) based on trinomials,” IEEE Trans. Circuits and
Systems-II, vol. 62, no. 4, pp. 377-381, 2015.
[11] J. Xie, J. He, and P. K. Meher, “Low latency systolic Montgomery multiplier for
finite field GF (2m) based on pentanomials,” IEEE Trans. VLSI Systems, vol. 21, no. 2,
pp. 385-389, 2013.
[12] P. Montgomery, “Modular multiplication without trial division,” Math, Computa-
tion, vol. 44, no. 170, pp. 519-521, 1985.
[13] R. Azarderakhsh, K. Jarvinen, and V. Dimitrov, “Fast inversion in GF (2m) with
normal basis using hybrid-double multipliers,” IEEE Trans. Comput., vol. 63, no. 4, pp.
1041-1047, 2014.
[14] R. Azarderakhsh, D. Jao, and H. Lee, “Space complexity reduction algorithms for
Gaussian normal basis multiplication,” IEEE Trans. Information Theory, vol. 61, no. 5,
pp. 2357-2369, 2015.
[15] R. Azarderakhsh and M. Mozaffari-Kermani, “High-performance two-dimensional fi-
nite field multiplication and exponentiation for cryptographic applications,” IEEE Trans.
Computer Aided Design Integrated Circuits Systems, vol. 34, no. 10, pp. 1-8, 2015.
[16] C-Y. Lee and P. K. Meher, “Area-efficient subquadratic space-complexity digit-serial
multiplier for type-II optimal normal basis of GF (2m) using symmetric TMVP and block
recombination techniques,” IEEE Trans. Circuits and Systems-I, vol. 62, no. 12, pp.
2846-2855, 2015.
[17] S. Talapatra, H. Rahaman, and S. K. Saha, “Unified digit serial systolic Montgomery
multiplication architecture for special classes of polynomials over GF (2m),” in Proc.
Conf. Digital System Design: Architectures, Methods and Tools, pp. 427-432, 2010.
[18] S. Fenn and M. Parker, “Bit-serial multiplication in GF (2m) using all-one polyno-
mials,” IEE proc. Com. Digit. Tech., vol. 144, no. 6, pp. 391-393, 1997.
[19] K.-Y. Chang, D. Hong, and H.S. Cho, “Low complexity bit-parallel multiplier for
GF (2m) defined by all-one polynomials using redundant representation,” IEEE Trans.
Comput., vol. 54, no. 12, pp. 1628-1629, 2005.
[20] H.-S. Kim and S.-W. Lee, “LFSR multipliers over GF (2m) defined by all-one poly-
45
nomial,” Integr., the VLSI J., vol. 40, no. 4, pp. 571-578, 2007.
[21] P. K. Meher and C.Y. Lee, “An optimized design of serial-parallel finite field mul-
tiplier for GF (2m) based on all-one polynomials,” in Proc. ASP-DAC, pp. 210-215,
2009.
[22] M.-Sandoval, M. F.-Uribe, and C. Kitsos, “Bit-serial and digit-serial GF (2m) Mont-
gomery multipliers using linear feedback shift registers,” IET Comput. & Digital Tech.,
vol. 5, no. 2, pp. 86-94, 2011.
[23] C.-Y. Lee, E.-H. Lu, and J.-Y. Lee, “Bit-parallel systolic multipliers for GF (2m)
fields defined by all-one and equally spaced polynomials,” IEEE Trans. Comput., vol.
50, no. 6, pp. 385-393, 2001.
[24] J. Xie, P. K. Meher, and J. He, “Low-complexity multiplier for GF (2m) based on
all one polynomials,” IEEE Trans. VLSI Systems, vol. 21, no. 1, pp. 168-172, 2013.
[25] Y.-R. Ting, E.-H. Lu, and Y.-C. Lu, “Ringed bit-parallel systolic multipliers over a
class of fields GF (2m),” Integr., the VLSI J., vol. 38, no. 4, pp. 571-578, 2005.
[26] S. Talapatra, H. Rahaman, and J. Mathew, “Low complexity digit serial systolic
Montgomery multipliers for special class of GF (2m),” IEEE Trans. VLSI Sys., vol. 18,
no. 5, pp. 847-852, 2010.
[27] T. Itoh and S. Tsujii, “Structure of parallel multipliers for a class of fieldsGF (2m),” In-
formation and Computation, vol. 83, no. 1, pp. 21-40, 1989.
[28] J. L. Imana, R. Hermida, and F. Tirado, “Low complexity bit-parallel multipliers
based on a class of irreducible pentanomials,” IEEE Trans. VLSI Sys., vol. 14, no. 12,
pp. 1388-1393, 2006.
[29] J. Xie, P. Meher, and Z. Mao, “Low-latency high-throughput systolic multipliers over
GF (2m) for NIST recommended pentanomials,” IEEE Trans. Circuits and Sys.-Regular
Papers, vol. 62, no. 3, pp. 881-890, 2015.
[30] P. Chen, S. Basha, M. Mozaffari-Kermani, R. Azarderakhsh and J. Xie, “FPGA
realization of low register systolic all-one-polynomial multipliers over GF (2m) and their
applications in trinomial multipliers,” IEEE Trans. VLSI Sys., vol. PP, no. 3, pp. 1-10,
2016.
[31] R. R. Farashahi and M. Joye, “Efficient arithmetic on Hessian curves,” in Proc. Int.
Conf. Pract. Theory Public Key Cryptograph., pp. 243-260, 2010.
[32] Ç. K. Koç and B. Sunar, “An efficient optimal normal basis Type II multiplier over
46
GF (2m),” IEEE Trans. Comput., vol. 50, no. 1, pp. 83-87, 2001.
[33] J. Xie, P. K. Meher, and Z. Mao, “High-throughput digit-level systolic multiplier
over GF (2m) based on irreducible trinomials,” IEEE Trans. Circ. and Syst.-II, vol. 62,
no. 5, pp. 481-485, 2015.
[34] J. Adikari, V. S. Dimitrov, and R. J. Cintra, “A new algorithm for double scalar
multiplication over Koblitz curves,” in Proc. IEEE Int. Symp. Circuits Syst., pp. 709-
712, 2011.
[35] K. Järvinen and J. Skyttä, “On parallelization of high-speed processors for elliptic
curve cryptograph,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 9,
pp. 1162-1175, 2008.
[36] C. Lee, P. Meher, and J. Patra, “Concurrent error detection in bit-serial normal basis
multiplication over GF (2m) using multiple parity prediction schemes,” IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., vol. 18, no. 8, pp. 1234-1238, 2010.
[37] W. Geiselmann and D. Gollmann, “Symmetry and duality in normal basis multiplica-
tion,” in Proc. Sixth Symp. Applied Algebra, Algebraic Algorithms and Error-Correcting
Codes (AAECC), pp. 230-238, 1989.
[38] R. Azarderakhsh, and A. Reyhani-Masoleh, “Efficient FPGA implementation of
point multiplication on binary Edwards and generalized Hessian curves using Gaussian
normal basis,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 8, pp.
1453-1466, 2012.
[39] P. K. Meher, “Systolic and non-systolic scalable modular designs of finite field
multipliers for Reed-Solomon codec,” IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 17, no. 6, pp. 747-757, 2009.
[40] S. Kwon, “A low complexity and a low latency bit parallel systolic multiplier over
GF (2m) using an optimal normal basis of type II,” in Proc. 16th IEEE Symp. Comput.
Arithmetic, pp. 196-202, 2003.
[41] J. Fan, D. V. Batina, T. Guneysu, C. Paar, and I. Verbauwhede, “Breaking elliptic
curves cryptosystems using reconfigurable hardware,” in Proc. Int. Conf. Field Program.
Logic Appl., pp. 133-138, 2010.
[42] A. Wang and S. Fan, “Efficient montgomery-based semi-systolic multiplier for even-
type GNB of GF (2m),” IEEE Trans. Comput., vol. 61, no. 3, pp. 415-419, 2012.
[43] C. Lee and C. W. Chiou, “Scalable Gaussian normal basis multipliers over GF (2m)
47
using Hankel matrix-vector representation,” J. Signal Process. Syst., vol. 69, no. 2, pp.
197-211, 2012.
[44] R. Azarderakhsh and A. Reyhani-Masoleh, “A modified low complexity digit-level
Gaussian normal baasis multiplier,” in Proc. 3rd Int. Workshop Arithmetic Finite Fields,
pp. 25-40, 2010.
[45] Digital Signature Standard, National Institute of Standards and Technology, Gaithers-
burg, MD, USA, Jan. 2000.
[46] A. Reyhani-Masoleh, “Efficient algorithms and architectures for field multiplication
using Gaussian normal bases,” IEEE Trans. Comput., vol. 55, no. 1, pp. 34-47, 2006.
[47] S. Kwon, K. Gaj, C. H. Kim, and C. P. Hong, “Efficient linear array for multiplica-
tion in GF (2m) using a normal basis for elliptic curve cryptography,” in Proc. 6th Int.
Workshop Cryptograph. Hardw. Embedded Syst., pp. 76-91, 2014.
[48] R. Azarderakhsh, M. Mozaffari Kermani, S. Bayat-Sarmadi, and C.-Y. Lee, “Systolic
Gaussian normal basis multiplier architectures suitable for High-performance application-
s,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 9, pp. 1969-1972,
2015.
[49] IEEE Std 1363-2000, “IEEE Standard Specification for Public-Key Cryptogra-
phy,” Jan. 2010.
[50] US Dept. of Commerce/NIST, “National Institute of Standards and Technolo-
gy,” Digital Signature Standard, FIPS Publications 186-2, Jan. 2010.
[51] J. Massey and J. Omura, Computational Method and Apparatus for Finite Arith-
metic, US Patent 4587627, Washington, D.C., 1986.
[52] T. Beth and D. Gollmann, “Algorithm engineering for public key algorithms,” IEEE
J. Selected Areas in Communications, vol. 7, no. 4, pp. 458-466, 1989.
[53] G. B. Agnew, R. C. Mullin, I. M. Onyszchuk, and S. A. Vanstone, “An implemen-
tation for a fast public-key cryptosystem,” J. Cryptology, vol. 3, no. 2, pp. 63-79,
1991.
[54] A. Reyhani-Masoleh and M. A. Hasan, “Efficient digit-serial normal basis multipliers
over binary extension fields,” ACM Trans. Embedded Computing Systoms, vol. 3, no. 3,
pp. 575-592, 2004.
[55] A. H. Namin, H. Wu, and M. Ahmadi, “A word-level finite field multiplier using
normal basis,” IEEE Trans. Comput., vol. 60, no. 6, pp. 890-895, 2006.
48
[56] C. Lee and P. Chang, “Digit-serial Gaussian normal basis multiplier over GF (2m)
using Toeplitz Matrix-approach,” in Proc. Int. Conf. Computational Intelligence and
Software Eng. (CiSE), pp. 1-4, 2009.
49
