LFSR-based bit-serial GF(^2m) multipliers using irreducible trinomials by Imaña Pascual, José Luis
0018-9340 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2020.2980259, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS 1




Abstract—In this paper, a new architecture of bit-serial polynomial basis
(PB) multipliers over the binary extension field GF (2m) generated by
irreducible trinomials is presented. Bit-serial GF (2m) PB multiplication
offers a performance/area trade-off that is very useful in resource con-
strained applications. The architecture here proposed is based on LFSR
(Linear-Feedback Shift Register ) and can perform a multiplication in
m clock cycles with a constant propagation delay of TA + TX . These
values match the best time results found in the literature for bit-serial PB
multipliers with a slight reduction of the space complexity. Furthermore,
the proposed architecture can perform the multiplication of two operands
for t different finite fields GF (2m) generated by t irreducible trinomials
simultaneously in m clock cycles with the inclusion of t(m− 1) flipflops
and tm XOR gates.




Binary extension fields GF (2m) play a fundamental role
in several important applications, such as cryptography,
coding theory and digital signal processing. These applica-
tions require efficient hardware implementations ofGF (2m)
arithmetic operations, especially for multiplication. This op-
eration is considered the most important and complex
one because exponentiation, inversion and division can be
performed by means of repeated multiplications. Differ-
ent representation bases can be used to perform GF (2m)
arithmetic operations, although polynomial basis (PB) is the
most widely used. Efficient methods and architectures for
GF (2m) multiplication have been proposed for PB where
the complexity depends on the generating irreducible poly-
nomial used for the finite field. Special irreducible poly-
nomials such as trinomials or pentanomials can provide
important optimizations in terms of both area and speed.
Different architectures for GF (2m) PB multiplication have
also been proposed, like bit-parallel, bit-serial and digit-
serial multipliers. Bit-parallel multipliers [1] present a high
area complexity, but can perform the multiplication in only
one clock cycle. Bit-serial multipliers [2], [3] are restricted
in area, but need m clock cycles. Digit-serial architectures
[4] represent a speed-area trade-off achieved by processing
several coefficients at the same time. Polynomial basis multi-
plication requires a polynomial multiplication followed by a
reduction modulo the irreducible polynomial f(y) selected
for the field [5]. Mastrovito proposed a new method to
combine the above two steps together [1], [6]. A new PB
multiplication approach applied to five types of irreducible
trinomials was proposed in [7], where functions Si and Ti
• J.L. Imaña is with the Department of Computer Architecture and Automa-
tion, Faculty of Physics, Complutense University, 28040 Madrid, Spain.
E-mail: jluimana@ucm.es
were obtained from the decomposition of a product matrix.
These functions are given by the sum of product terms and
their addition is used for the computation of the product of
two GF (2m) operands.
In this paper, a new bit-serial polynomial basis multiplier
over the binary extension field GF (2m) generated by irre-
ducible trinomials is presented. Bit-serial PB multiplication
offers a performance/area trade-off that is very useful in
resource constrained applications such as smart cards. The
architecture here proposed is based on LFSR (Linear-Feedback
Shift Register) and can perform a multiplication in m clock
cycles with a constant propagation delay of TA+TX (where
TA and TX represent the delay of 2-input AND and XOR
gates, respectively). These values match the best time results
found in the literature for bit-serial PB multipliers with a
slight reduction of the space complexity. A characteristic of
the proposed architecture is that it can perform the multipli-
cation of two operands for t different finite fields GF (2m)
generated by t irreducible trinomials simultaneously in m
clock cycles with the inclusion of t(m − 1) flipflops and
tm XOR gates. Based on [7], a new general multiplication
algorithm over irreducible trinomials f(y) = ym + yn + 1,
with 1 ≤ n ≤ m− 1, is also proposed.
The paper is organized as follows. Section 2 provides no-
tation and mathematical background. Irreducible trinomials
are introduced in Section 3, where a new general algorithm
for multiplication is also given. Section 4 describes the
new LFSR-based multiplier architecture, gives an example
of multiplication and analyses the theoretical complexity.
Comparisons with other similar multipliers and hardware
implementation results on Xilinx FPGAs are given in Section
5. Finally, conclusions are given in Section 6.
2 BACKGROUND
Any element A ∈ GF (2m) can be represented in the








i. Two-step classic PB multiplication in
GF (2m) requires a polynomial multiplication followed by
a reduction modulo the irreducible polynomial. Mastrovito
proposed an efficient bit-parallel multiplication method in
which a product matrix combines the above two steps to-
gether. In [7], a new GF (2m) PB multiplication method
was given. In order to compute the product C = A · B,
this approach defined functions Si (1 ≤ i ≤ m) and Ti
(0 ≤ i ≤ m− 2) given by the addition of terms xk = (akbk)
and zji = (aibj+ajbi), with ai, bi ∈ GF (2). These functions









i+j , where p = bi/2c, q = (dm/2e+ bi/2c),
the term xp = apbp only appears for i odd and xq only
appears for (m and i even) or for (m and i odd). In this
case, r = q. Otherwise, i.e., for (m even and i odd) or
for (m odd and i even), the term xq does not appear and
r = (dm/2e+ di/2e). For example, using the above expres-
sions, the terms Si and Ti for GF (26) are: S1 = x0 = a0b0,
S2 = z
1




















2 = a3b3 + (a1b5 + a5b1) + (a2b4 + a4b2), T1
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 19:07:44 UTC from IEEE Xplore.  Restrictions apply. 
IEEE TRANSACTIONS ON COMPUTERS 2
TABLE 1
Coordinates ci of the product for the trinomials f(y) = y6 + yn + 1,
with 1 ≤ n ≤ 5.
n = 1 2 3 4 5
c0 S1 T0 T4 T3 T2 T4 T1 T2 T3 T4
c1 S2 T1 T0 T4 T3 T2 T3 T4
c2 S3 T2 T1 T0 T4 T4 T3 T4
c3 S4 T3 T2 T1 T0 T3 T4
c4 S5 T4 T3 T2 T1 T4 T0 T2 T4
c5 S6 T4 T3 T2 T1 T3 T0 T1 T2 T3 T4
= z52 + z
4
3 = (a2b5 + a5b2) + (a3b4 + a4b3), T2 = x4 + z
5
3 =
a4b4 + (a3b5 + a5b3), T3 = z54 = (a4b5 + a5b4), T4 = x5 =
a5b5. The product C = A · B can then be computed as the
addition of these terms.
An LFSR is a shift register whose feedback value is a
linear function of its previous state. It is used in many im-
portant applications, such as cryptography, pseudo-random
numbers generation or test pattern generation. The product
P = A · x mod f(x), where x is a root of the irreducible
polynomial f(y) = ym + fm−1ym−1 + . . . + f1y + 1
and A ∈ GF (2m) generated by f(y), can also be per-
formed using an m-tap LFSR. The product P = A · x =
(am−1x
m−1 + . . . a1x + a0)x = am−1x
m + . . . a1x
2 + a0x
can be computed using the fact that x is a root of f(y),
so xm = fm−1xm−1 + . . . + f1x + 1. Substituting this
expression of xm into the expression for P = A · x it is
obtained P = pm−1xm−1+. . . p1x+p0, with p0 = am−1 and
pi = ai−1+am−1fi for i = 1 . . .m−1 [5]. This operation can
be implemented with an m-tap LFSR, where the registers
are initially loaded with the coordinates of the element A,
(a0, . . . , am−1), and the coefficients fi, i = 1 . . .m−1, of the
irreducible polynomial are connected to AND gates together
with the output of the last 1-bit register of the LFSR. After
m clock cycles, the registers contents will be the coefficients
(p0, . . . , pm−1) of P = A · x mod f(x) [5].
Using LFSR, a new multiplier was proposed in [11].
The new multiplier works in the truncated polynomial ring
modular q, Zq[x]/(xm − 1), where Zq[x] is polynomial with
integer coefficients modulo q and all polynomials are taken




i and B =
∑m−1
i=0 bix
i (with ai, bi ∈ Zq), is
computed as shown in Figure 1, where ⊕ refers to an adder,
 refers to a multiplier and a  stands for a register. In
Figure 1, the registers are initially loaded with 0. The coef-
ficients of A, (a0, . . . , am−1), are connected to the inputs of
each multiplier in parallel, while that the coefficients of the
operand B, (b0, . . . , bm−1), input to all the multipliers in a
serial fashion, starting with the less significant coefficient b0.
After m clock cycles, the registers will store the coefficients
of the product C = A×B mod (xm − 1) [11].
3 IRREDUCIBLE TRINOMIALS
For hardware implementation of GF (2m) multiplication,
low Hamming weight irreducible polynomials, such as tri-
nomials and pentanomials [1], [9], [10], are normally used.
Irreducible trinomials [8] f(y) = ym + yn + 1 are important
because they are abundant and, for a givenm, an irreducible
trinomial can be found when irreducible pentanomials do
not exist. PB multiplication for irreducible trinomials was
cm-1 cm-2 c0
am-1 am-2 a1 a0
bm-1,bm-2,…,b1,b0
Fig. 1. LFSR for the implementation of A×B mod (xm − 1) [11].
studied in [7], where different expressions for the coeffi-
cients of the product were given in terms of Si and Ti
functions for specific trinomials f(y) = ym + yn + 1 with
n = m − 1,m/2, (m + 1)/2, (m − 1)/2 and n = 1. For
example, the coefficients of the product for GF (26) using
trinomials f(y) = y6 + yn + 1, with 1 ≤ n ≤ 5 are
given in Table 1, where the expression of a coefficient ci is
computed by the addition (XOR) of Si and Ti terms given
in its row. The first column in Table 1 includes common
terms that appear in the coefficients for every trinomial
f(y) = y6 + yn + 1, with 1 ≤ n ≤ 5. The columns n = 1,
n = 2, n = 3, n = 4 and n = 5 include specific terms for
the coefficients of the different trinomials. For example, the
coefficient c5 for the trinomial f(y) = y6 + y + 1 (n = 1)
is c5 = S6 + T4, for n = 2 is c5 = S6 + T3, for n = 3 is
c5 = S6+T2, for n = 4 is c5 = S6+T1+T2, and for n = 5
is c5 = S6+T0+T1+T2+T3+T4. The method presented
in [7] was based on the introduction of a product matrix
that can be decomposed in a sum of matrices depending
on the generating irreducible polynomial selected for the
field. For example, for the trinomial f(y) = y6 + y3 + 1, the
product C = A · B can be computed as c = M · b, where
c = (c0, . . . , c5)
T and b = (b0, . . . , b5)T are the coefficients of
C and B, respectively, and where M is a 6×6 matrix whose
elements are additions of the coefficients ai of the operand
A. In this case, the matrix M can be decomposed as (1):
M = M0 +M1 +M2 = (1)
a0 a5 a4 a3 a2 a1
a1 a0 a5 a4 a3 a2
a2 a1 a0 a5 a4 a3
a3 a2 a1 a0 a5 a4
a4 a3 a2 a1 a0 a5
a5 a4 a3 a2 a1 a0
+

· · · · a5 a4
· · · · · a5
· · · · · ·
·a5 a4 a3 a2 a1
· · a5 a4 a3 a2
· · · a5 a4 a3
+

· · · · · ·
· · · · · ·
· · · · · ·
· · · ·a5 a4
· · · · · a4
· · · · · ·

It can be observed that the product M0 · b corresponds
with the addition of Si and Ti terms given in the first
column of Table 1, while that the products M1 · b and
M2 · b correspond with the two subcolumns included in the
column labeled with n = 3 in Table 1, respectively. It must
also be noted that the circulant matrix M0 (equivalent to K0
matrix in [7]) always appears in the decomposition of M for
any field GF (2m), so the product M0 · b (and therefore the
addition of the corresponding Si and Ti terms) is common
to any irreducible polynomial selected for GF (2m).
3.1 New General Expressions for the Multiplier
The expressions given in [7] for GF (2m) PB multiplication
are only valid for the five specific trinomials f(y) = ym +
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 19:07:44 UTC from IEEE Xplore.  Restrictions apply. 
0018-9340 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2020.2980259, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS 3
yn + 1 with n = m − 1,m/2, (m + 1)/2, (m − 1)/2 and
n = 1. However, using Table 1, a new general multiplication
algorithm for trinomials valid for any 1 ≤ n ≤ m − 1 can
be given. Figure 2 shows these new expressions, where the
coefficients of the product can be computed by the addition












Fig. 2. New general multiplication algorithm for 1 ≤ n ≤ m− 1.
3.1.1 Proof of the new multiplication algorithm
The new algorithm given in Figure 2 can be deduced from
[7] and from Table 1. From [7], the matrix M is decomposed
as M = M0 +
∑τ





, in such a way
that τ determines the number of subcolumns included in
the columns for a given n in Table 1. For a given trinomial
f(y) = ym + yn + 1, the coefficients ci of the product
include the addition (Si+1+Ti) corresponding with M0 · b.
It can be observed that the coefficient cn−1 only includes this
addition (Sn +Tn−1), while that the remaining coefficients
also include the sum of additional Tj terms corresponding
with the products Mi · b, with i = 1 . . . τ . From [7], the first
additional Tj terms (first subcolumn for a given n in Figure
2) for the coefficients ci are Ti+z for 0 ≤ i ≤ n−2 and Ti−n
for n ≤ i ≤ m−1. These terms correspond with M1 ·b. It can
be observed that if additional subcolumns (corresponding
with additional products Mh · b, with h = 2, 3, . . .) are
included, then additional terms Ti+hz (0 ≤ i ≤ n − 2) and
Ti−n+hz (n ≤ i ≤ m−1) are added to the coefficients ci. As
the maximum available term Ti for a given field GF (2m)
is Tm−2, then the maximum additional term Ti+hz (for
0 ≤ i ≤ n− 2) should be i+ hz ≤ m− 2, so the maximum





, corresponding with the
value δ in Figure 2. Similarly, the maximum additional term
Ti−n+hz (for n ≤ i ≤ m− 1) should be i− n+ hz ≤ m− 2,






corresponding with the value γ in Figure 2. Furthermore,
the coefficient with the maximum number of additional Tj






. As the addition of terms given in Figure 2
ranges from 0 to γ, then the number of additional terms Tj
(and therefore the maximum number of additional matrices











the value τ given in [7].
4 NEW LFSR-BASED MULTIPLIER ARCHITEC-
TURE FOR IRREDUCIBLE TRINOMIALS
The LFSR multiplication approach [11] given in Section 2
can be easily adapted to the computation of the product







i, with ai, bi ∈ GF (2). In this case, the same
architecture given in Figure 1 can be used to perform the
productC = A·B mod (xm+1) if⊕ refers to XORs,  refers
AND gate and  stands for 1-bit registers. For example, the
product C = A · B mod (x6 + 1) can be implemented with
Figure 1 using the fact that x is a root of the polynomial
f(y) = y6 + 1, so x6 = 1, x7 = x, x8 = x2, x9 = x3 and
x10 = x4. Substituting these expressions into those obtained
from the product of polynomials (a5x5 + a4x4 + a3x3 +
a2x
2 + a1x
1 + a0) · (b5x5 + b4x4 + b3x3 + b2x2 + b1x1 + b0)
we get c0 = a0b0 + a5b1 + a4b2 + a3b3 + a2b4 + a1b5,
c1 = a1b0+a0b1+a5b2+a4b3+a3b4+a2b5, c2 = a2b0+a1b1+
a0b2 + a5b3 + a4b4 + a3b5, c3 = a3b0 + a2b1 + a1b2 + a0b3 +
a5b4+a4b5, c4 = a4b0+a3b1+a2b2+a1b3+a0b4+a5b5 and
c5 = a5b0+a4b1+a3b2+a2b3+a1b4+a0b5. It can be observed
that after 6 clock cycles, the 1-bit registers c0, c1, c2, c3, c4
and c5 in Figure 1 contain these expressions of the coeffi-
cients of the product. In the above example, it is important
to note that the expressions of the coefficients ci, i = 0 . . . 5,
correspond with the addition of the Si and Ti terms given
in Section 2 for GF (26) using the approach in [7]. In this
case, the coefficients can be written as c0 = S1 + T0,
c1 = S2+T1, c2 = S3+T2, c3 = S4+T3, c4 = S5+T4 and
c5 = S6. In fact, these coefficients correspond with the result
of the product M0 · b as given in equation (1). This product
was stated in Section 3 to be common to any irreducible
polynomial selected for GF (26). Furthermore, the content
of the register c5 in this example is S1 in the first clock cycle
of the computation, S2 in the second cycle, S3 in the third
one, S4 in the fourth cycle, S5 in the fifth one and S6 in the
sixth clock cycle. As conclusion, the LFSR architecture given
in Figure 1 can be used for the computation of the product
M0 · b common to any polynomial selected for GF (2m),
where the 1-bit registers ci store the coefficients of M0 · b
and where the register cm−1 contains the term Si in the i-th
clock cycle, i = 1 . . .m− 1.
The method given in [7] for GF (2m) multiplication
using irreducible trinomials computes the coefficients of the
product by the addition of Si and Ti terms as given in Table
1 and Figure 2. It can be observed that different terms Ti are
shared among the coefficients of the product while that Si
terms are only used in the ci−1 coefficients, for i = 1 . . .m.
Therefore, it would be useful that we could access to indi-
vidual Ti terms in order to perform the product implemen-
tation. To do that, a modification of the LFSR architecture
given in Figure 1 can be proposed in such a way that we
can compute the matrix-vector product M0 · b and that the
register cm−1 can contain the term Tm−1−i in the i-th clock
cycle, i = 1 . . .m− 1. In the new proposed architecture, the
coefficients of A, (a0, a1, . . . , am−1), are connected to the
inputs of each AND gate in reverse order than in Figure
1, and the coefficients of the operand B, (b0, b1, . . . , bm−1),
input to all the AND gates in a serial fashion starting with
the most significant coefficient bm−1. Furthermore, the 1-bit
registers (from left to right) are cm−1, c0, c1, . . . , cm−2 that
are in a different order than those in Figure 2. After m clock
cycles, these registers will store the coefficients of M0 ·b and
the 1-bit register cm−1 will contain the term Tm−1−i in the
i-th clock cycle, with i = 1 . . .m− 1.
For specific case GF (26), Table 2 shows the evolution in
time of the contents of the 1-bit registers (c0, c1, c2, c3, c4, c5)
for the new proposed architecture given above, where Cycle
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 19:07:44 UTC from IEEE Xplore.  Restrictions apply. 
IEEE TRANSACTIONS ON COMPUTERS 4
S6
a0 a1 a3 a5
b0,b1,b2,b3,b4,b5
a2 a4
S1+T0 S2+T1 S3+T2 S4+T3 S5+T4
T0 T1 T2 T3 T4
c5 c0 c1 c2 c3 c4
s0 s1 s2 s3 s4
p5 p0 p1 p2 p3 p4
Fig. 3. New LFSR multiplier architecture for trinomial f(y) = y6 + y+1.
denotes the clock cycles and Serial denotes the coefficient bi
that is valid in the serial input for each clock cycle. Coeffi-
cients (a0, a1, a2, a3, a4, a5) are valid in parallel through all
the process. In Table 2, during initialization (Init.) the most
significant bit b5 of operand B is valid in serial input and
registers ci, i = 0 . . . 5 are loaded with 0. In first cycle t1, the
products a5b5, a0b5, a1b5, a2b5, a3b5 and a4b5 performed
during initialization are loaded into c5, c0, c1, c2, c3 and c4,
respectively. It must be noted that the term a5b5 loaded into
c5 correspond with the term T4 in GF (26). In t1, the bit
b4 of B is valid in serial input and multiplied (AND) with
the coefficients of A. This products will be added (XOR)
with the contents of the 1-bit registers in t1 and loaded into
registers in the second cycle t2, and so on. It can be observed
that after 6 clock cycles (in t6), the registers contains the
coefficients of M0 · b. Furthermore, the contents of c5 in
cycles t1, t2, t3, t4 and t5 (gray shadowed cells in Table 2)
correspond with T4, T3, T2, T1 and T0, respectively.
4.1 New LFSR Multiplier Architecture
The new previously proposed architecture can be used to
compute the product P = A ·B over GF (2m) generated by
irreducible trinomials. As shown in Table 1 and Figure 2, the
product for trinomials can be computed with the addition
of Si and Ti terms [7]. It has been previously stated that the
new proposed architecture computes the coefficients of the
product M0 · b (i.e., the addition of Si and Ti terms in the
first column of Table 1) that is common to any irreducible
polynomial selected for a given field size m, and all the
terms Ti, 0 ≤ i ≤ m − 2, that are successively stored
in register cm−1 in the first m − 1 cycles of the product
computation. Therefore, if the contents of the 1-bit register
cm−1 are loaded into a shift register, then the Ti terms can
be accessed and XORed with the corresponding coefficients
of M0 · b depending on the selected irreducible trinomial.
4.1.1 Multiplication for f(y) = y6 + y + 1
For example, let us consider the multiplication for GF (26)
using the trinomial f(y) = y6 + y + 1 given in Table
1 (column n = 1). Figure 3 shows the new multiplier
architecture. At the top of Figure 3, the architecture for the
computation of M0 · b is represented into a dotted rectangle
where the 1-bit registers (c0, c1, c2, c3, c4, c5) include their
final content after the six clock cycles (cycle t6 in Table 2).
S6
a0 a1 a3 a5
b0,b1,b2,b3,b4,b5
a2 a4
S1+T0 S2+T1 S3+T2 S4+T3 S5+T4
T4+T2 T3 T4
c5 c0 c1 c2 c3 c4
s0 s1 s2 s3 s4
p5 p0 p2 p3 p4
T3+T1T4 +T2+T0
p1
Fig. 4. New LFSR multiplier architecture for trinomial f(y) = y6+y4+1.
At the bottom of Figure 3, a shift register (s0, s1, s2, s3, s4)
loads the successive contents of register c5. Therefore, after
six clock cycles, the contents of s0, s1, s2, s3 and s4 are
T0,T1,T2,T3 and T4, respectively. In Table 3, the evolu-
tion in time of the contents of registers c5 and si, 0 ≤ i ≤ 4,
is given. In order to perform the multiplication, a final step
of addition of the contents of registers (c0, c1, c2, c3, c4, c5)
and (s0, s1, s2, s3, s4) must be done in order to compute the
product P = A · B. It must be noted that in this case, only
one level of 2-input XOR gates is needed for this final step.
4.1.2 Multiplication for f(y) = ym + yn + 1
The computation of the product for f(y) = y6 + y + 1 (and,
in general for trinomials f(y) = ym + yn + 1, with n = 1)
presents the simplest architecture because only one addition
of Ti terms must be done with the results of the product
M0 · b. This fact can be observed in Table 1, where only
one column with Ti terms appears for n = 1. However, for
trinomials f(y) = ym+yn+1, with n > 1, more subcolumns
appear for each value of n. For GF (26), Table 1 shows 2,
2, 3 and 5 subcolumns for n = 2, 3, 4 and 5, respectively.
The appearance of h subcolumns in Table 1 implies, at
least, one addition of h Ti terms and, therefore, the use of
dlog2(h+1)e levels of 2-input XOR gates in the final step of
the multiplication process. In order to reduce the number of
XOR levels in the final step of the product computation,
a modification of the shift register used to store the Ti
terms provided by the register cm−1 can be done. It can
be observed in Table 1 that for n > 1, the addition of at
least two Ti terms is needed for the computation. If a XOR
gate is included in the input to the first register s0 of the
shift register used to store the Ti terms and some specific
feedback to this XOR is done from a given si of the shift
register, then the addition of specific Ti terms can be done.
For example, let us consider the multiplication for
GF (26) using the trinomial f(y) = y6 + y4 + 1 given in
Table 1 (column n = 4). Figure 4 shows the new multiplier
architecture where a XOR gate has been included to the
input of the 1-bit register s0 of the bottom shift register.
One of the inputs is the output of c5 register, while that
the other one comes from the second register s1 of the shift
register. Using this architecture, Table 4 shows the evolution
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 19:07:44 UTC from IEEE Xplore.  Restrictions apply. 
0018-9340 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2020.2980259, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS 5
TABLE 2
Computation of M0 · b for GF (26) using the new LFSR architecture.
Cycle Serial c5 c0 c1 c2 c3 c4
Init. b5 0 0 0 0 0 0
t1 b4 a5b5 a0b5 a1b5 a2b5 a3b5 a4b5
t2 b3 a4b5 + a5b4 a5b5 + a0b4 a0b5 + a1b4 a1b5 + a2b4 a2b5 + a3b4 a3b5 + a4b4
t3 b2 a3b5 + a4b4+ a4b5 + a5b4+ a5b5 + a0b4+ a0b5 + a1b4+ a1b5 + a2b4+ a2b5 + a3b4+
a5b3 a0b3 a1b3 a2b3 a3b3 a4b3
t4 b1 a2b5 + a3b4+ a3b5 + a4b4+ a4b5 + a5b4+ a5b5 + a0b4+ a0b5 + a1b4+ a1b5 + a2b4+
a4b3 + a5b2 a5b3 + a0b2 a0b3 + a1b2 a1b3 + a2b2 a2b3 + a3b2 a3b3 + a4b2
t5 b0 a1b5 + a2b4+ a2b5 + a3b4+ a3b5 + a4b4+ a4b5 + a5b4+ a5b5 + a0b4+ a0b5 + a1b4+
a3b3 + a4b2+ a4b3 + a5b2+ a5b3 + a0b2+ a0b3 + a1b2+ a1b3 + a2b2+ a2b3 + a3b2+
a5b1 a0b1 a1b1 a2b1 a3b1 a4b1
t6 − a0b5 + a1b4+ a1b5 + a2b4+ a2b5 + a3b4+ a3b5 + a4b4+ a4b5 + a5b4+ a5b5 + a0b4+
a2b3 + a3b2+ a3b3 + a4b2+ a4b3 + a5b2+ a5b3 + a0b2+ a0b3 + a1b2+ a1b3 + a2b2+
a4b1 + a5b0 a5b1 + a0b0 a0b1 + a1b0 a1b1 + a2b0 a2b1 + a3b0 a3b1 + a4b0
TABLE 3
Computation of A ·B over GF (26) for f(y) = y6 + y+ 1 using the new
LFSR architecture.
Cycle c5 s0 s1 s2 s3 s4
Init. 0 0 0 0 0 0
t1 T4 0 0 0 0 0
t2 T3 T4 0 0 0 0
t3 T2 T3 T4 0 0 0
t4 T1 T2 T3 T4 0 0
t5 T0 T1 T2 T3 T4 0
t6 S6 T0 T1 T2 T3 T4
in time of the contents of registers c5 and si, 0 ≤ i ≤ 4,
where the contents of c5 are shifted to the right until cycle
t3. Due to the feedback from the 1-bit register s1 to the
XOR in the input of s0, the XOR of T2 (content of c5 in the
previous cycle t3) and T4 (content of s1 in t3) is performed
and loaded into s0 in the next clock cycle t4. In t4, the
previous contents of s0 and s1 are also loaded into s1 and
s2, respectively. Similar situation occurs in t5, where the new
value T3 + T1 is loaded into s0 and the values T4 + T2,
T3 and T4 are shifted to s1, s2 and s3, respectively. After
six clock cycles (t6), the values T4 + T2 + T0, T3 + T1,
T4 + T2, T3 and T4 are loaded into s0, s1, s2,s3 and s4,
respectively. These final contents of the registers are shown
in Figure 4. It can be observed that these values match with
the additions of Ti terms given in Table 1 needed for the
computation of the product P = A · B over GF (26) when
the trinomial f(y) = y6 + y4 + 1 is used. A final level of
XOR gates performing the sums of the corresponding ci
(i = 0 . . . 4) and si (i = 0, 2, 3, 4) registers completes the
implementation of the product. After six clock cycles, the
sum c5⊕ s1 provides the most significant bit of the product.
This sum feeds register s0 and is already implemented, so
no additional XOR is needed at the final level of XOR gates.
The above approach can be used to implement the
GF (2m) multiplication when a trinomial is used. Figure
5 shows the architecture of the shift registers for the five
trinomials f(y) = y6 + yn + 1, with 1 ≤ n ≤ 5, where the
final contents of the registers after six clock cycles are also
included. At the top of Figure 5, the computation of M0 ·b is
represented into a dotted rectangle. At the bottom, the shift
registers (s0, s1, s2, s3, s4) for the five trinomials are shown.
It can be observed that for each trinomial y6 + yn + 1, the
feedback from the register sm−1−n = sz−1 to the XOR in
TABLE 4
Computation of A ·B over GF (26) for f(y) = y6 + y4 + 1 using the
new LFSR architecture.
Cycle c5 s0 s1 s2 s3 s4
Init. 0 0 0 0 0 0
t1 T4 0 0 0 0 0
t2 T3 T4 0 0 0 0
t3 T2 T3 T4 0 0 0
t4 T1 T4 +T2 T3 T4 0 0
t5 T0 T3 +T1 T4 +T2 T3 T4 0
t6 S6 T4 +T2 +T0 T3 +T1 T4 +T2 T3 T4
the input of the 1-bit register s0 must be done. In fact, this
feedback is also done for f(y) = y6 + y + 1 (represented in
Figure 5 with a dotted XOR and dotted feedback), although
it is not applied because only six clock cycles are needed for
the computation. For this reason, the input XOR and feed-
back can be removed for this trinomial. Figure 5 does not
show the final step with one level of XOR gates performing
the addition of the corresponding ci (i = 0 . . . 5) and si
(i = 0 . . . 4) registers in order to complete the multiplication.
4.1.3 Proof of the new architecture
It can be proven that the contents of si registers in Figure
5 match the terms that must be added in the coefficients ci
of the product given in Table 1 and in the new algorithm
shown in Figure 2. From Table 1, single pairs of additions
(Tm−i + Tn−i), i = 2, 3, . . . appear (when at least two
subcolumns are included for a given n) as long as no
repeated Tj terms are included in these pairs. For example,
in Table 1 for n = 4, single pairs (T4+T2) and (T3+T1) are
given for c0 and c5, respectively, but no single pair (T2+T0)
is given (T2 is already included in the first pair). In this
case, the addition of terms in c4 comes from the addition
of the single pair (T4 + T2) and T0. From Table 4 and
Figure 5, terms Tm−2 and Tn−2 are loaded into registers
sz−1 and cm−1, respectively, in the clock cycle tz+1. Due to
the feedback from register sz−1 to the XOR in the input of
the register s0, in cycle tz+2 the single pair (Tm−2 +Tn−2)
is loaded into s0. The content of s0 is shifted to s1 in cycle
tz+2, when next pair (Tm−3 +Tn−3) is loaded into s0, and
so on. Afterm clock cycles, the contents of si registers match
with the additions of Ti terms given in Figure 2 and Table
1 (for m = 6) needed for the computation of the product
P = A ·B over GF (2m) for trinomials f(y) = ym + yn +1.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 19:07:44 UTC from IEEE Xplore.  Restrictions apply. 
IEEE TRANSACTIONS ON COMPUTERS 6
S6
a0 a1 a3 a5
b0,b1,b2,b3,b4,b5
a2 a4
S1+T0 S2+T1 S3+T2 S4+T3 S5+T4
c5 c0 c1 c2 c3 c4
T4+T3 T4
s0 s1 s2 s3 s4
T4+T3+T2T4+T3+T2+T1+T0 T4+T3+T2+T1
T3 T4
s0 s1 s2 s3 s4
T4+T2T4+T2+T0 T3+T1
T3 T4
s0 s1 s2 s3 s4
T2T3+T0 T4+T1
T3 T4
s0 s1 s2 s3 s4
T2T4+T0 T1
T3 T4






Fig. 5. Shift registers for trinomials f(y) = y6 + yn + 1, 1 ≤ n ≤ 5.
4.1.4 Parallel multiplication
A characteristic of the proposed architecture is that it can
perform the multiplication of two operands for n different
fieldsGF (2m) using n trinomials simultaneously inm clock
cycles only including n(m− 1) 1-bit registers and nm XOR
gates. For example, Figure 5 would perform the parallel
computation of P = A ·B over GF (26) for the five trinomi-
als f(y) = y6 + yn + 1, with 1 ≤ n ≤ 5, if the final step of
XOR gates performing the addition of the corresponding ci
(i = 0 . . . 5) and si (i = 0 . . . 4) registers for each trinomial is
included. The general architecture of the GF (2m) multiplier
for an irreducible trinomial f(y) = ym + yn + 1, with
1 ≤ n < m, is given in Figure 6. Parallel multiplication
of two operands for different irreducible polynomials could
be applied in cryptography. In order to hinder cryptanalytic
attacks, cryptosystems working over different irreducible
polynomials could be proposed in such a way that differ-
ent polynomials could be selected for different messages
without modifying the GF (2m) arithmetic hardware [12],
[13]. Pipelined architectures could also use simultaneous
multiplication in such a way that different stages could use
different polynomials, fulfilling that a given message uses
the same irreducible polynomial through the pipeline.
4.2 Complexity Analysis
As shown in Figure 6, the three main modules of the new
LFSR-based GF (2m) multiplier architecture for irreducible
trinomials f(y) = ym + yn + 1, with 1 ≤ n < m, are
the module for the computation of M0 · b, the shift register
for the loading/addition of Ti terms and the module that
performs the XOR of the outputs of the above two modules.
The computation of M0 · b requires m AND gates, m
XOR gates and m flipflops (1-bit registers). Furthermore,
the critical path delay is TA + TX . The shift register for the
loading/addition of Ti terms (see Figure 5) requires m − 1
flipflops and one XOR gate when 2 ≤ n < m (for n = 1 no
XORs are needed). The critical path delay in this case is TX .
Finally, the module that performs the XOR of the outputs of
the above two modules only requires m − 2 XOR gates in
parallel when 2 ≤ n < m or m− 1 XOR gates in parallel for
n = 1, so the critical path delay is also TX . Therefore, the
overall area complexity of the new LFSR-based multiplier
is m AND gates, 2m − 1 flipflops and 2m − 1 XOR gates.
With respect to time complexity, the product is computed
in m clock cycles with a period determined by the critical
path delay. The critical propagation delay of the multiplier
is given by the maximum path delay of the above three
modules, i.e., max{TA + TX , TX} = TA + TX . This is a
constant propagation delay, independent of the value of n.
Furthermore, a characteristic of the proposed architecture
is that it can perform the product of two operands for t
different fields GF (2m) generated by t irreducible trinomi-
als simultaneously in m clock cycles with the inclusion of
t(m− 1) flipflops and tm XOR gates (see Figure 5).
5 COMPARISON WITH OTHER MULTIPLIERS AND
HARDWARE IMPLEMENTATIONS
In Table 5, the theoretical complexity of our proposed mul-
tiplier is compared with several GF (2m) bit-serial mul-
tipliers found in the literature. In this table, the number
of 2-input AND, XOR and OR gates, the number of 1-bit
registers, 2:1 MUXs, the number of clock cycles needed to
complete the multiplication and the critical path delay are
given. Furthermore, TN, TO, TM and TNAND represent the
delay of an inverter, 2-input OR gate, 2:1 MUX and 2-input
NAND gate, respectively. In [14], Least-significant (LSB)
and Most-significant (MSB) bit-serial multipliers for general
irreducible polynomials were given. Versatile GF (2m) bit-
serial multipliers (that can also perform multiplications in
all underlying GF (2n) fields with 1 ≤ n ≤ m) for general
irreducible polynomials were also given in [15] and [16].
In [17], [20] and [22], bit-serial multipliers for irreducible
trinomials were proposed. A bit-serial multiplier for general
irreducible polynomials was also given in [21]. In [18] and
[19], bit-serial multipliers for all-one-polynomials (AOPs)
were also presented. The bit-serial multiplier for irreducible
trinomials here proposed is represented as Prop in Table 5.
In order to highlight the differences between the various
multiplier designs, specific results for GF (2233) generated
by the NIST irreducible trinomial f(y) = y233 + y74 + 1
are given in Table 6 (m = 233, n = 74). From the results,
it can be observed that our proposed multiplier presents,
together with the LSB/MSB bit-serial multipliers in [14], the
lowest propagation delay. Regarding the number of clock
cycles needed to perform the product, our multiplier needs
m clock cycles. The multipliers in [16], [21] require less clock
cycles, although both present highest propagation delays.
With respect to area complexity, our multiplier matches the
lowest number of AND gates in [14], although needs 2m−1
XOR gates versus m gates given in [18], [19]. The multiplier
here proposed also presents the lowest number of 1-bit
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 19:07:44 UTC from IEEE Xplore.  Restrictions apply. 
IEEE TRANSACTIONS ON COMPUTERS 7
TABLE 5
Complexities of bit-serial PB multipliers.
#AND #XOR #OR #FF #Mux #Clk Delay
LSB [14] m m+ 1 0 2m m m TA + TX
MSB [14] m m+ 1 0 2m m m TA + TX
[15] 3m 2m 0 m2 m 3m TX + dlog2(m)e(TA + TN + TO)
[16] 3m 2m 0 4m+ 2 2m (0.664)m 2TX + TM
[17] 2m− 1 2m+ n− 2 0 3m+ n− 1 0 m TA + (2 + dlog2(m)e)TX
[18] m+ 1 m 0 2m+ 2 m+ 2 2m TA +mTX
[18] 2m− 1 2m− 2 0 2m+ 2 m+ 2 2m− 1 TA + (1 + dlog2(m− 1)e)TX
[19] m+ 1 m 0 2m+ 2 2 2m− 1 TA +mTX






0 5m− 1 4m 2n+ 1 TA + dlog2meTX + 2TM
[22] 2m− 1 2m+ 1 0 3m− 2 0 m max{3TX, TA + 2TX}‡
Prop. m 2m− 1 0 2m− 1 0 m TA + TX
† NAND, ‡ delay TA + 2TX for n 6= m− 1 and max{3TX, TA + 2TX} for n = m− 1.
b0,b1,…,bm-1











Fig. 6. LFSR-based GF (2m) multiplier architecture for trinomials.
registers (2m−1 versus 2m given in [14]) and it does not use
multiplexers. In Table 6, the area (transistors count) and time
estimates for the NIST GF (2233) multipliers are also com-
pared. In order to fairly compare time complexities, typical
propagation-delays tPD of the following STMicroelectronics
real circuits have been used: M74HC08 (AND gate, tPD
= 6 ns.), M74HC86 (XOR, tPD = 12 ns.), M74HC32 (OR,
tPD = 8 ns.), M74HC257 (MUX, tPD = 11 ns.), M74HC04
(INV, tPD = 8 ns.) and M74HC00 (NAND, tPD = 6 ns.). In
order to estimate the number of CMOS transistors used, the
traditional counts (6 transistors for AND gate, 6 for XOR,
6 for OR, 8 for SR-latch, 6 for 2:1 MUX and 4 for NAND)
have been used. In Table 6, the Time needed to perform the
multiplication (given in nanoseconds) is computed as the
product of the critical path Delay by the number of clock cy-
cles used to complete the multiplication, and the Area×Time
column is given in transistors × miliseconds. It must be noted
that some of the results in Table 6 correspond to multipliers
for general irreducible polynomials and AOPs. Although
the comparison of different architectures of multipliers can
not be fair, it can be observed that the proposed multiplier
slightly reduces the best area complexity given in [14] and
matches the best time complexities (critical path delay and
time needed to perform multiplication) given in [14] and
[20]. With respect to the area×time metric, our multiplier
slightly reduces the best values obtained in [14].
5.1 FPGA implementations
In order to further compare the work here presented with
other bit-serial multipliers, FPGA implementations of the
proposed multiplier and the best one given in [14] have
been performed for NISTGF (2233) generated by irreducible
trinomials (m = 233, n = 74) and (m = 233, n = 159). These
multipliers have been described in VHDL, synthesized and
implemented on Xilinx FPGA Artix-7 XC7A200T-FFG1156.
Experimental results are those reported by Xilinx ISE 14.7
using XST synthesizer for speed high optimizations. Exper-
imental post-place and route results are given in Table 7
for the multiplier here proposed (Prop.) and for LSB bit-
serial multiplier given in [14], where Slices gives the used
number of slices, Per. represents the minimum clock period
(in nanoseconds), Time is the time needed (in nanoseconds)
for performing the multiplication after 233 clock cycles
(Time = Per.×233), andA×T expresses area by time delay
in Slices×ns to compare the area and delay (less is better).
In Table 7, experimental results obtained by the proposed
multiplier and the best multiplier [14] found in Table 5
are given for single multiplications with NIST trinomials
(233,74) and (233,159). Furthermore, in order to check the
property of the proposed multiplier for performing parallel
multiplications for different fieldsGF (2m) generated by dif-
ferent irreducible trinomials simultaneously, experimental
results are given in Table 7 (labeled as Parallel) for parallel
multiplications using both NIST GF (2233) trinomials. For
the method given in [14], two multipliers for these trino-
mials were implemented and for the proposed multiplier,
the architecture given in Figure 5 was used. In both cases,
multiplexers were included in order to select the product
of one of the trinomials depending on a control signal.
From experimental results, it can be observed that, for single
multiplications over (233,74) and (233,159), the multiplier
here proposed can use 6.6% and 13.5% less slices and needs
3.8% and 1.6% less time to perform a multiplication than
the multiplier given in [14], respectively. For the area×time
metrics, the multiplier here proposed obtains a reduction of
10.2% for (233,74) and 14.9% for (233,159) with respect to
[14]. Experimental results for parallel multiplications show
that the proposed multiplier can use 17.7% less slices, needs
2.9% more time to perform a multiplication and presents a
reduction of 15.4% in area×time metrics with respect to [14].
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 19:07:44 UTC from IEEE Xplore.  Restrictions apply. 
0018-9340 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2020.2980259, IEEE
Transactions on Computers
IEEE TRANSACTIONS ON COMPUTERS 8
TABLE 6
Complexities and Area-Time estimates for NIST GF (2233) defined by trinomial f(y) = y233 + y74 + 1.
Complexities Estimates
#AND #XOR #OR #FF #Mux #Clk Delay Transistors Delay Time Area×Time
LSB [14] 233 234 0 466 233 233 TA + TX 7928 18 4194 33.25
MSB [14] 233 234 0 466 233 233 TA + TX 7928 18 4194 33.25
[15] 699 466 0 54289 233 699 TX + 8(TA + TN + TO) 442700 188 131412 58176.09
[16] 699 466 0 934 466 154 2TX + TM 17258 35 5390 93.02
[17] 465 538 0 772 0 233 TA + 10TX 12194 126 29358 357.99
[18] 234 233 0 468 235 466 TA + 233TX 7956 2802 1305732 10388.40
[18] 465 464 0 468 235 465 TA + 9TX 10728 114 53010 568.69
[19] 234 233 0 468 2 465 TA + 233TX 6558 2802 1302930 8544.61
[20] 54289 54288 0 108112 0 233 TNAND + TX 1407780 18 4194 5904.23
[21] 27261 27261 0 1164 932 149 TA + 8TX + 2TM 342036 124 18476 6319.46
[22] 465 467 0 697 0 233 TA + 2TX 11168 30 6990 78.06
Prop. 233 465 0 465 0 233 TA + TX 7908 18 4194 33.17
TABLE 7
Comparison of FPGA implementations.
Slices Per.(ns) T ime(ns) A× T NIST
[14] 120 1.84 428.72 51446.40 (233,74)
Prop. 112 1.77 412.41 46189.92
[14] 118 1.87 435.71 51413.78 (233,159)
Prop. 102 1.84 428.72 43729.44
[14] 338 1.75 407.75 137819.50 Parallel
Prop. 278 1.80 419.40 116593.20
6 CONCLUSION
In this paper, a new bit-serial GF (2m) polynomial basis
multiplier using irreducible trinomials is presented. The
architecture here proposed is based on Linear-Feedback Shift
Register and can perform a multiplication in m clock cycles
with a constant propagation delay of TA+TX . These values
match the best time results found in the literature for bit-
serial PB multipliers with a slight reduction of the space
complexity. Furthermore, an important characteristic of the
proposed architecture is that it can perform the multipli-
cation of two operands for t different finite fields GF (2m)
generated by t irreducible trinomials simultaneously in m
clock cycles with the inclusion of t(m − 1) flipflops and
tm XOR gates. New general expressions for multiplication
over irreducible trinomials f(y) = ym + yn + 1, with
1 ≤ n ≤ m− 1, have also been proposed in this work.
ACKNOWLEDGMENTS
This work has been supported by the Spanish MINECO and
CM under grants S2018/TCS-4423, TIN 2015-65277-R and
RTI2018-093684-B-I00.
REFERENCES
[1] A. Reyhani-Masoleh and M.A. Hasan, “Low Complexity Bit Parallel
Architectures for Polynomial Basis Multiplication over GF (2m)”,
IEEE Trans. Comput., vol. 53, no. 8, pp. 945-959, Aug. 2004.
[2] D. Hankerson, A. Menezes and S. Vanstone, “Guide to Elliptic
Curve Cryptography”, Springer, New York, 2004.
[3] H. El-Razouk and A. Reyhani-Masoleh, “New Bit-Level Serial
GF (2m) Multiplication Using Polynomial Basis”, 22nd IEEE Sym-
posium on Computer Arithmetic, ARITH 22, pp. 129-136, Lyon, France,
June 22-24, 2015.
[4] L. Song and K.K. Parhi, “Low-Energy Digit-Serial/Parallel Finite
Field Multipliers”, J. VLSI Signal Process., vol. 19, pp. 149-166, 1998.
[5] F. Rodrı́guez-Henrı́quez, N. Saquib, A. Dı́az-Pérez and Ç.K.
Koç, ‘Cryptographic Algorithms on Reconfigurable Hardware’,
Springer, New York, 2006.
[6] T. Zhang and K.K. Parhi, “Systematic Design of Original and Mod-
ified Mastrovito Multipliers for General Irreducible Polynomials”,
IEEE Trans. Comput., vol. 50, no. 7, pp. 734-749, July 2001.
[7] J.L. Imaña, “Bit-Parallel Finite Field Multipliers for Irreducible
Trinomials”, IEEE Trans. Comput., vol. 55, no. 5, pp. 520-533, May
2006.
[8] H. Fan and Y. Dai, “Fast Bit Parallel GF (2m) Multiplier for All
Trinomials”, IEEE Trans. Comput., vol. 54, no. 4, pp. 485-490, Apr.
2005.
[9] F. Rodrı́guez-Henrı́quez and Ç.K. Koç, “Parallel Multipliers Based
on Special Irreducible Pentanomials”, IEEE Trans. Comput., vol. 52,
no. 12, pp. 1535-1542, Dec. 2003.
[10] J.L. Imaña, “Fast Bit-Parallel Binary Multipliers Based on Type-
I Pentanomials”, IEEE Trans. Comput., vol. 67, no. 6, pp. 898-904,
June 2018.
[11] B. Liu and H. Wu, “Efficient Architecture and Implementation for
NTRUEncrypt System”, Midwest Symposium on Circuits and Systems,
MWSCAS, pp. 1-4, Aug. 2-5, 2015.
[12] S. Gashkov and A. Frolov, “Comparative Analysis of Calculations
in Cryptographic Protocols Using a Combination of Different Bases
of Finite Fields”, 12th Intl. Conf. Dependability and Complex Systems,
pp. 166-177, July 2-6, 2017.
[13] A. Seghier, J. Li and D.Z. Sun, “Advanced encryption standard
based on key dependent S-Box cube”, IET Information Security, vol.
13, no. 6, pp. 552-558, Nov. 2019.
[14] T. Beth and D. Gollman, “Algorithm Engineering for public Key
Algorithms”, IEEE J. Sel. Areas Commun., vol. 7, no. 4, pp. 458-466,
1989.
[15] M.A. Hasan and M. Ebtedaei, “Efficient architectures for compu-
tations over variable dimensional Galois field”, IEEE Trans. Circuits
Syst. I. Fundam. Theory Appl., vol. 45, no. 11, pp. 1205-1211, Nov.
1998.
[16] G.N. Selimis, A.P. Fournaris, H. Michail and O. Koufopavlou,
“Improved throughput bit-serial multiplier for GF (2m) fields”,
Integration, vol. 42, pp. 217-226, 2009.
[17] A. Reyhani-Masoleh, “A New Bit-Serial Architecture for field
Multiplication Using Polynomial Bases”, Cryptographic Hardware
and Embedded Systems, CHES 2008, LNCS 5154, pp. 300-314, 2008.
[18] S.T.J. Fenn, M.G. Parker, M. Benaissa and D.Taylor, “Bit-serial
multiplication in GF (2m) using irreducible all-one polynomials”,
IEE Proc. Comput. Digit. Tech, vol. 144, no. 6, pp. 391-393, 1997.
[19] H.-S. Kim and K.-Y. Yoo, “AOP arithmetic architectures over
GF (2m)”, Appl. Mathem. and Computation, vol. 158, pp. 7-18, 2004.
[20] P.K. Meher, “Systolic and Super-Systolic Multipliers for Finite
Field GF (2m) Based on Irreducible Trinomials”, IEEE Trans. Cir-
cuits Syst. I Reg. Papers, vol. 55, no. 4, pp. 1031-1040, May 2008.
[21] J.L. Imaña, “Low Latency GF (2m) Polynomial Basis Multiplier”,
IEEE Trans. Circuits Syst. I Reg. Papers, vol. 58, no. 5, pp. 935-946,
May 2011.
[22] H. El-Razouk and A. Reyhani-Masoleh, “New Bit-Level Serial
GF (2m) Multiplication Using Polynomial Basis”, IEEE 22nd Sym-
posium on Computer Arithmetic, pp. 129-136, 2015.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 28,2020 at 19:07:44 UTC from IEEE Xplore.  Restrictions apply. 
