1 Fast Bit Parallel Shifted Polynomial Basis Multipliers in GF(2 n) by Haining Fan et al.
1
Fast Bit Parallel Shifted Polynomial Basis
Multipliers in GF(2n)
Haining Fan and M. Anwar Hasan Senior Member, IEEE,
Abstract
A new bit parallel shifted polynomial basis multiplier for GF(2
n) is presented. For some irreducible trinomials,
the space complexity of the multiplier matches the best results avaliable in the literture, and its gate delay is equal
to TA +dlog2 neTX, where TA and TX are the delay of one 2-input AND and XOR gates, respectively. To the best
of our knowledge, this is the ﬁrst time that the gate delay bound TA +dlog2 neTX is reached. For some irreducible
pentanomials, its gate delay is equal to TA + (1 + dlog2 ne)TX. NIST has recommended ﬁve binary ﬁelds for the
ECDSA (Elliptic Curve Digital Signature Algorithm) applications: GF(2
163), GF(2
233), GF(2
283), GF(2
409) and
GF(2
571), but no irreducible trinomials exist for three degrees, viz., 163, 283 and 571. For the three corresponding
binary ﬁelds, we show that the gate delay of the proposed multiplier is TA+(1+dlog2 ne)TX. This result outperforms
the previously known results.
Index Terms
Finite ﬁeld, multiplication, polynomial basis, shifted polynomial basis, irreducible polynomial.
I. INTRODUCTION
Efﬁcient VLSI implementation of high speed multipliers over GF(2n) is important for some cryptosystems. To
this end, several bit parallel polynomial basis (PB) multipliers using low Hamming weight irreducible polynomials,
such as trinomials and pentanomials, have been proposed. In [1], it has been shown that irreducible trinomials exist
for 5148 values of n in the range 1 < n < 10001. For each of the other 4851 values of n in the same range,
where no such irreducible trinomial of degree n exist, an irreducible pentanomial has been listed. In fact, there is
no known value of n for which an irreducible polynomial of weight w < 6 does not exist [1]. Therefore, we need
only to consider irreducible trinomials and pentanomials for practical purposes.
In hardware, since a two-input XOR (respectively AND) gate can be used to realize an addition (respectively
multiplication) operation over the ground ﬁeld GF(2), the space complexity of bit parallel multipliers in the
extension ﬁeld GF(2n) is often represented in terms of the total number of XOR and AND gates used. The
corresponding time complexity is given in terms of the maximum delay faced by a signal due to these XOR and
Afﬁliation of authors: Haining Fan and M. Anwar Hasan are with the Department of Electrical and Computer Engineering University of
Waterloo, Waterloo, Canada. E-mails: hfan@vlsi.uwaterloo.ca and ahasan@ece.uwaterloo.ca2
AND gates. If TA and TX correspond to the delay due to one 2-input AND and XOR gates, respectively, then
the total delay due to gates can be expressed as cATA + cXTX, where cA and cX are two positive integers. For a
given VLSI technology, the values of cA and cX can not be easily changed and hence in order to reduce the total
gate delay, AND and XOR gates used in the multiplier are carefully organized, for example, XOR gates are often
connected in the form of a binary tree. For most of the recently proposed multipliers, cA = 1 and cX depends on n.
For the special irreducible trinomials f(u) = un+un/2+1 and f(u) = un+u+1, the gate delay of multipliers in
[2], [3] and [4] is TA+(1+dlog2 ne)TX. When n = 2j +1 for some j > 0, the gate delay of the multipliers in [5]
and [6] is also TA+(1+dlog2 ne)TX. But for the general irreducible trinomial f(u) = un+uk +1 (n−1 is not a
power of 2, k 6= 1 and 2k 6= n), the gate delay of multipliers in [2]-[6] is TA+(2+dlog2 ne)TX. For the irreducible
trinomial f(u) = un+uk+1, the gate delay of the PB multiplier of [11] is TA+(1 + dlog2(n − 1 + dk/2e)e)TX,
but it requires more XOR gates for k > 2. Gate delays of these irreducible trinomial-based PB multipliers are at
least TA + (1 + dlog2 ne)TX.
The time complexity of the pentanomial-based multiplier depends on the form of the ﬁeld generating irreducible
pentanomial. Multipliers for special types of the irreducible pentanomials have been proposed in [3], [4], [5] and
[13]. Gate delays of these multipliers are at least TA +(3+dlog2(n − 1)e)TX. In [14], a redundant representation
is used to design the bit parallel multiplier. Although the space complexity of this multiplier is slightly greater than
other architectures, its gate delay is TA + d2 + log2(n + 1)eTX.
In this report, we present a straightforward architecture of a bit parallel multiplier using the shift polynomial
basis (SPB). For all irreducible trinomials, the space complexity of the new multiplier matches the best results. The
main contribution of the new multiplier is that its gate delay is equal to TA + dlog2 neTX for certain irreducible
trinomials. To the best of our knowledge, this is the ﬁrst time that the gate delay bound TA+dlog2 neTX is achieved.
For a special type of the irreducible pentanomials, namely, f(u) = un+uv+1+uv+uv−1+1 (3 < v < (n−3)/2),
we present a closed form formulae for the multiplier. Please note that this special type of pentanomials have been
studied in [13], where the gate delay of the multiplier is TA+(3+dlog2 ne)TX. NIST has recommended ﬁve binary
ﬁelds for the ECDSA (Elliptic Curve Digital Signature Algorithm) applications: GF(2163), GF(2233), GF(2283),
GF(2409) and GF(2571), but no irreducible trinomials exist for three degrees, viz., 163, 283 and 571. For the three
corresponding binary ﬁelds, we show that the gate delay of the proposed multiplier is TA +(1+dlog2 ne)TX. This
result outperforms the previously known results.
The remainder of this report is organized as follows: In Section II, we introduce notations used in the report. The
new multipliers for the irreducible trinomials and pentanomials are presented in Section III and IV, respectively.
Finally, concluding remarks are made in Section V.
II. NOTATIONS AND PRELIMINARIES
Let f(u) be an irreducible polynomial over GF(2) and GF(2n) = GF(2)[u]/(f(u)). A SPB of GF(2n) over
GF(2) is deﬁned as follows [12]:
Deﬁnition 1: Let v be an integer and the ordered set M = {xi|0 ≤ i ≤ m−1} be a polynomial basis of GF(2m)3
over GF(2). The ordered set x−vM := {xi−v|0 ≤ i ≤ m − 1} is called a shifted polynomial basis (SPB) with
respect to M.
In Sections II and III, we assume that f(u) = un + uk + 1 (n > 2) is an irreducible trinomial over GF(2)
and x is a root of f. In [12], it is shown that the best values of v are k and k − 1 when the SPB is used to
design parallel multipliers. Given two ﬁeld elements A and B, let A = (a0,a1,...,an−1)T = x−v Pn−1
i=0 aixi and
B = (b0,b1,...,bn−1)T = x−v Pn−1
i=0 bixi be their SPB representations. In [15], the structures to compute the
product of A and B using SPB have been grouped into two types. The type-I multiplier computes the product
C = (c0,c1,...,cn−1)T = x−v Pn−1
i=0 cixi of A and B using the following two steps.
(i) Perform the conventional polynomial multiplication:
S = AB = x
−2v
2n−2 X
t=0
stx
t =
2n−2−2v X
t=−2v
st+2vx
t = r− + r + r+, where
r =
n−1−v P
t=−v
st+2vxt, r− =
−1−v P
t=−2v
st+2vxt, r+ =
2(n−1−v) P
t=n−v
st+2vxt and
st =
X
i+j=t
0≤i,j≤n−1
aibj =

  
  
t P
i=0
aibt−i 0 ≤ t ≤ n − 1
n−1 P
i=t+1−n
aibt−i n ≤ t ≤ 2n − 2
. (1)
(ii) Reduce r+ and r− using the following two reduction formulae, respectively:
xi = xk+i−n + xi−n, where n − v ≤ i ≤ 2n − 2 − 2v,
xi = xn+i + xk+i, where − 2v ≤ i ≤ −(v + 1).
The reduced results e r+ and e r− are deﬁned as
e r− =
n−1−v X
t=n−2v
st+2v−nxt +
k−1−v X
t=k−2v
st+2v−kxt and
e r+ =
k+n−2−2v X
t=k−v
st+2v+n−kxt +
n−2−2v X
t=−v
st+n+2vxt.
The product of A and B is C =
Pn−1
i=0 cixi−v = x−v Pn−1
i=0 cixi = r + e r− + e r+.
The type-II SPB multiplier combines the two steps into a single matrix-vector product, i.e.,
C = (c0,c1,...,cn−1)T = Z(a0,a1,...,an−1)T, where the n×n matrix Z = (zi,j)0≤i,j≤n−1 is called the Mastrovito
matrix and it depends on both B and f. In order to compute C = Z(a0,a1,...,an−1)T, Z is computed ﬁrst, then
ct(0 ≤ t ≤ n − 1) is computed in a vector inner-product module whose output is ct =
Pn−1
i=0 aizt,i.
III. SPB MULTIPLIER FOR IRREDUCIBLE TRINOMIALS
A. Architectures
We now present a straightforward design of the SPB parallel multiplier. Instead of computing ct(0 ≤ t ≤ n−1)
by a vector inner-product module, the new multiplier calculates ct using a binary XOR tree of the smallest height.4
In [12], the detailed design of the type-II SPB multiplier is presented for the case v = k, n+1 ≤ 2v and v ≤ n−2.
We now use this special case to illustrate the new multiplier. The product C = AB is given as follows [12]:
C =
n−v−1 X
t=−v
cv+txt =
n−2v−1 X
t=−v
 
2v+t X
i=0
aib2v+t−i
!
xt +
n−v−1 X
t=n−2v
 
n−1 X
i=2v−n+t+1
aib2v+t−i
!
xt
+
n−v−1 X
t=n−2v
 
2v−n+t X
i=0
aib2v−n+t−i
!
xt +
−1 X
t=−v
 
v+t X
i=0
aibv+t−i
!
xt
+
n−v−2 X
t=0
 
n−1 X
i=v+t+1
aibv+n+t−i
!
x
t +
n−2v−2 X
t=−v
 
n−1 X
i=2v+t+1
aib2v+n+t−i
!
x
t.
Comparing the coefﬁcients of xt in this formula, we may obtain explicit expressions of the coordinate cv+t for
−v ≤ t ≤ n − 1 − v. For example, for the case t = n − 2v − 1, we have
cn−v−1 =
n−v−1 X
i=0
ai(bn−v−1−i + bn−1−i) +
n−1 X
i=n−v
aibn−1−i. (2)
In order to compute the coordinate cn−v−1, the new multiplier deﬁnes terms yi’s for 0 ≤ i ≤ n − v − 1 as
yi = ai(bn−v−1−i + bn−1−i). Obviously, computing each yi requires a gate delay of TA + TX. Since computing
aibn−1−i (n − v ≤ i ≤ n − 1) in (2) requires a delay of TA only, we may deﬁne terms yi’s for n − v ≤ i ≤
n − v + dv/2e − 1 as the pairwise summation of the expression
Pn−1
i=n−v aibn−1−i in (2), i.e.,
yi =

   
   
a2i−(n−v)b2n−v−1−2i + a1+2i−(n−v)b2n−v−2−2i, if v is even, 


a2i−(n−v)b2n−v−1−2i + a1+2i−(n−v)b2n−v−2−2i, n − v ≤ i ≤ n − (v + 3)/2
an−1b0, i = n − (v + 1)/2
if v is odd.
Due to the parallelism, it is easy to see that computing all yi’s (0 ≤ i ≤ n − v + dv/2e − 1) requires a gate
delay of TA + TX. The coordinate cn−v−1 is then obtained by adding all yi’s using a binary XOR tree, which
requires dlog2(n − v + dv/2e)eTX gate delay. Therefore, the total delay due to gates for computing cn−v−1 is
TA +(1 + dlog2(d(2n − v)/2e)e)TX = TA + dlog2(2n − v)eTX. Please note that the equality 1 + dlog2 di/2ee =
dlog2 ie holds for integer i > 1. Other coordinates of C are computed similarly.
Based on the coordinate expressions in [12], we may obtain the gate delay to compute each coordinate. Table I
summarizes the total XOR gate delays required to compute each coordinate for n + 1 ≤ 2v and v ≤ n − 2. The
delay due to gates of the multiplier corresponds to the maximal one (t = −1), which is found to be TA + (1 +
dlog2(d(n + v)/2e)e)TX = TA + dlog2(n + v)eTX.5
TABLE I
DELAYS DUE TO XOR GATES TO COMPUTE cv+t FOR n + 1 ≤ 2v AND v ≤ n − 2.
t Formula of cv+t Total XOR delays
−v ≤ t ≤ n − 2v − 2
Pv+t
i=0 ai(bv+t−i + b2v+t−i)+
P2v+t
i=v+t+1 aib2v+t−i +
Pn−1
i=2v+t+1 aib2v+n+t−i
dlog2(n + v + t + 1)eTx
t = n − 2v − 1
Pn−v−1
i=0 ai(bn−v−1−i + bn−1−i) +
Pn−1
i=n−v aibn−1−i dlog2(2n − v)eTx
n − 2v ≤ t ≤ −1
P2v−n+t
i=0 ai(b2v−n+t−i + bv+t−i)+
Pv+t
i=2v−n+t+1 ai(bv+t−i + b2v+t−i)+
Pn−1
i=v+t+1 aib2v+t−i
dlog2(n + v + t + 1)eTx
0 ≤ t ≤ n − v − 2
Pn−1
i=v+t+1 ai(bv+n+t−i + b2v+t−i)+
Pv+t
i=2v−n+t+1 aib2v+t−i +
P2v−n+t
i=0 aib2v−n+t−i
dlog2(2n − 1 − v − t)eTx
t = n − v − 1
Pn−1
i=v aibn+v−1−i +
Pv−1
i=0 aibv−1−i dlog2 neTx
Since we have only reorganized the computational sequence of the coordinate formulae of C = (c0,c1,...,cn−1)T,
the AND and XOR gate complexities of the proposed multiplier are the same as those of the type-II SPB multiplier.
For the irreducible trinomial f(u) = un + uk + 1 with v = k and v = k − 1 [12], these complexities are:
AND gates = n2;
XOR gates =



n2 − 1 2k 6= n,
n2 − n/2 2k = n.
For simplicity, we do not present detailed computational procedures of gate delays for the other cases here. But
we note that we have obtained these values for all irreducible trinomials f(u) = un +uk +1. We summarize gate
delays of the proposed multiplier in Table II.
TABLE II
GATE DELAYS FOR f(u) = un + uk + 1-BASED SPB MULTIPLIERS FOR v = k OR k − 1
v Gate delay
n ≤ 2v TA + dlog2(n + v)eTX
2v ≤ n − 1 TA + dlog2(2n − v − 1)eTX
For 2 < n < 1000, now we list pairs (n,k) such that f(u) = un + uk + 1 is irreducible and the gate delay of
the presented multiplier equals to TA + dlog2 neTX.6
(3,1) (5,2) (9,4) (10,3) (17,6) (18,9) (33,13) (34,7)
(36,15) (39,14) (41,20) (65,32) (66,3) (68,33) (71,35) (73,31)
(74,35) (81,35) (84,39) (129,46) (130,3) (132,29) (134,57) (135,29)
(137,57) (140,65) (145,69) (146,71) (147,49) (150,73) (151,70) (155,62)
(156,65) (162,81) (167,77) (169,84) (257,65) (258,83) (260,105) (263,93)
(265,127) (266,47) (268,61) (270,133) (271,70) (273,113) (274,135) (276,91)
(279,125) (281,99) (282,63) (284,141) (286,73) (287,125) (289,84) (292,97)
(294,81) (295,147) (297,137) (300,147) (305,102) (313,121) (316,135) (319,129)
(321,155) (324,149) (327,152) (513,242) (514,103) (516,91) (518,113) (519,79)
(521,168) (522,259) (524,195) (526,97) (527,239) (529,157) (532,81) (534,261)
(537,94) (538,195) (540,211) (543,235) (545,122) (550,193) (551,240) (553,258)
(556,273) (559,210) (561,155) (564,163) (566,273) (567,275) (569,210) (570,143)
(575,258) (577,231) (580,237) (582,261) (585,256) (588,253) (593,177) (594,195)
(596,273) (599,210) (601,202) (602,215) (607,273) (609,233) (612,297) (615,238)
(618,295) (622,297) (623,311) (626,251) (628,289) (631,307) (633,292) (634,315)
(636,315) (639,305) (641,287) (647,312) (649,321) (657,292) (663,307) (665,317)
For the range 2 < n < 1001, there are 128 values of n for which the gate delay of the proposed multiplier is
equal to TA+dlog2 neTX. We note that for a ﬁxed degree n, more than one irreducible trinomial of degree n may
exist such that the gate delay of the proposed multiplier equals to TA + dlog2 neTX, e.g., pairs (9, 4) and (9, 1).
B. Comparison
Reference [12] compares gate delays of the type-I and type-II SPB multipliers with those of a number of other
multipliers published earlier. Since the AND and XOR gates complexities of the proposed SPB parallel multiplier
match the best results known to date, we now compare the gate delay of the proposed multiplier with those of some
recently published multipliers. We note that the XOR gate complexity of the PB multiplier in [11] is n2+(k2−3k)/2.
As shown in the table, the gate delay of the proposed SPB multiplier matches or outperforms the previously known
best results for the same class of ﬁelds.
TABLE III
COMPARISONS OF GATE DELAYS FOR f(u) = un + uk + 1-BASED MULTIPLIERS FOR v = k OR k − 1
Multiplier Gate delay
PB [11] TA + dlog2(2n + k − 2)eTX
Type-I SPB[12] ≤ TA + dlog2 4neTX
Type-II SPB [12] TA + dlog2 2neTX
Presented Here
TA + dlog2(n + v)eTX, for n ≤ 2v
TA + dlog2(2n − v − 1)eTX,for 2v ≤ n − 17
C. An Example
As an example, we now construct the proposed multipliers for f(u) = u3 + u + 1 . Let v = k = 1. The SPB is
{xi−1|0 ≤ i ≤ 2} = {x−1,1,x1}. The SPB Mastrovito matrix Z is given as
Z =


 

b1 b0 b2
b0 + b2 b1 b0
b0 b2 b1 + b2


 

.
The coordinate formulae of C = (c0,c1,c2)T are
c0 = [b1a0 + b0a1] + b2a2,
c1 = [(b0 + b2)a0] + [b1a1 + b0a2], and
c2 = [b0a0 + b2a1] + [(b1 + b2)a2].
Terms in square brackets are computed in parallel. So the gate delay of the proposed multiplier is TA + 2TX.
The multiplier requires 9 AND gates and 8 XOR gates.
IV. SPB MULTIPLIER FOR SPECIAL IRREDUCIBLE PENTANOMIALS
A. Architectures
Let g(u) = un + ue + uh + uk + 1 be an irreducible pentanomial (n > e > h > k > 0) and x be a root of
g(u) = 0. We use the SPB {xi−v|0 ≤ i ≤ n−1} (0 < v < n) to represent ﬁeld elements, i.e., A = x−v Pn−1
i=0 aixi.
To obtain the formulae of coordinates of C=AB for this case, we ﬁrst compute:
S = AB =
2(n−v−1) X
t=−2v
st+2vxt = r− + r + r+, (3)
where r =
n−1−v P
t=−v
st+2vxt, r− =
−v−1 P
t=−2v
st+2vxt, r+ =
2(n−v−1) P
t=n−v
st+2vxt and st+2v is deﬁned in (1).
In (3), terms r− and r+ are reduced by the following reduction formulae:
xi = xe+i−n + xh+i−n + xk+i−n + xi−n,where n − v ≤ i ≤ 2n − 2v − 2, (4)
and
xi = xn+i + xe+i + xh+i + xk+i,where − 2v ≤ i ≤ −(v + 1). (5)
The reduced results e r+ and e r− are deﬁned as follows:
e r+ :=
2n−2v−2 X
t=n−v
st+2vxt+e−n +
2n−2v−2 X
t=n−v
st+2vxt+h−n +
2n−2v−2 X
t=n−v
st+2vxt+k−n +
2n−2v−2 X
t=n−v
st+2vxt−n
=
e+n−2v−2 X
t=e−v
st+2v+n−ext +
h+n−2v−2 X
t=h−v
st+2v+n−hxt +
k+n−2v−2 X
t=k−v
st+2v+n−kxt +
n−2v−2 X
t=−v
st+n+2vxt, (6)8
and
e r− :=
−v−1 X
t=−2v
st+2vxn+t +
−v−1 X
t=−2v
st+2vxe+t +
−v−1 X
t=−2v
st+2vxh+t +
−v−1 X
t=−2v
st+2vxk+t
=
n−v−1 X
t=n−2v
st+2v−nx
t +
e−v−1 X
t=e−2v
st+2v−ex
t +
h−v−1 X
t=h−2v
st+2v−hx
t +
k−v−1 X
t=k−2v
st+2v−kx
t. (7)
In this section, we present architecture of the new multiplier for the special type of the irreducible pentanomials
g(u) = un +uv+1 +uv +uv−1 +1, i.e., e = v +1, h = v and k = v +1. Therefore, only the last term in (7) will
be reduced again, i.e.,
k−v−1 X
t=k−2v
st+2v−kx
t =
k−1−v X
t=−v
st+2v−kx
t +
−v−1 X
t=k−2v
st+2v−kx
t
=
−2 X
t=−v
st+v+1x
t + s0x
n−v−1 + s0x
0 + s0x
−1 + s0x
−2
The product of A and B is then computed by the formula
C =
n−1 X
i=0
cix
i−v = x
−v
n−1 X
i=0
cix
i = r + e r− + e r+. (8)
For this special type of the irreducible pentanomials, the formula for C = AB obtained by applying (1), (6), (7)
to (8) may be simpliﬁed as follows:
C =
n−1−v X
t=−v
cv+tx
t =
n−2v−1 X
t=−v
 
2v+t X
i=0
aib2v+t−i
!
x
t +
n−v−1 X
t=n−2v
 
n−1 X
i=2v−n+t+1
aib2v+t−i
!
x
t
+
n−v−1 X
t=1
 
n−1 X
i=t+v
aibv+n−1+t−i
!
x
t +
n−v−2 X
t=0
 
n−1 X
i=t+v+1
aibv+n+t−i
!
x
t
+
n−v−3 X
t=−1
 
n−1 X
i=t+v+2
aibv+n+1+t−i
!
x
t +
n−2v−2 X
t=−v
 
n−1 X
i=t+2v+1
aibn+2v+t−i
!
x
t
+
n−v−1 X
t=n−2v
 
t+2v−n X
i=0
aib2v−n+t−i
!
xt +
0 X
t= 1−v
 
t+v−1 X
i=0
aibv−1+t−i
!
xt
+
−1 X
t=−v
 
t+v X
i=0
aibv+t−i
!
xt +
−2 X
t=−v
 
t+v+1 X
i=0
aibv+1+t−i
!
xt
+ a0b0x
n−v−1 + a0b0 + a0b0x
−1 + a0b0x
−2.
Comparing the coefﬁcients of xt in this formula, we can obtain explicit expressions of the coordinate cv+t for
−v ≤ t ≤ n−1−v. Expressions are different according to the value of v. For three of the ﬁve NIST recommended
values of n for ECDSA, namely, n=163, 283 and 571, all pairs of (n,v), for which g(u) = un+uv+1+uv+uv−1+1
is irreducible, are as follows: (163, 67), (163, 69), (163, 71), (163, 92), (163, 94), (163, 96), (283, 24), (283, 133),
(283, 150), (283, 259), (571, 104), (571, 230), (571, 341) and (571, 467). These values of v are in the range
of 3 < v < (n − 3)/2 and we can now present explicit expressions of the coordinate cv+t. To this end, the9
corresponding values of t are divided into ten segments/cases. Below we present Case 1 only and leave the rest
for the appendix.
Case 1: t = −v
c0 =
v X
i=0
aibv−i +
n−1 X
i=v+1
aibn+v−i + a0b0 + a0b1 + a1b0
= a0(b0 + b1 + bv) +
(
a1(b0 + bv−1) +
"
v X
i=2
aibv−i +
n−1 X
i=v+1
aibn+v−i
#)
(9)
The expressions of the corresponding row of the Mastrovito matrix is
z0,i =

      
      
b0 + b1 + bv i = 0
b0 + bv−1 i = 1
bv−i 2 ≤ i ≤ v
bn+v−i v + 1 ≤ i ≤ n − 1.
There are n − 2 terms in the square brackets in (9), and they are added pairwise ﬁrst. Then the d(n − 2)/2e
summations and the term a1(b0+bv−1) in the curly brackets are added pairwise. Finally, the d(1 + d(n − 2)/2e)/2e
summations obtained in the second step and the term a0(b0 + b1 + bv) are added pairwise using a binary XOR
tree. We note that the square and curly brackets have the similar computation sequence in the following cases. The
AND gate delay in this case and the following cases are the same, namely, TA. So we only consider the XOR gate
delay. Due to the parallelism, the XOR gate delay for computing c0 is 2 + dlog2 (1 + d(1 + d(n − 2)/2e)/2e)e =
2 + dlog2(1 + dn/4e)e.
Using the formula for each coordinate of C as given above and in the appendix, we now compute complexities
of the proposed multiplier. The maximal gate delay occurs in Case 5 (t = 0), which is found to be TA + (1 +
dlog2(2n − v − 1)e)TX.
From the discussions in Case 3 and Case 8, which is given in the appendix, we know that the multiplier requires
n2 AND gates. Now we count the number of the XOR gates. First we need to compute the following summations
of bi’s. The number of the XOR gates used to compute the corresponding expression is listed in the curly brackets.10
bv−1 + bn−1, {1};
bv + bn−1, {1};
b0 + bv−1, {1};
b0 + bv, {1};
b0 + b1 + bv = (b0 + bv) + b1, {1};
bv−1 + bn−2 + bn−1 = (bv−1 + bn−1) + bn−2, {1};
bv−2 + bv−1 + b2v−1 + b0 = (bv−2 + b2v−1) + (bv−1 + b0), {2};
bv−1 + b2v + b0 = (bv−1 + b0) + b2v, {1};
bj+v−1−n + bj, where n − v + 1 ≤ j ≤ n − 1, {v − 1};
bj+v−1 + bj, where v + 1 ≤ j ≤ n − v, {n − 2v};
bj+v+1 + bj, where 0 ≤ j ≤ v − 2, {v − 2};
bj+v+1 + bj+1 + bj = (bj+v+1 + bj) + bj+1, where 0 ≤ j ≤ v − 3, {v − 2};
bj+v−n + bj+1 + bj = (bj+v−n + bj+1) + bj, where n − v ≤ j ≤ n − 2, {v − 1};
bj+v + bj+1 + bj = (bj+v + bj+1) + bj, where v + 1 ≤ j ≤ n − v − 1, {n − 2v − 1};
bj+v+1 + bj+2 + bj+1 + bj = (bj+v+1 + bj) + (bj+2 + bj+1),
where 0 ≤ j ≤ v − 3, {2v − 4};
bj+1+v−n + bj+2 + bj+1 + bj = (bj+1+v−n + bj+2) + (bj+1 + bj),
where n − v − 1 ≤ j ≤ n − 3, {2v − 2};
bj+1+v + bj+2 + bj+1 + bj = (bj+1+v + bj+2) + (bj+1 + bj),
where v + 1 ≤ j ≤ n − v − 2, {2(n − 2v − 2)}.
We note that a previous expression may be reused by a following one. For example, we need only v − 2 XOR
gates to compute bj+v+1+bj (0 ≤ j ≤ v−2) since the term (bv−2+b2v−1) have been computed in the expression
bv−2+bv−1+b2v−1+b0 = (bv−2+b2v−1)+(bv−1+b0). Therefore, 4n−8 XOR gates are required to calculate all
expressions. Based on the discussions in Case 3, we know that the total number of the XOR gates of the multiplier
is n(n − 1) + 1 + 4n − 8 = n2 + 3n − 7.
For simplicity, we do not present detailed computational procedures of gate delays for the other cases here. But
we note that we have obtained these complexities for all irreducible pentanomials g(u) = un+uv+1+uv+uv−1+1.
We summarize the gate delay of the proposed multiplier as follows.
TABLE IV
GATE DELAYS FOR g(u) = un + uv+1 + uv + uv−1 + 1-BASED SPB MULTIPLIERS
v Gate delay
n + 1 < 2v TA + (1 + dlog2(n + v − 1)e)TX
2v ≤ n + 1 TA + (1 + dlog2(2n − v − 1)e)TX
For 2 < n < 1000, for which no irreducible trinomial of degree n exists, we now list pairs (n,v) such that11
g(u) = un + uv+1 + uv + uv−1 + 1 is irreducible and the gate delay of the presented multiplier equals to
TA + d1 + log2 neTX.
(19,6) (38,16) (40,20) (67,6) (70,20) (133,63) (139,60) (149,43)
(157,73) (160,72) (163,71) (164,75) (259,118) (262,91) (269,43) (275,122)
(277,128) (283,133) (290,142) (298,109) (299,134) (301,146) (307,130) (326,139)
(331,159) (515,205) (523,202) (533,92) (535,257) (536,144) (541,237) (542,191)
(547,134) (548,246) (554,206) (557,97) (560,272) (562,173) (563,241) (565,204)
(568,124) (571,230) (572,162) (578,286) (581,231) (584,224) (586,286) (587,195)
(589,169) (592,244) (595,166) (598,268) (605,299) (611,241) (638,252) (644,319)
(661,317) (667,316) (680,336)
For the range 2 < n < 1001, there are 59 values of n for which the gate delay of the proposed multiplier equals
to TA + d1 + log2 neTX. Especially, for the binary ﬁelds GF(2163), GF(2283) and GF(2571), which have been
recommended by NIST for ECDSA applications but no irreducible trinomials exist for the degrees n =163, 283 and
571, we have found pairs of (n,v) for these values of n such that the gate delay of the proposed SPB multipliers
is TA + d1 + log2 neTX.
B. Comparison
We now compare the gate delay of the proposed multiplier with some recently published multipliers. Please
note that the multiplier proposed in [14] uses n + 1 bits to represent GF(2n) elements, therefore the width of the
data-path is equal to n + 1 and its AND gate complexity is (n + 1)2. Other multipliers in Table V do not use the
redundant representation.
TABLE V
COMPARISONS OF SOME SELECTED MULTIPLIERS
Polynomials Bases # XOR Gate delay
un + uv+1 + uv + uv−1 + 1 Dual Basis [13] n2 + d1.5ne + 3v − 6 TA + d3 + log2 neTX
un + uv+1 + uv + u + 1 PB [5] n2 + n TA + d3 + log2(n − 1)eTX
un+1 + uv + uk + 1 Redundant [14] n2 + 2n + v − k TA + d2 + log2(n + 1)eTX
un + uv+1 + uv + uv−1 + 1 SPB Proposed n2 + 3n − 7
TA + d1 + log2(n + v − 1)eTX, if n + 1 < 2v
TA + d1 + log2(2n − v − 1)eTX, if 2v ≤ n + 1
For the purpose of illustration, in Table VI, we list the number of AND/XOR gates and the gate delay of some
available bit parallel multipliers for ﬁelds GF(2163), GF(2283) and GF(2571). Since existences of the suitable
4-term polynomials for n =283 and 571 are unknown, the algorithm in [14] is not applicable for these two ﬁelds.
From Table VI, we conclude the following:
For the ﬁeld GF(2163), the proposed design improves the gate delay of [14] by 10%; For the ﬁeld GF(2283),
the proposed design improves the gate delay of [5] by 16.7% at the cost of 0.7% increase of the XOR gate number;12
For the ﬁeld GF(2571), the proposed design improves the gate delay of [13] by 15.4% at the cost of 0.2% increase
of the XOR gate number.
TABLE VI
COMPLEXITIES OF SOME PRACTICAL FIELDS
Polynomials Bases
Space complexity
Gate delay
# AND # XOR
u163 + u68 + u67 + u66 + 1 Dual Basis [13] 26569 27009 TA + 11TX
u163 + u60 + u59 + u + 1 PB [5] 26569 26732 TA + 11TX
u164 + u33 + u22 + 1 Redundant [14] 26896 26906 TA + 10TX
u163 + u72 + u71 + u70 + 1 SPB Proposed 26569 27051 TA + 9TX
u283 + u25 + u24 + u23 + 1 Dual Basis [13] 80089 80580 TA + 12TX
u283 + u60 + u59 + u + 1 PB [5] 80089 80372 TA + 12TX
u283 + u134 + u133 + u132 + 1 SPB Proposed 80089 80931 TA + 10TX
u571 + u105 + u104 + u103 + 1 Dual Basis [13] 326041 327204 TA + 13TX
u571 + u277 + u275 + u2 + 1 PB [5] 326041 326612 TA + 14TX
u571 + u231 + u230 + u229 + 1 SPB Proposed 326041 327747 TA + 11TX
V. CONCLUSIONS
After reorganizing the computational procedure of the coordinate formulae of C = AB = (c0,c1,...,cn−1)T, we
have proposed a new SPB multiplier. For all irreducible trinomials, its space complexity matches the previously
known best results, but for certain irreducible trinomials its gate delay reaches a new low value of TA+dlog2 neTX.
For a special type of irreducible pentanomials, namely, f(u) = un+uv+1+uv+uv−1+1 (3 < v < (n−3)/2), we
have presented exact formulae for the multiplier and show that the gate delay is TA + (1 + dlog2(n + v − 1)e)TX
for the case n + 1 < 2v. Combining the result of the reference [12] and this report, we know that it is possible to
design SPB bit parallel multipliers with a gate delay of no more than TA + (1 + dlog2 ne)TX for the ﬁve binary
ﬁelds recommended by NIST for ECDSA applications.
APPENDIX
Formulae for cv+t: Case 1 has been presented in Section IV, Case 2 to 10 are given below.
Case 2: 1 − v ≤ t ≤ −3
cv+t =
2v+t X
i=0
aib2v+t−i +
n−1 X
i=t+2v+1
aibn+2v+t−i +
t+v−1 X
i=0
aibv−1+t−i +
t+v X
i=0
aibv+t−i +
t+v+1 X
i=0
aibv+1+t−i
=
v+t−1 X
i=0
ai(b2v+t−i + bv−1+t−i + bv+t−i + bv+1+t−i) + av+t(b0 + b1 + bv)
+
(
av+t+1(b0 + bv−1) +
"
2v+t X
i=v+t+2
aib2v+t−i +
n−1 X
i=t+2v+1
aibn+2v+t−i
#)13
The expressions of the corresponding rows of the Mastrovito matrix is
zv+t,i =

         
         
bv−1+t−i + bv+t−i + bv+1+t−i + b2v+t−i 0 ≤ i ≤ v − 1 + t
b0 + b1 + bv i = v + t
b0 + bv−1 i = v + t + 1
b2v+t−i v + t + 2 ≤ i ≤ 2v + t
bn+2v+t−i 2v + t + 1 ≤ i ≤ n − 1
Please note that summations b0 + b1 + bv and b0 + bv−1 also appear in Case 1, and they may be reused.
The XOR gate delay for computing cv+t (1 − v ≤ t ≤ −3) is
2 + dlog2 (v + t + 1 + d(1 + d(n − v − t − 2)/2e)/2e)e = 2 + dlog2 d(n + 3v + 3t + 4)/4ee. The maximal delay
is 2 + dlog2 d(n + 3v − 5)/4ee when t = −3.
From 0 ≤ i ≤ v − 1 + t and 1 − v ≤ t ≤ −3, we know that 0 ≤ v − 1 + t − i ≤ v − 4. Therefore, sum
bv−1+t−i + bv+t−i + bv+1+t−i + b2v+t−i are the same as sum bj+v+1 + bj+2 + bj+1 + bj, where 0 ≤ j ≤ v − 4.
Case 3: t = −2
cv−2 =
2v−2 X
i=0
aib2v−2−i +
n−1 X
i=2v−1
aibn+2v−2−i +
v−3 X
i=0
aibv−3−i +
v−2 X
i=0
aibv−2−i +
v−1 X
i=0
aibv−1−i + a0b0
=
v−3 X
i=0
ai(bv−3−i + bv−2−i + bv−1−i + b2v−2−i) + av−2(b0 + b1 + bv)
+
(
av−1(b0 + bv−1) +
"
2v−2 X
i=v
aib2v−2−i +
n−1 X
i=2v−1
aibn+2v−2−i + a0b0
#)
(10)
The expressions of the corresponding row of the Mastrovito matrix is
zv−2,i =

            
            
bv−3 + bv−2 + bv−1 + b2v−2 + b0 i = 0
bv−3−i + bv−2−i + bv−1−i + b2v−2−i 1 ≤ i ≤ v − 3
b0 + b1 + bv i = v − 2
b0 + bv−1 i = v − 1
b2v−2−i v ≤ i ≤ 2v − 2
bn+2v−2−i 2v − 1 ≤ i ≤ n − 1
Please note that zv−2,0 = bv−3+bv−2+bv−1+b2v−2+b0 is a summation of ﬁve terms. Thus the gate delay of the
type-II SPB multiplier is TA+(3 + dlog2 ne)TX. But we may use formula (10) to compute cv−2, where a0b0 is added
in the square bracket. The total XOR gate delay for computing cv+t is
2 + dlog2 (1 + v − 3 + 1 + d(1 + d(n − v + 1)/2e)/2e)e = 2 + dlog2 d(n + 3v − 1)/4ee and n + 1 AND gates
are required in (10). We note that only n AND gates are required in other n − 2 cases except this case and the
case t = n − 2v. Please refer to Case 8 for more details.
From 1 ≤ i ≤ v − 3 and t = −2, we know that 0 ≤ v − 3 − i ≤ v − 4. Therefore, sum bv−3−i + bv−2−i +
bv−1−i + b2v−2−i are the same as sum bj+v+1 + bj+2 + bj+1 + bj, where 0 ≤ j ≤ v − 4.14
Case 4: t = −1
cv−1 =
2v−1 X
i=0
aib2v−1−i +
n−1 X
i=v+1
aibn+v−i +
n−1 X
i=2v
aibn+2v−1−i +
v−2 X
i=0
aibv−2−i +
v−1 X
i=0
aibv−1−i + a0b0
= a0(bv−2 + bv−1 + b2v−1 + b0) +
v−2 X
i=1
ai(bv−2−i + bv−1−i + b2v−1−i)
+
(
av−1(b0 + bv) + avbv−1 +
2v−1 X
i=v+1
ai(b2v−1−i + bn+v−i) +
n−1 X
i=2v
ai(bn+2v−1−i + bn+v−i)
)
The expressions of the corresponding row of the Mastrovito matrix is
zv−1,i =

            
            
bv−2 + bv−1 + b2v−1 + b0 i = 0
bv−2−i + bv−1−i + b2v−1−i 1 ≤ i ≤ v − 2
b0 + bv i = v − 1
bv−1 i = v
b2v−1−i + bn+v−i v + 1 ≤ i ≤ 2v − 1
bn+v−i + bn+2v−1−i 2v ≤ i ≤ n − 1
The XOR gate delay for computing cv−1 is 2+dlog2 (v − 1 + d(n − v + 1)/2e)e = 2+dlog2 d(n + v − 1)/2ee.
From 1 ≤ i ≤ v − 2, we know that 0 ≤ v − 2 − i ≤ v − 3. Therefore, sum bv−2−i + bv−1−i + b2v−1−i are the
same as sum bj+v+1 + bj+1 + bj , where 0 ≤ j ≤ v − 3.
From v + 1 ≤ i ≤ 2v − 1, we know that n − v + 1 ≤ n + v − i ≤ n − 1. Therefore, sum b2v−1−i + bn+v−i are
the same as sum bj+v−1−n + bj , where n − v + 1 ≤ j ≤ n − 1.
From 2v ≤ i ≤ n − 1, we know that v + 1 ≤ n + v − i ≤ n − v. Therefore, sum bn+v−i + bn+2v−1−i are the
same as sum bj+v−1 + bj, where v + 1 ≤ j ≤ n − v.
Case 5: t = 0
cv =
2v X
i=0
aib2v−i +
n−1 X
i=v+1
aibn+v−i +
n−1 X
i=v+2
aibn+v+1−i +
n−1 X
i=2v+1
aibn+2v−i +
v−1 X
i=0
aibv−1−i + a0b0
= a0(b0 + bv−1 + b2v) +
(
v−1 X
i=1
ai(bv−1−i + b2v−i) + avbv + av+1(bv−1 + bn−1)
)
+
2v X
i=v+2
ai(b2v−i + bn+v−i + bn+v+1−i) +
n−1 X
i=2v+1
ai(bn+v−i + bn+v+1−i + bn+2v−i)
The expressions of the corresponding row of the Mastrovito matrix is
zv,i =

            
            
b0 + bv−1 + b2v i = 0
bv−1−i + b2v−i 1 ≤ i ≤ v − 1
bv i = v
bv−1 + bn−1 i = v + 1
b2v−i + bn+v−i + bn+v+1−i v + 2 ≤ i ≤ 2v
bn+v−i + bn+v+1−i + bn+2v−i 2v + 1 ≤ i ≤ n − 115
The XOR gate delay for computing cv is 2+dlog2 (1 + n − v − 2 + d(v + 1)/2e)e = 2+dlog2 d(2n − v − 1)/2ee,
which is the maximal XOR gate delay of the new multiplier.
From 1 ≤ i ≤ v − 1, we know that 0 ≤ v − 1 − i ≤ v − 2. Therefore, sum bv−1−i + b2v−i are the same as sum
bj+v+1 + bj, where 0 ≤ j ≤ v − 2.
From v + 2 ≤ i ≤ 2v, we know that n − v ≤ n + v − i ≤ n − 2. Therefore, sum b2v−i + bn+v−i + bn+v+1−i
are the same as sum bj+v−n + bj+1 + bj, where n − v ≤ j ≤ n − 2.
From 2v+1 ≤ i ≤ n−1, we know that v+1 ≤ n+v−i ≤ n−v−1. Therefore, sum bn+v−i+bn+v+1−i+bn+2v−i
are the same as sum bj+v + bj+1 + bj, where v + 1 ≤ j ≤ n − v − 1.
Case 6: 1 ≤ t ≤ n − 2v − 2
cv+t =
2v+t X
i=0
aib2v+t−i +
n−1 X
i=t+v
aibn+v−1+t−i +
n−1 X
i=t+v+1
aibn+v+t−i
+
n−1 X
i=t+v+2
aibn+v+1+t−i +
n−1 X
i=t+2v+1
aibn+2v+t−i
=
("
v+t−1 X
i=0
aib2v+t−i
#
+ at+v(bv + bn−1)
)
+ at+v+1(bv−1 + bn−2 + bn−1)
+
2v+t X
i=t+v+2
ai(b2v+t−i + bn+v−1+t−i + bn+v+t−i + bn+v+1+t−i)
+
n−1 X
i=t+2v+1
ai(bn+v−1+t−i + bn+v+t−i + bn+v+1+t−i + bn+2v+t−i)
The expressions of the corresponding rows of the Mastrovito matrix is
zv+t,i =

         
         
b2v+t−i 0 ≤ i ≤ t + v − 1
bv + bn−1 i = t + v
bv−1 + bn−2 + bn−1 i = t + v + 1
b2v+t−i + bn+v−1+t−i + bn+v+t−i + bn+v+1+t−i t + v + 2 ≤ i ≤ 2v + t
bn+v−1+t−i + bn+v+t−i + bn+v+1+t−i + bn+2v+t−i t + 2v + 1 ≤ i ≤ n − 1
The XOR gate delay for computing cv+t is
2 + dlog2 (n − t − v − 1 + d(v + t + 1)/2e)e = 2 + dlog2 d(2n − v − t − 1)/2ee. The maximal delay is 2 +
dlog2 d(2n − v − 2)/2ee when t = 1.
From t + v + 2 ≤ i ≤ 2v + t and 1 ≤ t ≤ n − 2v − 2, we know that n − v − 1 ≤ n + v − 1 + t − i ≤ n − 3.
Therefore, sum b2v+t−i+bn+v−1+t−i+bn+v+t−i+bn+v+1+t−i are the same as sum bj+1+v−n+bj+2+bj+1+bj,
where n − v − 1 ≤ j ≤ n − 3.
From t + 2v + 1 ≤ i ≤ n − 1 and 1 ≤ t ≤ n − 2v − 2, we know that v + 1 ≤ n + v − 1 + t − i ≤ n − v − 2.
Therefore, sum bn+v−1+t−i+bn+v+t−i+bn+v+1+t−i+bn+2v+t−i are the same as sum bj+1+v +bj+2+bj+1+bj,
where v + 1 ≤ j ≤ n − v − 2.16
Case 7: t = n − 2v − 1
cn−v−1 =
n−1 X
i=0
aibn−1−i +
n−1 X
i=n−v−1
aib2n−v−2−i +
n−1 X
i=n−v
aib2n−v−1−i +
n−1 X
i=n−v+1
aib2n−v−i
=
("
n−v−2 X
i=0
aibn−1−i
#
+ an−v−1(bv + bn−1)
)
+ an−v(bv−1 + bn−2 + bn−1)
+
n−1 X
i=n−v+1
ai(bn−1−i + b2n−v−2−i + b2n−v−1−i + b2n−v−i)
The expressions of the corresponding row of the Mastrovito matrix is
zn−v−1,i =

      
      
bn−1−i 0 ≤ i ≤ n − v − 2
bv + bn−1 i = n − v − 1
bv−1 + bn−2 + bn−1 i = n − v
bn−1−i + b2n−v−2−i + b2n−v−1−i + b2n−v−i n − v + 1 ≤ i ≤ n − 1
The XOR gate delay for computing cn−v−1 is
2 + dlog2 (v + d(1 + d(n − v − 1)/2e)/2e)e = 2 + dlog2 d(n + 3v + 1)/4ee.
From n − v + 1 ≤ i ≤ n − 1, we know that n − v − 1 ≤ 2n − v − 2 − i ≤ n − 3. Therefore, sum bn−1−i +
b2n−v−2−i+b2n−v−1−i+b2n−v−i are the same as sum bj+1+v−n+bj+2+bj+1+bj, where n−v−1 ≤ j ≤ n−3.
Case 8: n − 2v ≤ t ≤ n − v − 3
cv+t =
n−1 X
i=2v−n+t+1
aib2v+t−i +
n−1 X
i=t+v
aibn+v−1+t−i +
n−1 X
i=t+v+1
aibn+v+t−i
+
n−1 X
i=t+v+2
aibn+v+1+t−i +
2v−n+t X
i=0
aib2v−n+t−i
=
("
2v−n+t X
i=0
aib2v−n+t−i +
v+t−1 X
i=2v−n+t+1
aib2v+t−i
#
+ av+t(bv + bn−1)
)
+av+t+1(bv−1 + bn−2 + bn−1)
+
n−1 X
i=v+t+2
ai(b2v+t−i + bn+v−1+t−i + bn+v+t−i + bn+v+1+t−i) (11)
The expressions of the corresponding rows of the Mastrovito matrix is
zv+t,i =

         
         
b2v−n+t−i 0 ≤ i ≤ 2v − n + t
b2v+t−i 2v − n + t + 1 ≤ i ≤ v + t − 1
bv + bn−1 i = v + t
bv−1 + bn−2 + bn−1 i = v + t + 1
b2v+t−i + bn+v−1+t−i + bn+v+t−i + bn+v+1+t−i v + t + 2 ≤ i ≤ n − 1
Please note that we have obtained the term a0b0 in Case 3 at the cost of 1 AND gate. Since this term also
appears in the case t = n − 2v (the ﬁrst term in the square brackets of (11), t = n − 2v and i = 0), only n − 1
AND gates are required for the case t = n − 2v.17
The XOR gate delay for computing cv+t is
2 + dlog2 (n − v − t − 1 + d(1 + d(v + t)/2e)/2e)e = 2 + dlog2 d(4n − 3v − 3t − 2)/4ee. The maximal delay for
n − 2v ≤ t ≤ n − v − 3 is 2 + dlog2 d(n + 3v − 2)/4ee, which occurs in the case of t = n − 2v.
From v + t + 2 ≤ i ≤ n − 1 and n − 2v ≤ t ≤ n − v − 3, we know that n − v ≤ n + v − 1 + t − i ≤ n − 3.
Therefore, sum b2v+t−i+bn+v−1+t−i+bn+v+t−i+bn+v+1+t−i are the same as sum bj+1+v−n+bj+2+bj+1+bj,
where n − v ≤ j ≤ n − 3.
Case 9: t = n − v − 2
cn−2 =
n−1 X
i=v−1
aibn+v−2−i + an−2bn−1 + an−1bn−2 + an−1bn−1 +
v−2 X
i=0
aibv−2−i
=
("
v−2 X
i=0
aibv−2−i +
n−3 X
i=v−1
aibn+v−2−i
#
+ an−2(bv + bn−1)
)
+ an−1(bv−1 + bn−2 + bn−1)
The expressions of the corresponding row of the Mastrovito matrix is
zn−2,i =

      
      
bv−2−i 0 ≤ i ≤ v − 2
bn+v−2−i v − 1 ≤ i ≤ n − 3
bv + bn−1 i = n − 2
bv−1 + bn−2 + bn−1 i = n − 1
The XOR gate delay for computing cn−2 is 2 + dlog2 (1 + d(1 + d(n − 2)/2e)/2e)e = 2 + dlog2 d(n + 4)/4ee.
Case 10: t = n − v − 1
cn−1 =
n−1 X
i=v
aibn+v−1−i +
v−1 X
i=0
aibv−1−i + a0b0 + an−1bn−1
= a0(b0 + bv−1)+
("
v−1 X
i=1
aibv−1−i +
n−2 X
i=v
aibn+v−1−i
#
+ an−1(bv + bn−1)
)
The expressions of the corresponding row of the Mastrovito matrix is
zn−1,i =

      
      
b0 + bv−1 i = 0
bv−1−i 1 ≤ i ≤ v − 1
bn+v−1−i v ≤ i ≤ n − 2
bv + bn−1 i = n − 1
The XOR gate delay for computing cn−1 is 1 + dlog2(n − 1)e.18
REFERENCES
[1] G. Seroussi, “Table of Low-Weight Binary Irreducible Polynomials,” Technical Report HPL-98-135, Hewlett-Packard Laboratories, Palo
Alto, Calif., Aug. 1998, Available at http://www.hpl.hp.com/techreports/98/HPL-98-135.html.
[2] B. Sunar and C. K. Koc, “Mastrovito Multiplier for All Trinomials,” IEEE Transactions on Computers, vol. 48, no. 5, pp. 522-527, May
1999.
[3] T. Zhang and K. K. Parhi, “Systematic Design of Original and Modiﬁed Mastrovito Multipliers for General Irreducible Polynomials,”
IEEE Transactions on Computers, vol. 50, no. 7, pp. 734-749, July 2001.
[4] A. Halbutogullari and C. K. Koc, “Mastrovito Multiplier for General Irreducible Polynomials,” IEEE Transactions on Computers, vol. 49,
no. 5, pp. 503-518, May 2000.
[5] A. Reyhani-Masoleh and M.A. Hasan, “Low Complexity Bit Parallel Architectures for Polynomial Basis Multiplication over GF(2m),”
IEEE Transactions on Computers, vol. 53, no. 8, pp. 945-959, Aug. 2004.
[6] H. Wu, “Bit parallel Finite Field Multiplier and Squarer Using Polynomial Basis,” IEEE Transactions on Computers, vol. 51, no. 7, pp.
750-758, July 2002.
[7] H. Wu, “Montgomery Multiplier and Squarer for a Class of Finite Fields,” IEEE Transactions on Computers, vol. 51, no. 5, pp. 521-529,
May 2002.
[8] C. Paar, “A New Architecture for a Parallel Finite Field Multiplier with Low Complexity Based on Composite Fields,” IEEE Transactions
on Computers, vol. 45, no. 7, pp. 856-861, July 1996.
[9] E. D. Mastrovito, “VLSI Architectures for Multiplication over Finite Field GF(2m),” Applied Algebra, Algebraic Algorithms and Error-
Correcting Codes, T. Mora, ed., pp. 297-309, Springer-Verlag, 1988.
[10] C. K. Koc and T. Acar, “Montgomery Multiplication in GF(2k),” Designs, Codes, and Cryptography, vol. 14, pp. 57-69, 1998.
[11] S. O. Lee, S. W. Jung, Ch. H. Kim, J. Yoon, J. Koh, and D. Kim, “Design of Bit Parallel Multiplier with Lower time complexity,” In
Proc. ICICS’2003, LNCS 2971, pp. 127-139, Springer-Verlag, 2004.
[12] H. Fan and Y. Dai, “Fast bit parallel GF(2n) Multiplier for All Trinomials,” IEEE Transactions on Computers, vol. 54, no. 4, pp. 485-490,
2005.
[13] F. Rodriguez-Henriquez, and C. K. Koc, “Parallel Multipliers Based on Special Irreducible Pentanomials,” IEEE Transactions on Computers,
vol. 52, no. 12, pp. 1535-1542, Dec. 2003.
[14] Christophe Negre, “Quadrinomial Modular Arithmetic Using Modiﬁed Polynomial Basis,” In Proceedings of the International Conference
on Information Technology: Coding and Computing ( ITCC’2005), volume-I, pp. 550-555, 2005.
[15] H. Fan and M.A. Hasan, “Relationship between GF(2m) Montgomery and Shifted Polynomial Basis Multiplication Algorithms,” Technical
Report, CACR 2005-30, University of Waterloo, Waterloo, Canada., Aug. 2005.