of by M. A. Hasan & C. Negre
1
Subquadratic space complexity multiplier for a class
of binary ﬁelds using Toeplitz matrix approach
M. A. Hasan1 and C. Negre2
1ECE Department and CACR, University of Waterloo, Ontario, Canada.
2Team DALI/ELIAUS, Universit´ e de Perpignan, France.
Abstract
In the recent past, subquadratic space complexity multipliers have been proposed for binary ﬁelds deﬁned by
irreducible trinomials and some speciﬁc pentanomials. For such multipliers, alternative irreducible polynomials can
also be used, in particular, nearly all one polynomials (NAOPs) seem to be better than pentanomials (see [7]). For
improved efﬁciency, multiplication modulo an NAOP is performed via modulo a quadrinomial whose degree is one
more than that of the original NAOP. In this paper, we present a Toeplitz matrix-vector product based approach for
multiplication modulo a quadrinomial. We obtain a fully parallel (non-sequential) multiplier with a subquadratic
space complexity, which has the same order of space complexity as that of Fan and Hasan [4].
The Toeplitz matrix-vector product based approach is also interesting in the design of sequential multipliers.
In this paper, we present two such multipliers: one with bit serial output and the other bit parallel output.
Index Terms
Subquadratic complexity, binary ﬁeld, multiplication, double basis.
I. INTRODUCTION
For hardware implementation of certain cryptosystems [2], [9], [8], a ﬁnite ﬁeld multiplier can be
one of the most circuit or space demanding blocks. In order to make such a multiplier circuit-efﬁcient,
low weight irreducible polynomials are used for deﬁning the representation of the ﬁeld elements. For an
irreducible polynomial, with coefﬁcients being 0 and 1 only, the least weight is three. Such irreducible
binary trinomials however do not exist for all degrees. The second least weight for an irreducible binary
polynomial is ﬁve and such pentanomials appear to exist for all practical purpose.
A slightly different approach is to use a low weight composite, instead of irreducible, polynomial to
deﬁne the representation of the ﬁeld elements [7]. This leads to redundancy in the representation, but can
reduce the circuit requirement of the multiplier. To this end, in the past composite binomials of the form
Xn + 1 have been suggested. For reduced redundancy, such a binomial is chosen to be the product of
X +1 and an irreducible all-one polynomial (AOP) [6]. The latter is however not so abundant and the use
of nearly AOP (NAOP) has been suggested. Irreducible NAOPs appear to be abundant. The multiplication
of X + 1 and an NAOP results in a polynomial of weight four. Such a composite quadrinomial may be
a better choice than an irreducible pentanomial when an irreducible trinomial is not available.
In this paper, we use quadrinomials. In addition, for multiplication we represent the two inputs with
respect to two different bases. The main motivation of using two bases is to be able to formulate the ﬁeld
multiplication as a Toeplitz matrix-vector product (TMVP). It is well known that such a product can be
obtained in a bit parallel fashion with subquadratic circuit/space complexity [4]. In the recent past there
has been a considerable amount of interest to design sub-quadratic space complexity ﬁeld multipliers,
especially for cryptographic applications where the dimension of the ﬁeld is of intermediate sizes, i.e., in
the range of hundreds.
The main contributions of this work are as follows. For the modiﬁed polynomial basis B introduced
in [10], we have formulated the problem of binary ﬁeld multiplication as a TMVP. The TMVP approach
has been known for other bases, but to the best of our knowledge this is the ﬁrst time that a direct TMVP
approach has been derived for the modiﬁed polynomial basis. The TMVP formulation for the modiﬁed2
polynomial basis has been achieved by expressing one of the inputs with respect to another basis B0. We
are not aware of the prior use of B0 in ﬁnite ﬁeld arithmetic and in our work, we have given explicit
formulas for basis conversions– from B to B0 and vice versa. We have also proposed bit serial multiplier
structures involving bases B and B0. Although bit serial multipliers are available for various bases, to the
best of our knowledge the proposed structures are the ﬁrst for these B and B0 bases.
II. BINARY FIELD MULTIPLICATION
A binary ﬁeld F2n 1 is the set of binary polynomials modulo an irreducible polynomial P of degree
n   1
F2n 1 = F2[X]=(P):
Each element can be seen as a polynomial of degree at most n   2 and they are often expressed
in the polynomial basis f1;X;X2 :::;Xn 2g. In F2n 1, for such representation, ﬁeld operations like
multiplication and inversion are done modulo P. Multiplication is widely used in practice, in this paper
we focus on this operation. Let A =
Pn 2
i=0 aiXi and B =
Pn 2
i=0 biXi be two elements expressed in B.
We can compute C = A  B mod P as
C = A  B =
Pn 2
i=0 (AXi)bi mod P
=
Pn 2
i=0 A(i)bi;
after expanding the expression of B in B and noting that A(i) = AXi mod P. This can be written through
a matrix vector product
C =

A(0) A(1)  A(n 2) 
 B:
The two most used strategies to design an efﬁcient hardware multiplier via the above matrix vector
product are the following:
1) The choice of the polynomial P must provide an efﬁcient computation of the column A(i). Until
now the all one polynomials (AOP) and trinomials seems to be the best possible choice. However,
neither of them exists for all degree. Consequently other type of irreducible polynomials have been
considered. A class of pentanomials of the form
P = X
n 1 + X
k+2 + X
k+1 + X
k + 1
seems to be very interesting for the computation of the columns A(i) [4], [11] . Almost irreducible
trinomials have also been proposed [7], [1], [3]; these trinomials Xn 1++Xk+1 have an irreducible
factor of degree n   1. For a large set of ﬁeld  has really small value.
2) The second strategy consists of expressing the matrix

A(0) A(1)  A(n 2) 
to a Toeplitz form. We will see in Subsection II-B that a Toeplitz matrix vector product can be done
efﬁciently through a subquadratic complexity circuit. Generally we can obtain the Toeplitz form of
the matrix by performing some row operations or column operations, or in other words, by using
different bases of representation.
A. Nearly all one polynomials
Here we will design a multiplier modulo the irreducible polynomial introduced by Katti and Brennan
in [7]. We call such polynomials nearly all one polynomials (NAOPs). They have the following form
P =
Pk2 1
i=0 Xi +
Pn 1
i=k1 Xi with k2 < k1. In other words, for a NAOP all the coefﬁcients are equal to 1
unless they are in an interval [k2;k1   1]. If P is irreducible, we can deﬁne F2n 1 as F2[X]=(P).
As noticed in [7], multiplication in these ﬁelds can be done efﬁciently modulo Q = (X +1)P since
Q is a quadrinomial
Q = 1 + X
k2 + X
k1 + X
n:3
In Table I we give several irreducible NAOPs with degrees suitable for cryptographic applications.
Irreducible NAOPs are abundant, so we do not list all of them for each degree. However it appears to
be an open question that there exists a NAOP for each degree. As we will see later, irreducible NAOPs
which satisfy k1 = k2 + 1 give a more efﬁcient multiplier. As much as possible we give such NAOPs
for each degree (n   1) in Table I (they are marked by y). When there are no irreducible NAOPs with
k1 = k2 + 1 for a certain degree, Table I lists the NAOP with minimum k1 for the smallest possible k2.
TABLE I
IRREDUCIBLE NAOP OF DEGREE n   1
n   1 k1;k2 n   1 k1;k2 n   1 k1;k2
161
y 66;65 247
y 11;10 503
y 11;10
163 33;22 248 26;6 504 26;14
164 7;5 249 12;9 505
y 17;16
165 15;8 250 16;14 506 13;3
166 22;12 251 33;2 507 25;2
167
y 7;6 252 25;17 508 17;1
. . .
. . . 253
y 43;42 509
y 75;76
189
y 35;34 254 18;14 510 34;8
190 10;2 255
y 53;52 511
y 11;10
191
y 24;23 256 34;28 512 24;22
192 20;2 257
y 69;68 513 38;29
193
y 22;21 258 11;1 514 18;6
194 17;13 259 55;22 515 315;38
195
y 3;2 260
y 9;1 516 22;2
. . .
. . . 261 35;34
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
B. Asymptotic Complexities of Toeplitz Matrix Vector Product
As previously mentioned, multiplication in the binary ﬁeld is often expressed as a Toeplitz matrix
vector product. In this section we recall the method used to build a subquadratic circuit which computes
a Toeplitz-matrix-vector multiplication. Recall that a Toeplitz matrix is deﬁned as
Deﬁnition 1: An n  n Toeplitz matrix is a matrix [ti;j]0i;jn 1 such that ti;j = ti 1;j 1 for 1  i;j.
If 2jn we can use a two way approach presented in Table II to compute a matrix vector product T V ,
where T is an n  n Toeplitz matrix. If 3jn we can use the three way approach which is also presented
in Table II.
If n is a power of 2 or a power of 3 the formulas of Table II can be used recursively to perform
T V . Using these recursive processes through parallel computation, the resulting multipliers [4] have the
complexity given in Table III. In this table DA represents the delay of an AND gate and DX the delay of
an XOR gate. It is also possible to design subquadratic TMVP multipliers for size n = 3i2j by combining
two-way and three-way splits in the recursive computations.
III. DOUBLE BASIS REPRESENTATION
Let Q = 1 + Xk2 + Xk1 + Xn be a quadrinomial with k2 < k1 < n in F2[X]. Let l1 = n   k1 and
l2 = n   k2. We present now our contribution on multiplication modulo Q. Our main goal is to get a
subquadratic complexity multiplier modulo such a quadrinomial. To reach this goal, we attempt to express
the product of two elements modulo Q as a Toeplitz matrix vector product.
First, we represent the elements of F2[X] using the basis B = fe0;e1;:::;en 1g given in equation (1).
This basis was introduced in [10]. In this basis the matrix involved in the multiplication is easier to4
TABLE II
SUBQUADRATIC COMPLEXITY TOEPLITZ MATRIX VECTOR PRODUCT
Matrix decomposition
Two-way Three-way
T =

T1 T0
T2 T1



V0
V1

T =
2
4
T2 T1 T0
T3 T2 T1
T4 T3 T2
3
5 
2
4
V0
V1
V2
3
5
Recursive formulas
T  V =

P0 + P2
P1 + P2

T  V =
2
4
P0 + P3 + P4
P1 + P3 + P5
P2 + P4 + P5
3
5
where where
P0 = (T0 + T1)  V1;
P1 = (T1 + T2)  V0;
P2 = T1  (V0 + V1);
P0 = (T0 + T1 + T2)  V2;
P1 = (T0 + T1 + T3)  V1;
P2 = (T2 + T3 + T4)  V0;
P3 = T1  (V1 + V2);
P4 = T2  (V0 + V2);
P5 = T3  (V0 + V1);
TABLE III
ASYMPTOTIC COMPLEXITY
Two-way split method Three-way split method
# AND n
log2(3) n
log3(6)
# XOR 5:5n
log2(3)   6n + 0:5
24
5 n
log3(6)   5n +
1
5
Delay DA + 2log2(n)DX DA + 3log3(n)DX
compute than in the polynomial basis. However the matrix of [10] does not have the Toeplitz form.
Here we obtain the Toeplitz form by performing some column operations on this matrix. These column
operations should also be performed on the entries of B which is thus expressed in different or second
basis B0 and is given below.
B B0 index
ei = Xi e0
i = Xi for i 2 [0;l1[;
ei = Xi + Xi l1 e0
i = Xi for i 2 [l1;l2[;
ei = Xi + Xi l1 + Xi l2 e0
i = ei for i 2 [l2;n[:
(1)
Let us specify the general construction of such a double basis multiplier. Let A =
Pn 1
i=0 aiei be
expressed relative to B and B =
Pn 1
j=0 b0
je0
j be expressed relative to B0. The product C = AB can be
written as
C = A
 
n 1 X
j=0
b
0
je
0
j
!
=
n 1 X
j=0
b
0
j
 
Ae
0
j

: (2)
So if we denote A(j) = Ae0
j we obtain C =
Pn 1
j=0 b0
jA(j). Then if A(j) is expressed with respect to B,
using vector notations we obtain the B representation of C as
C =

A
(0);A
(1); ;A
(n 1)
 B
= MA  B: (3)
We will show that a simple permutation of the columns of MA results in a Toeplitz matrix, and thus
the previous product can be performed using subquadratic approach given in Subsection II-B.
To have a complete multiplier, one would expect to use the same basis to represent the elements A;B;C.
Here we use the basis B. Thus, there is a preliminary step which performs a conversion of B from B to
B0. Below we present formulas to perform this conversion.5
A. Conversion from B to B0
We assume here that l1 > l2   l1. Let B =
Pn 1
i=0 biei be expressed relative to B. To convert the
coefﬁcient of B from B to B0 we use the relation
ei = e0
i if i 2 [0;l1[[[l2;n[;
ei = Xi + Xi l1 = e0
i + e0
i l1 if i 2 [l1;l2[:
For B, we can write
B =
Pl1 1
i=0 biei +
Pl2 1
i=l1 biei +
Pn 1
i=l2 biei
=
Pl1 1
i=0 bie0
i +
Pl2 1
i=l1 bi(e0
i + e0
i l1) +
Pn 1
i=l2 bie0
i
=
Pl2 l1 1
i=0 (bi + bi+l1)e0
i +
Pn 1
i=l2 l1 bie0
i
In other words b0
i = bi + bi+l1 if i < l2   l1 and b0
i = bi if i  l2   l1.
B. Conversion from B0 to B
The reverse conversion is not required in our double basis multiplier, but for completeness we present
it here. We need to express bi in terms of b0
j. From the previous conversion from B to B0 we know that
bi = b0
i if i  l2   l1;
bi = b0
i + bi+l1 if i < l2   l1:
Thus we need to replace bi+l1 by its own expression in terms of b0
i
bi = b0
i + b0
i+l1 if i + l1  l2   l1;
bi = b0
i + b0
i+l1 + bi+2l1 if i + l1 < l2   l1:
For the last case, we have to repeat the process with bi+2l1. If we expand the recursion we obtain the
following formula
bi = b
0
i + b
0
i+l1 + ::: + b
0
i+l1 with  =

l2   l1   i
l1

:
Consequently if such conversion is needed, it is better to take quadrinomial such that l1 > l2  l1 since
in this case  is at most equal to 1 and conversion requires at most l1   l2 XOR operations. Most of the
NAOP given in Table I satisfy l1 > l2  l1 and thus conversions between B and B0 can be done using the
formulas of the current section.
IV. CONSTRUCTION OF THE MATRIX MA
Let MA =

A(0);A(1); ;A(n 1)
be the resulting matrix of the double basis multiplier associated to
bases B and B0 deﬁned in (1). The n  n matrix MA can be generated column by column starting from
the left end as follows.
A. Generating the ﬁrst l2 columns of MA.
First, note that A(0) = A and for 1  i  l2   1 one can write
A
(i) = Ae
0
i = AX
i = AX
i 1X = A
(i 1)X mod Q:
If we call a
(i)
j the j-th coefﬁcient in B of A(i), the previous equation becomes
A(i) =
Pn 1
j=0 a
(i 1)
j ejX:
We need to express ejX in B and this is the goal of the following lemma.6
Lemma 1: Let B = fe0;:::;en 1g as deﬁned in (1). Then the following equation holds
eiX =
8
<
:
ei+1 if i 6= l1   1;l2   1;n   1;
ei+1 + e0 if i = l1   1;l2   1;
e0 if i = n   1:
Proof: Suppose that i 6= n   1;l1   1;l2   1, thus i + 1 6= n;l1;l2. We recall that
ei =
8
<
:
Xi if 0  i < l1   1;
Xi + Xi l1 if l1  i < l2   2;
Xi + Xi l1 + Xi l2 if l2 < i < n   1:
Thus if we multiply these expressions with X, we get
eiX =
8
> > <
> > :
Xi+1 = ei+1 if 0  i < l1   1;
Xi+1 + Xi+1 l1 = ei+1 if l1  i < l2   1;
Xi+1 + Xi+1 l1 + Xi+1 l2
= ei+1 if l2  i < n   1:
For the special cases i 2 fl1   1;l2   1;n   1g, we have
el1 1X = Xl1 1X = (Xl1 + 1) + 1 = el1 + e0;
el2 1X = (Xl2 1 + Xl2 1 l1)X
= (Xl2 + Xl2 l1 + 1) + 1 = el2 + e0;
en 1X = Xn + Xk2 + Xk1 = 1 mod Q = e0:
This completes the proof.
Now we can write
A(i) =
Pn 1
j=0 a
(i 1)
j ejX
=
Pn 2
j=0 a
(i 1)
j ej+1

+ a
(i 1)
n 1 e0 + a
(i 1)
l1 1 e0
+a
(i 1)
l2 1 e0;
This previous expression enables us to compute A(i) as illustrated below:
A(0) A(1) A(2) 
# # #
a0 an 1 + al1 1 + al2 1 an 2 + al1 2 + al2 2 
a1 a0 an 1 + al1 1 + al2 1 
a2 a1 a0 
. . .
. . .
. . .
an 1 an 2 an 3 
This gives the column A(i) of MA for i < l2. Now consider how to express A(i) for l2  i  n   1.
B. Generating the last k2 columns of MA
We remark that A(i) = A(Xi + Xi l1 + Xi l2) for i = l2;l2 + 1;:::;n   1. Thus, if we factorize X
out from the right side we obtain for i = l2;l2 + 1;:::;n   1
A
(i) = A(X
i 1 + X
i 1 l1 + X
i 1 l2)X = A
(i 1)X;
which can be rewritten as
A
(i 1) = A
(i)X
 1:
Again if we replace A(i) by its expression in B we get
A(i 1) =
Pn 1
j=0 a
(i)
j ejX 1:
To proceed, we need to compute ejX 1 and this is done in the following lemma.7
Lemma 2: Let B = fe0;:::;en 1g be the basis given in of F2[X]=(Q(X)). Then we have
eiX
 1 =
8
<
:
ei 1 if i 6= 0;l1;l2;
ei 1 + en 1 if i = l1;l2;
en 1 if i = 0:
(4)
Proof: We ﬁrst deal with the cases i 6= 0;l1;l2. We have
ei =
8
<
:
Xi if 0 < i < l1;
Xi + Xi l1 if l1 < i < l2;
Xi + Xi l1 + Xi l2 if l2 < i < n:
If we multiply ei with X 1 we get
eiX 1 =
8
> > <
> > :
Xi 1 = ei 1 if 0 < i < l1;
Xi 1 + Xi l1 1 = ei 1 if l1 < i < l2;
Xi 1 + Xi l1 1 + Xi l2 1
= ei 1 if l2 < i < n;
as required.
Now we assume that i = 0;l1 or l2. To prove these three cases we use en 1 = X 1 mod Q, indeed
en 1X = (Xn 1 + Xn l1 + Xn l2)X
= Xn + Xk1 + Xk2
= 1 mod Q
Thus e0X 1 = X 1 = en 1 mod Q. Similarly, we have el1X 1 = Xl1 1 + X 1 = el1 1 + en 1; and
el2X 1 = el2 1 + en 1.
For A(i 1),
A(i 1) =
Pn 1
j=0 a
(i)
j ejX 1
=
Pn 2
j=1 a
(i)
j ej 1 + a
(i)
0 en 1 + a
(i)
l2 en 1
+a
(i)
l1 en 1:
We can thus compute A(i) for i = n   1;:::;l2 recursively, beginning with A(n) and multiply it by
X 1. For A(n) we have
A(n) = A  (Xn + Xn l1 + Xn l2)
= A  (Xn + Xk1 + Xk2)
= A
Since Xn + Xk1 + Xk2 = 1 mod (Xn + Xk1 + Xk2). We can then compute A(n 1) = AX 1;A(n 2) =
A(n 1)X 1;::: as shown below:
 A(n 2) A(n 1) A
# # #
 a2 a1 a0
 a3 a2 a1
 a4 a3 a2
. . .
. . .
. . .
 a0 + al1 + al2 an 1 an 2
 a1 + al1+1 + al2+1 a0 + al1 + al2 an 1
We easily remark that the matrix
TA = [A(l2);A(l2+1);:::;A(n 1);A(0);A(1);:::;A(l2 1)] (5)8
is a Toeplitz matrix. In addition, the ﬁrst k2 columns (resp. the last l2 = n   k2 columns) of TA are the
last k2 columns (resp. the ﬁrst n   k2 columns) of MA = [A(0);A(1);:::;A(n 1)] from (3). Similarly we
denote
e B =

b0
l2 b0
l2+1  b0
n 1 b0
0 b0
1  b0
l2 1

such that the ﬁrst k2 (resp. the last n   k2) coefﬁcients are equal to the last k2 (resp. the ﬁrst n   k2)
coefﬁcients of B. In this situation, the coordinates in B of C = AB are given by
C = TA  e B: (6)
V. EXAMPLE
We use the irreducible NAOP P = X8 + X7 + X6 + X3 + X2 + X + 1 to deﬁne the ﬁeld F28. We
construct the matrix MA of an element A modulo Q = P  (X + 1) = X9 + X6 + X4 + 1: The basis B
of F2[X]=(Q) is deﬁned by the 9 elements
e0 = 1;e1 = X;e2 = X2;
e3 = X3 + 1;e4 = X4 + X;
e5 = X5 + X2 + 1;e6 = X6 + X3 + X;
e7 = X7 + X4 + X2;e8 = X8 + X5 + X3:
In this situation, we have
l1 = 3; l2 = 5;
k1 = 6; k2 = 4:
Let A =
P8
i=0 aiei be expressed in B. We ﬁrst construct column A(i) for i = 0;:::;l2   1 as in
subsection IV-A.
A(0) A(1) A(2) A(3) A(4)
# # # # #
a0 a8 + a4 + a2 a7 + a3 + a1 a6 + a2 + a0 a5 + a1 + a8 + a4 + a2
a1 a0 a8 + a4 + a2 a7 + a3 + a1 a6 + a2 + a0
a2 a1 a0 a8 + a4 + a2 a7 + a3 + a1
a3 a2 a1 a0 a8 + a4 + a2
a4 a3 a2 a1 a0
a5 a4 a3 a2 a1
a6 a5 a4 a3 a2
a7 a6 a5 a4 a3
a8 a7 a6 a5 a4
Similarly we compute the other part as explained in subsection IV-B.
A(5) A(6) A(7) A(8) A(0)
# # # # #
a4 a3 a2 a1 a0
a5 a4 a3 a2 a1
a6 a5 a4 a3 a2
a7 a6 a5 a4 a3
a8 a7 a6 a5 a4
a0 + a3 + a5 a8 a7 a6 a5
a1 + a4 + a6 a0 + a3 + a5 a8 a7 a6
a2 + a5 + a7 a1 + a4 + a6 a0 + a3 + a5 a8 a7
a3 + a6 + a8 a2 + a5 + a7 a1 + a4 + a6 a0 + a3 + a5 a8
VI. MULTIPLIER ARCHITECTURE
In this section, we present several multiplier architectures associated to the Toeplitz matrix vector
expression in (6). Multiplier architectures can be classiﬁed into two types:
 Parallel architecture. Computations are done with no reuse of circuits and all the bits of the product
C = AB are output at each clock cycle.
 Sequential architecture. Computations are done with reuse of circuits and the result is obtained after
n clock cycles. There are mainly two types of sequential multipliers, one which output one bit per
clock cycle, and the other which output all n bits (in parallel) only at the end of n clock cycles. In
the paper the former will be referred to as sequential multiplier with bit serial output (SMSO) and
the latter as sequential multiplier with bit parallel output (SMPO).9
A. Parallel architecture
Here we present a parallel architecture associated to the double basis approach. This architecture is
sketched in Figure 1. It consists of two preliminary parallel computations followed by a Toeplitz matrix
vector product. The preliminary step computes the entries of the matrix TA from the coordinates of A
and in parallel performs the conversion of B from B to B0. For the former, we use the expression of the
columns A(i) given in Section IV. Let a
(i)
j denote the j-th entry of A(i). Then we have
a
(i)
1 = an i + al2 i + al1 i;1  i  l1; (7)
a
(i)
1 = a
(i 1)
l1 1 + an i + al2 i;l1 < i < l2; (8)
a
(n i)
n 1 = ai 1 + ai 1+l2 + ai 1+l1;1  i  k2: (9)
To compute the coordinates b0
i;i = 0;:::;n 1 we only need to apply the formula given in subsection III-
A. We evaluate the complexity of this architecture below:
Fig. 1. Parallel architecture for double basis multiplication modulo a quadrinomial
Basis conversion Matrix entries computation
Toeplitz matrix−vector
Subquadratic
multiplier
b0;b1;:::;bn 1 a0;a1;:::;an 1
b0
0;b0
1;:::;b0
n 1 entries of TA
c0;c1;:::;cn 1
 Space complexity. It includes the space (i.e., logic gates) needed for the precomputations as well as
the matrix-vector product. Using (7), (8) and (9), we deduce that the precomputations require
(2(l2   1) + 2k2)
| {z }
computation of TA
+ (l2   l1)
| {z }
conversion of B
XOR gates.
We already know the space complexity of the matrix vector product which is given in Table III.
 Time complexity. The delay of the critical path is equal to the delay for the preliminary computations
plus the delay of the Toeplitz matrix-vector product part. The critical path for the precomputations
corresponds to the computation of Eq. (8) and the corresponding delay is equal to 3DX to compute
the coefﬁcient of TA.
Remark 1: The parallel architecture in Figure 1 has a small improvement in the complexity of the
multiplier when k1 = k2+1. In this case, all the entries of TA are computed with (7) and (9). Consequently
the space complexity required for the computation of the entries of TA is equal to 2(n   1) XOR gates
and the time delay is 2DX.
For a ﬁxed ﬁeld F2n 1 we give in Table IV the complexity results of the two-way and three-way
approaches for the double basis multiplier. We also give the complexities of a two-way and three-way10
TABLE IV
COMPLEXITY COMPARISON FOR FIELD F2n 1
Method Splitting Space Complexity Delay
# XOR # AND
This paper Two-way 5:5nlog2(3)   3n + 0:5 nlog2(3) DA + (2log2(n) + 3)DX
Pentanomial [4] Two-way 5:5(n   1)log2(3)   3n + 4 (n   1)log2(3) DA + (2log2(n   1) + 4)DX
Redundant-trinomial Two-way 5:5(n   1 + )log2(3)   6 (n   1 + )log2(3) DA + (2log2(n   1 + ) + 1)DX
[4], [1], [3] +k + 5
This paper Three-way 24
5 nlog3(6)   2n + 1
5 nlog3(6) DA + (3log3(n) + 2)DX
Pentanomial [4] Three-way 24
5 mlog3(6)   2(n   1) + 7
10 (n   1)log3(6) DA + (3log3(n   1) + 3)DX
Redundant-trinomial Three-way 72
15(n   1 + )log3(6) (n   1 + )log3(6) DA + (3log3(m + ) + 1)DX
[4], [1], [3]  4(n   1 + )   4
5
Sunar [12] Winograd 6(n   1)log2(3)   8n + 10 (n   1)log2(3) DA + (2log2(n   1) + 1)DX
Sunar [12] Winograd 16
3 (n   1)log3(6)   22
3 (n   1) + 2 (n   1)log3(6) DA + (4log3(n   1) + 1)DX
splitting multipliers based on [4] for almost irreducible trinomial [1] Xn 1++Xk+1, also called redundant
trinomials in [3]. These redundant trinomials admit an irreducible factor of degree (n 1) which deﬁnes
the ﬁeld F2n 1. The authors of [1] proposed severals algorithms to ﬁnd almost irreducible trinomials with
small .
In Table IV we also give the complexity of the multiplier of [4] for speciﬁc pentanomials and the
complexity of the multiplier of Sunar [12]. The complexity results given in Table IV are valid if n (resp.
n   1, n   1 + ) is a power of 2 or 3. Some combination of two-way and three-way approaches could
also be used, which extend the use of TMVP multiplication to size n = 2i3j.
In this situation, we can see that quadrinomial with double basis approach gives a parallel multiplier with
the same space complexity as the multiplier of Fan and Hasan for pentanomials, but with an improvement
by DX (2DX for special quadrinomials) in the delay of the multiplier. The redundant trinomial is better
when  is small, otherwise when  is quite big the space complexity is higher.
For a general n, there are several strategies to obtain a subquadratic multiplier using the Toeplitz matrix.
Using a small example, below we present two possible approaches : the ﬁrst one enlarge the dimension
of the Toeplitz matrix to the nearest 2i3j bigger than (n 1), the second approach decomposes the matrix
in a number Toeplitz blocks of size 2i3j.
Example 1: here we study here subquadratic space complexity multipliers for n 6= 2i3j. We do it for
the ﬁeld F2235, ﬁrst assuming double basis with quadrinomial and then redundant trinomials.
 Quadrinomial approach. We represent the ﬁeld as a subring of F2[X]=(X237 +X2 +X +1). In this
situation, a multiplication of two elements is expressed as TA  e B where TA is a 237  237 matrix.
Since 237 cannot be decomposed as power of 2 and 3, we expand the matrix TA to a 243  243
Toeplitz matrix T 0
A.
Fig. 2. Redundant quadrinomial architecture for F2235
A T B B’ T’ A
appended 0
243
243 237
237
~ ~
.
.
We also append 6 zeros to e B to obtain e B0. The product of A and B is now given by the ﬁrst 237
coefﬁcients of T 0
A  e B0 and the three-way splitting approach can be applied since 243 = 35. The11
resulting multiplier has a delay of 17DX + DA and a space complexity of 36586 XOR and 7776
AND gates.
 Almost irreducible trinomial approach. We represent the ﬁeld F2235 as a subring of F2[X]=(X249 +
X7 + 1). The polynomial X249 + X7 + 1 is the smallest degree trinomial which has a prime factor
of a degree 235 polynomial. Following [4], the multiplication of two elements A and B can be
expressed as a Toeplitz matrix vector product TAB. In order to use the three-way splitting approach
we propose to split the matrix TA into three blocks (cf. Fig. 3).
Fig. 3. Redundant trinomial architecture for F2235
243 6
Parallel AND
B T
B T
T
1 0
1
0
2
Subquadtratic
 Space TMVP of XORs
Parallel AND gates
and binary tree gates and binary
tree of XORs
.
.
6
249
The computation TA  B is decomposed in T0  B0 and T1  B1 and T2  e B. We can perform these
three TMVPs in parallel. This approach is depicted in Fig. 3. The computation of T0  B is done
using three-way split subquadratic multiplication. For both T1  B1 and T2  B we use AND gates
in parallel followed by XOR gates arranged in binary tree fashion. Other multiplication strategies
could be applied for example for T2B, by splitting T2 in several 66 Toeplitz matrices, computing
each product in parallel and add the results with a binary tree of XORs. But this would not be
advantageous considering both space and time complexities since the size of the Toeplitz matrices is
really small. The resulting space complexity of this architecture is equal to 39061 XOR and 10728
AND gates and the delay is 17DX + DA.
In conclusion of this example, we point out that the construction of the subquadratic multiplier for
degree n 6= 2i3j involves some decomposition or expansion of the Toeplitz matrices. The consequence is
that it requires some additional computation, and the best approach between quadrinomial or redundant
trinomial depends deeply on the approach which gives the best decomposition or expansion.
B. Sequential multiplier with bit serial output (SMSO)
This multiplier is based on (6) where the i-th coordinate of C is obtained as the inner product of the
i-th row of TA and the vector e B. We remark that if we begin from the l2-th row of TA, we have
Rl2 =

an 1 an 2 ::: a0

;
and Rl2+1 is generated from Rl2 by performing a right shift and placing a0+al1 +al2 into the resulting
empty position at the left most end. Then Rl2+2 is generated from Rl2+1 in the same way, and so on.
This process can be performed by applying right shifts to a bidirectional linear feedback shift register
(bLFSR) as shown in the lower part of Figure 4. The bidirectional feature implies that the feedback shift
register can be shifted to either left or right direction as desired.
Hardware structure for such an bLFSR has been presented in [5]. Speciﬁcally, the implementation of
the modulo two addition of the linear relation for both left and right shifts requires an XOR gate and
some additional hardware which manages the change of direction of the output (see Fig 5 where L= L12
Fig. 4. Sequential multiplier with bit serial output
XOR tree
bLFSR
.
b0
l2 b0
0 b0
l1 1 b0
l1 b0
l2 1 b0
n 1 b0
n 2
an 1 al2+1 al2 al2 1 al1 al1 1 a0
Fig. 5. Bi-directional storage cell and modulo two adder [5]
L
 L
 L
 L
 L
L
L
L
determines the direction of data ﬂow). The same is true for the storage cells. More details about bLFSR
can be found in [5].
We remark also that if we begin with Rl2, we can shift the bLFSR contents to the left and place
an 1 + al1 1 + al2 1 into the right most cell to have Rl2 1. The row Rl2 2;:::;R0 can be computed in
the same way. So using the left shift property of the bLFSR we can produce the ﬁrst l2 rows of TA and
thus compute the corresponding coefﬁcients of C.
Consequently, we propose an original bit serial multiplier which uses these properties. This bit serial
multiplier is depicted in Figure 4 and its operation has the following four steps.
1) Load the register with A.
2) Apply right shifts to bLFSR to output the bits cl2;cl2+1;:::;cn 1.
3) Reload the register with A.
4) Apply l2 + 1 left shifts to bLFSR to output the bits cl2;cl2 1;:::;c0.
We remark that the bit cl2 is output twice, since the rows in the upper and the lower parts of TA are
generated from the same row Rl2. One of these bits should be removed.
C. Sequential multiplier with bit parallel output (SMPO)
We have seen in Section IV that there are two recursive relations which allow us to generate the columns
of TA. By generating the columns A(j) = Ae0
j for j = 0;1;:::;n 1 with a weight of b0
j we obtain C as
in equation (6). Note that A(j)’s are columns of TA.
The expressions of A(j) in terms of A(j 1) or A(j+1) given in Section IV correspond to a left and right
shifts of a bi-directional LFSR as described above to generate the rows of TA. This leads to a multiplier
with bit parallel output (Figure 6). The operation of the multiplier has the following steps
1) Load the register with A.
2) Apply k2 right shifts to bLFSR.
3) Reload the register with A and apply a left shift with 0 as input in AND gates.
4) Apply l2 left shifts to bLFSR to output the bits cn 1;cn 2;:::;c0.13
Fig. 6. Sequential multiplier with bit parallel output
a0 al1 1 al2 1 al1 al2+1 al1 an 1
b0
0;b0
1 :::;b0
l2 1;0;b0
n 1;:::;b0
l2+1;b0
l2
D. Comparison of SMSO and SMPO
Using Figures 4 and 6, we can easily evaluate the complexity of the two sequential multipliers. We
can see that the two architectures have roughly the same number of XOR and AND gates. Compared to
SMSO, SMPO requires twice more ﬂip-ﬂops but is faster by a factor of log2(n + 1).
TABLE V
COMPLEXITY OF SEQUENTIAL MULTIPLIERS
SMSO SMPO
# XOR n + 1 n + 2
# AND n n
Flip-Flop n cells 2n cells
Delay (n + 1)(DA + dlog2(n + 1)eDX) (n + 1)(DA + DX)
VII. CONCLUSION
In this paper we have considered multipliers over binary ﬁelds deﬁned by near all one polynomials. To
this end, we have used multiplication modulo a quadrinomial. We have introduced a double basis approach
which provides a Toeplitz matrix vector product expression for multiplication modulo the quadrinomial.
This has resulted in a subquadratic space complexity parallel multiplier, which has lower delay than the
Fan and Hasan multiplier modulo pentanomials [4]. We have also presented two sequential multipliers
using a bidirectional LFSR.
REFERENCES
[1] Richard P. Brent and Paul Zimmermann. Algorithms for almost irreducible and almost primitive trinomials. In in Primes and
Misdemeanours: Lectures in Honour of the Sixtieth Birthday of Hugh Cowie Williams, Fields Institute, 2004.
[2] W. Difﬁe and M.E. Hellman. New directions in cryptography. IEEE Transactions on Information Theory, 24:644–654, 1976.
[3] Christophe Doche. Redundant trinomials for ﬁnite ﬁelds of characteristic 2. In In ACISP 2005, pages 122–133. SpringerVerlag, 2005.
[4] H. Fan and M. A. Hasan. A new approach to sub-quadratic space complexity parallel multipliers for extended binary ﬁelds. IEEE
Trans. Computers, 56(2):224–233, September 2007.
[5] M. A. Hasan and V. K. Bhargava. Architecture for a Low Complexity Rate-Adaptive Reed-Solomon Encoder. IEEE Trans. Computers,
pages 938–942, July 1995.
[6] T. Itoh and S. Tsujii. Structure of parallel multipliers for a class of ﬁelds GF(2
m). Inform. Comp., 83:21–40, 1989.
[7] R. Katti and J. Brennan. Low complexity multiplication in a ﬁnite ﬁeld using ring representation. IEEE Trans. Comput., 52(4):418–427,
2003.
[8] N. Koblitz. Elliptic curve cryptosystems. Mathematics of Computation, 48:203–209, 1987.
[9] V. Miller. Use of elliptic curves in cryptography. In Advances in Cryptology, proceeding’s of CRYPTO’85, volume 218 of LNCS, pages
417–426. Springer-Verlag, 1986.
[10] C. Negre. Quadrinomial modular multiplication using modiﬁed polynomial basis. In ITCC 2005, Las Vegas USA, LNCS, april 2005.
[11] F. Rodriguez-Henriquez and C.K. Koc ¸. Parallel multipliers based on special irreducible pentanomials. IEEE Transaction on Computers,
52(12):1535–1542, December 2003.
[12] B. Sunar. A generalized method for constructing subquadratic complexity GF(2
k) multipliers. IEEE Transactions on Computers,
53:1097–1105, 2004.