Digit-Level Serial-In Parallel-Out Multiplier Using Redundant Representation for a Class of Finite Fields by Hosseinzadeh Namin, Parham et al.
University of Windsor
Scholarship at UWindsor
Electrical and Computer Engineering Publications Department of Electrical and ComputerEngineering
3-2017
Digit-Level Serial-In Parallel-Out Multiplier Using
Redundant Representation for a Class of Finite
Fields
Parham Hosseinzadeh Namin
University of Windsor
Roberto Muscedere
University of Windsor
Majid Ahmadi
University of Windsor
Follow this and additional works at: http://scholar.uwindsor.ca/electricalengpub
Part of the Electrical and Computer Engineering Commons
This Article is brought to you for free and open access by the Department of Electrical and Computer Engineering at Scholarship at UWindsor. It has
been accepted for inclusion in Electrical and Computer Engineering Publications by an authorized administrator of Scholarship at UWindsor. For more
information, please contact scholarship@uwindsor.ca.
Recommended Citation
Hosseinzadeh Namin, Parham; Muscedere, Roberto; and Ahmadi, Majid. (2017). Digit-Level Serial-In Parallel-Out Multiplier Using
Redundant Representation for a Class of Finite Fields. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI)
SYSTEMS, 25 (5).
http://scholar.uwindsor.ca/electricalengpub/10
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1
Digit-Level Serial-In Parallel-Out Multiplier
Using Redundant Representation
for a Class of Finite Fields
Parham Hosseinzadeh Namin, Student Member, IEEE, Roberto Muscedere, Member, IEEE,
and Majid Ahmadi Life Fellow, IEEE
Abstract— Two digit-level finite field multipliers in F2m using
redundant representation are presented. Embedding F2m in
cyclotomic field F(n)2 causes a certain amount of redundancy and
consequently performing field multiplication using redundant
representation would require more hardware resources. Based
on a specific feature of redundant representation in a class of
finite fields, two new multiplication algorithms along with their
pertaining architectures are proposed to alleviate this problem.
Considering area-delay product as a measure of evaluation, it has
been shown that both the proposed architectures considerably
outperform existing digit-level multipliers using the same basis.
It is also shown that for a subset of the fields, the proposed
multipliers are of higher performance in terms of area-delay
complexities among several recently proposed optimal normal
basis multipliers. The main characteristics of the postplace&route
application specific integrated circuit implementation of the
proposed multipliers for three practical digit sizes are also
reported.
Index Terms— Digit-level architecture, finite field arithmetic,
multiplication algorithm, redundant representation.
I. INTRODUCTION
F INITE field computation has recently gained growingattention due to its wide range of applications in coding
theory, error control coding, and especially in cryptography,
where ElGamal [1] and elliptic curve cryptography (ECC) [2],
two out of the three well-known cryptosystems, are based
on finite field arithmetic [3]. Finite field computation is
performed using arithmetic operations in the underlying finite
field. Among the basic field operations, multiplication plays
a fundamental role as more complicated operations, namely,
field exponentiation and field inversion can be carried out with
consecutive use of field multiplication [2], [4], [5].
Similar to linear algebra, the concept of representation bases
is also used in finite field arithmetic to represent field elements.
The choice of representation system—mainly affected by the
hardware in use and the requirements of the cryptosystem, has
a great impact on computational performance. A few number
of representation systems for extension binary fields have
been proposed in the literature, such as polynomial basis [6],
Manuscript received June 22, 2016; revised September 24, 2016 and
November 16, 2016; accepted December 17, 2016.
The authors are with the Department of Electrical and Computer Engi-
neering, University of Windsor, Windsor, ON N9B 3P4, Canada (e-mail:
hosseinp@uwindsor.ca; rmusced@uwindsor.ca; ahmadi@uwindsor.ca).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2016.2646479
normal basis (NB) [7], redundant basis (RB) [8], and dual
basis [9]. In both NB and redundant representation, squaring
operation can be performed by applying a simple permutation
operation on the coordinates. This makes them more efficient
for the hardware implementations of cryptographic algorithms
that utilize frequent squaring or exponentiation, such as point
addition/doubling in ECC. Moreover, redundant representation
is of a special interest due to its unique feature in accommodat-
ing ring type operations. This not only offers almost cost-free
squaring operation but also eliminates the need for modular
reduction in multiplication.
The idea of embedding a field in a larger ring was first
put forward by Gao et al. [10], [11] for performing fast
multiplication using NB. Later on, Wu et al. [8] introduced
redundant representation, also known as RB, and finite field
multiplication using this representation system. In efforts to
increase the multiplication speed or to reduce the hardware
complexities, several architectures have been proposed after-
ward, such as comb-style architecture [12] and linear feedback
shift register (LFSR)-based architectures [13], [14]. More
recently, Xie et al. [15] proposed a recursive decomposition
scheme for digit-level serial/parallel structures to achieve less
area–time–power complexities.
Despite the structure of the architecture in use, the main
drawback of redundant representation is that it contains a
certain amount of redundancy as embedding field F2m of
size m in cyclotomic field F(n)2 of size n, (n > m), is
not a one-to-one mapping operation. As a result, redundant
representation requires more bits to represent a field element,
where the number of representation bits depends on the size of
the cyclotomic field in which the underlying field is embedded.
In this paper, our focus is on digit-level architectures for
RB multipliers. We show that a specific feature of redundant
representation can be used for a class of finite fields to signifi-
cantly reduce the architectural complexity of RB multipliers to
compensate for the inherent redundancy in this representation
system. Two variants of multiplication algorithms along with
their corresponding architecture are presented. It is shown
that the proposed architectures have highly regular struc-
tures and thus suitable for hardware implementation. Com-
parisons with existing digit-level RB architectures reveal that
both the proposed architectures outperform other RB archi-
tectures when considering area-delay product as a measure
of performance. A comparison between the performances
of the proposed multipliers with those of several optimal
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
NB (ONB) multipliers is also given. Finally, hardware real-
izations of the proposed multipliers for three practical digit
sizes are presented.
The organization of this paper is as follows. Section II
contains a brief review of RB and finite field multiplication
using this representation system. In Section III, two new digit-
level algorithm and architectures for RB multiplication are
presented. The architectural complexity and the performance
comparison are discussed in Section IV followed by the
details of VLSI implementations of three practical field size
multipliers in Section V. The conclusion remarks are given
in Section VI.
II. BRIEF REVIEW OF REDUNDANT BASIS
REPRESENTATION IN F2m
A. Redundant Representation for F2m
Let F2 denote a field of characteristic 2 and xn − 1 be
a polynomial of degree positive integer n over F2. Then,
the splitting field of xn − 1, denoted by F(n)2 , is called the
nth cyclotomic field over F2. Let β be a primitive nth root
of unity in an extension field of F2. Then, F(n)2 is generated
by β over F2 and elements of F(n)2 can be represented in the
form of
A = a0 + a1β + a2β2 + · · · + an−1βn−1, ai ∈ F2. (1)
This representation of A is not unique, i.e., for a given element
of F(n)2 represented by n-tuple (a0, a1, · · · , an−1), ai ∈ F2,
there exist different tuples representing the same element.
For instance, each element in F(n)2 and its ones’ complement
represent the same field element as explained in Lemma 1.
Lemma 1: Assume that field element E ∈ F2m is rep-
resented by (e0, e1, . . . , en−1), ei ∈ F2 with respect to
I = {1, β, . . . , βn−1}. Then
E = e0 + e1β + · · · + en−1βn−1
= (1 + e0) + (1 + e1)β + · · · + (1 + en−1)βn−1. (2)
Proof: Since the set of powers of primitive nth root of
unity, i.e., {β i , 0 ≤ i ≤ n − 1}, form a cyclic group of
order n, then, βn = 1 and 1 + β + β2 + · · · + βn−1 = 0
accordingly. 
An interesting example would be the identity element of the
field with respect to operation “+,” namely, “0,” which can be
represented by both n-tuples (0, 0, . . . , 0) and (1, 1, · · · , 1).
Due to the redundancy in this representation system, (1) is
called a redundant representation of A and I = {1, β, β2, . . . ,
βn−1} is referred to as a RB for any subfield of F(n)2 [8].
In order for a field of characteristic two, F2m , to be embedded
in F(n)2 , the following relationship between n and m should
be satisfied.
Theorem 1: Let n be an odd positive integer greater than m.
Then, F2m is contained in F(n)2 iff m divides the multiplicative
order of 2 modulo n [16, Th. 2.47].
For more information about conversion from/to normal
representation system to/from redundant representation, the
reader is referred to [8] and [17].
B. Multiplication Using Redundant Representation in F2m
One of the unique advantages of using RB in finite field
arithmetic is that it eliminates the need for modular reduction
in multiplication operation. This useful feature stems from the
fact that the basis elements 1, β, β2, . . . , βn−1 form a cyclic
group of order n. As a direct result
β · β i =
{
β i+1 i = n − 1
1 i = n − 1. (3)
Let field elements A and B ∈ F2m be expressed with respect
to the RB I = {1, β, β2, . . . , βn−1} as
A =
n−1∑
i=0
aiβ
i , and B =
n−1∑
i=0
biβ i
respectively, where ai , bi ∈ F2. Note that n ≥ m + 1 and
βn = 1. Then C , the product of A and B , can be given by
C = A · B =
n−1∑
i=0
(aiβ
i ) · B
=
n−1∑
i=0
ai
( n−1∑
j=0
b jβ i+ j
)
=
n−1∑
i=0
ai
( n−1∑
j=0
b( j−i)nβ j
)
=
n−1∑
j=0
( n−1∑
i=0
ai b( j−i)n
)
β j (4)
where ( j − i)n denotes that ( j − i) is to be reduced modulo n.
Define, A · B = C  ∑n−1j=0 c jβ j , then, c j can be given by
c j =
n−1∑
i=0
ai b( j−i)n , j = 0, 1, . . . , n − 1. (5)
III. PROPOSED DIGIT-LEVEL SIPO MULTIPLIERS
USING REDUNDANT REPRESENTATION
In this section, we first present a new algorithm for
RB multiplication. Based on this algorithm, we propose two
new optimized digit-level serial-in parallel-out (SIPO) archi-
tectures. These architectures are adopted for a class of finite
fields in which n can be expressed as n = T m + 1, where
T ≥ 2 and is an even number. As will be seen in the
remainder of this section, this condition enables us to devise
an architecture that significantly reduces the complexity of the
multiplier. Corollary 1, which corresponds to [8, Lemma 2],
describes a specific feature that results from the above-
mentioned condition.
Corollary 1: Let A ∈ F2m and assume that its redundant
representation is given by (a0, a1, . . . , an−1) with respect to
RB I over F2m . If n can be expressed as T m + 1, assuming
that T ≥ 2 and is even, then
ak = an−k , k = 1, 2, . . . , n − 1. (6)
The degree of the smallest cyclotomic field that con-
tains F2m can be calculated using [18, Algorithm 1].
For ECC applications, field sizes are recommended to be
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
NAMIN et al.: DIGIT-LEVEL SIPO MULTIPLIER USING REDUNDANT REPRESENTATION FOR A CLASS OF FINITE FIELDS 3
TABLE I
SMALLEST CYCLOTOMIC FIELD F(n)2 THAT CONTAINS F2m FOR
150 < m < 250, WHEN n CAN BE EXPRESSED AS
n = T m + 1, T ≥ 2 AND EVEN ∗
selected within the range of 150 to 600 according to the secu-
rity standards [19]. It can be shown, via [18], that Corollary 1
covers over 60% of all finite fields within the practical range.
As an example, for the first 100 fields in the aforemen-
tioned range, the orders of the smallest cyclotomic fields that
contain F2m for a subset of fields that the relationship between
n and m satisfies the requirement of Corollary 1 that are listed
in Table I.
A. Proposed Digit-Level RB Multiplication Algorithm, Type-a
Assume w, 1 ≤ w ≤ (n − 1/2), denotes the digit size of
the multiplier. Excluding a0 from the coordinate set let the
rest of the coordinates be equally divided into 2w parts, d-bit
long each, where d = (n − 1/2w) as
A = a0 a1 . . . ad︸ ︷︷ ︸
A0
ad+1 . . . a2d︸ ︷︷ ︸
A1
. . . a(2w−1)d+1 . . . an−1 0 . . . 0︸ ︷︷ ︸
A2w−1
.
Note that the outside of the coordinate set is padded with
zero. Replace subscript i of ai in (5) with kd +  for 0 ≤
k ≤ 2w − 1 and 1 ≤  ≤ d . The product coefficient c j ,
j = 0, 1, . . . , n − 1, can be rewritten as
c j = a0b j +
2w−1∑
k=0
d∑
=1
a(kd+)b( j−kd−). (7)
Based on the definition of d , we have: (n − 1/2w) < d ≤
(n − 1/2w)+ 1. As a result, the upper bound of the subscript
kd +  in the above-mentioned double summation falls within
the range of n − 1 to n − 1 + 2w, and thus, all the nonzero
terms of the product coefficient c j is included in (7).
Under the required conditions of Corollary 1, the last
(T m/2) coordinates of a field element are mirror reflections
of the first (T m/2) coordinates. A new function ϕ(i) can be
utilized to map the set of integers used in the subscript of the
coordinates to the set {0, 1, . . . , (n − 1/2)}, as follows:
ϕ(i) =
⎧⎨
⎩i mod n 0 ≤ i mod n ≤
n − 1
2
n − (i mod n) Otherwise.
(8)
Now assume that aˆi , 0 ≤ i ≤ (n − 1/2) denotes the first
(n + 1/2) coordinates of A, as
aˆi =
⎧⎨
⎩ai 0 ≤ i ≤
n − 1
2
0 Otherwise.
(9)
Taking into account (8) and (9), (7) can be rewritten as
c j = a0b j +
w−1∑
k=0
d∑
=1
aˆ(kd+)bϕ( j−kd−)
+
w−1∑
k=0
n−1
2 +d∑
= n−12 +1
a(kd+)bϕ( j−kd−) (10)
= a0b j +
w−1∑
k=0
d∑
=1
aˆ(kd+)bϕ( j−kd−)
+
0∑
k=w−1
1∑
=d
aˆ(kd+)bϕ( j+kd+) (11)
= a0b j +
w−1∑
k=0
d∑
=1
aˆ(kd+)[bϕ( j−kd−) + bϕ( j+kd+)] (12)
for j = 0, 1, . . . , n − 1. In the last term of (11), a is replaced
with aˆ due to the fact that coordinates ai for i = (n + 1/2) to
n−1 are equal to their corresponding aˆi for i = (n − 1/2) to 1.
Also, from (8), it is clear that bϕ( j−(n+1/2)) = bϕ( j+(n−1/2))
and bϕ( j−(n−1)) = b j+1.
The complexity of multiplication operation carried out
using (12) can be further reduced by utilizing Lemma 1.
Taking into consideration (2), term a0b j can be removed
from (12). If coordinate a0 is equal to zero then the original
representation of A is used without being changed. Otherwise,
in the case a0 = 1, A complement can be used instead of A
without having any effect on the multiplication operation.
As a result, the product coefficient c j can be obtained as
c j =
w−1∑
k=0
d∑
=1
aˆ(kd+)[bϕ( j−kd−) + bϕ( j+kd+)]. (13)
Lemma 2 shows that if a0 = 0, and then, c0 will be equal
to zero too.
Lemma 2: Let field elements A, B, and C ∈ F2m be
expressed with respect to RB I = {1, β, β2, . . . , βn−1} as
A = ∑n−1i=0 aiβ i , B = ∑n−1i=0 biβ i , and C = ∑n−1i=0 ciβ i ,
respectively. Assume n is expressed as T m + 1, where T ≥ 2
and is an even integer. Then, if a0 = 0 and c0 = 0.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Proof: Substituting zero for j in (13) and according to
the definition of function ϕ(i) in (8) we have
c0 =
w−1∑
k=0
d∑
=1
aˆ(kd+)[bϕ(−kd−) + bϕ(kd+)]
=
w−1∑
k=0
d∑
=1
aˆ(kd+)[bϕ(kd+) + bϕ(kd+)] = 0.

Using (13), it can be easily proven that cn− j = c j for
j = 1, 2, . . . , n − 1
cn− j =
w−1∑
k=0
d∑
=1
aˆ(kd+)[bϕ(n− j−kd−) + bϕ(n− j+kd+)]
=
w−1∑
k=0
d∑
=1
aˆ(kd+)[bϕ(− j−kd−) + bϕ(− j+kd+)].
The definition of function ϕ(i) in (8) implies that ϕ(n − i) =
ϕ(−i) = ϕ(i), hence
cn− j =
w−1∑
k=0
d∑
=1
aˆ(kd+)[bϕ( j+kd+) + bϕ( j−kd−)]
= c j .
In order to obtain a digit-level multiplication algorithm, first
decompose (13) into two double summations as
c j =
w−1∑
k=0
d∑
=1
aˆ(kd+)bϕ( j−kd−)︸ ︷︷ ︸
P
+
w−1∑
k=0
d∑
=1
aˆ(kd+)bϕ( j+kd+)︸ ︷︷ ︸
Q
.
(14)
Then, define two signals p()j,k and q
()
j,k , j =
1, 2, . . . , (n − 1/2) and k = 0, 1, . . . , w − 1, as follows:{
p(0)j,k = 0 and p()j,k = p(−1)j,k + aˆ(kd+)bϕ( j−kd−)
q(0)j,k = 0 and q()j,k = q(−1)j,k + aˆ(kd+)bϕ( j+kd+)
(15)
where  = 1, 2, . . . , d indicates the current clock cycle. p()j,k
and q()j,k hold the sum of inner products of certain coordinates
of A and B [denoted by P and Q in (14)] up to the th clock
cycle. After d clock cycles, the values of signals p(d)j,k and q
(d)
j,k
would be equal to{
p(d)j,k =
∑d
=1 aˆ(kd+)bϕ( j−kd−)
q(d)j,k =
∑d
=1 aˆ(kd+)bϕ( j+kd+)
(16)
comparing (16) with (13), it follows:
c j =
w−1∑
k=0
[
p(d)j,k + q(d)j,k
]
. (17)
If the values of p()j,k and q
()
j,k can be calculated and accu-
mulated for all the values of j and k at each clock cycle,
then it takes d = (n − 1/2w) clock cycles to obtain all the
product coefficients. Algorithm 1 describes the multiplication
process in more detail. To perform arithmetic operations in
binary field F2, one XOR gate and one AND gate are required
Algorithm 1 Digit-Level RB Multiplication Algorithm
Where n Can Be Expressed as n = T m + 1, T ≥ 2 and
Even
to realize a bitwise addition and a bitwise multiplication,
respectively. In Step 5 of the algorithm, for given j and k, an
AND gate is used to multiply one bit of each input operands
together, and then, an XOR gate is required to perform addition
operation. Step 6 could also be implemented using one XOR
and one AND gate in a similar way to Step 5.
A pair of flip-flops is also required for given j and k to store
the values of two signals p()j,k and q
()
j,k after each clock cycle.
Note that Steps 5 and 6 of the multiplication algorithm are
computed in parallel at each clock cycle, while the resulting
values are accumulated throughout  clock cycles in serial.
Finally, the accumulation in Step 12 is computed right after
the dth clock cycle. For a given j , w pairs of intermediate
signals p(d)j,k and q
(d)
j,k (for k from 0 to w − 1) are to be added
together to form the final value of the corresponding product
coordinate.
B. New Multiplier Architecture, DL-SRB-a
An architecture for the proposed multiplier can be con-
structed based on the steps described in Algorithm 1 at
 = 1. Fig. 1 shows the proposed architecture, hereafter
referred to as digit-level symmetrical RB type−a multiplier
(DL-SRB-a). From top to bottom, the architecture consists
of an n-bit circular shift register which should be initialized
with the coordinates of operand B . This shift register pro-
vides inputs to a wire expansion module with n inputs and
w(n − 1) outputs followed by ((n − 1)/2) identical modules
(M1, M2, . . . , M(n−1/2)) shown inside the dashed boxes. At the
bottom, there is a network of XOR gates adding 2w outputs
of each module M j together to form output coordinates.
Each module M j is made of a layer of 2w AND gates
receiving the outputs of the wire expansion module as their
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
NAMIN et al.: DIGIT-LEVEL SIPO MULTIPLIER USING REDUNDANT REPRESENTATION FOR A CLASS OF FINITE FIELDS 5
Fig. 1. Proposed architecture for digit-level SIPO RB multiplier, DL-SRB-a.
first input set. The second input set is received from certain
bits of operand A in a digit-serial fashion. Each AND gate is
followed by an XOR gate connected immediately to a flip-flop.
The output of the flip-flop is fed back to the XOR gate forming
an accumulation unit together. Two AND gates along with their
respective accumulation units form a structure responsible
to realize the operations performed in Steps 5 and 6 of
Algorithm 1. One of these structures is shown in the Fig. 1
inside a dotted block for j = 0 and k = 0. In total, the
proposed architecture contains w(n − 1/2) such structures,
each of which consists of two AND gates, two XOR gates,
and two flip-flops to generate and store p()j,k and q
()
j,k in each
clock cycle.
As mentioned earlier, input A should be fed into the multi-
plier in a digit-serial fashion (comb style). According to (13),
the multiplication operation is performed using aˆi coefficients
which are necessarily equal to the (n − 1/2) coordinates of A
starting from coordinate number 1 to (n − 1/2). We will refer
to this set of coordinates of A as Aˆ. Let Aˆ be divided into
w parts of length d in the same way we did earlier for A, as
Aˆ = aˆ1 . . . aˆd︸ ︷︷ ︸
Aˆ0
aˆd+1 . . . aˆ2d︸ ︷︷ ︸
Aˆ1
. . . aˆ(w−1)d+1 . . . aˆ n−1
2
0 . . . 0︸ ︷︷ ︸
Aˆw−1
.
Note that Aˆ is padded with wd − (n − 1/2) zeros in the most
significant word. In the first clock cycle, the first bits of every
word, i.e., a1, ad+1, . . . , a(w−1)d+1 form an input set to the
multiplier. In the second clock cycle, the inputs would be the
set of second bits of every word, a2, ad+2, . . . , a(w−1)d+2, so
on and so forth.
Fig. 2. Circular n-bit shift register to store coordinates of operand B .
For given j and k, in each clock cycle, the variable of
function ϕ in bϕ( j−kd−) decreases by one in Step 5. An n-bit
circular shift register can be used, as shown in Fig. 2 by R1,
to generate the required coefficients in Step 5. This circular
shift register should be initially loaded as, from left to right,
bn−1, bn−2, . . . , b0. On the contrary, the variable of function ϕ
in bϕ( j+kd+) in Step 6 increases by one in each clock cycle.
In this case, a similar circular shift register, namely, R2, with
the same initial contents but with the opposite shift direction
should be utilized to produce the required coefficients.
Lemma 3: Assuming the required conditions of symmetry
property explained in Corollary 1 are satisfied, only one
circular shift register of length n would suffice to facilitate
both the operations of Steps 5 and 6.
Let the upper half of register R be initialized with equivalent
coordinates from the lower half of operand B in the way
shown in Fig. 2. Since ϕ(i) = ϕ(n − i), an increase/decrease
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
by one in the variable of function ϕ within the range of
(n + 1/2) to n would be equal to a decrease/increase by one
within the range of (n − 1/2) to 0. Also, since ϕ(−i) = ϕ(i)
an increase/decrease by one in the variable of function ϕ
within the range of −(n + 1/2) to 0 would be equal to a
decrease/increase by one within the range of 0 to (n − 1/2).
Consequently, the lower half of register R is used when an
initial decrease in the lower half or an initial increase in the
upper half of B is required and the upper half of register R
is used when an initial increase in the lower half or an initial
decrease in the upper half of B is needed.
Take an example in which j = 5 and k = 0. At the first
clock cycle, the value of function ϕ in Step 5 is equal to
ϕ(5 − 1) = 4. It decreases by one in each clock cycle up to
the fifth cycle and will increase by one at each cycle afterward
(ϕ(5 − 2) = 3, . . . , ϕ(5 − 5) = 0, ϕ(5 − 6) = ϕ(−1) = 1, . . .).
The similar function value in Step 6, is initially equal to ϕ(5+
1) = 6 at the first clock cycle. It increases by one in each clock
cycle up to the ((n − 1/2) − 5)th cycle, and will decrease by
one at each cycle afterward (ϕ(5 + 2) = 7, . . . , ϕ(n + 1/2) =
(n − 1/2), ϕ(n + 3/2) = (n − 3/2), . . .). As a result, R4 and
Rn−6 should be used to produce p()5,0 and q
()
5,0, respectively.
As can be seen in Fig. 1, the number of AND gates
exceeds the number of flip-flops in register R. The role of
wire expansion module with n inputs and w(n − 1) outputs
is to receive input bits from register R and to deliver them
to AND gates as follows: for given j and k, inputs to p()j,k
and q()j,k should be connected to Rϕ( j−kd−1) and Rϕ( j+kd+1),
respectively. It is evident that what wire expansion module
does, is nothing but permuting and reordering the input bits
and that it does not contain any logic gates.
Depending on the choices of n and w, the complexity of
the multiplier may be reduced one step further. Recall that
d = (n − 1/2w), so (n − 1/2w) ≤ d < (n − 1/2w) + 1.
If w is chosen such that d(w − 1) > (n − 1/2), then both
p(d)j,w−1 and q
(d)
j,w−1 in (16) become zero for all the values
of j , 1 ≤ j ≤ (n − 1/2). The reason lies in the fact that
under the above-mentioned condition (or in other words, the
use of 2w instead of w in the denominator of d), subscript
kd +  of aˆ for k = w − 1 becomes greater than (n − 1/2)
for all the values of  from 1 to d . Consequently, the last
pair of accumulation units along with their respective AND
gates can be discarded from all modules M j . For example,
when m = 233, n = 2m + 1 = 467 and w = 32, d =
466/64 = 8 and (w − 1)d = 248 which is greater than 233.
Likewise, (w−2)d = 240 and is still greater than 233. In this
case, each module M j only needs to contain w − 2 pairs of
AND–XOR-flop units to generate p()j,k and q
()
j,k , 0 ≤ k ≤ w−3.
In the rest of this paper, the notation w˜ will be used in place
of w to denote the number of the parallel branches required
in each module M j . w˜ can be defined as follows:
w˜ = arg max
w∈N
(
(w − 1)d∣∣(w − 1)d < n − 1
2
)
. (18)
A noteworthy feature of the proposed architecture is that
the critical path of the multiplier is independent of the field
size (m), the degree of the cyclotomic field (n), and the digit
size (w). The length of critical path in terms of the number of
logic gates used remains constant regardless of the number of
flip-flops in register R and the values of j and k. As the wire
expansion module does not require logic cells, the critical path
is composed of one AND gate and one XOR gate. Assuming
TA and Tx denotes the time delays required by a two-input
AND gate and a two-input XOR gate, respectively, the critical
path delay is equal to Tcp = TA + TX . Note that the XOR
network shown in Fig. 1 (bottom) is not part of the critical path
of the multiplier as the summation in Step 12 of Algorithm 1
is only needed to be performed once at the end of the
multiplication operation.
In other words, the proposed architecture can be viewed as a
sequential circuit followed by a combinational circuit. In the
sequential part (which contains the whole circuit excluding
the XOR tree), partial products p()j,k and q()j,k are recursively
generated and stored in the flip-flops at each clock cycle. Note
that during the first d clock cycles, the output of XOR trees
are not required to be stored as they do not play any role in
the computations performed in the sequential circuit. However,
the product coordinates will not be available immediately after
d clock cycles. It takes another time delay of log2 2wTX
associated with the binary tree of (2w − 1) two-input XOR
gates (combinational circuit) before the product coordinates
can be read from the output end. To avoid the combinational
circuit from becoming the critical path, this step should be
performed in multicycles. A common solution would be the
use of intermediate flip-flops to break a long path into smaller
pieces. However, the use of extra flip-flops can be avoided
provided that the inputs of the combinational circuit are kept
unchanged so that the combinational circuit has enough time
to generate valid outputs. In the proposed architecture, this
is done by padding each input sequence Aˆ0, Aˆ1, . . . , Aˆw−1
with dex zeroes, where dex is the number of extra clock cycles
needed after the operations of Steps 5 and 6 are done. dex can
be calculated as
dex =
⌈log2 2wTX
Tclock
⌉
(19)
where Tclock refers to the clock period. If the clock
period is chosen to be equal to the critical path delay to
achieve the maximum operation frequency, Tclock should be
replaced with Tcp in (19). Finally, the total number of clock
cycles needed for a single multiplication operation is equal
to d + dex.
C. New Multiplier Architecture, DL-SRB-b
At the expense of a slight increase in the critical path
delay, the number of logic gates and flip-flops used in the
architecture of Fig. 1 can be significantly reduced. Starting
from the closed formula of (13), instead of the decomposi-
tion shown in (14), define two intermediate signals s()j,k and
r
()
j,k , j = 1, 2, . . . , (n − 1/2) and k = 0, 1, . . . , w − 1 for
 = 1, 2, . . . , d as{
s
()
j,k = [bϕ( j−kd−) + bϕ( j+kd+)]
r
(0)
j,k = 0 and r ()j,k = r (−1)j,k + aˆ(kd+)s()j,k .
(20)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
NAMIN et al.: DIGIT-LEVEL SIPO MULTIPLIER USING REDUNDANT REPRESENTATION FOR A CLASS OF FINITE FIELDS 7
Fig. 3. Proposed architecture for digit-Level SIPO RB multiplier, DL-SRB-b.
r
(d)
j,k holds the value of signal r after d clock and is
equal to
r
(d)
j,k =
d∑
=1
aˆ(kd+)[bϕ( j−kd−) + bϕ( j+kd+)]. (21)
Comparing (20) with (13), product coordinates ci can be
expressed as
c j =
w−1∑
k=0
r
(d)
j,k . (22)
The new algorithm can be obtained by replacing Steps 5–12
in Algorithm 1 with the following steps:
5: s()j,k = [bϕ( j−kd−) + bϕ( j+kd+)]
6: r ()j,k = r (−1)j,k + aˆ(kd+)s()j,k
7: end for
8: end for
9: end for
10: for all values of j = 1, 2, . . . , n−12 , compute in parallel
11: for all values of k = 0, 1, . . . , w − 1, compute in
serial
12: c j = ∑w−1k=0 r (d)j,k .
Note that in each clock cycle, Steps 5 and 6 should be
computed in serial. Fig. 3 shows the modified architecture
referred to as DL-SRB-b. As can be seen from Fig. 3, the new
architecture is similar to the previously proposed architecture,
DL-SRB-a, in the sense that it utilizes the same wire expansion
module and the same n-bit circular shift register to store
operand B . Operand A is also fed into the multiplier in the
same way as earlier. The main difference between the two
architectures originates from the difference between the two
modules shown inside the dotted boxes in Figs. 1 and 3.
In type-a architecture, one bit of operand B is multiplied by
one bit of operand A, and the resulting partial product is stored
separately in its respective accumulation unit. On the contrary,
in type-b architecture, two bits of operand B are first added
together before they enter the AND gate and be fed into the
accumulation unit. As a result, the critical path delay of the
new architecture changes from TA + TX to TA + 2TX . In the
recent architecture, the number of accumulation units and AND
gates are reduced by half from w(n − 1) to w(n − 1/2) each.
Since half of the addition operations are performed before the
accumulation units, the size of the binary XOR tree is also
reduced from 2w − 1 to w − 1.
Similar to DL-SRB-a, the multiplication delay of
DL-SRB-b is composed of two parts: d and dex. The first
part corresponds to Steps 5 and 6 of the algorithm caused
by modules M j during d clock cycles. The second part
corresponds to the time delay of a w-input XOR gate or a
binary tree of (w − 1) two-input XOR gates. Assuming that a
binary tree of two-input XOR gates is used, the total number
of clock cycles required to complete a single multiplication
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
can be calculated as
d +
⌈log2 wTX
Tclock
⌉
. (23)
IV. ARCHITECTURAL COMPLEXITIES AND COMPARISON
The area complexities of the proposed architectures can be
readily calculated from Figs. 1 and 3. In the case of type-a
structure shown in Fig. 1, the circular shift register contains n
flip-flops to store the coordinates of operand B . As described
earlier, the structure utilizes (n − 1/2) identical modules M j ,
j = 1, 2, . . . , (n − 1/2), each of which employing 2w flip-
flops to store the values of signals p()j,k and q
()
j,k at each clock
cycle for all the values of k from 0 to w − 1. In total, the
number of required flip-flops comes to w(n−1)+n. There are
also w(n−1) two-input AND gates, each followed immediately
by a two-input XOR gate. Assuming that the XOR network
at the bottom of the Fig. 3 is only made of two-input XOR
gates, the architecture requires (4w − 1)(n − 1/2) XOR gates
altogether.
In the case of type-b architecture, each module Mi only con-
tains w parallel branches instead of 2w in type-a counterpart.
As a result, the number of AND gates and the total number
of flip-flops decrease to w(n − 1/2) and w(n − 1/2) + n,
respectively. XOR gates appear in two separate layers in the
structure of Fig. 3. The first layer consists of w(n − 1/2) gates
and the layer at the bottom most part of the structure requires
(w − 1)(n − 1/2). In sum, a total of (2w − 1)(n − 1/2) XOR
gates is used in the structure of type-b multiplier.
ONBs are the most efficient classes of Gaussian
NBs (GNBs) [16], [24]. To achieve smaller area and time
overheads when using NBs over binary extension fields, it
is recommended to use a GNB with the least possible type.
The least possible type for a GNB is equal to 2 and type-2
GNB is also known as type-II ONB. Since ONB is the
most efficient class of NB, it should be interesting to have
a complexity comparison between the proposed multipliers
and several recently proposed ONB multipliers. Table II draws
a comparison between the hardware complexities of the two
proposed multipliers, those of existing digit-level RB multi-
pliers and several ONB multipliers. The comparison has been
made in terms of the number of logic cells used, critical path
delay (Tcp), and multiplication delay. Among the architec-
tures listed in the Table II, the three architectures presented
in [13], [14], and [21] are based on a LFSR structure, whereas
the others have non-LFSR structural designs. The architecture
most comparable to that being proposed is the “comb-style”
architecture presented in [12]. Although the overall structure
of the two architectures might seem similar, there are two
important differences between them. First, the comb-style
architecture in [12] implements a general RB multiplier and
does not utilize the symmetry property even if applicable.
Second, for each output coordinate in this architecture, the
results of partial bitwise products are added together first,
and then, the resultant value enters the accumulation unit.
The addition operation is applied over w partial products
together with the current data stored in the accummulation
unit before updating the output flip-flop at each clock cycle.
As a result, the critical path contains the XOR chain, thus
causing additional delay cost.
As can be seen in the Table II, the second proposed architec-
ture, DL-SRB-b, requires the smallest number of gate counts
compared with the other RB multipliers. In terms of maximum
operating frequency, PS-III has the smallest critical path delay.
However, the reduction in critical path delay is achieved by
utilizing a layer of flip-flops between the AND gates and the
pipelined XOR tree in the structure of PS-III [15, Fig. 7]
at the cost of using about w times more flip-flops and sig-
nificantly longer multiplication delay. The proposed structure
DL-SRB-a together with “High-speed” structure in [14] has
the second smallest critical path delay amongst all the struc-
tures under comparison.
In order to enable a better comparison, the area and delay
complexities of the multipliers listed in Table II have been
calculated and tabulated in Table III as a case study. Among
the five field sizes recommended by the National Institute of
Standards and Technology for elliptic curve applications [19],
m = 233 is the only one for which a type-II ONB exists.
For this reason, in all the calculations made for Table III, the
field size was selected as m = 233. Note that F2233 can be
embedded into cyclotomic field F2(467) . As mentioned earlier,
accommodating ring type operations is a unique feature of
redundant representation which not only provides a cost-free
squaring operation but also eliminates the need for modular
reduction in finite field operations. However, these remarkable
advantages are achieved at the cost of a certain level of
redundancy in the number of bits required to represent field
elements. It should be noted that the appropriate choice of
representation system generally depends on the overall speci-
fications of the cryptographic system being implemented, such
as field size (security level), the frequency of using multiplica-
tion and exponentiation operations, the overhead of using basis
conversions, fault-tolerancy, and so on. Although numerical
comparison can reveal that the proposed architectures can
effectively reduce the area-delay complexity of RB multipliers
(by almost half) for 60% of all field sizes, the main focus of
this paper is placed on about 20% of the fields for which
T is equal to 2. In that case, not only the complexity of
the proposed RB multipliers become comparable to that of
ONB multipliers, but as suggested in Table III, they may even
outperform ONB multipliers.
For each multiplier listed in the Table III, the calculations
were made for three practical digit sizes 8, 16, and 32 based
on the following assumptions. The required areas for an
AND gate, an XOR gate, and a D-type flip-flop are assumed
to be equal to δA, δX , and δR square units, respectively.1
Parameter r in the second row of Table II represents the
number of output product bits generated simultaneously in
each clock cycle. To make a fair comparison in terms of
gate counts and multiplication delay, the value of parameter
r is assumed to be equal to w in Table III where needed.
It is also assumed that the propagation delay of an XOR
1In C M OS065L P standard cell library (from STMicroelectronics) a two-
input AND gate, two-input XOR gate, and D-type flip-flop with set/reset are
implemented by 6, 12, and 28–30 transistors and the area requirements for
pertaining standard cells are reported in [25].
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
NAMIN et al.: DIGIT-LEVEL SIPO MULTIPLIER USING REDUNDANT REPRESENTATION FOR A CLASS OF FINITE FIELDS 9
TABLE II
COMPLEXITY COMPARISON BETWEEN DIGIT-LEVEL ARCHITECTURES; THE PROPOSED MULTIPLIERS
VERSUS SOME EXISTING RB AND ONB MULTIPLIERS (n = T m + 1)
gate is twice as long as the delay of an AND gate [25]. The
column entitled “Area Cost” in the Table III shows the total
area required by logic gates and registers for each multiplier.
Assuming that the propagation delay of an AND gate is equal
to 1 delay unit, the column entitled “Delay Cost” presents the
relative multiplication delays in proportion to the delay of an
AND gate.
As shown in Table III, DL-SRB-a offers much lower delay
costs compared with the other multipliers. DL-SRB-b stands
at the second position, having the second lowest delay cost
except for only one case in which PS-III shows a slightly
better performance when the digit size is equal to 8. In the
design of digit-level finite field multipliers, there is always a
tradeoff between delay and area costs as two important design
factors and reducing one them generally results in an increase
in the other one. To achieve a fair comparison, the area-delay
product of the multipliers has been calculated and listed in
the rightmost column of the Table III. As can be seen, both of
the proposed architectures show much lower area-delay costs
than all the existing RB multipliers for all digit sizes listed in
the Table III. In the case of DL-SRB-b, the area-delay cost
is 53%, 51%, and 47% lower than the most comparable
architecture when w = 8, 16, and 32, respectively. In com-
parison with ONB multipliers, DL-SRB-b architecture offers
24%, 29%, and 7% area-delay improvement when the digit
size changes from 8 to 16 and finally, 32, respectively.
It has been proven that if there exist a type-II ONB for
representing field elements in F2m , then a cyclotomic field of
degree 2m +1 (T = 2) always exists [8]. However, the inverse
statement is not always true. The existence of a cyclotomic
field of degree n = 2m +1 for F2m does not necessarily imply
that a type-II ONB for that particular field size exists. As a
Fig. 4. Design flow used to implement the proposed architectures.
result, the advantage of using the proposed multipliers would
become more distinct when T = 2 but no ONB exists; for
example, in the case of m = 200, 204, or 224.
V. HARDWARE IMPLEMENTATION
In order to verify the theoretical results, both the proposed
multipliers were implemented in hardware as separate appli-
cation specific integrated circuit (ASIC) modules for three
digit sizes 8, 16, and 32. Multipliers have been realized
for the binary extension field of degree 233. Note that in
this case T is equal to 2 and the result of Corollary 1 is
applicable to the cyclotomic field of degree n = 467. All the
implementations were carried out in seven-metal layer 65-nm
CMOS process from STMicroelectronics with CMOS065LP
standard cell library.
Fig. 4 shows the design flow used to realize each mul-
tiplier. The implementation process started with writing a
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
TABLE III
NUMERICAL COMPLEXITY COMPARISON OF DIGIT-LEVEL RB AND ONB MULTIPLIERS IN F2233
(EMBEDDED IN F(467)2 IN THE CASE OF RB) FOR DIFFERENT VALUES OF w
TABLE IV
ASIC IMPLEMENTATION DETAILS FOR THE PROPOSED DIGIT-LEVEL RB MULTIPLIERS IN F2233
Verilog code to describe the multiplier in hardware description
language. C language was used to generate netlist blocks
describing the numerous interconnections between logic gates
as the main part of the RTL code. Then, the RTL design was
synthesized to an optimal gate level design using Design Com-
piler from Synopsys. In the final stage, the netlist was imported
to the Cadence SoC Encounter to perform floorplaning, cell
placement, clock tree synthesis, reset net synthesis, and routing
tasks. Three rounds of simulations were also carried out after
RTL design, synthesis, and place& route stages to ensure the
correct functionality of the multiplier. A set of golden results
was initially created by simulating the multiplier with a large
set of randomly generated input operands in MATLAB. Then,
in each round of simulation, the same set of input data was
fed into the multiplier and the product values were compared
against the golden set. To obtain accurate power estimation,
the final netlist was generated by Encounter and then simulated
for 1000 pairs of random input vectors by NCSim to extract
and store the switching activity information of all internal nets
in value change dump (VCD) format. The switching activity
information was fed into Encounter afterward to calculate the
power consumption values.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
NAMIN et al.: DIGIT-LEVEL SIPO MULTIPLIER USING REDUNDANT REPRESENTATION FOR A CLASS OF FINITE FIELDS 11
The main characteristics of ASIC implementations for the
proposed multipliers are listed in Table IV. It should be noted
that the gap between the critical path delays measured in the
postsynthesis stage and the postplace& route stage increases
as the value of w changes from 8 to 32. These changes in
critical path delay stem from two facts. First, increasing the
level of parallelism in the architectures of multipliers can
significantly increase the capacitive load of certain nets, such
as input A, reset, and clock. Consequently, a longer buffer
chain is required to be able to properly drive logic cells
connected to the high-fan-out nets, thus causing additional
delay. Second, contributing factors, such as interconnect and
parasitic capacitances, can only be taken into account for
timing analysis after place&route when the layout is fully
routed. Such factors eventually lead to a longer critical path
delay.
VI. CONCLUSION
Two new digit-level SIPO finite field multipliers using
redundant representation have been proposed. For about 60%
of the field sizes within the practical range for ECC appli-
cations, the relationship between extension degree m and the
size of the smallest cyclotomic field, (n), in which F2m can be
embedded is expressed as n = T m +1 for T even and greater
than or equal to 2 [18]. In this case, a specific feature of
redundant representation was used to alleviate the redundancy
problem in this representation system. Numerical complexity
comparison showed that both new architectures have the
lowest delay cost compared with the existing RB architectures.
One of the proposed architectures achieved at least 2.12 times
higher performance (for different digit sizes over F2233 ) in
comparison with the most comparable RB architecture when
considering area-delay complexity as a measure of perfor-
mance. In about 20% of cases where T = 2, the proposal can
show better performance than ONB multipliers, if existed, and
can show much better performance than NB multipliers when
T = 2 but no ONB exists (e.g., field sizes 200, 204, and 224).
VLSI implementation of the proposed architectures for binary
extension field of 233 and three practical digit sizes in 65-nm
CMOS technology was also presented.
REFERENCES
[1] T. ElGamal, “A public key cryptosystem and a signature scheme based
on discrete logarithms,” IEEE Trans. Inf. Theory, vol. 31, no. 4,
pp. 469–472, Sep. 2006.
[2] I. F. Blake, G. Seroussi, and N. P. Smart, Elliptic Curves in Cryptography
(London Mathematical Society Lecture Note Series). Cambridge, U.K.:
Cambridge Univ. Press, 1999.
[3] A. J. Memezes, P. C. Van Oorschot, and S. A. Vanstone, Handbook
of Applied Cryptography (Discrete Mathematics and Its Applications).
Boca Raton, FL, USA: CRC Press, 1996.
[4] T. Itoh and S. Tsujii, “A fast algorithm for computing multiplicative
inverses in G F(2m ) using normal basis,” Inf. Comput., vol. 78, no. 3,
pp. 171–177, 1988.
[5] C. Rebeiro, S. Roy, D. Reddy, and D. Mukhopadhyay, “Revisiting
the Itoh–Tsujii inversion algorithm for FPGA platforms,” IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 8, pp. 1508–1512,
Aug. 2011.
[6] E. D. Mastrovito, “VLSI architectures for computations in Galois fields,”
Ph.D. dissertation, Dept. Electr. Eng., Linköping Univ., Linköping,
Sweden, 1991.
[7] J. Omura and J. Massey, “Computational method and apparatus for finite
field arithmetic,” U.S. Patent 4 587 627, May 6, 1986.
[8] H. Wu, M. A. Hasan, I. F. Blake, and S. Gao, “Finite field multiplier
using redundant representation,” IEEE Trans. Comput., vol. 51, no. 11,
pp. 1306–1316, Nov. 2002.
[9] D. Jungnickel, A. J. Menezes, and S. A. Vanstone, “On the number
of self-dual bases of G F(qm) over G F(q),” Proc. Amer. Math. Soc.,
vol. 109, no. 1, pp. 23–29, 1990.
[10] S. Gao, J. von zur Gathen, D. Panario, and V. Shoup, “Algorithms for
exponentiation in finite fields,” J. Symbolic Comput., vol. 29, no. 6,
pp. 879–889, 2000.
[11] S. Gao, J. von zur Gathen, and D. Panario, “Gauss periods and
fast exponentiation in finite fields,” in LATIN Theoretical Informatics
(Lecture Notes in Computer Science), vol. 911. Berlin, Germany:
Springer, 1995, pp. 311–322.
[12] A. H. Namin, H. Wu, and M. Ahmadi, “Comb architectures for finite
field multiplication in (Fm2 ),” IEEE Trans. Comput., vol. 56, no. 7,
pp. 909–916, Jul. 2007.
[13] A. H. Namin, H. Wu, and M. Ahmadi, “A new finite-field multiplier
using redundant representation,” IEEE Trans. Comput., vol. 57, no. 5,
pp. 716–720, May 2008.
[14] A. H. Namin, H. Wu, and M. Ahmadi, “An efficient finite field multiplier
using redundant representation,” ACM Trans. Embedded Comput. Syst.,
vol. 11, no. 2, Jul. 2012, Art. no. 31.
[15] J. Xie, P. Meher, and Z.-H. Mao, “High-throughput finite field multipliers
using redundant basis for FPGA and ASIC implementations,” IEEE
Trans. Circuits Syst. I, Reg. Papers, vol. 62, no. 1, pp. 110–119,
Jan. 2015.
[16] R. Lidl and H. Niederreiter, Introduction to Finite Fields and Their
Applications, 2nd ed. New York, NY, USA: Cambridge Univ. Press,
1997.
[17] D. W. Ash, I. F. Blake, and S. A. Vanstone, “Low complexity normal
bases,” Discrete Appl. Math., vol. 25, no. 3, pp. 191–210, 1989.
[18] H. Wu, M. Hasan, and I. Blake, “Highly regular architectures for
finite field computation using redundant basis,” in Cryptographic Hard-
ware and Embedded Systems (Lecture Notes in Computer Science),
vol. 1717, C. K. Koç and C. Paar, Eds. Berlin, Germany: Springer, 1999,
pp. 269–279.
[19] C. F. Kerry and P. D. Gallagher, “Digital sig-
nature standard DSS,” U.S. Dept. Commerce,
Nat. Inst. Standards Technol. Tech. Rep. FIPS 186-4, Jul. 2013.
[Online]. Available: http://csrc.nist.gov/publications/PubsFIPSArch.html
[20] R. Azarderakhsh and A. Reyhani-Masoleh, “Low-complexity multiplier
architectures for single and hybrid-double multiplications in Gaussian
normal bases,” IEEE Trans. Comput., vol. 62, no. 4, pp. 744–757,
Apr. 2013.
[21] A. H. Namin, H. Wu, and M. Ahmadi, “A word-level finite field
multiplier using normal basis,” IEEE Trans. Comput., vol. 60, no. 6,
pp. 890–895, Jun. 2011.
[22] C.-Y. Lee and P. Meher, “Area-efficient subquadratic space-complexity
digit-serial multiplier for type-II optimal normal basis of G F(2m) using
symmetric TMVP and block recombination techniques,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 62, no. 12, pp. 2846–2855, Dec. 2015.
[23] A. Namin, H. Wu, and M. Ahmadi, “A parallel-in serial-out multiplier
using redundant representation for a class of finite fields,” in Proc.
13th IEEE Int. Conf. Electron., Circuits Syst. (ICECS), Dec. 2006,
pp. 502–505.
[24] S. Kwon, K. Gaj, C. H. Kim, and C. P. Hong, “Efficient linear array
for multiplication in G F(2m) using a normal basis for elliptic curve
cryptography,” in Cryptographic Hardware and Embedded Systems,
M. Joye and J.-J. Quisquater, Eds. Berlin, Germany: Springer, 2004,
pp. 76–91.
[25] 65nm STMicroelectronics CMOS Technology, Standard Cell Library for
65 Nanometer CMOS065LP VLSI Digital Design Platform, Jun. 2006.
Parham Hosseinzadeh Namin (S’09) received the
B.Sc. degree in electrical engineering from the
Islamic Azad University of Karaj, Alborz, Iran,
in 2006, and the M.Sc. degree in telecommunication
engineering from the University of Tabriz, Tabriz,
Iran, in 2009. He is currently pursuing the Ph.D.
degree with the Department of Electrical and Com-
puter Engineering, University of Windsor, Windsor,
ON, Canada.
His current research interests include digital and
analog integrated circuits, architectures in finite
fields, the hardware implementation of cryptosystems, spread spectrum com-
munications, and cognitive radio networks.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Roberto Muscedere (S’95–M’96) was born in
Windsor, ON, Canada, in 1973. He received the
B.A.Sc., M.A.Sc., and Ph.D. degrees from the Uni-
versity of Windsor, Windsor, in 1996, 1999, and
2003, respectively, all in electrical engineering.
He managed the microelectronics computing
environment with the Research Center for Inte-
grated Microsystems, University of Windsor from
1996 to 2001. His current research interests include
the implementation of high performance and low
power VLSI circuits, full and semicustom VLSI
design, computer arithmetic, HDL synthesis, and digital signal processing.
Majid Ahmadi (S’75–M’77–SM’84–F’02–LF’14)
received the B.Sc. degree in electrical engineering
from the Sharif University of Technology, Tehran,
Iran, in 1971, and the Ph.D. degree in electrical
engineering from the Imperial College of Science,
Technology and Medicine, London, U.K., in 1977.
He has been with the Department of Electrical
and Computer Engineering, University of Windsor,
Windsor, ON, Canada, since 1980 and he is cur-
rently a Distinguished University Professor and an
Associate Dean of Engineering for Research and
Graduate Studies. He has co-authored the book Digital Filtering in One-D
and Two-Dimensions; Design and Applications (New York, Plennum, 1989)
and has authored over 500 articles in these areas. His current research interests
include digital signal processing, machine vision, pattern recognition, neural
network architectures, applications, and VLSI implementation, computer
arithmetic, and MEMS.
Dr. Ahmadi is a fellow of IET. He was a recipient of an Honorable
Mention Award from the Editorial Board of the Journal of Pattern Recog-
nition in 1992, and the best paper award from the 2011 IEEE International
Electro/Information Technology Conference. He received the Distinctive Con-
tributed Paper Award from the Multiple-Valued Logic Conference Technical
Committee and the IEEE Computer Society in 2000, the Distinguished
University Professorship in 2003, the Faculty of Engineering Deans Special
Recognition Award in 2007, and the University of Windsor Award for
Excellence in Scholarship, Research, and Creative Activity in 2008. He was
the IEEE-CAS representative on the Neural Network Council and the Chair
of the IEEE Circuits and Systems Neural Systems Applications Technical
Committee in 2000. He has served on the Editorial Board of the Journal
of Circuits, Systems, and Computers as an Associate Editor and a Regional
Editor from 1992 to 2012, an Associate Editor for the Journal of Pattern
Recognition since 1992.
