Trade-Off Approach for GHASH Computation Based on a Block-Merging Strategy by Negre, Christophe
HAL Id: hal-01852027
https://hal.archives-ouvertes.fr/hal-01852027
Preprint submitted on 31 Jul 2018
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Trade-Off Approach for GHASH Computation Based on
a Block-Merging Strategy
Christophe Negre
To cite this version:
Christophe Negre. Trade-Off Approach for GHASH Computation Based on a Block-Merging Strategy.
2018. ￿hal-01852027￿
Trade-Off Approach for GHASH Computation Based on a
Block-Merging Strategy
Christophe Negre1,2
1Team DALI, Université de Perpignan, France
2LIRMM, Université de Montpellier and CNRS, Montpellier, France
1,2christophe.negre@univ-perp.fr
Abstract: In the Galois counter mode (GCM) of encryption an authentication tag is computed
with a sequence of multiplications and additions in F2m . In this paper we focus on multiply-and-
add architecture with a suquadratic space complexity multiplier in F2m . We propose a recom-
bination of the architecture of P. Patel [16] which is based on a subquadratic space complexity
Toeplitz matrix vector product. We merge some blocks of the recombined architecture in order
to reduce the critical path delay. We obtain an architecture with a subquadratic space complexity
of O(log2(m)mlog2(m)) and a reduced delay of (1.59 log2(m) + log2(δ))DX + DA where δ is a
small constant. To the best of our knowledge, this is the first multiply-and-add architecture with
subquadratic space complexity and delay smaller than 2 log2(m)DX .
Keywords: GHASH, Galois counter mode, multiply-and-add architecture, binary field multiplier,
subquadratic space complexity
1. Introduction
One extensively used cryptographic operation is the encryption with a secret key. The Galois
counter mode (GCM) [11] is a mode of encryption which encrypts a message of any length and
1
produces an authentication tag. It proceeds the encryption with the counter mode, which is a mode
which can be parallelized. The authentication is done with the GHASH function which consists in
a sequence of multiplications and additions in a field F2m . This sequence of multiplications and
additions can be parallelized [18]. This leads to a high-level of parallelization which renders GCM
really attractive for high throughput encrypted communication. The encryption mode GCM used
with the block-cipher AES [3, 13] was published as a standard by the NIST in 2007 [14].
Efficient implementation of GCM is based first on a parallelization and a pipelined implemen-
tation of the block-cipher [18, 9, 17, 1]. The multiplier in F2m used in the computation of the
authentication tag have to be efficiently implemented. The most efficient multipliers in F2m are
the parallel multipliers which compute a product in one clock cycle and have a small critical
path delay. Such parallel multipliers can either have a quadratic space complexity and a delay
of (log2(m) + O(1))DX + DA (cf. [10, 6, 2, 21]) or have a subquadratic space complexity mul-
tiplier and a delay of (2 log2(m) + O(1))DX + DA [15, 20, 12, 4, 5, 7] (here DX is the delay of
an XOR gate and DA is the delay of an AND gate and m is a power of 2). Subquadratic space
complexity multipliers can be either based on subquadratic complexity formula for polynomial
multiplication [15, 20, 12] or on subquadratic complexity formula for Toeplitz-matrix vector prod-
uct [4, 5, 7].
In this paper we instigate a new strategy for a subquadratic space complexity architecture which
has an improved delay. Our approach consists of the recombination of the different blocks of the
multiply-and-add architecture based on a subquadratic TMVP multiplier in F2m . This recom-
bination enables us to merge some blocks in order to reduce the delay. We obtain a multiply-
and-add architecture which has a subquadratic space complexity of (log2(m) + O(1))mlog2(3) +
O(m) gates and a delay of (log2(m) + O(1))DX + DA. To the best of our knowledge it is the
first time a suquadratic space complexity multiply-and-add architecture has a delay smaller than
2 log2(m)DX +DA.
The remainder of this paper is organized as follows. In Section 2 we review GCM and GHASH
algorithms. In Section 3 we review the suquadratic space complexity multiply-and-add architec-
2
ture of [16] which is based on a subquadratic space complexity TMVP multiplier. In Section 4
we present a recombination and some block-merging approaches in order to reduce the critical
path-delay of the multiply-and-add architecture. In Section 5 we compare the complexity of the
proposed architecture to the state of the art and we give some concluding remarks.
2. Galois Counter Mode and GHASH function
The GCM performs two main cryptographic operations: encryption/decryption and authentication.
Encryption and decryption are done using a block cipher denoted CIPH of block size m which is
operated in counter mode [11]. In the sequel we will assume that m is a power of 2. The counter
that is used for encryption is initialized with an m-bit value called initial counter block (ICB). It
is encrypted with the block-cipher CIPH using a private key (K). The result of the encryption
process from the counter block is XORed with the first m-bit block P1 of the plaintext (P ). This
process is repeated with the counter incremented by one for P2, P3, . . . , until the last block Pn of
the message is encrypted. The counter mode of operation is depicted in Fig. 1.
Fig. 1: Block-cipher CIPH in Counter Mode
ICB
C1
P1
CBn−1CB2
+1
C2
P2
CBn
+1
Pn
CIPHK CIPHK CIPHK CIPHK
Cn−1 Cn
Pn−1
The authentication tag is computed from the ciphertext C and the hash-key H = CIPHK(0m)
with the GHASH function. This GHASH function is defined as follows: let C = C1C2 · · ·Cn
be the decomposition of C into blocks of size m bits, then the hash value of the message C is
3
computed as follows
GHASH(C) = CnH + Cn−1H + · · ·+ C1Hn,
where Ci and H are considered as elements of F2m . The computation of GHASH(C) can be
computed as a sequence of multiplications by H followed by an addition of Ci in F2m . This
method is shown in Algorithm 1.
Algorithm 1 GHASH(C,H)
Require: C = (C1, . . . , Cn) where Ci ∈ F2m and H ∈ F2m
Ensure: V = GHASH](C,H)
V ← C1
for i = 2 to n do
V ← H × V + Ci
end for
V ← (H × V )
return V
Algorithm 1 computesC1Hn+C2Hn−1+· · ·+C1Hn with nmultiplications and n−1 additions
in F2m . These operations can be performed in sequence with one multiplier and one adder in F2m .
Let SM and DM be the space and time complexities of the considered F2m multiplier and let DX
be the delay of an XOR gate. The space complexity (S) of this multiply-and-add architecture is as
follows
S = SM +m XORs. (1)
The overall delay for the computation of GHASH(C,H) is as follows
D = n(DM +DX). (2)
3. Field Multiplier based on TMVP
The main operation in GHASH computation is the multiplication in F2m . It is thus important to
have an efficient F2m multiplier considering both time and space complexities. Consequently, we
4
focus on subquadratic complexity multiplier, in particular we consider the approach of Patel [16]
which uses the subquadratic space complexity multiplier based on a TMVP approach [4].
Let F2m be a binary field, it is defined as the set of binary polynomials modulo an irreducible
polynomial f(x) = xm + e(x) of degree m. An element in F2m = F2[x]/(f(x)) is then a binary
polynomial of degree less than m.
We assume that f(x) = xm + e(x) where e(x) = 1 + xe1 + . . . + xes and δ = deg e(x) = es
is small compared to m. This is for example the case for the NIST standard for AES-GCM [14]
where f(x) = x128 + x7 + x2 + x + 1. Following P. Patel in [16] we can derive a subquadratic
multiplier modulo such irreducible polynomial with the following steps.
• Step 1: matrix reduction. We first follow the construction of Halbutogullary and Koç [6].
We start with the following 2m ×m matrix-vector formulation of the multiplication of two
polynomials U(x) =
∑m−1
i=0 uix
i and V (x) =
∑m−1
i=0 vix
i.

u0 0 0 · · · 0
u1 u0 0 · · · 0
u2 u1 u0 · · · 0
...
...
...
...
um−1 um−2 um−3 · · · u0
0 um−1 um−2 · · · u1
...
...
...
...
0 0 0 · · · um−1
0 0 0 · · · 0

·

v0
v1
v2
...
vm−1

(3)
Then we use that xm = e(x) mod f(x) to reduce the rows of the matrix in (3) which corre-
spond to the monomials xi with i ≥ m. This leads to the following scheme for the reduction.
5
m
2m
m
0
m− e1
++
δ
=
m− es
m− e2
+ + · · · · · · · · · +
This diagram shows that the matrix remains not reduced since there are δ non-zero rows
corresponding to monomials xm, xm+1, . . . ,xm+δ−1 after this first matrix reduction. We have
to perform this matrix reduction a second time to have a fully reduced matrix.
• Step 2 : matrix decomposition. P. Patel noticed in [16] that after the second matrix re-
duction, the resulting m × m matrix, denoted MU , has a submatrix formed by the rows
δ, δ + 1, . . . ,m − 1 which is Toeplitz. One can then rewrite this matrix as the sum of an
m×m Toeplitz matrix TU and a matrix SU which has non-zero entries only in the first δ rows
as shown in Fig 2.
Fig. 2: Matrix decomposition
MU = = + Zero
δ
sub-matrix
Toeplitz sub-matrix
Non-Toeplitz sub-matrix Non-Toeplitz sub-matrix
Toeplitz matrix TU matrix SU
The matrix vector product MU · V is then split into one Toeplitz matrix vector product TU · V
and one non-Toeplitz matrix vector product SU · V :
MU · V = TU · V + SU · V.
6
• Step 3 : computation of SU · V . The matrix vector product SU · V is computed through δ
independent circuits each performing a row-vector product and consisting of m parallel AND
gates and a binary tree of m− 1 XOR gates. This computation of the matrix product SU · V
has the following complexity
 S = δ(m− 1) XORs and δm ANDsD = log2(m)DX +DA (4)
• Step 4 : computation of TU ·V . The Toeplitz-matrix-vector-product TU ·V can be computed
with a divide and conquer approach leading to a subquadratic multiplier. Specifically, if m
is even, Fan and Hasan proposed in [4] to use the two-way split formula of Winograd [22]
shown in Table 1 to compute T · V , where T is an n× n Toeplitz matrix and V is a vector of
size m.
Table 1 TMVP two-way split formula
Splitting Recursion Reconstruction
T =
[
T1 T0
T2 T1
]
V =
[
V0
V1
] P0 = (T0 + T1) · V1,P1 = T1 · (V0 + V1),
P2 = (T1 + T2) · V0,
T · V =
[
P0 + P1
P2 + P1
]
When m = 2t Fan and Hasan obtained a TMVP multiplier with the following complexities
when the formula in 1 is used recursively:
 S =
11
2
mlog2(3) − 6m+ 1
2
XORs and mlog2(3) ANDs
D = 2 log2(m)DX +DA
(5)
Hasan et al. in [8] noticed that the TMVP multiplier can be decomposed into four inde-
pendent computations: Component Matrix Formation (CMF), Component Vector Formation
(CVF), Component Multiplication (CM) and Reconstruction (R). The recursive formulas for
these four independent computations are shown in Table 2 along with their time and space
7
complexities.
Table 2 Space and time complexities of different sub-computations in the Fan-Hasan multiplier
Computation Split Recursion Complexity
CMF T =
[
T1 T0
T2 T1
]
CMF (T ) = (CMF (T0 + T1), CMF (T1), CMF (T1 + T2))
SCMF = 52m
log2(3) − 3m + 12 XORs
DCMF = log2(m)DX
CVF V =
[
V0
V1
]
CV F (V ) = (CFV (V1), CV F (V0 + V1), CV F (V0))
SCV F = mlog2(3) −mXORs
DCV F = log2(m)DX
CM − Ŵ = CTF (T )⊗ CV F (V ) SCM = m
log2(3)ANDs
DCM = DA
R Ŵ = [Ŵ0, Ŵ1, Ŵ2] W = R(Ŵ ) = (R(Ŵ0) + R(Ŵ1), R(Ŵ1) + R(Ŵ2))
SR(n) = 2mlog2(3) − 2mXORs
DR(n) = log2(m)DX
• Final step: multiply-and-add architecture. In Fig. 3 we provide the multiply-and-add
architecture for the GHASH computation of [16] which is based on the decomposition
MH · V = TH · V + SH · V
for a multiplication by H . The matrix vector product TH · V is computed with a suquadratic
TMVP multiplier which is decomposed into the different computations (CMF,CVF, CM and
R). We assume that the entries of TH , CMF (TH) and SH are precomputed. The multiplica-
tions TH ·V and SH ·V are computed in parallel and their results are added to get MH ·V . In
Fig. 3, RVP stands for row-vector-product and rowi,SH is the i-th row of SH . The space com-
plexity of this multiply-and-add architecture is obtained by adding the complexity of SH · V
plus the complexity of TH ·V and an adder. The delay of this architecture is equal to the delay
of a TMVP multiplier plus DX .
 S(m) =
11
2
mlog2(3) + (δ − 6)m+ 1
2
− δ XORs and mlog2(3) + δm ANDs
D(m) = (2 log2(m) + 1)DX +DA
8
Fig. 3: Multiply-and-add architecture with TMVP multiplier [16]
m
m
CM
R
CVF
CMF(TH)
TMVP Multiplier
m
Ci
RVPRVP
row1,SH row2,SH
RVP
m
rowδ,SH
V
δ
4. Recombination of the single-multiply-and-add architecture based on
Fan-Hasan multiplier
We present in this section our contribution for the multiply-and-add architecture. The current
multiply-and-add architecture in Fig. 3 has a critical path delay of (2 log2(m) + 1)DX +DA. This
delay is for the most part due to the blocks CVF and R, each contributing by log2(m)DX to the
critical path delay. Our strategy to reduce the critical path delay is to recombine the blocks of the
architecture in order to have a block CV F ◦ R which can computed with a delay of log2(m)DX .
So we first explain the recombination of the block of the architecture in the next subsection. The
remainder of the paper will be dedicated to explained how the new blocks can be implemented.
4.1. Recombination of the multiply-and-add architecture
We recombine the blocks of the multiplier in order to have the output of the reconstruction block
(R) directly input to the block CVF of the product TH · V and SH · V . The blocks R and CVF are
separated a bit-wise XOR. We use the following lemma to reverse the order of the block R and the
addition.
Lemma 1. We consider the term TU · V + SU · V + Ci involved in Fig. 3 where TU is a m ×m
9
Toeplitz matrix. Let Ŵ = CMF (TU)⊗ CV F (V ) then the following equation holds :
R(Ŵ )⊕ SU · V ⊕ Ci︸ ︷︷ ︸
(∗)
= R
(
Ŵ ⊕ (CMF (Id)⊗ CV F (SU · V ))⊕ CV F (Ci)
)
. (6)
Proof. Let us first denote Y = SU · V + Ci. The multiplication of Y by the identity matrix Id can
be done with a TMVP multiplier as follows:
Y = R(CMF (Id)⊗ CV F (Y )). (7)
Now we add Y and R(Ŵ ) to get (∗) in (6) and we use the linearity of R to obtain the following
R(Ŵ )⊕ SU · V ⊕ Ci = R(Ŵ )⊕ Y
= R(Ŵ )⊕R(CMF (Id)⊗ CV F (Y )) (using (7))
= R
(
Ŵ ⊕ (CMF (Id)⊗ CV F (Y ))
)
(by linearity of R).
Finally, we obtain the required expression (6) by using the linearity ofCV F which givesCV F (Y ) =
CV F (SU · V )⊕ CV F (Ci)).
Lemma 1 provides a way to transform the addition in
R(CMF (TH)⊗ CV F (V ))⊕ SU · V ⊗ Ci
to an addition in component formation representation. We use this property to recombine the
multiply-and-add architecture.
Main recombination of the multiply-and-add architecture. We use Lemma 1 to recombine the
multiply-and-add architecture as follows:
• We keep V̂ in component formation in register of size mlog2(3).
10
• At each iteration we compute Ŵ = CV F (R(V̂ )) and then the new value of V̂ is
V̂ = Ŵ ⊕ (CMF (Id)⊗ (CV F (SU · V )⊕ CV F (Ci)
This recombination leads to the architecture in Fig. 4. In this architecture we have what we
sought: a block R which is directly connected to the block CV F . But we have also that the
R is connected to the δ blocks RVP. We have some new blocks in the recombined architecture:
two blocks CVF, one applied to SU · V and one to Ci, and a block CM for the multiplication by
CMF (Id).
Fig. 4: Recombination of the mult-and-add architecture
CM
CMF(TH)
R
mlog2(3)
Ci
mlog2(3)
mlog2(3)
mlog2(3)
mlog2(3)
R
δ
mlog2(3)
CMF(Id)
CM
RVPRVP
CVF
CVF
CVF
row1,SH row2,SH rowδ,SH
RVP
V̂
Final recombination by block merging. In order to get the final version of the proposed archi-
tecture, we modify the architecture in 4 by merging the following blocks
• Merging the block R and the block CV F into a CV F ◦R block.
11
• Merging the block R with each block RV P into RV P ◦R block.
The multiply-and-add architecture with such merged blocks is shown in Fig. 4. In this architecture
the block δ-CVF is a regular CVF computation but which has an input vector with only δ non-
zero coefficients. In the next three subsections we provide explicit methods for the computation of
CV F ◦R, RV P ◦R and also for δ-CVF.
Fig. 5: Recombination of the mult-and-add architecture
CM
CMF(TH)
mlog2(3)
mlog2(3)
δ
mlog2(3)
CMF(Id)
CM
δ-CVF
RVP◦R RVP◦R RVP◦R
row1,SH row2,SH rowδ,SH
mlog2(3)
mlog2(3)
Ci
mlog2(3)
CVF◦R
CVF
V̂
4.2. Merging the block Reconstruction and Component Vector Formation
We show here that the reconstruction block R and the block CVF in Fig.4 can be merged in a
CVF◦R block in order to divide by two the delay of the sequence of operation Reconstruction
followed by a CVF. We denote CVF◦R the function {0, 1}mlog2(3) → {0, 1}mlog2(3) which consists to
reconstruct an array of sizemlog2(3) and then apply the component vector formation. The following
proposition establishes a recursive formula for this merged CVF◦R.
Proposition 1 (Recursive formula for CVF◦R). Let Ŵ be a mlog2(3) bit array where n = 2s. The
CVF◦R function can be computed recursively as follows:
12
i) If |Ŵ | = 1 we have CV F ◦R(Ŵ ) = Ŵ .
ii) If |Ŵ | > 1 we split Ŵ = [Ŵ0, Ŵ1, Ŵ2] and we recursively compute:
CV F ◦R(Ŵ ) = [CV F ◦R(Ŵ0 ⊕ Ŵ2), CV F ◦R(Ŵ1 ⊕ Ŵ2), CV F ◦R(Ŵ0 ⊕ Ŵ1)]. (8)
Proof. We prove it by induction.
• Proof of i). For |Ŵ | = 1, using the definition of R we have R(Ŵ ) = Ŵ and then
CV F ◦R(Ŵ ) = CV F (R(Ŵ )) = CV F (Ŵ ) = Ŵ ,
using the definition of CV F .
• Proof of ii). For |Ŵ | > 1. The recursive formula for R in Table 1 provides
R(Ŵ ) = [R(Ŵ0)⊕R(Ŵ1), R(Ŵ1)⊕R(Ŵ2)].
We apply the recursive definition of CV F (cf. Table 1) to this vector and we obtain
CV F ◦R(Ŵ )) = CV F ([R(Ŵ0)⊕R(Ŵ1), R(Ŵ1)⊕R(Ŵ2)])
= [CV F (R(Ŵ0)⊕R(Ŵ1)), CV F ((R(Ŵ0)⊕R(Ŵ1))⊕ (R(Ŵ1)⊕R(Ŵ2))), CV F (R(Ŵ1)⊕R(Ŵ2))]
= [CV F (R(Ŵ0)⊕R(Ŵ1)), CV F ((R(Ŵ0)⊕R(Ŵ2))), CV F (R(Ŵ1)⊕R(Ŵ2))] (cancellation)
= [CV F (R(Ŵ0 ⊕ Ŵ1)), CV F (R(Ŵ0 ⊕ Ŵ2)), CV F (R(Ŵ1 ⊕ Ŵ2))] (by linearity of R)
= [CV F ◦R(Ŵ0 ⊕ Ŵ1), CV F ◦R(Ŵ0 ⊕ Ŵ2), CV F ◦R(Ŵ1 ⊕ Ŵ2)],
which concludes the proof of Proposition 1.
We derive a circuit from the formula of Proposition 1 which is shown in Fig. 6. The following
lemma establishes the complexity of this circuit.
13
Fig. 6: Circuit for CVF◦R operation
Ŵ0 Ŵ1 Ŵ2
CVF ◦ R CVF ◦ R CVF ◦ R
Ŵ =
CVF ◦ R(Ŵ ) =
Lemma 2. We consider the circuit shown in Fig. 6 for CVF◦R obtained by applying recursively (8).
This circuit has the following complexity
 S = mlog2(3) log2(m) XORs,D = log2(m)DX . (9)
Proof. We prove Lemma 2 by induction on t in m = 2t. For m = 1 = 20 the complexity in (9) is
in this case direct from i) in Proposition 1.
Now, we assume that the corresponding arithmetic circuit provided by (8) is true for Ŵ of size
` = mlog2 3 with m = 2t and we show it for m′ = 2t+1, i.e., for Ŵ of size
`′ = m′ log2(3)) = (2m)log2(3) = 3(mlog2(3)).
From Fig. 6, we can notice that S(`′) = 3S(`) + 3mlog2(3). The induction hypothesis provides us
S(`) = log2(m)mlog2(m) which gives the required expression:
S(`′) = 3 log2(m)mlog2(m) + 3mlog2(3)
= (log2(m) + 1)3m
log2(3)
= log2(m
′)(m′)log2(3).
14
For the delay we have from Fig. 6 that D(`′) = D(`) + DX . By induction hypothesis we have
D(`) = log2(m), and then we obtain the required expressionD(`′) = log2(m)+1 = log2(m′).
4.3. Merging the Reconstruction and Row Vector Product
In this subsection we provide a method to merge the reconstruction block and the RVP block in
order to reduce the critical path delay of RVP ◦ R.
Lemma 3. Let a row matrix row and let Ŵ be the component vector formation of a vector W . We
can compute RV P (row, R(Ŵ )) = row ·R(Ŵ ) with the following formula
RV P (row, R(Ŵ )) = Sum(CV F (row)⊗ Ŵ )
where Sum is the bit-wise XOR of all the coefficients of an array.
Proof. We prove the lemma by induction on the size of W . We split Ŵ in three part Ŵ =
[Ŵ0, Ŵ1, Ŵ2]. By definition ofR in Table 2 we haveR(Ŵ ) = [R(Ŵ0)⊕R(Ŵ1), R(Ŵ1)⊕R(Ŵ2)].
We then split row in two parts row = [row0, row1] and we re-express the RVP row ·R(Ŵ ) as fol-
lows
row ·R(Ŵ ) = row · [R(Ŵ0)⊕R(Ŵ1), R(Ŵ1)⊕R(Ŵ2)]
= row0 · (R(Ŵ0)⊕R(Ŵ1))⊕ row1 · (R(Ŵ1)⊕R(Ŵ2))
= row0 ·R(Ŵ0)︸ ︷︷ ︸
(∗)
⊕ (row0 ⊕ row1) ·R(Ŵ1))︸ ︷︷ ︸
(∗∗)
⊕ row1 ·R(Ŵ2))︸ ︷︷ ︸
(∗∗∗)
We can apply the induction hypothesis to the three RVP (∗), (∗∗) and (∗ ∗ ∗), this leads to the
15
following:
row ·R(Ŵ ) = Sum(CV F (row0)⊗ Ŵ0)⊕ Sum(CV F (row0 ⊕ row1)⊗ Ŵ1)
⊕Sum(CV F (row1)⊗ Ŵ2)
= Sum([CV F (row0), CV F (row0 ⊕ row1), CV F (row1)]⊗ [Ŵ0, Ŵ1, Ŵ2])
= Sum(CV F (row)⊗ Ŵ ), (by definition of CVF in Table 2)
and this ends the proof.
Complexity of RV P ◦ R. The circuit performing RV P ◦ R with input row and V̂ consists in
mlog2(3) AND gates and (mlog2(3) − 1) XOR gates organized in a binary tree performing the Sum.
The delay of this computation is thus equal to
DA + dlog2(mlog2(3))eDX = DA + dlog2(3) log2(m)eDX . (10)
4.4. Computing δ-CVF
This CVF take as input a vector V which has its last m − δ coefficients which are equal to 0
assuming that δ is small compared to m. We will show here that the computation of CVF(V ) is
simplified in this case. Indeed, when we apply one recursion of CVF to a vector V = [V0, 0] we
have
CV F (V ) = [CV F (V0), CV F (V0 ⊕ 0), CV F (0)] = [CV F (V0), CV F (V0), 0]. (11)
So this means that the computation of CV F (V ) is reduced to the computation of CV F (V0). We
can apply this property recursively. The following lemma establishes the expression of CV F for
multiple recursions.
Lemma 4. Let V = [v0, . . . , v2s−1, 0, . . . , 0] be a vector of size m = 2t with t = s + u and let
16
Ṽ = [v0, . . . , v2s−1]. Then we have CV F (V ) = [V̂0, V̂1, . . . , V̂3u−1] such that |V̂i| = 3t−u and
V̂i = CV F (Ṽ ) if i =
∑t−s
j=0 ij3
j with all ij ∈ {0, 1},
V̂i = 0 if i =
∑t−s
j=0 ij3
j and there exists one ij = 2.
(12)
Proof. We prove the lemma by induction on u. For u = 1 this is direct from (11). Then, we assume
the lemma is true for u and we prove it for u+ 1. We consider a vector V of size 2s+u+1. We split
V as V = [V0, 0] which implies that V0 is of size 2s+u and we apply one recursion of CV F we get
CV F (V ) = [CV F (V0), CV F (V0), 0].
We then use the induction hypothesis for CV F (V0) which yields:
CV F (V ) = [(̂V0)0, (̂V0)1, . . . , (̂V0)3u−1︸ ︷︷ ︸
CV F (V0)
, (̂V0)0, (̂V0)1, . . . , (̂V0)3u−1︸ ︷︷ ︸
CV F (V0)
, 0, . . . , 0]. (13)
And by induction hypothesis (̂V0)i = CV F (Ṽ ) if i =
∑u−1
j=0 ij3
j does not contain any digit equal
to 2, otherwise it is 0. Let us relabel each term in (13) as in CV F (V ) = [V̂0, V̂1, . . . , V̂3u+1−1]. In
this case we have
V̂i = (̂V0)i if i = 0, . . . , 3
u − 1,
V̂i = (̂V0)i for i = 3
u, . . . , 2 · 3u − 1,
V̂i = 0 for i = 2 · 3u, . . . , 3u+1 − 1.
(14)
We can easily check that:
• For i < 3u, V̂i = CV F (Ṽ ) only if i =
∑u
j=0 ij3
j does not contain any digit equal to 2, since
it is already the case for (̂V0)i.
• For i = 3u, . . . , 2 · 3u − 1, we can write Vi = (V̂0)i′ with i′ = i − 3u. This implies that
Vi is equal to CV F (Ṽ ) when i′ =
∑u−1
j=0 i
′
j3
j does not contain a 2, and consequently when
i =
∑u
j=0 i
′
j3
j + 3u does not contain a 2.
17
• For i > 2 · 3u, we have V̂i = 0 and since i =
∑u−1
j=0 ij3
j + 2 · 3u it contains a 2, as required.
This ends the proof.
Lemma 4 teaches us that the computation of δ − CVF(V ) is reduced to the computation of
Ṽ = CVF([v0, . . . , vδ−1, 0, . . . , 0︸ ︷︷ ︸
2dlog2(δ)e−δ zeros
).
The other coefficients of CVF(V ) are either 0 or a copy of a coefficient of Ṽ . This leads to the
following complexity of the computation of δ-CVF(V ): this is the complexity of a CVF applied to
vector of size 2dlog2(δ)e:
 Sδ−CVF = 3dlog2(δ)e − 2dlog2(δ)eXORs,Dδ−CVF = dlog2(δ)eDX .
4.5. Overall complexity
We evaluate the complexity of the proposed architecture (Fig. 5), with the designs of the blocks
CV F ◦ R and RV P ◦ R and δ−CVF presented in Subsection 4.2, 4.3 and 4.4, respectively. We
obtain the overall space complexity of the proposed multiply-and-add architecture by adding the
space complexity of each block which appears in Fig. 5. We obtain the following
2× SCVF = 2m
log2(3) − 2m XORs
1× S
δ−CVF = 3
dlog2(δ)e − 2dlog2(δ)eXORs,
2× SCM = 2m
log2(3) ANDs
2× SCA = 2m
log2(3) XORs
1× SCVF◦R = m
log2(m) log2(m) XORs
δ × SRVP◦R = δm
log2(3) − δ XORs and δmlog2(3) ANDs
Total = (log2(m) + δ + 4)mlog2(3) − δ − 2m+ 3dlog2(δ)e − 2dlog2(δ)eXORs
and (2 + δ)mlog2(3) ANDs
18
Now, we evaluate the time complexity of the architecture in Fig. 5. There are two main paths:
• The critical path is the thick line starting and finishing at the register containing V̂ in Fig. 5.
On this path we have the following blocks: RVP ◦ R, δ − CVF, CA, CM and CA. Then if we
add the delay of each block we obtain an overall delay of this path:
(log2(3) log2(n) + dlog2(δ)e+ 2)DX + 2DA.
• The path which is going through CVF ◦ R, CM and CA has a smaller delay which is equal to
(log2(n) + 1)DX +DA.
We can conclude that the delay of the proposed multiply-and-add architecture is as follows
D = (log2(3) log2(n) + dlog2(δ)e+ 2)DX + 2DA.
5. Comparison and conclusion
We considered in this paper the implementation of the GHASH function used in the Galois counter
mode for the generation of the authentication tag. We presented a recombination of the multiply-
and-add architecture of P. Patel [16] which is based on a subquadratic space complexity TMVP
multiplier. This recombination was meant to reduce the overall critical path delay of the architec-
ture. This delay reduction was obtained by merging some blocks of the recombined architecture.
In Table 3 we provide the complexity of the proposed approach and also the complexities of the
usual multiply-and-add architecture using on the best approaches for the design of the multiplier.
The proposed approach is a subquadratic space complexity architecture which have delay which
is smaller than 2 log2(m)DX+DA when δ is sufficiently small. This improvement was obtained at a
cost of an increase of the space complexity. We notice that it is the first approach with subquadratic
space complexity which has a delay smaller than (2 log2(m) + O(1))DX + DA. Indeed, Table 3
shows that the only approach of the literature which has delay smaller than 2 log2(m)DX + D1
19
Table 3 Complexities of multiply-and-add architectures
Approach #XOR # AND Delay
Mastrovito [6] m2 +O(m) m2 +O(m) (log2(m) + 1)DX +DA
Karatsuba [19] 6mlog2(3) +O(m) mlog2(3) (2 log2(m) +O(1))DX +DA
Karatsuba [12] 5.25mlog2(3) +O(m) mlog2(3) (2 log2(m) +O(1))DX +DA
TMVP [16] 5.5mlog2(3) +O(m) (2 log2(m) + 1)DX +DA
TMVP [7] 5.5mlog2(3) +O(m) mlog2(3) (2 log2(m) +O(1))DX +DA
Proposed∗
(log2(m) + δ + 4)m
log2(3) − δ
+3dlog2(δ)e − 2dlog2(δ)e − 2m
(2 + δ)mlog2(3)
(1.59 log2(m) + dlog2(δ)e+ 2)DX
+2DA
∗ For multiplication modulo xm + e(x) and δ = deg e(x) + 1
is the one with a quadratic space complexity. This proposal remains a bit theoretical, but this is
a first step towards obtaining a subquadratic space complexity multiply-and-add architecture or a
multiplier in F2m with an optimal delay of log2(m) +O(1)DX +O(1)DA.
6. References
[1] D. Canright. A very compact s-box for AES. In CHES 2005, volume 3659 of LNCS, pages
441–455. Springer, 2005.
[2] Ç. K. Koç and B. Sunar. Low-Complexity Bit-Parallel Canonical and Normal Basis Multipli-
ers for a Class of Finite Fields. IEEE Trans. on Comput., 47:353–356, March 1998.
[3] J. Daemen and V. Rijmen. The Design of Rijndael: AES - The Advanced Encryption Standard.
Information Security and Cryptography. Springer, 2002.
[4] H. Fan and M. A. Hasan. A New Approach to Sub-quadratic Space Complexity Parallel
Multipliers for Extended Binary Fields. IEEE Trans. Computers, 56(2):224–233, 2007.
[5] H. Fan and M. A. Hasan. Subquadratic Computational Complexity Schemes for Ex-
tended Binary Field Multiplication Using Optimal Normal Bases. IEEE Trans. Computers,
56(10):1435–1437, 2007.
20
[6] A. Halbutogullari and Ç.K. Koç. Mastrovito multiplier for general irreducible polynomials.
IEEE Trans. on Comp., 49(5):503 –518, May 2000.
[7] J. Han and H. Fan. GF(2n) Shifted Polynomial Basis Multipliers Based on Subquadratic
Toeplitz Matrix-Vector Product Approach for All Irreducible Pentanomials. IEEE Trans.
Computers, 64(3):862–867, 2015.
[8] M. A. Hasan, N. Meloni, A. H. Namin, and C. Negre. Block Recombination Approach for
Subquadratic Space Complexity Binary Field Multiplication Based on Toeplitz Matrix-Vector
Product. IEEE Trans. Computers, 61(2):151–163, 2012.
[9] M. M. Kermani and A. Reyhani-Masoleh. Efficient and High-Performance Parallel Hardware
Architectures for the AES-GCM. IEEE Trans. Computers, 61(8):1165–1178, 2012.
[10] E. D. Mastrovito. VLSI Architectures for Computation in Galois Fields. PhD thesis, Dept. of
Electrical Eng., Linköping Univ., Sweden, 1991.
[11] D. A. McGrew and J. Viega. The Security and Performance of the Galois/Counter Mode
(GCM) of Operation. In INDOCRYPT, volume 3348 of LNCS, pages 343–355, 2004.
[12] C. Negre. Efficient binary polynomial multiplication based on optimized Karatsuba recon-
struction. J. Cryptographic Engineering, 4(2):91–106, 2014.
[13] NIST. Advanced Encryption Standard (AES), November 2001.
[14] NIST. Recommendation for Block Cipher Modes of Operation: Galois/Counter Mode (GCM)
and GMAC, November 2007.
[15] C. Paar. A New Architecture for a Parallel Finite Field Multiplier with Low Complexity
Based on Composite Fields. IEEE Trans. Comput., 45(7):856–861, 1996.
[16] P. Patel. Parallel multiplier designs for the Galois/counter mode of operation. Master’s thesis,
Electrical and Computer Engineering, University of Waterloo, 2008.
21
[17] A. Satoh, S. Morioka, K. Takano, and S. Munetoh. A Compact Rijndael Hardware Architec-
ture with S-Box Optimization. In ASIACRYPT 2001, volume 2248 of LNCS, pages 239–254.
Springer, 2001.
[18] A. Satoh, T. Sugawara, and T. Aoki. High-performance hardware architectures for Galois
Counter Mode. IEEE Transactions on Computers, 58(7):917–930, 2009.
[19] J. Sun, M. Gu, K.-Y. Lam, and H. Fan. Overlap-free Karatsuba-Ofman Polynomial Multipli-
cation Algorithm. IET Information Security, 4:8–14, March 2010.
[20] B. Sunar. A Generalized Method for Constructing Subquadratic Complexity GF(2k) Multi-
pliers. IEEE Trans. on Comp., 53:1097–1105, 2004.
[21] B. Sunar and Ç. K. Koç. An Efficient Optimal Normal Basis Type II Multiplier. IEEE Trans.
on Comp., 50(1):83–87, 2001.
[22] S. Winograd. Arithmetic Complexity of Computations. Society For Industrial & Applied
Mathematics, U.S., 1980.
22
