On Software Implementation of High Performance GHASH Algorithms by Umair, Iqbal Muhammad





presented to the University of Waterloo
in fulllment of the
thesis requirement for the degree of
Master of Applied Science
in
Electrical and Computer Engineering
Waterloo, Ontario, Canada, 2012
c© Iqbal Muhammad Umair 2012
I hereby declare that I am the sole author of this thesis. This is a true copy of the
thesis, including any required nal revisions, as accepted by my examiners.
I understand that my thesis may be made electronically available to the public.
ii
Abstract
There have been several modes of operations available for symmetric key block ci-
phers, among which Galois Counter Mode (GCM) of operation is a standard. GCM
mode of operation provides condentiality with the help of symmetric key block ci-
pher operating in counter mode. The authentication component of GCM comprises
of Galois hash (GHASH) computation which is a keyed hash function. The most
important component of GHASH computation is carry-less multiplication of 128-
bit operands which is followed by a modulo reduction. There have been a number
of schemes proposed for ecient software implementation of carry-less multiplica-
tion to improve performance of GHASH by increasing the speed of multiplications.
This thesis focuses on providing an ecient way of software implementation of high
performance GHASH function as being proposed by Meloni et al., and also on the
implementation of GHASH using a carry-less multiplication instruction provided
by Intel on their Westmere architecture.
The thesis work includes implementation of the high performance GHASH and
its comparison to the older or standard implementation of GHASH function. It
also includes comparison of the two implementations using Intel's carry-less mul-
tiplication instruction. This is the rst time that this kind of comparison is being
done on software implementations of these algorithms. Our software implementa-
tions suggest that the new GHASH algorithm, which was originally proposed for
the hardware implementations due to the required parallelization, can't take ad-
vantage of the Intel carry-less multiplication instruction PCLMULQDQ. On the
other hand, when implementations are done without using the PCLMULQDQ in-
struction the new algorithm performs better, even if its inherent parallelization is
not utilized. This suggest that the new algorithm will perform better on embedded
systems that do not support PCLMULQDQ.
iii
Acknowledgements
I am grateful to all who have guided me and helped me to reach this milestone, but
there are few who I would like to specially mention here. First of all, I would like
to thank my graduate supervisor, Professor Anwar Hasan, who has been a great
support and guide thorough out my program. I would also like to thank Professor
Catherine Gebotys and Professor Mark Aagaard for reviewing my work. I would
also like to take this opportunity to thank my family who has always been there for
me, and encouraged me at every step. In the end I would also like to thank all the




List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction 1
1.1 Brief Overview of Previous Work on GHASH . . . . . . . . . . . . . 2
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Galois Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Polynomial Representation of Galois Field . . . . . . . . . . 6
2.1.2 Galois Field Arithmetic . . . . . . . . . . . . . . . . . . . . 7
2.2 Hash Functions in Cryptography . . . . . . . . . . . . . . . . . . . 9
2.2.1 Security Properties of Hash Functions . . . . . . . . . . . . . 9
2.2.2 Types of Hash Algorithms . . . . . . . . . . . . . . . . . . . 10
2.3 Galois Counter Mode (GCM) . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 GCM Operation . . . . . . . . . . . . . . . . . . . . . . . . 12
3 GHASH Algorithms and Implementation Issues 18
3.1 Standard GHASH Computation . . . . . . . . . . . . . . . . . . . . 19
3.1.1 GHASH Description . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Parallel Architecture For GHASH . . . . . . . . . . . . . . . 23
3.2 Characteristic Polynomial Based GHASH . . . . . . . . . . . . . . . 25
3.3 Implementation Issues in Software . . . . . . . . . . . . . . . . . . . 30
3.3.1 Carry-less Multiplication . . . . . . . . . . . . . . . . . . . . 31
3.3.2 Ecient Carry-less Multiplication for Large Operands . . . . 32
3.3.3 Basics of Karatsuba Algorithm . . . . . . . . . . . . . . . . 33
3.3.4 Karatsuba Algorithm for GHASH . . . . . . . . . . . . . . . 34
4 Software Implementation of GHASH Algorithms 37
4.1 GHASH Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Implementation of Standard GF(2m) Multiplication . . . . . 38
4.1.2 Gordon's Algorithm . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.3 Intel's PCLMULQDQ instruction . . . . . . . . . . . . . . . 42
4.1.4 Intel's Karatsuba Implementation using PCLMULQDQ . . . 43
v
4.1.5 Ecient Reduction Modulo Implementation . . . . . . . . . 45
4.2 GHASH Implementation Results . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Implementation using Common Place Instructions . . . . . . 47
4.2.2 Implementation using PCLMULQDQ Instruction . . . . . . 48
5 Concluding Remarks 52
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A Software Implementation of Algorithms 54
A.1 Standard GHASH without PCLMULQDQ . . . . . . . . . . . . . . 54
A.2 High Perfromance GHASH without PCLMULQDQ . . . . . . . . . 62
A.3 Standard GHASH with PCLMULQDQ . . . . . . . . . . . . . . . . 71
A.4 High Perfromance GHASH with PCLMULQDQ . . . . . . . . . . . 79




4.1 Selection of quadwords . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Computation Time of Implementations with Customary Instructions 48
4.3 Computation Time of Implementations with PCLMULQDQ . . . . 50
vii
List of Figures
2.1 Simplied Davies-Meyer Hash function . . . . . . . . . . . . . . . . 11
2.2 Simple Block Cipher in ECB Mode . . . . . . . . . . . . . . . . . . 13
2.3 Encryption and Decryption in counter mode of operation . . . . . . 14
2.4 Encryption and Decryption in GCM. . . . . . . . . . . . . . . . . . 16
2.5 Authentication in GCM . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 GHASH Computation . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Feedback architecture for GHASH . . . . . . . . . . . . . . . . . . 22
3.3 Parallel GHASH Computation . . . . . . . . . . . . . . . . . . . . . 24
3.4 Polynomial Reduction Unit . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Performance Comparison of Implementations with Customary In-
structions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Performance Comparison of Implementations with PCLMULQDQ . 50
viii
List of Abbreviations
AD additional Authentication Data
AES Advanced Encryption Standard
C Ciphertext
FPGA Field Programmable Gate Array
GCM Galois Counter Mode
GF Galois Field
GHASH Galois Hash
GMAC Galois Message Authentication Code
IV Initialization Vector
K random Key
LSB Least Signicant Bit
MAC Message Authentication Code
MD4/5 Message Digest
MNH Meloni, Negre and Hasan
MSB Most Signicant Bit
NIST National Institute of Standards and Technology
OML Ordinary Multiplication
P Plaintext
PRU Polynomial Reduction Unit
SHA Secure Hash Algorithm
T authentication Tag




The three main goals of information systems security, namely condentiality,
integrity, and availability have always been a point of interest to the cryptographic
research world. Apart from being highly secure an important required feature
for some practical information systems is to perform cryptographic operations at
high speed. Block ciphers have proven themselves to be useful for this purpose.
Among many block ciphers, Advanced Encryption Standard (AES) is one of the
widely used symmetric key block ciphers [3]. National Institute of Standards and
Technology (NIST) standardized the operation of symmetric key block ciphers in
the Galois Counter Mode (GCM) due to its suitability for ecient implementation
in hardware as well as software. A lot of research work has been done on proposing
ecient ways of implementation, and its usage in dierent types of networks and
applications. GCM provides data authentication/integrity by using the Galois hash
(GHASH), and supports data condentiality via encryption/decryption operations
of AES. Below we give a brief overview of previous research work related to the
implementations of GHASH.
1
1.1 Brief Overview of Previous Work on GHASH
Over the past years, there have been various schemes proposed for improving data
authentication component of AES-GCM based systems, i.e., GHASH. One of the
research papers has proposed a GCM variation in [9], where the authors have ad-
dressed the slowness of computation of GHASH and the problem of memory re-
quirements for the pre-computed GHASH. There is also an ecient GHASH im-
plementation on FPGA proposed in [25] which uses a parallel architecture for the
polynomial multiplication in Galois elds [26]. Another ecient implementation
has been presented in [5], again using the multiplier from [26], but this time com-
bined with the pipeline method for higher throughput. Hardware implementation
on a per key basis has been proposed in [7], and the GHASH has been implemented
using Verilog, resulting in improved throughput. In [18], an ecient implementation
has also been proposed for GHASH using Intel's PCLMULQDQ instruction [17].
The work in [18] attempts to optimize the assembler implementation of GHASH
algorithm, and performs better than the standard implementation. Finally Intel
itself has proposed an optimized implementation of GHASH in GCM, using their
own PCLMULQDQ instruction [13].
Although there have been several implementations proposed for GHASH, they
all have one thing in common: they are all trying to implement or improve the
standard GHASH algorithm. Performance of the algorithm becomes slow as the
number of blocks being processed increases, since the number of 128-bit Galois
eld multiplications in the standard GHASH algorithm is almost as many as the
number of blocks. Although there have been schemes proposed for utilizing parallel
hardware to overcome some problems, but in terms of software implementation it
is hard to mimic that level of parallelism.
Recently, a new GHASH algorithm has been proposed by Meloni, Negre and
Hasan [29]. We will refer to this new algorithm as MNH GHASH. This algorithm
2
replaces all extension eld multiplications in excess of 127 by an equal number of
polynomial reduction operations. This algorithm has been primarily designed for
dedicated hardware implementations to take advantage of its inherent parallelism.
To the best of our knowledge, no work has been reported yet investigating the per-
formance of the algorithm when implemented using software on a general purpose
processor.
1.2 Contributions
The work presented in this thesis is with regard to faster computation and timing
analysis of dierent implementations of GHASH. The main contributions are as
follows:
• Using common place instructions, perform software design and implementa-
tion of the MNH GHASH algorithm recently proposed by Meloni et al.[29].
• Implement the above mentioned GHASH algorithm [29] using Intel's new
64-bit carry-less multiplication PCLMULQDQ instruction[17].
• Compare performance of the software implementations of the new and the
standard GHASH with and without Intel's carry-less multiplication instruc-
tion.
There have been other implementations of GHASH presented in the past, e.g.,
[13, 5, 18]. These are mainly to improve the implementation of the standard
GHASH algorithm and most of them are either FPGA or ASIC implementations.
On the other hand, in this work we deal with the new MNH GHASH algorithm
3
[29], and present software implementations and comparison. Unlike previous re-
search papers on this topic, our work is not two implementations of the standard
GHASH algorithm, but rather comparison of two dierent algorithms.
1.3 Organization
There are four more chapters in this thesis, starting with Chapter 2 in which we
give related background on Galois eld (GF) arithmetic, GCM and cryptographic
hash functions. In Chapter 3, we discuss algorithms for computing GHASH in
GCM, and explain how using minimal polynomials we can improve the GHASH
algorithm as presented by Meloni et al. [29]. In Chapter 4, the software implemen-
tation, analysis and results obtained from them are discussed. In Chapter 5 some
concluding remarks are made on analysis and results. This chapter also includes




In order to better understand the GHASH, which is used in GCM, we rst need to
understand how GCM mode works for a block cipher. Furthermore, to understand
GHASH we need to deal with Galois eld operations, and to understand GCM we
need to know a bit about cryptographic hash functions and how are they computed.
This chapter starts with a brief introduction on Galois elds, some of its basic op-
erations and a bit about characteristic polynomials. Then we discuss cryptographic
hash functions. The chapter ends with some discussions on GCM operation.
2.1 Galois Fields
Galois elds are most widely used in coding theory and eld of information security.
A Galois eld has a nite number of elements, with which one can perform addition,
subtraction, multiplicaion and division (by the non-zero element). The Galois eld
of q elements is denoted as GF(q). The value of q must be a prime or a prime
power. If q is prime (respectively, prime power), eld GF(q) is referred to as prime
(respectively, extension) eld. For example, a prime Galois eld is GF(2), which
5
can be extended to eld GF(2m), where m can be any integer greater than 1 [31].
2.1.1 Polynomial Representation of Galois Field
Although several representations for nite elds have been proposed, the one using
polynomial basis has been the most useful, specially when it comes to large elds.
In order to give a more general representation of a Galois eld in polynomial basis,
assume a Galois eld GF(pn), where p is prime. Let us assume that F (x) is an
irreducible polynomial, whose coecients belong to GF(p), and is of degree n. An
irreducible polynomial does not have any polynomial as its factor which has a degree
greater than 0 or smaller than n. Since F (x) is a polynomial of degree n, it is often
convenient to write it as follows [15]:






j, {fj ∈ GF(p)}
Now, if we assume that a root of F(x) is β, then any element B in eld GF(pn) can




n−3 + .....+ b1β





where, bj∈GF(p), and the polynomial basis of GF(pn) over GF(p) is formed using
{1, β, β2, β3, .....βn−2, βn−1}.
6
2.1.2 Galois Field Arithmetic
In order to further proceed with our discussion, it is important to give a brief
introduction to some basic Galois eld operations. In the next few paragraphs,
we will look into Galois eld addition, multiplication and the concept of minimal
polynomial (importance of which we will see in the next chapter).
Addition Operation in Galois Field
Addition in Galois eld is a very simple operation. For example, if we have Galois
eld elements C(x) and D(x), in polynomial basis form for eld GF(pn), then their
addition would be modulo p addition of the corresponding coecients of C(x) and
D(x). A better example could be in case of binary eld GF(2n). Let C(x) be
x2 + x+ 1, and D(x) be x+ 1. Then their sum S(x) is
S(x) ≡ C(x) +D(x) ≡ ((x2 + x+ 1) + (x+ 1)) mod 2
S(x) ≡ (x2 + (x+ x) + (1 + 1)) mod 2 ≡ x2
Another way of looking at addition in binary eld from the implemenatation
perspective is to observe that if the elements are stored in bit form, then addition
is nothing but XORing of corresponding bits.
Multiplication Operation in Galois Field
Multiplication in Galois eld is a little more complicated operation than addition.
For multiplication of two elements of GF(pn), rst the polynomials corresponding
to the led elements are multiplied and then they go through a modular reduction
7
using polynomial F (x), which as mentioned earlier is an irreducible polnomial of
degree n. To illustrate a small example, let us assume that we have two elements
C(x) and D(x) of Galois eld GF(23). Let C(x) be x2 and D(x) be x, and the eld
dening irreducible polynomial F (x) be x3 + x + 1. Then we can multiply C(x)
and D(x) as follows,
M(x) ≡ C(x) ·D(x) mod F (x),
M(x) ≡ (x2) · (x) mod (x3 + x+ 1),
M(x) ≡ x3 mod (x3 + x+ 1) ≡ x+ 1.
Minimal Polynomial in Galois Field
An important concept related to Galois elds, which is worth mentioning here, is
minimal polynomial. The minimum polynomial of any element α of eld GF(pn)
is a polynomial M(x), such that M(α) = 0, and its coecients are in eld GF(p)
[31]. For example if we have element 0 in Galois eld GF(2m), then its minimal
polynomial will be x, and similarly for element 1 the minimal polynomial is x+ 1.
Now let us look at a more elaborate example. Let us consider eld GF(24) with
eld dening polynomial to be F (x) = x4 + x+ 1. Let α be a root of F (x). Then
for eld element α2 + α, we have minimal polynomial x2 + x + 1, which can be
veried as follows,
M(x) ≡ x2 + x+ 1,
M(α2 + α) ≡ (α2 + α)2 + α2 + α + 1,
M(α2 + α) ≡ α4 + α2 + α2 + α + 1,
M(α2 + α) ≡ α4 + α + 1,
8
Since α is root of F (x), hence F (α) = α4 + α+ 1 = 0, resulting in M(α2 + α) = 0,
i.e., M(x) is the minimal polynomial of α2 + α.
2.2 Hash Functions in Cryptography
A cryptographic hash function can be considered to be an algorithm which takes
blocks of data and convert them to strings often referred to as tags. A tag can be
viewed as a nger print of the message or representation of message and is unique.
In a normal hash function, there is no concept of key; but we will briey look into
keyed hash functions or message authentication code (MAC) as well, since the hash
function of our interest GHASH, which is used in GCM, uses keys. In short, a hash
functions provides an easy and ecient way of representing a message of arbitrary
length and produces a tag of nite bits string, which helps in signing messages and
resolves the issue of high computation and message overhead costs involved when
computing digital signatures without hash functions [34].
Hash functions are generally expected to be easily computed, and even if one
bit changes in the message the whole hash function generated again should not
be the same. That is they need to be highly sensitive to any change, and satisfy
other properties. Below we discuss a couple of important security properties of
hash functions.
2.2.1 Security Properties of Hash Functions
A cryptographic hash function should be one-way and collision resistance. One-
wayness, which is also known as preimage resistance, guarantees that it is ideally
9
impossible to create the message back from a hash function or hash tag.
Collision resistance is one of the most important properties or requirements for
hash functions. It implies that there are no two input messages which can produce
the same hash tag. A hash function's collision resistance can be either weak or
strong. In the weak case, one message is already given and the attacker tries to
nd a second message which can produce the same hash tag. In the strong case,
the attacker has an opportunity to select any two messages and see if it is possible
to get the same hash tag from both.
2.2.2 Types of Hash Algorithms
Hash algorithms can be divided into two major categories: dedicated hash functions
and block ciphers based hash functions [34]. It is also worth mentioning that hash
functions can be keyed or not keyed. Our hash function of interest, i.e., GHASH
is a keyed hash function. A keyed hash function uses both the message and the
key for computing a hash tag and is generally used in Message Authentication
Code (MAC). Un-keyed hash functions on the other hand are used mostly in error
detection codes, and their computations does not require any key.
Dedicated Hash Functions
As the name suggests, dedicated hash functions are specically designed for com-
puting hashes and do not usually rely upon complex computations like discrete
logarithm or integer factorization. Examples of dedicated hash functions are MD4,
MD5 and various SHAs.
10
Figure 2.1: Simplied Davies-Meyer Hash function
Block Cipher based Hash Functions
In terms of computation speeds, the block cipher based hash functions are bit
slower as compared to the dedicated ones, but they give an added advantage of
using the same block cipher which is being used for encryption. There are again
many dierent methods proposed in the past to generate hash tags using block
ciphers. In Fig. 2.1, a very basic method known as Davies-Meyer method [6] is
shown.
In the Davies-Meyer method, the hash is constructed by taking previous hash
value hi−1 as input to a block cipher encryption function E. Using the ith message
block mi as the key and then whatever output is generated is XORed with the
previous hash value to obtain new ith hash value hi. In the case of rst hash tag,
usually a pre-computed specic initial hash value h0 is used.
2.3 Galois Counter Mode (GCM)
Galois Counter Mode (GCM) is a recommended mode of operation for symmetric
key block ciphers by NIST [17, 33]. Galois counter mode of operation handles con-
dentiality through encryption in the counter mode and authentication is taken care
by computation involving a secure hash function. The Galois eld used for easiness
11
of hardware/software implementation is binary eld GF(2128). GCM provides en-
cryption using the symmetric block cipher AES (Advanced Encryption Standard)
[33].
The term GMAC is often heard in the context to GCM, which only means that
if our input data does not contain any information which is needed to be encrypted
then the operation of GCM could be just called GMAC. In that case it is only
providing data authentication, and it is needless to say that authentication provided
by GCM is far stronger than any error detecting code or check sum [33]. GCM also
provides lot of opportunities for pre-computations and parallelized implementation
[33]. For example, even the length of input data is not required in advanced; but if
we know it, it is xed, and if we also know about the initialization vector, then lot
of block cipher computations related to invocation can be done beforehand [33].
2.3.1 GCM Operation
As mentioned earlier, GCM is composed of two parts: authentication and encryp-
tion. Data authentication is achieved via the keyed hash function GHASH and
encryption via block cipher AES in the counter mode. As cryptographic hash func-
tions have already been discussed earlier, a brief description on block ciphers and
counter mode of operation will be given below.
Block Ciphers
A block cipher causes the input data to go through a particular transformation,
and in every transformation, a xed amount of data from input is taken, which is
called a block. The operations of block ciphers are dependent on a random key,
12
Figure 2.2: Simple Block Cipher in ECB Mode
say K, regulates the transformation which input block goes through. To achieve
condentiality, the block ciphers uses two functions that are inverse of each other,
and one is called encryption and the other decryption. To present a visual and sim-
plied example of a block cipher, Fig. 2.2 shows a simple block cipher in electronic
code book mode.
So, as we can see in the above gure, it has two functions: one for encryption
ENCK , and the other funtion DECK , where DECK = ENC
−1
K .
Counter Mode of Operation for Block Ciphers
There are dierent modes of operations for block ciphers. The mode used in Fig.
2.2 is known as electronic code book mode. The mode that is most important
from the GCM perspective is Counter Mode of operation. One important feature
of the counter mode of operation is that it doesn't require two functions like the
example of block cipher we saw earlier. It only needs forward cipher function,
13
Figure 2.3: Encryption and Decryption in counter mode of operation
which is advantageous from the implementation point of view. In this mode of
operation, forward cipher function transformations are applied on counter blocks,
which are special input blocks, and have the property of being distinct per block
under same key. The output from those transformations is XORed to produce the
ciphertext. To get the plaintext back, the same counter goes through the forward
cipher function and, is then XORed with the ciphertext [32]. To give a simple
example, let us consider a counter block X1 , and we apply forward cipher function
with key K on as ForwK(X1). Now we XOR it with plaintext P1, then cipher text
C1will be P1 ⊕ ForwK(X1) . To get back P1, we just need to get forward cipher
function applied on the same counter, that is , ForwK(X1), and then XORing it to
C1, which will give us back our plaintext P1. A simplied block diagram of Counter
Mode of operation of block ciphers can be seen in Fig. 2.3.
As we can see from Fig. 2.3, Count1 is a distinct counter block for a block,
P1 of plaintext, whereas the ForwCiphK , is the forward cipher function, with key




As we already know, GCM requires a block cipher for establishing condentiality
feature. Let us assume that the cipher block function is using random key K, and
the input required is composed of the Plaintext P, which is the actual data to
be encrypted. Plaintext P is usually broken into blocks of 128 bits long except
for the last block. If the last block is not already 128 bits, then extra zeros are
padded. Another part of input comprises of additional Authentication Data (AD),
which is not encrypted and used only for the purpose of authentication. The length
restriction for blocks of this data is the same as for the plaintext [33, 28]. The
nal part of input is Initialization Vector (IV), which is a nonce, and is unique
in reference to the context, and has a main role in invocation of forward cipher
function. Its construction and properties are discussed in more details in the NIST
specication [33]. The resulting output of GCM operation is Ciphertext (C) and
authentication Tag (T). The decryption part takes, initialization vector, ciphertext,
authentication data, and tag as input, and using initialization vector and cipher
text it produces the plaintext. A simplied version of encryption and decryption
blocks can be seen in Fig. 2.4. It does not include how authentication part works
in GCM, which is part of our following discussion.
Now as we can see in the gure, input data for encryption goes through a block
called GCTRK , which is nothing much but a modied counter block operation
as discussed earlier and uses the block cipher for encryption with key, K. The
ciphertext output is also broken into blocks and with the same length restrictions
and padding as plaintext, and the authentication tag produced is 128 bits in length.
15
Figure 2.4: Encryption and Decryption in GCM.
Although shorter tags can be created, but due to some security concerns they are
not encouraged by NIST.
To create authentication tag, the ciphertext along with unencrypted authen-
tication data is passed through a GHASHH block and then through a GCTRK
block. Similarly at the receiving end during decryption, the authentication tag is
computed using the received authentication data, which is in clear, and ciphertext
again using hash subkey H. The computer tag is then compared to the received tag.
If the tags are the same, then it's a pass, else the authentication fails. A simplied
version of authentication is shown in Fig. 2.5.
Hence, we can see from the above discussion on GCM that GCM provides
authentication as well as condentiality, and it allows some of the data to be in
clear, which is the additional authentication data. Such data in practice could
contain any addresses or any other information related to the encrypted data. If
there is no data to be encrypted then it can just act as authentication mechanism
called GMAC, which again can be classied in the category of block cipher based
16
Figure 2.5: Authentication in GCM
message authentication algorithms as we discussed previously. In the next chapter,





As discussed in the previous chapter, one of the modes for symmetric key block
cipher recommended by NIST is GCM [33], which can provide encryption/decryp-
tion and authentication (i.e., integrity of data) at the same time. In case of au-
thentication computation, GCM has to generate a tag using keyed hash function
also known as GHASH. In this chapter, the discussion will start with a brief in-
troduction, followed by a description of the standard GHASH algorithm [28]. A
brief description will also be provided on parallelized computation of GHASH as
discussed in [29, 28]. We will then review a new GHASH algorithm proposed by
Meloni et al. Finally, we discuss the carry-less multiplication schemes that can be
used in GHASH computations.
18
3.1 Standard GHASH Computation
There have been several proposals in the past for improving GCM implementation.
Some researchers have proposed faster computation of the associated symmetric
block cipher itself [11, 14], while some have tried to come up with faster ways
of multiplication [4, 27]. There have also been several implementations proposed
for ecient implementation of AES-GCM combined [21, 9, 5]. Although these
proposed schemes vary, but in their core the GHASH algorithm is almost same,
that is involving as many GF(2128) multiplications as the number of blocks.
The Galois counter mode of operation provides opportunities for paralleliza-
tion of computation steps. However, the computation of GHASH, which involves
GF(2128) multiplications, poses a bottleneck to the whole operation. There has
been a solution proposed by the authors of GCM itself [28], which we will see later
in the chapter, but that method increases the number of required multipliers. Be-
low, we rst give a description of the standard GHASH algorithm as specied in
[28, 33].
3.1.1 GHASH Description
As a brief description of how GCM handles condentiality has been given in the
previous chapter, let us assume that the associated block cipher uses block size of
m bits. In order to authenticate data, the ciphertext generated by the block cipher
goes through a series of GF(2128) multiplications, using a key say H, which is also
in GF(2128). Also, assume that the block cipher being used for encryption and
decryption is AES.
In order to understand the operation of GHASH, assume that there is an input
data stream of bits P , divided into n blocks of size m bits. Let us represent those
19
blocks as P1, P2, P3. . . . . .Pn, and they are all m-bit long as mentioned. There might
be an exception with the last block Pn, which might not be m bits in length. In
order to x the length of the last block, if it falls short of m bits, extra 0's are
padded to it to increase its length to m bits [33]. The hash key H, which we have
mentioned earlier, is also m bits in length. Based on the GCM specication in
[33], the input blocks are 128 bits in length and could be either actual input data,
which is the output from ciphertext, or just additional authentication data (which
is also divided into equal block sizes of m bits as seen in the previous chapter), but
to avoid any confusion and to give a more formal denition of algorithm, we will
assume that any kind of block used as an input will be represented byPi, where
i = 1, 2, 3, ....n. So, using all these representations we can dene the resultant or
required GHASH as follows:
GHASHH(P ) = P1H
n + P2H
n−1 + P3H
n−2 + · · ·+ PnH (3.1)
where, hash subkey H, is obtained by applying block cipher to the zero block, i.e.,
all of its bits are zero, and let us assume that GHASH computed using key H to
be represented as GHASHH .
We assume a scenario of two parties communicating using GCM-AES, and de-
cide to use one shared key, K as the session key. Now, they can actually pre-compute
H, which will be nothing but application of AES encryption using key K on a zero
block. This H can also be shared between the two parties and use throughout
the session, without any need to compute H every time GHASH computation is
performed.
As it can be seen from Eq. (3.1), computation of GHASH is nothing but a
series of multiplication and addition operations in eld GF(2m). In a more formal
form, and also as described by its authors [28], the operation can be represented in
Algorithm 3.1.
20
Algorithm 3.1 GHASH Standard [29, 33]
Input: P,H
Output: Tn
Steps: T0 ← 0
for i = 1 to n do
Ti ← (Ti−1 ⊕ Pi) ·H
end for
return Tn
Algorithm 3.1 can be graphically represented as Fig. 3.1. The zero block T 0 in
Fig. 3.1 can be seen as the step in Algorithm 3.1,where variableT0 is initialized to
be 0 , and the tag Tn produced in the last step of gure actually represent the the
computed GHASH tag.
From Fig. 3.1 it is evident that if we have n blocks of m-bits each, it will require
n multiplications in GF(2m). It can also be seen from Fig. 3.1 and Algorithm 3.1,
that overall architecture of GHASH computation has essence of feedback in it.
Using this feedback characteristic a more compact graphical representation using
one multiplier and XOR operator in feedback can be seen in Fig. 3.2. This is a more
practical approach to the GHASH implementation as it requires less hardware. On
the other hand, it takes more time to compute GHASH.
In order to determine the computation time of GHASH using the feedback
structure of Figure 3.2, let us assume that delay due to XOR-operation of whole
block is dxor, and delay due to one multiplication is dmul. The total delay for
computing GHASH using this architecture can be approximated as,
dtotal = (dxor + dmul) · n (3.2)
The multiplication used in GHASH function in eld GF(2128) is carry-less mul-
21
Figure 3.1: GHASH Computation
Figure 3.2: Feedback architecture for GHASH
22
tiplication. Intel has proposed a carry-less instruction to multiply two 64 bit
operands, and has used it to come up with ecient software implementations of
GHASH. It has also been mentioned earlier, GHASH operation can be parallelized,
but has its restrictions in practical implementation. Before we move onto Intel's
proposed scheme, we give a brief overview of parallel architecture for GHASH com-
putation.
3.1.2 Parallel Architecture For GHASH
GHASH formulation allows its computation to be parallelized [28, 35]. To under-
stand a parallel architecture, assume that we have g multipliers and adders, and
also assume that data stream P is our input. Now for simplication we assume that
g is a factor of n, where n is the number of blocks P is divided into, depending on
the block size. Let us assume the block size to be m bits. Now we divide P again,








g−2 + ...+ SgH (3.3)
and we dene all the Si's as in [29],
Si = Pi (H
g)n/g−1 + Pi+g (H
g)n/g−2 + ...+ Pn−g+i (H
g)0 (3.4)
Now, these Si's, can be computed in parallel in (
n
g
− 1) steps with, a delay
of (n
g
− 1)(dmul + dxor), where dmul is delay due to a single multiplier and dxor
is a delay due to single XOR gate. Additional multiplier delay dmul, will also be
included due to multiplication of all Si's with their respective H
i's, where1 ≤ i ≤ g,
and assuming that their values are already computed. We can also represent this
in a more graphical form as in Fig. 3.3 [29].
23
Figure 3.3: Parallel GHASH Computation
After all the parallel computations, we can add all the result in a binary tree
fashion (to allow parallel XORing operations), which can be seen in the Fig. 3.3.
This addition (XOR operations) will require a delay of, log2 g, and as a result will







· (dmul + dxor) + dmul + (log2 g) · dxor (3.5)
24
3.2 Characteristic Polynomial Based GHASH
A high performance GHASH computation algorithm has been proposed in [29],
based on the concept of characteristic or minimal polynomial. For the purpose of
GHASH, in [29] a characteristic polynomial for an element E in eld GF(2m) is
dened to be a polynomial χE(t) of degree m with all the coecients belonging to
GF(2) such that χE(E) = 0. Now, in case of GHASH computation, let us assume
that the characteristic polynomial for hash sub-key H be χH . If the characteristic
polynomial is irreducible, then it can be shown that it is the minimal polynomial






i = 0 (3.6)
Since, all the ci's are either 0 or 1, and we know that degree of χH is m, hence










m−2 + · · ·+ PmH (3.8)
where, all the Pi's are in eld GF(2
m). If we apply modular reduction on G using
χH , we get,
G mod χH = c0P1 + (Pm + c1 · P1) ·H + ...+
(P3 + cm−2P1) ·Hm−2 + (P2 + cm−1P1) ·Hm−1
(3.9)
Since, ci's can only have a 0 or 1, the term (Pm−i+1 + ci · P1)is no computation if
ci = 0 and an addition in GF(2
m) if ci = 1. These operations can be represented in
form of a circuit as shown in Fig. 3.4.The registers shown in the gure are loaded
in the following sequence: P1, P2, ..., Pn → Ym−1, Ym−2...Y0. We can see from Fig.
25
3.4 and Eq. (3.9) that, the addition computations can be peroformed in parallel.
This circuit to perform these parallel operations is called Polynomial Reduction
Unit (PRU) [29].
Now, let us consider a polynomial similar to Eq. (3.8) but of degree n > m and we
can break that polynomial as follows,
G = ((...((P1H
m + P2H
m−1 + ...+ Pm+1)H + Pm+2)H + ...+ Pn−1)H + Pn)H
which can be simplied after applying modulo reduction χH as [29],
G mod χH = ((...((P1H
m + P2H
m−1 + ...+ Pm+1 mod χH)H
+Pm+2 mod χH)H + ...+ Pn−1 mod χH)H
+Pn mod χH)H mod χH
(3.10)
Basically, the idea here is to replace n−m+1 multiplications by H with that many
polynomial reductions using a circuit shown in Fig. 3.4. In a more formal way and
as dened in [29] the algorithm can be represented as in Algorithm 3.2. For the
sake of simplicity this GHASH algorithm, which is proposed by Meloni, Negre, and
Hasan [29], and from here and onwards will be referred to as the MNH GHASH
algorithm.
We clearly see that compared to older Algorithm 3.1, the new one requires
fewer number of multiplications when n ≥ m. For example, if we have n blocks to
compute GHASH, the older algorithm will require n multiplications, but the MNH
algorithm restricts the number of multiplications to m − 1, and replaces rest of
multilpications with n−m + 1 parallel rounds of a PRU, which mainly comprises
of XOR operation, and AND operations for xed ci's. One important thing to
note here is that, inside the rst loop, the computations of all the Yi's , and Y0,
represent operation of PRU, and are computed in parallel at every iteration of j.
26
Figure 3.4: Polynomial Reduction Unit
27
Algorithm 3.2 MNH GHASH Algorithm
Input: P = P1, P2, ..., Pn , χH(H) =
∑m
i=0 ciH
i where, (n ≥ m)
Output: GHASHH(P ) = P1H
n + P2H
n−1 + P3H
n−2 + · · ·+ PnH
Steps:
P1, P2, ..., Pn → Ym−1, Ym−2...Y0
T → 0, Pn+1 = 0
for j = m to n do
Ym−1 → C
Yi ← Yi−1 + ciC, m− 1 ≥ i ≥ 1
{
in parallel






for i = m− 1 down to 1 do
T ← (T + Yi) ·H
endfor
return (T + Y0)
28
Such parallel computations are easily possible in special purpose hardware. On
the hand, this is not the case in software using general purpose processors. In
software these computations are most likely to be performed in sequence which will
be discussed in the next chapter. A brief description will be also given on how to
compute the characteristic polynomial required for the operation of algorithm. For
now, we give an example to clarify the operation of Algorithm 3.2.
Let us assume that, we have GF(24), i.e., m = 4, and the reduction polynomial
is x4 + x + 1. Now, let us assume that we have ve blocks P1 = 1, P2 = x, P3 =
x + 1, P4 = x
2, P5 = x
2 + 1, for computation of GHASHH , where H = x
3 and
P6 = 0. Now, from the assigned value of H, we can get value of ci's through
characteristic polynomial which is x4 + x3 + x2 + x + 1 for the given H. The
expression for GHASHH is




2 + P5H (3.11)
Now, using the MNH GHASH algorithm we will rst assign input block values to
PRU registers as follows,
Y3 = P1 = 1
Y2 = P2 = x
Y1 = P3 = 1 + x
Y0 = P4 = x
2
Now, to apply the PRU iteration for j = 4,
C = Y3 = 1
Y3 = Y2 + C = x+ 1
Y2 = Y1 + C = x
Y1 = Y0 + C = x
2 + 1
Y0 = P5 + C = x
2
Again, applying PRU iteration for j = 5,
29
C = Y3 = x+ 1
Y3 = Y2 + C = 1
Y2 = Y1 + C = x
2 + x
Y1 = Y0 + C = x
2 + x+ 1
Y0 = P6 + C = x+ 1
Now, applying iterations of multiplicaiton loop, starting with i = 3 and T = 0,
T = (T + Y3) ·H = (0 + 1) · x3 = x3
for i = 2,
T = (T + Y2) ·H = (x3 + x2 + x) · x3 = 1 + x3
again, for i = 1,
T = (T + Y1) ·H = (x3 + 1 + x2 + x+ 1) · x3 = 1 + x3
Now, for the nal step we add Y0 and T ,
GHASHH(P ) = Y0 + T = 1 + x+ 1 + x
3 = x3 + x
As mentioned earlier the MNH algorithm restricts combined multiplication and
XOR operations to m−1, and rest of the multiplicaitons are replaced by n−m+1
PRU operations. Hence, the GHASH computation time using the MNH algorithm
is
dtotal = (n−m+ 1) · (dxor + dand) + (m− 1) (dmul + dxor) (3.12)
3.3 Implementation Issues in Software
The most challenging task in the software implementation of GHASH using either
the standard or the MNH algorithm is the multiplication in GF(2128). As mentioned
earlier, such multiplication can be performed by rst multiplying two polynomials
of degree less than 128 over the ground eld GF(2) and then reducing the resultant
30
polynomial of degree 254 or less using the eld dening polynomial of degree 128.
The polynomial multiplication over GF(2) can be viewed as a carry-less multiplica-
tion, which is described below. We also present Intel's new instruction to speed-up
such carry-less multiplication and its use to the Karatsuba algorithm.
3.3.1 Carry-less Multiplication
Carry-less multiplication in simple words can be dened as multiplication of two
operands, with no propagation and generation of carries during the process. Let us
assume that we have two operands, X & Y and we represent them in an array of
m bits.
X = [x1, x2, x3, ........., xm]
Y = [y1, y2, y3, ........., ym]
The carry-less product generated will be of size 2m − 1 bits and let us call it Z,
where Z can be represented as,
Z = [z2m−1, z2m−2, z2m−3, ........., z2, z1]











xjyi−j, m+ 1 ≤ i ≤ 2m− 1
It is also evident from the equation above the result is similar to integer multi-
plication, but without any carry. A small example to understand it in a better way











As, it can be seen the result of normal multiplication of X and Y should be 144
i.e., in binary [10010000], but result of carry-less muliplication is [1010000], which
is equvalent to 80.
3.3.2 Ecient Carry-less Multiplication for Large Operands
For GHASH, the size of the operands for carry-less multiplication is m or 128 bits.
There are basically two types of techniques used in ecient software implementa-
tions of carry-less multiplication of such large size operands: look-up table and the
Karatsuba methods.
Look-up table based implementation is based on two major steps. First is pre-
processing, where all the tables are generated in GF(2128) and stored. Second
step involves nding the right matches based on the given input and XOR all the
matches to obtain the output. Further details for the look-up table based is not
that relevant for our discussion, so has been avoided, but the key idea is that, the
scheme involves memory storage cost and can be inecient when high performance
(speed) is required.
32
The other popular technique is to use a carry-less Karatsuba algorithm. The
multiplication is then followed by a reduction algorithm. A brief introduction to
Karatsuba algorithm is presented, followed by a brief description of the modied
Karatsuba algorithm.
3.3.3 Basics of Karatsuba Algorithm
The Karatsuba algorithm was named after its inventor, Anatolii A. Karatsuba. The
Karatsuba algorithm enables faster multiplication of two n-digit numbers, and has
proven to be faster than traditional algorithms. The older algorithm also called
ordinary multiplication (OML) [20, 19], has an algorithmic complexity of O(n2),
which is reduced to O(nlog23) by the Karatsuba algorithm.
In order to get better understanding of algorithm, assume two n-digits numbers,
a and b, and let them be in base B. Also, assume a positive integer less than n and
call it x, such that we can divide the two numbers as follows,
a = a1 ·Bx + a0, (3.13)
b = b1 ·Bx + b0. (3.14)
We also have to make sure that Bx, is greater than a0 and b0, and as a result
product of a and b, p can be represented as
p = a · b = c2 ·B2x + c1Bx + c0 (3.15)
where,
c2 = a1 · b1
c1 = a1 · b0 + a0b1
33
c0 = a0 · b0
Now, it appears that we need to perform four multiplications, but Karatsuba,
reduced these to three multiplications at the cost of extra additions as follows,
c1 = (a1 + a0) · (b1 + b0)− c2 − c0
So, by computing c1 as above the number of multiplications has been reduced by
one. Let us see a small example to verify working of this algorithm. Let us assume
that we want to multiply two 3-digit numbers, a = 123 and b = 456, the base used
is 10, and the value of x used is 1. So, we can split those two numbers like Eq.
(3.13) and Eq. (3.14),
a = 123 = 12 · 101 + 3
b = 456 = 45 · 101 + 6
So, values of c2, c1, and c0, can be computed as,
c2 = a1 · b1 = 12 · 45 = 540
c0 = a0 · b0 = 3 · 6 = 18
c1 = (a1 + a0) · (b1 + b0)− c2 − c0 = (12 + 3) · (45 + 6)− 540− 18 = 207
Now, using Eq. (3.15), we can compute the nal result as,
p = a · b = c2 ·B2x + c1Bx + c0 = 540 · 102 + 207 · 10 + 18 = 56088
3.3.4 Karatsuba Algorithm for GHASH
Intel has proposed the use of the Karatsuba algorithm for computing GF(2128) eld
multiplications, using their PCLMULQDQ instruction. Since the instruction can
34
multiply operands of 64-bits long, only one recursion of the Karatsuba algorithm is
needed. If we assume that we have two 128-bit operands, A and B, then to apply
the Karatsuba algorithm we will need to divide them into two parts each 64 bit long
represented as A[A1 : A0] and B[B1 : B0], where : corresponds to concatenation
[16]. As we saw in the previous section, for the Karatsuba algorithm we compute
c2, c1 and c0; here we compute their equivalents, [G1 : G0], [E1 : E0] and [D1 : D0],
respectively. To understand the multiplication by splitting into two 64 bit halves,
let us assume that in polynomial form A and B can be represented as (addition in
following equations is not normal addition, but addition used in eld arithmetic,





Similarly,[G1 : G0], [E1 : E0] and [D1 : D0] can be computed as,
A1B1 = [G1 : G0] = G1x
64 +G0
A0B0 = [D1 : D0] = D1x
64 +D0
(A0 + A1) · (B1 +B0) = [E1 : E0] = E1x64 + E0











A ·B = A1B1x128 + ((A0 + A1) · (B1 +B0) + A1B1 + A0B0)
x64 + A0B0
(3.16)
Now, substituting values of [G1 : G0], [E1 : E0] and [D1 : D0], in Eq. (3.16), and
simplifying we get,
35
A ·B = G1x192+(G1 +G0 +D1 + E1)x128+(D1 +D0 +G0 + E0)x64+D0 (3.17)
So, Eq. (3.17) represents the nal product and shows how using 64 bit halves the
carry-less multiplication of 128-bit operands can be performed. Implementation
details will be discussed in the next chapter.
Next step right after multiplication is modulo reduction of multiplication result
using g(x) = x128+x7+x2+x+1. In terms of implementation, this can be achieved
in a simpler way, by only using some shifts and XOR operations which we will see
in the next chapter.
36
Chapter 4
Software Implementation of GHASH
Algorithms
In the previous chapter we discussed the standard GHASH algorithm [33] and the
new MNH GHASH algorithm [29]. There we also mentioned about Intel's new
instruction PCLMULQDQ [17] and its usage towards the GHASH computation.
In this current chapter we will look at the performance of the old and the new
GHASH algorithms. Here our discussion will be primarily based on software imple-
mentations of those two GHASH algorithms. At the start of the chapter, we will
look into the eld multiplication algorithm suggested in [33], and how a modied
implementation of this algorithm is done, followed by a small example in GF (24). A
brief descripton of Gordon's algorithm for computing characteristic polynomials is
presented, again followed by a small example in GF(24). In the later sections of the
chapter, GHASH implementation using Intel's carry-less instruction PCLMULQDQ
is presented. The chapter also includes implementation results and a comparison
between the standard and the MNH GHASH algotithms.
37
4.1 GHASH Building Blocks
It is evident from the discussion in the previous two chapters that the most impor-
tant part of GHASH computation is carry-less multiplication. If we closely look
at the standard way of computing GHASH, it is nothing but eld multiplication
and XOR operation done repeatedly. The carry-less multiplication algorithm used
in our peformance comparison is the modied implementation of multiplication
algorithm proposed in [33]. In order to implement the MNH GHASH algorithm,
one of the most important ingredients required is computation of the characteristic
polynomial χH , for the corresponding sub-key H [29]. In order to compute char-
acteristic polynomial, implementation of Gordon's algorithm, as presented in [29],
has been implemented. Before, we can discuss the implementation results, brief
discussions with examples are provided on the modied multiplication algorithm,
Gordon algorithm, and algorithms needed to utilize PCLMULQDQ instruction.
4.1.1 Implementation of Standard GF(2m) Multiplication
An algorithm of multiplication in GF(2128) has been given in [33]. The operations
involved in it are based on right shifts and XORing. The algorithm's implemen-
tation is modied to use the left shift operations instead of the right shift. The
modication allows us to avoid bit reection, which was required previously as men-
tioned in [33, 17], and it also makes algorithm easier to understand. The modied
scheme is described in Algorithm 4.1, and in the literature it is known as the least
signicant bit rst multiplication algorithm.
In order to clarify the working of Algorithm 4.1, we give a small example. Let
us start by assuming that we have two 4-bit blocks, A = 0110 and B = 0011 in
eld GF(24). The eld polynomial used for modulo reduction is x4 + x + 1, i.e.,
38
Algorithm 4.1 Least Signicant Bit First Multiplication.
Input : A,B (Input blocks) and R is reduction polynomial block
Output : Dm = A ·B
Steps :
A→ am−1am−2...a2a1a0(Input block A represented as bits string)
D0 ← 00 · · · 0, E0 ← B
for i = 0 to m− 1 do
Di+1 ←

Di if ai = 0
Di ⊕ Ei if ai = 1
Ei+1 ←

Ei  1 if MSB(Ei) = 0
(Ei  1)⊕R if MSB(Ei) = 1
endfor
return Dm
x4 ≡ x + 1, which is represented as a 4-bit block, R = 0011. Let us initialize the
values of D0 and E0 as follows
D0 = 0
4 = 0000,
E0 = B = 0011.
Now, we can also represent block A as
A = a3a2a1a0 = 0110
Now, going thorugh iterations of the for loop, for i = 0, a0 = 0 and MSB(E0) = 0
we have,
D1 = D0 = 0000
E1 = E0  1 = 0110
Now, for i = 1, a1 = 1 and MSB(E1) = 0 we have,
39
D2 = D1 ⊕ E1 = 0110
E2 = E1  1 = 1100
Now, for i = 2, a2 = 1 and MSB(E2) = 1 we have,
D3 = D2 ⊕ E2 = 1010
E3 = (E2  1)⊕R = 1011
Again, for i = 3, a3 = 0 and MSB(E3) = 1 we have our result D4 as follows,
D4 = D3 = 1010.
Hence, our result for the carry-less multiplication of blocks A = 0110 and B = 0011,
over eld GF(24) using Algorithm 4.1 is 1010.
4.1.2 Gordon's Algorithm
In order to calculate the characteristic polynomial, Gordon's method is used, which









where, χH is the characteristic polynomial of hash sub-key H, and where, H ∈
GF(2m), this can be reresented in form of Algorithm 4.2 [29].
The for loop in Algorithm 4.2 starts from i = 1, not i = 0 as suggested by
Eq. (4.1). It is because the initialization step χH ← t +H already represents the
stage when i = 0. In order to clarify the working of Algorithm 4.2, let us assume
that we have, H = x3. The eld we are using is, GF(24), and the eld polynomial
x4 + x+ 1. Let us begin with the initialization step,
χH = t+H = t+ x
3
Z = H = x3
Now, the rst iteration of for loop, when i = 1,
40
Algorithm 4.2 Gordon's Algorithm [29]
Input : H ∈ GF(2m)




for i = 1 to m− 1 do
Z ← Z2
χH ← χH · t+ χH · Z
endfor
return χH
Z = Z2 = (x3)2 = x6 = x3 + x2
χH = χH · t+ χH · Z
χH = (t+ x
3) · t+ (t+ x3) · (x3 + x2)
χH = t
2 + tx2 + x3 + x
When i = 2,
Z = Z2 = (x3 + x2)2 = x3 + x2 + x+ 1
χH = χH · t+ χH · Z
χH = (t
2 + tx2 + x3 + x) · t+ (t2 + tx2 + x3 + x) · (x3 + x2 + x+ 1)
χH = t
3 + (1 + x+ x3)t2 + (1 + x)t+ x2 + x3
Again when i = 3,
Z = Z2 = (x3 + x2 + x+ 1)2 = x3 + x
χH = χH · t+ χH · Z
χH = (t
3 + (1 + x+ x3)t2 + (1 + x)t+ x2 + x3) · t
+(t3 + (1 + x+ x3)t2 + (1 + x)t+ x2 + x3) · (x+ x3)
χH = t
4 + t3 + t2 + t+ 1
41
Now, χH is our required characteristic polynomial.
In order to implement Gordon's algorithm, only two major components are
needed: a eld multiplier and a XOR operator. For XOR operations, Intel's in-
trinsic XOR operation is used, and for eld multiplications, the carry-less multplier
modied implementation discussed in the previous Section 4.1.1 is used.
4.1.3 Intel's PCLMULQDQ instruction
Intel proposed the PCLMULQDQ instruction in 2010, for carry-less multiplication
on their Westmere architecture [17]. This instruction can be used to multiply two
operands, which are 64 bits in length. This instruction provides a faster way of
computing carry-less multiplication as compared to the methods available before it
[17]. This instruction can further be used to compute carry-less multiplication of
two 128-bit operands, as we will see in the next subsection. In its assembly usage
form this instruction can be written as [17],
pclmulqdq immbyte, reg1, reg2
where, reg1 and reg2 are two 128-bit registers. The carry-less multiplication is
performed on a quadword (8 bytes) of reg1 and a quadword of register reg2. The
selection of the quadwords from reg1 and reg2 depends on the value of immbyte
(the result gets stored in reg2, and in C instruction can be used by calling a
function which returns the result of multiplication). If we assume that reg1, reg2
and immbyte are represented by referring to their number of bits as
reg1 [127 : 0]
reg2 [127 : 0]
immbyte [7 : 0]
then, we can rperesent the selection of quadwords, on basis of immbyte values as
in Table 4.1,
42
immbyte (in hex) Quadword Selection
0x00 reg2 [63 : 0], reg1[63 : 0]
0x01 reg2 [63 : 0], reg1[127 : 64]
0x10 reg2 [127 : 64], reg1[63 : 0]
0x11 reg2 [127 : 64], reg1[127 : 64]
Table 4.1: Selection of quadwords
In terms of software implementation intrinsic function for the PCLMULQDQ
can be used. Intel allows the use of the intrinsic function without explicitly speci-
fying PCLMULQDQ [2]. A small example is also given on how to use this intrinsic
function in [2], in C language. The intrinsic function _mm_clmulepi64_si128( ),
can be formally dened as [2],
_m128i _mm_clmulepi64_si128 (_m128i a1,_m128i a2, const int
immbyte )
The denition above means that function returns a value of type _m128i, and
as inputs it takes two 128-bit parameters and a constant integer immbyte, which
decides the halves of reg1 and reg2 are to be taken for multiplication using the
criteria as shown in Table 4.1.
4.1.4 Intel's Karatsuba Implementation using PCLMULQDQ
In chapter 3, we have already discussed Intel's modied Karatsuba algorithm and
how it works in terms of polynomial arithmetic. In this subsection we will look more
closely in terms of implementation. A more formal denition of Intel's carry-less
Karatsuba algorithm as proposed in [16] is presented in Algorithm 4.3.
In Algorithm 4.3, X and Y represent the two blocks to be multiplied and are
divided into two halves. In case of GF(2128) eld, X1 and Y1 are the upper 64-bit
43
Algorithm 4.3 Intel's modied Karatsuba algorithm [16]
Input : X = [X1 : X0], Y = [Y1 : Y0]
Output : X · Y
Steps :
[Z1 : Z0 ] = X1 · Y1
[W1 : W0] = X0 · Y0
[V1 : V0 ] = (X1 ⊕X0) · (Y1 ⊕ Y0)
X · Y = [Z1 : Z0 ⊕ Z1 ⊕W1 ⊕ V1 : W1 ⊕ Z0 ⊕W0 ⊕ V0 : W0]
Return X · Y
halves, and X0, Y0 represent the lower 64-bit halves of 128-bit operands, which are
X and Y respectively. The symbol : represents concatenation of blocks, and the
symbol  · represents carry-less multiplication operator.
Below we use a small example to clarify the working of Algorithm 4.3. In order
to keep things simple, assume that we have two blocks: X and Y in a eld GF(24).
These blocks can be divided into two halves of 2 bits each to keep the representation
consistent with Algorithm 4.3.
X = [ X1 : X0 ] = [01 : 10]
Y = [ Y1 : Y0 ] = [00 : 11]
Th expected result of multiplication is 1010, and the steps of operation can be seen
below, where  · represents carry-less multiplication with XOR operations in the
third step.
[ Z1 : Z0 ] = X1 · Y1 = 01 · 00 = [00 : 00]
[ W1 : W0 ] = X0 · Y0 = 10 · 11 = [01 : 10]
[ V1 : V0 ] = (X1 ⊕X0) · (Y1 ⊕ Y0) = 11 · 11 = [01 : 01]
44
Now, the nal step of computing product involves XOR operations, and concate-
nating four 64-bit blocks to produce a 256-bit output.
X · Y = [Z1 : Z0 ⊕ Z1 ⊕W1 ⊕ V1 : W1 ⊕ Z0 ⊕W0 ⊕ V0 : W0]
X · Y = [00 : 00⊕ 00⊕ 01⊕ 01 : 01⊕ 00⊕ 10⊕ 01 : 10]
X · Y = [00 : 00 : 10 : 10]
Hence, the result is same as expected result.
As we can see from Algorithm 4.3, the rst three steps involve only multiplica-
tion, and the last step involves multiple XOR operations. In terms of implementa-
tion, the three carry-less multiplications are implemented using the PCLMULQDQ
instruction, in the same manner as we discussed in last subsection. Implementation
of the XOR operations can be again done using the intrinsic function for XORing
two 128-bit operands. Similar to PCLMULQDQ, intrinsic funtion for XOR opera-
tion dened in [1], can be represented as shown below
_m128i _mm_xor_si128 (_m128i x,_m128i y)
where x and y are two 128-bit operands.
4.1.5 Ecient Reduction Modulo Implementation
In [16, 17] Intel has proposed a modular reduction algorithm by taking into con-
sideration eld dening polynomial x128 + x7 + x2 + x + 1. The algorithm is then
implemented in combination with the carry-less Karatsuba algorithm explained in
the previous section. A more formal description of this modular reduction is given
in Algorithm 4.4.
In terms of implementation, a combined implementation of this algorithm with
carry-less Karatsuba as presented in [17] is used. The combined implementation
45
Algorithm 4.4 Modular Reduction in GF(2128) [17, 16]
Input : [X4, X3, X2, X1], where X4, X3, X2, X1, are each 64-bit long.
Output : [Y1, Y0] (128-bit long reduciton result, where Y1, Y0, are each 64-bit long
)
Steps :
U = X4  63
V = X4  62
W = X4  57
Z = U ⊕ V ⊕W ⊕X3
Now, using Z we form [X4 : Z] , and proceed as follows,
[P1 : P0] = [X4 : Z] 1
[Q1 : Q0] = [X4 : Z] 2
[R1 : R0] = [X4 : Z] 7





serves the purpose of GF(2128) eld multiplication, and then using it implementa-
tions of the standard and the MNH GHASH algorithms are done.
4.2 GHASH Implementation Results
4.2.1 Implementation using Common Place Instructions
A software implementation of the standard and the MNH GHASH function has
been done using the C programming language and without using Intel's special
carry-less multiplication instruction. The 128-bit multiplication is performed by
using the modied version of the algorithm provided by NIST in [33] (see Section
4.1.1). The standard GHASH implementation can be viewed in Appendix A.1. In
order to compute the characteristic polynomial, a Maple code has been written.
The input or hash sub-key value (H), used for the Maple code is selected to be a
large 128-bit random value with half of its bits are 1 and half of it are 0. The result
of the Maple code is in polynomial form, and then that resultant characteristic
polynomial is used in the C code for the MNH GHASH in form of an array of
1's and 0's. The MNH GHASH implementation has been included in Appendix
A.2. Both GHASH algorithms are compared in terms of computation time. In
order to increase accuracy of timing result each algorithm was run 10,000 times
for each value of input blocks (each block is 128 bits long), and then average time
was obtained by dividing the total by 10,000. The system used for running the
implementations was Xeon E3-1270 (quad-core 3.4GHz) and the operating system
used was Linux Ubuntu Server 11.10 (with gcc 4.6.1). Computation time was
calculated using 'time' command in Linux. The values of the computation time
for the standard and high performance GHASH can be seen in Table 4.2, and a
47










Table 4.2: Computation Time of Implementations with Customary Instructions
graphical representation of results can be seen in Fig. 4.1.
As it can be seen from Fig. 4.1, the MNH GHASH shows smaller delay than the
standard GHASH, and the delay result improves as the number of blocks increases.
The improved delay is due to the fact that the MNH GHASH algorithm keeps the
number of 128-bit multiplication operations xed at 127. The rest of the 128-bit
multiplications, which are part of the standard GHASH algorithm, are replaced
by PRU computations as discussed in Section 3.2. PRU computations in terms of
implementation are much more faster than the implementation of 128-bit multipli-
cation. In case of software implmentation, the PRU operations are not implemented
in parallel, but rather computed sequentially, as it is not feasible to compute 128
operations exactly in parallel using software implementation on our processors. It
is interesting to note that even though the PRU operations are not occuring in
parallel, but still the new algorithm gives better results than the standard one.
4.2.2 Implementation using PCLMULQDQ Instruction
The GHASH algorithms have been implemented using Intel's PCLMULQDQ in-
struction. In order to perform 128-bit multiplication, the algorithm mentioned by
Intel in [17], and which we also discussed in Section 4.1.4 is used. The implementa-
48
Figure 4.1: Performance Comparison of Implementations with Customary Instruc-
tions
tion of reduction algorithm discussed in Section 4.1.5 is used in combination to get
128-bit multiplication result. The characteristic polynomial for the MNH GHASH
algorithm is obtained in a similar way as mentioned in Section 4.2.1. The results
of computation time for the two algorithms can be seen in Table 4.3, and graphical
representation of the results can be seen in Fig. 4.2.
As it can be seen from Fig. 4.2, the MNH GHASH algorithm does not perform
well when compared to the standard one. The reason for better performance of the
standard GHASH algorithm in this case is due to the usage of Intel's PCLMULDQ
instruction, which really speeds up the carry-less multiplication operation. Another
reason for the improved performance of the standard GHASH algorithm is that,
the software implementation of the MNH GHASH algorithm is not able to utilize
the parallelism required by the PRU operations, as we also mentioned it in section
49










Table 4.3: Computation Time of Implementations with PCLMULQDQ







In this thesis, software implementations of the MNH GHASH are compared with
those of the standard GHASH. In implementations, where Intel's PCLMULQDQ
instruction is not used, the MNH GHASH has performed well as compared to
the standard GHASH. In contrast, implementations where Intel's PCLMULQDQ
instruction was used, the standard GHASH has proven to be better in performance
than the MNH GHASH. In its core, the MNH GHASH algorithm attempts to reduce
the number of multiplications required to compute the GHASH by using multiple
XOR operations in parallel. Regardless of using or not using PCLMULQDQ, the
implmentations have not been able to take advantage of parallelism present in the
MNH GHASH algorithm. The parallelism can be utilized by having a hardware
polynomial reduction unit, which today's main stream processors do not have.
Even though parallelism is not utilized by our software implmentations due to
the above-mentioned limitations, the MNH algorithm performs better than the
standard implementation in case where PCLMULQDQ is not used. This suggests,
that on architectures which are older than Westmere, or architectures which do
52
not support Intel's PCLMULQDQ instruction, the MNH GHASH algorithm will
perform better than the standard GHASH one.
5.2 Future Work
As we discussed, due to the inability of software implementations to exploit the
parallelism of the MNH GHASH algorithm, the latter has not performed better than
the standard GHASH algorithm. It is hard to compute 128 XOR operations, which
are needed for the polynomial reduction unit, in exact parallel through software
implementation. It is however possible to try on high end systems with multi-
core and/or programmable logic equipped processors to at least do some part of
computation in parallel. If, for example, we can perform four XOR operation in
parallel, we can reduce 128 bit sequential XOR operations to 32 rounds of XOR
operations, with each round having 4 XOR operations. Further work can be done
on exploiting parallelism available on various high end systems to speed up software





A.1 Standard GHASH without PCLMULQDQ
// Uses implementation o f mu l t i p l i c a t i o n from s e c t i o n 4 . 1 . 1
#inc lude <s td i n t . h>
#inc lude <in t type s . h>
#inc lude <wmmintrin . h>
#inc lude<emmintrin . h>
#inc lude<smmintrin . h>
#inc lude <s td i o . h>
#inc lude <time . h>
s t r u c t aes_block { uint64_t a ; uint64_t b ; } ;
void gfmulos (__m128i x , __m128i y , __m128i ∗ r e s ) ;
void print_m128i_with_string ( char ∗ s t r i ng ,__m128i data ) ;
i n t main ( ) {






























































































































































































































































































__m128i X[ 1 5 3 7 ] ;
i n t i = 0 ;
// Input I n i t i a l i z a t i o n
60
f o r ( i = 0 ; i <=1536 ; i++){
a [ i ] = a [ i ] ∗ 1000000000000000000ULL; // en l a r g i ng input
b [ i ] = b [ i ] ∗ 1000000000000000000ULL; // en l a r g i ng input
X[ i ] = _mm_set_epi64 ( (__m64) a [ i ] , (__m64)b [ i ] ) ;
}
__m128i H = _mm_set_epi64 ( (__m64)5708010131839353156ULL,
(__m64)3405470159317640703ULL) ; //Hash Sub−key
/// Standard GHASH
__m128i temp = {0x00 , 0x00 } ;
i n t k = 0 ; // f o r mu l t ip l e runs . . . to magnify t iming r e s u l t s
// The commented f o r loop i s only used when running the code f o r
// t iming ana l y s i s
// f o r ( k =0 ; k<=10000; k++){
f o r ( i = 0 ; i <=1535 ; i++){
temp = _mm_xor_si128( temp , X[ i ] ) ;
gfmulos ( temp ,H,&temp ) ;
}
// } //−−−−−−−−−−−−−−
print_m128i_with_string (" TagResult f : " , temp ) ;
}
void print_m128i_with_string ( char ∗ s t r i ng ,__m128i data ){
unsigned char ∗ po in t e r = ( unsigned char ∗)&data ;
i n t i ;
p r i n t f ("%−40s [ 0 x" , s t r i n g ) ;
f o r ( i =0; i <16; i++)
p r i n t f ("%02x" , po in t e r [ i ] ) ;
p r i n t f ( " ] \ n " ) ;
}
void gfmulos (__m128i x , __m128i y ,__m128i ∗ r e s ) {
__m128i R = _mm_set_epi64 ( (__m64)0ULL, (__m64)135ULL ) ;
__m128i Z = _mm_set_epi64 ( (__m64)0ULL, (__m64)0ULL ) ;
__m128i V = y ; __m128i X = x ;
i n t i = 0 ;
__m64 ch ;
__m64 che ;
f o r ( i =0; i < 128 ; i++ ){
i f ( i <64){ ch = (__m64) X [ 0 ] ; }
e l s e { ch = (__m64) X[ 1 ] ; }
uint64_t ch1 = ( uint64_t ) ch&1ULL;
i f ( ch1 ){
Z = _mm_xor_si128(Z , V) ; }
che = (__m64) V [ 1 ] ;
uint64_t che1 = ( uint64_t ) che&9223372036854775808ULL;
i f ( che1 ) {
61
__m64 ad = (__m64)V [ 0 ] ;
uint64_t tx1 = ( uint64_t ) ad&9223372036854775808ULL;
V = _mm_slli_epi64 (V, 1 ) ; //
i f ( tx1 ){
V[ 1 ] = ( uint64_t ) _mm_or_si64 ( (__m64)V[ 1 ] , (__m64)1ULL) ;
}
V = _mm_xor_si128(V, R) ;
} e l s e {
__m64 ad1 = (__m64)V [ 0 ] ;
uint64_t tx2 = ( uint64_t ) ad1&9223372036854775808ULL;
V = _mm_slli_epi64 (V, 1 ) ; / /
i f ( tx2 ){
V[ 1 ] = ( uint64_t ) _mm_or_si64 ( (__m64)V[ 1 ] , (__m64)1ULL) ;
}
}
i f ( i <64){ X[ 0 ] = ( uint64_t ) _mm_srli_si64 ( (__m64)X[ 0 ] , 1 ) ; }
e l s e { X[ 1 ] = ( uint64_t ) _mm_srli_si64 ( (__m64)X[ 1 ] , 1 ) ; }
}
∗ r e s = Z ;
}
A.2 High Perfromance GHASH without PCLMULQDQ
// Uses modi f i ed implementation o f mu l t i p l i c a t i o n from s e c t i o n 4 . 1 . 1 ,
// and array f o r ' c i ' va lue s ( which i s assumed as precomputed )
// i s cons t ruc ted from c h a r a c t e r i s t i c polynomial computed us ing
// Gordon ' s Algorithm Maple implementation de s c r ib ed in Appendix A. 5 .
#inc lude <s td i n t . h>
#inc lude <in t type s . h>
#inc lude <wmmintrin . h>
#inc lude<emmintrin . h>
#inc lude<smmintrin . h>
#inc lude <s td i o . h>
#inc lude <time . h>
s t r u c t aes_block { uint64_t a ; uint64_t b ; } ;
void gfmulos (__m128i x , __m128i y , __m128i ∗ r e s ) ;
void print_m128i_with_string ( char ∗ s t r i ng ,__m128i data ) ;
i n t main ( ) {































































































































































































































































































__m128i X[ 1 5 3 7 ] ;
i n t i = 0 ;
// Input I n i t i a l i z a t i o n
f o r ( i = 0 ; i <=1536 ; i++){
a [ i ] = a [ i ] ∗ 1000000000000000000ULL; // en l a r g i ng input
b [ i ] = b [ i ] ∗ 1000000000000000000ULL; // en l a r g i ng input
X[ i ] = _mm_set_epi64 ( (__m64) a [ i ] , (__m64)b [ i ] ) ;
}
__m128i H = _mm_set_epi64 ( (__m64)5708010131839353156ULL,
(__m64)3405470159317640703ULL) ; //Hash Sub−key
/// Standard GHASH
__m128i temp = {0x00 , 0x00 } ;
i n t k = 0 ; // f o r mu l t ip l e runs . . . to magnify t iming r e s u l t s
// New GHASH −−−−−−−−−−−−−−−
i n t j = 0 ;
// Values o f c i s e t based on c h a r a c t e r i s t i c polynomial
// from Maple Code
i n t c i [ 1 2 9 ] ={1 ,0 , 1 , 0 , 1 , 0 , 0 , 0 , 1 , 1 , 0 , 0 , 0 , 1 , 0 , 0 ,
0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 1 , 1 , 1 ,
1 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 1 , 1 , 1 , 0 , 1 , 0 , 0 , 0 ,
0 , 0 , 1 , 1 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 1 ,
1 , 1 , 0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 1 , 0 , 1 , 1 , 1 , 0 ,
1 , 1 , 1 , 0 , 1 , 0 , 0 , 0 , 1 , 1 , 0 , 1 , 1 , 0 , 0 , 0 ,
1 , 1 , 1 , 0 , 1 , 1 , 0 , 0 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 1 ,
0 , 0 , 1 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 1 , 1 , 0 , 1 , 1 } ;
__m128i Y[ 1 2 8 ] ;
// The commented f o r loop i s only used when running the code f o r
// t iming ana l y s i s
// f o r ( k = 0 ; k <=10000; k++){
f o r ( i =0; i <=127; i++){
Y[ i ] = X[127− i ] ;
}
f o r ( j = 127 ; j<= 1535 ; j++){
__m128i C = Y[ 1 2 7 ] ;
f o r ( i = 127 ; i >=1 ; i−−){
i f ( c i [ i ]==1){Y[ i ] = _mm_xor_si128(Y[ i −1] , C) ; }
e l s e {Y[ i ] = Y[ i −1] ;}
}
i f ( c i [ 0 ] == 1){Y[ 0 ] = _mm_xor_si128(X[ j +1] , C) ; }
e l s e {Y[ 0 ] = X[ j +1] ;}
69
}
f o r ( i = 127 ; i >=1 ; i−−){
temp = _mm_xor_si128( temp , Y[ i ] ) ;
gfmulos ( temp ,H, &temp ) ;
}
temp = _mm_xor_si128 ( temp , Y [ 0 ] ) ;
//}
print_m128i_with_string (" TagResult f : " , temp ) ;
}
void print_m128i_with_string ( char ∗ s t r i ng ,__m128i data ){
unsigned char ∗ po in t e r = ( unsigned char ∗)&data ;
i n t i ;
p r i n t f ("%−40s [ 0 x" , s t r i n g ) ;
f o r ( i =0; i <16; i++)
p r i n t f ("%02x" , po in t e r [ i ] ) ;
p r i n t f ( " ] \ n " ) ;
}
void gfmulos (__m128i x , __m128i y ,__m128i ∗ r e s ) {
__m128i R = _mm_set_epi64 ( (__m64)0ULL, (__m64)135ULL ) ;
__m128i Z = _mm_set_epi64 ( (__m64)0ULL, (__m64)0ULL ) ;
__m128i V = y ; __m128i X = x ;
i n t i = 0 ;
__m64 ch ;
__m64 che ;
f o r ( i =0; i < 128 ; i++ ){
i f ( i <64){ ch = (__m64) X [ 0 ] ; }
e l s e { ch = (__m64) X[ 1 ] ; }
uint64_t ch1 = ( uint64_t ) ch&1ULL;
i f ( ch1 ){
Z = _mm_xor_si128(Z , V) ; }
che = (__m64) V [ 1 ] ;
uint64_t che1 = ( uint64_t ) che&9223372036854775808ULL;
i f ( che1 ) {
__m64 ad = (__m64)V [ 0 ] ;
uint64_t tx1 = ( uint64_t ) ad&9223372036854775808ULL;
V = _mm_slli_epi64 (V, 1 ) ; //
i f ( tx1 ){
V[ 1 ] = ( uint64_t ) _mm_or_si64 ( (__m64)V[ 1 ] , (__m64)1ULL) ;
}
V = _mm_xor_si128(V, R) ;
} e l s e {
__m64 ad1 = (__m64)V [ 0 ] ;
uint64_t tx2 = ( uint64_t ) ad1&9223372036854775808ULL;
V = _mm_slli_epi64 (V, 1 ) ; / /
70
i f ( tx2 ){
V[ 1 ] = ( uint64_t ) _mm_or_si64 ( (__m64)V[ 1 ] , (__m64)1ULL) ;
}
}
i f ( i <64){ X[ 0 ] = ( uint64_t ) _mm_srli_si64 ( (__m64)X[ 0 ] , 1 ) ; }
e l s e { X[ 1 ] = ( uint64_t ) _mm_srli_si64 ( (__m64)X[ 1 ] , 1 ) ; }
}
∗ r e s = Z ;
}
A.3 Standard GHASH with PCLMULQDQ
// Uses mu l t i p l i c a t i o n implementation from In t e l ' s PCLMULQDQ white paper
// , which i s combination o f a lgor i thms in 4 . 1 . 4 and 4 . 1 . 5 Se c t i on s .
#inc lude <s td i n t . h>
#inc lude <in t type s . h>
#inc lude <wmmintrin . h>
#inc lude<emmintrin . h>
#inc lude<smmintrin . h>
#inc lude <s td i o . h>
#inc lude <time . h>
s t r u c t aes_block { uint64_t a ; uint64_t b ; } ;
void g fmu l i n t e l (__m128i a , __m128i b , __m128i ∗ r e s ) ;
void print_m128i_with_string ( char ∗ s t r i ng ,__m128i data ) ;
i n t main ( ) {






























































































































































































































































































__m128i X[ 1 5 3 7 ] ;
i n t i = 0 ;
// Input I n i t i a l i z a t i o n
f o r ( i = 0 ; i <=1536 ; i++){
a [ i ] = a [ i ] ∗ 1000000000000000000ULL; // en l a r g i ng input
b [ i ] = b [ i ] ∗ 1000000000000000000ULL; // en l a r g i ng input
X[ i ] = _mm_set_epi64 ( (__m64) a [ i ] , (__m64)b [ i ] ) ;
}
77
__m128i H = _mm_set_epi64 ( (__m64)5708010131839353156ULL,
(__m64)3405470159317640703ULL) ; //Hash Sub−key
/// Standard GHASH
__m128i temp = {0x00 , 0x00 } ;
i n t k = 0 ; // f o r mu l t ip l e runs . . . to magnify t iming r e s u l t s
// The commented f o r loop i s only used when running the code f o r
// t iming ana l y s i s
// f o r ( k =0 ; k<=10000; k++){
f o r ( i = 0 ; i <=1535 ; i++){
temp = _mm_xor_si128( temp , X[ i ] ) ;
g fmu l i n t e l ( temp ,H,&temp ) ;
}
// } //−−−−−−−−−−−−−−
print_m128i_with_string (" TagResult f : " , temp ) ;
}
void print_m128i_with_string ( char ∗ s t r i ng ,__m128i data ){
unsigned char ∗ po in t e r = ( unsigned char ∗)&data ;
i n t i ;
p r i n t f ("%−40s [ 0 x" , s t r i n g ) ;
f o r ( i =0; i <16; i++)
p r i n t f ("%02x" , po in t e r [ i ] ) ;
p r i n t f ( " ] \ n " ) ;
}
void g fmu l i n t e l (__m128i a , __m128i b , __m128i ∗ r e s ){
__m128i tmp0 , tmp1 , tmp2 , tmp3 , tmp4 , tmp5 , tmp6 ,
tmp7 , tmp8 , tmp9 , tmp10 , tmp11 , tmp12 ;
__m128i XMMMASK = _mm_setr_epi32 (0 x f f f f f f f f , 0 x0 , 0 x0 , 0 x0 ) ;
tmp3 = _mm_clmulepi64_si128 (a , b , 0x00 ) ;
tmp6 = _mm_clmulepi64_si128 (a , b , 0x11 ) ;
tmp4 = _mm_shuffle_epi32 (a , 7 8 ) ;
tmp5 = _mm_shuffle_epi32 (b , 7 8 ) ;
tmp4 = _mm_xor_si128( tmp4 , a ) ;
tmp5 = _mm_xor_si128( tmp5 , b ) ;
tmp4 = _mm_clmulepi64_si128 ( tmp4 , tmp5 , 0x00 ) ;
tmp4 = _mm_xor_si128( tmp4 , tmp3 ) ;
tmp4 = _mm_xor_si128( tmp4 , tmp6 ) ;
tmp5 = _mm_slli_si128 ( tmp4 , 8 ) ;
tmp4 = _mm_srli_si128 ( tmp4 , 8 ) ;
tmp3 = _mm_xor_si128( tmp3 , tmp5 ) ;
tmp6 = _mm_xor_si128( tmp6 , tmp4 ) ;
tmp7 = _mm_srli_epi32 ( tmp6 , 3 1 ) ;
tmp8 = _mm_srli_epi32 ( tmp6 , 3 0 ) ;
tmp9 = _mm_srli_epi32 ( tmp6 , 2 5 ) ;
tmp7 = _mm_xor_si128( tmp7 , tmp8 ) ;
tmp7 = _mm_xor_si128( tmp7 , tmp9 ) ;
78
tmp8 = _mm_shuffle_epi32 ( tmp7 , 147 ) ;
tmp7 = _mm_and_si128(XMMMASK, tmp8 ) ;
tmp8 = _mm_andnot_si128(XMMMASK, tmp8 ) ;
tmp3 = _mm_xor_si128( tmp3 , tmp8 ) ;
tmp6 = _mm_xor_si128( tmp6 , tmp7 ) ;
tmp10 = _mm_slli_epi32 ( tmp6 , 1 ) ;
tmp3 = _mm_xor_si128( tmp3 , tmp10 ) ;
tmp11 = _mm_slli_epi32 ( tmp6 , 2 ) ;
tmp3 = _mm_xor_si128( tmp3 , tmp11 ) ;
tmp12 = _mm_slli_epi32 ( tmp6 , 7 ) ;
tmp3 = _mm_xor_si128( tmp3 , tmp12 ) ;
∗ r e s = _mm_xor_si128( tmp3 , tmp6 ) ;
}
A.4 High Perfromance GHASH with PCLMULQDQ
// Uses mu l t i p l i c a t i o n implementation as presented in
// In t e l ' s PCLMULQDQ white paper , and which i s combination
// o f a lgor i thms in s e c t i o n 4 . 1 . 4 and 4 . 1 . 5 , and a l s o uses
// Gordon ' s a lgor i thm in a s im i l a r way as in Appendix A. 2
#inc lude <s td i n t . h>
#inc lude <in t type s . h>
#inc lude <wmmintrin . h>
#inc lude<emmintrin . h>
#inc lude<smmintrin . h>
#inc lude <s td i o . h>
#inc lude <time . h>
s t r u c t aes_block { uint64_t a ; uint64_t b ; } ;
void g fmu l i n t e l (__m128i a , __m128i b , __m128i ∗ r e s ) ;
void print_m128i_with_string ( char ∗ s t r i ng ,__m128i data ) ;
i n t main ( ) {






























































































































































































































































































__m128i X[ 1 5 3 7 ] ;
i n t i = 0 ;
85
// Input I n i t i a l i z a t i o n
f o r ( i = 0 ; i <=1536 ; i++){
a [ i ] = a [ i ] ∗ 1000000000000000000ULL; // en l a r g i ng input
b [ i ] = b [ i ] ∗ 1000000000000000000ULL; // en l a r g i ng input
X[ i ] = _mm_set_epi64 ( (__m64) a [ i ] , (__m64)b [ i ] ) ;
}
__m128i H = _mm_set_epi64 ( (__m64)5708010131839353156ULL,
(__m64)3405470159317640703ULL) ; //Hash Sub−key
/// Standard GHASH
__m128i temp = {0x00 , 0x00 } ;
i n t k = 0 ; // f o r mu l t ip l e runs . . . to magnify t iming r e s u l t s
// New GHASH −−−−−−−−−−−−−−−
i n t j = 0 ;
// Values o f c i s e t based on c h a r a c t e r i s t i c polynomial
// from Maple Code
i n t c i [ 1 2 9 ] ={1 ,0 , 1 , 0 , 1 , 0 , 0 , 0 , 1 , 1 , 0 , 0 , 0 , 1 , 0 , 0 ,
0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 1 , 1 , 1 ,
1 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 1 , 1 , 1 , 0 , 1 , 0 , 0 , 0 ,
0 , 0 , 1 , 1 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 1 ,
1 , 1 , 0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 1 , 0 , 1 , 1 , 1 , 0 ,
1 , 1 , 1 , 0 , 1 , 0 , 0 , 0 , 1 , 1 , 0 , 1 , 1 , 0 , 0 , 0 ,
1 , 1 , 1 , 0 , 1 , 1 , 0 , 0 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 1 ,
0 , 0 , 1 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 1 , 1 , 0 , 1 , 1 } ;
__m128i Y[ 1 2 8 ] ;
// The commented f o r loop i s only used when running the code f o r
// t iming ana l y s i s
// f o r ( k = 0 ; k <=10000; k++){
f o r ( i =0; i <=127; i++){
Y[ i ] = X[127− i ] ;
}
f o r ( j = 127 ; j<= 1535 ; j++){
__m128i C = Y[ 1 2 7 ] ;
f o r ( i = 127 ; i >=1 ; i−−){
i f ( c i [ i ]==1){Y[ i ] = _mm_xor_si128(Y[ i −1] , C) ; }
e l s e {Y[ i ] = Y[ i −1] ;}
}
i f ( c i [ 0 ] == 1){Y[ 0 ] = _mm_xor_si128(X[ j +1] , C) ; }
e l s e {Y[ 0 ] = X[ j +1] ;}
}
f o r ( i = 127 ; i >=1 ; i−−){
temp = _mm_xor_si128( temp , Y[ i ] ) ;
g fmu l i n t e l ( temp ,H, &temp ) ;
}
temp = _mm_xor_si128 ( temp , Y [ 0 ] ) ;
//}
86
print_m128i_with_string (" TagResult f : " , temp ) ;
}
void print_m128i_with_string ( char ∗ s t r i ng ,__m128i data ){
unsigned char ∗ po in t e r = ( unsigned char ∗)&data ;
i n t i ;
p r i n t f ("%−40s [ 0 x" , s t r i n g ) ;
f o r ( i =0; i <16; i++)
p r i n t f ("%02x" , po in t e r [ i ] ) ;
p r i n t f ( " ] \ n " ) ;
}
void g fmu l i n t e l (__m128i a , __m128i b , __m128i ∗ r e s ){
__m128i tmp0 , tmp1 , tmp2 , tmp3 , tmp4 , tmp5 , tmp6 ,
tmp7 , tmp8 , tmp9 , tmp10 , tmp11 , tmp12 ;
__m128i XMMMASK = _mm_setr_epi32 (0 x f f f f f f f f , 0 x0 , 0 x0 , 0 x0 ) ;
tmp3 = _mm_clmulepi64_si128 (a , b , 0x00 ) ;
tmp6 = _mm_clmulepi64_si128 (a , b , 0x11 ) ;
tmp4 = _mm_shuffle_epi32 (a , 7 8 ) ;
tmp5 = _mm_shuffle_epi32 (b , 7 8 ) ;
tmp4 = _mm_xor_si128( tmp4 , a ) ;
tmp5 = _mm_xor_si128( tmp5 , b ) ;
tmp4 = _mm_clmulepi64_si128 ( tmp4 , tmp5 , 0x00 ) ;
tmp4 = _mm_xor_si128( tmp4 , tmp3 ) ;
tmp4 = _mm_xor_si128( tmp4 , tmp6 ) ;
tmp5 = _mm_slli_si128 ( tmp4 , 8 ) ;
tmp4 = _mm_srli_si128 ( tmp4 , 8 ) ;
tmp3 = _mm_xor_si128( tmp3 , tmp5 ) ;
tmp6 = _mm_xor_si128( tmp6 , tmp4 ) ;
tmp7 = _mm_srli_epi32 ( tmp6 , 3 1 ) ;
tmp8 = _mm_srli_epi32 ( tmp6 , 3 0 ) ;
tmp9 = _mm_srli_epi32 ( tmp6 , 2 5 ) ;
tmp7 = _mm_xor_si128( tmp7 , tmp8 ) ;
tmp7 = _mm_xor_si128( tmp7 , tmp9 ) ;
tmp8 = _mm_shuffle_epi32 ( tmp7 , 147 ) ;
tmp7 = _mm_and_si128(XMMMASK, tmp8 ) ;
tmp8 = _mm_andnot_si128(XMMMASK, tmp8 ) ;
tmp3 = _mm_xor_si128( tmp3 , tmp8 ) ;
tmp6 = _mm_xor_si128( tmp6 , tmp7 ) ;
tmp10 = _mm_slli_epi32 ( tmp6 , 1 ) ;
tmp3 = _mm_xor_si128( tmp3 , tmp10 ) ;
tmp11 = _mm_slli_epi32 ( tmp6 , 2 ) ;
tmp3 = _mm_xor_si128( tmp3 , tmp11 ) ;
tmp12 = _mm_slli_epi32 ( tmp6 , 7 ) ;
tmp3 = _mm_xor_si128( tmp3 , tmp12 ) ;
∗ r e s = _mm_xor_si128( tmp3 , tmp6 ) ;
}
87
A.5 Gordon's Algorithm in Maple
// Input i s v a r i a b l e 'A' . For ac tua l r e s u l t s ,
// a l a r g e va lue o f input i s used . x+1 here
// i s j u s t to g ive an example .
A := x+1
G := GF(2 , 128 , x^128+x^7+x^2+x+1);
aa := G:−ConvertIn (A) ;
T := x ;
t t := G:−ConvertIn (T) ;
XA := G:− `+ `(aa , t t ) ;
Z := aa ;
f o r i to 127 do
aa := G:− `∗ ` ( aa , aa ) ;
XA := G:− `+ `(G:− `∗ ` (XA, t t ) , G:− `∗ ` (XA, aa ) )
end do ;
y := x^128+x^7+x^2+x+1;
XA := G:−ConvertOut (XA) ;
r e s u l t := `mod ` ( y+XA, 2)
88
References
[1] 128 bit XOR intrinsic function usage. http://msdn.microsoft.com/en-us/
library/fzt08www.aspx. Accessed: 16/08/2012. 45
[2] PCLMULQDQ intrinsic function usage. http://msdn.microsoft.com/
en-us/library/cc664767.aspx. Accessed: 15/08/2012. 43
[3] Using AES-GCM IETF DRAFT. http://tools.ietf.org/html/
draft-ietf-smime-cms-aes-ccm-and-gcm-00. Accessed: 16/08/2012.
1
[4] J. Bajard, L. Imbert, and G. Jullien. Parallel Montgomery multiplication
in GF (2k) using trinomial residue arithmetic. In 17th IEEE Symposium on
Computer Arithmetic (ARITH), 2005. 19
[5] Tianshan Chen, Wenjie Huo, and Zhenglin Liu. Design and Ecient FPGA
Implementation of Ghash Core for AES-GCM. In Computational Intelligence
and Software Engineering (CiSE), pages 14. IEEE, 2010. 2, 3, 19
[6] Jean Sebastien Coron, Yevgeniy Dodis, Cecile Malinaud, and Prashant Puniya.
Merkledamgard revisited: How to construct a hash function. pages 430448.
SpringerVerlag, 2005. 11
[7] Jeremie Crenne, Pascal Cotret, Guy Gogniat, Russell Tessier, and Jean-
Philippe Diguet. Ecient key-dependent message authentication in recon-
gurable hardware. In FPT'11, pages 16, 2011. 2
[8] Ashwini M. Deshpande, Mangesh S. Deshpande, and Devendra N. Kay-
atanavar. FPGA Implementation of AES Encryption and Decryption. In
International Conference on Control, Automation, Communicatin and Energy
Conservation, 2009.
[9] Mohamed Abo El-Fotouh and Klaus Diepold. Galois Substitution Counter
Mode (GSCM). In Enterprise Distributed Object Computing Conference Work-
shops, 12th, pages 199206. IEEE, 2008. 2, 19
[10] AJ Elbirt. Fast and Ecient Implementation of AES Via Instruction Set
Extensions. In 21st International Conference on Advanced Information Net-
working and Applications Workshops (AINAW'07) . IEEE, 2007.
89
[11] Bulens et al. Implementation of the AES128 on Virtex5 FPGAs. In Progress
in Cryptology  AFRICACRYPT, 2008. 19
[12] Chetna Sangwan et al. VLSI Implementation of Advanced Encryption Stan-
dard. In 2012 Second International Conference on Advanced Computing I&
Communication Technologies. IEEE, 2012.
[13] Vinodh Gopal et al. Optimized Galois Counter Mode Implementation on Intel
Architecture Processors. White paper, Intel Corporation, 2010. 2, 3
[14] T. Good and M. Benaissa. AES on FPGA from the fastest to the smallest. In
Cryptographic Hardware and Embedded Systems  CHES , 2005. 19
[15] Jorge Guajardo, Sandeep S. Kumar, Christof Paar, and Jan Pelzl. Ecient
SoftwareImplementation of Finite Fields with Applicaitons to Cryptography.
In Acta Appl Math. Springer Science, 2006. 6
[16] Shay Gueron and Michael Kounavis. Ecient implementation of the galois
counter mode using a carryless multiplier and a fast reduction algorithm.
pages 549553. Elsevier NorthHolland, Inc., 2010. 35, 43, 44, 45, 46
[17] Shay Gueron and Michael E. Kounavis. Intel Carry-Less Multiplication In-
struction and its Usage for Computing the GCM Mode . White paper, Intel
Corporation, 2010. 2, 3, 11, 31, 37, 38, 42, 45, 46, 48
[18] Krzysztof Jankowski and Pierre Laurent. Packed AES-GCM Algorithm Suit-
able for AES/PCLMULQDQ Instructions. In IEEE Transactions on Comput-
ers. IEEE Computer Society, 2011. 2, 3
[19] A. Karatsuba and Y. Ofman. Multiplication of Multidigit Numbers on Au-
tomata. In Soviet Physics Doklady, 1963. 33
[20] A. A. Karatsuba. The Complexity of Computations. In Proceedings of the
Steklov Institute of Mathematics, 1995. 33
[21] Mehran Mozaari Kermani and Arash Reyhani Masoleh. Ecient and High
Performance Parallel Hardware Architectures for the AESGCM. In IEEE
Transactions on Computers. IEEE, 2011. 19
[22] Chae Hoon Lim and Pil Joong Lee. More exible exponentiation with precom-
putation. In CRYPTO, pages 95107, 1994.
[23] Kuan Jen Lin, Chin-Mu Hsiao, and Ching Hung Jhan. Exploring HW/SW
Codesign of AES Algorithm Using Custom Instructions. In The 13th IEEE
International Symposium on Consumer Electronics (ISCE2009) . IEEE, 2009.
[24] Julio Lopez and Ricardo Dahab. High-speed software multiplication in F2m .
In INDOCRYPT'00, pages 203212, 2000.
90
[25] Yang Lu, Guochu Shou, Yihong Hu, and Zhigang Guo. The Research and
Ecient FPGA Implementation of GHASH Core for GMAC. In EBusiness
and Information System Security EBISS '09. IEEE, 2009. 2
[26] Arash Reyhani Masoleh and M. Anwar Hasan. Low Complexity Bit Paral-
lel Architectures for Polynomial Basis Multiplication over GF (2m). In IEEE
Transactions on Computers. IEEE, 2004. 2
[27] E. D. Mastrovito. Multiplication of Multidigit Numbers on Automata. In PhD
thesis, Dept. of Electrical Eng., Linkping Univ., Sweden , 1991. 19
[28] D.A. McGrew and J. Viega. The Galois/Counter Mode of Operation (GCM).
In Submission to NIST Modes of Operation Process, 2004. 15, 18, 19, 20, 23
[29] Nicolas Meloni, Christophe Negre, and M. Anwar Hasan. High performance
GHASH and impacts of a class of unconventional bases. In Journal of Cryp-
tographic Engineering, pages 201218. SpringerVerlag, 2011. 2, 3, 4, 18, 21,
23, 25, 26, 37, 38, 40, 41
[30] A.J. Menezes, P.C. Van Oorschot, and S.A. Vanstone. Handbook of Applied
Cryptography. CRC Press, 1997.
[31] Jorge Castineira Moreira and Patrick Guy Farrell. Essentials of ErrorControl
Coding. John Wiley and I& Sons Ltd., 2006. 6, 8
[32] NIST. Recommendation for Block Cipher Modes of Operation: Methods and
Techniques . NIST Special Publication 80038A, 2001. 14
[33] NIST. Recommendation for Block Cipher Modes of Operation: Galois/Counter
Mode (GCM) and GMAC. NIST Special Publication 80038D, 2007. 11, 12,
15, 18, 19, 20, 21, 37, 38, 47
[34] Christof Paar and Jan Pelzl. Understanding Cryptography A Textbook for
Students and Practitioners. Springer Science, 2010. 9, 10
[35] Akashi Satoh. HighSpeed Parallel Hardware Architecture for Galois Counter
Mode . In IEEE International Symposium on Circuits and Systems, ISCAS
2007. IEEE, 2007. 23
91
