Analysis of KECCAK Tree Hashing on GPU Architectures by Lowden, Jason Michael
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
6-2014 
Analysis of KECCAK Tree Hashing on GPU Architectures 
Jason Michael Lowden 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Lowden, Jason Michael, "Analysis of KECCAK Tree Hashing on GPU Architectures" (2014). Thesis. 
Rochester Institute of Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 




A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of
Science in Computer Engineering
Supervised by
Dr. Marcin Łukowiak
Department of Computer Engineering
Kate Gleason College of Engineering





Thesis Advisor, Department of Computer Engineering
Dr. Sonia Lopez Alarcon
Committee Member, Department of Computer Engineering
Dr. Stanisław P. Radziszowski
Committee Member, Department of Computer Science
Thesis Release Permission Form
Rochester Institute of Technology
Kate Gleason College of Engineering
Title:
Analysis of KECCAK Tree Hashing on GPU Architectures
I, Jason Michael Lowden, hereby grant permission to the Wallace Memorial Library to





I dedicate this thesis to my parents, Michael and Donna Lowden, my siblings, Stacy and
Matthew Lowden, and the rest of the family who have shown their support for this work.
iv
Acknowledgments
I would like to thank Dr. Marcin Łukowiak for serving as my thesis advisor and
introducing me to the world of cryptography. I would also like to thank Dr. Sonia Lopez
Alarcon and Dr. Stanisław Radziszowski for serving on my thesis committee and
providing guidance along the way. Finally, I would like to thank all of the faculty and staff
who provided support during this journey.
Without the support of many people along with way, this work would not have been
possible. I would like to thank Colin Donahue, Elizabeth Fischer, and Samantha Kenyon
for keeping me focused on the work at hand. I am grateful for Ben Wheeler, Zack
Sigmund, Gordon Werner, and Matt Kelly for their encouragement along the way. I
extend a special thank you to Mallory Rauch and Brady Hrabovsky for their continued
support during the conclusion of this thesis. Finally, I would like to thank Kerrie Bondi for
her support, encouragement, and teaching of the lessons that cannot be learned in a
classroom. And last, but not least, I would like to thank Dr. Danielle Smith and the RIT
Honors Program for the numerous opportunities during my time here at RIT.
v
Abstract
Analysis of KECCAK Tree Hashing on GPU Architectures
Jason Michael Lowden
Supervising Professor: Dr. Marcin Łukowiak
In an effort to provide security and data integrity, hashing algorithms have been designed
to consume an input of any length to produce a fixed length output. KECCAK was selected
by NIST to become the next Secure Hashing Algorithm (SHA-3) after nearly five years of
competition. In addition to providing a sequential operating mode, there is also a tree mode
that allows large input messages to be hashed in parallel.
This thesis focuses on the exploration and analysis of the KECCAK tree hashing mode
on a GPU platform. Based on the implementation, there are core features of the GPU
that could be used to accelerate the time it takes to complete a hash due to the massively
parallel architecture of the device. In addition to analyzing the speed of the algorithm, the
underlying hardware is profiled to identify the bottlenecks that limited the speed.
The results of this work show that tree hashing can hash data at rates of up to 3 GB/s for
the fixed size tree mode. On a 3.40 GHz CPU, this is the equivalent of 1.03 cycles per byte,
more than six times faster than a sequential implementation for a very large input. For the
variable size tree mode, the throughput was 500 MB/s. Based on the performance analysis,
modification of the input rate of the KECCAK sponge resulted in a negligible change to the
overall speed. As a result of the hardware profiling, the register and L1 cache usage in the
GPU was a major bottleneck to the overall throughput. In a simulated GPU environment,
it was shown that increasing the L1 cache by 25 percent could increase the throughput by
up to 30 percent for a small tree and 15 percent for a tree that will achieve the greatest
throughput on a real GPU. When this modification is combined with an increase of the L2
cache, performance can be improved by up to 20 percent.
vi
Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 The KECCAK Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Sponge Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Theta (θ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Rho (ρ) and Pi (π) . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Chi (χ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Iota (ι) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Tree Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Encoding and Padding . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Modifications for SHA-3 Standardization . . . . . . . . . . . . . . . . . . 19
3 Implementation Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 CUDA Enabled Graphics Processing Units (GPUs) . . . . . . . . . . . . . 21
3.1.1 Processing Architecture . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Cycle-Accurate GPU Simulator . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Field Programmable Gate Arrays (FGPAs) . . . . . . . . . . . . . . . . . . 33
vii
4 Supporting Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.1 CPU Sequential Implementations . . . . . . . . . . . . . . . . . . 36
4.1.2 CPU SSE2 Tree Hashing . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.3 GPU-Based Hashing . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 FPGA-Based Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 KECCAK-defined Processing Cores . . . . . . . . . . . . . . . . . 44
4.2.2 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3 Pipelined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 High Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.1 Additional Types of Parallelism . . . . . . . . . . . . . . . . . . . 51
5.2 Kernel Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.1 Leaf Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.2 Internal Processing Kernels . . . . . . . . . . . . . . . . . . . . . 55
5.2.3 Top Level Processing . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.1 GPU Memory Use . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.2 Efficient Memory Design for Low-End GPUs . . . . . . . . . . . . 61
6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1 Hashing Speed using Page-Locked Memory . . . . . . . . . . . . . . . . . 63
6.1.1 CPU Clock Cycle Conversion . . . . . . . . . . . . . . . . . . . . 68
6.2 Hashing Speed using Page-Locked Memory Rate Comparison . . . . . . . 70
6.3 GPU Memory Use Effect on Hash Speed . . . . . . . . . . . . . . . . . . . 72
6.4 Kernel Level Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5 Hardware Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7 Simulated Cryptographic GPU Design . . . . . . . . . . . . . . . . . . . . 92
7.1 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.2 L1 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3 L2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.4 L1 and L2 Cache Combined . . . . . . . . . . . . . . . . . . . . . . . . . 97
8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
viii
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A Hashing Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . 108
ix
List of Tables
2.1 KECCAK ρ Offsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 KECCAK r Matrix Constants . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 KECCAK Round Constants . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Standardized KECCAK Modes . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Constant Memory Microbenchmark Memory Access Pattern . . . . . . . . 28
4.1 KECCAK Hashing Speeds [9] . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 SSE2 Implementation Performance [6] . . . . . . . . . . . . . . . . . . . . 39
4.3 SSE2 Tree Implementation Performance [6] . . . . . . . . . . . . . . . . . 39
4.4 Tree Hash Mode Performance Results [27] . . . . . . . . . . . . . . . . . . 41
4.5 GPU Tree Hashing Results [13] . . . . . . . . . . . . . . . . . . . . . . . 43
4.6 CPU Sequential Hashing Reference Results [13] . . . . . . . . . . . . . . . 43
4.7 FGPA Implementation Summary . . . . . . . . . . . . . . . . . . . . . . . 44
4.8 ModelSim Performance Estimates [6] . . . . . . . . . . . . . . . . . . . . 46
4.9 KECCAK FPGA Performance [21] . . . . . . . . . . . . . . . . . . . . . . 48
4.10 FPGA Pipeline Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 Testing Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Ideal Tree Hashing Operating Points . . . . . . . . . . . . . . . . . . . . . 69
6.3 Relative Performance Comparison . . . . . . . . . . . . . . . . . . . . . . 72
6.4 Comprehensive Relative Performance Comparison . . . . . . . . . . . . . 73
6.5 Maximum Hash Speeds using Different Memories . . . . . . . . . . . . . . 77
6.6 Minimum Required Data Sizes for Adequate Performance . . . . . . . . . 78
6.7 Percentage of Kernel Execution Times for First Two Kernel Launches . . . 82
6.8 Kernel Throughput for Fixed Tree Height . . . . . . . . . . . . . . . . . . 87
6.9 Compute Capability 3.5 Kernel Resource Requirements . . . . . . . . . . . 88
7.1 Limiting Factors for Thread Execution . . . . . . . . . . . . . . . . . . . . 94
x
List of Figures
2.1 Sponge Construction [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Graphical Representation of the KECCAK Internal State [8] . . . . . . . . . 6
2.3 Naming conventions for parts of the KECCAK-f state [8] . . . . . . . . . . 7
2.4 KECCAK θ [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 KECCAK ρ [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 KECCAK π [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 KECCAK χ [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 KECCAK LI Tree, Height = 3 . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 KECCAK FNG Tree, Height = 3 . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 KECCAK FNG Tree with Extra Leaves . . . . . . . . . . . . . . . . . . . . 17
2.11 KECCAK FNG Tree, Height = 2 . . . . . . . . . . . . . . . . . . . . . . . 17
2.12 KECCAK FNG Tree, Height = 4 . . . . . . . . . . . . . . . . . . . . . . . 18
2.13 KECCAK FNG Tree, Height = 5 . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 CUDA Fermi Streaming Multiprocessor (SM) [15] . . . . . . . . . . . . . 22
3.2 CUDA Grid Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 CUDA Memory Hierarchy [14] . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 CUDA Fermi Cache Hierarchy [15] . . . . . . . . . . . . . . . . . . . . . 26
3.5 CUDA Fermi Constant Cache Overview . . . . . . . . . . . . . . . . . . . 28
3.6 CUDA Fermi Constant Cache L1 and L2 Parameter Identification . . . . . 29
3.7 CUDA Kepler Constant Cache Overview . . . . . . . . . . . . . . . . . . . 29
3.8 CUDA Kepler Constant Cache L1 and L2 Parameter Identification . . . . . 30
3.9 Accuracy of Simulator to Fermi Architecture [1] . . . . . . . . . . . . . . . 31
3.10 High Level Architecture of Simulator [1] . . . . . . . . . . . . . . . . . . . 32
3.11 Streaming Processor Core Architecture [1] . . . . . . . . . . . . . . . . . . 32
3.12 Memory Subsystem Architecture [1] . . . . . . . . . . . . . . . . . . . . . 33
3.13 Basic FPGA Fabric Visualization . . . . . . . . . . . . . . . . . . . . . . . 34
3.14 Virtex-7 SLICEL [32] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Basic Tree Hash Mode [27] . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Second Tree Hash Mode - Secure [27] . . . . . . . . . . . . . . . . . . . . 41
xi
4.3 KECCAK Round Parallelism Performance [12] . . . . . . . . . . . . . . . . 42
4.4 KECCAK High-Speed Core [11] . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 KECCAK Low Area Co-processor [11] . . . . . . . . . . . . . . . . . . . . 46
4.6 Hash Function Wrapper [4] . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.7 Common SHA-3 Interface [21] . . . . . . . . . . . . . . . . . . . . . . . . 48
4.8 KECCAK Pipeline Stages [20] . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1 High Level Description of Tree Mode Hash Functions . . . . . . . . . . . . 52
5.2 High Level Kernel Architecture . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Tree Hashing Kernel Launch Sequence . . . . . . . . . . . . . . . . . . . . 56
5.4 Kernel Execution Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Efficient Tree Layout in Memory . . . . . . . . . . . . . . . . . . . . . . . 59
5.6 Pages of Memory for Sequential Processing . . . . . . . . . . . . . . . . . 62
6.1 Hash Performance using LI Mode . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Block Size Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3 Hash Performance using FNG Mode . . . . . . . . . . . . . . . . . . . . . 68
6.4 Normalized Hash Speeds for Different KECCAK Rates . . . . . . . . . . . 71
6.5 Performance Comparison of Hash Speeds with Associated Overhead using
Global Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.6 Performance Comparison of Hash Speeds with Associated Overhead using
Page-Locked Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.7 Minimum Required Data Size for Hashing . . . . . . . . . . . . . . . . . . 79
6.8 Leaf Kernel Execution Time Distribution . . . . . . . . . . . . . . . . . . 80
6.9 Kernel Performance for Small Tree Size . . . . . . . . . . . . . . . . . . . 81
6.10 First Two Kernel Throughputs for a Tree with Degree = 2 . . . . . . . . . . 84
6.11 Kernel Throughputs for Varying Heights with Degree = 2 . . . . . . . . . . 85
6.12 Kernel Throughputs for Varying Heights and Degrees . . . . . . . . . . . . 86
6.13 Average Percentage of Misses to Global Memory . . . . . . . . . . . . . . 89
6.14 Comparison of Block Size and Cache Miss Rate for Degree = 2 . . . . . . . 90
6.15 Stall Percentages for Data Requests . . . . . . . . . . . . . . . . . . . . . 90
6.16 Stall Percentages for Data Requests . . . . . . . . . . . . . . . . . . . . . 90
7.1 Register Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.2 Register Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3 L1 Cache Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.4 L2 Cache Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.5 Detailed L2 Cache Modifications . . . . . . . . . . . . . . . . . . . . . . . 97
xii
7.6 L1 and L2 Combined Modifications for Degree = 10 . . . . . . . . . . . . 98
7.7 L1 and L2 Combined Modifications for Degree = 2 . . . . . . . . . . . . . 99
7.8 L1 and L2 Combined Modifications for Degree = 3 . . . . . . . . . . . . . 100
xiii
Acronyms
CUDA Compute Unified Device Architecture
FNG Final Node Growth
LI Leaf Interleaving





B The block size, in bits, that is processed at one time by each leaf node in the hash tree.
For LI mode, this is the number of bits that are read before moving to the next section
to be processed; for FNG mode, this is the maximum number of bits that will be
processed by each leaf.
Compute Unified Device Architecture A extension to the C programming language de-
signed by NVIDIA to program general purpose computations on a GPU
D The fixed degree for the hash tree that specifies how many child nodes each parent node
will have
Digest The output of a hashing function
G The tree growth mode that is used to define the size of the tree
H The number of layers of nodes in the hash tree below the root; a height of zero implies
that the tree consists of a single node
Kernel A group of threads that are launched simultaneously to perform computation on a
GPU
R The variable degree for the child nodes of the parent nodes which is determined from
the fixed degree (D), height (H), and the length of the input message
Special Function Unit A special core within a SM that computes special functions, such
as sine, cosine, etc.
Streaming Multiprocessor A group of SPs, SFUs, caches, and schedulers that work to
execute a GPU kernel
Streaming Processor A light-weight processing core with an integer and a floating point
pipeline to perform the actual execution of kernel instructions
Warp A group of 32 threads that execute in lock-step; conditional branches may cause




Hash functions accept an input of arbitrary length and produce an output of fixed length
[28]. Since the introduction of GPUs in heterogeneous computing, there have been a num-
ber of attempts to convert slow, sequential applications into high speed parallel applica-
tions. The use of GPUs in cryptography is limited due to the sequential nature of the
operations. Hash functions are unique in the way that an output is generated. A message
is broken down into a number of blocks and the hash function consumes each block of the
message into some type of internal state, with a final output produced after the last block
is consumed. Because of this structure, it is difficult to parallelize the hash function itself.
However, tree hashing is an application that runs many instances of the hash function in
parallel in order to speed up the time it takes for large input messages.
In addition to providing a unique signature for a given input, hashing has a couple of ap-
plications in the field of digital forensics. First, a hash function can be used to determine
if there are files that can be ignored as part of an investigation using a database of known
good or bad files. Secondly, hashing can be used to determine if a physical medium has
been tampered with [2]. By performing a hash of the image before an investigation begins,
it can be shown that nothing has been modified if the hash function applied after the in-
vestigation results in the same hash. In this case, the inputs could range from a few bytes
to a few terabytes, and using a sequential hash function is not the best choice. A function
with a tree hashing mode could be used to significantly reduced the amount of time that is
required to compute the hash.
2
1.1 Motivation
When NIST announced KECCAK to be the winner of the SHA-3 competition in Octo-
ber 2012, a number of sequential CPU and FPGA-based implementations were available.
However, only a couple of works have been published that detail a GPU-based implemen-
tation. This thesis provides a greater understanding of the performance limitations that
exist in the implementation of the KECCAK hashing function on a CUDA-enabled GPU. In
order to have a realistic value, this thesis analyzed both tree modes defined in the KECCAK
specification [6]. In order to evaluate the implementation for speed and other performance
metrics, both tree modes were executed on a NVIDIA Tesla GPU by modifying a number
of the tree parameters. Finally, the execution results are used to modify a simulated GPU
architecture in an effort to provide additional resource utilization and system throughput
for a GPU that would be designed for a cryptographic application.
1.2 Thesis Organization
This thesis begins with a description of the KECCAK algorithm and a detailed overview
of the potential platforms on which it could be implemented in Chapters 2 and 3, respec-
tively. Chapter 4 describes all of the previous implementations that have been designed,
ranging from CPU and GPU implementations to FGPA-based designs. Chapter 5 presents
the structure and methodology in which the solution to this problem was written using the
CUDA C/C++ extensions. The results are presented in Chapter 6 with a detailed discus-
sion of the overall results, in addition to low-level hardware details obtained by profiling
the application. Chapter 7 presents the work of modifying a simulated GPU architecture to
further improve the performance of the hash function. Chapter 8 presents the conclusion
about all of the results that were obtained and provides some insights into a tree hashing





On November 2, 2007, the National Institute of Standards and Technology (NIST), an-
nounced a competition to create the next Secure Hashing Algorithm (SHA-3). The reason
for the competition was to find a complement to the SHA-1 and SHA-2 algorithms. In
recent publications, it was shown that SHA-1 has some serious vulnerabilities [23]. Since
SHA-2 has a similar design to SHA-1, there was also a concern that it may be compromised
by similar attacks. The SHA-3 competition consisted of three rounds in which the designs
were available for scrutiny at various conferences around the world in addition to the cryp-
tography community. As a result of this scrutiny, the selection pool was reduced for various
reasons, such as cryptographic vulnerabilities or performance considerations. On October
31, 2008, sixty-four candidates were submitted to the competition. By December 2008, the
first fifty-one candidates were selected as part of the first round, and the second round of
fourteen candidates was announced in July 2009. In the third and final round of the com-
petition, BLAKE, Grøstl, JH, KECCAK and Skein were selected in December 2010 [24].
On October 2, 2012, KECCAK was announced to be the winner of the competition [23].
On April 7, 2014, NIST released a draft of the Federal Information Processing Standard
(FIPS) 202 that detailed the standardized hashing algorithm [25]; Section 2.5 provides an
overview of the changes that were made in the standardized document.
KECCAK can be used to refer to one of two meanings. The first use of KECCAK is for the
underlying construction that manipulates the bits to perform the hash function in the 24
rounds of operations; these operations are done on a fixed block of the input. The second
meaning is the overall hash function that takes a message of arbitrary length and produces
4
a fixed size output. When the internal hash function is being referred to, KECCAK defines
the order of operations that are used. After this section, the second definition of KECCAK
will be used to refer to the hash function.
2.1 Sponge Construction
The KECCAK function is defined as a sponge function, due to the process that input is
absorbed and output is squeezed. Figure 2.1 illustrates the general concept of a sponge.
The first step in using the sponge is to pad the input to be a multiple of the specified input
rate. Section 2.2 provides a detailed explanation of the padding scheme. The purpose of
the sponge is to absorb blocks of input using an exclusive OR operation, run the KECCAK
algorithm specified in Section 2.3, and then continue this type of processing until all of the
input blocks are absorbed. From Figure 2.1, each f represents a time when the KECCAK
function needs to be executed. Once all of the input blocks are absorbed, the sponge is
able to produce an output of any length by squeezing [7]. One of the benefits of using the
sponge construction is that any length output can be generated. At each stage of squeezing
output, there are a fixed number of bits that can be generated, where the number of bits in
one block of output is equal to the size of an input block. If the requested output length is
greater than number of bits that can be generated, the KECCAK function is applied to the
data, without combining it with any other input [7]. There are no differences in operation
of the KECCAK function, f , between the absorbing and the squeezing phases.
Figure 2.1: Sponge Construction [7]
5
As shown in Figure 2.1, references are made to parameters r and c in the sponge. These
values define the rate, r, at which input is processed and output is generated, in addition to
the security capacity, c, that the sponge is able to provide [8]. As the rate increases, input
can be absorbed faster, requiring fewer KECCAK functions to be executed. However, the
capacity will be reduced by the same amount, thereby reducing the underlying security.
The purpose of the capacity is to define how much of the internal structure will not be
directly generating output [7]. Equation 2.1 shows the relationship between rate, capacity,
and the internal state size, b.
r + c = b (2.1)
The hashing function KECCAK-f[b] is defined by selecting some value b that will specify
the number of bits that are present in the internal state. By definition, b is found within
the set of numbers of the form 52 × 2l, where l ∈ [0, 6]. For the SHA-3 competition, the
internal state size was set to be 1600 bits, where l = 6 [5]. As the KECCAK function is
performed on the input message, an internal state is maintained during the process. Figure
2.1 shows the rectangular boxes filled with 0 on the left to illustrate this state. Figure 2.2
shows how the state is made up of a 3-dimensional grid of bits. The X and Y dimensions
of the state will always be 5 bits, but the size of the Z dimension is dependent on the mode
that is used. The Z dimension is referred to as a lane, and the size of the lane is given as 2l.
A complete breakdown of the state is shown in Figure 2.3. A register of 2l bits is the ideal
size for the operations to be performed. In the SHA-3 competition, the ideal register size
would be 64 bits.
2.2 Padding
The algorithm is designed to implement a 10*1 bit padding scheme, meaning that there is
a 1 followed by a suitable number of 0’s, and finally another 1 [8]. The definition from the
KECCAK Reference manual is shown in Definition 2.1. Depending on the size of the input
and the rate that is used, the padding scheme will append a variable number of bits. In the
best case, padding would append two bits, 0b11; this would occur when the input block is
two bits less than the absorption rate. In the worst case, an entire block of padding could
6
Figure 2.2: Graphical Representation of the KECCAK Internal State [8]
be generated, meaning that the hash function would have to be called two times for a single
block of input. The worst case padding would occur when the input block is one bit less
than the input rate. For this case, 0b1 would be appended to the input block, the hash would
be performed on that block, and then another input block, represented as 0r−11, would be
hashed.
Definition 2.1. Multi-rate padding, denoted by pad10*1, appends a single bit 1 followed
by the minimum number of bits 0 followed by a single bit 1 such that the length of the
result is a multiple of the block length [8].
2.3 Algorithm
In order to securely obfuscate the input, the KECCAK algorithm is designed to run five dif-
ferent operations using a fixed number of rounds. Regardless of the round, these operations
perform the same transformations on the state of the sponge construction [8]. As shown
in Equation 2.2, the number of rounds is directly proportional to the value of l that was
selected determining the size of the state that is used.
nr = 12 + 2× l (2.2)
The five operations are performed in the following order for all round: theta (θ), rho (ρ),
pi (π), chi (χ), iota (ι). The details of each of the operations are shown in the following
sections.
7
Figure 2.3: Naming conventions for parts of the KECCAK-f state [8]
8
Figure 2.4: KECCAK θ [8]
2.3.1 Theta (θ)
The purpose of the θ step is to diffuse the input by performing a bit-wise summation over
the nearby bits of the state. Equation 2.3 shows the mathematical operation that is being







a[x+ 1][y′][z − 1] (2.3)
The implementation of this step is based on a summation, which translate to an exclusive
OR operation in hardware and software. Equation 2.4 describes the way that this step is
implemented in hardware and software. The variables C and D are intermediate variables,
but A is the state of the sponge. For each operation, the state is read at the beginning and
the computed values are written back to the state at the end of the same operation. The
ROT operation is a bit-wise shift of the first parameter by the number of bits specified in
the second parameter. Using this method, all of the operations occur on lanes of the state,
which provides a benefit when it comes to mapping the operations to registers in the desired
architecture.
C[x] = A[x, 0]⊕ A[x, 1]⊕ A[x, 2]⊕ A[x, 3]⊕ A[x, 4],∀x ∈ [0, 4]
D[x] = C[x− 1]⊕ROT (C[x+ 1], 1), ∀x ∈ [0, 4]
A[x, y] = A[x, y]⊕D[x],∀(x, y) ∈ ([0, 4], [0, 4])
(2.4)
9
2.3.2 Rho (ρ) and Pi (π)
The purpose of the ρ step is to provide diffusion within the slices of the state, by applying
a shifting operation on the lanes. The operation is mathematically represented in Equation
2.5 and graphically in Figure 2.5. The center of the of the figure represents x = y = 0.
a[x][y][z]← a[x][y][z − (t+ 1)(t+ 2)/2], (2.5)








 in GF (5)2x2,
or t = −1 if x = y = 0
Figure 2.5: KECCAK ρ [8]
The operation of ρ is done by using offsets to calculate the number of positions that have
to be rotated. Table 2.1 gives the offsets that are used as part of the operation. The values
in this table are generated by an iterative equation that is derived from Equation 2.5.
Table 2.1: KECCAK ρ Offsets
x = 3 x = 4 x = 0 x = 1 x = 2
y = 2 153 231 3 10 171
y = 1 55 276 36 300 6
y = 0 28 91 0 1 190
y = 4 120 78 210 66 253
y = 3 21 136 105 45 15
The purpose of the π step is to provide dispersion to help with long-term diffusion [8].
Equation 2.6 shows the matrix operation that is performed on the lanes of the state. Figure
10
Figure 2.6: KECCAK π [8]
2.6 shows the graphical representation of the transformation. Since all of the operations










The implementation of this step is shown in Equation 2.7. It is more efficient to combine the
ρ and π steps together, rather than perform the matrix operations individually. The ROT
operation is the same bit-wise shift that was shown in the θ step in Equation 2.4. When
these steps are combined together, the number of bits to apply to the rotation becomes
variable, as defined by the rotation matrix, r. The values of the matrix can be determined
by solving the equations above. Additionally, Table 2.2 shows the values that are applied to
the rotation based on the x and y values that are being used. By comparing Table 2.2 with
Table 2.1, the relationship to obtain the number of bits to shift by are the values in Table
2.1 mod 2l. The state, A, is read from and the variable B is generated, which will be used
11
in a later step.
B[y, 2x+ 3y] = ROT (A[x, y], r[x, y]),∀(x, y) ∈ ([0, 4], [0, 4]) (2.7)
Table 2.2: KECCAK r Matrix Constants
x = 3 x = 4 x = 0 x = 1 x = 2
y = 2 25 39 3 10 43
y = 1 55 20 36 44 6
y = 0 28 27 0 1 62
y = 4 56 14 18 2 61
y = 3 21 8 41 45 15
2.3.3 Chi (χ)
The χ step is one of the most important functions in KECCAK since it is the only non-linear
step. Equation 2.8 mathematically represents the operation. This protects the hash function
from being broken by linear cryptanalysis. Without the presence of a non-linear step, the
entire round would be linear. While this step is non-linear, it remains invertible. However,
the complexity of computing the inverse operation is greatly increased [8]. Figure 2.7
shows the pictorial representation using AND and NOT gates.
a[x]← a[x] + (a[x+ 1] + 1)a[x+ 2] (2.8)
The non-linear part of the function is shown in Equation 2.9 by the AND and NOT opera-
tions on all of the lanes of the state. The intermediate variable B is used from the previous
operation, as described in Equation 2.7 of Section 2.3.2.
A[x, y] = B[x, y]⊕((NOT B[x+1, y])AND B[x+2, y]),∀(x, y) ∈ ([0, 4], [0, 4]) (2.9)
2.3.4 Iota (ι)
The ι step is the only operation that uses a different value depending on the round. The
purpose of this step is to disrupt the symmetry that may exist. If the value were not added,
all of the rounds would perform identical operations on the data. Each round constant is a
12
Figure 2.7: KECCAK χ [8]
unique value that ensures the data in the state is modified. Equation 2.10 mathematically
defines the operation.
a← a+RC[ir] (2.10)
While this operation appears to be relatively simple in relation to the other operations, the
complexity of the operation is in generating the round constants that are used. Equation
2.11 shows the definition of the round constants. The first index, ir, is the round constant;
it is also used as an index into a multi-dimensional array that contains the constants. For
each round, the entire state is 0, except for lane (0, 0), which is set by a Linear Feedback
Shift Register (LFSR). The LFSR function rc[t] is defined in Equation 2.12.
RC[ir][0][0][2
j − 1] = rc[j + 7ir],∀ 0 ≤ j ≤ l (2.11)
rc[t] = (xt mod x8 + x6 + x5 + x4 + 1)mod x in GF(2)[x] (2.12)
Since these constants will be used for every round of every KECCAK function call, they can
be pre-computed and stored in memory. Table 2.3 shows all of the round constants that will
be used during the 24 rounds when l = 6. In the event that a smaller value of l is used, the
constants can be truncated to the required length. Once the table of constants is created, the
implementation is an exclusive OR of lane (0, 0) with the constant; Equation 2.13 shows
13
the equation for implementation.
Table 2.3: KECCAK Round Constants
Round Constant Round Constant
0 0x0000000000000001 12 0x000000008000808B
1 0x0000000000008082 13 0x800000000000008B
2 0x800000000000808A 14 0x8000000000008089
3 0x8000000080008000 15 0x8000000000008003
4 0x000000000000808B 16 0x8000000000008002
5 0x0000000080000001 17 0x8000000000000080
6 0x8000000080008081 18 0x000000000000800A
7 0x8000000000008009 19 0x800000008000000A
8 0x000000000000008A 20 0x8000000080008081
9 0x0000000000000088 21 0x8000000000008080
10 0x0000000080008009 22 0x0000000080000001
11 0x000000008000000A 23 0x8000000080008008
A[0, 0] = A[0, 0]⊕RC[round] (2.13)
2.4 Tree Hashing
As illustrated in Section 2.3, the algorithm processes each block of the input message and
generates output after all of the blocks are absorbed. Since the underlying algorithm does
not change, multiple instances can be run in parallel to speed up the computation. There
are two types of parallel computation that can be done on multiple instances: multiple mes-
sage or single message. The multiple message approach would provide separate sponges
that have no relationship to each other. The single message method is known as tree hash-
ing because the results of each sponge are used to feed other sponges. The root of the tree
does not feed another sponge; it produces the output of the hash function. The KECCAK
specification provides two types of tree modes that can be used. After specifying the pa-
rameters of the tree, the number of nodes can be determined based on those parameters and
the length of the input message [6]. Each node of the tree contains a KECCAK sponge that
will process a unique portion of the input message. Within a given level of the tree, there
are dependencies between the nodes. However, each level of the tree is dependent on the
14
level below it.
In order to use the KECCAK tree mode for hashing, the rate, r, and capacity, c, have to
be specified as before, and these parameters will apply to every node in the tree. The rate
controls how quickly the input is absorbed into the node. The capacity still serves the same
purpose for security as before, but it also specifies the number of bits that are output at each
node and used as input to the next node; the only node where this does not apply is the root,
where the selected output size is used.
In addition to the rate and capacity of the underlying sponge, the height, H , degree, D,
input processing block size, B, and the tree growth mode, G, have to be specified [6].
The height of the tree specifies the number of levels below the root node. The degree,
D, of the tree determines the number of children of each node. By definition of the tree,
H, D ∈ [0, 255] [6]. The tree growth mode G defines how the tree is organized. Because
there are two types of trees, two different degrees can be defined. However, the degree D
will be a common parameter for both tree modes; the other degree will be defined in detail
with the tree growth modes. The block sizeB is used to control how the input is distributed
to the leaves of the tree. B has constraints that require it to be a multiple of 8 bits and
B/8 ∈ [1, 216 − 1] [6].
The two tree growth modes are Leaf Interleaving, LI , and Final Node Growth, FNG. LI
mode will use the aforementioned parameters to create a fixed size tree that will work for
any input size. Therefore, the number of nodes in the tree is easily computed using Equation
2.14. In this mode, B specifies how much of the input will be processed at each leaf before
stepping to a new section of input. With this method, blocks of data are interleaved across
all of the leaves. Equation 2.16 describes how many bits of input are between the input








distanceBetweenBlocks = B ×NumLeavesLI bits (2.16)
A tree that uses the FNG mode will be a variable tree that takes into account the length of
the input message, M , in addition to the other tree parameters. When this mode is used,
the height must be at least 1, and each leaf will only process the amount of data specified
by B bits [6]. As previously stated, this tree mode introduces a new degree, R, because of
its dependence on the length of the input. The parents of the leaves will have degree R,
whereas the all of the higher level nodes will have degree D. The new degree is calculated











In order to illustrate the differences between the tree modes, a small example can be con-
sidered. For this example, assume that B = 64 bits, H = 3, and D = 2. The input will be
1024 bits in length. Using the equations shown above, the number of nodes in the trees
can be calculated. Figures 2.8 and 2.9 show G = LI and G = FNG, respectively. In
the figures, the input is shown below the tree and the node that processes the section of
input is given above the line. The major differences between the trees are the number of
leaves and the data that each node will process. Figure 2.8 shows that each leaf will process
two blocks of input, where the blocks being processed are separated by Equation 2.16. In













0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024
Node
Bits
7 78 89 910 1011 1112 1213 1314 14












0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024
Node
Bits
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Figure 2.9: KECCAK FNG Tree, Height = 3
One of the most important constraints with the FNG tree mode is the number of leaves
that will be created. Because the degree R is based on the message length, it is possible
that extra leaves will be created that will process no input. This case is shown in Figure
2.10 where D = 2, H = 4, B = 64, and |M |= 1152.
When the |M |, D, and B are fixed, each tree that is constructed will have approximately
the same number of leaf nodes. When H is changed, extra leaves could be created that do
not process input. From Figure 2.9, there were sixteen leaf nodes to process the input. If
the height of the tree were changed to two, four, or five, the same number of leaf nodes are
17
























0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 1088 1152
Node
Bits
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32






0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024
Node
Bits
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Figure 2.11: KECCAK FNG Tree, Height = 2
Figure 2.13 is a unique set of inputs because the variable degree R is calculated to be 1.
When this occurs, extra processing occurs at the second level of the tree where no input is
combined from multiple leaf nodes. The height of the tree needs to be considered when
defining a general tree hashing parameter set with FNG mode in order to avoid this case.
2.4.1 Encoding and Padding
At each node of the tree, there are some parameters that need to be encoded into the data
before the hash is performed. Because there are different methods of encoding, namespaces

























0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024
Node
Bits
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
















































0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024
Node
Bits
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
Figure 2.13: KECCAK FNG Tree, Height = 5
the given namespace and some prefix or salt to the hash. The length of the prefix is selected
to be in the range of [0, 255] bytes. In an effort to reduce the complexity of input distribution
with this encoding, the length of the prefix can be selected to fill the remainder of the state.
The KECCAK Main Document defines the default namespace to be http://keccak.
noekeon.org/tree/, which has a length of 31 bytes [6]. By definition of the encoding,
this is followed by eight 0’s, in bits, for a total of 32 bytes, or 256 bits. The reduced
complexity occurs when length of the prefix is r/8 − 32. Once the contents of the prefix
19
are filled, r bits will be ready to be hashed into the initialized internal state. Once this hash
is performed, it can be saved for later use by all nodes before the hashing of the actual data
begins. Because this is a pre-computed hash, this will be the initialized state when the data
is combined with the state to produce output for other nodes.
Once all of the encoding is taken care of for the prefix block, the data that is hashed by each
node needs to be padded with some additional information. First of all, the data is padded to
be a multiple of 8 bits using the padding scheme in Section 2.2. This process will add up to
8 bits to the message; if it is already byte aligned, 0x81 will be added. After the alignment
padding is completed, the tree parameters are encoded by concatenation to the end of the
input message [6]. If the node is not the root node, then 0x00 will be appended, for a total
of one byte. The root node will encode tree parameters. The first value to be appended is
the tree mode: 0x00 when G = LI or 0x01 when G = FNG. The second parameter is
H encoded in 8 bits, followed by D encoded in 8 bits, and finally B/8 encoded in 16 bits.
Once the parameters are given, 0x01 is appended, for a total of six bytes. After the input
message is padded and the encoding parameters are added, the 10*1 padding scheme from
Section 2.2 needs to be applied once again to make the size of the block a multiple of r.
2.5 Modifications for SHA-3 Standardization
While the original KECCAK specification was written in general terms for a variety of
parameters, the FIPS document fixes some of the parameters. It is required that the size
of the state be 1600 bits, corresponding to l = 6. The output, or digest, sizes are fixed
and the capacity is set to twice the digest size. Table 2.4 provides a summary of the four
standardized modes of KECCAK each called SHA3-d, where d is the size of the output
[25]. In addition to the standard hashing modes, extendable-output functions (XOFs) are
defined that can be used to provide output of any desired length [25]. In Table 2.4, these
modes are represented by SHAKEz, where z is one-half of the capacity. SHAKE represents
the combination of Secure Hash Algorithm with KECCAK [25]. The SHA-3 standard is
designed to adhere to the Sakura coding scheme, which is used for domain separation and
20
chaining [10]. Currently, there are no standardized tree hashing modes; they are scheduled
to be specified in an additional document at a later time [25].
Table 2.4: Standardized KECCAK Modes
Function Digest Size (bits) Capacity (bits) Rate (bits) State Size (bits)
SHA3-224 224 448 1152 1600
SHA3-256 256 512 1088 1600
SHA3-384 384 768 832 1600
SHA3-512 512 1024 576 1600
SHAKE128 d 256 1344 1600




3.1 CUDA Enabled Graphics Processing Units (GPUs)
Modern trends in scientific computing have moved towards a high performance environ-
ment that makes use of GPUs. The CUDA-C programming language provides an abstrac-
tion layer to programming an NVIDIA GPU. In these high performance environments,
thousands of threads are executed simultaneously. The ability of a GPU to perform this
type of work is described in detail in Section 3.1.1. CUDA provides two layers of abstrac-
tion for programming. First, there is the low-level Driver API which provides more direct
access to the underlying hardware [18]. These functions are more difficult to program in
code, but the impact on device utilization could greatly be affected. On the other hand,
the CUDA Runtime API provides higher level abstractions [18]. All of the Runtime API
functions call Driver API functions, but the work is reduced because the runtime handles
all of the complex initialization.
3.1.1 Processing Architecture
The GPUs are designed with many lightweight processing cores, called streaming proces-
sors (SPs), each of which can execute a single thread at a time. Each SP contains an integer
core and a floating point core that can be used [15]. Within the NVIDIA CUDA architec-
ture, a GPU is comprised of streaming multiprocessors (SM); each SM contains many SPs,
which can vary depending on the generation of the GPU. The Fermi architecture only con-
tained 48 per SM, whereas the Kepler architecture contains 192 [15, 16]. Figure 3.1 shows
the high level organization of a SM and SP. Within a SM, there are many other components
that are required. The register file is one of the most important components to provide fast
22
access to all of the threads that are spawned. Additionally, the load/store units provide
access to external memory. The Special Floating Point Units (SFUs) provide the ability to
perform complex mathematical operations, such as sine or cosine.
Figure 3.1: CUDA Fermi Streaming Multiprocessor (SM) [15]
GPUs are designed to spawn thousands of threads simultaneously, which are organized into
groups of threads known as blocks. Collectively, these blocks make up a grid, which is also
referred to as a kernel launch. Figure 3.2 shows the thread hierarchy for a given kernel
call. While it is shown as a two dimensional coordinate system, grids and blocks can be
defined in three dimensions. By providing this flexibility, the kernels can be configured to
map to the algorithm that is being analyzed. Even though each of the threads is assigned to
23
a block, the threads are also assigned to groups of 32 threads known as warps [18]. These






T0,3 T1,3 T2,3 T3,3
T0,2 T1,2 T2,2 T3,2
T0,1 T1,1 T2,1 T3,1
T0,0 T1,0 T2,0 T3,0
Figure 3.2: CUDA Grid Hierarchy
When the threads in a GPU are spawned, there are often many more threads than there are
SP cores to handle the processing. The warp scheduler in the SM is responsible for issuing
instructions that can execute simultaneously on the resources. The ability to run thousands
of threads comes from the fact that the SMs are able to context switch the threads when
one of them is waiting for a long latency operation, such as a global memory access. By
scheduling instructions from multiple warps, the instruction mix of memory operations and
compute operations can be interleaved to hide the latency of other instructions.
In order to effectively take advantage of all available hardware, the CUDA platform has
a concept of streams that allows work to be performed simultaneously. By default, any
CUDA operation that involves the actual GPU is launched on the default stream [26]. Even
24
though multiple streams can be created and used, the operations within a stream are still
sequential operations. When the streams are effectively used, computation can be over-
lapped with communication, thereby reducing the overall execution time. In the Fermi
architecture, there was a single hardware queue that managed all of the work; the Kepler
architecture increased this limit with 32 queues [16]. With this design, multiple execution
engines can be utilized to reduce the amount of idle time when work is scheduled for GPU
execution.
3.1.2 Memory Architecture
One of the most important features of the GPU is the unique memory hierarchy that is
available. Similar to all CPU designs, there are registers that provide the fastest level of
access. However, the major difference is the number of registers that are present on a SM.
Even though a single thread may require a few registers, the number of threads that are
spawned required a total of thousands of registers. Due to this requirement, there can up to
65536 registers in a single SM [16]. Until recent versions of the architecture, registers were
local to a given thread and they could not be accessed by any other threads. Beginning with
the Kepler architecture, warp shuffle instructions were introduced to allow register sharing
between threads in the same warp [19]. If threads need to share data with threads in other
warps, another type of memory would have to be used, resulting in increased latency for
the instructions to complete.
In addition to the registers for fast access, there are a number of read-only and read/write
memories available. Figure 3.3 shows the memories and directions that data can be sent.
Constant memory and texture memory are read-only, whereas local memory, global mem-
ory, and shared memory are all read/write, from the perspective of the kernel. By using
the CUDA API, constant and texture memories are setup by the C code that launches the
kernel. The goal of these memories is to aggressively cache values that are read by the
kernels. Constant memory has some additional characteristics that are explained in fur-
ther detail in Section 3.1.2.1. Texture memory is designed to cache values that are located
25
Figure 3.3: CUDA Memory Hierarchy [14]
nearby in memory, especially two dimensional accessing that can take advantage of spatial
locality. Because the data is cached, it is best suited to reduce the demand to main memory
by providing increased bandwidth to on-chip memory [26].
The remaining memory types are read/write memories. Global memory is the highest level
of GPU memory, which is stored in off-chip DRAM. An access to global memory incurs
a relatively long delay because it has to be obtained off-chip. In order to overcome the
latency, a cache hierarchy was introduced with the Fermi architecture. Figure 3.4 shows
the hierarchy. There is a single L2 cache that services all of the SMs on the GPU. Each SM
contains its own L1 cache. In addition to being used for caching global memory access, the
L1 cache is where the shared memory is stored. When a kernel is launched, a portion of
L1 is allocated to shared memory space. Shared memory is specific to the blocks that are
launched; threads in the same block shared the same shared memory area [26]. In general,
the shared memory is a software-based, user-controlled cache. Local memory is the last
26
Figure 3.4: CUDA Fermi Cache Hierarchy [15]
memory type shown in the figure. The purpose of local memory is to hold the data that each
thread requires when it exceeds the number of registers that are available during execution.
Local memory is located in a portion of global memory, making the access long relative to
a register access. Each thread is allocated its own portion of memory and there is no way
to directly access the local memory of another thread.
While the L1 cache and shared memory can be useful, the Kepler architecture takes a
different approach to their use. In previous generations, the hierarchy was used to serve all
of the global memory requests in an effort to increase performance. The Kepler architecture
does not use the L1 cache for any load/store caching [19]. Instead, the cache is used for
register spills or local memory. Register spills occur when the register requirements for a
given kernel launch exceed the number of registers available in the SM. As a result of this
design, the penalty to access local memory is reduced when a thread requires too much
memory.
The final type of memory that can be used is not shown in Figure 3.3. The GPU is able
to access CPU RAM using page locked memory. There are special CUDA API functions
27
that can be called to allocate this type of memory. When the allocation is made, the pages
will not be paged out of CPU memory by the memory manager when a page fault occurs
[17]. There are clear advantages and disadvantages about using it, however. The primary
advantage is to overcome the limitation of the amount of memory available in the GPU
DRAM; it is not uncommon for a CPU to tens or even hundreds of gigabytes available. It
can also be used as a method of communication between the CPU and GPU without having
to perform an explicit memory copy. The disadvantage is system performance. If it is not
used efficiently, it could bring the system to a halt as the operating system is unable to
adequately manage all of the page faults. Since the memory is located on the CPU and not
the GPU, all of the data transfers need to happen over the PCI bus. Depending on the system
configuration, this type of transfer could incur high execution overhead [17]. Despite the
potential disadvantages, its use could provide a significant benefit to an application when a
GPU does not have enough on-board memory to meet the application requirements.
3.1.2.1 Constant Memory Hierarchy
From the design and description of the constant cache, it appears to be a single area of cache
that aggressively manages values that are declared by a programmer and read by a kernel.
Since the design is a trade secret of NVIDIA, the implementation details are unknown,
which makes the characterization difficult. Using a technique called microbenchmarking,
the parameters of the cache can be identified. A team of researchers at the University of
Toronto created a microbenchmarking suite that analyzed the performance of a variety of
architectural components, including the constant cache [31]. Using a similar method of
design, the CUDA Fermi and Kepler architectures were analyzed to determine the levels
and size of the constant cache.
The design of the microbenchmark focused on read values from an array separated by some
value, referred to as a stride. For example, a stride of one means that each entry in the array
is accessed sequentially. In the following figures, an array of 4-byte integers was used;
the size of the constant cache is fixed in the hardware to 64KB [17]. The microbenchmark
28
reads from the array and then calculates the average amount of time that it takes to complete
each read, as measured by clock cycles on the GPU. Table 3.1 illustrates a portion of the
array that has been filled; a stride of three integers and an array size of 64 bytes is used for
simplicity. Each read from the array indicates the next index that will be accessed.
Table 3.1: Constant Memory Microbenchmark Memory Access Pattern
Array Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Next Access 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 0
Figure 3.5 shows the results of running the microbenchmark on the Fermi architecture.
Based on the points were there is a steep increase in the access time, two levels of caching
can be identified. The first level of cache is 4KB, whereas the second level is 32KB.
With further analysis, the associativity of each cache can also be determined. In order
to determine these parameters, a stride of 16 integers, or 64 bytes, was used. Using Figure
3.6a, the line size of the L1 cache is 64 bytes, the way size is 1KB, and it is four-way set
associative. Figure 3.6b indicates that the line size of L2 is 256 bytes, the way size is 4KB,
and it is eight-way set associative.
Figure 3.5: CUDA Fermi Constant Cache Overview
29
(a) L1 Cache (b) L2 Cache
Figure 3.6: CUDA Fermi Constant Cache L1 and L2 Parameter Identification
Using the same microbenchmark, the parameters of the constant cache in the Kepler ar-
chitecture can also be identified. Figure 3.7 shows the L1 cache size as 2KB, and the L2
cache size as 32KB. As identified from Figure 3.8a, the line size is 64 bytes, the way size
is 512 bytes, and the L1 cache is four-way set associative. Figure 3.8b indicates that L2 has
a cache line size of 256 bytes, a way size of 4KB, and it is eight-way set associative.
Figure 3.7: CUDA Kepler Constant Cache Overview
30
(a) L1 Cache (b) L2 Cache
Figure 3.8: CUDA Kepler Constant Cache L1 and L2 Parameter Identification
3.2 Cycle-Accurate GPU Simulator
With the numerous advances in the design of modern processors, a functional and per-
formance simulator for the GPU architecture can be used to understand the system and
bottlenecks that may exist. GPGPU-Sim is a general purpose GPU simulator that allows
CUDA programs to be executed with it, thereby providing a detailed insight into the ar-
chitectural details of the device that NVIDIA does not release [3]. When the simulator
was developed, the details of the interconnection network were analyzed and determined
to have a major influence on the overall performance. Overall, the simulator is designed
to mimic the real hardware configuration of the NVIDIA Fermi architecture. Beyond the
basic configuration of the architecture, a power simulation model is integrated into the sim-
ulator to provide a power analysis of the subsystems using GPUWattch [1]. In order to
illustrate the accuracy, a number of RODINIA benchmarks were executed to compared the
IPC of the simulator to those of the actual hardware; Figure 3.9 shows that there is more
than a 97 percent correlation between the architectures, indicating that this is an accurate
comparison of the systems.
31
Figure 3.9: Accuracy of Simulator to Fermi Architecture [1]
Figure 3.10 shows a high level architectural description of the simulator. The core process-
ing within the streaming multiprocessors, represented as an SIMD Core Cluster, operates
on a different clock frequency than the L2 cache, the DRAM memory subsystem, and the
interconnection network [3]. In addition to a representation of the SM clusters, a detailed
pipeline is implemented for the streaming processors within the cluster; Figure 3.11 illus-
trates the design of the SP core that is used. The final major component of the simulator is
the memory architecture. The L1 cache is implemented inside of the SP because each core
has its own cache. The L2 cache and external DRAM are shared among all SM clusters,
resulting in the subsystem shown in Figure 3.12.
Through the use of the simulator, a number of hardware level metrics can be analyzed.
After each kernel is executed, detailed information can be viewed, such as the number
of clock cycles and number of instructions that were simulated [3]. In addition to the
32
Figure 3.10: High Level Architecture of Simulator [1]
core processing information, the memory subsystem also provides a significant amount of
detail. The L1 cache can be analyzed for each SM cluster to understand hits and misses for
all types of memory accesses, including constant, texture, and instruction accesses. Since
the L2 cache is shared across the device, cumulative statistics are kept for all types of
accesses [1]. The configuration for each of the subsystems is controlled by a configuration
file that contains sizes, latencies, etc.
Figure 3.11: Streaming Processor Core Architecture [1]
33
Figure 3.12: Memory Subsystem Architecture [1]
3.3 Field Programmable Gate Arrays (FGPAs)
The FPGA architecture provides a unique environment where dedicated functions or state
machines can be implemented by having direct access to the underlying hardware. In order
to take advantage of the environment, a hardware description language (HDL), such as
VHDL or Verilog, needs to be used. At the most basic level, FPGAs contain basic logic
blocks, configurable I/O blocks, and a complex interconnection network [30]; altogether,
these components can be considered the FPGA fabric. Figure 3.13 illustrates a simplified
representation of how the basic blocks are arranged; in a real device, there will be tens
of thousands of these basic blocks. When the device is programmed, the interconnection
network and I/O blocks will be configured to ensure that the basic blocks are connected
correctly. Within the basic blocks, multiplexers, look-up tables (LUTs), flip-flops and some
other basic features exist. With this configuration, a single block may be configured to
realize a certain function, but multiple blocks can also be configured to collaboratively





Configurable I/O Block Interconnection Network
Figure 3.13: Basic FPGA Fabric Visualization
Since there are so many resources that exist within the device, it may be possible to put
multiple instances of the datapath on a single FPGA. While the parallelism will not match
the GPU, the parallel nature of the operations across multiple instances may provide similar
performance benefit due to more direct access to the underlying hardware. As an example
to show the contents of a basic block, Figure 3.14 represents a SLICEL from a Xilinx
Virtex-7 FGPA [32]. Within the Xilinx family of devices, these basic blocks are called
Configurable Logic Blocks (CLBs). While some slices in the FPGA may also contain a
SLICEM, the basic components are still present. The flip-flops, multiplexers, and LUTs
are all essential components that make it possible to realize almost any HDL design.
35




There are a number of implementations of KECCAK across many different platforms. Many
of the implementations are based in software using various optimizations, but there are
also some implementations that are hardware based. For the ASIC and FPGA hardware
implementations, there are resource optimizations that can be made, such as pipelining to
improve the overall performance. In general, there are only a few implementations that are
focused on tree hashing.
4.1 Software
4.1.1 CPU Sequential Implementations
The number of CPU implementations for KECCAK are widespread across a multitude of
different platforms. While the number of implementations is high, all of them are designed
to perform a sequential hash. Table 4.1 shows the hashing speeds for a large variety of pro-
cessors. The most important column in the table is the Long column showing the number
of cycles per byte to operate on a long message. All of these parameters are important for
an optimized implementation. These sequential results will serve as a basis of compari-
son when it comes to evaluating the performance of a tree hashing scheme. The following
information can be used to understand the implementation details [9].
• Implementation: this characterizes the fastest implementation on this machine. It
includes the following properties:
– plain 32-bit: plain C code using 32-bit operations;
– plain 64-bit: plain C code using 64-bit operations;
37
– LC: code using the lane complementing technique (see the KECCAK Implemen-
tation Overview [11] for more details);
– BI: code using the bit interleaving technique to use 32-bit rotations instead of
64-bit ones;
– BI(T): same, where the bit interleaving is implemented with tables;
– SIMD64: the code uses the 64-bit SIMD operations of the processor (MMX on
the AMD and Intel processors);
– SIMD128: the code uses the 128-bit SIMD operations of the processor (SSE2
on the AMD and Intel processors).
• Short gives the number of cycles to hash a 1-block message (≤ 127 bytes when r =
1024).
• Long gives the number of cycles per byte to hash a long message with KECCAK[r =
1024, c = 576].
4.1.2 CPU SSE2 Tree Hashing
While there have been many implementations of the KECCAK algorithm on a variety of
platforms, tree hashing remains a relatively unexplored area. The authors of the KECCAK
algorithm describe a method of implementing it with Streaming SIMD Extensions 2 (SSE2)
instructions of the CPU [6]. SSE2 instructions are designed to improve the performance of
applications by taking advantage of 128 bit registers [22]. When this method is used, the
CPU is able to natively operate on two instances of the sponge at the same time, because
SSE2 instructions operate on 128-bit registers. When SSE2 instructions are used with the
tree mode, a block of 64 bits allows interleaving of the data across the two sponges. While
this appears to be a good approach, it is limited in the ability to expand the size of the tree.
In most cases, this method only works easily for a tree of degree 2 and height 1 [6]; this is
the biggest limitation that presents itself. The use of this method also requires an intricate
knowledge of the assembly language in order to create the hashing scheme.
When SSE2 instructions are used, the performance can vary significantly. Table 4.2 shows
the results of the implementation without using the tree mode, whereas Table 4.3 shows
the tree mode using SSE2 instructions [6]. For each table, c is the number of cycles, and
c/b is cycles per byte. The difference between the first two columns of Table 4.2 exists
38
Table 4.1: KECCAK Hashing Speeds [9]









AMD Athlon 64 X2 GCC 4.4.3 plain 64-bit, LC 1974 13.05
AMD Phenom 9550 GCC 4.4.1 plain 64-bit, LC 2016 12.99
AMD Phenom II X4 955 GCC 4.4.1 plain 64-bit, LC 2026 13.07
AMD Phenom II X6 1090T GCC 4.4.3 plain 64-bit, LC 2670 12.98
HP Itanium II GCC 3.2.3 plain 64-bit 1318 6.28
HP Itanium II GCC 3.3.3 plain 64-bit 1428 7.03
IBM POWER4 GCC 4.0.0, XLC 8.0 plain 64-bit, LC 3480 20.92
IBM POWER5 GCC 4.4.3 plain 64-bit, LC 2984 16.91
IBM PowerPC G5 970 GCC 4.3.2 plain 64-bit, LC 3348 19.46
ICT Loongson-2 V0.3 GCC 4.3.3 plain 64-bit, LC 5416 24.72
Intel Core 2 Duo GCC 4.4.3 plain 64-bit, LC 2008 12.64
Intel Core 2 Duo E4600 GCC 4.4.3, ICC 11.10 plain 64-bit, LC 3965 12.63
Intel Core 2 Duo E8400 GCC 4.4,3, ICC 11.10 plain 64-bit, LC 1926 12.66
Intel Core 2 Quad Q9550 GCC 4.4.1 plain 64-bit, LC 1938 12.64
Intel Core i5 750 GCC 4.4.1 plain 64-bit, LC 1926 10.98
Intel Core i5 M 520 GCC 4.4.3, ICC 11.10 plain 64-bit, LC 1794 10.87
Intel Core i7 920 GCC 4.4.4 plain 64-bit, LC 4200 13.09
Intel Xeon E5420 GCC 4.6,0 plain 64-bit, LC 1935 12.64
Intel Xeon E5530 GCC 4.4.1, ICC 11.10 plain 64-bit, LC 2012 13.12
Sun UltraSPARC IIIi GCC 3.4.3 plain 64-bit 5724 37.89









AMD Athlon GCC 4.4.3 SIMD64 5430 37.97
Atmel AT91RM9200 plain 32-bit, BI 29025 115.00
Freescale i.MX515 GCC 4.4.1 plain 32-bit, BI 8064 62.88
Intel Pentium 3 GCC 4.4.1 SIMD64 5995 40.86
Intel Pentium 4 GCC 4.4.1 plain 32-bit, BI 7484 48.89
Intel Pentium M GCC 4.4.1 SIMD64 5060 33.82
Luminary Micro LM3S811 ARM assembly 16466 103.19
Motorola PowerPC 750CXe GCC 4.3.2 plain 32-bit, BI 7440 46.82
Motorola PowerPC G4 7410 GCC 4.3.2 plain 32-bit, BI 7408 46.72
Motorola PowerPC G4 7447a GCC 3.3 plain 32-bit, LC, BI(T) 7514 52.59
TI OMAP 2420 GCC 3.4.4 plain 32-bit, BI 14439 97.37















e AMD Athlon 64 X2 GCC 4.4.3 SIMD64 5093 35.71
AMD Phenom 9550 GCC 4.4.1 SIMD64 4979 34.21
IBM POWER4 GCC 4.0.0 plain 32-bit, BI 8080 46.88
IBM POWER5 GCC 4.4.3 plain 32-bit, LC, BI(T) 5376 35.52
IBM PowerPC G5 970 GCC 4.3.2 plain 32-bit, LC, BI(T) 6480 43.72
ICT Loongson-2 V0.3 GCC 4.3.3 plain 64-bit, LC 5410 25.29
ICT Loongson-2 V0.3 GCC 4.3.3 plain 64-bit, LC 9428 54.49
Intel Core 2 Duo E4600 GCC 4.4.3, ICC 11.10 SIMD128 6048 19.18
Intel Core 2 Quad Q9550 GCC 4.4.1 SIMD128 3392 21.95
Intel Core i5 750 GCC 4.4.1 SIMD128 2814 18.40
Intel Xeon E5420 GCC 4.6,0 SIMD128 3413 22.07
Intel Xeon E5530 GCC 4.4.1 SIMD128 3336 21.63
39
Table 4.2: SSE2 Implementation Performance [6]
Operation Implementation64 Bit 32 Bit SSE2 - 64 bit SSE2 - 32 Bit
KECCAK-f [1600] only 1648 c 4408 c 2520 c 2816 c
Squeezing with r = 1024 12.9 c/b 34.4 c/b 19.7 c/b 22.0 c/b
KECCAK-f and XORing 1024 bits 1680 c 4600 c 2528 c 2832 c
Absorbing with r = 1024 13.1 c/b 35.9 c/b 19.8 c/b 22.1 c/b
Table 4.3: SSE2 Tree Implementation Performance [6]
Operation Implementation
2 x (KECCAK-f and XORing 1024 bits) 2520 c
Absorbing with r = 1024 9.8 c/b
2 x (KECCAK-f and XORing 1088 bits) 2528 c
Absorbing with r = 1088 9.3 c/b
because the code was generated for 64-bit instructions and 32-bit instructions, respectively.
When SSE2 instructions are used for the 64-bit instructions, the performance is shown to
drop. However, the 32-bit instruction performance increased significantly [6]. For the tree
mode in Table 4.3, the 64-bit instructions were used with SSE2. Based on the performance,
it is significantly faster to perform approximately twice the work using these extensions.
Furthermore, it is shown that an increased rate will result in a marginally faster hash com-
putation [6].
4.1.3 GPU-Based Hashing
There are a few papers that have used CUDA to implement KECCAK on a GPU. In the first
work, Sevestre presents an implementation based on the current submission of KECCAK to
the SHA-3 competition at the time [27]. This paper implemented the KECCAKf-[800] ver-
sion, meaning that 32 bit words were used to represent the lanes of the internal state. There
are many types of tree structures that were explored in [27]. The first design created a basic
streaming architecture, where each GPU thread would hash a single input block and then
copy the hashed value back to the CPU. Afterwards, the CPU would act as the root node
and perform a sequential hash of all of the outputs, which would produce the final hash
[27]. Figure 4.1 shows the architecture of this design. After the initial results of the tree
40
Figure 4.1: Basic Tree Hash Mode [27]
mode, modifications can be made to improve the performance. By design of the hash mode,
the CPU can be idle while the GPU is performing work, even if the results are already com-
puted. Using the concept of streams in CUDA, multiple GPU operations can be performed
at the same time, such as a memory copy and computation [26]. By implementing the
overlapping copy and compute operations, the performance can be significantly increased
over the basic mode; when additional streams are used, the performance is increased even
more. Table 4.4 shows the performance results that were obtained using these various tree
modes from the basic tree mode.
Building on the first tree concept, a second tree was also analyzed to be more secure in
terms of chaining values. The second tree modes takes advantage of shared memory and
thread blocks to generate the subtrees on the GPU [27]. As a result, there are fewer blocks
that need to be hashed on the CPU. Figure 4.2 shows the new tree structure. Since threads
in the same block are able to communicate using shared memory, a single kernel launch
can be used to execute a number of subtrees, meaning that the depth of the tree can be
greater. Table 4.4 shows the results of the implementation in the Secure tree mode row of
the table. The reason for the degraded performance compared to the first implementation is
the increased security margin; the chaining values between nodes of the tree were doubled
for this implementation [27].
41
Figure 4.2: Second Tree Hash Mode - Secure [27]
Table 4.4: Tree Hash Mode Performance Results [27]
Platform Tree Mode A B
CPU Basic Core2 Duo 2.6 GHz Core i5-750 2.6 GHz
GPU Basic Quadro FX 370M GTS 250
CPU Hash Speed Basic 25 MBps 15 MBps
CPU/GPU Hash Speed - Basic Basic 61 MBps 682 MBps
CPU/GPU Hash Speed - Overlapped Basic 63 MBps 1032 MBps
CPU/GPU Hash Speed - Streams Basic 64 MBps 1219 MBps
CPU/GPU Hash Speed - Streams Secure 59 MBps 1183 MBps
The results of this work show that the performance of the tree hash mode is heavily depen-
dent on the CUDA features that are used. Furthermore, the structure of the tree can have an
impact on the overall performance, as shown by the second tree mode where the chaining
values are larger [27]. The final contribution of this work highlights the differences that are
present between various GPUs; in this case, a mobile card as opposed to a desktop card.
The mobile card, the Quadro FX 370M, has far fewer resources and is therefore saturated
quickly regardless of the performance optimizations [27]. For a low end implementation,
this could be an important aspect to consider when designing the algorithm to run on all
types of devices.
42
Figure 4.3: KECCAK Round Parallelism Performance [12]
Another way to use the GPU is to exploit the parallelism within the KECCAK algorithm.
From the algorithm description in Section 2.3, the amount of parallelism ranges from 1 to
25 in each of the five steps within a given round; the maximum amount of parallelism is set
by the number of lanes that exist in the state. An implementation using CUDA is provided
by Bobrov [12]. For each round of the algorithm, one warp of 32 threads was launched,
but only 25 of those threads would actually be executed. Furthermore, there were repeated
calculations that could be made to minimize communication between threads [12]. Figure
4.3 shows the resulting execution time and throughput for the algorithm. It is clear that the
GPU execution dominates the overall performance time due to the associated overhead.
The work by Bobrov is similar to another work Cayrel that took a deeper analysis into
the causes for poor performance in the internal permutation. From the implementation, all
of the threads in a single warp mean that they will execute in lock-step, resulting in an
implementation that does not need explicit thread synchronization [13]. By profiling the
application, it was revealed that a significant number of bank conflicts as a result of using
shared memory for the temporary variables that need to be shared between threads [13]. In
addition to analyzing the architecture reasons for a slow performance, a batch mode was
proposed. In the batch mode, multiple documents could be processed simultaneously using
the parallelism within a round [13]. While it could be useful for GPU execution to keep
the device saturated with processing, it will be limited in the amount of memory; GPUs are
43
Table 4.5: GPU Tree Hashing Results [13]
File Size H = 0 H = 1 H = 2 H = 3 H = 4 MD5
1,050,112 b 0.415 s 0.104 s 0.020 s 0.014 s 0.019 s 0.003 s
10,500,096 b 4.110 s 0.994 s 0.144 s 0.069 s 0.063 s 0.025 s
25,200,000 b 9.854 s 2.375 s 0.332 s 0.151 s 0.129 s 0.057 s
50,400,000 b 19.702 s 4.742 s 0.655 s 0.291 s 0.199 s 0.112 s
Table 4.6: CPU Sequential Hashing Reference Results [13]





not known to have a high amount of memory to support all of the documents. This would
impose some additional constraints on the algorithm [13].
The final piece work that was presented by Cayrel was the implementation of the leaf in-
terleaving tree mode defined in Section 2.4 using KECCAKf-[1600]. The performance was
analyzed in terms of execution time to determine how the implementation performed in
relation to the MD5 hash and a sequential KECCAK hash; the MD5 hash was performed se-
quentially. All of the tree hashing was performed on a GTX 295 GPU. For the performance
measurements, trees of D = 4 were used and the height was varied [13]. Table 4.5 shows
the results that were obtained by varying the height, while using a constant block size of
B = 128. Compared with a sequential CPU-based hash, the runtime of the tree hash mode
begins to outperform the sequential hash once a sufficient depth is reached; for the largest
file size, a height of 4 is enough to match the workload. Table 4.6 shows the CPU results
that were used for a comparison.
4.2 FPGA-Based Hardware
As the SHA-3 competition progressed, a number of FPGA implementations surfaced, rang-
ing from benchmarking the designs of the KECCAK team to creating custom architectures
44
that would instantiate a KECCAK core. Table 4.7 provides an overview of the various
implementations. Over time, each of these implementations provided valuable insight in
order to refine the final VHDL version of the algorithm that was submitted to the SHA-3
competition.
Table 4.7: FGPA Implementation Summary
Design Type Device Area Used (Capacity) Frequency (MHz) Throughput (Mbps)
[29] High Speed Core Cyclone III EP3C10F256C2 5842 (10320) LEs 123 7168.0
[29] High Speed Core Stratix III EP3SE50F484C2 4550 (38000) ALUTs 176 10240.0
[29] High Speed Core Spartan-3A XC3S1400ANFGG676-5 3393 (11264) slices 85 4915.2
[29] High Speed Core Virtex 5 XC5VLX50FF324-3 1483 (7200) slices 118 6860.8
[6] High Speed Core Stratix III EP3SE50F484C2 4686 (38000) ALUTs 206 8908.8
[6] High Speed Core Cyclone III EP3C10F256C6 5770 (10320) LEs 145 6246.4
[6] High Speed Core Virtex 5 XC5VLX50FF324-3 1330 (7200) slices 122 5324.8
[29] Low Area Co-processor Cyclone III EP3C10F256C26 1769 (5136) LEs 85 22.1
[29] Low Area Co-processor Stratix III EP3SE50F484C2 1026 (38000) ALUTs 133 34.6
[6] Low Area Co-processor Stratix III EP3SE50F484C2 855 (38000) ALUTs 359 71.2
[6] Low Area Co-processor Cyclone III EP3C5F256C6 1570 (5136) LEs 183 36.3
[6] Low Area Co-processor Virtex 5 XC5VLX50FF324-3 448 (7200) slices 65 53.5
[4] KECCAK-512 Virtex 5 XC5VLX330T-2-FF1738 1117 (-) slices 189 8518.0
[21] KECCAK[r = 1024, c = 576] Virtex 5 1129 (-) slices 239 10170.9
[20] Pipeline Virtex 5 XC5VLX50FF324-3 3117 (7200) slices 452 7884.8
4.2.1 KECCAK-defined Processing Cores
The designers of KECCAK provided two implementations of processing cores that can be
used for different purposes. First, a high speed core is proposed that removes all of the
work from a CPU. It is designed to be a stand-alone core with enough internal memory
to handle the state and all of the operations [11]. Figure 4.4 represents a high-level block
diagram of the solution. In the design of the I/O buffer and the processing core, it is able to
transfer data over the bus and process data at the same time [11].
The second type of proposed core is a low-area co-processor. Compared to the high speed
core, this contains no internal memory and it only performs the basic round operations [11].
The main purpose of using this co-processor is in an embedded device, where the overall
area needs to be low and Gbps performance is not a requirement. Figure 4.5 shows the
block diagram of this co-processor. As shown, all of the memory is located in a separate
portion of the chip that is outside of the processing architecture; the only internal memory
is for the temporary values that are obtained when performing a round of the algorithm
45
Figure 4.4: KECCAK High-Speed Core [11]
[11].
4.2.2 Sequential
Many of the FPGA publications are based on the provided KECCAK core. Towards the
beginning of the competition, Joachim worked to implement the high speed core and low
area co-processor on different platforms using both Altera and Xilinx suites; this worked
proved to be essential at creating later versions of the reference packages [29]. The results
of the synthesis provided throughputs between 4.8 and 10.0 Gbps [29]. At approximately
the same time, the KECCAK designers also started to implement the algorithm on multiple
platforms.In a number of simulations using ModelSim, the performance could be estimated
based on the number of round instances that were provided. Table 4.8 shows the result of
varying the number of rounds that are created in simulation only [6]. Based on the estima-
tions, there is a clear trend that increasing the amount of combination logic decreases the
operating frequency, but has the potential to significantly increase the throughput relative to
it. The major disadvantage about this implementation is how the data bus width needs to in-
crease in order to keep the system saturated [6]; this could pose a potential constraint on the
size of the overall system. [6] also provides implementations of the high performance core
and low area co-processor on the same FPGAs as [29]. While there are some differences
46
Figure 4.5: KECCAK Low Area Co-processor [11]
that exist in terms of area consumption, they can be explained through the modifications
that were made in order to improve the processors.
Table 4.8: ModelSim Performance Estimates [6]
Round Instances Frequency (MHz) Throughput (Gbps)
n = 1 526 22.44
n = 2 333 28.44
n = 3 244 31.22
n = 4 192 32.82
n = 6 135 34.59
In addition to implementing and analyzing the provided cores, some work has been done by
Baldwin to implement a generic wrapper that could be used to benchmark any of the SHA-
3 candidates [4]. When this wrapper is used, it can also be synthesized and incorporated
into the overall design. Figure 4.6 illustrates the structural design of the wrapper. For
KECCAK-512, it can be shown that the core operates at 189 MHz with a throughput of
8518 Mbps. The core is implemented to perform one round per clock cycle. However, the
47
Figure 4.6: Hash Function Wrapper [4]
wrapper actually increases the overall speed of the implementation to 195.733 MHz, but
leaves the throughput unchanged [4]. The biggest disadvantage about using the wrapper
in this case is the overhead of approximately 800 slices, which accounts for almost fifty
percent of the total area that would be consumed after synthesis.
A lot of effort has been done to determine the exact performance metrics of the SHA-
3 candidates, in terms of throughput and area. Gaj presented a closed form equation to
estimate the throughput of any SHA-3 candidate [21]. Equation 4.1 can be used to measure
throughput for the KECCAK implementation, where T is the clock period. In the design of
the KECCAK core, the latency to hash an input block is 24 cycles, which is the same as the
number of rounds that are performed [21]. In order to accurately calculate throughput, the
message is assumed to be padded to the appropriate length in software before it is passed





The work of [21] was not limited to building a common testing interface and a closed-form
performance equation. The KECCAK core was also synthesized for other hardware FPGAs
48
Figure 4.7: Common SHA-3 Interface [21]
to determine the operating frequency for the design; all of the frequencies were determined
after the place and route phase of synthesis [21]. Table 4.9 provides the implementation
details for other types of FPGAs using Equation 4.1 to determine the throughput.
Table 4.9: KECCAK FPGA Performance [21]
FPGA Type Frequency (MHz) Throughput (Mbps)
Spartan 3 96.32 4109.65
Virtex 4 202.47 8638.72
Virtex 5 238.38 10170.88
Cyclone II 165.07 7042.99
Cyclone III 174.28 7435.95
Stratix II 198.65 8475.73
Stratix III 296.30 12642.13
4.2.3 Pipelined
Due to the nature of the KECCAK algorithm within a round, multiple pipeline stages can be
used to provide greater performance. Dacêncio Pereira, Moreno Ordonez, Daun Sakai, and
Mariano de Souza designed the pipeline such that each of the steps defined in Section 2.3 is
a stage in the pipeline [20]. Because it is common to combine the ρ and π operations, there
is a four stage pipeline that can be created, which means that four independent messages
can be processed at the same time. In the ideal scenario, one message would be output
every clock cycle after the initial delays for filling the pipeline. Figure 4.8 illustrates the
method in which the pipeline is constructed. For the notation, T |1|0|, T is the operation
49
Figure 4.8: KECCAK Pipeline Stages [20]
being performed, 1 represents the block being operated on, and 0 is the round.
Table 4.10: FPGA Pipeline Data








Table 4.10 shows the resulting implementation for each operation on a Xilinx Virtex 5
FPGA. From the table, the θ operation uses the greatest number of slices and has the lowest
performance frequency; this is shown in the table as θ1+2. Based on the implementation of
that step, it is possible to break it down into two separate steps, thereby adding another stage
to the pipeline [20]. After making this modification and re-analyzing the performance, the
operating frequencies of the additional stages are shown as θ1 and θ2 in the table. As
a result of the implementation, all stages of the pipeline require approximately the same
operating frequency. For the complete implementation, the pipeline is stated to run at 452




Converting an algorithm from a sequential to a parallel implementation is often a non-trivial
task and there are a number of considerations that need to be made. In order to write an
efficient tree hashing scheme, the software must be written in a way that makes efficient
use of the underlying GPU hardware. The main goal of this work is to determine the
performance of KECCAK in a tree hashing mode on a GPU. The previous works in Section
4.1.3 showed that a method exists for the implementation. The software is designed to
work with hash trees of varying size, provided that the input length in bits, |m|, satisfies
|m| mod 8 ≡ 0 and the block size in bits, B satisfies B mod 64 ≡ 0. While the tree
hashing scheme is designed to work with any instance of the KECCAK function, this work
only uses KECCAKf-[1600] for the function used at each leaf in the tree.
The restrictions that are placed on |m| andB are artificial constraints imposed by the design
to reduce some of the added complexity with the GPU kernels. From the GPU standpoint,
any conditional branch has the potential to cause warp divergences. Without the constraints,
bit level manipulation would have to be performed; while it is possible to operate with any
number of bits, it is not ideal to perform this low level operation due to the number of
conditional branches that would be required. Furthermore, the operations are best suited
for byte operations without additional complexity. Finally, setting B to a multiple of the
lane size helps when the data is being XORed into the KECCAK state because one lane is
processed at a time, reducing any additional overheads.
51
5.1 High Level Design
Tree hashing is computationally feasible due to the parallel nature of all of the nodes at
a given level of tree. When the algorithm is mapped to a GPU platform, one thread is
launched for each node, resulting in parallelism for every step. However, the overall amount
of parallelism decreases as the tree is traversed back to the root node. The high level flow
of the algorithm for both the LI and FNG tree modes is shown in Figure 5.1. The algorithm
is iterative, requiring the use of synchronization points after each kernel call to ensure that
the results of the lower levels are computed before the root is computed.
Theoretically, the amount of work and resulting execution time of each kernel will decrease
by a factor of D each time, until a point is reached where the launch overhead begins to
dominate the execution time. Section 5.1.1 describes methods of reducing the overhead
associated with the kernel launches when the number of threads is small. Section 5.2 will
discuss the specific implementation details of the kernels and how the performance of each
launch can be characterized with respect to the tree mode.
5.1.1 Additional Types of Parallelism
As described in Chapter 4, there are varying levels of parallelism that can be exploited,
which range from SIMD instructions to GPU-based implementations. One of the enhance-
ments in the latest version of CUDA is the ability to spawn kernels from within a kernel,
also known as dynamic parallelism [16]. This additional level of parallelism could be ex-
ploited when a thread needs to perform a hash using shared memory, but it comes with a
high price; each thread would launch an additional 32 threads, but that would be the limit of
the parallelism. If all parallelization overheads were eliminated, then this would be a great
addition to the design. However, thousands of threads each launching an additional 32
threads is a high overhead that cannot be matched with the computational efficiency. With
this type of management, the coordination of executing grids would be severely limited and









































































Figure 5.1: High Level Description of Tree Mode Hash Functions
53
Another method of using CUDA features to speedup the processing is through the use
of streams. This work does not use streams other than the default stream in any way.
However, there are places where they could be applied if designed correctly. First of all,
the bottlenecks that are identified in Section 5.2.3 could be broken down across multiple
streams for processing. As an extension of this design, multiple streams could be used
for multiple messages if the design were constructed to handle thousands of messages at a
time, as opposed to a large amount of data from a single file.
The other type of applicable parallelism is to take advantage of multiple CPU cores. The
concept of a multicore CPU is the same as that of a GPU, except for the reduced number
of cores. Using OpenMP, multiple threads can be launched for each level of the tree that
execute without the GPU. Using this approach, the GPU could be avoided if the number
of threads launched introduces additional overhead to eliminate the performance benefit.
CPU threads require less overhead to launch and there are no memory copies that need to
take place because they share the same physical memory.
5.2 Kernel Architecture
In order to support the various tree hashing modes, four different kernels need to be used.
The kernels can be broken down into two groups: leaf processing and internal node pro-
cessing. The order of operations shown in Figure 5.2 is essentially the same for all kernels,
but the locations where the kernels read from and write to differs. For all leaf kernels,
input is read from the message memory and output is written to the internal tree structure
memory. The internal processing kernels always read from and write to the internal tree
memory. The initialization refers to the process of setting the initial value of the state and
any other processing that needs to occur. When a block of input is read, it is moved into
a temporary buffer that will be used during the consume state. During the read, up to one
block of input will be read. Consuming a block of input refers to the process by which input
is combined into the state using the XOR operation and running the KECCAK function on
the state; this will also perform the padding that will ensure that the input is of the proper
54
length. Finally, generating the output hash will provide the input values to the next level of

















Figure 5.2: High Level Kernel Architecture
5.2.1 Leaf Kernels
In terms of implementation, both leaf kernels for LI and FNG tree mode operate the in
a similar way. The difference between the two is the iterations when reading data. LI
mode requires a loop to iterate over all of the input until the end of the message is reached,
whereas FNG mode only requires a fixed number of consecutive bits. When all of the
data is read, there are a number of validations that need to be made to ensure that the
block is of the proper length. In the ideal scenario, the length of the input message would
always be a multiple of 64 bits. However, it is often a constraint that cannot be met. As a
result, there are a number of places where warp divergence can occur based on the number
of conditional branches. Nevertheless, warp divergence will only occur when a thread
operates in a portion of memory where a full block cannot be read. As a result of this
55
structure, a majority of the threads should take the same branches, meaning that the warp
divergence will not be as apparent.
5.2.2 Internal Processing Kernels
The internal processing kernels are at the heart of the implementation as they work to
reduce everything from the lower levels of the tree to a single hashed value. The two
kernels that are responsible for this operation perform the same operation, but one of
them, InternalKernelNonRoot FNG, is designed to take into account the differences
for the layer of nodes immediately after the leaf kernel for FNG mode. This kernel re-
quires a couple of additional parameters in order to make the appropriate calculations
for the indices of the input and output compared to the other kernel. The second ker-
nel, InternalKernelNonRoot, is designed to read from the internal memory structure
and write to other locations in the same area of allocated space. When using FNG mode,
the InternalKernelNonRoot FNG kernel runs first, followed by iterative runs of the
InternalKernelNonRoot kernel.
In terms of execution, the internal processing kernels are simpler than the leaf kernels be-
cause there is a fixed data size that is hashed; the execution of the inner loops is determined
based in the specified rate, r. As previously described, the execution should theoretically
decrease by a factor of D each time due to the fixed input and output sizes for the kernel,
once the initial processing occurs to reduce the amount of data that needs to be hashed.
5.2.3 Top Level Processing
Figure 5.3 illustrates how the leaf kernels and internal processing kernels are connected
together to compute the hash. Figure 5.3a shows that there are only two kernels used to
complete the tree hashing algorithm. Figure 5.3b indicates that there are three separate
kernels that are used. Based on the figure, there is one additional kernel call for FNG mode
that is required when compared to LI mode. Because of this design, any tree that uses FNG




























Figure 5.3: Tree Hashing Kernel Launch Sequence
an alternative solution to calling each of the kernels so many times. If the OpenMP solution
were used, there would be fewer launches to the InternalKernelNonRoot, where calls
H − 1, H − 2, etc. are replaced with OpenMP thread launches.
It was described that the execution time for each kernel decreases by a factor of D for each
kernel launch until the overhead associated with it dominates the overall time. While this
is true for the InternalKernelNonRoot kernel, it does not hold true for the processing
of the leaf kernels and the InternalKernelNonRoot FNG kernel. Figure 5.4 shows the
timeline that the LI and FNG tree modes follow when it comes to kernel launches. LKx
represents the leaf kernel, whereas IKx represents an internal processing kernel. In both
examples, the it is assumed thatD = 2. As always, there are some variations to the timeline
when the ideal values of D and H are selected, but it holds true when the ideal values are
not selected.
From the generalized timeline, there is one major bottleneck in terms of performance for
each tree mode. For the LI tree mode, the leaf kernel introduces the performance limit
because there is an unknown amount of computation that needs to be done. The kernels
are designed to iterate over all of the input until the end is reached, making is impossible
to statically predict how long it will take. After the leaf kernel is complete, the amount of
data into and out of the kernel is fixed by D and r for the internal processing kernels; this
57
Time
LKLI IK1 IK2 IK3 ...
(a) LI Mode
Time
LKFNG IK1 IK2 IK3 ...
(b) FNG Mode
Figure 5.4: Kernel Execution Timeline
is where the execution time is reduced by a factor of D each time. Figure 5.4a shows the
relative execution times of the leaves to each other kernel launch.
Hashing in FNG mode shares the same characteristics as LI mode for the shared internal
processing kernel. However, the performance bottleneck is no longer associated with the
leaf kernel, since each kernel does a fixed amount of processing that is specified by the
block size, B. The bottleneck is the InternalKernelNonRoot FNG kernel because this
is where the variable degree R is used as input and the fixed degree D is used for output.
Because R can be large depending on the length of the input and tree parameters, this
could be generalized as another a second leaf kernel that needs to be executed. Figure 5.4b
illustrates how the performance is comparable to the rest of the kernels; it is assumed that
R >> D.
5.3 Memory Architecture
Due to the nature of the algorithm, there are certain types of memory that can be used to take
advantage of the architecture. At a very high level, the use of memory can play a vital role in
the execution of the algorithm. While execution is typically analyzed from a performance
58
or speed perspective, the memory selection may not be considered an important area of
design. However, the various GPU memory types can provide a significant benefit to the
overall performance.
5.3.1 GPU Memory Use
Section 3.1.2 describes a number of memory types that are available for use. This work
targets four of the memory spaces: global, local, constant, and page-locked. Each of these
types provides a significant benefit in the implementation because of the architectural opti-
mizations of the device on which the program is executed.
5.3.1.1 Constant Memory
First of all, there are a number of constants that can be precomputed in order to reduce the
overall execution time. From Section 2.3, the rotation matrix and the round constants are
fixed for any KECCAK iteration. To the kernel, these values are read-only. Therefore, the
values are placed in constant memory in order to take advantage of the aggressive hardware
caching that is supported.
The use of constant memory is not limited to the constants that are part of the KECCAK
specification. In order to properly define the tree hashing, Section 2.4.1 defines the notion
of a prefix block that contains some metadata about the hash. When the proper salt length
is chosen, a single input block of r bits can be formed. Since this prefix block is used for
all nodes in the tree, it can be computed once on the CPU and then transferred to constant
memory for the initial state during state initialization in the kernels; afterwards, the kernel
simply copies the precomputed values into its initial state using a memory copy.
5.3.1.2 Message Memory
When the input message is hashed, it needs to be located in some type of memory for the
GPU to access; the only viable options are global memory or page-locked memory. This
work allows the message to exist in either location and calculations are made to determine
the overall impact for each. However, there is an advantage to each method. For global
59
I0 I1 ... ID−1 P I0 I1 ... ID−1 P I0 I1 ... ID−1 P...
...Parent0 Parent1 Parentn
Figure 5.5: Efficient Tree Layout in Memory
memory, the access is faster than having to access the PCI bus, but the price is paid when
the memory copy needs to transfer the data from the CPU to the GPU. Page-locked memory
allows the data to exist in CPU memory and the device will make a memory request when-
ever a kernel requires a specified location in memory. While there is no explicit memory
copy operation, the access remains much slower than an access to the DRAM.
5.3.1.3 Internal Tree Memory Structure
Global memory is one of the most critical components in the entire design, as it is the
location where all of the intermediate values are read from and written to by the processing
kernels. As shown in Figure 5.1, there are two uses for global memory depending on the
tree mode. When using LI mode, Figure 5.1a indicates that there is one area of global
memory that is used. The reason for a single area is based on the fixed degree size for
every node in the tree; it is simple to determine what locations need to be read from and
written to. The tree is allocated as an array of memory based on the number of nodes in
the tree and the amount of data that needs to be passed between all of the levels. Figure 5.5
shows the simplified tree structure of a k-ary tree. Because each internal node of the tree
has at least one child, the array is created with enough space for each child to write its data.
From the figure, I refers to an index, or child. P is remaining space that a child will never
write to; its purpose is to hold the padding that is inserted before the node performs a hash
of the inputs.
Whenever a kernel is launched, each thread works to calculate a unique ID from its location
in the grid and blocks using Equation 5.1. For the leaf kernel, a unique ID zero-based index
in the tree can be calculated using Equation 5.2 by combining Equations 2.14 and 2.15.
Once the TreeID is determined, the parent of the leaf can be identified and the location
60
for writing output can be determined. Equation 5.3 is used to determined the parent and
Equation 5.4 is used to determine the index to write within each parent, since each parent
will have a number of indices equal to the degree. The area of global memory that is used
for the tree storage is only written to by the leaf kernel; the message may exist in another
area of memory.
(5.1)
GlobalThreadID = (blockIndexy × blockDimensiony + threadIndexy)
× blockDimensionx × gridDimensionx
+ blockIndexx × blockDimensionx + threadIndexx







ParentIndexID = (TreeID − 1) mod D (5.4)
Once the leaf kernel is launched, each successive kernel uses a similar methodology to
determine where to read from and write to. The major difference that exists between the
kernels in terms of indexing is that the number of nodes at the current level and above the
current level are passed as parameters. Equation 5.5 shows the calculation that is performed
to determine the unique zero-based index for the node in the tree; h is the current height in
the tree, starting with the leaf level as h = H−1, and h = H−currentKernelInvocation
for each successive kernel call. The allocated space of global memory is always used for






When FNG is used for the tree mode, a similar memory pattern is used. However, as
indicated in Figure 5.1b, there are two different areas of global memory that are used.
Because there are two different degrees that are used, the two different structures provide a
slightly different method of access to the kernels. There is a smaller portion of memory that
is used with the variable degree R, as opposed to the LI partition that is described above.
61
5.3.1.4 Kernel Memory
For all kernels, registers are allocated in the register file of the streaming multiprocessor for
use during execution. However, registers are not an infinite resource. For each thread that
is launched, registers are needed to handle the intermediate calculations that are performed.
In addition to those registers, each thread needs 200 bytes of storage to maintain the internal
state of the sponge construction. Because of this requirement, register spills occur into local
memory for each thread. While the Kepler architecture is designed to aid this by using the
L1 cache for spills, there are other architectures that place the overflow in local memory
directly into global memory. This can have negative impacts to performance due to the
misses that may occur when accessing the L1 cache. In an effort to aid the caching of these
local memory accesses, the cache is configured to preference the L1 cache, meaning that
the 64KB cache is split into 48KB for L1, and 16KB for shared memory.
5.3.2 Efficient Memory Design for Low-End GPUs
Tree hashing is an application that can benefit from the use of a GPU accelerator. How-
ever, high-end GPUs, such as NVIDIA Tesla cards, should not be the only focus when the
algorithm is implemented. Because there are a large number of streaming processors in
some of the low-end devices, computational resources are available for use, but memory
resources are not as widespread. Furthermore, a GPU is not a device that supports expand-
able memory. When all of these factors are considered, a low-end device can be targeted by
using page-locked memory. In a CPU, it is common to have devices with tens of gigabytes
of relatively inexpensive RAM, whereas most GPUs have between one and six gigabytes.
Using this method, the GPU memory could be used to hold the internal data for the tree
and CPU page-locked memory could be used for the input message.
Beyond the low-end desktop cards, there are also some GPUs that have integrated graphics,
meaning that they share a portion of memory with the CPU [26]. Because of this design,
there may be little to no memory available for the message on the GPU at all. In this case,
page-locked memory provides another benefit as it could be used to hold all of the internal
62
tree memory as well. While the overall memory subsystem may be a limiting factor, there






Figure 5.6: Pages of Memory for Sequential Processing
Even though page-locked memory can provide a reasonably sized resource for containing
the input message, there is still a limitation when the amount of data that needs to be hashed
exceeds the limits of memory. When this occurs, the input would need to be broken down
into blocks, or pages, as shown in Figure 5.6. After each page is processed, a relatively
small portion of memory would be needed to keep the chaining values that connect it to
the rest of the hash tree. If the device had enough global memory to support a sufficiently
large tree and a significant portion of the input message, then each page would have to be
copied to the GPU individually. While it may not appear to be a significant amount of data,
the overhead of actually performing the copy may outweigh any performance benefit of the
memory accesses. Overall, this design using page-locked memory provides a signficant




This chapter presents all of the results that were obtained from executing the design on a
real GPU. While there are many parameters that could be modified, only a few values were
selected for each one to analyze the performance. For all of the graphs, a block size of
16 × 16 was used as this was a value that would fully utilize the SMs. Unless otherwise
noted, the rate that was used for the KECCAK sponge was 1024 bits and the total size of the
state was 1600 bits.
For this work, an NVIDIA Tesla card was used for the analysis due to the nature of the
Tesla family of GPUs for parallel computations. Furthermore, the Kepler architecture al-
lows analysis to be performed on the latest GPU architecture. Table 6.1 shows the system
configuration that was used.
Table 6.1: Testing Platform
Component System A
CPU Intel Core i7-2600, 3.40 GHz
OS Windows 7 x64
RAM 16 GB
GPU Tesla K20c (GK110)
GPU RAM 6 GB GDDR5
Compute Capability 3.5
6.1 Hashing Speed using Page-Locked Memory
The first test that was completed was an evalutation of the hashing speed using the page-
locked memory approach described in Chapter 5. When this test was designed, the input
64
file size was selected to be approximately five gigabytes due to the nature of tree hashing.
Because of this file size, the input file and the tree could not be stored in global memory
simultaneously without some sort of paging algorithm. Therefore, the input file was allo-
cated in CPU RAM using page-locked memory, thereby allowing the GPU to perform the
operation without having the explicit memory copy to device memory. Whenever a request
for data is made, the value is transferred over the PCI bus to the GPU; all of the speeds that
are presented take into account the access time to the CPU memory.
Using the aforementioned test, both tree modes were analyzed. Figure 6.1a shows the
results using the LI mode for a tree degree of two. There are three important regions of
the graph for the performance analysis. First, all heights less than 11 show that there are
not enough threads to perform the computation and maintain full utilization of the GPU; it
is therefore unable to hide the latency of memory operations. The second region between
heights 11 and 20, inclusive, show the best performance. In this region, there is a well-
balanced mix of threads to hide many of the long memory operations resulting in a speed
of more than 3 GB/s. Finally, increasing the height further above 20 only shows degraded
performance because the kernel launch overhead does not provide any type of benefit to
the hashing speed. Figure 6.1b shows the same type of behavior for a tree of degree three.
However, there is a difference in performance regions due to the size of the tree based on
the degree. With this tree, the hashing speed is found to be less than 2.5 GB/s. Figures
A.1a to A.7a show the hash speed for degrees four to ten, respectively. For all of the even
degrees, the performance is slightly more than 3 GB/s, whereas the performance is less
than 2.5 GB/s for odd degrees.
The other important observation from Figures 6.1a and 6.1b is the relationship between the
speed of the hash function and the block size. In the figures, the block size is increased by
64 bits each time. As a result of the increased block size, the speed of the hash function
begins to decrease. Based on the block size, the kernels are required to perform more
reads from the message memory to get all of the data. Furthermore, each thread accesses a
65
(a) Degree = 2 (b) Degree = 3
Figure 6.1: Hash Performance using LI Mode
different part of the memory for the message, thereby potentially helping other threads in
the process due to caching. Based on the data requirements for a single thread, one read
to memory may bring more than one threads worth of data, reducing the total number of
memory requests that need to be made. More information about the caching information
related to block size is shown in Section 6.5.
The final piece of information that can be extracted from the LI performance figures is
the relationship to the speed and block size. As stated, the speed for all even degrees is
approximately 3 GB/s, whereas odd degrees exhibit a speed of 2.5 GB/s. The affect of
the block size on the speed is a linear relationship that depends on whether the degree is
even or odd. Figure 6.2a shows how the speed of the hash is related to the multiple of a 64
bit block size by an exponential decay model. While differences between the degrees are
observed for small block sizes, the performance eventually levels off to the same value for
both degrees at approximately 530 MB/s. This result indicates that a large block size is not
the ideal selection as the kernels are required to perform more work to read all of the data
into the internal state; a 64 bit block size provides the best performance.
66
(a) Block Size Effect on Speed of the Hash Function (b) Maximum Speed Ratio for Block Sizes
Figure 6.2: Block Size Variations
Based on the maximum speeds that are attainable, the ratio of the maximum speed from all
block sizes, which was found to be 64 bits, to the maximum speed for each block size can
be compared. The same behavior exists with these ratios as a result of the behavior with the
speeds regarding odd and even block sizes. Figure 6.2b shows the ratios for degrees two and
three. Both even and odd degrees displayed similar trends, but the changes were somewhat
different. Equations 6.1 and 6.2 show the complete summary of the various regions of the
graph for even and odd degrees, respectively, and how the increase in block size relates to




(0.9857× α + 0.0118)−1 for 1 ≤ α ≤ 4
(0.5592× α + 1.7159)−1 for 4 ≤ α ≤ 8
(−0.1476× α + 7.3753)−1 for 8 ≤ α ≤ 11





(0.7219× α + 0.2774)−1 for 1 ≤ α ≤ 4
(0.4109× α + 1.5217)−1 for 4 ≤ α ≤ 8
(−0.1085× α + 5.6786)−1 for 8 ≤ α ≤ 11
4.5786−1 for α > 11.
(6.2)
When the LI tree mode is used, there are clear differences in hash speed as the the block
size and height of the tree are changed. The FNG tree mode shows results that greatly differ
from the LI mode. Figure 6.3 shows the hash speed when using FNG mode for degrees two
and three. The first major difference compared with LI mode is the block sizes that are
used. Because the input is so large and the FNG mode is designed to only process B bits
at each leaf, the number of threads could easily exceed the number of available resources
if it is not carefully chosen. For example, a 64 bit block size with five gigabytes of input
with degree two will always result in 671,088,640 threads; a degree of three results in up
to 679,181,598 threads. Between all of the registers and memory that are required, these
kernels cannot be launched on the current generation of hardware. Therefore, the block
sizes that were tested had to be greater than the 1024 bits (α = 16), which is in the range
where the hash speed remains constant.
Figures 6.3a and 6.3b indicates that the speed of the hash function using FNG mode is
approximately 500 MB/s for both degrees and all block sizes once a sufficiently large height
is used. The maximum speed that was obtained was 516.55 MB/s. Appendix A shows the
FNG mode speeds for degrees four to ten in Figures A.1b to A.7b. Even though the degree
of the tree is changed, the performance does not change because the number of threads is
approximately the same for all input cases. These results show that the FNG mode is not
the best option for a general instance of a tree hash due to the dependence of the input
message length on the required resources on the GPU. Compared to the LI mode, hash
speeds are approximately one-sixth the value.
68
(a) Degree = 2 (b) Degree = 3
Figure 6.3: Hash Performance using FNG Mode
Based on the comparison of the hashing speeds for both tree modes, the best performance is
obtained when LI mode is used. Table 6.2 shows the ideal operating points when running
the tree hash for speed. Even though there is a range of heights, all values within the
height produce approximately the same performance. With the smaller heights, the memory
requirement for the tree is the lowest, reducing the overall GPU memory requirement for
the GPU. While the all of the degrees are listed, the even degree trees provide the best
performance. Therefore, the design of the system should use an even degree tree to take
full advantage of the parallelism.
6.1.1 CPU Clock Cycle Conversion
There are a number of ways to determine the speed of the hash function. Up to this point, all
of the speeds have been reported in MB/s. In order to provide a more accurate comparison
to the sequential speed comparisons, these values can be converted to cycles per byte. On
the Windows platform, there are two types of cycle counters. The first is to use the Time
Stamp Counter (TSC), which is a compiler intrinsic that counts the number of clock cycles.
In the C/C++ language, the cycles are obtained using the __rdtscp() function to return
69
Table 6.2: Ideal Tree Hashing Operating Points
Degree Heights Maximum Speed (GB/s) Cycles per byte
2 11 ≤ h ≤ 20 3.06 1.034
3 7 ≤ h ≤ 12 2.38 1.330
4 6 ≤ h ≤ 10 3.07 1.030
5 5 ≤ h ≤ 9 2.38 1.330
6 6 ≤ h ≤ 8 3.07 1.031
7 4 ≤ h ≤ 7 2.37 1.332
8 4 ≤ h ≤ 7 3.06 1.031
9 5 ≤ h ≤ 7 2.37 1.331
10 4 ≤ h ≤ 6 3.07 1.030
the number of cycles from some point in time. The second method is to use the Query
Performance Counter (QPC) API in the Windows operating system. This high resolution
timer is platform independent to provide the highest possible resolution. The two functions,
QueryPerformanceFrequency() and QueryPerformanceCounter(), get the
frequency of the clock and the number of cycles from some point in time. It can be shown
that one tick of the QPC clock does not correspond to one tick of the TSC method. It can
be shown that 1024 TSC ticks correspond to a single tick of the QPC. Equation 6.3 shows




×DataSizeMB ×QPCFrequency × 1024 (6.3)
If it is assumed that the speed is in MB/s and the data size is in MB, then the conversion







For all of the timing measurements that were obtained, the QPC method was used. On the
test machine, the frequency using the QPC API was found to be 3312841. Table 6.2 shows
the equivalent cycles per byte in terms of CPU clock cycles using this method of conversion.
70
Even though the GPU is being invoked and the processing really happens there, the CPU
cycles per metric can still be useful when comparing the overall speed of the system to other
implementations. Compared to Table 4.1, these speeds show significant improvement in the
speed of long message calculations. The rates for the given implementations range from
6.28 cycles per byte up to 115 cycles per byte. For an Intel Core i7 920, the speed was
measured at 13.09 cycles per bytes. The tree hashing implementation with a GPU shows
a speedup of up to 12.71 times faster when LI mode is used. As a basis for comparison,
the FNG tree mode operates at 6.26 cycles per byte. Compared to the fastest sequential
implementation, the speedup is 1.003, meaning that the FNG tree mode is not a good choice
for a general purpose tree hash; the sequential implementation would be just as well.
6.2 Hashing Speed using Page-Locked Memory Rate Comparison
In the sequential hash with KECCAK the define rate, r, can have a large impact on the speed
of the function. As the rate increases, the underlying security decreases, but the speed can
increase as a result. The third column of Table 6.3 shows the relative performance when
using a sequential hash function [11]. Figure 6.4 shows the normalized comparison of the
rate for various heights at degree two using LI mode. When the height is between five
and ten, inclusive, there is a significant difference between the rates due to the lack of
parallelism in the tree; these heights correspond to relatively low hash speeds in Figure
6.1a. When the height is between 11 and 20, inclusive, there is little impact on the hash
speed due to the change in rate. Table 6.3 shows the comparison of these two regions to
a rate of 1024 bits using tree hashing in the last two columns for a tree with degree two.
From the ideal operating points in Table 6.2, the same approximate values for the relative
performance were obtained. Table 6.4 that shows the performance comparison for the ideal


































Table 6.3: Relative Performance Comparison
r c Relative Performance 5 ≤ h ≤ 10 11 ≤ h ≤ 20
256 1344 ÷ 4.000 ÷ 2.717 ÷ 1.911
576 1024 ÷ 1.778 ÷ 1.377 ÷ 1.039
832 768 ÷ 1.231 ÷ 1.135 ÷ 1.006
1024 576 1 1 1
1088 512 × 1.063 × 1.021 × 1.001
1152 448 × 1.125 × 1.114 × 1.000
1216 384 × 1.188 - -
1280 320 × 1.250 - -
1344 256 × 1.312 × 1.208 × 0.994
1408 192 × 1.375 - -
1536 64 × 1.500 × 1.402 × 1.011
Based on these multipliers for the tree hash mode, it is clear that the parallelism in the tree
dominates the performance more than the rate of the underlying sponge. For all compar-
isons to the tree hash with r = 1024, all of the speeds are the same within three percent,
with the exception of r = 256. The relative performance compared with the sequential
hash is always better, even in the cases where the tree is not operating at its peak speed for
degree two. The other tree degrees show the same behavior where there is almost no dif-
ference in the overall hash speed based on a change in rate. Because the rate does not have
a large impact on speed, a lower rate could be used in the hash function, thereby increasing
the security envelope and providing a more secure hash function without sacrificing hash
speed.
6.3 GPU Memory Use Effect on Hash Speed
Based on the previous section, there were multiple tree degrees and heights that produced
the maximum speed of the system using page-locked memory. In addition to using page-
locked memory, this section shows the impact of using GPU global memory to perform the
hash. For all of the figures and tables, a tree of degree two is used because it is one of the
configurations that produces the highest speed. Since the fastest sequential implementation




























































































































































































































































































































































































































(a) LI Mode (b) FNG Mode
Figure 6.5: Performance Comparison of Hash Speeds with Associated Overhead using Global
Memory
speed that should be seen for the data is 515.16 MB/s.
Figure 6.5a shows the speed of the hash function when global memory is utilized using a
LI mode tree. For this particular height, the function is able to operate at speeds greater
than 9 GB/s, resulting in a speed of approximately 0.359 cycles per byte. The reason
for this significant performance improvement is the interconnection network between the
memory and the processing cores. When the GPU has to access data in global memory, the
bandwidth is very high, which allows this computation speed to be obtained. However, the
downside is the cost of performing the memory copy from CPU memory to GPU memory.
The plot also shows the resulting speed when the copy overhead is accounted for. While
the speed remains significantly faster than the sequential hash, the overall speed is reduced
to approximately 2 GB/s.
In addition to LI mode, the same analysis can be performed using FNG mode, despite the
performance difference. Figure 6.5b shows the resulting speed for FNG mode. Even though
the data exists in GPU memory, the resulting speeds are still very low due to the number of
threads that are running. When the speed is calculated for the hash function, it falls short
of 200 MB/s; when the extra cost of the memory transfer is included, the speeds drops, but
75
(a) LI Mode (b) FNG Mode
Figure 6.6: Performance Comparison of Hash Speeds with Associated Overhead using
Page-Locked Memory
not as much as when LI mode is used. In terms of a comparison to the sequential hash, this
mode will never be faster. One of the benefits of exploring FNG mode with smaller input
sizes is the ability to reduce the block size. The data presented in the figure was obtained
using a 64-bit block size. Based on LI mode, the best performance is obtained when a small
block size is used; despite this fact, the performance does not exceed the speed that was
calculated for a large input size. Even though there were still a large number of threads,
the device limits for memory and resources were not exceeded until the size of the input
exceeded 350 MB.
The first half of Table 6.5 shows the comparison of the maximum hashing speeds that
were obtained for the raw computation and the combined speed when the copy overhead
is taken into account for both tree modes when global memory and memory copies are
used. Since FNG mode already has a high dependency on resource availablity, the resulting
performance degradation is not as much as can be seen with LI mode. For FNG mode, the
greatest decrease is about 7.2%, whereas the LI mode decreased by up to 74.8%.
In order to perform a baseline comparison, the same analysis is done using page-locked
memories and calculating the associated overhead. When global memory was used, the
76
extra memory copies resulted in a significant performance degradation. The extra overhead
associated with using page-locked memory is the conversion of the host pointer into a
device pointer, which is significantly less cost than a memory transfer. In Figure 6.6, both
LI and FNG modes are presented with the use of page-locked memory. The performance of
both tree modes is nearly identical when the overheads are accounted for. By comparing the
FNG modes for both types of memory, the result is that they perform at essentially the same
speed. This shows that the resource requirements are very high when using FNG mode and
that it should be avoided for general purpose use. The second half of Table 6.5 shows
the maximum speeds that were obtained using page-locked memory. The performance
decrease for LI mode is 0.1% and the decrease for FNG mode is 0.2% for some cases.
While analyzing the performance overhead using different types of memories is important,
the analysis to determine the data size in which the tree hash out-performs a sequential hash
is required. The behavior of Figures 6.5 and 6.6 shows that for small data sizes, the speed
increases quickly and then converges to a fixed value once a sufficient data size is use for
input. In order for the tree hash to be more beneficial than a sequential hash, the speed must
exceed 515.16 MB/s. For all cases of using the FNG mode, the speed never exceeded 200
MB/s. When LI mode was used, the speed always beat the sequential hash except for the
case when the height of the tree was eight; anything greater provided a significant speedup.
Figure 6.7 illustrates the minimum data sizes that are required in order to achieve near-
maximum performance. In the plots, the hash speed refers to the speed of only the hash
function with no overhead; the complete speed takes into account the additional overheads
of memory copies or initializations. The indicated percentage determines how close the
values are to the maximum. For example, if the maximum speed were 2000 MB/s and
the one percent metric was used, the minimum size would be determined when the speed
reached a minimum of 1980 MB/s. For the FNG mode, there are few visible differences
between the plots. However, LI mode indicates that there are a number of differences
between the start and end of the heights, but the converged value is the approximately the
77
Table 6.5: Maximum Hash Speeds using Different Memories
Height
LI Mode FNG Mode








8 549.9 465.6 0.84694 29.58 29.29 0.99028
9 1097.8 806.7 0.73483 52.05 51.17 0.98295
10 2190.3 1274.1 0.58171 85.19 82.85 0.97247
11 3918.7 1712.9 0.43712 125.80 120.69 0.95941
12 6103.2 2030.8 0.33275 156.66 148.87 0.95027
13 6547.9 2077.4 0.31727 180.17 169.94 0.94324
14 8010.3 2206.3 0.27544 185.04 174.25 0.94166
15 8069.9 2209.2 0.27376 194.16 182.38 0.93931
16 9050.0 2278.6 0.25177 196.48 184.45 0.93879
17 8376.9 2231.9 0.26644 199.65 187.23 0.93778
18 8056.4 2209.1 0.27420 197.40 185.09 0.93764
19 6983.1 2118.0 0.30331 197.49 185.35 0.93851
20 5824.0 1997.1 0.34292 193.51 181.81 0.93955
21 4220.6 1767.7 0.41882 188.71 177.48 0.94049











8 489.1 489.0 0.99982 29.31 29.303 0.99978
9 975.0 974.7 0.99967 52.00 51.99 0.99976
10 1928.3 1927.0 0.99934 85.06 85.028 0.99961
11 3071.0 3068.3 0.99911 125.51 125.41 0.99918
12 3103.2 3099.9 0.99894 156.17 156.04 0.99917
13 3042.8 3039.4 0.99890 179.55 179.38 0.99905
14 3112.5 3110.9 0.99948 184.38 184.20 0.99903
15 3037.7 3034.5 0.99896 193.28 193.11 0.99911
16 3088.9 3086.6 0.99924 195.71 195.53 0.99910
17 3045.2 3042.5 0.99911 197.48 197.32 0.99917
18 2991.7 2987.3 0.99852 196.61 196.43 0.99910
19 2865.6 2863.4 0.99923 196.42 196.25 0.99913
20 2677.4 2674.8 0.99905 192.73 192.59 0.99926
21 2342.3 2339.8 0.99892 188.04 187.88 0.99913
22 1899.4 1897.8 0.99916 175.67 175.36 0.99828
78
Table 6.6: Minimum Required Data Sizes for Adequate Performance
Height Page-Locked Memory Global Memory1% (MB) 2% (MB) 5% (MB) 1% (MB) 2% (MB) 5% (MB)
8 91 45 19 108 54 23
9 176 94 36 174 98 39
10 385 217 95 301 166 66
11 609 353 156 366 229 94
12 575 371 173 519 308 132
13 443 304 158 528 324 137
14 710 468 228 633 418 182
15 730 510 261 739 494 229
16 827 626 354 813 588 326
17 990 805 499 946 726 448
18 1045 896 620 1045 883 597
19 1118 1014 791 1123 1000 750
20 1170 1100 930 1134 1107 955
21 1219 1178 1055 1227 1203 1132
22 1227 1192 1110 1234 1213 1148
same at 1200 MB for a large tree height. Table 6.6 shows the one, two, and five percent
data sizes that are required for the complete hash speed including the overhead times. In
a number of the scenarios, the global memory design requires less data to reach its peak
performance, but the global memory design also suffers from running at two-thirds the
speed of the page-locked memory design.
When all of this data is accounted for, the use of global memory for input message does
not provide the optimal performance. Even though the speed of the actual hash function
performs nearly three times better, the memory overheads cannot be neglected when the
data has to be copied to the GPU before the hash is performed. Page-locked memory
provides a useful alternative that does not suffer from the same overheads limitations. The
tree mode only appears to have an impact on the hash speed when LI mode is used; FNG
mode does not show any performance difference using different types of memories. Using
this information, the general case to design a tree hashing algorithm should take advantage
of page-locked memory in LI mode.
79
Figure 6.7: Minimum Required Data Size for Hashing
80
(a) LI Mode (b) FNG Mode
Figure 6.8: Leaf Kernel Execution Time Distribution
6.4 Kernel Level Timing Analysis
In all of the previous results, the average throughput and timing were calculated using the
overall system performance. In addition to that analysis, the kernel level timing information
can be obtained when the NVIDIA Visual Profiler is used. After running the application
in the tool, the result is the microsecond precision for the amount of time that each kernel
required.
The first type of analysis that can be done is performed on the leaf kernels for the different
tree modes. Figures 6.8a and 6.8b show the histograms of how the leaf kernel execution
times are affected by the tree parameters for the LI and FNG modes, respectively. As part
of the original design, LI mode has a high dependence on the amount of input, whereas
FNG mode reads a fixed number of bits from the input. For LI mode, the average kernel
time was 15.398 seconds, with a standard deviation of 46.336 seconds; FNG mode has an
average time of 9.5464 seconds with a standard deviation of 0.0404 seconds.
In the general case for LI mode, the execution times are very short due to the available
resources. When the tree has a small height, the amount of input processed by each leaf is
very high, resulting in high execution times. The average in this experiment is very high
81
(a) Leaf and First Internal Processing
Kernels (b) Final Four Processing Kernels
Figure 6.9: Kernel Performance for Small Tree Size
as a result of the data set that was used; varying degrees and heights were used to find the
average and standard deviation. Even though the same test cases were used for FNG mode,
the results show that all of the leaf kernels take the same amount of time. After all, each
thread is processing a fixed size input and the tree attempts to create approximately the
same number of leaves, regardless of the tree parameters.
In the grand scheme of tree hashing, the leaf kernel is responsible for a majority of the
processing time. Table 6.7 shows the percentage of total execution time associated with the
leaf kernel and first interal processing kernel for both tree modes. For both tree modes, the
sum of the first two kernels accounts for almost all of the processing time. There is never a
case in LI mode where the leaf kernel takes less than a majority of the time. However, FNG
mode has some cases where the processing kernel takes far more time than the leaf kernel.
In this mode, the first processing kernel is responsible for converting the variable degree R
into degree D, meaning that a significant amount of time could be taken for the hash.
Figure 6.9a shows one of the special cases when the first internal processing kernel takes
longer than the leaf kernel for FNG mode. Even though there are six kernel invocations,
only the first two are able to be seen because of the amount of work that is done for both
tree modes. For the test tree, the degree was two and the height was six. In FNG mode, the
82
Table 6.7: Percentage of Kernel Execution Times for First Two Kernel Launches
Height
FNG Mode LI Mode
Leaf Kernel First Kernel Leaf Kernel First Kernel




5 17.30 82.70 99.999 2.54× 10−4
6 27.80 72.20 99.998 6.37× 10−4
7 43.82 56.17 99.994 1.26× 10−3
8 60.61 39.38 99.986 2.53× 10−3
9 75.47 24.51 99.967 5.03× 10−3
10 86.02 13.97 99.926 9.92× 10−3
11 92.47 7.51 99.863 1.63× 10−2
12 95.80 4.18 99.845 1.72× 10−2
13 97.10 2.87 99.830 2.07× 10−2
14 97.48 2.49 99.780 4.19× 10−2
15 97.79 2.17 99.712 7.23× 10−2
16 97.77 2.18 99.571 1.34× 10−1
17 98.08 1.85 99.335 2.41× 10−1
18 97.95 1.94 98.897 4.49× 10−1
19 97.93 1.88 98.055 8.51× 10−1
20 97.71 1.94 96.453 1.61× 100
21 97.29 2.09 93.502 3.09× 100
22 96.61 2.23 88.067 5.78× 100




4 72.22 27.78 99.960 1.47× 10−2
5 93.73 6.27 99.928 1.97× 10−2
6 97.24 2.75 99.861 6.17× 10−2
7 98.00 1.98 99.610 2.51× 10−1
8 98.02 1.91 98.367 1.25× 100





1 - - 100 0
2 13.43 86.57 99.998 2.08× 10−3
3 63.56 36.44 99.954 2.48× 10−2
4 95.17 4.83 99.920 3.02× 10−2
5 97.86 213 99.762 1.57× 10−1
6 98.02 1.94 98.269 1.50× 100
83
variable degree was calculated to be 12315926. There is almost as much work done by the
first processing kernel as the leaf kernel because all of the input data has to be converted
to degree two. Based on the major differences that exist between the first two kernels, the
last four kernels can be analyzed as well. Figure 6.9b shows the comparison between the
final four kernel invocations. Even though the tree modes are different, the first two kernels
have done enough processing to make the rest of the kernels take approximately the same
amount of time.
The final analysis that can be done with kernel level timing information is the throughput
at each stage. The input size for both tree modes was approximately five gigabytes. For
each successive kernel invocation, the amount of data is reduced. Equation 6.5 shows the
calculation to determine the number of input bits into each of the internal kernels. This
equation can be used for all internal kernels in LI mode and the last H − 2 internal kernels
for FNG mode; Equation 6.6 shows the calculation for the size of the data in bits of the first
internal kernel in FNG mode. Both of these equations take into account the padding that
has to be performed.
InputSizeLI = D
h × [(D × c) + r − [(D × c)%r]], 0 < h < H (6.5)
InputSizeFNG = D
H−1 × [(R× c) + r − [(R× c)%r]] (6.6)
Figure 6.10a shows the throughput for the leaf kernels when the degree of the tree is two
operating in LI mode. This plot shows two regions that have differing performance. When
the height is between five and ten, inclusive, the throughput approximately doubles each
time. The second region shows that throughput remains constant as the height continues to
increase. Once the peak is reached, additional height does not provide any improvement.
This is the same type of relationship that is seen in Figure 6.1a with the overall hash speed,
except for the lack of degradation at large heights. Figure 6.10b shows the same tree in FNG
mode throughput for the leaf kernel. Because each leaf kernel executes in approximately
the same amount of time, there is essentially no difference as the tree height is changed.
84
(a) LI Leaf Kernel Throughput (b) FNG Leaf Kernel Throughput
(c) LI First Processing Kernel Throughput (d) FNG First Processing Kernel Throughput
Figure 6.10: First Two Kernel Throughputs for a Tree with Degree = 2
Based on both of these plots, the throughput of the leaf kernel provides an accurate estimate
of the maximum throughput that is possible for the entire tree.
Due to the differences that exist between the leaf kernel throughputs, there should also
be a difference in the first internal processing kernel. Figures 6.10c and 6.10d show the
first internal kernel throughputs for LI and FNG mode, respectively, in a tree with degree
two. For nearly all heights that were used, the FNG throughput is almost twice that of the
LI mode. Even though the throughput is greater, it is not enough to overcome the slow
performance caused by the FNG leaf kernel to be faster than LI mode.
Figure 6.11 shows the kernel throughputs for all kernel invocations for selected heights
85
(a) LI Kernels (b) FNG Kernels
Figure 6.11: Kernel Throughputs for Varying Heights with Degree = 2
with a tree of degree two. When the plots are compared to each other, the only difference is
the throughput of the first and second kernel invocations, referring to the leaf kernel and the
first internal processing kernel; the rest of the kernel invocations share the same throughput
for the matching heights.
When a height of 16 is used, the overall throughput of the system is greatest for LI mode.
Compared with Figure 6.1a, the change from H = 16 to H = 23 results in a performance
decrease. Figure 6.11a shows that the reason for the decrease is not a lack of resources,
but extra levels of additional processing that are unnecesary. Even though the throughput
is unchanged from other invocations, the small amounts of time for each kernel add up to
decrease performance.
Figures 6.12a and 6.12b show the difference when the degree is increased to six and twelve,
respectively. Compared to the tree with a degree of two, the resulting throughputs are al-
most identical. Because of this relationship, a higher degree does not substantially improve
the overall throughput of the system. The final comparison of the throughput is when a
fixed height is used and the degrees are changed. Table 6.8 shows the resulting throughput
for each kernel when the height of the tree is six. As the degrees increase, the throughput
86
(a) Degree = 6 (b) Degree = 12
Figure 6.12: Kernel Throughputs for Varying Heights and Degrees
for each kernel also increases, even though the overall throughput is not affected as indi-
cated by the graphs in Appendinx A. In general, the high throughputs for a high degree tree
are higher, but there is no benefit in terms of overall performance.
6.5 Hardware Profiling
In addition to the detailed timing information for each kernel, there is a lot additional in-
formation that can be extracted from the hardware, such as the number of registers that
were used, cache performace, etc.As part of the compilation process, the CUDA compiler
is capable of giving the number of registers that were required for the specified architec-
ture. When a fat binary file is generated, the Parallel Thread Execution (PTX) assembly
instructions for the architecture are embedded in the executable. During the first stage of
execution, this PTX code may be recompiled by the Just-In-Time (JIT) compiler as part
of the device driver [17]. This will optimize the kernel for the specific device that it is
executed on. Table 6.9 shows the resources that were required for this to run on the testing
platform. Because each thread is required to hold the internal state of the sponge, there
is also a high memory requirement. When compute capability 3.5 is used, all of the addi-
tional data that is needed gets stored into a stack frame, which is internally stored in local














































































































































































































































































































































































































































































































































































































































Table 6.9: Compute Capability 3.5 Kernel Resource Requirements
Kernel Registers Used Stack Frame Used
Leaf LI 95 328 bytes
Leaf FNG 123 328 bytes
Internal Proccessing 90 200 bytes
Internal Processing FNG 90 200 bytes
and try to minimize register spilling, KECCAK has a very high memory requirement that is
a limiting factor for resource allocation during execution.
As shown by the overall performance plots in Section 6.1, the speed of the hash function is
affected by both the height and block size that are used when LI mode is used. Becaue the
block size can have a significant impact on performance, it is essential to understand how a
change in block size affects the performance of the cache subsystem. Since local memory
is stored in the global memory address space, global memory still needs to be accessed for
the local data when it is requested. For this experiement, the block sizes were changed to
collect the percentage of all accesses that miss the L1 and L2 cache. The block sizes ranged
from 64 bits to 3072 bits. Since the leaf kernel was identified as the biggest bottleneck to
performance, it was the only kernel that was profiled to obtain all of the data.
Figure 6.13 shows the average miss rate taken over all of the block sizes for degrees two
and three. The heights with the highest percentage of misses to global memory correspond
to the ideal operating points defined by Table 6.2. Even though the cache misses are very
high for these heights, the performance remains high because there is a well-balanced mix
of threads that are waiting for memory accesses to complete and threads that are computing
a hash.
Using the same method of analysis for the FNG mode, the performance remains approxi-
mately constant for all block sizes because the number of threads for the leaf kernel does
not significantly change as the tree changes size. Figure 6.14 shows the comparison be-
tween an increase in block size for LI and FNG modes. With LI mode, the miss rate begins
89
(a) Degree = 2 (b) Degree = 3
Figure 6.13: Average Percentage of Misses to Global Memory
to level out as the block size increases. As the height of the tree increases, the percentage
of misses also increases. However, the percentage of misses to global memory does appear
to increase linearly with an increase in block size, and the height of the tree does not have
any impact on the miss rate.
Because the FNG mode has a number of disadvantages when it comes to performance and
input requirements, there is no more analysis that can be done to significantly improve the
system. However, the cache structure for the LI mode can be done. From this point on,
only the LI mode will be analyzed.
As previously shown in Figure 6.13, the best performance is observed when the cache
misses are the highest. Figure 6.15 shows the stall percentages due to memory accesses
that were required by the pipeline as the height and block sizes were adjusted. For the small
block sizes, the percentage of stalls is very low for both degrees two and three. Therefore,
the goal is to reduce the stall cycles by having a balanced mix of memory accesses and
computation to adequately hide long latency operations.
In addition to analyzing the stalls due to memory accesses, the hit rate of the L1 cache is
90
(a) LI Mode (b) FNG Mode
Figure 6.14: Comparison of Block Size and Cache Miss Rate for Degree = 2
(a) Degree = 2 (b) Degree = 3
Figure 6.15: Stall Percentages for Data Requests
(a) Degree = 2 (b) Degree = 3
Figure 6.16: Stall Percentages for Data Requests
91
one of the most important metrics. Since the Kepler architecture uses the entire L1 cache
for register spills and local memory accesses, the cache is directly related to the overall
system performance. Figure 6.16 shows the hit rate for trees of degrees two and three. The
lowest hit rate in the graphs corresponds to the maximum performance as well.
92
Chapter 7
Simulated Cryptographic GPU Design
Based on the hardware analysis in Section 6.5, the register usage and the L1 cache proved
to be two of the biggest bottlenecks when it came to the overall system performance. With
this understanding, GPGPU-Sim was used to simulate the modified architectures to deter-
mine how simple changes could help improve the overall performance. The simulator was
designed for use with the Fermi architecture, even though the bottlenecks were found using
the Kepler architecture. Despite the differences in architecture, these modifications are ap-
plicable to any CUDA architecture that has the same characteristics as the ones described.
The baseline configuration that was used had 32K registers and a 48K L1 cache that was 6-
way set associative. Once the baseline values were obtained, the results for modifications
are always shown as normalized values to this baseline. The total number of simulated
clock cycles was used as the metric to determine how the change affected the speed. Be-
cause LI mode showed the best performance on the physical hardware, these modifications
were tested in LI mode.
7.1 Registers
As it was found, the number of registers required by each thread is very high. Therefore,
an increase in the number of registers in the register file that are available for use should
provide a performance benefit because there are fewer accesses that have to be made to local
memory. Figure 7.1 shows how a change in the number of registers affects the speedup for
a tree with a degree of two. Based on the figure, there is no change in the speedup when the
tree height is sufficiently low. Once the height reaches ten, the register changes start to show
a small performance benefit. Figure 7.2 shows a more detailed view of the speedups with a
93
Figure 7.1: Register Modifications
varying number of registers. At its maximum, the register change to 48K provided a seven
percent speedup over the baseline; all of other register changes reuslted in no performance
gain.
Based on the conceptual benefits, an increased register file show provide a more signifi-
cant speedup, especially when the number of registers is significantly increased. However,
the results show that 48K registers is the only register size that shows a difference when
compared to the other sizes. For all of the other register sizes, the speedup was the same.
The reason for the lack of performance benefit is a result of the limiting factors associated
with block scheduling, not the actual size of the register file. The simulator is designed
to use the specified number of registers in an analysis for the limiting factor; the number
is never used as part of the actual simulation for register accesses. Table 7.1 shows the
limiting factors that were identified to be limiting the performance. Each thread block that
is launched contains 256 threads, meaning that 1536 is the maximum number of threads
that can be assigned to a SM. Once the 64K register count is obtained, the limiting factor is
moved to threads, therefore the number of registers does not have any impact on the overall
performance.
94
Figure 7.2: Register Modifications
Table 7.1: Limiting Factors for Thread Execution
Registers Leaf Kernel Internal KernelFactor Threads Factor Threads
32768 registers 768 registers 1024
49152 registers 1280 registers, threads 1536
65536 threads 1536 threads 1536
Overall, increasing the number of registers has the potential to increase the speed as long as
registers are a limiting factor for the scheduler. Once the transition is made from registers to
threads as the limiting factor, the registers no longer play any role in the simulation which
results in a constant speedup for any additonal change.
7.2 L1 Cache
Figure 6.15 incidated that the L1 hit rate significantly decreased when the performance of
the tree was the greatest. By increasing the size of the L1 cache, it is possible to improve
the performance of the system. Figure 7.3 shows the speedup for a tree of degree two
when the associativity is changed to seven and eight. With the change in associativities
to seven and eight, the size of the cache is changed to 56K and 64K, respectively. From
95
Figure 7.3: L1 Cache Modifications
an implementation perspective, an associativity of seven is not a good choice as it requires
additional or unused logic. From that perspective, an associativity of eight is the best
choice. Furthermore, shared memory is configured as a two-way set associative cache
because there are eight ways total in L1. With this configuration, L1 could be broken down
into two separate regions that have a power-of-two associativity.
As it could be expected, the speed always increases for an increased associativity. When
a small height is used, there are only a few threads that are being executed. As a result,
an increased associativity will provide more space for memory accesses. As the height of
the tree increases, the speedup from the L1 modification begins to diminish, but it is still
shown to provide a speedup of five percent for an associativity of eight. For all heights, the
speed increases by between eight and thirty percent.
96
Figure 7.4: L2 Cache Modifications
7.3 L2 Cache
The final independent modification that can be made is an increased size for the L2 cache.
By default, the L2 cache subsystem is defined by the number of memory channels. There
are six memory channels with two subpartitions per channel which provide a total of 12
partitions. Each subpartition contains an eight-way set associative cache that is 64K in
size. The entired cache started at a size of 768K. For this modification, the size of the cache
was increased by changing the associativity for each subpartition. The new associativities
ranges from nine to sixteen.
Figure 7.4 shows the overall speedup for these changes when the tree degree is equal to
two. Similar to the register modifications that were made, there is no observable speedup
for the trees when the height is small. There are not enough memory requests going to the
L2 cache to make any impact on the amount of time. Figure 7.5 shows a more detailed view
of the graph when the tree has a height greater than nine. By increasing the associativity
of each subpartition to nine, the speed was only shown to increase between zero and four
97
Figure 7.5: Detailed L2 Cache Modifications
percent. If the original size is doubled to 16 ways, the speed increased by up to 17 percent
for a large tree. In general, the L2 size changes show greater improvement as the trees
become larger due to more threads causing contention in the underlying subsystems.
7.4 L1 and L2 Cache Combined
Since the L1 and L2 cache modifications showed that a significant performance benefit can
be gained, the combined changes have the potential to further reduce the execution time.
Because the register modifications were also used in a limiting factor analysis, the change
was not included in this section as it would not have been an accurate comparison to the
improvements.
Figure 7.7 shows the combined modifications for the L1 cache having an associativity of
seven and eight with the variations in the associativity of L2. For all of the simulations,
a degree of two was used for the tree. The trend for both of the plots is the same. For
the heights less than nine, the L2 cache increase has little effect due to the lack of threads.
98
Figure 7.6: L1 and L2 Combined Modifications for Degree = 10
However, the increased L2 cache sizes start to show a performance improvement when
the tree has a height of at least ten. At that point, the L1 and L2 cache modifications are
showing a performance benefit. Compared to the real executions, the tree needs to have
a height greater than 11 in order to be operationg at its peak. At a height of 11 in the
simulations, the performance can be increased by nearly 20 percent.
Figure 7.8 shows the combined modifications when the degree is changed to three. For a
tree with the height greater than seven provides the best improvement between 20 and 25
percent. Even though the benefit is more than the tree with degree two, it is not enough of
an improvement to make this tree run at a higher throughput than the even degree trees; the
performance limitation remains, but is reduced slightly. Finally, Figure 7.6 shows a tree
with a degree of ten. When the height is equal to four, the tree is operating at its fastest
speed. With these modifications, it improves by between ten and 25 percent.
99
Figure 7.7: L1 and L2 Combined Modifications for Degree = 2
100




This thesis accomplished a number of the goals that were defined. First, an implementation
of KECCAK was designed for the sponge construction required by NIST. This design allows
the execution of the two tree modes that have fixed and variable size to understand the
NVIDIA Tesla GPU throughput. In all of the testing, each parameter that was required for
tree hashing was modified in some way to determine its effect on the overall throughput.
A performance analysis of the various tree parameters showed how the parameters impact
the total execution time, resulting throughput, and even the cache level usage and hit rates to
improve the overall performance. When a tree grows to a sufficient size and has a through-
put greater than 520 MB/s, it will be faster than a sequential implementation. For a number
of testing scenarios, the throughput was found to be more than 3000 MB/s. Using the CPU
clock for a cycle count, this was operating at a speed of 1.03 cycles per byte. The type of
memory that is used in a GPU implementation plays a large role in the overall throughput
because it may be possible to avoid some overheads. As shown with using global memory
in the GPU, data transfer overheads cannot be ignored and cause a significant decrease in
throughput when they are accounted for.
In general, even degree trees shared the same throughput, whereas all odd degree trees
shared another, slower throughput. For the general case implementation, an even degree
should be used. Furthermore, there were a number of heights that could be used to get the
maximum throughput. Even though there is a range, the smallest height should be used
because it would require the least amount of memory to hold the internal tree structure. In
102
addition to the speed of the hashing tree, it was shown that the input rate for the KECCAK
sponge no longer had an impact on the overall execution time for a sufficiently large tree.
As a result, it is possible to use a lower input rate to improve the security of the hash without
compromising speed.
The simulated GPU showed that the bottlenecks of registers and L1 cache use could be
minimized by increasing the sizes of both. In addition to those, the L2 cache can also be
increased to provide a more significant improvement. When the L1 and L2 caches were
simultaneously modified, the speed was improved by up to 20 percent, thereby significantly




Parallel and heterogeneous architectures are an emerging field of computing that is mak-
ing its way into a variety of application fields. Tree hashing has been shown to work well
and achieve high throughput on a CUDA enabled GPU. For this work, the Kepler architec-
ture was used for all of the benchmarking. The next generation of CUDA enabled GPUs
are being designed with power efficiency and performance improvements. Tree hashing
could be explored on the NVIDIA Maxwell architecture to determine how the hardware
modifications affect the performance.
Additional features of the CUDA extensions could also be used. CUDA streams are able to
take advantage of more resources on the GPU when properly used. An architecture could
be designed to use streams to improve the overall throughput. A streamed tree hashing
implementation would be able to operate in multiple modes, such as performing a tree hash
or hashing many messages at one time; it could also be designed to perform tree hashing
using many files in a directory tree.
There are many other parallel environments and languages that could be used to understand
the performances on each. Different languages could be used for the same implementation
to determine which language provides the least overhead. The additional languages for tree
hashing to be implemented in are C++ AMP, MPI, and OpenCL. In addition to using other
languages, different hardware could also be used. AMD manufactures GPUs that can be
programmed for general purpose computation. Intel released the Xeon Phi coprocessor that
could be used to explore another architecture.
104
The final extension for this work is focused on the cryptographic design rather than the
system architecture. Parallel tree hashing can be explored using different tree designs and
different parameters, such as the block length or padding scheme. A complete and thorough




[1] T.M. Aamodt, W.W.L. Fung, I. Singh, A. El-Shafiey, J. Kwa, T. Hetherington,
A. Gubran, A. Boktor, T. Rogers, A. Bakhoda, and H. Jooybar. GPGPU-Sim 3.x Man-
ual, 2012. http://gpgpu-sim.org/manual/index.php5/GPGPU-Sim_
3.x_Manual.
[2] D. Ashfield. Why are cryptographic hash functions important in
digital forensics?, 2014. http://www.cclgroupltd.com/
cryptographic-hash-functions-important-digital-forensics/.
[3] A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt. Analyzing
CUDA Workloads Using a Detailed GPU Simulator. In ISPASS 2009. IEEE Inter-
national Symposium on Performance Analysis of Systems and Software, 2009., pages
163–174, April 2009.
[4] B. Baldwin, A. Byrne, Liang L., M. Hamilton, N. Hanley, M. O’Neill, and William P.
Marnane. FPGA Implementations of the Round Two SHA-3 Candidates. In 2010 In-
ternational Conference on Field Programmable Logic and Applications (FPL), pages
400–407, Aug 2010.
[5] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche. KECCAK Specifications, Oc-
tober 2008. http://keccak.noekeon.org/Keccak-specifications.
pdf.
[6] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche. KECCAK Sponge Function
Family Main Document, June 2010. http://keccak.noekeon.org/.
[7] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche. Cryptographic Sponge Func-
tions, January 2011. http://sponge.noekeon.org/.
[8] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche. The KECCAK reference,
January 2011. http://keccak.noekeon.org/.
[9] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche. Software Performance Fig-
ures, 2011. http://keccak.noekeon.org/sw_performance.html.
106
[10] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche. Sakura: A Flexible Coding
for Tree Hashing. Cryptology ePrint Archive, Report 2013/231, 2013. http://
eprint.iacr.org/.
[11] G. Bertoni, J. Daemen, M. Peeters, G. Van Assche, and R. Van Keer. KECCAK im-
plementation overview, May 2012. http://keccak.noekeon.org/.
[12] M. Bobrov. Cryptographic Algorithm Acceleration Using CUDA Enabled GPUs in
Typical System Configurations. Master’s thesis, Rochester Institute of Technology,
New York, August 2010.
[13] P.L. Cayrel, G. Hoffmann, and M. Schneider. GPU Implementation of the Keccak
Hash Function Family. Information Security and Assurance, page 3342, 2011.
[14] NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Pro-
gramming Guide (v1.0), 2007.
[15] NVIDIA Corporation. NVIDIA’s Next Generation CUDA Compute Architecture:
Fermi, 2009.
[16] NVIDIA Corporation. NVIDIA’s Next Generation CUDA Compute Architecture:
Kepler GK110, 2012.
[17] NVIDIA Corporation. CUDA C Programming Guide (v5.5), 2013.
[18] NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Pro-
gramming Guide (v5.5), 2013.
[19] NVIDIA Corporation. Tuning CUDA Applications for Kepler (v5.5), 2013.
[20] F. Dacêncio Pereira, E. Moreno Ordonez, I. Daun Sakai, and A. Mariano de Souza.
Exploiting Parallelism on Keccak : FPGA and GPU Comparison. Parallel & Cloud
Computing, 2(1), 2013.
[21] K. Gaj, E. Homsirikamol, and M. Rogawski. Fair and Comprehensive Methodology
for Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates
Using FPGAs. In Proceedings of the 12th International Conference on Cryptographic
Hardware and Embedded Systems, CHES’10, pages 264–278, Berlin, Heidelberg,
2010. Springer-Verlag.
[22] Intel. Processors - Define SSE2,SSE3 and SSE4. http://www.intel.com/
support/processors/sb/CS-030123.htm (Accessed 02-17-2014).
107
[23] Computer Science Division. National Institute of Standards and Technology. Cryp-
tographic Hash and SHA-3 Standard Development. http://csrc.nist.gov/
groups/ST/hash/index.html.
[24] Computer Science Division. National Institute of Standards and Technology. SHA-
3 Competition (2007-2012). http://csrc.nist.gov/groups/ST/hash/
sha-3/index.html.
[25] Computer Science Division. National Institute of Standards and Technology. DRAFT
FIPS PUB 202, 2014. http://csrc.nist.gov/publications/drafts/
fips-202/fips_202_draft.pdf.
[26] J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose
GPU Programming. Addison-Wesley Professional, 2010.
[27] G. Sevestre. Implementation of Keccak Hash Function in Tree Hashing Mode on
NVIDIA GPU. 2010.
[28] D. Stinson. Cryptography: Theory and Practice. CRC Press, third edition, 2006.
[29] J. Strömbergson. Implementation of the Keccak Hash Function in FPGA Devices.
December 2008.
[30] M. Łukowiak. Field Programmable Gate Arrays: Overview and Trends. 2012.
[31] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystify-
ing GPU Microarchitecture through Microbenchmarking. In 2010 IEEE International
Symposium on Performance Analysis of Systems Software (ISPASS), pages 235–246,
March 2010.




These graphs show the hashing speed using page-locked memory to hold the input message
of approximately five gigabytes. The degree, height, and block size are all modified and
presented for both tree modes. Section 6.1 provides a detailed explanation of the graphs.
The major purpose of these graphs is to show how the overall throughput of the system is
affected by changes in degree, height, block size, and tree mode.
(a) LI Tree Mode (b) FNG Tree Mode
Figure A.1: Hash Performance with Degree = 4
109
(a) LI Tree Mode (b) FNG Tree Mode
Figure A.2: Hash Performance with Degree = 5
(a) LI Tree Mode (b) FNG Tree Mode
Figure A.3: Hash Performance with Degree = 6
110
(a) LI Tree Mode (b) FNG Tree Mode
Figure A.4: Hash Performance with Degree = 7
(a) LI Tree Mode (b) FNG Tree Mode
Figure A.5: Hash Performance with Degree = 8
111
(a) LI Tree Mode (b) FNG Tree Mode
Figure A.6: Hash Performance with Degree = 9
(a) LI Tree Mode (b) FNG Tree Mode
Figure A.7: Hash Performance with Degree = 10
