GPU Accelerated Keccak (SHA3) Algorithm by Wang, Canhui & Chu, Xiaowen
ar
X
iv
:1
90
2.
05
32
0v
1 
 [c
s.D
C]
  1
4 F
eb
 20
19
GPU Accelerated Keccak (SHA3) Algorithm
Canhui Wang †, Xiaowen Chu †,⋆
† Computer Science, Hong Kong Baptist University, Hongkong, China
⋆ HKBU Institute of Research and Continuing Education, Shenzhen, China
{chwang, chxw}@comp.hkbu.edu.hk
Abstract
Hash functions like SHA-1 or MD5 are one of the most important cryp-
tographic primitives, especially in the field of information integrity. Con-
sidering the fact that increasing methods have been proposed to break these
hash algorithms, a competition for a new family of hash functions was held
by the US National Institute of Standards and Technology. Keccak was the
winner and selected to be the next generation of hash function standard,
named SHA-3.
We aim to implement and optimize Batch mode based Keccak algo-
rithms on NVIDIA GPU platform. Our work consider the case of processing
multiple hash tasks at once and implement the case on CPU and GPU respec-
tively. Our experimental results show that GPU performance is significantly
higher than CPU is the case of processing large batches of small hash tasks.
1 Introduction
Security techniques [1, 2] have been acknowledged to be an integral part in many
fields i.e., business, national defense, military and etc. One special part of crypto-
graphic algorithms is the hashing family that is important, especially in the field of
modern information security where have a wide range of applications, i.e., digital
signatures, message authentication codes, password authentication and etc. Un-
fortunately, today increasing existing hash algorithms like MD5, SHA-1 and so on
are at high risk of being cracked. To improve the security of the hash algorithms,
a new SHA-3 [3] algorithm driven from KECCAK has been proposed to replace
the older hash functions. Hash functions are unique in the way that an output is
1
generated. A message is broken down into a number of blocks and the hash func-
tion consumes each block of the message into some type of internal state, with
a final output produced after the last block is consumed. This structure is diffi-
cult to parallelize. In this case, the inputs could range from a few bytes to a few
terabytes, and using a sequential hash function is not the best choice. A function
with a tree hashing mode could be used to significantly reduce the amount of time
that is required to compute the hash.
There are classes of problems that may be expressed as data-parallel compu-
tations with high arithmetic intensity where a CPU is not particularly efficient.
Multi-core CPUs excel at managing multiple discrete tasks and processing data
sequentially, by using loops to handle each element. Instead, the architecture of
GPU maps the data to thousands of parallel threads, each handling one element.
This architecture looks ideal for our fast algorithm implementation.
2 Related Work
Lowden [4] focus on the exploration and analysis of the Keccak tree hashing mode
on a GPU platform. Based on the implementation, there are core features of the
GPU that could be used to accelerate the time it takes to complete a hash due to the
massively parallel architecture of the device. In addition to analyzing the speed of
the algorithm, the underlying hardware is profiled to identify the bottlenecks that
limited the hash speed. The results of their work show that tree hashing can hash
data at rates of up to 3 GB/s for the fixed size tree mode.
Qinjian et al., [5] propose a GPU based AES implementation. In their im-
plementation, the frequently accessed T-boxes were allocated on on-chip shared
memory and the granularity that one thread handles a 16 Bytes AES block was
adopted. Finally, they achieve a performance of around 60 Gbps throughput on
NVIDIA Tesla C2050 GPU, which runs up to 50 times faster than a sequential
implementation based on Intel Core i7-920 2.66GHz CPU.
Kaiyong et al., [6] develop G-BLASTN, a GPU-accelerated nucleotide align-
ment tool based on the widely used NCBI-BLAST. G-BLASTN can produce ex-
actly the same results as NCBI-BLAST, and it has very similar user commands.
Compared with the sequential NCBI-BLAST, G-BLASTN can achieve an overall
speedup of 14.80X under megablast mode. They [7] also propose to exploit the
computing power of Graphic Processing Units (GPUs) for homomorphic hash-
ing. Specifically, they demonstrate how to use NVIDIA GPUs and the Computer
Unified Device Architecture (CUDA) programming model to achieve 38 times of
speedup over the CPU counterpart. They also develop a multi-precision modu-
lar arithmetic library on CUDA platform, which is not only key to our specific
application, but also very useful for a large number of cryptographic applications.
Xinxin et al., [8, 9] propose a novel fine-grained benchmarking approach and
2
apply it on two popular GPUs, namely Fermi and Kepler, to expose the previously
unknown characteristics of their memory hierarchies. They also investigate the
impact of bank conflict on shared memory ac- cess latency.
Thuong et al., [10] implements a high speed hash function Keccak (SHA3-
512) using the integrated development environment CUDA for GPU is proposed.
In addition, the safety level of Keccak is also discussed at the point of Pre-Image
Resistance especially. In order to implement a high speed hash function for pass-
word cracking, the special program is also developed for passwords up to 71 char-
acters. Moreover, the throughput of 2-time hash is also evaluated in their work.
Chengjian et al., [11] propose a graphics processing unit (GPU)-based imple-
mentation of erasure coding named G-CRS, which employs the Cauchy Reed-
Solomon (CRS) code, to overcome the aforementioned bottleneck. To maximize
the coding performance of G-CRS, they designed and implemented a set of op-
timization strategies, such as a compact structure to store the bitmatrix in GPU
constant memory, efficient data access through shared memory, and decoding par-
allelism, to fully utilize the GPU resources.
Xiaowen et al., [12, 13] exploit the po- tential of the huge computing power
of Graphic Processing Units (GPUs) to reduce the computational cost of network
coding and homomorphic hashing. With their network coding and HHF imple-
mentation on GPU, they observed significant computational speedup in compari-
son with the best CPU implemen- tation. This implementation can lead to a prac-
tical solution for defending against the pollution attacks in distributed systems.
Cheong et al., [14] contribute to the cryptography research community by pre-
senting techniques to accelerate symmetric block ciphers (IDEA, Blowfish and
Threefish) in NVIDIA GTX 690 with Kepler architecture. The results are bench-
marked against implementation in OpenMP and existing GPU implementations in
the literature. We are able to achieve encryption throughput of 90.3 Gbps, 50.82
Gbps and 83.71 Gbps for IDEA, Blowfish and Threefish respectively. Block ci-
phers can be used as pseudorandom number generator (PRNG) when it is operat-
ing under counter mode (CTR), but the speed is usually slower compare to other
PRNG using lighter operations. Hence, they attempt to modify IDEA and Blow-
fish in order to achieve faster PRNG generation. The modified IDEA and Blowfish
manage to pass all NIST Statistical Test and TestU01 Small Crush except the more
stringent tests in TestU01 (Crush and BigCrush).
3 Preliminary
The secure hash algorithm-3 (SHA-3) family is based on an instance of Keccak
algorithm that has been selected as the winner of the SHA-3 cryptographic hash al-
gorithm competition by NIST in 2012. The SHA-3 consists of four cryptographic
hash functions, including SHA-3-224, SHA-3-256, SHA-3-384 and SHA-3-512,
3
as well as two additional extendable output functions, SHAKE-128 and SHAKE-
256. Specifically, the extendable output functions are different from hashing func-
tions. It provides a flexible way to be adopt easily in according with the require-
ments of individual applications. In general, the hash functions play an important
role in many fields, including digital signatures, pseudorandom bit generation etc.
3.1 Keccak-p
The SHA-3 functions can be viewed as modes ofKeccak−p permutations, which
are designed as the main components of various cryptographic functions. Two
core parameters of Keccak − p permutations are specified as width and round.
In this case, width is denoted by b, meaning the fixed length of the permuted
strings and round is denoted by nr, meaning that the number of iterations of an
internal transformation.
The state ofKeccak − p[b, nr] consists of b bits. In addition, specifications in
standard contain two other quantities related to b: b/25 and log2(b/25), denoted
by w and l, respectively. Seven possible cases for these variables that are defined
forKeccak − p[b, nr] are given in Table 1.
Table 1: The widths and other quantities ofKeccak − p[b, nr]
b 25 50 100 200 400 800 1600
w 1 2 4 8 16 32 64
l 0 1 2 3 4 5 6
It is convenient to represent the input and output states of the step mappings as
five-by-five-w array denoted as A[x, y, z], meaning that an integer triple (x, y, z)
where 0 ≤ x ≤ 5, 0 ≤ y ≤ 5 and 0 ≤ z ≤ w. A string can be denoted as S.
An array is a representation of the string by a three-dimensional array and their
relationship can be expressed in equation (1).
A[x, y, z] = S[w(5y + x) + z] (1)
After representing a string into a state, next we need to operation the state.
The specifications of operations, including ϑ, ρ, pi, χ and ζ are discussed in the
following section. Note that the algorithm for each step mapping takes a state
array denoted by A. An return or output state array is denoted by A′. The size of
the state is a parameter that is omitted from notation because b is always specified
when step mappings are invoked.
Definition 1. ϑ: the input state array is denoted by A and the output state array is
denoted by A′. Then,
4
C[x, z] = A[x, 0, z]⊕ A[x, 1, z]⊕ A[x, 2, z]⊕A[x, 3, z]⊕ A[x, 4, z] (2)
D[x, z] = C[(x− 1)mod(5), z]⊕ C[(x+ 1)mod(5), (z − 1)mod(w)] (3)
A′[x, y, z] = A[x, y, z]⊕D[x, z] (4)
where 0 ≤ x < 5, 0 ≤ y < 5 and 0 ≤ z < w. The effect of the specification ϑ is
toXOR each bit in the state with parties of two columns in the array. In particular,
for A[x0, y0, z0], the x-coordinate of one of the columns is (x0 − 1)mod(5) with
the same z-coordinate z0; while the x-coordinate of one of the columns is (x0 +
1)mod(5) with coordinate (z0 − 1)mod(w).
Definition 2. ρ: the input state array is denoted as A and the output state array is
denoted by A′. ∀t ∈ [0, 24), the specification of ρ is expressed as follows.
A′[y, (2x+ 3y)mod(5), (z − (t + 1)(t+ 2)/2)mod(w)] = A[x, y, z] (5)
where 0 ≤ z < w and the initial value x = 1 and y = 0. The effect of the
specification of ρ is to rotate the bits of each lane by a length named offset,
which depends on the fixed x and y coordinates of the lane. Equivalently, for each
bit in the lane, the z coordinate is modified by adding the offset modulated by
the lane size.
Definition 3. pi: the input state array is denoted by A and the output state array is
denoted by A′. The specification of pi is expressed as follows.
A′[x, y, z] = A[(x+ 3y)mod(5), x, z] (6)
where 0 ≤ x < 5, 0 ≤ y < 5 and 0 ≤ z < w. The effect of the specification pi is
to rearrange the positions of the lanes.
Definition 4. χ: the input state array is denoted by A, the output state array is
denoted by A′. The specification of χ is expressed as follows.
A′[x, y, z] = A[x, y, z]⊕ (A[(x+ 2)mod(5), y, z] · (1⊕A[(x+ 1)mod(5), y, z]))
(7)
where 0 ≤ x < 5, o ≤ y < 5 and 0 ≤ z < w. Note that the dot in the equation
(7) indicates integer multiplication which in this case is equivalent to the intended
Boolean AND operation. The effect of χ is to XOR each bit with a non-linear
function of two other bits in its row.
5
Definition 5. ζ: the input state array is denoted by A and the output state array is
denoted by A′. The specification of ζ is expressed as follows.
RC[2j − 1] = rc(j + 7ir) (8)
A′(0, 0, z) = A(0, 0, z)⊕ RC[z] (9)
where 0 ≤ z < w, 0 ≤ j < l. Note that within the specification of ζ , a parameter
determines l + 1 bits of a lane value called the Round Constant and denoted by
RC. Each of these l + 1 bits is generated by a function that is based on a linear
feedback shift register. The function is denoted by rc. The effect of ζ is to modify
some of the bits of A(0, 0, z) where 0 ≤ z < w in a manner that depends on the
round index ir.
Thus, given a state array A and a round index ir, the round function Rnd is
the transformation that results from applying the steps as follows.
Rnd(A, ir) = ζ(χ(pi(ρ(ϑ(A)))), ir) (10)
Note that theKeccak − p[b, nr] permutation consists of nr iterations of Rnd.
3.2 Sponge
The spong construction is a framework for specifying functions on binary data
with arbitrary output length. The construction employs three components: an
underlying function on fixed-length strings denoted by f , a parameter named the
Rare denoted by r and a padding function denoted by pad. These components
form a sponge function denoted by Sponge[f, pad, r]. The sponge function takes
two inputs: a bit string denoted byN and the bit length denoted by d of the output
string denoted by Z. Note that the input d determines the number of bits that the
Sponge algorithm returns. But it does not affect the actual values. In principle,
the output can be regarded as an infinite string whose computation is halted after
the desired number of output bits is produced.
pad10 · 1(x,m) = 1||0(−m−1)mod(x)||1 (11)
The padding rule for Keccak family is named multi-rate padding. Given a
positive integer x, a non-negative integer m, the specification of padding rule for
Keccak denoted by pad10 · 1(x,m) is specified as given in equation (11).
3.3 SHA-3
Keccak is the family of the sponge functions withKeccak − p[b, 12 + 2l] permu-
tation. The family is parameterized by any choices of the rare r and the capacity
6
c such that r+ c is in 25, 50, 100, 200, 400, 800, 1600. When restricted to the case
b = 1600, the Keccak family is denoted by Keccak[c]. In this case, r is deter-
mined by the choice of c. The algorithmKeccak[c] is specified as follows.
Keccak[c](N, d) = Sponge[Keccak − p[1600, 24], pad10 · 1, 1600− c](N, d)
(12)
SHA-3 hash functions and two SHA-3 XOF s will be defined. Given a mes-
sage M , the four SHA-3 hash functions are defined from Keccak[c](N, d) func-
tion by spending a two-bit suffix toM and by specifying the length of the output
as follows.
SHA− 3− 224(M) = Keccak[448](M ||01, 224) (13)
SHA− 3− 256(M) = Keccak[512](M ||01, 256) (14)
SHA− 3− 384(M) = Keccak[768](M ||01, 384) (15)
SHA− 3− 512(M) = Keccak[1024](M ||01, 512) (16)
SHAKE − 128(M, d) = Keccak[256](M ||1111, d) (17)
SHAKE − 256(M, d) = Keccak[512](M ||1111, d) (18)
In this case, the capacity is double the digest length, in other words, c = 2d
and the resulting input N to Keccak[c] is the message with the suffix appended
N = M ||01. The suffix supports domain separation, which distinguishes the
inputs toKeccak[c](N, d) arising from the SHA-3 hash functions from the inputs
arising from the SHA-3 XOF s.
4 GPU Accelerated SHA-3
The SHA-3 parallel hash mode can be divided into two types: Batch mode and
Tree mode. Batch mode is to divide a piece of information into multiple identical
slices, and then hash the slices in parallel; or, Batch mode is to process multiple
pieces of the same information in parallel at one time. Tree mode is to process
multiple information into a form of Hash Root. In other words, multiple informa-
tion is hashed in twos until the Hash Root is finally obtained. This Hash Root can
be understood as Merkle Tree Root. In this article, we will adopt the Batch mode
to implement the parallel calculation of the hash algorithm.
7
Table 2: Environment specification
CPU Intel Core i5-7200U@ 2.5GHz*4
GPU GeForce 940MX/PCIe/SSE2
Memory 12 GiB
OS Ubuntu 16.04 LTS, 64 bits
CUDA compliation V7.5.17
GCC V5.4.0
4.1 Parallel Granularity
SHA-3 parallel mode can be divided into Batch mode and Tree mode. The GPU
parallelism used by different modes is different. For example, Batch mode nor-
mally uses ’one thread one message’ parallel granularity which means that multi-
ple GPU threads process multiple messages at the same time. Tree mode normally
uses ’one thread one tree’ parallel granularity which means that multiple GPU
threads process multiple hash trees at the same time. In this paper, we mainly
focus on Batch mode and our parallel granularity is ’one thread per message’.
4.2 RC Tables Allocation
SHA-3 RC tables are essentially look-up tables through which users can quickly
implement part of cryptographic operations. Every thread needs to get access
RC tables in each round of cryptographic operations. Hence we need to load RC
tables in the memory of GPU in advance. In our benchmarking approach, we load
RC tables into CUDA constant memory. Another possible solution is to load RC
tables into CUDA share memory.
4.3 Plaintext Allocation
In our implementation of SHA-3 algorithm, we mainly use Batch mode. A large
amount of messages are hashed at the same time. In our experiment, we hash a
large amount of same length messages (10 bytes) via a large number of synchro-
nized GPU threads. For example, if we have 100 messages and the length of each
message is 10 bytes, then we need at least 100 CUDA threads for hashing.
5 Experimental Results
Table 2 shows the configuration of our experiment platform. We conduct our
experiments on a Ubuntu 16.04 LTS 64-bit operating system with a 12-GiB mem-
ory, an Intel Core i5-7200U @ 2.5GHz*4, an GeForce 940MX/PCIe/SSE2 GPU.
8
Table 3: SHA-3 CPU V.S. GPU
File Size (bytes)
Hash
Time (seconds) Throughput (bytes per second)
CPU GPU CPU GPU
1202 0.002656 0.000431 452560.24 2788863.11
4652 0.008600 0.000330 540930.23 14096969.69
9302 0.016400 0.000330 567195.12 28187878.78
18602 0.032475 0.000346 572809.85 53763005.78
37202 0.065230 0.000373 570320.40 99737265.42
74402 0.129495 0.000382 574555.00 194769633.41
148802 0.258204 0.000437 576296.26 340508009.15
297602 0.516079 0.000557 576659.77 534294434.47
595202 1.036339 0.000755 574331.37 788347019.87
1190402 2.060367 0.001204 577762.12 988705980.06
The version of nvcc compliation is V7.5.17 and the version of GCC is V5.4.0.
Our CPU code is written in standard C language and the GPU code is written in
CUDA.
Table 3 shows the SHA-3 hashing performance comparison between CPU and
GPU. One thing to note is that the GPU’s parallel computing performance is af-
fected by several factors, such as the ability of the GPU to perform parallel com-
puting when the amount of concurrent tasks is small, such as where the file size
is less than 1K bytes. Can’t be played well. Once the number of parallelizable
tasks is large, the GPU’s parallel computing capabilities will be greatly utilized.
In our experiments, it is very obvious that the performance of the GPU when
the GPU performance is more than 1K bytes when the parallel file size is larger
than the GPU performance. Exceeds the performance of the CPU by more than
4 times, and, as the number of parallelizable files increases, the GPU’s powerful
parallelism gets better.
6 Conclusion
This paper implements and optimizes Batch mode based Keccak algorithms on
NVIDIA GPU platform. Our work consider the case of processing multiple hash
tasks at once and implement the case on CPU and GPU respectively. Our exper-
imental results show that GPU performance is significantly higher than CPU is
the case of processing large batches of small hash tasks. In future work, we aim
to implement and analyze Hash Tree Mode based Keccak algorithms where many
CUDA Reduce operations are involved. And our projected is now available at
Github: https://github.com/Canhui/SHA3-ON-GPU.
9
Acknowledgements
This work is supported by Shenzhen Basic Research Grant SCI-2015-SZTIC-002.
References
[1] I. A. Saeed, “Examine the effectiveness of using traditional techniques for teach-
ing and learning cuda programming,” Journal of Computing Sciences in Colleges,
vol. 33, no. 5, pp. 173–178, 2018.
[2] S. Kumar, J. Shekhar, and J. P. Singh, “Data security and encryption technique for
cloud storage,” in Cyber Security: Proceedings of CSI 2015. Springer, 2018, pp.
193–199.
[3] M. J. Dworkin, “Sha-3 standard: Permutation-based hash and extendable-output
functions,” Tech. Rep., 2015.
[4] J. Lowden, M. Łukowiak, and S. Lopez Alarcon, “Design and performance analysis
of efficient keccak tree hashing on gpu architectures,” Journal of Computer Security,
vol. 23, no. 5, pp. 541–562, 2015.
[5] Q. Li, C. Zhong, K. Zhao, X. Mei, and X. Chu, “Implementation and analysis of aes
encryption on gpu,” in High Performance Computing and Communication & 2012
IEEE 9th International Conference on Embedded Software and Systems (HPCC-
ICESS), 2012 IEEE 14th International Conference on. IEEE, 2012, pp. 843–848.
[6] K. Zhao and X. Chu, “G-blastn: accelerating nucleotide alignment by graphics pro-
cessors,” Bioinformatics, vol. 30, no. 10, pp. 1384–1391, 2014.
[7] Zhao and X. Chu, “Gpump: Amultiple precision integer library for gpus,” pp. 1164–
1168, 2010.
[8] X. Mei, K. Zhao, C. Liu, and X. Chu, “Benchmarking the memory hierarchy of
modern gpus,” in IFIP International Conference on Network and Parallel Comput-
ing. Springer, 2014, pp. 144–156.
[9] X. Mei and X. Chu, “Dissecting gpu memory hierarchy through microbenchmark-
ing,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 1, pp.
72–86, 2017.
[10] J. Yang, W. Wang, Z. Xie, J. Han, Z. Yu, and X. Zeng, “Parallel implementations
of sha-3 on a 24-core processor with software and hardware co-design,” in ASIC
(ASICON), 2017 IEEE 12th International Conference on. IEEE, 2017, pp. 953–
956.
10
[11] C. Liu, Q. Wang, X. Chu, and Y.-W. Leung, “G-crs: Gpu accelerated cauchy reed-
solomon coding,” IEEE Transactions on Parallel and Distributed Systems, 2018.
[12] X. Chu, K. Zhao, and Wang, “Practical random linear network coding on gpus,” in
International Conference on Research in Networking. Springer, 2009, pp. 573–585.
[13] X. Chu, K. Zhao, and M. Wang, “Practical rsdfom linear network coding on gpus,”
in Interdfnational Conference on Research in Networking. Springer, 2009, pp.
573–585.
[14] H.-S. Cheong and W.-K. Lee, “Fast implementation of block ciphers and prngs for
kepler gpu architecture,” in IT Convergence and Security (ICITCS), 2015 5th Inter-
national Conference on. IEEE, 2015, pp. 1–5.
11
