GPU Accelerated AES Algorithm by Wang, Canhui & Chu, Xiaowen
ar
X
iv
:1
90
2.
05
23
4v
1 
 [c
s.D
C]
  1
4 F
eb
 20
19
GPU Accelerated AES Algorithm
Canhui Wang †, Xiaowen Chu †,⋆
† Computer Science, Hong Kong Baptist University, Hongkong, China
⋆ HKBU Institute of Research and Continuing Education, Shenzhen, China
{chwang, chxw}@comp.hkbu.edu.hk
Abstract
It has been widely accepted that Graphics Processing Units (GPUs) is
one of promising schemes for encryption acceleration, in particular, the sup-
port of complex mathematical calculations such as integer and logical opera-
tions makes the implementation easier; however, complexes such as parallel
granularity, memory allocation still imposes a burden on real world imple-
mentations.
In this paper, we propose a benchmarking approach for AES acceler-
ations, including both encryption and decryption. Specifically, we adapt
the Electronic Code Book (ECB) mode for cryptographic transformation, T-
boxes scheme for fast lookups, and a granularity of ’one state per thread’ for
thread scheduling. Our benchmarking results offer researchers a good un-
derstanding on GPU architectures and software accelerations. In addition,
both our source code and experimental results are freely available.
1 Introduction
A graphics processing unit (GPU) is a specialized electronic circuit, which is orig-
inally designed to rapidly manipulate, and alter memory to accelerate the creation
of images in a frame buffer. Nowadays GPUs are very efficient at manipulat-
ing computer graphics and image processing as well. Also their highly parallel
structure makes them more efficient than general CPUs in the related fields such
as statistical mechanics, mathematical biology and information security where the
processing of large blocks of data is done in parallel. Compute Unified Device Ar-
chitecture (CUDA), a parallel computing platform and application programming
1
interface model developed by NVIDIA, is the earliest widely adopted program-
ming model for GPU computing. To make it more easier for specialists in parallel
programming, the CUDA platform is designed to work with programming lan-
guages such as C, C++, Fortran, MATLAB and etc. Additionally, CUDA supports
many programming frameworks like OPENACC, OpenCL and etc. It has become
one of the most popular choices for the implementation of computationally de-
manding algorithms.
2 Related Work
Abdelrahman et al., [1] propose an implementation of the AES-128 ECB Encryp-
tion on three different GPU architectures (Kepler, Maxwell and Pascal). The re-
sults show that encryption speeds with 207 Gbps on the NVIDIA GTX TITAN X
(Maxwell) and 280 Gbps on the NVIDIA GTX 1080 (Pascal) have been achieved
by performing new optimization techniques using 32bytes/thread granularity.
Chengjian et al., [2] propose a graphics processing unit (GPU)-based imple-
mentation of erasure coding named G-CRS which employs the Cauchy Reed-
Solomon (CRS) code to overcome the aforementioned bottleneck. To maximize
the coding performance of G-CRS, they designed and implemented a set of op-
timization strategies, such as a compact structure to store the bitmatrix in GPU
constant memory, efficient data access through shared memory, and decoding par-
allelism, to fully utilize the GPU resources.
Kaiyong et al., [3] develope G-BLASTN, a GPU-accelerated nucleotide align-
ment tool based on the widely used NCBI-BLAST. G-BLASTN can produce ex-
actly the same results as NCBI-BLAST, and it has very similar user commands.
Compared with the sequential NCBI-BLAST, G-BLASTN can achieve an overall
speedup of 14.80X under megablast mode. And they [4] propose to exploit the
computing power of Graphic Processing Units (GPUs) for homomorphic hash-
ing. Specifically, they demonstrate how to use NVIDIA GPUs and the Computer
Unified Device Architecture (CUDA) programming model to achieve 38 times of
speedup over the CPU counterpart. They also develop a multi-precision modular
arithmetic library on CUDA platform, which is not only key to our specific appli-
cation, but also very useful for a large number of cryptographic applications. Also,
they [5] propose a GPU based AES implementation of which the frequently ac-
cessed T-boxes were allocated on on-chip shared memory and the granularity that
one thread handles a 16 bytes AES block was adopted. They achieved the perfor-
mance of around 60 Gbps throughput on NVIDIA Tesla C2050 GPU, which runs
up to 50 times faster than a sequential implementation based on Intel Core i7-920
2.66GHz CPU.
Iwai et al., [6] present results of several experiments that were conducted
to elucidate the relation between memory allocation styles of variables of AES
2
and granularity as the parallelism exploited from AES encoding processes using
CUDAwith an NVIDIAGeForce GTX285 (Nvidia Corp.). Results of these exper-
iments showed that the 16 bytes/thread granularity had the highest performance.
It achieved approximately 35 Gbps throughput.
Xinxin et al., [7, 8] propose a novel fine-grained benchmarking approach and
apply it on two popular GPUs, namely Fermi and Kepler, to expose the previously
unknown characteristics of their memory hierarchies. They also investigate the
impact of bank conflict on shared memory ac- cess latency.
Xiaowen et al., [9, 10] exploit the potential of the huge computing power of
Graphic Processing Units (GPUs) to reduce the computational cost of network
coding and homomorphic hashing. With their network coding and HHF imple-
mentation on GPU, they observed significant computational speedup in compar-
ison with the best CPU implemen- tation. Their implementation can lead to a
practical solution for defending against the pollution attacks in distributed sys-
tems.
Ahmed et al., [11] propose an AES-128 algorithm (ECB mode) implemen-
tation on three different GPU architectures with different values of granularities
(32,64 and 128 bytes/thread). Their results show that the throughput factor reaches
277 Gbps, 201 Gbps and 78 Gbps using the NVIDIA GTX 1080 (Pascal), the
NVIDIA GTX TITAN X (Maxwell) and the GTX 780 (Kepler) GPU architec-
tures.
Conti et al., [12] presents a direct comparison between FPGA and GPU used
as accelerators for the AES cipher. The results achieved on both platforms and
their analysis has been compared to several others in order to establish which de-
vice is best at playing the role of hardware accelerator by each solution showing
interesting considerations in terms of throughput, speedup factor, and resource
usage. Their analysis suggests that, while hardware design on FPGA remains
the natural choice for consumer-product design, GPUs are nowadays the prefer-
able choice for PC based accelerators, especially when the processing routines are
highly parallelizable.
Ma et al., [13] implement AES decryption in CBC mode, a mode that is
widely used by many applications, on GPU using Cuda, a framework developed
by NVIDIA and friendly to use. To achieve the best performance, they give a
comprehensive performance analysis to the implementations based on GPU un-
der different parameters setting that include the size of input data, the number of
threads per block, memory allocation style and parallel granularity. Through ex-
periment evaluation based on our implementation, the best performance of AES
on GPU is 112 times over the serial AES algorithm on CPU.
3
3 Preliminary
3.1 Block Cipher Mode
Block cipher mode is a schema that uses a block cipher to encrypt messages of
arbitrary length in a way that provides confidentiality and authenticity. Many
block cipher modes [14,15] have been well defined. Five of themwill be discussed
below.
3.1.1 The Electronic Codebook Mode
The Electron Code book (ECB) is a mode that assigns a fixed cipher text block to
each plaintext block, which can be defined as the equation (1).
Ci = ciphers (Pi) (1)
where i = 1, 2, 3, ..., n. The plain text P can be represented as P = P1P2P3...Pn,
and the cipher text C can be represented as C = C1C2C3...Cn, respectively.
3.1.2 The Cipher Block Chaining Mode
The Cipher Block Chaining (CBC) is a mode where the previous cipher-text
blocks are combined with the plain-text blocks. An initialization vector is re-
quired to combine with the first plain-text block. The encryption in CBC mode
can be defined as the equation (2).
Ci = Pi ⊕ Ci−1 (2)
where i = 1, 2, 3, ..., n. The plain text P can be represented as P = P1P2P3...Pn,
and the cipher text C can be represented as C = C1C2C3...Cn. In particular, the
initialization vector C0 is not necessary to be secret but it should be unpredictable.
3.1.3 The Cipher Feedback Mode
The Cipher Feedback (CFB) is a mode that is similar to the previous described
CBC mode. A main difference is that the CFB mode is fed with the cipher-text
data (not the plain-text data) from previous rounds. The encryption in CFB mode
can be defined as the equation (3).
Ci = Pi ⊕ CFBk (Ci−1 ⊕ Pi−1) (3)
where i = 1, 2, 3, ..., n. The plain text P can be represented as P = P1P2P3...Pn,
and the cipher text C can be represented as C = C1C2C3...Cn. If any one single
bit of a plain text Pi is damaged, the output cipher block will be damaged.
4
3.1.4 The Output Feedback Mode
The Output Feedback (OFB) is a mode where key-stream bits IV are created for
data encryption. The encryption of OFB mode can be defined as the equation (4).
Ci = Pi ⊕OFB (IVi−1) (4)
where i = 1, 2, 3, ..., n. The plain text P can be represented as P = P1P2P3...Pn,
and the cipher text C can be represented as C = C1C2C3...Cn.
3.1.5 The Counter Mode
The Counter (CTR) mode makes block cipher the way similar to OFB mode. The
key-stream bits are created regardless of the content of encrypted blocks data. In
such a mode, subsequent values of an increasing counter are added to a unique
number and then the plain text data are encrypted as usual. The encryption of
CTR mode can be defined as the equation (5).
Ci = Pi ⊕ CTR (R + i) (5)
where R =
{
0, 1, 2, ..., 2|P | − 1
}
, i = 1, 2, 3, ..., n. The plain text P can be
represented as P = P1P2P3...Pn, and the cipher text C can be represented as
C = C1C2C3...Cn.
3.1.6 Comparison
Table 1: Comparison
Block Cipher Mode Parallel Capacity
ECB Suitable
CBC Unsuitable
CFB Unsuitable
OFB Unsuitable
CTR Suitable
Table 1 shows the parallel capacity of five block cipher modes. As we can see,
both ECB and CTRmodes naturally support parallel computing because every one
single block of them can be encrypted independently. In the following sections,
we will focus on ECB mode and analyze the implementation of ECB based AES
algorithms.
5
3.2 AES Basics
It is widely acknowledged that Rijndael is a government encryption standard that
has been submitted to the International Organization for Standardization (ISO)
and the Internet Engineering Task Force (IETF) as well as the Institute of Electri-
cal and Electronics Engineers (IEEE) as an approval encryption standard. One of
the main factors for its succeed acceptance is that it supports parallel implementa-
tions on a wide range of platforms. In this section, we will discuss both the theory
and the construction of AES.
3.2.1 The Field of GF (2n)
GF (2n) is a finite field that contains 2n elements. It supports some basic mathe-
matical operations like multiplication, addition and etc. Elements in a finite field
can be represented in several different ways such as polynomial expressions, inte-
ger numbers etc. In AES, the field of GF (28) is adopt. To describe a finite field
in detail, a simple example of GF (23) is given in Table 2.
Table 2: All elements in the finite field of GF (23)
Polynomial Tri-Tuple Integer
0 (0, 0, 0) 0
1 (0, 0, 1) 1
x (0, 1, 0) 2
x+ 1 (0, 1, 1) 3
x2 (1, 0, 0) 4
x2 + 1 (1, 0, 1) 5
x2 + x (1, 1, 0) 6
x2 + x+ 1 (1, 1, 1) 7
Table 2 shows all elements in the field of GF (23) in three different ways:
polynomial expressions, tri-tuple and integer numbers. In general, all elements of
a finite filed GF (2n) can be represented in equation (6).
GF (2n) = aix
i (6)
where i = 0, 1, 2, 3, ..., n− 1, ai = 0 or 1. In particular for AES, n = 8.
Definition 1. Addition: in the polynomial representation, the addition of two
elements in a finite field simply means adding the coefficients of the two elements.
For example, given f(x) = x2 + x, g(x) = x2 + x + 1, then f(x) + g(x) can be
represented in equation (7).
(x2 + x) + (x2 + x+ 1) = 1 (7)
6
Definition 2. Multiplication: in the polynomial representation, the multiplication
of two elements in a finite field is similar to normal polynomial multiplication, the
only difference is that an irreducible polynomial is required for finite filed multi-
plication. An irreducible polynomial is a non-constant polynomial that cannot be
factored into the product of two non-constant polynomials. Specifically, in AES
the irreducible polynomial denoted asm(x) as given in equation (8).
m(x) = x8 + x4 + x3 + x+ 1 (8)
For example, given f(x) = x6 + x4 + x2 + x + 1,g(x) = x7 + x + 1, then
f(x)g(x) can be reasoned as follows,
(x6 + x4 + x2 + x+ 1)(x7 + x+ 1) (9)
= x13 + x11 + x9 + x8 + x7 + x7 + x5 + x3 + x2 + x6 + x4 + x2 + x1 + x0 (10)
= x13 + x11 + x9 + x8 + x6 + x5 + x4 + x3 + x0 (11)
and
(x13 + x11 + x9 + x8 + x6 + x5 + x4 + x3 + x0)mod(m(x)) (12)
= x7 + x6 + 1 (13)
thus
f(x)g(x) = x7 + x6 + 1 (14)
Definition 3. xtime(): in AES, the multiplication by 02 in decimal (or x in polyno-
mial) is denoted as xtime(). For example, if we multiply f(x) by the polynomial
x, we can reason it as follows,
xtime(f(x)) = xf(x) (15)
= x7 + x5 + x3 + x2 + x1 (16)
thus,
xtime(f(x)) = x7 + x5 + x3 + x2 + x1 (17)
One implementation of xtime() is to take a fixed number of cycles which are
independent from the value of its arguments; however, such an approach suffers
7
the power analysis attacks [16–18]. A better approach is to define look-up tables
denoted byM , whereM [a] = xtime(a). All elements of AES can be written as a
sum of powers of x, thus it is clear that multiplication by any value in AES could
be implemented by a repeated use of look-up table M . Note that the function of
xtime() equals to a shift operation and a conditionalXOR operation. Therefore,
the function xtime() is helpful to improve the performance of AES.
3.2.2 AES Specification
Definition 4. sub-bytes(): in AES, the sub-bytes() transformation is a non-linear
operation where each byte of a state is operated independently. In addition, to
accelerate the speed of AES, a 256-byte loop-up table named sbox [19] is used
during the process of encryption and decryption. The sub-bytes() transformation
can be represented as follows.
statei,j = sboxt (18)
where t = statei,j , 0 ≤ i, j ≤ 3. The equation (18) shows the sub-bytes()
operation where the state matrix is represented as state and the sbox is represented
as sbox. Obviously, the sub-bytes() operation can be implemented in parallel by
operating on each byte of the state matrix.
Definition 5. shift-rows(): in AES, the shift-rows() operation is to shift rows of a
state. In general, the rows of the state matrix are cyclically shifted over different
offsets. In this case, the original state matrix is denoted as state, the state matrix
after shifting rows is denoted as state′. Then, the shift-rows() operation can be
expressed in equation (19).
state
′
i,j = statei,t (19)
where t = (i+ j)mod(4). After the shift-rows() operation, all four bytes of each
specific column are spread over four distinct columns. Note that the shift-rows()
operation heavily depends on the round key.
Definition 6. mix-columns(): assume that the state matrix state, after the oper-
ation of shifting rows, the state matrix is denoted by state′. In addition, a fixed
matrix is denoted by A. Then, the mix-columns operation can be expressed in
equation (20).
state′ = A⊗ state (20)
A =


0x02 0x03 0x01 0x01
0x01 0x02 0x03 0x01
0x01 0x01 0x02 0x03
0x03 0x01 0x01 0x02


8
where A is a fixed matrix representing the rows shifting operation.
Definition 7. add-roundkey(): a round key is applied to the state by a simple
bitwise operation. Let assume that the state matrix state, after the operation of
adding round key, the state matrix is denoted as state′ and the expanded round
key used in this step is denoted as epdkey. Then the add-roundkey() operation
can be expressed in equation (21).
state′ = state⊕ epdkey (21)
Note that one of the most important things is about how to generate and use
round keys. The round keys are generated from the input cipher key by Key
Schedule [20]. The key schedule involves two steps: the key expansion and round
key selection. In the first step, the total length of twelve round keys is equal to the
block length that multiplies the number of rounds. For example, for a block with
128 bits block size and 10 rounds, the total length round keys (or expanded key) is
128 ∗ (10 + 1) = 1408 bits. In the second step, round keys are sequentially taken
from the expanded key. For example, the first round key consists of the first Nb
words in the expanded key, the second round key consists of the secondNb words
in the expanded key, so on and so forth.
3.2.3 T Tables
So far, all specific operations of AES, including sub-bytes(), shift-rows(), mix-
columns(), add-roundkey(), have been discussed (also see Algorithm 1).
Algorithm 1: An overview of AES implementation
Input : a state, a cipherkey.
Output: an encrypted state.
1 keyExpansion(cipherkey, expandedkey);
2 AddRoundKey(state, expandedkey[0]);
3 for int i = 0; i<Nr; i++ do
4 sub-bytes(state); (see Definition 4)
5 shift-rows(state);(see Definition 5)
6 mix-columns(state); (see Definition 6)
7 add-roundkey(state, expandedkey[i]); (see Definition 7)
8 sub-bytes(state); (see Definition 4)
9 shift-rows(state); (see Definition 5)
10 mix-columns(state); (see Definition 6)
11 add-roundkey(state, expandedkey[Nr]); (see Definition 7)
9
T0(w)


xtime(sbox(w), 02)
sbox(w)
sbox(w)
xtime(sbox(w), 03)

 (22)
T1(w)


xtime(sbox(w), 03)
xtime(sbox(w), 02)
sbox(w)
sbox(w)

 (23)
T2(w)


sbox(w)
xtime(sbox(w), 03)
xtime(sbox(w), 02)
sbox(w)

 (24)
T3(w)


sbox(w)
sbox(w)
xtime(sbox(w), 03)
xtime(sbox(w), 02)

 (25)
However, in order to achieve a higher performance, four look-up tables named
T Tables are introduced. Note that for each AES round, T boxes consists of round
transformations including sub-bytes, shift-rows, mix-columns and add-roundkey.
In other words, a T Table represents a round transformation. All four T Tables are
computed as equations (22, 23, 24 and 25), where w = 0, 1, 2, 3, ..., 255.Consider
a given round input p, the round output could be represented as follows,
ej = T0[p0,j ]⊕ T1[p1,j+1]⊕ T2[p2,j+2]⊕ T3[p3,j+3]⊕ kj (26)
where kj represents the j − th column of a round key.
4 GPU Accelerated AES
4.1 Parallel Granularity
Our GPU Accelerated AES adapts the ’one state per thread’ parallel granularity
which means that each thread is mapped to a AES state. Messages are divided
into multiple 16-byte states and these states are executed by GPUs synchronously.
Note that each GPUs thread in this case can be completed independently since
states of different threads do not have interdependencies, meaning that any thread
do not need to wait for other threads.
10
4.2 T-Tables Allocation
T Tables are essentially look-up tables through which users can quickly implement
part of cryptographic operations. Every thread needs to get access T tables in each
round of cryptographic operations. Hence we need to load T tables in the memory
of GPU in advance. In our benchmarking approach, we load T tables into CUDA
constant memory. Another possible solution is to load T tables into CUDA share
memory or registers.
4.3 Round-keys Allocation
AES mainly guarantees the security of the password through the confidentiality
of the key and the length of the key. In the implementation of a specific AES
algorithm, each round of AES encryption requires key operations. Each thread of
CUDA needs an input of a different round key. Of course, tens of different round
keys are generated from one single input key. In this case, we put round keys in
the constant memory. All threads share user round keys in the constant memory.
4.4 Plaintext Allocation
In the AES of ECB mode, the plaintext information is divided into several dif-
ferent states/blocks to be encrypted. In our benchmarking approach, we use ’one
thread per state’ method. For example, if we have 100 states, that is 1600 bytes,
then we need at least 100 GPU threads. After each thread acquires 16 bytes of
data, the plaintext data is stored in the thread’s register. Register data between
threads is inaccessible to each other. After all threads complete the data encryp-
tion, we get back the encrypted data from GPU memory.
5 Experimental Results
Table 3 shows the configuration of our experiment platform. We conduct our
experiments on a Ubuntu 16.04 LTS 64-bit operating system with a 12-GiB mem-
ory, an Intel Core i5-7200U @ 2.5GHz*4, an GeForce 940MX/PCIe/SSE2 GPU.
The version of nvcc compliation is V7.5.17 and the version of GCC is V5.4.0.
Our CPU code is written in standard C language and the GPU code is written in
CUDA.
Table 4 shows the performance between CPU based AES encryption and GPU
based AES encryption. When the file size is small, say 1024 bytes, CPU performs
better than GPU; however, when the file size is getting larger, say 10 KBytes,
GPU performs much better than CPU. The performance of CPU is relatively sta-
ble while the performance of GPU can be highly excavated, meaning that GPU
11
Table 3: Environment specification
CPU Intel Core i5-7200U@ 2.5GHz*4
GPU GeForce 940MX/PCIe/SSE2
Memory 12 GiB
OS Ubuntu 16.04 LTS, 64 bits
CUDA compliation V7.5.17
GCC V5.4.0
performance can be better used when more concurrent tasks are available. In par-
ticular, when the file size is 1190402 bytes, the CPU’s encryption performance is
still maintained at about 7510103 bytes per second while the GPU’s performance
has increased to 223928517 bytes per second.
Table 4: AES encryption CPU v.s. GPU
File Size (bytes)
Encryption
Time (seconds) Throughput (bytes per second)
CPU GPU CPU GPU
1202 0.000152 0.000732 7921052.63 1644808.74
4652 0.000689 0.000770 6769230.77 6057142.85
9302 0.001418 0.000784 6564174.89 11872448.97
18602 0.002545 0.000807 7313163.06 23063197.03
37202 0.004985 0.000886 7463189.57 41990970.65
74402 0.010099 0.001048 7367462.12 70996183.21
148802 0.020109 0.001302 7399870.70 114288786.48
297602 0.039651 0.001896 7505586.24 156964135.02
595202 0.079503 0.003145 7486560.26 189254054.05
1190402 0.158507 0.005316 7510103.65 223928517.68
Table 5 shows the performance between CPU based AES decryption and GPU
based AES decryption. When the file size is small, say 1024 bytes, CPU performs
better than GPU; however, when the file size is getting larger, say 10 KBytes, GPU
performs much better than CPU. In particular, when the file size is 1190402 bytes,
the CPU’s decryption performance is still maintained at 5239385 bytes per second
while the GPU’s performance has increased to 203592269 bytes per second.
6 Conclusion
In this paper, we propose a benchmarking approach for GPU based AES imple-
mentation involving both encryption and decryption progress. We adapt the Elec-
tronic Code Book (ECB) mode for cryptographic transformation, T-boxes scheme
12
Table 5: AES decryption CPU v.s. GPU
File Size (bytes)
Decryption
Time (seconds) Throughput (bytes per second)
CPU GPU CPU GPU
1202 0.000264 0.000771 4560606.06 1561608.30
4652 0.001168 0.000821 3993150.68 5680876.98
9302 0.002081 0.000790 4472849.59 11782278.48
18602 0.003889 0.000821 4785806.12 22669914.74
37202 0.007196 0.000918 5170094.49 40527233.12
74402 0.014296 0.001040 5204532.74 71542307.69
148802 0.028493 0.001331 5222475.69 111798647.63
297602 0.056839 0.002018 5235911.96 147474727.45
595202 0.113610 0.003577 5239010.65 166397539.84
1190402 0.227203 0.005847 5239385.04 203592269.54
for fast lookups, and a granularity of ’one state per thread’ for thread schedul-
ing. Our benchmarking results offer researchers a good understanding on GPU
architectures and software accelerations. Our experimental results show that GPU
has significantly advantages than CPU in performance when using AES for en-
crypting (or decrypting) large files, say more than 4K bytes. And when the size of
parallelizable files is getting larger, the performance of GPU can be fully utilized.
In addition, both our source code and experimental results are freely available at
github (https://github.com/Canhui/AES-ON-GPU).
Acknowledgements
This work is supported by Shenzhen Basic Research Grant SCI-2015-SZTIC-002.
References
[1] A. A. Abdelrahman, M. M. Fouad, H. Dahshan, and A. M. Mousa, “High perfor-
mance cuda aes implementation: A quantitative performance analysis approach,” in
Computing Conference, 2017. IEEE, 2017, pp. 1077–1085.
[2] C. Liu, Q. Wang, X. Chu, and Y.-W. Leung, “G-crs: Gpu accelerated cauchy reed-
solomon coding,” IEEE Transactions on Parallel and Distributed Systems, 2018.
[3] K. Zhao and X. Chu, “G-blastn: accelerating nucleotide alignment by graphics pro-
cessors,” Bioinformatics, vol. 30, no. 10, pp. 1384–1391, 2014.
13
[4] Zhao and X. Chu, “Gpump: Amultiple precision integer library for gpus,” pp. 1164–
1168, 2010.
[5] Q. Li, C. Zhong, K. Zhao, X. Mei, and X. Chu, “Implementation and analysis of aes
encryption on gpu,” in High Performance Computing and Communication & 2012
IEEE 9th International Conference on Embedded Software and Systems (HPCC-
ICESS), 2012 IEEE 14th International Conference on. IEEE, 2012, pp. 843–848.
[6] K. Iwai, N. Nishikawa, and T. Kurokawa, “Acceleration of aes encryption on cuda
gpu,” International Journal of Networking and Computing, vol. 2, no. 1, pp. 131–
145, 2012.
[7] X. Mei, K. Zhao, C. Liu, and X. Chu, “Benchmarking the memory hierarchy of
modern gpus,” in IFIP International Conference on Network and Parallel Comput-
ing. Springer, 2014, pp. 144–156.
[8] X. Mei and X. Chu, “Dissecting gpu memory hierarchy through microbenchmark-
ing,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 1, pp.
72–86, 2017.
[9] X. Chu, K. Zhao, and Wang, “Practical random linear network coding on gpus,” in
International Conference on Research in Networking. Springer, 2009, pp. 573–585.
[10] X. Chu, K. Zhao, and M. Wang, “Practical rsdfom linear network coding on gpus,”
in Interdfnational Conference on Research in Networking. Springer, 2009, pp.
573–585.
[11] A. A. Abdelrahman, M. M. Fouad, and H. Dahshan, “Analysis on the aes imple-
mentation with various granularities on different gpu architectures,” Advances in
Electrical and Electronic Engineering, vol. 15, no. 3, p. 526, 2017.
[12] V. Conti and S. Vitabile, “Design exploration of aes accelerators on fpgas and gpus,”
Journal of Telecommunications and Information Technology, no. 1, p. 28, 2017.
[13] J. Ma, X. Chen, R. Xu, and J. Shi, “Implementation and evaluation of different
parallel designs of aes using cuda,” in Data Science in Cyberspace (DSC), 2017
IEEE Second International Conference on. IEEE, 2017, pp. 606–614.
[14] J. Black and P. Rogaway, “A block-cipher mode of operation for parallelizable mes-
sage authentication,” in International Conference on the Theory and Applications of
Cryptographic Techniques. Springer, 2002, pp. 384–397.
[15] M. Dworkin, “Recommendation for block cipher modes of operation: methods for
formatpreserving encryption,” NIST Special Publication, vol. 800, p. 38G, 2016.
14
[16] G. Fontana, S. Donatiello, and G. Di Sirio, “Method for protecting ic cards against
power analysis attacks,” Aug. 12 2014, uS Patent 8,804,949.
[17] M. Alioto, S. Bongiovanni, M. Djukanovic, G. Scotti, and A. Trifiletti, “Effective-
ness of leakage power analysis attacks on dpa-resistant logic styles under process
variations,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 61,
no. 2, pp. 429–442, 2014.
[18] P. Luo, L. Zhang, Y. Fei, and A. A. Ding, “Towards secure cryptographic software
implementation against side-channel power analysis attacks,” in Application-specific
Systems, Architectures and Processors (ASAP), 2015 IEEE 26th International Con-
ference on. IEEE, 2015, pp. 144–148.
[19] J.-S. Coron, A. Greuet, E. Prouff, and R. Zeitoun, “Faster evaluation of sboxes via
common shares,” in International Conference on Cryptographic Hardware and Em-
bedded Systems. Springer, 2016, pp. 498–514.
[20] J. Huang, H. Yan, and X. Lai, “Transposition of aes key schedule,” in International
Conference on Information Security and Cryptology. Springer, 2016, pp. 84–102.
15
