Hardware/Software Obfuscation against Timing Side-channel Attack on a
  GPU by Karimi, Elmira et al.
Hardware/Software Obfuscation against Timing
Side-channel Attack on a GPU
Elmira Karimi, Yunsi Fei, David Kaeli
ECE Department, Northeastern University Boston, MA 02115 USA
{elmirakarimi, yfei, kaeli}@ece.neu.edu
Abstract—GPUs are increasingly being used in security appli-
cations, especially for accelerating encryption/decryption. While
GPUs are an attractive platform in terms of performance, the
security of these devices raises a number of concerns. One
vulnerability is the data-dependent timing information, which
can be exploited by adversary to recover the encryption key.
Memory system features are frequently exploited since they
create detectable timing variations. In this paper, our attack
model is a coalescing attack, which leverages a critical GPU
microarchitectural feature - the coalescing unit. As multiple
concurrent GPU memory requests can refer to the same cache
block, the coalescing unit collapses them into a single memory
transaction. The access time of an encryption kernel is dependent
on the number of transactions. Correlation between a guessed
key value and the associated timing samples can be exploited to
recover the secret key.
In this paper, a series of hardware/software countermeasures
are proposed to obfuscate the memory timing side channel,
making the GPU more resilient without impacting performance.
Our hardware-based approach attempts to randomize the width
of the coalescing unit to lower the signal-to-noise ratio. We
present a hierarchical Miss Status Holding Register (MSHR)
design that can merge transactions across different warps. This
feature boosts performance, while, at the same time, secures
the execution. We also present a software-based approach to
permute the organization of critical data structures, significantly
changing the coalescing behavior and introducing a high degree
of randomness. Equipped with our new protections, the effort
to launch a successful attack is increased up to 1433X × 178X ,
while also improving encryption/decryption performance up to
7%.
I. INTRODUCTION
Graphic Processing Units (GPUs) have become the acceler-
ator of choice in a number of compute-intensive applications.
GPUs achieve high throughput by leveraging thread-level
parallelism of applications on thousands of simple cores. The
range of applications that can leverage GPU acceleration spans
deep learning models, computational physics, and security
functions such as data encryption and signing [1], [2], [3], [4],
[5].
CPUs are now offloading encryption workloads to GPUs,
given their efficiency. Encryption/decryption operations are
commonly present in security libraries and they can benefit
greatly from the high performance provided by a GPU [6],
[7], [8], [9], [10], [11]. Iwai et al. [7], Li et al. [12] and
Schonberger [13] each developed GPU implementations of the
Advanced Encryption Standard (AES) algorithm, demonstrating
double-digit speedups on a GPU compared to a CPU. However
research has shown that GPUs are vulnerable to attacks that
leverage the physical properties associated with applications
execution on the device, a class of vulnerabilities called
side-channel attacks [14], [15], [16], [17]. Although some
companies have developed secure encryption cores, these
designs are specifically implement a single algorithm (e.g.,
AES) [18]. Side-channel attack methods can be used to attack
different table-based algorithms. Therefore we need a more
comprehensive defense strategy. Timing side-channel attacks
have typically targeted the memory hierarchy of a GPU. If the
attacker can measure the execution time of an encryption,
and with knowledge of the encryption algorithm’s layout
in memory, the encryption keys can be recovered through
correlation analysis [15], [16], [17]. RCoal [19], has been
proposed as a countermeasure that can change the warp size and
thread allocation randomly. Unfortunately, this solution hurts
performance significantly due to reduced memory hierarchy
efficiency, while only slightly improving security.
In this paper, we introduce a series of hardware and software
countermeasures against GPU coalescing attacks. Given the
fact that attackers can use the relationship between execution
time and number of memory accesses to launch an attack,
our approach attempts to decorrelate the measured timing
signal and memory access pattern. In most GPU memory
systems, the coalescing unit can dynamically merge multiple
memory requests across different threads into a single memory
access, called a transaction. Given the organization of the
data structures used during encryption, and their associated
key-dependent memory access behavior, the memory access
pattern produces a timing channel [17].
To confuse the attacker, we first enhance the hardware
by randomizing the width of the coalescing unit to reduce
the Signal-to-Noise Ratio (SNR) of the timing sample. By
modifying the width of this unit during program execution, we
can introduce randomness in terms of the number of memory
transactions generated. In order to further reduce the SNR, we
utilize multiple coalescing unit widths across different cache
lines. By dynamically changing the width of coalescing unit
for different cache lines, we can introduce another dimension
to the noise injected. Since each Streaming Multiprocessor
(SM) in a NVIDIA GPU has its own coalescing unit, we
use a combination of different transaction sizes for each SM
to introduce additional noise to the timing channel. We also
redesign the Miss Status Holding Registers (MSHRs) to provide
miss tracking across SMs, which not only adds some additional
noise to the timing channel, but also improves performance.
1
ar
X
iv
:2
00
7.
16
17
5v
1 
 [c
s.C
R]
  3
1 J
ul 
20
20
In order to explore these changes in hardware, we utilize an
execution-driven GPU simulator.
For our software approach, we also focus on the memory
model of the GPU. During AES encryption/decryption we need
to access several look-up tables stored in GPU memory. We
introduce a software-based lookup table hashing algorithm to
obfuscate the associated access pattern to one of these tables.
This helps to hide the number of transactions observed by the
attacker through the measured timing sample.
The rest of this paper is organized as follows. Section 2
covers background on GPU microarchitecture and describes
a GPU-based coalescing attack. Section 3 provides an intro-
duction to our methodology. In Section 4, we characterize
the timing channel present on a GPU, as well as the design
of our hardware/software countermeasures based on changing
the coalescing unit width, modifying the MSHR design and
applying look-up table rotation. Then we evaluate how well
we can thwart a timing attack in Section 5. Finally, in Section
6, we conclude the paper.
II. BACKGROUND AND RELATED WORK
A. GPU Architecture
In contrast to CPUs, GPUs typically provide many cores.
NVIDIA refers to these cores as Streaming Processors (SMs),
while AMD calls them Compute Units (CUs). In this paper
we will use NVIDIA terminology and provide a brief review
of the Fermi GPU architecture. While Fermi is an older
GPU architecture, it is used in our simulation infrastructure
(GPGPU-Sim [20]), and most of the features discussed in this
paper are present in later generations of NVIDIA GPUs, as
well as AMD GPUs. So the attack strategies and defences
proposed are relevant for a broad class of architectures. In
this paper we use GPGPU-Sim based on the Fermi GTX480
GPU. architecture, which has 15 SMs. Each SM includes 32
CUDA cores, a load/store(LD/ST) unit, 4 Special Function
Units (SFUs), a 64KB block of high speed on-chip memory
(L1+Shared Memory) and an interface to the L2 dcache. The
L2 cache is a unified shared cache across all the SMs. Each
SM can run two warps, with each warp running 32 threads in
lockstep. Warps are scheduled by a warp scheduler. Detailed
configuration information is provided in Table I.
TABLE I
THE FERMI GTX480 ARCHITECTURE.
Number of SMs 15
Number of Threads per
Warp 32
L1 Data Cache Size 48KB
L2 Cache Size 768KB
Transaction Size 64B
Memory Bus Width 258B
MSHR Entries 32
Memory Coalescing: Threads within a warp run in a Single
Instruction Multiple Thread (SIMT) fashion. As a result, each
memory instruction in a warp will produce 32 memory requests,
which are sent to the LD/ST unit. Each LD/ST unit includes
a coalescing unit that inspects the addresses generated across
the warp and collapses them into a minimum set of memory
transactions. The size of each transaction is equal to one cache
line. Accesses are non-blocking, such that memory can handle
multiple requests concurrently. Wang et. al [21] proposed an
intra-cluster coalescing unit to merge memory requests from
different SMs in a cluster.
Miss Status Holding Register Functionality: Every SM
has a private L1 cache. On an L1 read miss, an entry in the
Miss Status Holding Registers (MSHRs) will be allocated for
tracking the miss. Each outstanding request to the L2 cache is
tracked by an MSHR entry. Once the L2 cache responds, the
MSHR entry will be released. Each MSHR entry can be shared
by multiple accesses in an SM in order to avoid redundant
entries, though redundant entries from across different warps
may cause a performance loss. Tuck et. al [22] described
sharing MSHR entries across L1 banks, improving bandwidth
management to L2.
B. Side-channel Attacks
Side-channel attacks exploit information recorded during
execution of encryption/decryption. There are many types
of attacks and countermeasures in different levels of the
design [23], [24], [25], [26], [27], [28]. The memory hierarchy
is one of the most common targets for the attacker. There are
two main types of memory attacks: access-driven and time-
driven attacks. In an access-driven attack, the attacker exploits
the principle of cache sharing to observe the victim’s memory
access behaviour. The adversary intentionally creates contention
on the shared cache resource to be able to infer if certain
addresses have been accessed by the victim. Alternatively, a
time-driven attack is based on measuring the execution time of
an encryption/decryption and establishing a correlation between
this timing value and an embedded secret.
Most previous studies of side-channel attacks, as well as
attack countermeasures, have targeted CPU devices. Bern-
stein [29] proposed a simple class of timing attack strategies
that leverage the layout of data structures in the memory
hierarchy of a CPU and used the correlation between the
data structure index and the measured time to recover the
encryption key. Tromer et al. [30] describe several access-
driven attacks and countermeasures against such timing. These
methods attempt to obscure the timing channel by introducing
time constant algorithms, masking the access pattern by
disabling shared memory or performing random permutations
and preloading. Both software and hardware approaches have
been used, but the main problem with most of these obfuscation
techniques is that they come with significant performance costs,
rendering them infeasible or costly to implement.
Given the growing popularity of GPU platforms, it is
important to thoroughly evaluate vulnerabilities of these high-
performance devices, as well as employ effective countermea-
sures against various attacks. The most important difference
between CPUs and GPUs is the source of the execution time
variation. In CPUs, the variations in timing are the result
of cache misses (or hits), providing a clean timing channel.
2
While on a GPU, instead of tracking cache hits and misses,
timing variations are a result of memory request management
by the GPU’s coalescing unit. While the degree of thread-
level concurrency is very high on a GPU, the noise associated
with this concurrency is not enough to obscure the timing
channel [15], [16], [17]. The attacker generally only observes
a single timing value associated with the execution of all the
threads in a kernel. Although, the correlation between the
execution time and the memory access pattern becomes weaker
as the degree of concurrency is increased, the GPU memory
reference pattern is mainly a function of the plaintext input and
encryption/decryption algorithm. An attacker can track data
access patterns using timing. Even if the programmer chooses
to store the encryption data structures in shared memory,
a partition of memory only accessible to each Streaming
Processor on the device, a successful attack is still possible.
The good news is that the same memory features that provide
attackers with an attractive attack surface, can also be used as
a form of protection.
C. Baseline Attack
Memory coalescing is a common feature that has been
used by GPU designers to improve memory throughput. This
mechanism also provides a strong timing channel, since the
effectiveness of the coalescing unit is a function of the
memory address pattern generated across threads. An attacker
can use this timing information to quantify the number of
transactions, and eventually derive the encryption key by
computing the correlation between the measured time and
number of transactions. This attack scenario was first identified
by Jiang et al. [16], where the attacker remotely sends a large
number of random plaintext inputs to the victim. The victim
uses a GPU to encrypt the data and the attacker finally collects
the resulting ciphertext and the execution time. The attack
is repeated many times (typically 100K to 1M samples are
collected).
Cj = T4[ti]⊕ kj : 0 < j < 15, 0 < i < 15 (1)
This attack method is adopted as our baseline methodology.
Based on this scheme, we will introduce randomness during
encryption to obfuscate the timing channel. Jiang et al.
described a T-table implementation of AES encryption (ECB
mode). We implement the same scheme in CUDA, following
the OpenSSL 0.9.7 library implementation. The CUDA kernel
takes 16 bytes of plaintext and 16 key bytes, and outputs
ciphertext. The key bytes are expanded to 160 bytes, because
in total AES has 10 rounds. First, the 16-byte plaintext is
XORed with the first round key. Then during the next 9 rounds,
a combination of the four T-table lookups (T0, T1, T2 and T3)
are XORed with the key. For the last round, another T-table,
T4, are performed. As there is no MixColumn operation, the
ciphertext can be derived using Equation 1.
In our CUDA implementation, the entire encryption is
implemented in a kernel. Each thread in the kernel takes 16
bytes of plaintext, performs an independent encryption, and
Possible Values
0 50 100 150 200 250Ab
so
lu
te
 V
al
ue
 o
f C
or
re
la
tio
n #10-3
0
2
4
6
8
Fig. 1. The absolute value of the correlation for possible values of k5.
produces 16 bytes of ciphertext. A warp contains 32 threads,
so 32 encryptions are performed together. Each T-Table has
256 4-byte elements (1KB in total). The T-table is stored in
global memory.
In the attack scenario, each key byte is guessed (from
the values 0 to 255). By knowing the ciphertext of each
sample and using Equation 2, the attacker can produce all
indices (ti). Equipped with a coalescing unit, the 32 concurrent
memory requests result in a smaller (and computable) number
of memory transactions. The attacker only needs to know the
width of the coalescing unit. For example, if the transaction
size is 32 bytes, each memory transaction would contain 8
consecutive addresses. To recover the transaction index, the
attacker simply shifts the original index by log28 (i.e., 3).
After determining the number of transactions based on the
cipherthext, the attacker can calculate the correlation between
the measured execution time and the predicted number of
transactions under a key byte guess.
For a key byte, there are 256 possible values. The attacker can
calculate the set of 256 correlation values using the execution
time and the number of transactions for each key guess. The
value with the highest correlation between the execution time
and the number of transactions will be associated with the
correct key guess. The attacker will repeat this process for the
other 15 bytes of the key and find the highest correlation for
each. Figure 1 shows the correlation of all possible values of
k5, using 500,000 timing samples.
ti = T
−1
4 [Cj ⊕ kj ] : 0 < j < 15, 0 < i < 15 (2)
III. OUR METHODOLOGY
Typically, the degree of difficulty associated with an attack
can be measured by the number of timing samples needed
to recover the key. An ideal countermeasure increases the
number of samples required for the attack, without hurting
execution performance. Kadam et al. [19] changed the number
of threads that are grouped together in a warp randomly. While
introducing randomness to help protect a device, GPUs are
3
designed to execute regular (e.g., dense matrix) operations,
where the execution time is well behaved. So reducing the
warp size in order to introduce noise in the timing channel
should greatly impact performance.
We introduce a series of hardware/software countermeasure
based on commonly design principles used in most GPU
memory systems. Given the fact that the attacker can use the
relationship between time and the associated memory access
pattern, our approach attempts to reduce the correlation between
number of memory accesses and the measured timing signal.
A. Obfuscation in Hardware
1) Randomizing The Width of the Coalescing Unit: The
transaction size of coalescing unit is equal to the size of an L1
cache line. The address range of the coalescing unit is always
fixed. Across the threads in a warp, each address pattern will
result in a predictable number of cache line accesses. In order
to confuse any attacker who is tracking memory behavior, we
propose an obfuscation approach that dynamically modifies the
coalescing unit. By changing the range of addresses that result
in a transaction, we render the execution time associated with
a memory address pattern as non-deterministic. For example,
if the cache line size is l bytes, the basic coalescing unit maps
addresses from the range 0 to l − 1 into the first cache line,
but with our coalescing unit, the width of coalescing unit is
randomly selected to be a fraction of the total cache line size.
This will introduce a source of noise, and by randomly changing
the number of transactions, this will significantly increase the
number of executions required to launch a successful coalescing
attack.
We start by randomizing the width of the coalescing unit
based on a normal probability density function. In this step, the
width of the coalescing unit is randomly chosen, but the value
is fixed for each individual cache line. To inject more noise,
we can use different coalescing widths for different cache lines.
One simple way to do this is to generate a random width for
every cache line. Given that an L1 cache has a large number of
cache lines, generating a random value for the width to be used
for every line would be costly. In order to reduce the overhead,
we generate a limited number of random widths, for example
16, and use this set of widths for every 16 cache lines, and
then repeat the pattern through the whole cache. This approach
introduces additional noise into the timing channel, and makes
it significantly harder for an attacker to recover the key.
2) A Hierarchical MSHRs for Increased Obfuscation
Strength: MSHRs help to manage memory bandwidth when
encountering L1 misses. GPUs provide a private set of MSHRs
for each SM. Kernels tend to fully utilize all SMs in order to
exploit the highest degree of parallelism available on a GPU.
As a result, there can be a varying number number of misses
experiences across different SMs to the unified L2, which
results in a performance loss due to L2’s limited bandwidth
and the number of MSHRs per SM available. To provide further
obfuscation, while also considering performance, we present
a hierarchical MSHRs design, where at the first level, each
SM has its own set of MSHRs. Then at the second level, the
MSHR entries are shared across all SMs. This can benefit
both the security and performance of the system. In terms of
security, a hierarchical MSHRs will reduce the number of L1
misses from different SMs and hide a portion of the latency,
which introduces additional noise into the timing channel. Our
proposed design should also improve memory performance as
compared to current GPU-based MSHRs.
B. Obfuscation in Software
Changing the T-table mapping in the cache is a commonly
proposed countermeasure which attempts to randomly permute
the address mapping using either software or hardware. Wang
et al. [31] proposed novel cache design named RPcache, which
can only defeat access-driven attacks. Brickell et al. [32]
considered obfuscation that employed a software solution,
though a fixed permutation only obfuscates the timing channel
×
× ×
× ×
× ×
× × × ×
×
× ×
×
× ×
× ×
× ×
× × × ×
×
× ×
×
× ×
× ×
× ×
× × × ×
×
× ×
Cache line0
Cache line1
Cache line2
Cache line3
Cache line0
Cache line1
Cache line2
Cache line3
(a) (b)
(b)
(d)(c)
(a)
Fig. 2. Hardware and Software Obfuscation using: a) a fixed coalescing unit
width, b) a fixed coalescing unit width with rotated accesses, c) a dynamic
coalescing unit width (for row 1 to 4 the widths are 2, 2, 1 and 4, respectively),
and (d) a dynamic coalescing unit width combined with rotated accesses.
temporarily until the attacker figures out the permutation
algorithm. In this paper we propose a GPU-oriented random
rotating algorithm that works by rotating T-table addresses,
changing the permutation over time. The frequency of applying
rotation can be controlled to minimize the overhead, while
providing the right level of obfuscation. Dynamic rotation also
can exploit the parallelism of a GPU to prevent performance
loss.
Our approach effectively disassociates elements stored in a
cache line and re-associates them with new elements in other
cache lines. As an example, suppose that we have n elements
in the T-table that will be stored in l continuous cache lines
and each cache line contains m elements. Figure 2(a) shows
a simple case in which the look-up table has 16 elements and
each cache line can store 4 elements. As we can see, this is
a two-dimensional structure, where the rows represents cache
lines. In this example, we have 7 random accesses, where the
number of accessed cache lines (i.e., rows) is 4. The attacker can
use this number to launch an attack by finding the correlation
between the execution time and the number of cache accesses.
One approach to obfuscate the number of rows accessed is
to change the mapping of each entry, column-by-column, as
shown in Figure 2(b), where columns 1, 2, 3 and 4 are shifted
by 2, 3, 0 and 1, respectively This rotation changes the access
pattern as shown, resulting in changes to the row accessed
from 4 to 3.
Figure 2 also shows the impact of obfuscation by changing
the coalescing unit width, combined with rotation in the T-table.
4
In Figure 2(a), we show results using a fixed coalescing unit
width (= 2) with different colors. In Figure 2(b), we combine
a fixed coalescing unit width with the T-table rotation (as
described before). Figure 2(c) shows results for dynamically
changing the width of the coalescing unit for different rows.
Figure 2(d) shows results when dynamically changing the
coalescing unit width, combined with T-table rotation. The
number of rows accessed is 6, 5, 5 and 4 for cases a-d,
respectively.
IV. HARDWARE/SOFTWARE COUNTERMEASURES
In this section, first we present a series of hardware
countermeasures based on a redesign of the GPU’s coalescing
unit. Then we provide details of our hierarchical MSHR
design. Our solution improves performance and provides further
protection. Then we discuss our software approach, which
randomly rotates columns of the T-table in the GPU’s global
memory.
A. Randomizing the Width of the Coalescing Unit
As discussed earlier, we use GPGPUsim to evaluate a timing
attack of an AES running on a Fermi GPU. In this simulator,
we set the memory transaction size to 64 bytes, which is equal
to L1 cache line size. To study the effects of changing the
width of the coalescing unit on the execution time of a kernel,
we first develop a simple kernel to learn the behavior of the
coalescing unit and then design the countermeasure based on
our observation.
1) benchmarking Observations: We developed a microbench-
mark kernel to study the impact of changing the width of the
coalescing unit. The kernel loads an array of floats from global
memory, where each warp generates 32 loads. We set the array
indices to access n unique addresses. We modify this pattern,
varying the value of n from 1 to 32 over each kernel, for 32
threads. We vary the width of the coalescing unit, setting this
value to 8, 16, 32 and 64 bytes. Our design space includes
32 × 4 configurations. Each microbenchmark kernel is run
10,000 times per configuration. We collect the corresponding
timing samples for each configuration, 128×104 timing samples
in total. Figure 3 presents the measured execution time, versus
the number of unique addresses, for coalescing unit widths of
32 and 64 bytes. We added a random noise to the simulator’s
timing to simulate the real hardware. As the results show,
the execution time is linearly proportional to the number of
unique addresses, and changing the width of coalescing unit
changes the slope of this line. Using regression analysis, we
can construct a linear equation that expresses the relationship
between the time samples and the number of unique addresses.
The coefficients of this equation are found using a least-squares
approximation method, which minimizes the distance between
the actual timings and the expected ones. In equation 3, te, n
and  are the execution time, the number of unique accesses
and the error due to noise, respectively. We use regression
analysis to find β1, which represents the slope of the line and
β0 that is a constant.
te = β1n+ β0 +  (3)
The attacker can use the fact that the time measured is
linearly proportional to the number of unique accesses, and
can be inferred from the execution time. But when varying
the width of the coalescing unit, we observe a less tractable
relationship between time and the memory addresses. This
confirms that the design of the coalescing unit can impact the
leakiness of the GPU. One metric that can be used to measure
the leakiness of an architecture is SNR. A higher SNR is
desirable for the attacker. Therefore, to protect a GPU from
a side-channel attack, we should try to reduce the SNR. The
SNR for the linear model produced by our regression analysis
(Equation 3) is calculated by Equation 4, where σ2s and σ
2

represent the variance of the signal and the noise.
SNR =
σ2s
σ2
=
β21σ
2
n
σ2
(4)
If the access pattern and noise remain constant, increasing
the slope of the line results in a higher SNR. In other words,
β1 captures how likely the attacker can distinguish addresses
i and i + 1 when she records the corresponding execution
times (te(i) and te(i + 1)). The first four rows of Table II
show the characteristics of the linear model while varying the
width of the coalescing units. As we can see, by increasing
the width of the coalescing units, β1 is reduced, which results
in a lower SNR based on Equation 4. To calculate the SNR
using Equation 4, σ2 is calculated in Equation 5.
σ2 =
∑m
i=1(te − (β1n+ β0))2
m
(5)
TABLE II
REGRESSION ANALYSIS CHARACTERIZATION.
width of coalescing unit β1 β0 SNR
8 19.463 346.2 1.8656
16 11.42 344.3 0.9429
32 6.032 343.6 0.4494
64 3.601 336.9 0.1940
Fixed Random Width 4.426 337.3 0.0985
Dynamic Random Width 4.087 336.6 0.0050
Fig. 3. Execution time of accessing 1 to 32 unique address with different
width of coalescing unit
The results of our micro-benchmarking exercise illustrate
how changes in the width of coalescing unit affect the SNR.
5
Using the maximum width of the coalescing unit, the SNR
is still high due to the lack of noise. One parameter to
represent the noise is the standard deviation (STD) of the
measured execution time. A lower STD means that the timing
samples are less noisy. In our observations, when running our
microbenchmark kernel 10,000 times, the STD is very low,
meaning that the execution time is still highly correlated with
the number of unique accesses. So the attacker can filter out
noise by running a reasonable number of samples. To introduce
more noise into the system, we randomly change the width of
the coalescing unit to increase σ2 , and eventually, reduce the
SNR. This randomness can be fixed and dynamic, which will
be described next.
2) Using a Fixed Random Width for the Coalescing Unit:
Introducing fixed randomness means that during each kernel
run, the width of the coalescing unit is randomly chosen, but
the width is fixed for all cache lines. Here, we have used 4
different coalescing unit width values as our reference set. The
width of coalescing unit is 2k bytes, where k is varied from
3 to 6. The maximum width of the coalescing unit (64 bytes)
is equal to the width of an L1 cache line. The total number
of timing traces collected is still 10,000 and the probability
density function (PDF) representing the coalescing unit width is
a skewed normal distribution, with a mean k = 5, as shown in
Figure 4. By using this distribution, the probability of choosing
a larger transaction size is much higher (e.g., for only 500 out
of 10,000 runs, we see 8-byte transactions). If we run 10,000
samples while selecting a random width for the coalescing
unit, the SNR is reduced (as shown in Table II) due to the
increased noise.
k
0 2 4 6
Pr
ob
ab
ilit
y 
De
ns
ity
0
0.1
0.2
0.3
Fig. 4. Probability density func-
tion, while changing the width
of the coalescing unit.
3) Using a Dynamic Random
Width for the Coalescing Unit::
As we observed, randomizing the
width of coalescing unit increases
the amount of noise and reduces
the SNR. In regression analysis,
the variance of the noise is equiv-
alent to the sum of the squared dis-
tances between the predicted and
actual values. Randomly changing
the width of the coalescing unit
increases this distance, resulting
in more noise.
Random 
Generator
Coalescing Unit
Detecting 
Cache line 
Size and 
Offset
.
.
.3
2
 A
d
d
re
ss
es
C
ac
h
e 
lin
es
 N
u
m
b
er
an
d
 O
ff
se
ts
r[16]
.
.
.
Request to L1
New Transaction
Detecting New 
Transaction
Fig. 5. Redesign Coalescing Unit for Dynamic Random Width
But we still utilize a fixed random value selected from
a reference set. Another level of randomness would be to
use different widths for different cache lines. But generally
speaking, supporting a unique random width for every cache
line has a huge overhead. Therefore, we choose a limited
number of cache lines and produce n different width values and
repeat the random width pattern for every n cache lines. Here,
we choose 16 cache lines due to the T-table size, which occupies
16 cache lines. A high-level design for this functionality is
shown in Figure 5.
In this design, a pseudo-random number generator generates
16 values during each kernel run, r[16], from the set 1, 2, 4, 8,
recognizing that each cache line width should be split into r[i]
parts that we call subtransactions. Once a request is issued
by a warp, the cache lines and offsets within the cache lines
are calculated for all 32 requests. At this point, a traditional
coalescing unit would check each request to see if there is
a memory request being processed for this cache line being
issued by another thread. If a match is found, the current
request will be coalesced and checking will move to the next
request.
In our proposed design, when the coalescing unit receives a
request and finds an access referencing the same cache block
in the ordered memory request queue, it will first check the
number of subtransactions within the cache line based on the
r array entry, indexed as follows: r[cachelinenumber%16].
Then the width of each subtransaction will be calculated as:
64
r[cachelinenumber%16] . We refer to this width as the subtransac-
tion width. The coalescing unit will resolve the subtransaction
number of the current request (0 to r[cachelinenumber%16]−
1) based on the request’s offset, and if there is a queued
request with the same cache line and subtransaction number
for the same cache line, the current request can be coalesced.
Otherwise, a new transaction will be sent to L1. While this new
transaction is possibly unnecessary given that there is already
an outstanding transaction for the same cache line underway,
the added time will help obfuscate this timing channel. We
provide the SNR values of this case in Table II. Using this
method makes the design more resilient to the attack due to
randomly chaning the pattern in every single kernel run.
B. Hierarchical MSHR Design
As we discussed earlier, each SM in our architecture has a
set of 32 MSHRs for tracking L1 misses. One issue with this
microarchitecture is that threads across different SMs could
request the same data from L2, generating redundant requests
and producing a bottleneck at L2. This issue can significantly
impact performance when many SMs are making requests to
L2. In order to avoid this issue, we introduce a second level
of MSHRs. This set of MSHRs is unified and shared across
all SMs, with each MSHR having 32 entries. In this design,
each L1 miss first allocates an entry in the MSHR local to
the SM issuing the request to L2. These requests arrive at the
second-level MSHRs, and if there are no outstanding requests
issued by other SMs for the same L2 cache line in the unified
MSHRs, the request will be sent to the L2 cache. This design
will increase performance whenever the competition for L2
access by multiple SM is high. Although a unified MSHR
6
set is a shared hardware structure across SMs, it cannot be
an attack surface. First of all, in most GPU applications, the
GPU resources are fully utilized such that the attacker would
not be able to run their application concurrently with victim’s.
Second, a kernel would be run on several SMs and all of these
SMs would send L2 requests via the MSHRs. Therefore, even
though the attacker can get an SM to observe the requests of
a unified MSHR, it is very challenging to track the requests
coming from different SMs that are executing concurrently.
While our MSHR design has the potential of improving
performance, the second-level unified MSHRs will introduce
another source of obfuscation. A hierarchical MSHR set can
merge and collapse different incoming requests within a range
of 128 bytes(L2 cache line size). This architecture will lower
the SNR due to hiding the latency of L1 misses. In Equation 3
we have modeled the time as a linear function of the number
of transactions. Suppose that we have β10 L1 misses resulting
from β1 memory accesses. We can rewrite Equation 3 as
Equation 6, where h is the hit time and m is the miss time for
the L1 cache and m = m0h .
te = (β10m0 + (β1 − β10))hn+ β0 +  (6)
When we use a hierarchical MSHR set, there is some
possibility that we will reduce the miss penalty for some
L2 requests. This nondeterministic behavior will add noise to
m0h by reducing the miss time to mnoiseh, which is in the
range of the time required to service the outstanding miss.
If the requests can be merged with probability pm0 , then
Equation 6 can be rewritten as Equation 7. In Equation 8, 0
is the noise generated using our hierarchical MSHR set.
te = (β10m0+(β1−β10))hn+β0+−pm0(m0−mnoise)β10nh
(7)
te = (β10m0 + (β1 − β10))hn+ β0 + + 0 (8)
With this architecture the SNR will be lowered due to adding
0 as:
SNR =
β21σ
2
n
σ2 + σ
2
0
(9)
C. GPU-Oriented Random Rotation
During a GPU timing attack, the only information available
to the attacker is the execution time and kernel inputs. The
attacker attempts to correlate the execution time with predicted
number of transactions under each key guess in view of
the coalecsing width. Another way to decorrelate these two
parameters is to change the access pattern of the threads
to memory, which results in a nondeterministic number of
transactions. In order to change the access pattern, we should
study the GPU’s memory model. Based on how we map data
in to the GPU memory, we can introduce nondeterminism to
the access pattern.
In an AES encryption/decryption baseline attack, 5 T-table
are copied from the CPU memory to the GPU memory. As
0 1 2 … m-1
m m+1 m+2 … 2m-1
… … … … …
(l-1)m (l-1)m+1 (l-1)m+2 … lm-1
Cache line size=m
N
u
m
b
er
 o
f 
ca
ch
e 
lin
es
=l
j
i
j+j0
i+i0
(i,j)
(i+i0,j+j0)(i+i0,j)
Fig. 6. General structure of a lookup table with size n = ml in cache, where
the cache line can hold m elements.
the attack targets the last round of the encryption by tracking
accesses to T4, we will focus on mapping of this table to
GPU memory. While we focus on accesses to T4 due to their
observability, we can apply rotation to any look-up table. A
look-up table is a one-dimensional array of size n. If we assume
the whole array can fit in the L1 cache, and each cache line can
store m, then the table would occupy l lines, where l =
[
n
m
]
+1.
Figure 6 shows the general structure of this look-up table in
cache. Whenever threads within a warp make a request to
memory, the coalescing unit combines the requests within a
cache line (without randomizing the width of coalescing unit)
and will access the whole cache line. Therefore, the number of
accessed cache lines depends on the row-wise location of the
request. To introduce randomness to the number of transactions
in this structure, we can dynamically change the mapping of
an address to the cache.
Algorithm 1 Random Rotation
1: m← SizeofCacheline()
2: n← SizeofT table
3: l← n/m+ 1
4: for i = 0 : m− 1 do
5: r ← UniqueRand()
6: for j = 0 : l − 1 do
7: T4[((j + r)%l)m+ i]← T4[jm+ i].
An example of mapping the look-up table is when we want
to map element (i, j) to (i+i0, j+j0), which is in the range of
[m, l]. But as discussed, the coalescing unit accesses the whole
row and produces a deterministic memory behavior. But if we
use a column-wise random rotation, we can introduce noise
into the timing channel. Algorithm 1 provides pseudocode
for rotation algorithm. Using this scheme, each column will
circulate based on a unique randomly chosen number between
[0, l − 1], providing the highest level of diffusion across the
cache as possible.
The rotation can be performed on either the CPU or the GPU
side. The main drawback of performing this transformation
on the CPU side is that the rotation would be performed only
one time, since it is too expense to repeat the memory transfer
from the CPU to the GPU. When performed on the GPU side,
this not only eliminates data transfer overhead, but also our
7
rotation can be done in parallel on the many cores of the GPU.
As a result, the performance overhead is much smaller, thus
we use a GPU implementation, and can frequenty run a table
rotation. The question of how often rotation is necessary is
discussed next.
The last round of AES is the target for our attack. This
round involves a number of look-up table accesses to table T4.
The size of the table is 1K entries, occupying 16 64B cache
lines. Suppose that we rotate columns in the T-table every
100,000 kernel executions. So every 100,000 iterations, the
algorithm 1 is executed. Each thread is responsible for rotating
a column of 4B elements, 16 threads in total. Then the AES
encryption will be started. Then, the probability of having the
same number of transactions is significantly decreased, which
reduces the correlation between execution time and number of
transactions significantly.
V. COUNTERMEASURE ANALYSIS
One approach to defend against a timing attack would result
in an increase in the number of timing samples required to
launch a successful attack. As we discussed earlier, the attacker
uses the correlation between the execution time and number
of transactions to find the key, so-called distinguisher. The
attacker must be able to do this using a reasonable number of
samples. Our series of countermeasures increase the number
of required samples while keeping and in some cases even
improving performance. Here, we first tried to find the number
of samples for a successful attack using hardware and software
countermeasures separately. Then we used all countermeasures
together and tried to attack. As the number of samples increase
significantly in this case, a detectable correlation was not
observed in a reasonable amount of time. So we calculated the
number of samples for this case using statistical analyses.
The probability of launching an attack can be calculated as
Equation 10 [33], where S is the number of samples and ρpeak
is the highest detectable correlation value. We have modified
this equation by replacing ρ = 0 with ρ = ρave, because when
using a smaller number of samples, this value is not negligible.
α = Φ
 12 ln 1+ρpeak1−ρpeak − 12 ln 1+ρave1−ρave√
2
S−3
 (10)
Our countermeasures reduce ρpeak, then the attacker would
need to collect more timing samples in order to detect ρpeak
in the correlation values. As we discussed earlier, our possible
hardware countermeasure options include randomizing the
width of coalescing unit, utilizing our hierarchical MHSR
design, both of which can reduce the SNR. Equation 12 shows
the contribution of noise when computing the correlation value
between the execution and the number of memory transactions.
ρpeak(te,n)=
µten−µteµn
σteσn
=
µ(β1n+β0)n+n
−µβ1n+β0µn−µµn
σ(β1n+β0)
σn+σσn
(11)
ρpeak(te, n) =
ρpeak(β1n+ β0)√
1 + 1SNR
(12)
The highest possible value for ρpeak is 1. In this case, the
variables would be strongly correlated and the measurements
would need to be free of any noise. When launching a timing
attack on an unprotected GPU, the highest ρpeak that we can
obtain is 0.4031, which is limited only by the inherent noise
in the timing channel. The highest value for ρpeak, when
employing a single (but randomly selected) coalescing unit
width for each kernel run with average value of 32, is 0.2994,
as shown in Figure 4. If we change the coalescing unit width
dynamically during runtime, the max ρpeak value is reduced
to 0.0547, meaning that using different sizes for different
cache lines introduces even more noise. Finally employing our
hierarchical MSHR design, the SNR is reduced to 0.0447. But
in this case, in order to fully utilize all SMs, we increased the
number of threads to 32threadsperwarp× 2warpsperSM ×
15SMs.
For this series of countermeasures, it is important to choose
the right average width based on the normal distribution of
memory access requests in order to achieve a good balance
between performance and security. Figure 7 shows the number
of timing samples and performance of different average widths.
Performance is defined as 1executionTime . As we can see,
using a dynamic coalescing width with hierarchical MSHRs
improves the resiliency of the design to attacks, while also
boosting performance. With this design, the effort required
by the attacker (measured in the number of timing samples
collected) to launch a successful attack increases by 1433X. On
the other side, we can enjoy a 7% performance improvement.
Considering the fact that deceasing the average width makes the
design harder to attack, while losing performance, the designer
can choose different widths based on the performance/security
tradeoff. For a design to introduce negligible performance
overhead, we should select an average width no narrower than
16B. Adopting a value of 16B will increase the difficulty of
launching a successful attack by 175X, while a more aggressive
design that increases the attacker effort by 1433X would suffer
a 3% loss in performance.
While hardware countermeasures help to obscure the rela-
tionship between the kernel execution time and the number of
transactions by lowering SNR, employing our software scheme
of rotating T-table columns will also decrease the correlation
since this also changes the number of transactions. Equation 14
computes the correlation after applying T-table rotation in the
original hardware without applying hardware countermeasures.
In this equation, t is the measured time, n is the number of
transactions to original table, o in the number of transactions
after rotation and po=n is the probability of o = n. In this
equation, we assume that the time is not related to o whenever
the number of transactions is not equal to n.
ρpeak(t, o) =
po=n(µtn − µtµn) + (1− po=n)(µtµo − µtµo)
σtσo
(13)
ρpeak(t, o) = ρpeak(t, n)po=n
σn
σo
(14)
The values for po=n and σo depend on the frequency of
rotations. Using every 1000 encryptions as the frequency for
8
0200
400
600
800
1000
1200
1400
1600
0 10 20 30 40 50 60 70
N
o
rm
al
iz
ed
 N
u
m
b
e
r 
o
f 
Sa
m
p
le
s
Average Coalescing Unit Width
Security
0.88
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
0 10 20 30 40 50 60 70
N
o
rm
al
iz
ed
 P
er
fo
rm
an
ce
Average Coalescing Unit Width
Performance
FW
DW
DW+H-MSHR
FW
DW
DW+H-MSHR
Fig. 7. Security and performance for hardware countermeasures.FX:Fixed
Width, DW:Dynamic Width, H-MSHRs: Hierarchical MSHRs
rotation, ρpeak(t, o) can be decreased to 0.0858× ρpeak(t, n),
without applying any hardware countermeasures. Figure 8
shows the resulting security and performance values based
on the rotation frequency. This figure shows that rotating T-
table columns every 1M to 1000 samples, we do not see
any detectable performance loss, but increases the number of
samples by 68X. Decreasing the time between rotations to less
than 1000 encryptions would result in a slight performance
loss. Using the maximum frequency, the T-table will be rotated
after every run of the encryption algorithm, resulting in a 2%
performance loss, while increasing the effort for a successful
timing attack by 178X.
Figure 9 shows the correlation values when attacking key
byte 5, As we can see, the number of samples required for
the original attack with α = 0.9 is around 5× 105. When we
apply the hardware (average width=32B )and software (rotation
frequency=1000) countermeasures individually, the number of
traces needed grows to 4 × 107 and 3.5 × 107, respectively,
increasing the effort by two orders of magnitude.
Our hardware and software methods reduce the correlation
to 0.1109× ppeak and 0.0858× ppeak, respectively. When we
apply both methods, the correlation will be at least 0.1109×
0.0858× ppeak. By reducing the width of the coalescing unit,
the number of cache lines accessed will increase, which will
decrease po=n. For example, with a 64B L1 cache line, 32
0
20
40
60
80
100
120
140
160
180
200
1 10 100 1000 10000 100000 1000000
N
o
rm
al
iz
ed
 N
u
m
b
e
r 
o
f 
Sa
m
p
le
s
Rotation Frequency
Security
0.975
0.98
0.985
0.99
0.995
1
1.005
1 10 100 1000 10000 100000 1000000
N
o
rm
al
iz
ed
 P
er
fo
rm
an
ce
Rotation Frequency
Performance
Fig. 8. Security and performance for software countermeasures.
Sample Size#105
0 5 10
Co
rre
la
tio
n
0
0.005
0.01
0.015
0.02
(a)
Sample size #107
0 2 4 6
Co
rre
la
tio
n
#10-3
0
1
2
3
4
5
(b)
Sample size #107
0 2 4 6
Co
rre
la
tio
n
#10-3
0
1
2
3
4
5
(c)
Fig. 9. ρpeak of k5 vs. the number of samples a)the baseline attack, b)the
attack only with hardware countermeasures, c)the attack only with software
countermeasures
requests will access 2 to 16 cache lines. To change each number
from this range, we have 14 other possible values. But having
32B cache lines results in accessing 4 to 64 cache lines. So the
possibility of hiding the original number of cache lines is higher
due to using a wider range. Therefore, hardware obfuscation
will increase the level of obfuscation performed in software. If
the number of required samples with hardware and software
countermeasures individually are ghX and gsX , respectively
(as compared to the traces required for the basic attack), the
number of required traces needed to launch a successful attack
using a combination of both countermeasures will be increased
by at least gh ∗ gs times.
VI. CONCLUSION
GPUs are vulnerable to timing side-channel attacks. The
memory system of these devices can leak timing information.
9
If the attacker can time the encyrption, and with general
knowledge of an encryption algorithm’s memory access pattern,
she can recover all bytes of the key.
In this paper we explore multiple hardware-based and
software-based strategies to obfuscate the timing leakage of the
memory system on a GPU. By randomizing the width of the
coalescing unit dynamically, we are able to introduce significant
noise into the timing channel. By introducing a second-level
unified MSHRs across all SMs, we can further obfuscate the
timing channel, while improving encyption performance. We
also propose software-based methods the obfuscate T-table
accesses and the mapping of entries to cache lines. Applying
all of these countermeasures together to thwart an attack, an
attacker will still be able to recover the encryption key, but the
number of traces required for a successful attack has increased
by 1400*178X over the baseline, while slightly improving AES
performance.
REFERENCES
[1] M. Aaftab, “The opencl specification,” in Hot Chips 21 Symposium
(HCS), 2009 IEEE, pp. 1–314, IEEE, 2009.
[2] D. R. Kaeli, P. Mistry, D. Schaa, and D. P. Zhang, Heterogeneous
Computing with OpenCL 2.0. San Francisco, CA, USA: Morgan
Kaufmann Publishers Inc., 1st ed., 2015.
[3] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel
programming with cuda,” in ACM SIGGRAPH 2008 classes, p. 16, ACM,
2008.
[4] N. Corporation, “Gpu-accelerated applications,” 2018.
[5] N. Corporation, “Nvidia cuda toolkit v10.0 documentation,” 2018.
[6] S. A. Manavski et al., “Cuda compatible gpu as an efficient hardware
accelerator for aes cryptography,” Signal Processing and Communications,
vol. 2007, 2007.
[7] K. Iwai, T. Kurokawa, and N. Nisikawa, “Aes encryption implementation
on cuda gpu and its analysis,” in 2010 First International Conference
on Networking and Computing, pp. 209–214, IEEE, 2010.
[8] A. D. Biagio, A. Barenghi, G. Agosta, and G. Pelosi, “Design of a parallel
aes for graphics hardware using the cuda framework,” in Proceedings
of the 2009 IEEE International Symposium on Parallel&Distributed
Processing, pp. 1–8, IEEE Computer Society, 2009.
[9] D. Le, J. Chang, X. Gou, A. Zhang, and C. Lu, “Parallel aes algorithm for
fast data encryption on gpu,” in Computer Engineering and Technology
(ICCET), 2010 2nd International Conference on, vol. 6, pp. V6–1, IEEE,
2010.
[10] A. E. Cohen and K. K. Parhi, “Gpu accelerated elliptic curve cryptography
in gf (2 m),” in 2010 53rd IEEE International Midwest Symposium on
Circuits and Systems, pp. 57–60, IEEE, 2010.
[11] R. Szerwinski and T. Gu¨neysu, “Exploiting the power of gpus for
asymmetric cryptography,” in International Workshop on Cryptographic
hardware and embedded systems, pp. 79–99, Springer, 2008.
[12] Q. Li, C. Zhong, K. Zhao, X. Mei, and X. Chu, “Implementation and
Analysis of AES Encryption on GPU,” in IEEE 14th International
Conference on High Performance Computing and Communication,
pp. 843–848, June 2012.
[13] G. Scho¨nberger and J. Fuß, “GPU-assisted AES Encryption Using GCM,”
in Proceedings of the 12th IFIP TC 6/TC 11 International Conference on
Communications and Multimedia Security, pp. 178–185, Springer-Verlag,
2011.
[14] M. Patterson, “Vulnerability analysis of gpu computing,” 2013.
[15] E. Karimi, Z. H. Jiang, Y. Fei, and D. Kaeli, “A timing side-shannel
atack on a mobile gpu.,” in ICCD, vol. 2018, 2018.
[16] Z. H. Jiang, Y. Fei, and D. Kaeli, “A complete key recovery timing
attack on a gpu,” in High Performance Computer Architecture (HPCA),
2016 IEEE International Symposium on, pp. 394–405, IEEE, 2016.
[17] Z. H. Jiang, Y. Fei, and D. Kaeli, “A novel side-channel timing attack
on gpus,” in Proceedings of the on Great Lakes Symposium on VLSI
2017, pp. 167–172, ACM, 2017.
[18] K. Akdemir, M. Dixon, W. Feghali, P. Fay, V. Gopal, J. Guilford,
E. Ozturk, G. Wolrich, and R. Zohar, “Breakthrough aes performance
with intel aes new instructions,” White paper, June, p. 11, 2010.
[19] G. Kadam, D. Zhang, and A. Jog, “Rcoal: mitigating gpu timing attack via
subwarp-based randomized coalescing techniques,” in High Performance
Computer Architecture (HPCA), 2018 IEEE International Symposium
on, pp. 156–167, IEEE, 2018.
[20] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt,
“Analyzing cuda workloads using a detailed gpu simulator,” in IEEE
International Symposium on Performance Analysis of Systems and
Software, pp. 163–174, April 2009.
[21] L. Wang, X. Zhao, D. Kaeli, Z. Wang, and L. Eeckhout, “Intra-cluster
coalescing to reduce gpu noc pressure,” in 2018 IEEE International
Parallel and Distributed Processing Symposium (IPDPS), pp. 990–999,
IEEE, 2018.
[22] J. Tuck, L. Ceze, and J. Torrellas, “Scalable cache miss handling for
high memory-level parallelism,” in Proceedings of the 39th Annual
IEEE/ACM International Symposium on Microarchitecture, pp. 409–422,
IEEE Computer Society, 2006.
[23] S. Keshavarz and D. Holcomb, “Privacy leakages in approximate adders,”
in 2017 IEEE International Symposium on Circuits and Systems (ISCAS),
pp. 1–4, IEEE, 2017.
[24] S. Keshavarz, F. Schellenberg, B. Richte, C. Paar, and D. Holcomb, “Sat-
based reverse engineering of gate-level schematics using fault injection
and probing,” in 2018 IEEE International Symposium on Hardware
Oriented Security and Trust (HOST), pp. 215–220, IEEE, 2018.
[25] S. Keshavarz, C. Paar, and D. Holcomb, “Design automation for
obfuscated circuits with multiple viable functions,” in Design, Automation
& Test in Europe Conference & Exhibition (DATE), 2017, pp. 886–889,
IEEE, 2017.
[26] M. A. Vosoughi and S. Ko¨se, “Combined distinguishers to enhance
the accuracy and success of side channel analysis,” in 2019 IEEE
International Symposium on Circuits and Systems (ISCAS), pp. 1–5,
IEEE, 2019.
[27] M. A. Vosoughi, L. Wang, and S. Ko¨se, “Bus-invert coding as a low-
power countermeasure against correlation power analysis attack,” in
2019 ACM/IEEE International Workshop on System Level Interconnect
Prediction (SLIP), pp. 1–5, IEEE, 2019.
[28] A. Vosoughi and S. Ko¨se, “Leveraging on-chip voltage regulators against
fault injection attacks,” in Proceedings of the 2019 on Great Lakes
Symposium on VLSI, pp. 15–20, ACM, 2019.
[29] D. J. Bernstein, “Cache-timing attacks on aes,” 2005.
[30] E. Tromer, D. A. Osvik, and A. Shamir, “Efficient cache attacks on aes,
and countermeasures,” Journal of Cryptology, vol. 23, no. 1, pp. 37–71,
2010.
[31] Z. Wang and R. B. Lee, “New cache designs for thwarting software cache-
based side channel attacks,” ACM SIGARCH Computer Architecture News,
vol. 35, no. 2, pp. 494–505, 2007.
[32] E. Brickell, G. Graunke, M. Neve, and J.-P. Seifert, “Software mitigations
to hedge aes against cache-based software side channel vulnerabilities.,”
IACR Cryptology ePrint Archive, vol. 2006, p. 52, 2006.
[33] S. Mangard, “Hardware countermeasures against dpa–a statistical analysis
of their effectiveness,” in Cryptographers Track at the RSA Conference,
pp. 222–235, Springer, 2004.
10
