GPU Acceleration of Real-Time Control Loops by Bamakhrama, Mohamed A. et al.
GPU Acceleration of Real-Time Control Loops
Mohamed A. Bamakhrama
ASML, Netherlands
Alejandro Arrizabalaga
ASML, Netherlands
Frank Overman
ASML, Netherlands
Jean-Paul Smeets
ASML, Netherlands
Kornel van der Sommen
ASML, Netherlands
Remko van der Vossen
ASML, Netherlands
John Wagensveld
ASML, Netherlands
Abstract
Extreme Ultraviolet (EUV) photolithography is seen as the key enabler for increasing transistor
density in the next decade. In EUV lithography, 13.5 nm EUV light is illuminated through a
reticle, holding a pattern to be printed, onto a silicon wafer. This process is performed about
100 times per wafer, at a rate of over a hundred wafers an hour. During this process, a certain
percentage of the light energy is converted into heat in the wafer. In turn, this heat causes the
wafer to deform which increases the overlay error, and as a result, reduces the manufacturing
yield. To alleviate this, we propose a firm real-time control system that uses a wafer heat feed-
forward model to compensate for the wafer deformation. The model calculates the expected wafer
deformation, and then, compensates for that by adjusting the light projection and/or the wafer
movement. However, the model computational demands are very high. As a result, it needs to be
executed on dedicated HW that can perform computations quickly. To this end, we deploy Graphics
Processing Units (GPUs) to accelerate the calculations. In order to fit the computations within the
required time budgets, we combine in a novel manner multiple techniques, such as compression and
mixed-precision arithmetic, with recent advancements in GPUs to build a GPU-based real-time
control system. A proof-of-concept implementation using NVIDIA P100 GPUs is able to deliver
decompression throughput of 33 GB/s and a sustained 198 GFLOP/s per GPU for mixed-precision
dense matrix-vector multiplication.
2012 ACM Subject Classification Computer systems organization → Embedded and cyber-physical
systems; Computer systems organization → Real-time system architecture; Hardware → Hardware
accelerators
Keywords and phrases GPU Acceleration, Feed-Forward Control, Mixed-Precision BLAS
Acknowledgements The authors would like to thank Peter Lindstrom and Matthew Larsen from
Lawrence Livermore National Laboratory (LLNL) for their support with zfp.
1 Introduction
Extreme Ultraviolet (EUV) photolithography is seen as the key enabler for reducing the
transistor dimensions, and hence, increasing transistor density in the next decade. EUV
lithography has been in development for more than two decades. However, EUV systems
have been deployed in production only in the past five years. As the adoption ramps up into
ar
X
iv
:1
90
2.
08
01
8v
1 
 [c
s.D
C]
  2
1 F
eb
 20
19
2 GPU Acceleration of Real-Time Control Loops
high volume production, EUV lithography manufacturers are focusing now on improving the
technology further [18]. One of the improvements is increasing the power of the light dosage
delivered to the wafer. Naturally, a certain amount of the delivered power is converted into
heat in the wafer. As the power increases, the amount of power converted into heat increases
too. This extra heat poses a problem; it causes the wafer to deform which means that the
overlay error is increased. As a result, this extra error leads to reduced manufacturing yield.
To alleviate this, a cooling hood is introduced to absorb the extra heat [18]. However, the
cooling hood alone is not sufficient to reduce the overlay error to the target values. As a
solution, a wafer heat feed forward (WHFF) model is used in addition to the cooling
hood [17]. The WHFF model is composed of two parts: (i) the thermal model which computes
the temperature of each point in the wafer after being exposed to a certain light dosage,
and (ii) the deformation model which computes the deformation in wafer shape due to the
absorbed heat. The WHFF model computes an accurate approximation of the temperature
and deformation which are then sent to the actuators. After that, the actuators perform
corrections to the EUV light projection and/or wafer movement in order to minimize the
overlay error.
One major issue in realizing the WHFF model is its huge computational requirements [17].
The WHFF model involves solving a heat dissipation equation (for the thermal model)
together with a 2.5D mechanical deformation equation. The heat dissipation equation is
solved using Finite Elements Method (FEM) with a uniform mesh grid. The temperature
points resulting from solving the FEM formulation are applied to the 2.5D mechanical
deformation equation to compute the deformation values in x, y, and z axes. To reach the
required levels of accuracy, the thermal model uses a 2D mesh grid that has more than
360000 points. For a 300 mm wafer, this means computing a temperature point for every 0.25
mm2. Once the thermal and deformation equations are formulated in matrix-vector form, the
computational needs of the two models can be derived. The performance needed to compute
the WHFF model “on-time” to compensate for deformation is around 600 GFLOP/s for
all dimensions. By “on-time”, we mean computing the WHFF and delivering the resulting
corrections to the actuators within a time window of 50 ms that allows those corrections
to take effect. If the corrections were to be sent after the 50 ms time window, then the
mechanical parts would have moved and the corrections will have no impact. That’s why we
classify the control system as firm real-time; missing a deadline is tolerable, however, it
results in an increased overlay error (and hence lower yield) for the impacted dies and wafers.
In this paper, we present a realization of a firm real-time control system that runs the
WHFF model on Commercial Off-The-Shelf (COTS) hardware components. The proposed
realization uses Graphics Processing Units (GPUs) to deliver the required throughput and
memory bandwidth. In the past decade, there has been a lot of interest in using GPUs in
embedded and industrial control systems [9, 10, 12, 13, 15, 16, 20, 21]. Most of this interest
is driven by the proliferation of machine learning workloads that benefit significantly from
acceleration on GPUs (e.g., [20]). Acceleration of other compute-intensive control loops
remains a largely unexplored arena despite the few attempts [9, 15, 16]. In this work, we
propose a novel solution for the acceleration of compute-intensive control models used in
firm real-time systems. The novelty of the proposed solution stems from the way in which
it combines multiple domains (real-time control, high-performance computing, and digital
signal processing) together with recent advancements in GPU technology in a unique way
to solve a real industrial problem. Several key aspects of our proposed solution can be
summarized as follows:
High-Performance + Real-Time: The system has to deliver a throughput of 10
M. A. Bamakhrama et al. 3
GFLOP every 50 ms per axis. Such a high throughput is characteristic of HPC systems
used for scientific computing. At the same time, the 50 ms is a firm deadline; any
violation of this deadline will render the computed data useless. In our proposed system,
we combine such high-performance with a tight firm real-time control loop.
High Bandwidth Memory: The WHFF model needs to access around 400 GB/s
per axis. Traditionally, GPUs main drawback was the the limited memory bandwidth.
However, this is starting to change with the advancement of High Bandwidth Memory
(HBM, [6]). HBMs use 3D stacked memories with Through-Silicon-Vias (TSVs) to form a
memory die that is connected to the logic die via an interposer. The latest GPUs from
NVIDIA (e.g., P100 [4] and V100 [5]) provide 16-32 GB of HBM with a bandwidth in
the range of 700-900 GB/s.
Compression: HBMs alone are not sufficient as the connection between system memory
and HBM is still based on the rather slow PCI-Express interconnect. In order to overcome
the PCI-Express bandwidth bottleneck, we compress all the data that goes over PCI-
Express to GPU. With compression, we are able to achieve 4-10x reduction in the data
volume transferred over PCI-Express.
Mixed-Precision Arithmetic [7]: A key technique to maximize throughput on modern
GPUs is to perform different operations using different precision depending on: (i)
operation cost, (ii) accuracy needed, and (ii) rounding error introduced by each operation.
In this work, we perform multiplications (costly, low rounding error) using single-precision,
and reductions (cheap, high rounding error) using double-precision. Key advantages of this
approach are: (i) the ability to store all the inputs as single-precision compressed arrays
which reduces data transfer volume, and (ii) utilize both single- and double-precision
execution units.
The rest of this paper is organized as follows. Section 2 gives a detailed overview of the
WHFF model and the environment in which it runs. Section 3 gives an overview of the
related work. Section 4 describes the proposed system and SW implementation. Section 5
presents the results of evaluating a proof-of-concept implementation of the proposed system
based on NVIDIA P100 GPUs. Finally, Section 6 finishes the paper with conclusions.
2 Background
In this section, we provide an overview of how the WHFF model works. For an in-depth
mathematical description of the model, one can refer to [17].
2.1 Wafer Heat Feed-Forward Model
Figure 1 illustrates the scan pattern of a wafer. A wafer is divided into fields, where each
field corresponds roughly to a single die. Within a single field, the pattern is printed
by illuminating EUV light through the reticle. At a given time, the illuminated pattern
corresponds to an arc-shaped slit. The scan speed can be either fast or slow depending
on the energy dosage to be delivered to the wafer. A fast scan has shorter illumination
period which means less energy and less time to scan a full field. Conversely, a slow scan
has longer illumination period which means more energy and more time to scan a full field.
Furthermore, the scan time is classified into two inter-mixed stages: light (i.e., when the
illumninator is on), and dark (i.e., when the illuminator is off). In a real EUV machine,
the field fast scan is around 70 ms (light + dark) and the field slow scan is around 110 ms
4 GPU Acceleration of Real-Time Control Loops
Figure 1 Wafer scan pattern
(light + dark). Unless mentioned otherwise, all real-valued data are stored as single-precision
floating-point numbers.
The thermal model in the WHFF tracks the temperature development during the complete
wafer scan. As mentioned earlier, to reach the desired accuracy, the wafer is divided into
uniform 2D mesh grid consisting of more than 360000 points. For every millisecond of the
entire scan time (i.e., both light and dark stages), the thermal model has to be evaluated to
compute the new temperature values. In addition to the wafer, the thermal model takes also
into account the wafer clamp (i.e., the part holding the wafer). In total, there are around
3.7× 105 temperature values (stored in ~Tk) that are updated in each iteration of the thermal
model. ~Tk is updated using Equation 1. It is important to note that A is a sparse matrix,
while B is a diagonal one.
~Tk+1 = A · ~Tk +B · ~uk (1)
After computing the temperature points, a thermal interpolation step is performed which
reduces ~Tk+1 to a smaller vector ~Sk+1 that contains around 2.5× 105 values.
The 2.5D mechanical deformation model takes vector ~Sk+1 (resulting from the thermal
model) and multiplies it by matrix Cd at every light time step as shown in Equation 2.
∀k ∈ light stage : ~∆d = Cd · ~Sk+1 ∀d ∈ {x, y, z} (2)
TheCd matrix in Equation 2 is a dense matrix that describes, for every dimension d ∈ {x, y, z},
the deformation response of the complete mechanical setup, consisting of the wafer and
the layers below it, to a variation in temperature at any point in the wafer surface. The
resulting ~∆d contains (per dimension) the resulting deformations. Equation 2 needs to be
evaluated only for the light stage. During the dark stage, the light is off and there is no
extra heat transferred into the wafer. However, keep in mind that the thermal model has to
be evaluated for both light and dark stages to track the decrease of temperature during the
dark stage.
Computing the real size of Cd (assuming single-precision floating-point) reveals that it is
229 GB per dimension d. This poses a huge challenge for implementing the WHFF on any
HW platform since storing and processing such a huge matrix will be very costly. To solve
this challenge, we apply divide and conquer approach. Recall from Figure 1 that the scanner
moves sequentially across the fields and slits. Therefore, one can reduce the size of Cd used in
Equation 2 by computing only the deformations impacting the field-under-scan. This means
that Cd can be reduced into Cd,field. This new smaller matrix has much fewer rows than
M. A. Bamakhrama et al. 5
the original one (but the same number of columns as the original Cd). Furthermore, during
the scan of a field, one can repeat the same strategy by computing only the deformations
impacting the slit-under-scan using a matrix Cd,slit that is a subset of Cd,field. This means
that, for every field, a single Cd,field is fetched, then multiple instances of Cd,slit are fetched
from Cd,field. In other words: Cd,slit ⊂ Cd,field ⊂ Cd. Computing the real sizes of these new
smaller matrices (assuming single-precision) shows that Cd,field is 2067 MB and Cd,slit is 361
MB.
A complete pseudo-code summary of the WHFF model is depicted in Figure 2. Based on
this summary, we can compute the complexity of the WHFF model per scan millisecond.
One can easily observe that the WHFF model is dominated by GEneralized Matrix-Vector
(GEMV) multiplication which is a Basic Linear Algebra Subprograms (BLAS) Level-2 routine.
GEMV is an I/O bound algorithm with an arithmetic intensity [19] ranging from 16 to
1
2
asymptotically. Having said that, we proceed now with computing the exact FLOP and I/O
cost of the model. Recall from linear algebra that dense matrix-vector product A ·~b of a
matrix with M rows and N columns costs MN multiplications and M(N − 1) summations.
Hence, the total FLOP cost is M(2N − 1) FLOPs. For sparse matrices, the FLOP cost is
M(2 · nnz(A)− 1) FLOPs, where nnz(A) is the number of non-zero elements per row. Let tl
be the number of light time steps and td be the number of dark time steps. Then, the FLOP
cost of the WHFF model per field is given by Equation 3.
FLOPS
Time Budget (ms) = 3︸︷︷︸
per axis
td
(
M(2S − 1)
)
︸ ︷︷ ︸
Deformation Model
+T (tl + td)
(
2 · nnz(A) + 2 · nnz(B)− 1
)
︸ ︷︷ ︸
Thermal Model
(3)
where T , M , and S are the dimensions specified in Figure 2. The time budget in Equation 3
is the time allocated for executing the model and it is set to around 50 ms (for fast scans)
and around 80 ms (for slow scans).
If we substitute T , M , and S in Equation 3 with their real values, then we obtain the
following lower bounds on the total WHFF computational requirements: 398.7 GFLOP/s
(for fast scans) and 587.2 GFLOP/s (for slow scans). However, one can easily note that the
loop in line 8 in Figure 2 is fully parallelizable since there are no loop carried dependencies.
This allows us to speed-up the model execution by running all three iterations in parallel.
By doing that, we find that the throughput requirements per axis are 150 GFLOP/s (for fast
scans) and 195 GFLOP/s (for slow scans).
In a similar manner, one can compute the I/O cost of the WHFF model (I/O here means
the number of accesses needed to fetch the data into the HW). For dense matrix-vector
multiplication, one needs to fetch MN + N values and write M values. So in total, the I/O
cost is MN + N + M I/O operations (IOPS). Therefore, the I/O cost of the WHFF model
to scan a single field is given by:
IOPS
Time (ms) = (tl + td)T
(
2T + 3
)
+ 3td
(
MS + S + M
)
(4)
If we combine the actual throughput and memory bandwidth requirements of the WHFF
model, then we obtain the picture shown in Figure 3.
3 Related Work
The past decade witnessed an increase in the interest of using General Purpose GPU
(GPGPU) computing in embedded, real-time, and industrial systems [9, 10, 12, 13, 15, 16, 20].
In [10], Elliot and Anderson discussed the applications that can benefit from GPUs and the
6 GPU Acceleration of Real-Time Control Loops
Require: Dark phase load
Require: Wafer position
Require: Heat loads impacting the wafer
Require: Sparse matrix A ∈ RT×T
Require: Diagonal matrix B ∈ RT×T
Require: Dense matrices Cd ∈ RK×S , where d ∈ {x, y, z}
1: for Every exposure scan do
2: Fetch a sub-matrix Cd,field ∈ RL×S corresponding to the field-under-scan from Cd
3: for Every millisecond during scan (light and dark) do
4: Source term calculation (produces ~uk ∈ RT×1)
5: Thermal Model (Equation 1)
6: Thermal Interpolation (produces ~Sk ∈ RS×1)
7: if scan during light stage then
8: for Every dimension d ∈ {x, y, z} do
9: Fetch a sub-matrix Cd,slit ∈ RM×S from Cd,field
10: Deformation Model: ~∆d = Cd,slit · ~Sk+1
11: Deformation Interpolation
12: end for
13: end if
14: end for
15: end for
16: return Wafer deformation ~∆d in x, y, and z
Figure 2 A pseudo-code implementation of the WHFF model
constraints on using them. In [12,13], Hallmans et al. discussed the potential of GPUs in
embedded and industrial systems and the issues facing their adoption.
In [9], the first real attempt in accelerating a real-time control loop (involving a particles
filter) on GPUs is proposed. The proposed implementation was done on commercial GeForce
cards and can be integrated in large industrial systems. Similar to our proposed solution, [9]
is targeting latencies below 50 ms. However, a key difference between [9] and our solution is
the model computational complexity and the amount of data needed to execute the model.
Particle filters are compute-bound and require small amounts of data compared to the dense
matrices involved in solving the 3D mechanical deformation equation described in Section 2.
In [20], Windmann and Niggemann presented a method for fault detection in an industrial
automation process. The method incorporates a particle filter with switching neural networks
in a fault detection method. The execution time of the method was reduced from 80 s on
CPUs to around 6 s on GPUs. Compared to our proposed solution, [20] differs in: (i) their
target latency is 100x our target latency, and (ii) the model is compute-bound with rather
low memory bandwidth requirements.
In [15, 16], Maceina and Manduchi presented an assessment of GPGPU in real-time
control systems used in nuclear fusion reactors. According to them, GPUs had limited
success in real-time control due to: (i) lack of highly parallel real control applications, and (ii)
memory bandwidth bottleneck between system memory and GPU. They demonstrated that
GPUs are very useful to accelerate multiple classes of compute-intensive control applications
such as: (i) dense matrix-vector multiplication, (ii) image analysis, and (iii) synthetic
magnetic measurements in magnetic confinement fusion. Their first application (matrix-
vector multiplication) is the same as the one that we accelerate in the WHFF. However, key
M. A. Bamakhrama et al. 7
Figure 3 Throughput and bandwidth requirements (per axis) for WHFF model
differences between our solution and the one in [16] are:
The size of the matrix and vector are much larger in our case. To alleviate the memory
bandwidth bottleneck, we use data compression.
The use of mixed-precision arithmetic to increase the GPU utilization.
Another approach we use to alleviate the memory bandwidth bottleneck is the deployment
of the latest GPUs equipped with High Bandwidth Memory (HBM, [6]) such as NVIDIA’s
P100 and V100 GPUs.
4 Proposed Solution
In this section, we outline the proposed solution to accelerate the WHFF model. We start,
in Section 4.1, by describing the criterion we followed to select GPUs from the different
HW acceleration technologies available in the market. Then, in Section 4.2, we describe the
overall system architecture. After that, we describe the data compression scheme deployed
in our system. Next, in Section 4.4, we describe in details the mixed-precision matrix-vector
multiplication scheme implemented on our system.
4.1 Why GPUs?
EUV lithography machines are unique industrial systems. A single EUV lithography machine
costs more than $100 M, is power-rated at 1 MW, and has a lifetime span of 15-20 years.
A large portion of the machine power budget goes into the EUV light source, vacuum
environment systems, and the mechanical parts. Typically, when selecting electronics for
8 GPU Acceleration of Real-Time Control Loops
2−42−32−22−1 20 21 22 23 24 25 26
Arithmetic Intensity (FLOP/byte)
23
24
25
26
27
28
29
210
211
212
213
214
Ac
hi
ev
ea
bl
e
Pe
rfo
rm
an
ce
(G
FL
OP
/s
)
Roofline Model
Tesla V100
Tesla P100
Xeon Gold 6140
Stratix 10 MX
GEMV
2−42−32−22−1 20 21 22 23 24 25 26
Arithmetic Intensity (FLOP/byte)
22
23
24
25
26
27
28
29
210
211
No
rm
ali
ze
d
Ac
hi
ev
ea
bl
e
Pe
rfo
rm
an
ce
(M
FL
OP
/s
/$
)
Normalized Roofline Model
Tesla V100
Tesla P100
Xeon Gold 6140
Stratix 10 MX
GEMV
Figure 4 Roofline model of modern computing HW platforms and the maximum achievable
performance for single-precision GEMV on these HW platforms.
EUV machines, cost efficiency (i.e., throughput/$) is often much more important than energy
efficiency (i.e., throughput/W). Figure 4 shows the Roofline model [19] for a collection
of high-end modern HW platforms covering CPUs, GPUs, and FPGAs together with a
vertical line denoting the arithmetic intensity of GEMV. The figure shows the canonical
Roofline (as defined in [19]) and the normalized Roofline model (i.e., obtained through
dividing throughput by price). The y-axis shows the maximum achievable single-precision
performance. We observe the following:
1. HW with HBM2 memory (i.e., Teslas and Stratix) offers the best performance in the
canonical Roofline.
2. FPGAs become worse in MFLOP/s/$ compared to GPUs and CPUs. This has to do
with the high price tag of HBM2 enabled FPGAs.
3. We observe an "inversion" between P100 and V100 in the normalized Roofline (i.e., P100
is better than V100). V100 is a newer card with higher bandwidth. However, the increase
in bandwidth (and hence throughput) is not worth it if you normalize throughput by cost
for IO bound applications.
4.2 System Architecture
If we take the requirements shown in Figure 3 and map them to actual COTS HW, then
we obtain the system architecture shown in Figure 5. The mapping in Figure 3 shows the
need for "level 1 memory" that can store > 2067 MB and support more than 410 GB/s. At
the time of writing, the only COTS memories capable of realizing Level-1 memory are HBM
M. A. Bamakhrama et al. 9
Figure 5 Overall System Architecture. Each GPU has its own dedicated 16-lane PCI Express
Gen3 connection to the system memory. The host CPU is assumed to have enough PCI lanes (e.g.,
48) to have a direct point-to-point connection with each GPU (i.e., no PCI-Express switch)
and Graphics DDR (GDDR). If we combine the FLOP requirements with Level-1 memory
requirements, then we find that NVIDIA’s V100 and P100 cards satisfy both requirements.
For level-2 memory, the Cd matrix is so huge (> 200 GB) that it can be stored only in
the system DDR memory or permanent storage. However, the interconnection from the
system memory/storage to the GPU is slower than the required bandwidth as it is based on
PCI-Express1 Gen3 (16 GB/s vs. 45 GB/s). To tackle this problem, we deploy compression
on the data transferred over PCI Express. Compression is able to achieve compression ratios
in the range of 4-10x. A factor 4 allows the bandwidth requirement to drop from 45 GB/s
to 11.25 GB/s which is well-below PCI-Express maximum bandwidth. In the final system,
three GPU cards are used where each card processes one dimension of the 3D mechanical
deformation model (i.e., one iteration of the loop in line 8 in Figure 2). Each card is connected
to the system memory using a 16-lane PCI-Express Gen3 interconnect.
1 We exclude NVIDIA’s NVLink because it is a proprietary interconnect with no clear end-of-life strategy
and is supported only on IBM POWER CPUs.
10 GPU Acceleration of Real-Time Control Loops
Figure 6 The system model with data compression
Table 1 Pipeline period and latency of the system with and without applying data compression
Without compression With compression
Pipeline Period (s) max
(
Ttransfer, Tcompute
)
max
(
Ttransfer
R
, Tcompute + Tdecode
)
Latency (s) Tcompute + Ttransfer Tcompute + TtransferR + Tdecode
4.3 Data Compression
As mentioned in Section 4.2, we apply compression on the data transferred over PCI Express
bus to the GPUs. Figure 6 illustrates how compression is deployed. The interconnect has a
fixed bandwidth (denoted by B and measured in byte/s). Suppose that we have a block of
data consisting of S bytes that needs to be transferred from the host CPU to the GPU. We
assume that data transfers are pipelined with computations [1]. Without compression, it
would take Ttransfer = S/B seconds to transfer the data from the host to GPU (red line in
Figure 6). With compression, the data is reduced in size by a compression factor R which
results in a compressed piece of data with size Sc = SR bytes. This means that the transfer
time is reduced by a factor R (green line in Figure 6). Let Tcompute be the time needed to
for the application computations and Tdecode be the time needed to decode the compressed
data. Then, Table 1 shows the pipeline period and latency of the total system with and
without compression. Note that in bandwidth-bound applications (e.g., GEMV), Ttransfer is
much larger than Tcompute (i.e., Ttransfer  Tcompute). For compression to be beneficial, the
following must hold:
1. The decompression throughput (denoted by Wc and measured in byte/s) must be larger
than or equal to the interconnect bandwidth. Wc is equal to S/Tdecode. We show later in
Section 5 that we are able to get Wc on NVIDIA P100 that is in the range of 30-35 GB/s
which is higher than PCI-Express Gen3 16 GB/s maximum bandwidth.
2. The GPU must have enough resources to accommodate both the application and the
decompressor. This is true in our case as we do not run any other applications beside the
WHFF on the GPU.
3. The total latency after introducing decompression is still within the latency budget of
the application. In other words, Tdecode + Tcompute + Ttransfer/R < Tbudget. This is also
true in our application. Moreover, in our application, compression can be done offline
(i.e., before application starts) which means that the encoding part is completely out of
the time budget.
M. A. Bamakhrama et al. 11
Table 2 Results of compressing Cd under different modes of zfp. RMSE stands for Root Mean
Square Error. NRMSE is Normalized RMSE. PSNR is Peak Signal-to-Noise Ratio.
Mode Axis Bit/value Rate RMSE NRMSE Max. pointwise error PSNR
Fixed rate (8)
x 8 4 3.510× 10−13 2.36× 10−5 1.522× 10−10 86.54
y 8 4 2.113× 10−13 1.95× 10−5 1.122× 10−10 88.19
z 8 4 6.204× 10−13 2.77× 10−5 2.900× 10−10 85.14
Fixed-accuracy (< 10−12)
x 2.635 12.14 4.326× 10−14 2.90× 10−6 3.974× 10−13 104.72
y 2.579 12.41 4.526× 10−14 4.17× 10−6 4.081× 10−13 101.58
z 3.38 9.47 4.942× 10−14 2.21× 10−6 4.107× 10−13 107.11
Fixed-precision (17 bits)
x 7.6 4.21 3.925× 10−15 2.63× 10−7 1.156× 10−12 125.57
y 8.073 3.96 4.090× 10−15 3.77× 10−7 1.273× 10−12 122.46
z 7.575 4.22 2.142× 10−14 9.55× 10−7 4.791× 10−12 114.38
We choose zfp [14] as a compression scheme. zfp is a compression algorithm that operates
on integer and floating-point data. It can process data sets that are organized as 1D, 2D, 3D
and 4D. zfp provides three modes of operation:
1. Fixed accuracy: In this mode, the compressor accepts the uncompressed data and a
tolerance. Then, it produces a compressed data stream which, compared to the original
data, has an absolute difference that is less than or equal to tolerance. If the tolerance is
set to zero, then zfp will operate in near lossless mode.
2. Fixed precision: In this mode, the compressor is controlled by specifying the number
of most significant bits that the compressor must preserve from the original data into the
compressed data.
3. Fixed rate: In this mode, the compressor is controlled by specifying a fixed Bit-Per-Value
(BPV) rate. Then, the encoder encodes each floating point value using BPV bits.
Table 2 highlights the results of compressing Cd (from a mesh of 150× 200) under the
different modes of zfp. It shows that fixed accuracy gives the best compression ratio (~10x)
while fixed-precision achieves the best NRMSE and PSNR. Fixed-rate is very deterministic
in timing and bandwidth, at the cost of much larger error.
The next step is to investigate the error introduced by the changed decompressed data on
the deformations computed by the WHFF model. Upon evaluating the error introduced by
compression on the final Quantity of Interest (QoI), we found that the error in deformations
is 0.1% of the deformations computed by the original Cd matrix (in fixed accuracy and
precision modes). Such a low error in the final QoI is acceptable as long as it is below 1%.
4.4 Mixed-Precision Matrix-Vector Multiplication
As mentioned earlier, the core of the WHFF model is dense matrix-vector multiplication (i.e.,
GEMV). Therefore, it is very critical to accelerate GEMV as much as possible. Most HW
vendors provide an optimized version of BLAS routines for their platforms. For example,
Intel provides Intel’s Math Kernel Library (MKL, [2]) for the x86 platform, while NVIDIA
provides cuBLAS library [3] for its GPUs. In the WHFF, Cd,slit is a very wide matrix with
378 rows and 256000 columns. Such a wide matrix poses an interesting challenge. In A ·~b = ~r,
each element of ~r is computed using the dot operator as:
ri =
N∑
j=1
A(i, j)bj (5)
12 GPU Acceleration of Real-Time Control Loops
/* float is single -precision (32 bits)
and double is double -precision (64 bits) */
void gemv(const float * __restrict__ M,
const float * __restrict__ V,
float * __restrict__ R)
{
int x, y;
double sum;
float product;
for (y = 0; y < HEIGHT; y++) {
sum = 0.0;
for (x = 0; x < WIDTH; x++) {
product = (float)M[y*WIDTH + x] * (float)V[x];
sum += (double) product;
}
R[y] = (float) sum;
}
}
Figure 7 Mixed-precision sequential implementation of GEMV in C
If A is very wide, then the summation in Equation 5 becomes a very long reduction sequence.
If single-precision floating-point numbers are used, then for an A with a width of 256000
elements (the same as Cd,slit), we lose 17 bits of the mantissa due to rounding errors [11].
This leaves us with only 6 "un-contaminated" bits. Therefore, it is essential to use double-
precision for performing the reduction. On the other hand, performing the multiplication in
Equation 5 using single-precision is very safe (in terms of rounding error) and much faster
than double-precision. Therefore, we use a mixed-precision implementation of GEMV, where:
(i) the inputs and outputs are both single-precision arrays, (ii) multiplication is performed
using single-precision, and (iii) reduction is performed by casting the product from (ii) and
adding it to a double-precision sum variable. A sequential mixed-precision GEMV in C is
shown in Figure 7.
In this work, we implemented two variants of GEMV:
1. CPU-MP: Implemented manually by auto-vectorizing a sequential version of mixed-
precision GEMV using an auto-vectorizing compiler and targeting AVX-2 and OpenMP.
2. GPU-MP: A manual mixed-precision implementation of GEMV (using CUDA and
OpenCL).
In the following subsections, we explain each variant in details.
4.4.1 Variant 1: CPU-MP
The CPU variant is based on a the sequential implementation shown in Figure 7. We applied
one modification to the code shown in Figure 7 to enable auto-vectorization. The change
deals with telling the compiler that the input pointers are allocated using aligned memory
allocators (e.g., POSIX’s posix_memalign). Aligning the pointers in memory enables the
compiler to generate efficient vector code. In addition, we also parallelized the outer loop
over different cores using OpenMP. The core of the GEMV kernel is vectorized nicely using
GCC into AVX instructions as shown in Figure 8. We can easily see that the compiler
produces single-precision instructions (vmulps) for the multiplication and double-precision
ones (vaddpd) for the addition. The conversion from single to double precision is done using
the vcvtps2pd instruction.
M. A. Bamakhrama et al. 13
vmovaps (%rdx ,%rax ,1),% ymm0
vmulps (%rcx ,%rax ,1),%ymm0 ,%ymm0
add $0x20 ,%rax
vextractf128 $0x1 ,%ymm0 ,%xmm1
vcvtps2pd %xmm0 ,%ymm0
vcvtps2pd %xmm1 ,%ymm1
vaddpd %ymm0 ,%ymm1 ,%ymm0
vaddpd %ymm0 ,%ymm2 ,%ymm2
cmp $0x8000 ,%rax
Figure 8 AVX instructions implementing the product and summation inside GEMV
/* gemv multiplies matrix M (H x W) by
vector V (W x 1) and stores the result
in vector R (H x 1). */
__global__ void gemv(const real_t * __restrict__ M,
const real_t * __restrict__ V,
real_t * __restrict__ R)
{
int x;
__shared__ double partial_sums[THREAD ];
double sum = 0.0;
float product;
#pragma unroll
for(x = threadIdx.x; x < W; x += blockDim.x) {
product = (float)M[blockIdx.x*W+x] * (float)V[x];
sum += (double) product;
}
partial_sums[threadIdx.x] = sum;
__syncthreads ();
// Perform parallel reduction on partial_sums.
// Code is omitted for brevity. The result
// of the parallel reduction is stored in
// partial_sums [0]
if(threadIdx.x == 0)
R[blockIdx.x] = partial_sums [0];
}
Figure 9 Mixed-precision CUDA implementation of GEMV
4.4.2 Variant 2: GPU-MP
The second variant is implemented using CUDA as shown in Figure 9. It applies, next to
mixed-precision arithmetic, memory coalescing and parallel reduction [1]. Both techniques
are widely used in GPUs to achieve higher performance. In addition, it uses the GPU shared
memory to achieve very short access time to the partial_sums array.
5 Evaluation Results
We have implemented a Proof-of-Concept (PoC) version of the WHFF model that runs the
model for a single dimension. The PoC HW uses Intel Xeon E5-2660 V4 clocked at 2.0 GHz
interconnected to an NVIDIA P100 (16 GB HBM) using 16-lane PCI-Express Gen3. For the
SW, we use RedHat Enterprise Linux (RHEL) 7 together with Intel MKL 11.3.1.150, GCC
4.9.3, and CUDA 8.0.61.
14 GPU Acceleration of Real-Time Control Loops
Table 3 Decompression performance on NVIDIA P100 for fixed accuracy and precision modes
using the Brain dataset from [8]
Mode Compression Ratio R Throughput Wc (GB/s)
Fixed accuracy 11.1 33.2
Fixed precision 9.3 35.7
In the PoC, we evaluated the two mixed-precision matrix-vector multiplication schemes
on our proposed system. To judge the performance of our two variants implemented in
Section 4.4, we use the following two standard BLAS implementations of gemv:
1. Intel MKL: Uses GEMV from Intel’s Math Kernel Library (MKL) running on the host
CPU.
2. cuBLAS: Uses GEMV from NVIDIA’s cuBLAS library running on the GPU.
It is important to note that the two BLAS implementations do not support mixed-precision
arithmetic for BLAS Level-2 routines. Therefore, we use the double-precision version of
GEMV (called DGEMV in contrast to the single-precision version SGEMV).
5.1 GEMV Performance
Figure 10 shows the throughput (in GFLOP/s) of the different GEMV variants. The first
observation is that all vendor-provided variants fall short of the 195 GFLOP/s threshold
shown in Figure 3 which is needed to execute the WHFF model within its time budget.
The maximum achieved performance on P100 was obtained using the GPU-MP (CUDA)
variant which is able to provide a sustained 198 GFLOP/s. This finding is interesting since
the common wisdom is to use vendor-provided routines (e.g., MKL and cuBLAS) as they
represent the most optimized implementation for the underlying HW. However, in this specific
case, using a custom mixed-precision implementation tailored for the application proved to
be the right choice.
Another key observation from Figure 10 is the better performance of CUDA over OpenCL
for wide and tall matrices. For square ones, OpenCL is slightly better. Another interesting
observation is the poor performance of cuBLAS on wide matrices. This poor performance is
due to the fact that CUDA GEMV is parallelized for the row direction with 1D thread-block
and the row dimension is too small, in wide matrices, compared to the number of CUDA
cores and SMs.
5.2 Decompression Performance
Table 3 shows the decompression performance under P100 using the Brain dataset from [8].
The decoder is able to achieve around 33-35 GB/s with compression ratio R ≈ 10. The
impact of applying decompression is shown in Figure 11. Applying decompression proves
to be beneficial as it reduces the total execution time by 47%. However, the decompressor
contribution in the total execution time is quite high (52%). This high overhead is due to
the following reasons:
1. The decoding phase in zfp is inherently sequential. The only parallelization that can be
achieved is through dispatching "chunks" of compressed bitstream to each CUDA thread.
Future work should focus on either: (i) implementing such parallelization and finding
the suitable chunk size for the application and underlying HW, or (ii) investigating other
"GPU-friendly" compression schemes.
M. A. Bamakhrama et al. 15
MK
L (
DG
EM
V)
cu
BL
AS
(D
GE
MV
)
CP
U-
MP
(A
VX
+O
pe
nM
P)
GP
U-
MP
(O
pe
nC
L)
GP
U-
MP
(C
UD
A)
0
20
40
60
80
100
120
140
160
180
200
220
240
G
FL
O
P/
s
Wide Matrix (378×256K)
Tall Matrix (256K×378)
Square Matrix (10K×10K)
Figure 10 Results of evaluating different GEMV variants under different matrix layouts
16 GPU Acceleration of Real-Time Control Loops
Be
for
e
Af
ter
20
40
60
80
100
N
or
m
al
iz
ed
ex
ec
ut
io
n
tim
e
Computation Transfer Decompression
Figure 11 Execution time breakdown before and after applying decompression. Total execution
time has reduced by around 47% after applying decompression
2. The current CUDA implementation is still experimental and has room for optimizations
which were not implemented in this work.
We also believe that the advent of faster and more powerful GPUs will increase the decoder
throughput and hence reduce its overhead.
6 Conclusions
A GPU-based firm real-time system for executing the WHFF model is proposed. The
system is based on COTS HW components. To run the model within the given constraints,
compression, mixed-precision arithmetic, and HBM-enabled GPUs are used. The system is
capable of achieving around 198 GFLOP/s for mixed-precision GEMV. Data compression
enables alleviating the memory bandwidth bottleneck of GPUs. Future work should focus on
devising GPU-friendly data decompression schemes.
References
1 CUDA C Best Practices Guide. last accessed on August 23, 2018. URL: https://docs.nvidia.
com/cuda/cuda-c-best-practices-guide/index.html.
2 Intel® Math Kernel Library (Intel MKL). last accessed on August 20, 2018. URL: https:
//software.intel.com/en-us/mkl.
3 NVIDIA cuBLAS. last accessed on August 20, 2018. URL: https://developer.nvidia.com/
cublas.
4 Tesla P100 Data Center Accelerator. last accessed on August 13, 2018. URL: https://www.
nvidia.com/en-us/data-center/tesla-p100/.
5 Tesla V100 Data Center GPU. last accessed on August 13, 2018. URL: https://www.nvidia.
com/en-us/data-center/tesla-v100/.
6 High Bandwidth Memory (HBM) DRAM, Nov 2015. last accessed on August 8, 2018. Requires
membership. URL: https://www.jedec.org/sites/default/files/docs/JESD235A.pdf.
M. A. Bamakhrama et al. 17
7 Marc Baboulin, Alfredo Buttari, Jack Dongarra, Jakub Kurzak, Julie Langou, Julien Langou,
Piotr Luszczek, and Stanimire Tomov. Accelerating scientific computations with mixed
precision algorithms. Computer Physics Communications, 180(12):2526–2533, 2009. URL:
https://doi.org/10.1016/j.cpc.2008.11.005, doi:10.1016/j.cpc.2008.11.005.
8 M. Burtscher and P. Ratanaworabhan. FPC: A High-Speed Compressor for Double-Precision
Floating-Point Data. IEEE Transactions on Computers, 58(1):18–31, Jan 2009. URL: https:
//doi.org/10.1109/TC.2008.131, doi:10.1109/TC.2008.131.
9 M. Chitchian, A. Simonetto, A. S. van Amesfoort, and T. Keviczky. Distributed Computation
Particle Filters on GPU Architectures for Real-Time Control Applications. IEEE Transactions
on Control Systems Technology, 21(6):2224–2238, Nov 2013. URL: https://doi.org/10.1109/
TCST.2012.2234749, doi:10.1109/TCST.2012.2234749.
10 G. A. Elliott and J. H. Anderson. Real-World Constraints of GPUs in Real-Time Systems. In
Proceedings of the IEEE 17th International Conference on Embedded and Real-Time Computing
Systems and Applications, volume 2, pages 48–54, Aug 2011. URL: https://doi.org/10.
1109/RTCSA.2011.46, doi:10.1109/RTCSA.2011.46.
11 David Goldberg. What Every Computer Scientist Should Know About Floating-point Arith-
metic. ACM Computing Surveys, 23(1):5–48, March 1991. URL: https://doi.org/10.1145/
103162.103163, doi:10.1145/103162.103163.
12 D. Hallmans, M. Åsberg, and T. Nolte. Towards using the Graphics Processing Unit (GPU)
for embedded systems. In Proceedings of 2012 IEEE 17th International Conference on
Emerging Technologies Factory Automation (ETFA 2012), pages 1–4, Sept 2012. URL:
https://doi.org/10.1109/ETFA.2012.6489715, doi:10.1109/ETFA.2012.6489715.
13 D. Hallmans, K. Sandström, M. Lindgren, and T. Nolte. GPGPU for industrial control systems.
In Proceedings of the IEEE 18th Conference on Emerging Technologies Factory Automation
(ETFA), pages 1–4, Sept 2013. URL: https://doi.org/10.1109/ETFA.2013.6648166, doi:
10.1109/ETFA.2013.6648166.
14 P. Lindstrom. Fixed-Rate Compressed Floating-Point Arrays. IEEE Transactions on Visu-
alization and Computer Graphics, 20(12):2674–2683, Dec 2014. URL: https://doi.org/10.
1109/TVCG.2014.2346458, doi:10.1109/TVCG.2014.2346458.
15 T. Maceina, P. Bettini, G. Manduchi, and M. Passarotto. Fast and Efficient Algorithms
for Computational Electromagnetics on GPU Architecture. IEEE Transactions on Nuclear
Science, 64(7):1983–1987, July 2017. URL: https://doi.org/10.1109/TNS.2017.2713888,
doi:10.1109/TNS.2017.2713888.
16 T. J. Maceina and G. Manduchi. Assessment of General Purpose GPU Systems in Real-
Time Control. IEEE Transactions on Nuclear Science, 64(6):1455–1460, June 2017. URL:
https://doi.org/10.1109/TNS.2017.2691061, doi:10.1109/TNS.2017.2691061.
17 D. van den Hurk, S. Weiland, and K. van Berkel. Modeling and localized feedforward control
of thermal deformations induced by a moving heat load. In Proceedings of the 2018 SICE
International Symposium on Control Systems (SICE ISCS), pages 171–178, March 2018.
URL: https://doi.org/10.23919/SICEISCS.2018.8330172, doi:10.23919/SICEISCS.2018.
8330172.
18 Jan van Schoot, Kars Troost, Frank Bornebroek, Rob van Ballegoij, Sjoerd Lok, Peter
Krabbendam, Judon Stoeldraijer, Erik Loopstra, Jos P. Benschop, Jo Finders, Hans Meiling,
Eelco van Setten, Bernhard Kneer, Peter Kuerz, Winfried Kaiser, Tilmann Heil, Sascha
Migura, and Jens Timo Neumann. High-NA EUV lithography enabling Moore’s law in the
next decade. In Proceedings Volume 10450, International Conference on Extreme Ultraviolet
Lithography 2017, volume 10450, 2017. URL: https://doi.org/10.1117/12.2280592, doi:
10.1117/12.2280592.
19 Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An Insightful Visual
Performance Model for Multicore Architectures. Commun. ACM, 52(4):65–76, April 2009.
URL: http://doi.acm.org/10.1145/1498765.1498785, doi:10.1145/1498765.1498785.
18 GPU Acceleration of Real-Time Control Loops
20 S. Windmann and O. Niggemann. A GPU-based method for robust and efficient fault
detection in industrial automation processes. In Proceeings of the IEEE 14th International
Conference on Industrial Informatics (INDIN), pages 442–445, July 2016. URL: https:
//doi.org/10.1109/INDIN.2016.7819201, doi:10.1109/INDIN.2016.7819201.
21 Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H. Anderson, and
F. Donelson Smith. Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in
Autonomous Systems. In Proceedings of the 30th Euromicro Conference on Real-Time Systems
(ECRTS 2018), volume 106, pages 20:1–20:21, 2018. URL: https://doi.org/10.4230/LIPIcs.
ECRTS.2018.20, doi:10.4230/LIPIcs.ECRTS.2018.20.
