Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures by Karunanithi, Kamesh Arumugam
Old Dominion University 
ODU Digital Commons 
Computer Science Theses & Dissertations Computer Science 
Fall 2017 
Efficient Machine Learning Approach for Optimizing Scientific 
Computing Applications on Emerging HPC Architectures 
Kamesh Arumugam Karunanithi 
Old Dominion University, a.k.kamesh001@gmail.com 
Follow this and additional works at: https://digitalcommons.odu.edu/computerscience_etds 
 Part of the Computer Sciences Commons 
Recommended Citation 
Karunanithi, Kamesh A.. "Efficient Machine Learning Approach for Optimizing Scientific Computing 
Applications on Emerging HPC Architectures" (2017). Doctor of Philosophy (PhD), Dissertation, Computer 
Science, Old Dominion University, DOI: 10.25777/s49t-1525 
https://digitalcommons.odu.edu/computerscience_etds/33 
This Dissertation is brought to you for free and open access by the Computer Science at ODU Digital Commons. It 
has been accepted for inclusion in Computer Science Theses & Dissertations by an authorized administrator of 
ODU Digital Commons. For more information, please contact digitalcommons@odu.edu. 
EFFICIENT MACHINE LEARNING APPROACH FOR
OPTIMIZING SCIENTIFIC COMPUTING




B.E. July 2010, Visvesvaraya Technological University, India
A Dissertation Submitted to the Faculty of
Old Dominion University in Partial Fulfillment of the











EFFICIENT MACHINE LEARNING APPROACH FOR
OPTIMIZING SCIENTIFIC COMPUTING APPLICATIONS ON
EMERGING HPC ARCHITECTURES
Kamesh Arumugam Karunanithi
Old Dominion University, 2017
Co-Directors: Dr. Mohammad Zubair
Dr. Desh Ranjan
Dr. Baľsa Terzić
Efficient parallel implementations of scientific applications on multi-core CPUs
with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploi-
ting the data parallel architecture of the accelerator along with the vector pipelines
of modern x86 CPU architectures, load balancing, and efficient memory transfer
between different devices. It is relatively easy to meet these requirements for highly-
structured scientific applications. In contrast, a number of scientific and engineering
applications are unstructured. Getting performance on accelerators for these appli-
cations is extremely challenging because many of these applications employ irregular
algorithms which exhibit data-dependent control-flow and irregular memory acces-
ses. Furthermore, these applications are often iterative with dependency between
steps, and thus making it hard to parallelize across steps. As a result, parallelism in
these applications is often limited to a single step. Numerical simulation of charged
particles beam dynamics is one such application where the distribution of work and
memory access pattern at each time step is irregular. Applications with these pro-
perties tend to present significant branch and memory divergence, load imbalance
between different processor cores, and poor compute and memory utilization. Prior
research on parallelizing such irregular applications have been focused around op-
timizing the irregular, data-dependent memory accesses and control-flow during a
single step of the application independent of the other steps, with the assumption
that these patterns are completely unpredictable. We observed that the structure of
computation leading to control-flow divergence and irregular memory accesses in one
step is similar to that in the next step. It is possible to predict this structure in the
current step by observing the computation structure of previous steps.
In this dissertation, we present novel machine learning based optimization techni-
ques to address the parallel implementation challenges of such irregular applications
on different HPC architectures. In particular, we use supervised learning to predict
the computation structure and use it to address the control-flow and memory access
irregularities in the parallel implementation of such applications on GPUs, Xeon Phis,
and heterogeneous architectures composed of multi-core CPUs with GPUs or Xeon
Phis. We use numerical simulation of charged particles beam dynamics simulation
as a motivating example throughout the dissertation to present our new approach,
though they should be equally applicable to a wide range of irregular applications.
The machine learning approach presented here use predictive analytics and forecas-
ting techniques to adaptively model and track the irregular memory access pattern
at each time step of the simulation to anticipate the future memory access pattern.
Access pattern forecasts can then be used to formulate optimization decisions du-
ring application execution which improves the performance of the application at a
future time step based on the observations from earlier time steps. In heterogeneous
architectures, forecasts can also be used to improve the memory performance and
resource utilization of all the processing units to deliver a good aggregate perfor-
mance. We used these optimization techniques and anticipation strategy to design a
cache-aware, memory efficient parallel algorithm to address the irregularities in the
parallel implementation of charged particles beam dynamics simulation on different
HPC architectures. Experimental result using a diverse mix of HPC architectures
shows that our approach in using anticipation strategy is effective in maximizing
data reuse, ensuring workload balance, minimizing branch and memory divergence,
and in improving resource utilization.
iv
Copyright, 2017, by Kamesh Arumugam Karunanithi, All Rights Reserved.
v
ACKNOWLEDGMENTS
The work presented in this dissertation, as well as other work completed during
my graduate career, would have not have been possible without the support of the
following people -
• I would like to thank Dr. Mohammad Zubair who decided to supervise my
doctorate degree and share his experience and knowledge of high performance
computing. His ability to ask fundamental questions to understand the given
problem has helped me to develop a rational thought process.
• I would like to thank Dr. Desh Ranjan who decided to co-supervise my
doctorate degree despite his many other academic and administrative com-
mitments. His capability of proposing interesting solutions to problems and
proving correctness of the proposed solution always inspired and motivated
me.
• I would like to thank Dr. Baľsa Terzić for co-supervising my doctorate degree,
and for being a constant source of real world scientific problems that have
given my dissertation its purpose with immense practical value. His patience
and intuitive explanations of the complex computational physics problems have
sparked in me a newfound interest in computational sciences.
• I would like to thank Dr. Andrey Chernikov for giving precious advice and
participating in my final defense committee.
• I would like to thank my parents, sister, and friends for their continued sup-
port and understanding when my PhD has led to periods of reduced social
interaction and contact.
This work was supported in part by National Science Foundation through grant
1535641, Jefferson Science Associates Project No. 712336 and the U.S. Department of
Energy (DOE) Contract No. DE-AC05-06OR23177, and Old Dominion University’s
Modeling and Simulation Graduate Research Fellowship during 2013-2016. I would
also like to acknowledge the support of NVIDIA Corporation for the donation of




LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 AIM OF THE THESIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 THESIS CONTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 THESIS ORGANIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 OVERVIEW OF PARALLEL ARCHITECTURES . . . . . . . . . . . . . . . . . 9
2.2 PARALLEL COMPUTING CHALLENGES . . . . . . . . . . . . . . . . . . . . . . 12
2.3 IRREGULAR ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 BEAM DYNAMICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 NUMERICAL INTEGRATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 BEAM DYNAMICS SIMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 OUTLINE OF THE ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 SEQUENTIAL SIMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 PRIOR RESEARCH IN PARALLEL SIMULATION . . . . . . . . . . . . . . . 50
4 EFFICIENT PARALLEL SIMULATION ON GPUS . . . . . . . . . . . . . . . . . . . . . . . 65
4.1 MODELING ACCESS PATTERNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 PARALLEL ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 EVALUATION AND EXPERIMENTAL RESULTS . . . . . . . . . . . . . . . . 75
5 EFFICIENT PARALLEL SIMULATION ON HETEROGENEOUS ARCHI-
TECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1 HETEROGENEOUS ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 PERFORMANCE RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
APPENDICES
A ADAPTIVE MULTI-DIMENSIONAL INTEGRATION. . . . . . . . . . . . . . . . . . . . . 110
A.1 OVERVIEW OF CUHRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.2 GPU-ACCELERATED PARALLEL ALGORITHM . . . . . . . . . . . . . . . . 111
vii
B PERFORMANCE ANALYSIS TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
B.1 ROOFLINE PERFORMANCE MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . 115
B.2 PROFILER METRICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116




1 Specifications of three supercomputers from world’s top-5 (as of June 2017). 2
2 Newton-Cotes formulae for different degrees. . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Performance of Two-Phase-RP-Kernel for computing retarded po-
tentials in a beam dynamics simulation with 100000 particles and for
different grid resolutions on NVIDIA Tesla K40 GPU. . . . . . . . . . . . . . . . . . 53
4 Performance of Heuristics-RP-Kernel for computing retarded poten-
tials in a beam dynamics simulation with 100000 particles and for different
grid resolutions on NVIDIA Tesla K40 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Performance of Predictive-RP-Kernel for computing retarded poten-
tials in a beam dynamics simulation with 100000 particles and for different
grid resolutions on NVIDIA Tesla K40 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . 78
6 Execution time of compute retarded potentials stage of the simulation
using Predictive-RP-Kernel for different simulation configurations
on NVIDIA Tesla K40 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7 Speedup of Predictive-RP-Kernel compared against Two-Phase-
RP-Kernel and Heuristics-RP-Kernel for different simulation con-
figurations on NVIDIA Tesla K40 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8 Execution time of the parallel implementation of compute retarded po-
tentials stage of the simulation on Machine-G in GPU-only execution
mode for different simulation configurations with varying number of Tesla
K40 GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9 Execution time of the parallel implementation of compute retarded poten-
tials stage of the simulation on Machine-X in Xeon Phi-only execution
mode for different simulation configurations with varying number of KNC
Xeon Phi coprocessor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10 Performance comparison of the parallel implementation of compute retar-
ded potentials stage on different HPC architectures - (a) Multi-core CPU
with 20 cores (using the CPU from Machine-X which is the fastest of the
two available multi-core CPUs), (b) Tesla K40 GPU from Machine-G,
and (c) KNC Xeon-Phi from Machine-X. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
ix
11 Performance of the heterogeneous implementation of compute retarded
potentials stage of the simulation on Machine-G which is composed of
multi-core CPU and four NVIDIA Tesla K40 GPUs. . . . . . . . . . . . . . . . . . . . 97
12 Performance of the heterogeneous implementation of compute retarded
potentials stage of the simulation on Machine-X which is composed of




1 CUDA programming model for general purpose computing on NVIDIA
GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Memory latency divergence when the addresses accessed by SIMD group
are scattered in the memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Memory latency divergence when: (a) the addresses accessed by SIMD
group are uniform (all equal), (b) the addresses accessed by SIMD group
are consecutive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Control-flow divergence in GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Conceptual view of charged particles beam dynamics on a 2D plane. . . . . . 23
6 Simulation of charged particles beam dynamics. (a) Particle distribution
for charged particle beam at time t0 is generated from Monte Carlo sam-
pling of initial DF of N particles with a total charge of Q. (b) - (e) The
charged particles in the beam emits synchrotron radiation when forced to
travel along a curved trajectory under the influence of a bending force of a
accelerator magnets (referred to as Bending Magnet in the above figure).
(f) The radiations emitted from all the earlier time steps catch up with
the charged particle beam at time tk = k∆t for all integers k > 0, and
these synchrotron radiations leads to hazardous self-interactions. . . . . . . . . 24





0.005 dx using Simp-
son’s quadrature rule with error tolerance τ = 0.001. . . . . . . . . . . . . . . . . . . 29
8 Outline of charged particle beam dynamics simulation algorithm. . . . . . . . 33
9 Spatial-grid enclosing the particle distribution at a particular time step
of the simulation. Its size is determined by the outliers of the distribution
along the principal axes. Blue line denotes the design orbit. . . . . . . . . . . . . 35
10 Steps in particle deposition stage of the beam dynamics simulation. . . . . . 36
11 Evaluation of integration limits in rp-integral. . . . . . . . . . . . . . . . . . . . . . . . . 38
12 Numerical approximation of integrand values in rp-integral using interpo-
lation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
xi
13 Percentage of sequential execution time spent by different stages of the
beam dynamics simulation per time step averaged over all time steps for
various grid resolutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
14 Computational workload characteristics of beam dynamics simulation at
different time steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
15 Computational workload characteristics of beam dynamics simulation at
different grid points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
16 Control-flow properties of rp-integral at different grid points. . . . . . . . . . . . 47
17 Memory access properties of rp-integral evaluation at different grid points. 49
18 Roofline model analysis for Two-Phase-RP-Kernel on NVIDIA Tesla
K40 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
19 Roofline model analysis for Heuristics-RP-Kernel on NVIDIA Tesla
K40 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
20 Analytic versus computed effective longitudinal (left) and transverse (right)
forces for the LCSL bend [47]: N = 1000000 particles on a 128×128 grid,
bend radius R0 = 25.13 m, θb = 11.4
◦, longitudinal rms beam size σs = 50
µm, emittance ε = 1 nm, and total beam charge of Q = 1nC. . . . . . . . . . . 77
21 Mean-square error for the longitudinal force, as defined in the text, as a
function of the number of particles per cell Nppc = N/Ngrid, for a fixed
grid of 128× 128 (or Ngrid = 1282). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
22 Roofline model analysis for Predictive-RP-Kernel on NVIDIA Tesla
K40 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
23 Roofline model analysis of Predictive-RP-Kernel (red line) compa-
red against Heuristics-RP-Kernel (green line) and Two-Phase-RP-
Kernel (grey line) on NVIDIA Tesla K40 GPU. . . . . . . . . . . . . . . . . . . . . . 80
24 Efficiency of the heterogeneous implementation of compute retarded po-
tentials stage of the simulation on Machine-G which is composed of
multi-core CPU and four NVIDIA Tesla K40 GPUs. . . . . . . . . . . . . . . . . . . . 97
25 Efficiency of the heterogeneous implementation of compute retarded po-
tentials stage of the simulation on Machine-X which is composed of
multi-core CPU and two KNC Xeon Phi coprocessor. . . . . . . . . . . . . . . . . . . 98




Heterogeneous architectures are now becoming ubiquitous in the computing sys-
tems ranging from supercomputers to embedded systems. These architectures inte-
grate different types of processing units (PUs) with different hardware and perfor-
mance characteristics, for example, multi-core CPUs, GPUs, Intel Many Integrated
Core (MIC), and FPGAs. These processing units have the potential to improve per-
formance for many scientific applications [38]. At present, a large fraction of Top500
[72] and Green500 [31] supercomputers now use heterogeneous computing architectu-
res. Accelerators, in particular, have played a significant role in the evolution of these
architectures. In fact, systems composed of multi-core CPUs and different types of
accelerators, like NVIDIA GPUs and Intel MIC (e.g. Xeon Phis), are among the most
common type of heterogeneous architectures in world’s top supercomputers [72]. For
instance, three of top-5 supercomputers in the world are heterogeneous architectures
with hardware accelerators: Tianhe-2 (Milkyway-2) employs Intel Xeon Phi many-
core accelerators [57], Piz Daint [74] and ORNL Titan [29] employ NVIDIA Tesla
GPUs. Table 1 gives a breakdown on the hardware specification of different PUs in
these three systems along with their theoretical performance in PFlop/s. The pre-
sence of accelerators in these systems boost the overall raw performance as well as the
price-to-performance and power-to-performance ratios of these systems when com-
pared to the traditional symmetric CPU architectures. With power consumption as
one of the major design constrains in todays computing systems, this trend towards
heterogeneous architectures comprising of multi-core CPUs with GPUs, Xeon-Phi
and other accelerators will continue.
The vastly different architectures and programming models of multi-core CPUs
and accelerators, however, present several challenges in achieving good performance.
Optimizing overall application performance requires taking into account the indivi-
dual characteristics of the PUs. For instance, multi-core CPUs use fewer cores, which
are typically out-of-order, multi-instruction issue cores which run at high-frequency
and use large-sized caches to minimize the latency of a single thread. This makes
2
Tianhe-2 Piz Daint Titan
CPU cores per node 24 12 16
Accelerators per node 3 1 1
Number of nodes 16000 5320 18688
Total system memory (TB) 1024 340 710
Theoretical Peak Performance (PFlop/s) 54.90 25.33 27.11
Power consumption (MW) 17.81 2.27 8.21
Table 1: Specifications of three supercomputers from world’s top-5 (as of June 2017).
CPUs more suitable for latency-critical applications. In contrast, GPUs use thou-
sands of cores, which are in-order cores that share their control unit and are designed
for handling multiple tasks simultaneously where the memory access latency is ty-
pically hidden with calculations instead of big data caches. This makes GPUs more
suitable for throughput-oriented applications which are typically expressed as data-
parallel computations - the same program is executed on many data elements in
parallel. For this reason, conventional architecture specific optimizations techniques
alone may not work well in a heterogeneous system and hence, novel techniques are
required to realize the potential and promise of heterogeneous computing. In other
words, performance on heterogeneous architecture can only be achieved if the appli-
cation workload is partitioned and mapped to the PUs such that all the PUs are best
utilized and combined to deliver good aggregate performance.
Heterogeneous architectures composed of multi-core CPUs and accelerators are
most effective in accelerating applications with dense and highly-structured worklo-
ads common in many problem domains ranging from graphics applications to mo-
lecular dynamics simulation and climate modeling. This is because many of these
applications use regular algorithms that operate on structured data like large vectors
or matrices, and access them in statically predictable ways which fit well on these ar-
chitectures where they can exploit the data-parallel, single-instruction multiple data
(SIMD) nature of the accelerators and the vectorization support of x86 CPUs to
improve performance. In particular, these algorithms exhibit high computational de-
mands, extensive data parallelism, access memory in a streaming fashion, and require
little synchronization. These characteristics in an application make them effective
in achieving coalesced memory accesses, minimizing thread divergence and synchro-
nization, improving SIMD and vector pipeline utilizations, etc., which are some of
3
the important factors to achieve good performance on many parallel architectures.
Moreover, there exist a plethora of optimization techniques, methods, and languages
models to achieve efficient parallelization of regular algorithms [44, 34, 55], and their
implementation on GPUs and Intel MIC can be at least an order of magnitude faster
than fine-tuned parallel CPU version [24].
However, a number of scientific and engineering applications are unstructured.
Getting performance on accelerators for these applications is extremely challenging
because many of these applications employ algorithms which exhibit data-dependent
control-flow and memory accesses that are not readily amenable to these architec-
tures. Algorithms with these properties are said to be irregular, and pose problems
for high-performance parallel implementations due to the following characteristics in
them -
• Irregular algorithms often demonstrate significant memory access irregularity
which leads to severe performance bottlenecks on SIMD architectures [18, 14].
The data-dependent memory accesses in these programs tend to have less spa-
tial locality compared to traditional graphics and regular general-purpose ap-
plications.
• Input values in these algorithms determine the program’s runtime behavior,
which therefore cannot be statically predicted. These properties in the al-
gorithm pose problems for high-performance parallel implementations, where
equal distribution of work over processor cores and locality of reference are
required within each cache sharing processor core.
• Performance of applications on data-parallel, SIMD architectures relies on high
SIMD lane occupancy and efficient memory coalescing for inter-thread data
locality, where the former requires minimal divergent branching for threads in
a SIMD group, while the latter requires regular memory access patterns and
data structure layouts [23, 59]. Unfortunately, irregular algorithms tend to
present both significant branch and memory divergence which leads to severe
performance bottlenecks.
Irregular algorithms are the core of many scientific computing applications that
arise from several domains of science and engineering. Some of the well-known appli-
cations that employ irregular algorithms are charged particles beam dynamics simu-
lation [77, 39], n-body simulations [9], data mining [75], Boolean satisfiability [17],
4
social networks [37], system modeling [66], compilers [2], meshing [25], and discrete-
event simulation [54]. The irregular nature of the underlying algorithms makes these
applications difficult to parallelize and more challenging to map to modern parallel
architectures. Several efficient implementations for some of these applications using
accelerators and other parallel architectures have been published in recent literature,
demonstrating that individually most of these architectures are capable of accelera-
ting at least some irregular applications relative to the CPUs [19, 51, 53]. However,
developing parallel implementations for these application using collaborative compu-
ting on heterogeneous architectures to deliver the best aggregate performance from
all the PUs remains a challenge.
1.1 AIM OF THE THESIS
Efficient parallel implementations of scientific applications on multi-core CPUs
with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploi-
ting the data parallel architecture of the accelerator along with the vector pipelines of
modern x86 CPU architectures (using 128-bit Streaming SIMD Extensions (SSE) or
256-bit Advanced Vector Extensions (AVX)), load balancing, and efficient memory
transfer between different devices. It is relatively easy to meet these requirements for
highly-structured scientific applications. In contrast, getting good performance on
accelerators for unstructured applications that employ irregular algorithms is extre-
mely challenging, thereby making their efficient parallel implementation a daunting
task. Furthermore, these applications are often iterative with dependency between
steps, and thus making it hard to parallelize across steps. As a result, the parallelism
in these applications is often limited to a single step.
Numerical simulation of charged particles beam dynamics is one such irregular
application that has gained increased interest in computational physics, especially in
recent years, as these simulations are crucial in understanding and the design of: (i)
high-brightness synchrotron light sources - powerful tools for cutting-edge research
in physics, biology, medicine and other fields, and (ii) electron-ion particle colliders,
which probe the nature of matter at unprecedented depths. This application simu-
lates the time evolution of charged particles in particle accelerator by computing
the collective effects (e.g. forces from self-interaction, forces from external magnetic
fields, and so on) acting on individual particles of a beam for a few hundreds or
thousands of time steps where the computation of collective effects at each time step
5
is irregular. In particular, distribution of work and data in the accurate computation
of collective effects at each time step of the simulation is highly unstructured and
cannot be characterized a priori, as these quantities are input-dependent and evolve
with the computation itself. To obtain high performance in such irregular algorithms
is extremely challenging, and to this end, much effort has been devoted in the de-
velopment of suitable algorithms that enable unprecedented fidelity and precision in
the study of collective effects [10]. However, implementing algorithms with such high-
accuracy and resolution have proven to be extremely challenging due to the data- and
compute-intensive nature of the underlying numerical methods [47, 46, 42, 16, 49].
Consequently, many of the existing algorithms employ a number of approximations
and simplifications to reduce the computational load [16, 49]. This improves the
performance while sacrificing the accuracy.
Another well-known irregular application is n-body simulation using the Barnes-
Hut algorithm which computes the gravitational forces acting on n different celes-
tial objects for a number of time steps where each time step simulates a particular
moment in the time evolution of the celestial bodies. Barnes-Hut algorithm hierar-
chically decomposes the space around the celestial bodies into successively smaller
volumes, called grids, and computes summary information for the bodies contained
inside each grid, allowing the algorithm to quickly approximate forces that the n bo-
dies induce upon each other. The hierarchical decomposition is recorded in an octree
data structure and the force calculations at each step require tree-building and repe-
ated traversal of the unbalanced octree which is highly irregular. Other examples of
such irregular scientific applications include - molecular dynamics simulation, finite
elements methods, simulation of wave and sound propagation in 3D objects, etc.
Prior research on parallelizing such applications have been focused around op-
timizing the irregular, data-dependent memory accesses and control-flow during a
single step of the application, independent of the other steps, with the assumption
that these patterns are completely unpredictable [18, 9, 5]. Multiple analysis of these
applications executing irregular workloads for a few hundreds or thousands of steps
show that control-flow and data access patterns made by the irregular algorithm fol-
low a loosely similar pattern between steps. In such situation, one effective approach
to reduce the irregularities is to analyze the control-flow and data access patterns at
each step of the application and then anticipate future data dependence and control-
flow before it is needed. Given the complexity and diversity of control-flow and data
6
access patterns in these applications, we believe anticipation strategies are best rea-
lized via intelligent application-specific prediction models that can adaptively model
and track access patterns. Access pattern forecasts can then be used to make opti-
mization decisions during application execution which improves the performance at
a future step based on the observations from the earlier steps.
In this thesis, we aim to use such predictive analytics and forecasting techniques to
optimize irregular scientific computing applications like beam dynamics simulation on
emerging high-performance computing (HPC) architectures. In particular, we target
on attaining the following two optimization goals while developing efficient parallel
implementations on GPUs, Xeon Phis and heterogeneous architectures composed of
multi-core CPUs with GPUs or Xeon Phis -
• Performance exploitation of individual PUs of heterogeneous systems - Archi-
tecture specific optimizations to reduce the control-flow and memory access
irregularities while mapping these applications on to the parallel architectures.
• Effective workload partitioning between the hybrid mix of PUs of the under-
lying heterogeneous architecture to obtain the best aggregate performance.
To optimize the irregularities, we explore the use supervised learning to adaptively
model and track irregular access patterns in the irregular algorithm at each step
of the application to anticipate the future control-flow and data access patterns.
Access pattern forecasts are then used to formulate runtime decisions that optimize
the irregular computations at a future step based on the observations from earlier
time steps. For example, forecasts can be used to determine computations to thread
mapping that maximize data reuse within a cache sharing thread group and minimize
thread divergence, improve data prefetching, linearize the irregularities, etc. Most
of these runtime decisions improve the performance of the application on each PU
independently. In order to improve the performance in the collective computing
environment of the heterogeneous architecture, the forecasts are used to create and
distribute sub-problems to different PUs of the heterogeneous architecture which
maximizes the resource utilization.
Throughout this thesis, we use numerical simulation of charged particles beam
dynamics simulation as a motivating example to develop and illustrate all the opti-
mization techniques. However, the techniques presented here are equally applicable
7
to a wide range of iterative applications that have characteristics similar to that of
beam dynamics simulation.
1.2 THESIS CONTRIBUTIONS
The main contributions of this thesis are -
• We present supervised learning based optimization techniques to address the
control-flow and memory access irregularities in the parallel implementation of
iterative scientific applications on GPU and Intel MIC architectures. The new
optimization technique uses predictive analytics and forecasting techniques to
adaptively model and track the irregular memory access patterns at each step
of the application to anticipate the future memory access patterns. Access
pattern forecasts are used to make optimization and prefetch decisions during
application execution which improves the performance at a future step based
on observations from earlier steps.
• We present optimization techniques that use machine learning algorithms to
divide the original problem into multiple smaller sub-problems and then dis-
tribute these sub-problems efficiently between different PUs of the underlying
heterogeneous architecture such that it improves the memory performance and
resource utilization of all the PUs and delivers a good aggregate performance.
• We demonstrate all our optimization techniques from previous bullets using
numerical simulation of charged particle beam dynamics simulation which re-
quire execution of irregular workloads for multiple time steps. In particular,
we present a cache-aware and memory efficient parallel algorithm that use the
proposed machine learning based optimization techniques to address the irre-
gularities in the parallel implementation of beam dynamics simulation on hete-
rogeneous architectures composed of GPUs, Xeon Phis and multi-core CPUs.
1.3 THESIS ORGANIZATION
The structure of this thesis as follows:
• Chapter 2 provides a brief overview of several topics that are essential for un-
derstanding the problems addressed in this thesis, namely: overview of different
8
parallel architectures, implementation challenges on GPUs and Xeon Phis, he-
terogeneous computing challenges, irregular algorithms and its implementation
challenges, physical problem of charged particles beam dynamics simulation
and related work, and numerical integration algorithms which is one of the
core irregular algorithm used in beam dynamics simulations and in many other
scientific computing applications.
• Chapter 3 describes the algorithm to numerically simulate charged particles
beam dynamics, its limitation on sequential machines, challenges in developing
efficient parallel implementation, and a brief survey of previous research in
developing parallel algorithms for this problem.
• Chapter 4 presents the memory efficient parallel algorithm that use machine
learning based optimization techniques to address the challenges from irregu-
larities in the parallel implementation of charged particles beam dynamics on
GPUs. Further, it presents a quantitative analysis of its performance on NVI-
DIA Tesla K40 GPU.
• Chapter 5 presents the parallel algorithm and its implementation on hetero-
geneous architectures for the irregular computations in beam dynamics simu-
lation. In addition to addressing the irregularities, the algorithm presented in
this chapter extends the machine learning approach from Chapter 4 to optimize
the resource utilization of all the PUs of underlying heterogeneous architecture.
• Chapter 6 provides a concluding discussion on the optimization techniques and




This chapter provides a brief overview of several topics that are essential for un-
derstanding the problems addressed in this thesis. In particular, Section 2.1 provides
a overview of the parallel architectures used in this study, and Section 2.2 presents
the parallel implementation challenges on these architectures. Next, in Section 2.3,
we discuss irregular algorithms and its parallel implementation challenges. Section
2.4 presents the physical problem of charged particle beam dynamics simulation and
provides a overview of the related work in its numerical simulation methods. Finally,
in Section 2.5, we a take brief look at numerical integration algorithms, which is
the core irregular algorithm used in beam dynamics simulations and in many other
scientific computing applications.
2.1 OVERVIEW OF PARALLEL ARCHITECTURES
In this section, we take a brief look at the typical architecture of two widely
used accelerator - NVIDIA’s General Purpose GPU and Intel’s Xeon Phi coprocessor
architecture.
2.1.1 GENERAL PURPOSE GPUS
At the hardware level, NVIDIA’s GPU architecture is an scalable array of mul-
tithreaded Streaming Multiprocessors (SMs). Each SM features several Streaming
Processor (SP) cores and double-precision logic units (DP unit), where each SP core
is a fully pipelined integer arithmetic logic unit (ALU) and single-precision floating
point unit (FPU). In addition to SP cores and DP units, each SM features (i) lo-
ad/store units, (ii) special function units (SFU) for transcendental instructions such
as sin, cosine, reciprocal, and square root, (iii) schedulers and instruction dispatch
units, (iv) instruction cache, (v) register file, (vi) on-chip shared-memory and L1-
cache, (vii) read-only cache, and (viii) texture units. The size and number of each
unit vary form one generation of GPUs to another. Each core also supports Fu-
sed Multiply-Add (FMA) instructions for both single precision and double precision
10
(a) Grid of thread blocks (b) Memory hierarchy
Figure 1: CUDA programming model for general purpose computing on NVIDIA
GPUs.
floating point operations. GPUs also supports larger off-chip global, constant and
texture memory that are shared among all SMs. The global/texture memory are of-
ten cached and use two-level caching system, where L1-cache is located within each
SM, while the L2-cache is located off-chip and is shared among all the SMs.
Compute Unified Device Architecture (CUDA) is the general purpose parallel
computing platform and programming model used for designing parallel computati-
ons on NVIDIA GPUs. CUDA programming allows the programmer to define functi-
ons, called kernels, that when called, are executed on the GPUs by many different
parallel CUDA threads. The programmer or compiler organizes these CUDA threads
into one-dimensional, two-dimensional, or three-dimensional block of threads, called
a thread block where each thread within a thread block executes an instance of the
kernel. The maximum number of threads in a thread block vary from one generation
of GPUs to another, for example, a thread block may contain up to 1024 CUDA
threads for programming Kepler GPUs. The thread blocks are further organized
into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks
11
as illustrated by Figure 1a. The number of thread blocks in a grid is usually dictated
by the size of the data being processed or the number of processors in the system,
which it can greatly exceed. Thread blocks are executed independently which allows
it to be scheduled in any order across any number of cores as illustrated in [60], and
this enables the programmers to write code that scales with the number of cores.
CUDA threads have access to data from six different memory space during their
course of execution: register memory, constant memory, shared memory, texture
memory, local memory, and global memory. Figure 1b illustrates how CUDA threads
can access data from the different memory spaces. Each thread has private per-
thread registers that are often used to hold frequently accessed data, and these are
not programmer controlled. Each thread has private local memory that is used for
register spills, function calls, and automatic array variables. Each thread block has a
per-block shared memory which is visible to all threads of the block and has the same
lifetime as the block. Shared memory is often used for inter-thread communication
and data sharing in parallel algorithms. The global, constant and texture memory
spaces are accessible from all threads and these three memory spaces are persistent
across kernel launches by the same application.
When a CUDA kernel is invoked by the host CPU, the blocks of threads that
constitute the kernel grid are enumerated and scheduled on the available SMs in any
order, concurrently or sequentially, so that the compiled CUDA kernel can execute
on any number of SMs. Multiple thread blocks can execute concurrently on a single
SM and as the thread blocks terminate, new blocks are launched on the vacated SM.
Each SM employs SIMT architecture to manage and execute hundreds of threads con-
currently [60]. The SM creates, manages, schedules and executes threads in groups
of 32 parallel threads called warps. Individual threads composing a warp start their
execution at the same program address, but they have their own program counter
and register state and are therefore free to branch and execute independently. At
any given time, all threads within a warp execute the same instruction in a lockstep.
As a result, full warp efficiency is realized when all 32 threads of a warp agree on
their execution path, or more formally referred to as control-flow (warp execution
efficiency is the average percentage of active threads in each executed warp). Howe-
ver, the presence of data dependent conditional branch often cause threads within
the same warp to follow different control-flow paths (also known as branch or control-
flow divergence) and this causes the warp to serially execute each branch path taken,
12
disabling threads that are not on that path, and when all paths complete, the thre-
ads converge back to the same execution path. It is important to note that such
branch divergence occurs only within a warp; different warps execute independently
regardless of whether they are executing common or disjoint code paths.
2.1.2 INTEL XEON PHI COPROCESSOR
The Intel Xeon Phi coprocessor is based on the Many Integrated Core (MIC)
architecture. This architecture uses previous concepts developed by Intel for the
Larrabee many core architecture, as well as the Teraflops Research Chip and the
Intel Single-chip Cloud computer. We will discuss one of the Xeon Phi variants, the
Knight Corner (KNC). This is the model used in this thesis. The KNC architecture is
primarily composed of processing cores (more than 50), caches, memory controllers,
PCIe client logic, and a very high bandwidth, bidirectional ring interconnect . Each
core is equipped with (i) Vector processing unit (VPU), (ii) scalar processing unit,
(iii) Extended Math Unit (EMU) for transcendental instructions such as sin, cosine,
reciprocal, and square root, (iv) scalar and vector registers, (v) a L1 cache (data and
instruction), and (vi) a unified L2 cache. Caches within a core are fully coherent and
implement the x86 memory order model. The L1 and L2 caches provide an aggregate
bandwidth that is approximately 15 and 7 times, respectively, faster compared to the
aggregate memory bandwidth.
The Xeon Phi is highly optimized for vector processing and it implements SIMD
execution model in all VPUs. Each VPU features a 512-bit SIMD instructions, as a
replacement for the more commonly found Intel SSE, MMX and AVX instructions.
With 512-bit SIMD instructions, VPU provides data parallelism at a very fine grain,
working on 512 bits of 16 single-precision floats or 16 integers or 8 double-precision
floats at a time. The VPU also supports Fused Multiply-Add (FMA) instructions and
hence can execute 32 single-precision or 16 double-precision floating point operations
per cycle. A more comprehensive overview of Intel Xeon Phi architecture can be
found in [40].
2.2 PARALLEL COMPUTING CHALLENGES
In this section, we first illustrate the impact of data-parallel, SIMD nature of NVI-
DIA’s GPU and Intel’s Xeon Phi architecture on application performance. Next, in
13
Section 2.2.2, we take a brief look at the parallel computing challenges in heteroge-
neous architectures.
2.2.1 ARCHITECTURAL INFLUENCES ON PERFORMANCE
Effectively exploiting data parallelism using SIMD pipelines is one of the most
important aspects in achieving high performance on PUs such as GPU, Intel MIC and
modern x86 architectures. For example, Intel MIC has 512-bit VPU units per core
to expose SIMD parallelism, GPUs by design employ SIMT architecture within each
streaming multiprocessors that execute threads in group of 32 (equivalent to 1024-bit
SIMD for single precision floating point), x86 architectures uses 128-bit Streaming
SIMD Extensions or 256-bit Advanced Vector Extensions (AVX) to support SIMD
parallelism. In these PUs, data parallelism using SIMD pipelines is a power efficient
way of boosting the peak performance, and such parallelism is exploited at different
granularities: usually, several work items (also called threads) are organized into a
work group. These are broken down into several SIMD groups (e.g. warps in GPUs)
that are executed by a SIMD pipeline. Each PU may have multiple SIMD pipelines
to execute SIMD groups in parallel, where effective hardware utilization of the PU
relies largely on exploiting the SIMD pipelines.
Programming applications to maximize the utilizations of SIMD pipelines and to
deliver high-performance of the application code on the PUs remains a challenge. In
particular, the nature of SIMD execution requires that all threads in a SIMD thread
group (e.g., a warp in GPUs) to execute the same instruction in lockstep. While this
allows the processor design in PUs to be relatively simple, application performance
may suffer significantly whenever threads in the same SIMD group behave differently
due to control or memory latency divergence [60]. Control divergence results in
serialized execution of divergent control paths, leaving execution resources idle and
throttling parallelism. Similarly, memory latency divergence causes a SIMD group to
stall until the longest memory request for a vector load completes before executing
any dependent instructions. In this section, we first briefly overview the effects of
control-flow and memory latency divergence on application performance. Then, we
look at the SIMD architectural features that affect control-flow and memory latency
divergence, and those that try to mitigate them.
14
Memory Latency Divergence
When a SIMD thread group of size w issues a load instruction (e.g., w = 32
threads for GPUs, w = 8 double precision loads in Intel Xeon-Phi’s 512-bit VPU),
the SIMD group will block once an instruction dependent upon the load data becomes
the next to issue. As a result, this group of threads is unable to make progress until
all the data for the constituent load instructions are available. In particular, for
SIMT architectures, a single delinquent load can block forward progress of the SIMD
group. This introduces the problem of memory latency divergence, where a SIMD
group can be stalled until the last memory request from a vector load instruction
is returned, potentially long after other memory requests from the vector load have
completed. In many workloads, this load latency cannot be hidden by executing other
SIMD groups. Several studies have highlighted how memory latency divergence can
be a significant performance bottleneck in GPUs and Xeon Phis [52, 68, 22]. This
problem of memory latency divergence is not unique to GPUs (or Xeon Phis) and
can also manifest itself in other SIMD/vector architectures that support “gather”
load operations.
𝑇1 𝑇2 𝑇3 𝑇4 𝑇5 𝑇6 𝑇7 𝑇8
Figure 2: Memory latency divergence when the addresses accessed by SIMD group
are scattered in the memory.
15
𝑇1 𝑇2 𝑇3 𝑇4 𝑇5 𝑇6 𝑇7 𝑇8
(a)
𝑇1 𝑇2 𝑇3 𝑇4 𝑇5 𝑇6 𝑇7 𝑇8
(b)
Figure 3: Memory latency divergence when: (a) the addresses accessed by SIMD
group are uniform (all equal), (b) the addresses accessed by SIMD group are conse-
cutive.
Memory latency divergence arises from several factors. First, the target data for
a set of memory requests from a SIMD group may reside in different levels of memory
hierarchy, and the memory hierarchy service different requests at different times due
to hits and misses at various levels. Second, the time to complete each memory
request depends on several factors, including which DRAM bank the target memory
resides in, contention in the interconnect network, and availability of resources in
the memory controller. Third, current memory systems are primarily optimized to
support traditional, structured workloads with low degree of memory access irregu-
larity. In case of workloads with high degree of irregular or non-coalesced memory
accesses, the memory system serializes the execution of memory references from a
SIMD thread group, which stalls the SIMD group until the last memory reference
is returned. For example, consider a SIMD group of size w that executes a load
instruction. In the worst case, all w addresses are scattered in the address space,
which results in w data loads from the memory. This serializes the execution of load
instructions, which stalls the SIMD group until the last load instruction is returned.
Figure 2 illustrates the memory divergence for load instruction within a SIMD group
of size w = 8 where all the w addresses are scattered in the address space. On the
other hand, if the addresses are uniform (i.e., all equal) or consecutive, only a single
load has to be issued (see Figure 3a and Figure 3b).
16
g l o b a l void ke rne l ( ){
. . . .
i f ( cond i t i on ){
. . . .
Code−Block A
. . . .
} else {
. . . .
Code−Block B
. . . .
}
. . . .
}
(a) CUDA kernel with if − else construct.
















(b) Control-flow for an warp executing the if − else branch condition when half
of warps thread evaluate the branch condition to true.
Figure 4: Control-flow divergence in GPUs.
Control-flow Divergence
Consider a SIMD group of threads executing the if − else construct shown in
Figure 4a. The threads within the SIMD group works with full efficiency when all
the threads execute the if part or all execute the else part. On the other hand, when
threads within a SIMD group take different control-flow paths then the execution of
the group takes multiple passes through the divergent paths. The first pass executes
the threads that follow if part (Code-Block A) and the second pass executes those
that follow the else part (Code-Block B). In other words, each thread from the SIMD
17
group takes both the branch paths sequentially, even though it just executes one of
them. This results in poor to suboptimal hardware utilization which often degrades
the application performance. For example, Figure 4b illustrates the control-flow in
a GPU warp when half of the its thread evaluate the branch condition to true. In
the first pass, only half the threads takes the path for if condition (denoted by black
arrow) and the other half remains disabled (denoted by grey arrow). Similarly, in
the second pass, half the threads take the path for else condition (denoted by black
arrow) and the other half remains disabled (denoted by grey arrow). This reduces
the SMID groups or warp execution efficiency to 50%. Such reduction in execution
efficiency often leads to poor utilization of the compute resources, which severely
degrades the application performance on SIMD architectures [32]. As a result, it is
of vital importance to mitigate the control-flow divergence and subsequently increase
the execution efficiency of all SIMD thread groups for obtaining good application
performance on SIMD architectures.
Mitigating Divergence
The architectural features in GPUs and Intel MIC that affect memory latency
divergence (besides the SIMD execution model illustrated in the previous section)
and those that try to mitigate divergence are -
• Memory Coalescing - A common architectural technique that reduces me-
mory bandwidth demand in SIMD architectures is memory coalescing. In this
technique, the individual requests from a SIMD group are combined, based
on their target address, to form as few cache-line sized (e.g. 128B in GPUs)
requests as possible. Coalescing is primarily designed to reduce bandwidth re-
quirements by eliminating redundant accesses to the same cache line. However,
it also reduces the opportunity for memory latency divergence by minimizing
the number of distinct memory requests per SIMD group. Coalescing is not
effective, however, if the data accessed by the threads in a SIMD group are
not spatially colocated, as is commonly the case in applications that exhibit
memory access patterns that are not readily amenable to the SIMD architec-
ture. Previous studies have shown that coalescing is extremely effective for
traditional structured workloads with uniform memory access patterns, but its
effectiveness falls short for workloads with irregular memory accesses [22].
18
• Multithreading - SIMD architectures leverage thread-level parallelism to hide
memory access latency. The effect of long divergence-induced stalls can be
mitigated if there are enough ready SIMD thread groups in the system to hide
the latency of the slowest request. Previous studies [22, 27, 41] have shown,
however, that in spite of having a large number of thread contexts to choose
from, a SIMD processor core will frequently sit idle as all the SIMD groups
are stalled on pending memory accesses. For instance, recent NVIDIA GPUs
support at most 48 to 64 warps within a SP, while the main memory latency
have been measured to exceed 400 cycles. It is, therefore, important to note that
thread-level parallelism cannot always completely hide main memory latency
[8].
• Caches - Caches have low latency when compared to the global memory.
As a result, memory requests from a SIMD thread group that hit in a cache
are returned sooner even with access irregularities or latency divergence when
compared to servicing the same requests from global memory. This improves
the average memory latency. However, for memory latency divergence to be
meaningfully addressed with caches, a substantial fraction of SIMD groups
must be able to serve all of their memory requests with cache hits. Otherwise,
the cache misses for a SIMD group will be serviced from the global memory,
and faces the latency divergence issues described above.
2.2.2 HETEROGENEOUS COMPUTING CHALLENGES
The presence of different type of PUs with different processing capabilities in-
troduces many challenges to the application design. In particular, the level of he-
terogeneity in these systems introduces non-uniformity in system development, pro-
gramming practices, and overall system capability. These non-uniformities presents
a significant challenge to the programming community in developing applications
that can fully exploit the capabilities of the multiple PUs and in approaching their
combined theoretical performance. Several studies have been devoted to developing
programming models that can exploit these heterogeneous architectures in a unified
way, e.g., OpenCL [43], OpenACC [63] and OpenMP [64]. These models simplify
the programming effort required in exploiting these architectures, however, achie-
ving good performance on heterogeneous architectures still largely depends on (i)
19
performance exploitation of each single PU, and (ii) performance aggregation from
all the PUs. This requires partitioning the application workload and mapping them
efficiently between the PUs such that all the PUs are best utilized and combined to
deliver best aggregate performance.
Performance exploitation of the individual PU is achieved using architecture-
specific optimization strategies, [60, 59, 80]. Researchers have developed several op-
timizations techniques and tools on top of the programming models to extract more
performance on each type of architecture [70, 69, 79]. On the other hand, vastly
different architectures and programming models of different PUs in a heterogeneous
system presents several challenges in optimizing applications to deliver good aggre-
gate performance from all the PUs. Several factors relating to both the PU and the
application itself, and spanning from microarchitecture-level to system-level need to
be taken into account for fully leveraging their potential in heterogeneous compu-
ting systems [55]. In particular, designing effective workload partitioning between a
hybrid mix of PUs is challenging because -
• Heterogeneity of the platform requires partitioning algorithms to fully consi-
der the PUs differences in processing capability, hardware architecture, and
memory model. For example, GPUs and Xeon Phis usually offer higher pro-
cessing throughput than CPUs (on average 2.5X in [45]), but have separate
memory spaces and low data communication bandwidth to the CPU memory
(GPUs are connected with the host CPU through a PCI Express bus). Due to
the interaction between the PUs in a heterogeneous environment, optimizing
performance and energy efficiency requires taking into account the individual
characteristics of the PUs. Some of the dominant PU-specific factors that needs
to be considered are
– Current load on PUs and achieving load balancing between them,
– Memory bandwidth and CPU to accelerator data transfer overhead; avoi-
ding and/or amortizing overhead of launching kernels on accelerators,
– Overlapping data transfer with computation or CPU computation with
computation on accelerators,
– Taking into account the limitations of PUs, e.g. CPU to GPU/Xeon-Phi
memory bandwidth, size of GPU/Xeon-Phi and CPU memory, number
20
of threads in GPU/Xeon-Phi and CPU cores, reduced performance of
GPU/Xeon-Phi for double-precision computations etc. [78]. Additionally,
aggressively using CPU for computation affects its ability to act as a host,
which harms the performance [30, 73].
• The diversity of platforms, applications, and datasets increases the partitioning
complexity [71]. The diversity lies in the fact that heterogeneous platforms exist
in different configurations and forms (e.g., multi-core CPUs with accelerators,
embedded systems, MPSoC, etc.), and that data parallel applications (with
different datasets) present various kinds of performance behavior even on the
same platform.
• Application or problem specific factors that influence workload partitioning -
– Subdividing the workload and selecting suitable work sizes to be allocated
to different PUs
– Accounting for data dependencies, e.g. if a task has data dependencies on
previous task, where was the previous task executed?
• From programmer’s viewpoint, it is challenging to obtain, with limited time
and effort, the optimal partitioning that leads to the best performance for a
given heterogeneous platform, application, and dataset.
A more comprehensive survey of heterogeneous computing techniques and the
challenges involved in heterogeneous computing can be found in [55].
2.3 IRREGULAR ALGORITHMS
The terms regular and irregular stem from the compiler literature. Regular al-
gorithms operate on structured data like large vectors or matrices, and access them
in statically predictable ways - e.g. dense linear algebra computations, matrix vec-
tor multiplications, etc. These algorithms often have high computational demands,
exhibit extensive data parallelism, access memory in a streaming fashion, and re-
quire little synchronization [50]. This sort of regularity in the algorithm exhibit data
independent control-flow and memory references which can be exploited to minimize
memory divergence by coalescing memory accesses, and minimize thread divergence
and synchronization by maintaining workload balance. In fact, there exists a plethora
21
of techniques, methods and languages to achieve efficient parallelization of regular
algorithms [44, 34], and their implementation on GPUs and Xeon Phis can be at-least
an order of magnitude faster than fine tuned parallel CPU version [24].
On the other hand, irregular algorithms are characterized by control-flow and
memory references that are data dependent. In particular, distribution of work and
data in these algorithms cannot be characterized a priori because these quantities
are input-dependent and evolve with the computation itself. Irregular algorithms
are the core of many high-performance applications, such as n-body simulations [9],
data mining [75], Boolean satisfiability [17], social networks [37], system modeling
[66], compilers [2], meshing [25], and discrete-event simulation [54]. Efficient parallel
implementation of irregular algorithms are challenging because:
• Irregular algorithms often demonstrate significant Memory Access Irregularity
(MAI) [18] which leads to severe performance bottlenecks on SIMD architec-
tures [14]. The data dependent memory accesses in these programs tend to
have less spatial locality compared to traditional graphics and regular general-
purpose applications.
• Input values in these algorithms determine the program’s runtime behavior,
which therefore cannot be statically predicted. These properties in the al-
gorithm pose problems for high-performance parallel implementations, where
equal distribution of work over processors cores and locality of reference are
required within each cache sharing processor cores.
• Performance of applications on SIMD architectures relies on high SIMD lane
occupancy and efficient memory coalescing for inter-thread data locality, where
the former requires minimal divergent branching for threads in a SIMD group,
while the latter requires regular memory access patterns and data structure
layouts [23, 59]. Unfortunately, irregular algorithms tend to present both signi-
ficant branch and memory divergence which leads to severe performance bott-
lenecks.
2.4 BEAM DYNAMICS
In this section, we present an overview of charged particle beam dynamics and
the general equations that govern the dynamics of charged particles in a synchrotron
(a type of cyclic particle accelerator).
22
2.4.1 PHYSICAL PROBLEM
The dynamics of charged particle beams is captured by the Lorentz force [47]:
d
dt
(γmev) = e (E + β ×B) , (1)
with the relativistic β and γ, velocity v, electric fieldE and magnetic fieldB specified
as, respectively,




1 + p · p/(mec)2
, (2a)
E = −∇φ− 1
c
∂tA, B = ∇×A. (2b)
φ and A the retarded scalar and retarded vector potentials, respectively. They are
obtained by integrating the charge distribution ρ and the charge current density J




























r and p are particle coordinates and momentum, respectively, f(r,p, t) is the particle
distribution function (DF) of the beam in phase space, me particle mass, c the speed
of light. Both electric and magnetic fields are composed of two components, one
due to external fields and the other due to self-fields: E = Eext +Eself , B = Bext +
Bself . Eext andBext are external electromagnetic (EM) fields fixed by the accelerator
lattice, and Eself and Bself are the EM fields from the beam self-interaction. The
beam self-interaction depends on the history of the beam charge distribution ρ and
current density J via the retarded potentials φ and A.
The computation of retarded potentials requires integration over the history of
charge distribution and current density, as can be seen from Equation 3a. This is
the main computational bottleneck of the beam dynamics simulations. In particular,
the problems to overcome in a successful beam dynamics simulation are: (i) data
storage for the time-dependent beam quantities (ρ and J); (ii) numerical treatment
of retardation and singularity in the integral equation for retarded potentials; and
23
Figure 5: Conceptual view of charged particles beam dynamics on a 2D plane.
(iii) accurate and efficient multi-dimensional integration in the equation for retarded
potentials.
Figure 5 and Figure 6 illustrates the conceptual view of charged particle beam
dynamics in a synchrotron (a type of cyclic particle accelerator) that is approximated
on a 2D plane. Figure 5 provides an overall view of the simulation, and Figure 6
illustrates the simulation process in a step-by-step manner. In both these figures,
blue line denotes the design orbit along which a beam bunch with N charged particles
travel under the influence of a bending force of a magnet. In order to numerically
simulate the dynamics of this beam, first the charged particles at time t0 = 0 are
generated from Monte Carlo sampling of initial DF of N particles with a total charge
ofQ. The particles evolve by a small time increment ∆t between each time step due to
the forces acting on the individual particles. These forces are computed using Vlasov-
Maxwell equations that are solved either directly, by sampling the entire phase space
of the DF, either on a grid or in an appropriate basis [15], or by using a particle
tracking approach [46, 47, 77, 76, 39]. The computational requirements associated
with sampling the entire phase space limit the direct solvers to lower dimensions
(usually 1D). On the other hand, tracking methods are less restrictive owing to the





Figure 6: Simulation of charged particles beam dynamics. (a) Particle distribution
for charged particle beam at time t0 is generated from Monte Carlo sampling of initial
DF of N particles with a total charge of Q. (b) - (e) The charged particles in the beam
emits synchrotron radiation when forced to travel along a curved trajectory under the
influence of a bending force of a accelerator magnets (referred to as Bending Magnet
in the above figure). (f) The radiations emitted from all the earlier time steps catch
up with the charged particle beam at time tk = k∆t for all integers k > 0, and these
synchrotron radiations leads to hazardous self-interactions.
25
allows the study in higher-dimensional systems, which gives them a clear advantage
and makes them a preferred method for modeling collective effects in beam dynamics
simulations. The most dominant of the particle tracking approaches is the particle-
in-cell method [77, 76, 39], where the individual particles are tracked in continuous
phase space, whereas moments of the distribution such as densities and currents are
computed simultaneously on discrete spatial grid points. The continuous phase space
in the plane of the beam lattice is referred as Lab Frame (LF)(XY−plane in Figure
5).
Furthermore, at each time step, the particle distribution experiences the effects of
coherent synchrotron radiation (csr) and other collective effects. Figure 6 illustrates
the collective effects on the particles due to beam’s self-interaction. In particular,
when a charged particle beam travels along a curved trajectory under the influence
of a bending force of an accelerator magnets, it emits synchrotron radiations, as il-
lustrated in Figure 6b - 6e. The radiations emitted from all the earlier time steps
catch up with the charged particle beam at time tk for all integers k > 0, and these
synchrotron radiations leads to hazardous self-interactions. This self-interaction cau-
ses significant emittance degradation, as well as fragmentation and microbunching
within the beam bunch, rendering the beam useless for physics research.
2.4.2 RELATED WORK
Present beam dynamics simulation methods employ a number of approximation
in the study of collective (including CSR) effects. For example, the beam dynamics
calculation in elegant [16] is based on the analysis of bunch self-interaction for a
rigid-line bunch. This method is widely used for accelerator design and is the first
to reveal CSR-induced microbunching in bunch compressors. However, in the re-
gime of extreme bunch compression when the bunch deflection is appreciable, the
1D approximation used in elegant may not be appropriate [49]. The earliest 2D
beam dynamics simulation is TraFiC4 [42]. Here electro-magnetic (EM) fields are
generated from the source particles moving along prescribed orbit, and CSR effects
are calculated from the impact of these EM fields on the dynamics of test particles.
An early self-consistent beam dynamics simulation was developed by [46, 47]. This
code calculates direct interaction between microparticles, with the retarded poten-
tials obtained by integrating bunch distribution over retarded times. However, the
computation efficiency for this code is severely limited by the direct particle-particle
26
interaction employed in the model. Recently, Bassi et al. [11] have developed a
highly efficient, high-resolution 2D self-consistent code for simulation of the CSR
effects. This simulation has generated interesting results on CSR-induced microbun-
ching in bunch compressors. Currently, the code assumes linear optics, so the effects
caused by nonlinear optics are not included. Self-consistent CSR simulations based
on finite element method was pioneered by Agoh and Yokoya [1]. This method can
include boundary effect by chamber walls much easier than the Greens function ap-
proach. More comprehensive review of the status of beam dynamics simulation can
be found in the review article by Bassi et al. [10].
2.5 NUMERICAL INTEGRATION
Numerical integration (also called as quadrature) constitutes a broad family of
algorithms for calculating the numerical value of a definite integral. In this section,
we describe the two widely used quadrature algorithms to calculate one-dimensional
definite integrals: Newton-Cotes formulas and Adaptive quadrature. Then, we ex-
tend the description to definite integrals for higher dimensions (commonly called as
multiple or multi-dimensional integrals).
2.5.1 NEWTON-COTES FORMULAS
Newton-Cotes formulae are a family of numerical integration techniques where
the value of a definite integral is approximated as a weighted sum of the integrand
values at equally spaced points. More formally, given a function or integrand f(x),
Newton-Cotes formula of degree n that approximates the definite integral of f(x)













wi are derived from Lagrange basis polynomials [21, p. 500], and it depends only on
the degree n of the Newton-Cotes formula.
Table 2 lists some of the widely used Newton-Cotes formulae of varying degree.
The notation fi is a shorthand for f(xi), and the number ξ in error term is between
a and b. The exponent of step size h in the error term shows the rate at which
the approximation error decreases, and the derivative of f in the error term shows
27
Degree (n) Common name Formula Error term
















Table 2: Newton-Cotes formulae for different degrees.
which polynomials can be integrated exactly (i.e., with error equal to zero). For the
Newton-Cotes rules to be accurate, the step size h needs to be small, which means
the domain of integration [a, b] must be small itself, which is not true most of the
time. One way to improve the accuracy is to partition the integration interval from a
to b into a number of subintervals and apply Newton-Cotes rule on each subinterval,
where the integral and error estimate from each subinterval is accumulated to give
the final estimates for the original integral. The resulting set of equation is called
as composite integration rule. For example, the composite integration rule using









f (a+ kh) + f(b)
]
(5)
where m is the number of subintervals, and subintervals have the form [kh, (k +
1)h], with h = (b−a)
m
and k = 0, 1, 2, . . . ,m − 1. The overall error estimate for
composite trapezoidal rule can be obtained by accumulating the individual errors for
each subinterval. A more comprehensive survey of Newton-Cotes formulae and their
corresponding composite rules can be found in [21, p. 601].
2.5.2 ADAPTIVE QUADRATURES
Newton-Cotes formulae and their corresponding family of composite rules are
based on evaluating the integrand at evenly spaced points. This global perspective
ignores the fact that many functions have regions of high variability along with
other sections where change is gradual. As a result, numerical integration using
28
Newton-Cotes formulae are accurate only when the integrand f is smooth (i.e., if it
is sufficiently differentiable or it features only subintervals where change is gradual).
On the other hand, when the integrand is highly oscillatory or it lacks derivatives
at certain points, then integration methods such as adaptive quadrature are used. In
adaptive quadrature, step sizes are adaptively adjusted so that small intervals are
used in regions of rapid variations and larger intervals are used where the function
changes gradually.




f(x)dx to a user-specified error tolerance of τ using static
quadrature rules on adaptively refined subregions of the integration domain [21,








subregion [a, b]. Some of the most commonly used quadrature rules are Simpson’s
rule, Newton-Cotes formulas, Gauss-Kronrod 7/15-point, and Gauss-Kronrod 10/21-
point [21].
Adaptive quadrature is a recursive algorithm which builds a binary recursion tree
of subregions as the computation proceeds. This recursion tree of subregions is a
visual and conceptual representation of the control-flow in adaptive quadrature. The
root of subregion recursion tree is the input integration domain [a, b] over which
the initial estimates for integral and error are determined using quadrature rules.
When the estimated error is larger than the required error tolerance, the subregion




, b]). The partitioned
subregions constitute the left and right child node of the parent node [a, b]. This
process is repeated recursively on the left and right nodes until the error estimate on
the subregion associated with the tree node is smaller than the error tolerance. Once
the recursion terminates, the final estimates of the integral and the error are given
by the corresponding estimates from the subregions represented by the leaves of the
subregion recursion tree. Moreover, nature of this recursive algorithm adapts to the
integrand automatically by partitioning the integration domain into subregions with
fine spacing where the integrand is varying rapidly and coarse spacing where the
integrand is varying slowly.
Let P = 〈x0, x1, x2, . . . , x|P |〉 be the partition on the integration domain [a, b] that






























(a) Partitions on the integration region [0, 1] along x-axis for an Gaussian































































(b) Subregion recursion tree





0.005 dx using Simp-
son’s quadrature rule with error tolerance τ = 0.001.
30





where Q(xi, xi+1) is the integral estimate given by quadrature rule on the subregion
[xi, xi+1] for all integers i in the range 0 < i < |P |. Moreover, ε(xi, xi+1) < τ , for all
integers i in the range 0 < i < |P |, where ε(xi, xi+1) is the error estimate given by
quadrature rule along the subregion [xi, xi+1]. It is important to note that the set
of all subregions [xi, xi+1] for all integers i in the range 0 < i < |P | represents the
leaves of subregion recursion tree.
For example, consider a Gaussian function f(x) = e−
(x−0.2)2
0.005 ∀x ∈ [0, 1] (see
Figure 7a). The subregion recursion tree generated by the adaptive quadrature to
approximate the Gaussian integral
∫ 1
0
f(x)dx to a error tolerance of τ = 0.001 using
Simpson’s quadrature rule is shown in Figure 7b. The partition generated along the
















, 1〉, where the partitions
along the subregion [0, 3
8





The multiple integral is a generalization of the definite integral to functions of
more than one real variable. For example, multiple integral of a function f(x1, x2, . . . , xn)







f(x1, x2, . . . , xn)dx1dx2 . . . dxn (7)
where [ai, bi] is the integration region for the variable xi, and [a1, b1] × [a2, b2] ×
. . . [an, bn] is the hyper-rectangle that represents the integration domain for Equation
7. In general, multiple integral with n variables is referred to as n-dimensional or
n-D integral. The simplest of all multiple integrals is the 2D integral (also called as





where [a1, b1]× [a2, b2] is a integration region in R2. Typically, above double integral









where multiple integral of the function f(x1, x2) is first performed over the variable
x1 and then performed over the variable x2. In other words, numerical approximation
of Equation 9 involves applying one-dimensional integration methods like Newton-
Cotes formula or Adaptive quadrature along inner dimension with each value of the
outer dimension held constant. Then the result of inner dimension is integrated in the
outer dimension again by using any one of the standard one-dimensional integration
methods. The choice of integration method along each dimension is usually based
on the nature of integrand along that particular dimension. In particular, Newton-
Cotes formula is used when the integrand is smooth; otherwise adaptive quadrature
is used.
Multiple integrals of higher dimensions (n > 2) can also be computed as repeated
one-dimensional integrals. However, this approach requires the function evaluations
to grow exponentially as the number of dimensions increases. For instance, using
repeated one-dimensional integral with m function evaluations along each dimension
requires a total of mn functional evaluations to compute n-dimensional integral.
In order to overcome this curse of dimensionality, methods such as Monte Carlo
integration, Sparse Grids, Bayesian Quadrature, or algorithms adaptive on the entire
n-dimensional space are used for multiple integrals with higher dimension (n > 5)
[12, 13, 33, 20]. The fastest known such open source adaptive method is cuhre
[12, 13], which is available as part of cuba library [35, 36]. A brief overview of
cuhre is illustrated in Appendix A.1, and a more comprehensive survey of other




In this chapter, we first introduce the algorithm to numerically simulate the char-
ged particles beam dynamics and then provide a detailed discussion of the irregular
algorithms used in this simulation. Then, in Section 3.2, we present the limitati-
ons of implementing this simulation algorithm on sequential machines and present
with the irregular properties in the numerical simulation which makes the parallel
implementation a challenging task. Finally, Section 3.3 presents our prior work in
developing efficient parallel algorithms for this problem.
3.1 OUTLINE OF THE ALGORITHM
Numerical simulation of charged particle beam dynamics on a 2D plane of the
beam lattice consists of four consecutive steps that are computed at each time step
of the simulation, and are repeated for a few hundreds or thousands of time steps
(see Figure 8). Formally, for a simulation with step size ∆t, following four steps are
executed during each time step k, for some integer k in the range 0 to Nt, where Nt
is the number of time steps required for the simulation -
1. Particle Deposition - Deposit the DF sampled by N particles onto a 2D
spatial grid of NX × NY resolution using particle-in-cell method [77, 76, 39],
thereby yielding 2D grid of moments, one for each grid-point. The moments
here is a multidimensional quantity representing the distribution’s deposited
charge, current densities, etc.
2. Compute Retarded Potentials - The collective effects in the particle distri-
bution due to beam’s self-interaction are modeled through retarded potentials,
which are computed at each grid-point of the 2D spatial grid using the quadra-
ture defined in Equation 3a.
3. Compute Self-Forces - Self-forces are computed on each grid-point of the
spatial grid using Equation 1. Next, the self-forces acting on each particle are
computed by linear interpolation from the 2D grid of self-forces. It is required
33
Bin particles on 𝑁𝑋 × 𝑁𝑌
spatial grid
Particle Deposition  
Compute forces




Evaluate retarded potentials 
on 𝑁𝑋 × 𝑁𝑌 spatial grid
Initial Distribution
of 𝑁 particles 
at  𝑡 = 0
Advance particles by Δ𝑡





Figure 8: Outline of charged particle beam dynamics simulation algorithm.
that the particle deposition onto the grid and interpolation from the grid onto
particles is done in the same manner, so as to avoid “ghost forces”.
4. Push Particles - The particles are advanced to next time step by a small
increment ∆t in time by solving Lorentz equation (Equation 1), using leap-frog
scheme [47].
The steps 1-4 are repeated for Nt steps, where Nt is usually in the order of few
hundreds to thousands. The heart of beam dynamics simulation is the stage at which
the retarded potentials are computed, which, as later illustrated in this chapter, is
the crucial and by far the most computationally-intensive step of the simulation. As
a result, to obtain high performance in the overall beam dynamics simulation, it is
important to optimize this step. However, the algorithm required to compute the
retarded potentials is highly irregular, making its efficient parallel implementation a
daunting task.
3.1.1 PARTICLE DEPOSITION
During particle deposition stage of the simulation, moments of the particle distri-
bution are computed on discrete grid points using PIC approach [77, 76, 39], where
34
the distribution is first transformed to a normalized representation in which the
charged particles are enclosed within a 2D unit-grid. Then, the moments of charged
particles are deposited onto the nearest grid points of the unit-grid using inverse
interpolation as described in [39]. As a result, during time step k with simulation
time tk = k∆t for some integer k in range 0 < k < Nt, the particle deposition stage
generates a 2D grid of moments which denotes the moments of particle distribution
at that particular time step. For simplicity, we use the notation Dk to denote the
2D grid of moments at time step k.
Figure 9 and 10 outlines the transformation of particle distribution to a 2D unit-
grid representation using PIC approach. The top panel of the Figure 9 shows the
charged particle dynamics along a design orbit, where the particle distribution at
each time step is tightly enclosed within a spatial grid of resolution NX ×NY . The
bottom panel of Figure 9 denotes one such spatial grid that tightly envelops the
particle distribution at time step k, where the spacing between the extreme points is
declared to be LX and LY , respectively, so that the outliers are in the middle of the
boundary cells. The spatial grid makes an angle α with the XY−plane. Orienting
the beam in such a way so as to occupy the smallest volume while containing all
the particles yields optimal spatial resolution on a fixed-size rectangular grid. In
general, the spatial grid enclosing the particle distribution at time step k is uniquely
identified by its tilt angle α, physical size of the grid in X− and Y−directions, LX
and LY respectively, and the location of its center of charge point (X0, Y0). In other
words, given the quintuple 〈α,LX , LY , X0, Y0〉k at time step k, the spatial grid that
envelops the particle distribution at tk can be constructed from the quintuple. Next,
particles within the spatial grid are rotated through an angle α from the design orbit
in XY−plane, so as to account for the (X, Y ) correlation and is further normalized to
be contained within a unit-grid given by [−0.5, 0.5]×[−0.5, 0.5]. The coordinate space
in the rotated and normalized plane is referred to as Grid Frame (GF) (X̃Ỹ−plane
in Figure 10). Once the particle distribution is normalized, all the charged particles
are deposited onto the nearest grid points using PIC deposition scheme, thereby
yielding the moments of the distribution on each grid-point of the unit-grid, which
involves inverse interpolation (scatter operation) from the particle position to the
nearest grid points. This results in a 2D grid of moments gk[1..NX , 1..NY ] for the
particle distribution at tk, where gk[i, j] denote the moments deposited on a grid-point
(x̃i, ỹj) (in GF ) located in i













Figure 9: Spatial-grid enclosing the particle distribution at a particular time step of
the simulation. Its size is determined by the outliers of the distribution along the
principal axes. Blue line denotes the design orbit.
36






















Figure 10: Steps in particle deposition stage of the beam dynamics simulation.
gk[i, j] is a triplet 〈ρ, JX , JY 〉, where ρ is the deposited charge, JX and JY are the
current densities in X- and Y - directions, respectively. It is important to note that
the moments deposited on a point (x̃i, ỹj) on the unit-grid (in GF ) and on a point
(xi, yj) on the spatial grid (in LF ) at tk are identical and is given by gk[i, j], where

























3.1.2 COMPUTE RETARDED POTENTIALS
The collective effects in the particle distribution due to beam’s self-interaction
is modeled through retarded potentials, which are computed at each grid-point of
the 2D spatial grid. Formally, suppose Vk denotes the set of grid points on the 2D
grid at time step k, such that |Vk| = NXNY and each grid-point p ∈ Vk is a two-
dimensional point (x, y), denoting Cartesian coordinate (or LF -coordinate) of the
grid-point on the 2D spatial grid (i.e., (x, y) is the position of p on Dk). Then, the
retarded potentials at p are computed using the quadrature in Equation 3a, which,












(r′, θ′, t′) dθ′ (12)
where 0 < R(p) ≤ κc∆t for some positive integer κ ≤ k, retarded time t′ = k∆t−r′/c,
and i∆t < t′ ≤ (i+ 1)∆t for some integer i in the range k − κ ≤ i < k. The integral
estimate I(p) is a triplet 〈φ,AX , AY 〉, where φ is the scalar potential, AX and AY
are the vector potentials in X− and Y− directions, respectively. The domain of
integration, {(r′, θ′) ∈ R2 | 0 ≤ r′ ≤ R(p), θ(p)min(r′) ≤ θ′ ≤ θ
(p)
max(r′)}, is a subregion
of R2 and it is denoted S0,R(p). For convenience, we shall refer to the integral in
Equation 12 that computes retarded potentials as rp-integral.
Computation of Integration Limits
Figure 11 illustrates the computation of integration limits, where for all r′ sampled
along the outer dimension, the portion of circle with radius r′ and center at (x, y) that
is within the spatial grid enclosing the particle distribution at time t′ constitutes the
inner integral limits. Moreover, these limits are discontinuous for some r′. The red-
line in Figure 11 denotes the integration range in θ′−domain, where [θ(p)min(r′), θ
(p)
max(r′)]
represents the inner integral limits and the circle of radius r′ with center at p is
referred to as circle of causality at p (denoted by grey dotted line). The outer
integral limit R(p) is the smallest value of r′ such that the circle of causality for all
r′ > R(p) has no intersection with the spatial grid at t′, and for all 0 ≤ r′ ≤ R(p)
















: A → R3, where A ∈ {(r′, θ′, t′) | 0 ≤ r′ ≤ R(p), t′ = k∆t −
r′
c
, θ′ ∈ [0, 2π]} and a point (r′, θ′, t′) on the integrand represents the polar coordinate
(r′, θ′) in LF with center at (x, y) as shown in Figure 12a. Cartesian equivalent for















(r′, θ′, t′) in rp-integral denote the moments, 〈ρ, JX , JY 〉, deposited
on the polar coordinate (r′, θ′) (or (x′, y′) in Cartesian system) by the particle dis-
tribution at time t′. However, f
(p)
does not have an analytic form, and as a result,
f
(p)
(r′, θ′, t′) is numerically approximated using neighboring points from the 2D grid
of moments deposited on the spatial grid at time t′. When t′ = i∆t, moments on the
























( 𝑥′,  𝑦′)
(b)
Figure 12: Numerical approximation of integrand values in rp-integral using inter-
polation.
40
deposition stage of time step i. On the other hand, when i∆t < t′ < (i + 1)∆t,
the moments at t′ is approximated using interpolation from the 2D grid of moments,
Di−1, Di, and Di+1. Furthermore, for all t
′ < t, the value of f
(p)
(r′, θ′, t′) is defined
only when (x′, y′) is within the spatial grid at time t′ and is 0 otherwise. For exam-
ple, Figure 12a and Figure 12b illustrates the approximation of f
(p)
(r′, θ′, t′) using 27
neighboring points.
In Figure 12a, spatial grids from earlier time steps (i− 1), i, and (i+ 1) that are
computed during the particle deposition stage of the corresponding time steps are
shown by grey colored grids, whereas the spatial grid at the current time step k is
shown by black colored grid. Further, the spatial grid at retarded time t′ which is
approximated from the neighboring three spatial grids is shown by blue colored grid.
The point (r′, θ′, t′) on the integrand lies on the portion of causality circle that is
within the spatial grid at t′. Figure 12b illustrates the approximation of f
(p)
(r′, θ′, t′)
using the normalized unit-grid (in GF ) representation of the moments, where (x̃′, ỹ′)
is the GF equivalent for the LF -coordinate (x′, y′). The moments deposited on (x̃′, ỹ′)
is approximated using moments from the grid points of neighboring three unit-grids,
where the set of grid points required by the interpolation is shown using blue dots.
Formally, for all r′ sampled from the interval [(k− i− 1)c∆t, (k− i)c∆t] (i.e. for
all t′ ∈ [i∆t, (i+1)∆t]), integrand f (p)(r′, θ′, t′) is approximated using 27 neighboring
points from the 2D grids of moments, Di−1, Di and Di+1 (nine points from each of
the three data grids). In other words, 2D grids of moments from time steps i − 1,
i, and i + 1 are required for calculating rp-integral along the subregion, {(r′, θ′) ∈
R2 | (k − i − 1)c∆t ≤ r′ ≤ (k − i)c∆t, θ(p)min(r′) ≤ θ′ ≤ θ
(p)
max(r)}. Adapting to
this relation, rp-integral along the entire integration region S0,R(p) requires grids of
moments between time steps k − κ and k (i.e. Dk−κ to Dk), where the computation
of integral along a subregion Sjc∆t,(j+1)c∆t uses data from Dk−j, Dk−j−1, and Dk−j−2.
Numerical Integration
Numerical approximation of rp-integral use repeated one-dimensional integration
algorithm, as illustrated in Section 2.5.3. The choice of integration method along
each dimension is based on the nature of integrand along that particular dimension.
Empirical analysis on rp-integral behavior shows that the integrand has strongly
varying orders of magnitude in different parts of the subregion in r′-domain. In
contrast, the inner dimension features only regions where change is gradual (i.e.
41
integrand is smooth in θ′-domain). As a result, outer integral is computed using
adaptive quadrature and the inner integral using Newton-Cotes formulas.
More formally, for a grid-point p ∈ Vk and a subregion Sa,b = {(r′, θ′) ∈ R2 |
a ≤ r′ ≤ b, θ(p)min(r′) ≤ θ′ ≤ θ
(p)
max(r′)}, integration algorithm results in a partition,
〈r(p)0 , r
(p)
1 , . . . , r
(p)
n 〉, along the outer dimension, where r(p)0 = a < r
(p)
1 < · · · < r
(p)
n = b
and n > 0 is an integer, such that the partition has fine spacing where the integrand
is varying rapidly and coarse spacing where the integrand is varying slowly. Given

















and for every r′ sampled along that subregion, inner integral is computed using
















































































are the estimates given






















(r′, θ′, t′) dθ′ i.e. fin(r
′) is the inner integral in θ′−domain
and it is approximated as the weighted sum of integrand values at equally spaced
points within the domain of integration, where the weights are given by Newton-Cotes
formulas [21, p. 613]. For the purpose of this study, we use three-point Newton-Cotes
formula to compute the inner integral.
Note that high-fidelity computation of collective effects by calculating retarded
potentials at all grid points on the 2D spatial grid is the crucial and by far the
most computationally intensive step of this simulation. As a result, to obtain high




In this section, we first present the limitations of implementing beam dynamics
simulation on sequential machines and show that the performance of sequential si-
mulation is severely limited by the retarded potentials computation stage. Next, in
Section 3.2.2, we present the empirical analysis of sequential beam dynamics simu-
lation and show that distribution of work and data in the accurate computation of
retarded potentials at each time step is irregular and exhibits control-flow and me-
mory access patterns that are not readily amenable to the data-parallel, SIMD nature
of GPU and Intel MIC architectures. In particular, we show that the computation of
rp-integral is highly irregular, and as a result, naive parallel implementations of such
computations tend to present significant branch and memory divergence on SIMD
architectures which leads to severe performance bottlenecks.
3.2.1 LIMITATIONS OF SEQUENTIAL IMPLEMENTATION
Figure 13: Percentage of sequential execution time spent by different stages of the
beam dynamics simulation per time step averaged over all time steps for various grid
resolutions.
The heart of beam dynamics simulation is the high-fidelity computation of col-
lective effects which require calculating retarded potentials at all grid points on the
2D spatial grid of each time step. This is crucial and by far the most computatio-
nally intensive step of the beam dynamics simulation. We illustrate this empirically
by simulating the beam dynamics sequentially with N = 100000, where the ini-
tial distribution is generated by Monte Carlo sampling of N particles with a total
charge of the beam bunch Q = 1nC. This simulation is executed sequentially for
43
three different spatial grid resolutions: (i) 32 × 32, (ii) 64× 64, and (iii) 128 × 128.
Figure 13 illustrates the percentage of execution time required by different stages of
this simulation per time step averaged over all time steps. Notice that, during each
time step, compute retarded potentials stage of the simulation takes 95 − 99% of
the overall execution time. This shows that the efficiency of sequential simulation
is severely limited by the computational requirement associated with computing re-
tarded potentials. Empirical study with different simulation configuration generates
behavior that is consistent with that shown in Figure 13. In other words, for each
time step of the beam dynamic simulation, computing retarded potentials is the most
computationally-intensive step for all the valid simulation configurations.
3.2.2 EMPIRICAL ANALYSIS
This section presents the computational requirement and memory access proper-
ties of retarded potentials computation stage of the simulation that poses problems
for high-performance parallel implementations. We use empirical analysis on sequen-
tial beam dynamics implementation to study the computational properties, and then,
based on the observed behavior, we make a general inference about the properties
that tend to present significant challenge in developing parallel implementation on
GPUs and Intel MICs.
The input configuration considered for empirical analysis using sequential simula-
tion is illustrated below and the values are chosen so as to illustrate specific properties
of rp-integral and retarded potentials computations that are commonly observed for
all valid simulation configurations. It is important to note that individual values of
parameters in the chosen configuration are not relevant to the discussion that fol-
lows, and the conclusions inferred from the observed properties apply to all valid
simulation configurations.
Simulation configuration for empirical study: N = 100000, NX = 32,
NY = 32, Nt = 1000, τ = 0.001, and the initial distribution is generated
by Monte Carlo sampling of N particles with a total charge of beam bunch
Q = 1nC.
Note that the resolution of the above described simulation configuration is very low
to provide any valuable insights about the physics behind beam dynamics process.
However, they are chosen only to study the properties of the computation involved.
44
Further, high-resolution and high-fidelity beam dynamics simulation is computatio-
nally intractable with sequential codes. In practice, to study all the relevant physics
of beam dynamics in particle accelerators, N is in the order of few hundred thou-
sands to millions of particles, NX and NY varies from 64 to 1024, error tolerance for
rp-integral approximation τ is between 10−12 to 10−6, and Nt is of the order of few
hundreds to thousands of time steps. Such high-resolution and high-fidelity simulati-
ons are intractable with sequential implementations, and thus necessitating the need
for high-performance parallel implementations.
Computational Workload Characteristics
Figures 14 and 15 illustrate the computational requirement for beam dynamics
simulation using different simulation configurations, where computational require-
ment or workload is measured in terms of the number of double-precision floating
point operations required for the corresponding computation. In particular, Figure
14 shows the workload behavior of compute retarded potentials stage at different
time steps of a beam dynamics simulation using the above described configuration,
where x-axis denote different time steps of the simulation and y-axis denote double-
precision floating point operations required for computing retarded potentials at a
particular time step in GFlops (109 Flops). Next, Figure 15 shows the computa-
tional workload for rp-integral evaluations at all grid points on a spatial grid at a
particular time step of the simulation using above described configuration but with
8 × 8 spatial grid (We choose 8 × 8 instead of 32 × 32 grids for visual presentation
convenience). The x-axis in figure denotes the set of grid points at a particular
time step and y-axis denotes double-precision floating point operations required for
numerically approximating rp-integral in MFlops (106 Flops).
It is evident from Figure 14 and Figure 15 that the computational requirement
is not uniform and varies unpredictably between time steps and grid points. In
particular, Flops required for rp-integral computation at each grid-point is different
from one another which leads to non-uniform workload between time steps. Such
non-uniform and input dependent workload poses multiple challenges to develop
a high-performance parallel implementation, where equal distribution of work over






































Figure 14: Workload characteristics of compute retarded potentials stage of the beam
dynamics simulation for 1000 time steps calculated using the sequential simulation
code with N = 100000 particles, 32 × 32 spatial grid resolution, error tolerance for
rp-integral computations τ = 0.001, and total beam charge of Q = 1nC. (Note that
the computational workload is measured as a function of double-precision floating
point operations in GFlops)
Figure 15: Workload characteristics of rp-integral computation at different grid
points on a 8×8 spatial grid at a particular time step of the beam dynamics simulation
with N = 100000 particles, error tolerance for rp-integral computations τ = 0.001,
and total beam charge of Q = 1nC. (Note that the computational workload is mea-
sured as a function of double-precision floating point operations in GFlops)
46
Control-flow Properties
Figure 16 illustrates the subregion recursion tree generated by adaptive quadra-
ture along the outer dimension of rp-integral at grid points p1, p2, p3 and p4 on the
spatial grid at time step k, where the values of p1, p2, p3, p4 and k are empirically
selected to illustrate the control-flow properties of rp-integral. In Figure 16, flow of
computation or control-flow is determined by the pre-order traversal of the corre-
sponding recursion tree. In particular, the red-arrows on each subregion recursion
tree denotes the control-flow path for evaluating rp-integral at the corresponding
grid-point, and the order of execution is shown by the number on each arrow. Addi-
tionally, the partition generated by the adaptive quadrature for the outer integral is
shown at the bottom of each sub-figure.
Now, consider computing retarded potentials at time step k which requires eva-
luating rp-integral at all grid points Vk = {p1, p2, . . . , p|V |}, where |Vk| = NXNY
and pi is a point on the spatial grid at tk, for all integers i in range 0 < i ≤ |Vk|.
From Figure 16, and from the empirical analysis on sequential beam dynamics simu-
lation, we make the following general inference about rp-integral properties at any
two observation-point pi and pj (i 6= j)
1. Partition generated by adaptive quadrature along the outer dimension of rp-
integral is unique for each grid-point i.e. outer integral partition for rp-integral
at pi is different form that of rp-integral at pj. For example, in Figure 16, the
partition generated for rp-integral at p1 and p4 are
















2. The pre-order traversal for the subregion recursion tree generated by adaptive
quadrature along the outer dimension of rp-integral evaluation at pi is different
from that of rp-integral at pj. In other words, control-flow for rp-integral at pi
and pj are substantially different from one another.
3. The number of floating-point operations required to calculate rp-integral at pi
is different from that of rp-integral at pj.
The properties listed above shows that numerical approximation of rp-integral






























































































































































































































































Figure 16: Subregion recursion tree generated by adaptive quadrature along the outer
dimension of RP-integral at grid points p1, p2, p3 and p4.
48
not readily amenable to many modern parallel architectures. Moreover, these irre-
gular properties in an algorithm tend to present multiple challenges in developing
high-performance parallel implementations on GPU and Intel MIC architectures, as
illustrated in Section 2.3.
Memory Access Properties
Let the moments computed from all time steps be stored linearly on the host/de-
vice memory as a 2D array M [0..Nt, 1..NXNY ] in row major order such that moments
deposited on a grid-point located in ith row and jth column of the spatial grid during
the particle deposition stage at tk = k∆t is stored in M [k, (iNX + j)], for all integers
i, j and k in the range 0 ≤ i ≤ NX , 0 ≤ j ≤ NY , and 0 ≤ k ≤ Nt, respectively.
Figure 17 illustrates the memory access on a portion of 2D array M (between
row-index 350 − 410) that is observed during the rp-integral evaluation at each of
the four grid points (p1, p2, p3 and p4), where a grid-cell at m
th row and nth column
denotes the data element M [m,n]. Each cell has a gradient color that represents
the number of times the data element corresponding to that cell is accessed while
computing rp-integral. When M [m,n] is not accessed during the lifetime of rp-
integral evaluation at tk, then the cell corresponding to M [m,n] is marked green. On
the other hand, blue gradient at M [m,n] denotes that the data element is accessed
at least once during the course of execution and the gradient magnitude denotes
the number of times M [m,n] is accessed during the rp-integral evaluation at the
corresponding grid-point.
Now, consider computing retarded potentials at tk which requires evaluating rp-
integral at all grid points Vk = {p1, p2, . . . , p|Vk|}, where |Vk| = NXNY and pi for all
integers i in range 0 < i ≤ |Vk| is a point on the spatial grid at tk. From Figure 17,
and from the empirical analysis on sequential beam dynamics simulation, we make
the following general inference about the memory-access characteristics of rp-integral
computation at any two grid points pi and pj (i 6= j)
1. Temporal locality - A certain subset of data accessed during the computation of
rp-integral at pi is likely to be referenced again in relative temporal proximity.
In other words, for rp-integral computation at pi, there exist one or more data
elements M [m,n] that is referenced more than once. In Figure 17, a cell in
mth row and nth column with a gradient magnitude of greater than one de-








































































































Figure 17: Memory requests on a portion of 2D array M (between row-index 350−
410) for rp-integral at grid points p1, p2, p3, and p4.
times during the course of corresponding rp-integral evaluation. For example,
data element M [406, 17] is accessed and used roughly 160 times during the
rp-integral computation at p3.
2. Spatial locality - Value of the integrand f
(pi) that constitutes the rp-integral at
pi is approximated using interpolation from 27 neighboring points, as described
in Section 3.1.2. Consequently, memory access to read these 27 points exhibits
3D spatial locality for every integration point that gets sampled during the
numerical integration procedure.
3. For each rp-integral evaluation, there exists a small set of data elements that
are reused more frequently than others which are denoted by the darkest shade
of the blue gradient in Figure 17. For example, consider the memory access
on array M for rp-integral computation at grid-point p1. Notice that the data
elements between row index 404 and 407 have higher gradient magnitude i.e.
they are reused more frequently that others. In other words, the moments
computed between t404 and t407 are reused more frequently. This happens when
50
the integral computation is focused around the subregion [(k∆t− t407)c, (k∆t−
t404)c] (since, t
′ = k∆t−r′/c⇒ r′ = (k∆t−t′)c from the conditions in Equation
12) or the partition has fine spacing within that subregion.
4. The memory access footprint for rp-integral at pi has some overlap with that of
rp-integral at pj for some integer i and j. In other words, there is a significant
reuse of data for two grid points. For example, the data elements accessed
between the row index 404 and 407 for rp-integral at point p3 overlaps with
that of the rp-integral at point p4, as illustrated in Figure 17c and Figure 17d.
However, the gradient magnitude of the overlapping data access are different
i.e. the frequency of reuse is different.
5. While computing rp-integral on grid points, there is a significant reuse of data
for two nearby grid points. The reuse of data for two grid points is inversely
proportional to the distance between the two grid points. More formally, given
two points u and v, where u, v ∈ V and u 6= v, the data locality or the number
of overlapping memory access between rp-integral evaluation at u and v is
inversely proportional to the Euclidean distance between them.
3.3 PRIOR RESEARCH IN PARALLEL SIMULATION
This section presents two parallel algorithms that improve the overall performance
of beam dynamics simulation by offloading the retarded potentials computation stage
at each time step of the simulation onto GPUs. The first algorithm is Two-phase
Algorithm, and Section 3.3.1 presents a brief overview of this algorithm that com-
putes retarded potentials at each time step by focusing on load-balancing, which,
due to the uneven computation load associated with rp-integrals evaluation, is cri-
tical for good performance and scalability. The second algorithm is Heuristics
Algorithm, and Section 3.3.2 presents a overview of this algorithm which uses
two different heuristics to maximize data reuse and to balance the workload among
threads during the retarded potentials computation stage of the simulation. The
heuristics used in this algorithm are based on the properties observed from empirical
analysis of sequential simulation illustrated in Section 3.2.2. A more comprehen-
sive review of these two algorithms and their implementation performance on GPU
architectures is published in [5, 6].
51
3.3.1 TWO-PHASE ALGORITHM
The two-phase approach is a globally adaptive algorithm that approximates rp-
integral estimates at NXNY grid-points at each time step by adaptively locating
subregions in parallel where the error estimate is greater than some user-specified
error tolerance. It then calculates the rp-integral estimates on these subregions in
parallel.
Compute-Potentials(k, V, τ,M)
1 L← RP-PhaseOne(k, V, τ,M)
2 RP-PhaseTwo(t, V, L, τ,M)
RP-PhaseOne(k, V, τ,M)
1 L← ∅
2 for each grid-point p ∈ V in parallel
3 p.I ← 0, p.ε← 0
4 compute outer integral limit [0, R]
5 List-Insert(L, ([0, R], p))
6 while (|L| < Lmax) and (|L| 6= 0)
7 for each tuple ([a, b], p) ∈ L in parallel
8 (I, ε)← RP-QuadRule(([a, b], p), τ,M)
9 List-Insert(S, (([a, b], p), I, ε))
10 L← Partition(S, Lmax, τ)
11 for each tuple (([a, b], p), I, ε) ∈ S
12 if ε < τ
13 p.I ← p.I + I
14 p.ε← p.ε+ ε
15 return L
RP-PhaseTwo(t, V, L, τ,M)
1 for each ([a, b], p) ∈ L parallel
2 (I, ε)← RP-Quadrature(([a, b], p), τ,M)
3 p.I ← p.I + I
4 p.ε← p.ε+ ε
The procedure Compute-Potentials implements this algorithm where RP-
PhaseOne and RP-PhaseTwo method illustrates the two phases of the algorithm
52
to compute rp-integral to a error tolerance of τ at all grid points p ∈ V , where p is a
grid-point on the spatial grid at time t = k∆t. Grid-point p is a 5-tuple (x, y, t, I, ε)
element, where (p.x, p.y) denotes the Cartesian coordinate of a point on the spatial
grid at time p.t, p.I is the integral estimate for the rp-integral at p and p.ε is the error
estimate. The integral estimate p.I is a 3-tuple (φ,AX , AY ) element which represents
the scalar and vector potentials on p. The moments computed from all time steps
are stored linearly on the device memory as a 2D array M [1..Nt, 1..NXNY ] in row
major order such that moments deposited on a grid-point located in ith row and jth
column of the spatial grid during the particle deposition stage at tk = k∆t is stored
in M [k, (iNX + j)], for all integers i, j and k in the range 0 ≤ i ≤ NX , 0 ≤ j ≤ NY ,
and 0 ≤ k < Nt, respectively. In the description below, a subregion is identified by
the record ([a, b], p), where p denotes a grid-point on the spatial grid at time step k
and [a, b] denotes the limits of integration along outer dimension of rp-integral at p.
The two-phase approach is an extension of the GPU-accelerated multidimensional
numerical integration algorithm described in Appendix A.1.
RP-Quadrature(([a, b], p), τ,M)
1 (I ′, ε′)← RP-QuadRule(([a, b], p), τ,M)
2 H ← ∅
3 Push(H, (p, [a, b], I ′, ε′))
4 while ε′ > τ
5 (([a, b], p), I, ε)← Pop(H)
6 m← a+b
2
7 (Il, εl)← RP-QuadRule(([a,m], p), τ,M)
8 (Ir, εr)← RP-QuadRule(([m, b], p), τ,M)
9 I ′ ← I ′ − I + Il + Ir
10 ε′ ← ε′ − ε+ εl + εr
11 Push(H, (([a,m], p), Il, εl))
12 Push(H, (([m, b], p), Ir, εr))
13 return (I ′, ε′)
The procedure RP-PhaseOne, RP-PhaseTwo and RP-Quadrature is iden-
tical to Quadrature-PhaseOne, Quadrature-PhaseTwo and Sequential-
Cuhre, respectively, from Appendix A.1. However, instead of using Cuhre method
53
Grid Resolution (NX ×NY ) 64× 64 128× 128 256× 256
Double Precision Performance (GFlops/sec) 60 61 61
Arithmetic Intensity (Flops/DRAM byte) 0.30 0.31 0.31
Effective Bandwidth (GB/sec) 73 75 76
Global Load Efficiency 29% 30% 32%
Global Load Transactions per Request 7.8 8.0 8.1
Warp Execution Efficiency 60% 75% 77%
L1-cache Global Hit Rate 46% 40% 38%
L2-cache Hit Rate 99.6% 97.3% 90%
Table 3: Performance of Two-Phase-RP-Kernel for computing retarded poten-
tials in a beam dynamics simulation with 100000 particles and for different grid
resolutions on NVIDIA Tesla K40 GPU.
for numerical integration as described in the appendix, we use repeated one dimen-
sional integral to compute rp-integral estimates, where outer integral is calculated
using adaptive quadrature and inner integral using Newton-Cotes rules. In parti-
cular, procedure RP-Quadrature implements this repeated one-dimensional inte-
gration method to solve rp-integral, where outer integral is computed using adap-
tive Simpson’s rule and inner integral using three-point Newton-Cotes rule. Furt-
hermore, the procedure C-Rule is replaced with RP-QuadRule (([a, b], p), τ,M),
which calculates Simpson’s quadrature rule estimates, I and ε, along a subregion
S = {(r′, θ′) ∈ R2 | a ≤ r′ ≤ b, θ(p)min(r′) ≤ θ′ ≤ θ
(p)
max(r′)}, for rp-integral at a grid-
point p, where I is the integral estimate and ε is the error estimate. In other words,
procedure RP-QuadRule calculates Equation 15 from Section 3.1.2. We omit the
pseudocode for RP-QuadRule, as it is identical to the standard Simpson’s rule
(three point Newton Cotes rule) [21]. A more comprehensive review of the two-
phase algorithm, its implementation and performance analysis on GPU architectures
can be found in [5].
Performance Analysis and Limitations
Table 3 illustrates the double-precision floating-point performance for computing
retarded potentials at a particular time step of a beam dynamics simulation with









1/8 1/4 1/2  1  2  4  8  16  32
Peak DP GFlops/sec 




























NVIDIA Tesla K40 GPU
Figure 18: Roofline model analysis for Two-Phase-RP-Kernel on NVIDIA Tesla
K40 GPU.
memory accesses configured to be cached in both L1 and L2 (commonly called the
Caching mode). Initial distribution for all the simulations are generated by Monte
Carlo sampling of N particles with a total charge of beam bunch Q = 1nC, and the
rp-integral at all grid points are approximated to a error tolerance of τ = 10−6.
The performance metrics illustrated in this section are measured using NVIDIA
profiler [62], and a detailed description about each metric and its relation to kernel’s
performance is illustrated in Appendix B.2. In this study, for convenience, we shall
refer to the kernel implementing GPU specific code from Compute-Potentials
procedure of the two-phase algorithm as Two-Phase-RP-Kernel.
Roofline Model Analysis - Figure 18 shows the performance of Two-Phase-RP-
Kernel illustrated on the Roofline model for K40 GPU. A detailed description about
Roofline plot and its use to bound the performance of GPU kernels is illustrated in
Appendix B.1. The performance achieved by Two-Phase-RP-Kernel for the
55
simulation with 100000 particles and 128 × 128 grid resolution is 61 GFlops/sec
(see Table 3), and it is indicated by a blue horizontal line in Figure 18. The point
where this blue horizontal line intersects the bandwidth ceiling marks the achieved
arithmetic intensity. In particular, the green and red X marks achieved arithmetic
intensity for Two-Phase-RP-Kernel when the system being modeled delivers a
memory bandwidth of BWTheoretical-Peak = 288 GB/sec and BWExperimental-Peak = 200
GB/sec, respectively. Table 3 illustrates the arithmetic intensity (with the bandwidth
assumed to be BWExperimental-Peak) achieved by Two-Phase-RP-Kernel measured
for different simulation configuration. We notice that arithmetic intensity achieved
for the kernel is approximately 0.31 Flops/DRAM byte-accessed when the memory
the system being modeled delivers a memory bandwidth of BWExperimental-Peak. It is
clear from Figure 18 and the achieved arithmetic intensity in Table 3 that the kernel
is memory-bound and it performs poorly because it does not make good use of the
available bandwidth, which, due to low arithmetic intensity of the implementation,
is the main bottleneck.
Branch Divergence - The warp execution efficiency Two-Phase-RP-Kernel is far
less than 100% for all simulation configurations, as illustrated in Table 3. This
indicates that the kernel implementation has large number divergent branches that
results in poor utilization of the GPU’s hardware resources, thereby leading to poor
execution performance.
Memory Performance - The memory performance of Two-Phase-RP-Kernel on
GPU is analyzed using profiler metrics - Global load efficiency, Global load tran-
saction per requests, Cache hits, etc. These metrics and its relation to the kernel’s
performance is described in Appendix B.2. The following key observations about the
memory performance of Two-Phase-RP-Kernel are inferred from the metrics in
Table 3:
1. Global load efficiency is far less than 100% which indicates scattered and non-
coalesced memory access, and such accesses waste off-chip bandwidth by over-
fetching unnecessary data. Typically, caching in L2 only (non-caching mode)
is enabled to reduce such over-fetch. However, as illustrated in [6], this kernel
has poor global load efficiency even with non-caching mode indicating high
bandwidth waste in both caching and non-caching mode.
56
2. The global load transactions per request is approximately 4X times greater
than the ideal value; in other words, the global load transaction is replayed
8 times for each global memory load. This shows that substantial number of
memory request from the kernel are non-coalesced that results in transaction
replays.
3. Global memory requests from the kernel almost always hits in L2-cache which is
evident by near 100% L2-cache hit rate. This indicates that (i) kernel exhibits
high data locality and/or (ii) the problem fits entirely in L2-cache. Even though
nearly 100% of the memory request is serviced from L2-cache, which in practice
has a bandwidth much greater than BWTheoretical-Peak, the achieved or effective
bandwidth is still far less than BWTheoretical-Peak.
It is clear from the above observations that Two-Phase-RP-Kernel performs
poorly because of irregular, and non-coalesced global memory accesses, which, toget-
her with low arithmetic intensity of the kernel, is the main performance bottleneck.
Moreover, poor data locality and massively multithreaded nature of the kernel with
little cache capacity per thread results in high L1-cache miss rates. Such behavior sig-
nificantly over-fetches off-chip data for the application, wasting memory bandwidth,
and on-chip storage. It is, therefore, clear that optimizing the global memory access
and improving the effective bandwidth is most important for the best performance
of the beam dynamics simulation on GPUs. On the other hand, this also suggests
that other optimization methods such as improving the floating point performance
through the optimization of the arithmetic instructions or hiding the latency of global
memory access through maximizing the multiprocessor occupancy are not effective.
3.3.2 HEURISTICS ALGORITHM
The heuristics algorithm presented in this section use two different heuristics to
maximize data reuse and to minimize divergence in the parallel implementation of
compute retarded potentials stage of the beam dynamics simulation on GPUs. The
heuristics used in the algorithm helps in effectively utilizing the bandwidth at diffe-
rent levels of the memory hierarchy by coalescing the memory accesses, and maximize
the data reuse (i.e. increase the probability of a data block to be reused more ex-
tensively before the block is replaced) by improving data locality. Furthermore, the
57
algorithm uses techniques that effectively balance the workload among threads, mi-
nimize their divergence, and reduce the overall floating point operations required to
compute the retarded potentials when compared to prior implementations.
The procedure Compute-Potentials-H implements the heuristics based com-
pute retarded potential stage of the simulation that approximates rp-integral to a
error tolerance of τ at all points p ∈ V , where p is a grid-point on the spatial grid at
time step k with simulation time t = k∆t and |V | = NXNY . Grid-point p is a 5-tuple
(x, y, t, I, ε) element, where (p.x, p.y) denotes the Cartesian coordinate of a point on
the spatial grid at time step p.t, p.I is the integral estimate for the rp-integral at
p and p.ε is the error estimate. The integral estimate p.I is a 3-tuple (φ,Ax, Ay)
element which represents the scalar and vector potentials on p. The moments com-
puted from all time steps are stored linearly on the device memory as a 2D array
M [1..Nt, 1..NXNY ] in row major order such that moments deposited on a grid-point
located in ith row and jth column of the spatial grid during the particle deposition
stage at tk = k∆t is stored in M [k, (iNX + j)], for all integers i, j and k in the range
0 ≤ i ≤ NX , 0 ≤ j ≤ NY , and 0 ≤ k < Nt, respectively.
In Compute-Potentials-H, lines 1-2 initialize the estimates p.I and p.ε to 0
for all points p ∈ V . Next, RP-Classifier procedure at line-3 partitions the set
of grid points into a small number of clusters based on the data locality properties
of the corresponding rp-integrals, using standard clustering techniques like k-means.
More formally, given a set of grid points V and an integer k > 0, RP-Classifier
partitions the |V | grid points into k(≤ |V |) classes, C = {C1, C2, . . . , Ck}, such that








where µi is the center of cluster Ci, and the distance function d(p, µi) is the measure
of similarity between the grid-point p ∈ Ci and its cluster center µi for all integer i
in range 0 < i ≤ k. A value of zero to the distance function implies strong similarity,
and the similarity decreases with the increase in distance value. In RP-Classifier,
two points u and v (u, v ∈ V , u 6= v) are considered to have strong similarity when
the rp-integral evaluations at u and v exhibit high data locality against one another,
in such a case, d(u, v) ≈ 0. One of the simplest measure of data locality is the
number of overlapping memory locations required to evaluate rp-integral at u and
v. Consequently, the distance function d(u, v) can be expressed as the inverse of
58
data locality i.e. inverse of the number of overlapping memory locations required to
evaluate rp-integral at u and v.
Compute-Potentials-H(k, V, τ,M)
1 for each grid-point p ∈ V in parallel
2 p.I ← 0, p.ε← 0
3 C ← RP-Classifier(V, k) // implements k-means clustering
4 for each class c ∈ C in parallel
5 P ← RP-IntegralPartition(c, t,M)
6 for each grid-point p ∈ c in parallel
7 for i = 0 to P.length− 1
8 a← P [i], b← P [i+ 1]
9 (I, ε)← RP-QuadRule(p, [a, b], τ,M)
10 if ε < τ
11 p.I ← p.I + I
12 p.ε← p.ε+ ε
13 else
14 List-Insert(L, ([a, b], p))
15 for each ([a, b], p) ∈ L in parallel
16 (I, ε)← RP-Quadrature(p, [a, b], τ,M)
17 p.I ← p.I + I
18 p.ε← p.ε+ ε
The main motivations behind such locality based classification is that when a
set of rp-integrals that exhibit high data locality between each other are mapped to
parallel CUDA threads with one-to-one correspondence, they exhibit strong inter-
thread locality. Such data locality among the threads can be exploited by grouping
them into one or more thread blocks, where the memory performance is improved
due to the benefit form data locality by using L1-cache or shared-memory. In ad-
dition, these thread blocks when scheduled on a single core or simultaneously on
different cores can also benefit from locality using shared L2-cache. However, accu-
rate prediction of data locality without actually evaluating the integral is non-trivial
and computationally challenging due to the data-dependent, irregular, and statically
unpredictable (i.e. unknown until run time) memory accesses patterns of rp-integral
59
evaluations at different grid points. As a result, we propose a heuristic to approx-
imate d(u, v), referred to as locality heuristic. The key observation behind locality
heuristic is that while computing rp-integral on grid points, there is a significant
reuse of data for two nearby grid points. The reuse of data for two grid points is
inversely proportional to the distance between the two grid points. More formally,
Locality heuristic: Given two points u and v, where u, v ∈ V and u 6= v,
the data locality or the number of overlapping memory access between
rp-integral evaluation at u and v is inversely proportional to the Euclidean
distance between them.
We performed empirical analysis using sequential beam dynamics simulation for
different input configurations to validate the above assumed heuristics, and all our
experiments confirm the locality heuristic. Consequently, in RP-Classifier, Eucli-
dean distance is used as the distance function to measure the data locality between
two points. (Note that other distance metrics such as L1-distance will do fine also.)
Next, for each class c ∈ C, we require to calculate the rp-integral estimate at all
grid points p ∈ c. Typically, numerical approximation of rp-integral at each grid-
point results in a partition that is independent of the rp-integral at other points, and
methods such as adaptive quadrature are often used to compute these partitions, as
is the case in [5]. Once the partition is computed, integral estimate is calculated
using Equation 14. However, in our proposed approach, a single unique partition per
class that combines the partition of rp-integral at all grid points of that particular
class is calculated using heuristics instead of using traditional adaptive quadrature
methods on each point.
The main motivations for calculating such unique partition for a group of points
instead of individual grid-point is that it eliminates the need for adaptive quadrature
or data-dependent control-flow on each integral evaluation, which, as illustrated in
[3, 4, 5], is the main performance bottleneck for such adaptive computations on SIMD
architectures. The procedure RP-IntegralPartition implements this heuristics
approach, where for each class c ∈ C, it generates a unique partition P [1..P.length]
that denotes a rp-integral partition along the outer integration domain (r′-domain).
Ideally, P should be a combination of the partitions generated by rp-integral at all
p ∈ c. However, computing such partition per class instead of individual grid-point
is computationally challenging due to the data-dependent, and irregular control-flow
60
behavior of different rp-integrals. Moreover, it requires prior understanding of the
integrand being integrated. As a result, we propose a heuristic to compute the par-
tition for each class c ∈ C, referred to as control-flow heuristic. The key observation
behind control-flow heuristic is that while working on a set of grid points, we can
use partial results of the same set of grid points at an earlier time step. More formally,
Control-flow heuristic: Given a class of grid points c ∈ C, where c is one
of the k-classes generated by RP-Classifier. Then, the unique partition
required to evaluate the rp-integral at all p ∈ c is approximated by combining
the partition for rp-integral at the class center and the partition for same
set of grid points from earlier time step.
In the heuristics, partition for class center is computed sequentially using tradi-
tional adaptive quadrature method, whereas the partition for each grid-point from
previous time step is computed during the Compute-Potentials-H procedure of
that particular time step. We performed empirical analysis using sequential beam
dynamics simulation for different input configurations to validate the above assu-
med heuristics, and all our experiments confirm the control-flow heuristics. Like
locality heuristics, when rp-Integral computations at all grid points of a particular
class are mapped to parallel CUDA threads with one-to-one correspondence, they
exhibit uniform flow of computation. Such uniform control-flow will eliminate the
branch divergence and load balancing complexities that are introduced when adap-
tive quadrature method is used, as is the case in [5]. Moreover, uniform control-flow,
combined with inter-thread data locality from locality heuristics will further increase
the memory performance when the threads accessing same cache line are grouped in
a warp.
Once the procedure RP-IntegralPartition at line-5 outputs the partition
P [1..P.length] for a class c ∈ C, then the rp-integral estimate for all grid points













′, θ′, t′)dθ′ (18)
where for all i, integral estimate along the subregion [P [i], P [i + 1]] is computed
using Simpsons quadrature rule when the error estimate is less than τ ; otherwise,
adaptive quadrature method is used to compute the integral. Ideally, for all p ∈ c,
all the subregions in Equation 18 should have error estimate less than τ , as defined
61
by the control-flow heuristics. However, there may exist some grid-point p′ ∈ c for
which the error estimate along the subregion [P [j], P [j + 1]] is larger than τ , for
some j in range 0 < j < P.length (i.e. the heuristic fails for some subregion). We
hypothesize that this particular scenario will seldom occur, and even if it should, the
number of subregions that fail the heuristics will be insignificant. However, in order
to maintain the correctness of the integral estimate, we use adaptive quadrature on
these subregions.
The procedure RP-QuadRule at line-9 of Compute-Potentials-H denotes
the Simpsons quadrature rule computation for rp-integral at p along the subregion
[a, b], and it outputs a pair I and ε which correspond to the integral and error es-
timate for that particular subregion. Lines 10-13 accumulates the integral estimate
when error estimate for the subregion is less than τ and all other subregions are
inserted into a list L (line-14) for later application using parallel adaptive quadra-
ture method (line 15-18). The computation between line 15-18 is identical to that
of RP-PhaseTwo procedure from Section 3.3.1. A more comprehensive review
of the heuristics algorithm, its implementation and performance analysis on GPU
architectures can be found in [6].
Performance Analysis and Limitations
Table 4 illustrates the double-precision floating-point performance for computing
retarded potentials at a particular time step of a beam dynamics simulation with
100000 particles and varying grid resolution on NVIDIA Tesla K40 GPU with global
memory accesses configured to be cached in both L1 and L2 (commonly called the
Caching mode). Initial distribution for all the simulations are generated by Monte
Carlo sampling of N particles with a total charge of beam bunch Q = 1nC, and the
rp-integral at all grid points are approximated to a error tolerance of τ = 10−6.
The performance metrics illustrated in this section are measured using NVIDIA
profiler [62], and a detailed description about each metric and its relation to kernel’s
performance is illustrated in Appendix B.2. In this study, for convenience, we shall
refer to the kernel implementing GPU specific code from Compute-Potentials-H
procedure of the heuristics algorithm as Heuristics-RP-Kernel.
Roofline Model Analysis - Figure 19 shows the performance of Heuristics-RP-
Kernel illustrated on the Roofline model for K40 GPU. A detailed description
62
Grid Resolution (NX ×NY ) 64× 64 128× 128 256× 256
Double Precision Performance (GFlops/sec) 401 420 440
Arithmetic Intensity (Flops/DRAM byte) 2.00 2.10 2.22
Effective Bandwidth (GB/sec) 398 446 476
Global Load Efficiency 105% 115% 120%
Global Load Transactions per Request 1.8 1.8 1.8
Warp Execution Efficiency 92% 96% 97%
L1-cache Global Hit Rate 100% 99% 99%
L2-cache Hit Rate 100% 100% 97%
Table 4: Performance of Heuristics-RP-Kernel for computing retarded poten-
tials in a beam dynamics simulation with 100000 particles and for different grid
resolutions on NVIDIA Tesla K40 GPU.
about Roofline plot and its use to bound the performance of GPU kernels is illustra-
ted in Appendix B.1. The performance achieved by Heuristics-RP-Kernel for
the simulation with 100000 particles and 128× 128 grid resolution is 420 GFlops/sec
(see Table 4), and it is indicated by a blue horizontal line in Figure 19. The point
where this blue horizontal line intersects the bandwidth ceiling marks the achieved
arithmetic intensity. In particular, the green and red X marks achieved arithmetic
intensity for Heuristics-RP-Kernel when the system being modeled delivers a
memory bandwidth of BWTheoretical-Peak = 288 GB/sec and BWExperimental-Peak = 200
GB/sec, respectively. Table 4 illustrates the arithmetic intensity (with the bandwidth
assumed to be BWExperimental-Peak) achieved by Heuristics-RP-Kernel measured
for different simulation configuration. We notice that arithmetic intensity achieved
for the kernel is approximately 2.10 Flops/DRAM byte-accessed when the memory
the system being modeled delivers a memory bandwidth of BWExperimental-Peak, which
is 7X more than Two-Phase-RP-Kernel. Furthermore, the effective bandwidth
for the kernel is greater than the experimental peak, BWExperimental-Peak. The increase
in effective bandwidth indicates that Heuristics-RP-Kernel is effective in utili-
zing the caches to filter the number of accesses that go to memory, thereby increasing
the arithmetic intensity.
Branch Divergence - The warp execution efficiency Heuristics-RP-Kernel is ne-









1/8 1/4 1/2  1  2  4  8  16  32
Peak DP GFlops/sec 




























NVIDIA Tesla K40 GPU
Figure 19: Roofline model analysis for Heuristics-RP-Kernel on NVIDIA Tesla
K40 GPU.
that GPU kernel has fewer divergent branches and has near uniform control-flow.
In other words, locality heuristics in the proposed algorithm is effective in reducing
the control-flow irregularity among threads and minimizing their divergence, which,
as illustrated in Section 2.2 and [60, 59], is one of the most important performance
consideration in programming for CUDA-capable GPU architectures.
Memory Performance - The memory performance of Heuristics-RP-Kernel on
GPU is analyzed using profiler metrics - Global load efficiency, Global load tran-
saction per requests, Cache hits, etc. These metrics and its relation to the kernel’s
performance is described in Appendix B.2. The following key observations about the
memory performance of Heuristics-RP-Kernel are inferred from the metrics in
Table 4:
• Global load efficiency for the kernel, which is the ratio of number of bytes
requested by the kernel to number of bytes transferred, is greater than 100.
64
This indicates that, on average, the load requests of multiple threads in a warp
are fetched from the same memory address.
• Global load transactions per request value is much closer to the ideal value of
2.0 for transactions with 8-byte words. This shows that most of the memory
requests from within a warp are coalesced, and the memory accesses are within
at most two cache lines.
• Global memory requests from the kernel almost always hits in L2-cache which is
evident by near 100% L2-cache hit rate. This shows that the kernel fits entirely
in L2-cache, and the behavior is identical to that of Two-Phase-RP-Kernel.
• The L1-cache hit rate is nearly 100%. Typically, increased cache hit from
a kernel reduces the DRAM bandwidth which contributes to the increase in
effective bandwidth of that particular kernel, as is the case for Heuristics-
RP-Kernel.
It is clear from the above observations that the performance of Heuristics-
RP-Kernel is substantially improved when compared to that of Two-Phase-RP-
Kernel. In particular, heuristcs based algorithm is effective in improving data
locality, maximizing data reuse, coalescing the memory accesses, and in increasing
the effective bandwidth of the kernel. Additionally, effective utilization of the band-
width at different levels of the memory hierarchy results in an increase in arithmetic
intensity, which is shown in Figure 19. The heuristics algorithm and its implemen-
tation on GPU is currently the fastest known method for high-fidelity computation
of retarded potentials in a charged particle beam dynamics simulations.
65
CHAPTER 4
EFFICIENT PARALLEL SIMULATION ON GPUS
In Section 3.2, we showed that the distribution of work and data in the accurate
computation of retarded potentials at each time step of the simulation is highly
unstructured and they cannot be characterized a priori, as these quantities are input-
dependent and evolve with the computation itself. As a direct consequence of these
properties, obtaining high-performance in such algorithms is extremely challenging,
and to this end, we have developed two parallel algorithms to accurately calculate
the retarded potentials at each time step of the simulation using GPUs, Two-phase
Algorithm and Heuristics Algorithm. Two-phase Algorithm is illustrated
in Section 3.3.1 and it targets equal distribution of work over processor to reduce
control-flow irregularity. Heuristics Algorithm is illustrated in Section 3.3.2 and
it uses heuristics to maximize data reuse and to balance the workload among threads,
thereby reducing both control-flow and memory access divergence on GPUs. Both
these algorithms focus on optimizing the irregular, data-dependent memory accesses
and control-flow during a single time step of the simulation independent of the other
steps, with the assumption that these patterns are completely unpredictable.
Multiple analysis of beam dynamics simulation executing irregular workloads for
a few hundreds or thousands of time steps show that control-flow and data access
patterns made during the computation of retarded potentials follow a loosely similar
pattern between time steps. In such situation, one effective approach to reduce the
irregularities is to analyze the control-flow and data access patterns at each time
step of the simulation and then anticipate future data dependence and control-flow
before it is needed. Given the complexity and diversity of control-flow and data
access patterns in beam dynamics simulation, we believe anticipation strategies are
best realized via intelligent application-specific prediction models that can adapti-
vely model and track access patterns. Access pattern forecasts can then be used to
formulate runtime decisions that optimize future computations of collective effects
on GPUs, such as determining computations to thread mapping that maximize data
reuse within a cache-sharing thread group and minimize thread divergence, data
prefetching, computational workload balancing, linearizing the irregularities, etc.
66
This chapter presents the use of predictive analytics and forecasting techniques
to optimize the computation of retarded potentials on GPUs, thereby improving
the overall performance of beam dynamics simulation. In particular, we present a
cache-aware algorithm that use machine learning to forecast the control-flow and
data access patterns required to calculate the retarded potentials at a future time
step based on the computations and access patterns observed from earlier time steps.
The remainder of the chapter is organized as follows. Section 4.1 presents the ma-
chine learning approach to model irregular data access patterns in the computation of
retarded potentials where the future values of control-flow and data access patterns
are predicted based on the previously observed values. Next, the parallel algorithm
to calculate retarded potentials using predictive analytics and its implementation on
GPU architecture is illustrated in Section 4.2. Section 4.3 validates the parallel im-
plementation and then illustrates the performance of parallel algorithm on NVIDIA
Tesla K40 GPU.
4.1 MODELING ACCESS PATTERNS
An effective model for forecasting irregular data access patterns must predict
when, what, and how many data blocks are required by an application before it
is needed. To obtain these predictions, we have modeled data access patterns in
the computation of retarded potentials using application-specific supervised learning
algorithms described in this section. First, we outline the representation of data
access pattern in the numerical approximation of rp-integrals, which are later used
as input to the learning algorithm. Next, we present the online prediction model
that use supervised learning on the observed data access patterns to train the model.
Finally, we outline the algorithm to forecast the data access and control-flow patterns
in rp-integral evaluations at a future time step using the prediction model learned at
an earlier time step.
4.1.1 REPRESENTATION OF DATA ACCESS PATTERN
The computation of rp-integral at a grid-point p ∈ Vk during time step k requires
referencing data from the 2D grids of moments computed from one or more earlier
time steps. In particular, calculating rp-integral along the subregion Sic∆t,(i+1)c∆t
requires referencing data from the 2D grids computed during time steps k−i, k−i−1,
and k−i−2 (i.e., gridsDk−i, Dk−i−1, andDk−i−2), for all positive integers i and k such
67
that i < k and k < Nt. Further, when Sic∆t,(i+1)c∆t is subdivided into ni partitions
then rp-integral evaluation within that subregion will result in αni memory references
to each of the three data grids, where α is the number of memory references made
during the computation of inner integral, which is constant for a given Newton-Cotes
formulae.
Adapting to this relation between the number of partitions and the memory
references, we choose to represent the data access pattern in rp-integral evaluation
using the partition generated along subregions, Sic∆t,(i+1)c∆t, for integers i such that
0 ≤ i < Nt. More formally, data access pattern observed during rp-integral evaluation
at a grid-point p ∈ Vk is represented by a list, [n(p)0 , n
(p)





the number of partitions along the subregion Sic∆t,(i+1)c∆t required during rp-integral
evaluation at p, and given the access pattern, we can easily calculate the memory








Data access patterns are extracted during regular execution of the beam dyna-
mics simulation with negligible overhead except for additional storage required to log
the access patterns. These access patterns are later used by a supervised learning
algorithm to train the online prediction model. Note that we have chosen to mo-
del the access patterns using coarser data grids instead of individual data elements.
The rationale behind coarser modeling is that the irregular nature of beam dyna-
mics simulation makes it challenging to track access to individual data elements.
Moreover, even if tracking individual data elements was feasible, the overhead as-
sociated with storing the number of references to each data element will increase
the memory requirement of beam dynamics simulation, which is already a memory
intensive application.
4.1.2 ONLINE PREDICTION MODEL
To capture and forecast the irregular data access patterns, we model the appli-
cation access patterns using online prediction techniques where the future values are
predicted based on the previously observed values. Formally, suppose the current
time step of the simulation is k and we are given a set of access patterns observed
during rp-integral computations at all grid points up to time step k. Then, using all
the data access patterns observed up to time step k, the prediction model forecasts




1 , . . . , n
(q)
Nt
] required to compute rp-integral at a grid-point
68
q ∈ Vj for a future time step j, where j > k. This forecast is used to formulate intel-
ligent runtime decisions that optimize the application execution during time step j.
Further, forecasts can be one-step ahead forecasting where j = k+1, or multiple step
ahead forecasting where j >> k. We use one-step ahead forecasting in this study.
Training and Prediction
For model training, given a set of training examples of the form (x1, y1), . . . , (xn, yn)
where xi is the grid-point of the i
th example and yi is the access pattern observed
during rp-integral evaluation at xi, a learning algorithm seeks a function g : X → Y ,
where X is the space of inputs (i.e., grid points) and Y is space of outputs (i.e.,
rp-integral data access patterns). In particular, at time step k, the set of rp-integral
computations at all grid points from one or more earlier time steps is used as training
data by a supervised learning algorithm to seek a predictor function gk : X → Y at
that particular time step. However, keeping track of rp-integral computations and
access patterns from multiple time steps may increase the computational load and
memory requirement of the application. In such situation, we can use supervised
learning algorithms with online training where the function predictor at kth time
step, gk, is learned just from the access patterns observed during time step k, and
the previous best predictor gk−1.
The choice of learning algorithm depends on the data distribution, quality and
nature of the data, required accuracy of prediction, and so on. Typically, each
learning algorithm have different effect on a given problem, and as result, choosing
the right algorithm often requires studying multiple algorithms and its effects on the
problem before choosing the best performing one. In this study, we use k-nearest
neighbor algorithm (kNN) in regression setting to train the prediction model, which,
based on the heuristics illustrated in Section 3.3.2 and in [6] is an intuitive choice.
4.1.3 FORECASTING MEMORY ACCESS
Given a predictor function gk−1 learned at time step k−1, the data access pattern
for rp-integral evaluation at a grid-point p ∈ Vk for time step k is approximated as,
gk−1(p).
In the algorithm illustrated in Section 4.2, the predicted access patterns for all
points p ∈ Vk are used to determine rp-integral computations to thread mapping
that maximizes the data reuse within a cache-sharing thread group on the target
69
architecture. This leads to an improved cache performance, even with the presence of
memory access irregularity. Also, note that the predicted access patterns are just an
approximation to the observed access patterns, and they are primarily used to reduce
memory access irregularities in the simulation by optimizing computation to thread
mapping, therefore, does not compromise the correctness of integral computation.
4.1.4 FORECASTING CONTROL-FLOW
The flow of computation or control-flow for numerically approximating rp-integral
at grid-point p ∈ Vk is typically determined by the algorithm used to compute the
partition, 〈r(p)0 , r
(p)
1 , . . . , r
(p)
n 〉, along the outer dimension. Adaptive quadrature is tra-
ditionally used to compute such partitions, which, as illustrated in [3, 4], is characteri-
zed by control-flow and memory access irregularities that leads to severe performance
bottlenecks on GPU architectures.
In the proposed algorithm we use predicted access patterns to approximate rp-
integral partition, and given the partition, computation of rp-integral simply in-
volves evaluating Equation-14. Such evaluation exhibit uniform and deterministic
control-flow that can be mapped to GPUs with minimal thread divergences, the-





1 , . . . , n
(p)
Nt
] corresponding to a grid-point p ∈ Vk, the forecasting algorithm
computes a partition list, 〈r(p)0 , r
(p)
1 , . . . , r
(p)
m 〉, required to calculate rp-integral at p,
which is an approximation to the partition required to calculate rp-integral within
the required error tolerance. The following two methods are used to transform the
data access pattern to integral partition -
1. Uniform partitioning - Each subregion Sic∆t,(i+1)c∆t is divided into (n
(p)
i −1) finer
subregions of equal size (i.e., n
(p)
i partitions along Sic∆t,(i+1)c∆t), for all integers





the entire integration region [0, R(p)].
2. Adaptive partitioning - Partition generated at an earlier time step is used al-
ongside the access pattern forecast to approximate the partition at time step
k. The choice of partition from an earlier time step is identical to the approach
used in Section 3.3.2, and this partition is updated using the access pattern
forecast to generate a new partition which is used during time step k. For
example, let the partition from an earlier time step contain di partitions along
70
Si, then each subregion in Sic∆t,(i+1)c∆t from the earlier partition is divided into
n
(p)
i /di finer subregions to generate a new partition, which is used for rp-integral





i on the entire integration region S0,R(p).
Note that the partition forecast computed from the access pattern is an approx-
imation to the partition required to calculate rp-integral within the required error
tolerance, and it is possible that the rp-integral estimate calculated using this par-
tition forecast is not within the required error tolerance. The proposed algorithm
illustrated in Section 4.2 handles this situation by considering the prediction as an
initial condition for numerically approximating rp-integral and ensures that integral
estimate always achieves the required error tolerance.
4.2 PARALLEL ALGORITHM
The procedure Compute-Potentials-ML implements the second step of the
four step beam dynamics simulation algorithm where it approximates the rp-integral
at all grid points on a 2D grid for a given time step. The procedure takes input
k, V, τ, g, and D, where k is the current time step of simulation, V is a set of grid
points on the 2D grid at kth time step such that |V | = NXNY , τ is the required error
tolerance for rp-integral evaluations, g denotes the predictor function learned using
supervised learning algorithm at time step k − 1, and D is the list of 2D data grids
of moments from each time step stored linearly on the device memory. Each grid-
point p ∈ V is a reference to 7-tuple object, (x, y, t, I, ε, access pattern, partition),
where (p.x, p.y) denote the Cartesian coordinate of the grid-point on the 2D grid
at time step k, p.t is the simulation time of the corresponding time step, p.I is the
rp-integral estimate, p.ε is the rp-integral error estimate, p.access pattern is a list
containing the data access pattern for rp-integral computation, and p.partition holds
a list containing the partition for rp-integral computation.
The procedure Compute-Potentials-ML works as follows. Line 1-4 initializes
different attributes of the grid-point object. In particular, for each grid-point p ∈
V , line-2 initializes integral and error estimates to 0, line-3 uses the best predictor
function g learned at time step k − 1 to forecast the access pattern required for rp-
integral computation for the current time step k, and line-4 calls a procedure that
implements the algorithm described in Section 4.1.4 to convert access pattern forecast
to rp-integral partition. Next, RP-Clustering procedure at line-5 implements a
71
clustering algorithm to partition the grid points based on their data access patterns
such that access patterns for grid points in the same cluster are similar to one another.
Formally, given a set of grid points V and an integer m, RP-Clustering procedure
partitions the |V | grid points into m disjoint clusters, C = {C1, C2, . . . , Cm}, such








||p.access pattern− µi||2 (19)
where µi is the center of cluster Ci, and m = NX or NY . We use k-means clustering
algorithm to implement RP-Clustering procedure.
Compute-Potentials-ML(k, V, τ, g,D)
1 for each grid-point p ∈ V in parallel
2 p.I ← 0, p.ε← 0
3 p.access pattern← g(p.x, p.y, p.t)
4 p.partition← Compute-Partition(p.access pattern)
5 C ← RP-Clustering(V )
6 L← ∅
7 for each cluster c ∈ C in parallel
8 P ← ∅
9 for each grid-point p ∈ c in parallel
10 P ←Merge-Lists(P, p.partition)
11 for each grid-point p ∈ c in parallel
12 L′ ← Compute-RP-Integral(p, P, τ,D)
13 L←Merge-Lists(L,L′)
14 for each ([a, b], p) ∈ L in parallel
15 (I, ε, P, A)← RP-Quadrature(([a, b], p), τ,D)
16 p.access pattern←Merge-Lists(p.access pattern,A)
17 p.partition←Merge-Lists(p.partition, P )
18 p.I ← p.I + I
19 p.ε← p.ε+ ε
20 g ← Online-Learning(V, g)
Furthermore, two grid points u, v ∈ V , where u 6= v, gets partitioned into same
cluster when u.access pattern exhibits stronger similarity to that of v.access pattern,
72
which also implies that rp-integral computation at u and v have maximum data reuse
between them. This property of data reuse and access pattern similarity within a
cluster is used to optimize rp-integral computations to thread mapping such that
the overall memory performance on the target architecture is maximized. In other
words, when a set of rp-integral computations that exhibit similar data access pattern
between each other are mapped to parallel threads with one-to-one correspondence,
they exhibit strong inter-thread locality. Such data locality among threads can be
exploited by grouping them into one or more thread blocks in GPU architectures,
where the memory performance is improved due to the benefit from data locality
by using L1-cache or shared-memory. Note that, RP-Clustering procedure in
Compute-Potentials-ML is similar to the one used in Section 3.3.2, however,
algorithm in Section 3.3.2 uses heuristics to measure the data reuse between two grid
points, whereas in here, we use a more accurate measure of data reuse between two
points by comparing their corresponding rp-integral computations access patterns.
Even though we measure data reuse using predicted access patterns which are just an
approximation to the observed data access patterns, experimental results in Section
4.3 shows that this approach is effective in improving the memory performance.
The for loop in lines 7-13 evaluates rp-integral at all grid points using the partition
approximated in line-4. First, for each cluster c ∈ C, line-8 initializes a list P , and
for each grid-point p ∈ c, the for loop in line 9 merges the list p.partition with P
by calling an auxiliary procedure Merge-List. The procedure Merge-List(P, P ′)
returns the sorted list that is the merge of its two sorted input lists P , and P ′ with
duplicate values removed. In other words, lines 9-10 combine the predicted partition
of all the points p ∈ c into a single unique partition P , such that the combined
partition is an approximation to individual partitions. The main objective behind
combining the partitions is to have uniform control-flow in the for loop at line-11,
which aids in minimizing the thread divergence when computations of this loop are
mapped to GPU threads.
The procedure Compute-RP-Integral(p, P, τ,D) approximates rp-integral at
a grid-point p using only the subregions from a partition list P where the rp-integral
error estimate is less than τ , and the integral and error estimates along each subre-
gion is approximated using Simpson’s quadrature rule. We use an auxiliary procedure
RP-QuadRule([a, b], p,D) to compute Simpson’s quadrature rule estimates, (I, ε),
along a subregion [a, b] for rp-integral at a point p, where I is the integral estimate
73
and ε is the error estimate. We omit the pseudocode for RP-QuadRule, as it is
identical to the standard Simpson’s quadratue rule with the inner integral approxi-
mated using Newton-Cotes formulae, as illustrated in [5]. In ith iteration of the for
loop in Compute-RP-Integral procedure, integral and error estimates along an
integration region SP [i],P [i+1] is calculated by calling RP-QuadRule. Next, when
the error estimate returned from RP-QuadRule is less than τ , both integral and
error estimates are accumulated to the input grid-point’s global estimates, p.I and
p.ε, respectively. Otherwise, the grid-point object and the corresponding subregion
where the error estimate is larger than τ is inserted to a list L. Once the for loop
terminates, partition list used to compute the integral is stored in p.partition, data
access pattern observed during the computation is stored in p.access pattern, and
the list L is returned as the output from Compute-RP-Integral procedure.
Compute-RP-Integral(p, P, τ,D)
1 L← ∅
2 for i = 0 to P.length− 1
3 a← P [i], b← P [i+ 1]
4 (I, ε)← RP-QuadRule(([a, b], p), τ,D)
5 if ε < τ
6 p.I ← p.I + I
7 p.ε← p.ε+ ε
8 else
9 List-Insert(L, ([a, b], p))
10 p.partition← P
11 update p.access pattern with the observed data access pattern
12 return L
In the pseudocode for Compute-Potentials-ML, the method Compute-RP-
Integral is called on for each grid-point p ∈ c inside the for loop at line-11, and
it returns a list L′. Each element of this list is a pair ([a, b], p) where [a, b] denotes a
subregion such that the Simpson’s quadrature rule error estimate for rp-integral at a
grid-point p along that subregion is larger than τ . Furthermore, individual list from
each iteration of the for loop is merged to a global list L using the auxiliary proce-
dure Merge-List. The accumulated subregions and grid points from the global list
are processed using traditional adaptive quadrature algorithm in lines 14-19. We use
74
the procedure RP-Quadrature to implement Simpson’s adaptive quadrature algo-
rithm, where, in addition to integral and error estimates, our implementation returns
the integral partition along outer dimension and the data access pattern observed du-
ring the algorithm execution. In particular, RP-Quadrature([a, b], p, τ,D) outputs
a tuple, (I, ε, P, A), where I and ε are the integral and error estimates, respectively,
P is the partition along the outer dimension generated by adaptive quadrature’s
control-flow, and A is the observed data access pattern. Next, access pattern and
partition returned from RP-Quadrature is merged with the corresponding grid-
point’s access pattern and partition, respectively, that is seen during Compute-RP-
Integral procedure. Furthermore, rp-integral estimates from adaptive quadrature
are accumulated to the grid-point’s global estimates.
Online-Learning(V, g)
1 X ← ∅
2 Y ← ∅
3 for each grid-point p ∈ V
4 List-Insert(X, (p.x, p.y, p.t))
5 List-Insert(Y, p.access pattern)
6 update the predictor function g : X → Y using supervised learning
on the inputs X and Y
7 return g
Next, in Online-Learning procedure, access patterns observed during rp-integral
computations at all grid points p ∈ V is used by a supervised learning algorithm to
train and update the predictor function g. The updated prediction function g is used
by Compute-Potentials-ML procedure during the next time step.
4.2.1 GPU IMPLEMENTATION
Initialization steps between lines 1-4 are implemented on CPU, where the for
loop is parallelized using OpenMP, and the grid-point attributes: access pattern and
partition, are stored only in CPU memory. The procedure RP-Clustering imple-
menting the k-means clustering algorithm is also implemented on CPU, using scikit-
learn library [65]. Further, in the implementation of RP-Clustering, we choose
number of clusters to be m = max(NX , NY ), and since k-means algorithm prefers
clusters of approximates similar size, each cluster size is approximately min(NX , NY ).
75
Lines 7-13 is the heart of beam dynamics simulation that approximates the rp-
integral at all grids points using the predicted partitions, and it is implemented on
GPUs. In particular, computations of the for loop at line-7 is assigned to one or more
thread blocks, where multiple thread blocks are used to take advantage of the Thread
Level Parallelism (TLP). Within each thread block, the computation of for loop at
line-11 is assigned to GPU threads with one-to-one correspondence. In other words,
each cluster c ∈ C is assigned to one or more thread blocks where the computation of
each grid-point p ∈ c is assigned to GPU threads of the corresponding thread-block.
The required number of threads for each block depends on the number of grid points
in the cluster assigned to that particular block, and as a result, number of threads
per block for GPU execution is chosen to be the maximum of all cluster sizes. Each
thread implements the Compute-RP-Integral procedure for the assigned grid-
point and the data access patterns observed during the evaluation of this procedure is
updated on CPU based on the partition used within the procedure. Further, TLP for
the GPU execution is governed by assigning multiple thread blocks to each cluster’s
rp-integral computation, such that the loop iteration in Compute-RP-Integral
procedure is shared between multiple thread blocks. It is important to note that the
flow of computation in for loop at line-11 is uniform between different threads of a
thread-block, and as a result, it eliminates the thread divergences between rp-integral
computations at different grid-points assigned to the threads of the block.
Next, the computations in lines 14-19 is also implemented on GPU using a diffe-
rent kernel from the one explain before. In this kernel, the list elements are mapped
to parallel GPU threads with one-to-one correspondence. Each parallel threads im-
plements the RP-Quadrature procedure in parallel and independent of other thre-
ads. This implementation is identical to the globally adaptive algorithm illustrated in
Section 3.3.1. Finally, Online-Learning procedure to update the prediction model
using supervised learning is implemented on CPU using scikit-learn and OpenMP,
where both the libraries use all the CPU cores to speed up the training process.
4.3 EVALUATION AND EXPERIMENTAL RESULTS
Simulation experiments for studying beam dynamics and performance analysis of
the parallel implementation of Compute-Potentials-ML is carried out on NVI-
DIA Tesla K40 GPU with global memory accesses configured to be cached in both
L1 and L2 (commonly called the Caching mode), unless specified otherwise. Initial
76
distribution for all the simulations are generated by Monte Carlo sampling of N par-
ticles with a total charge of beam bunch Q = 1nC, and the rp-integral at all grid
points are approximated to a error tolerance of τ = 10−6. The performance metrics
for all the GPU kernels illustrated in this section are measured using NVIDIA profiler
[62], and the results are averaged over multiple runs. For convenience, we shall re-
fer to the kernel implementing GPU specific code from Compute-Potentials-ML
procedure as Predictive-RP-Kernel.
4.3.1 VALIDATION OF PARALLEL SIMULATION
The correctness and accuracy of beam dynamics simulation depends on the fidelity
of Predictive-RP-Kernel algorithm that use predictive analytics and forecasting
techniques to calculate the retarded potentials in a multistep beam dynamics simu-
lation. To ensure that such prediction based algorithm does not trade simulation’s
correctness for performance, it is imperative to validate the parallel implementation
and its effect on the simulation. We validate the parallel simulation by comparing the
simulation output to the only special case for which the exact analytical results are
available - that of a 1D monochromatic rigid bunch. Exact analytical solutions for
the longitudinal and transverse force for a 1D rigid-line bunch study state model is
given in [28, 48]. The parallel implementation presented here is benchmarked against
the analytical results described in [28, 48] for the parameters of the Linac Coherent
Light Source (LCLS) bend [47]: bend radius R0 = 25.13 m, θb = 11.4
◦, longitudinal
rms beam size σs = 50 µm, emittance ε = 1 nm, total beam charge Q = 1 nC. From
Figure 20, it is evident that both longitudinal and transverse forces computed with
our parallel algorithm agree perfectly with the exact analytical solution.
Further, a closer look into the nature of convergence of the computed forces to








Fi − F exacti
)2
,
with N the number of particles, Fi the computed force and F
exact
i the analytic force
on individual particles. As one should expect from Monte-Carlo type simulations,
the accuracy of the computed forces, as measured by the mean-square error, scales
































































Figure 20: Analytic versus computed effective longitudinal (left) and transverse
(right) forces for the LCSL bend [47]: N = 1000000 particles on a 128 × 128 grid,
bend radius R0 = 25.13 m, θb = 11.4
◦, longitudinal rms beam size σs = 50 µm,



















Number of particles per cell (Nppc)
Figure 21: Mean-square error for the longitudinal force, as defined in the text, as
a function of the number of particles per cell Nppc = N/Ngrid, for a fixed grid of
128× 128 (or Ngrid = 1282).
78
Grid Resolution (NX ×NY ) 64× 64 128× 128 256× 256
Double precision performance (GFlops/sec) 460 480 485
Arithmetic Intensity (Flops/DRAM byte) 2.30 2.40 2.43
Effective Bandwidth (GB/sec) 532 567 580
Global Load Efficiency 135% 150% 160%
Global Load Transactions per Request 1.9 1.9 1.9
Warp Execution Efficiency 97% 99% 99%
L1-cache Global Hit Rate 100% 100% 99%
L2-cache Hit Rate 100% 100% 100%
Table 5: Performance of Predictive-RP-Kernel for computing retarded poten-
tials in a beam dynamics simulation with 100000 particles and for different grid
resolutions on NVIDIA Tesla K40 GPU.
4.3.2 PERFORMANCE ANALYSIS
Table 5 illustrates the double-precision floating-point performance of Predictive-
RP-Kernel at a particular time step of a beam dynamics simulation with 100000
particles and varying grid resolution on NVIDIA Tesla K40 GPU. The results from
Table 5 are used to provide a quantitative analysis on the effects of using predictive
analytics and forecasting techniques in improving the performance of compute retar-
ded potentials stage of the beam dynamics simulations on GPUs. The performance
metrics illustrated in this section are measured using NVIDIA profiler [62], and a
detailed description about each metric and its relation to kernel’s performance is
illustrated in Appendix B.2.
Roofline Model Analysis
Figure 22 shows the performance of Predictive-RP-Kernel illustrated on the
Roofline model for K40 GPU. A detailed description about Roofline plot and its
use to bound the performance of GPU kernels is illustrated in Appendix B.1. The
performance achieved by Predictive-RP-Kernel for the simulation with 100000
particles and 128 × 128 grid resolution is 480 GFlops/sec (see Table 5), and it is
indicated by a blue horizontal line in Figure 22. The point where this blue hori-
zontal line intersects the bandwidth ceiling marks the achieved arithmetic intensity.









1/8 1/4 1/2  1  2  4  8  16  32
Peak DP GFlops/sec 




























NVIDIA Tesla K40 GPU
Figure 22: Roofline model analysis for Predictive-RP-Kernel on NVIDIA Tesla
K40 GPU.
achieved arithmetic intensity for Predictive-RP-Kernel when the memory band-
width attained by the kernel is assumed to be BWTheoretical-Peak = 288 GB/sec and
BWExperimental-Peak = 200 GB/sec, respectively.
Figure 23 compares the performance of Predictive-RP-Kernel against exis-
ting two kernels - Two-Phase-RP-Kernel and Heuristics-RP-Kernel, illus-
trated on the Roofline model for K40 GPU. The vertical lines indicate the achie-
ved arithmetic intensity when the attained memory bandwidth is assumed to be
BWExperimental-Peak and the X marks performance attained for that particular GPU
kernel. It is clear from Figure 22 and Figure 23 that the parallel algorithm and its
implementation on GPUs based on predictive analytics and forecasting technique
has sufficiently high arithmetic-intensity when compared to the previous two im-
plementations. In particular, Predictive-RP-Kernel delivers 480 GFlops/sec of
double precision floating-point performance on K40 GPU and this translates to achie-









1/8 1/4 1/2  1  2  4  8  16  32
Peak DP GFlops/sec 































































NVIDIA Tesla K40 GPU
Figure 23: Roofline model analysis of Predictive-RP-Kernel (red line) compared
against Heuristics-RP-Kernel (green line) and Two-Phase-RP-Kernel (grey
line) on NVIDIA Tesla K40 GPU.
more than the Two-Phase-RP-Kernel and Heuristics-RP-Kernel, respecti-
vely. Furthermore, the effective bandwidth for Predictive-RP-Kernel kernel is
greater than BWExperimental-Peak (see Table 5). The increase in effective bandwidth in-
dicates that the kernel implementation is effective in utilizing the caches to filter the
number of accesses that go to memory, thereby increasing the arithmetic intensity.
Branch Divergence
The warp execution efficiency for Predictive-RP-Kernel is nearly 100%, il-
lustrated in Table 5. This indicates that the GPU kernel has fewer divergent branches
and has near uniform control-flow. In other words, use of anticipation strategies are
effective in reducing the control-flow irregularity among parallel threads, which, as
illustrated in Section 2.2 and [60, 59], is one of the most important performance consi-
deration in programming CUDA-capable GPU architectures. Moreover, such higher
81
value of warp execution efficiency for Predictive-RP-Kernel indicates that this
kernel implementation is better at utilizing the GPU device to its full potential when
compared to the other two kernels .
Memory Performance
The memory performance of Predictive-RP-Kernel on GPU is analyzed
using profiler metrics - Global load efficiency, Global load transaction per requests,
Cache hits, etc. These metrics and its relation to the kernel’s performance is des-
cribed in Appendix B.2. The following analysis about the memory performance of
Predictive-RP-Kernel is inferred from studying different profiler metrics illus-
trated in Table 5 -
• Global load efficiency of Predictive-RP-Kernel is greater than 100%, which
indicates that on average, the load requests of multiple threads in a warp are
fetched from the same memory address and are also coalesced. In other words,
implementation is effective in taking advantage of the memory coalescing fea-
ture of the GPU’s architecture.
• Global load transactions per request for Predictive-RP-Kernel is much
closer to the ideal value of 2.0 for transactions with 8-byte words. This shows
that most of the memory requests from within a warp are coalesced, and the
memory accesses are within at most two cache lines.
• Global memory requests from Predictive-RP-Kernel almost always hits in
L2-cache which is evident by near 100% L2-cache hit rate. This indicates that
kernel exhibits high data locality and/or the problem fits entirely in L2-cache.
• The L1-cache hit rate for global loads is nearly 100%, which indicates elevated
data reuse between cache sharing threads groups. Further, increased cache hit
from the kernel reduces the DRAM bandwidth, which contributes to the incre-
ase in effective bandwidth and in subsequent increase of arithmetic intensity.
It is evident from the above observations that memory performance for the
GPU implementation of Predictive-RP-Kernel on Tesla K40 GPU is substanti-
ally improved when compared to Two-Phase-RP-Kernel and Heuristics-RP-
Kernel. In particular, computation to thread mapping based on the data access
pattern forecast is effective in maximizing the data reuse within all cache-sharing
82
thread groups. This leads to an improved cache performance and aids in reducing
the impact of any unforeseen memory access irregularity. Moreover, optimizations
based on predictive analytics and forecasting is effective in coalescing the memory
accesses and in increasing the effective bandwidth of the kernel. This improves the
overall utilization of the bandwidth at different levels of the memory hierarchy, the-
reby increasing the arithmetic intensity, as shown in Figure 22.
Speedup
Table 6 and Table 7 illustrates the performance of compute retarded potentials
stage of the simulation using Predictive-RP-Kernel compared against Two-
Phase-RP-Kernel and Heuristics-RP-Kernel for different simulation confi-
gurations. In Table 6, GPU time refers to the kernel execution time on the GPU
together with the time spent in memory operations (allocations, initialization, and
memory copy between CPU and GPU), Clustering time and Online-training time re-
fers to the execution time of procedures RP-Clustering and Online-Learning,
respectively, and the Overall time is the total execution time of the compute retarded
potentials stage of the simulation.
The results indicate that depending on the grid resolution and the number of
particles in the simulation, computing retarded potentials using Predictive-RP-
Kernel achieves a speedup gain of up to 6.4X and 2.8X compared to the Two-
Phase-RP-Kernel and Heuristics-RP-Kernel, respectively. Further, the split
execution time in Table 6 shows that the time spent in online-training and clustering
are negligible when compared to the GPU time. In other words, the time spent in
using the access pattern forecast to optimize the computation to thread mapping and
in formulating the runtime decisions are negligible when compared to the performance














64× 64 0.08 1.04 0.05 1.17
128× 128 0.40 4.77 0.10 5.27
256× 256 1.50 53.40 0.40 55.30
1000000
64× 64 0.08 0.93 0.05 1.06
128× 128 0.16 3.41 0.10 3.67
256× 256 1.18 30.10 0.40 31.68
Table 6: Execution time of compute retarded potentials stage of the simulation using Predictive-RP-Kernel for different
simulation configurations on NVIDIA Tesla K40 GPU.
Number of Grid Two-Phase-RP Heuristics-RP Predictive-RP
Particles Resolution Execution Execution Execution Speedup with respect to
(N) (NX ×NY ) Time (sec.) Time (sec.) Time (sec.) Two-Phase-RP Heuristics-RP
100000
64× 64 3.55 1.75 1.17 3.0 1.5
128× 128 28.56 14.20 5.27 5.5 2.7
256× 256 293.63 149.55 55.30 5.3 2.7
1000000
64× 64 2.97 1.40 1.06 2.9 1.3
128× 128 22.82 10.15 3.67 6.2 2.8
256× 256 201.29 88.35 31.68 6.4 2.8
Table 7: Speedup of Predictive-RP-Kernel compared against Two-Phase-RP-Kernel and Heuristics-RP-Kernel
for different simulation configurations on NVIDIA Tesla K40 GPU.
84
CHAPTER 5
EFFICIENT PARALLEL SIMULATION ON
HETEROGENEOUS ARCHITECTURE
This chapter presents the parallel algorithm and its implementation on hete-
rogeneous architectures for the retarded potentials computation stage of the beam
dynamics simulation which uses machine learning algorithms to create and distribute
sub-problems to different PUs of the underlying heterogeneous architecture. In par-
ticular, supervised learning algorithm illustrated in the previous chapter and in [7] is
used to adaptively model and track irregular data access patterns in the computation
of retarded potentials at each time step of the simulation to anticipate the future
data access patterns. Then, at some future time step, access pattern forecast is used
to approximately divide the original problem of evaluating the rp-integral into mul-
tiple smaller sub-problems that are defined only on a subset of data (a subproblem
is also referred to as tasks in this chapter). These tasks created from subdividing all
the NXNY rp-integrals of that particular time step are distributed between multiple
PUs of the heterogeneous architectures such that each PU is scheduled with tasks
localized to a portion of the entire data at a given time. In particular, data requi-
red for computing retarded potentials is first partitioned into multiple smaller data
blocks, and these data blocks are mapped to PUs dynamically such that no two PU
shares the same data block. Then, tasks localized to a data block is scheduled on
a PU based on data block to PU mapping which is determined dynamically. This
approach of task creation and distribution based on the data access pattern has the
advantage of assigning tasks with different memory footprint to different PUs which
improves the memory performance on the heterogeneous architectures by ensuring
that only a portion of data is required to be transferred to a particular PU.
The remainder of this chapter is organized as follows. Section 5.1 presents the
algorithm to compute retarded potentials on heterogeneous architecture composed of
one or more PUs of distinct hardware characteristics and processing capabilities. In
Section 5.1, we first provide a brief overview of the data layout (Section 5.1.1) and the
access pattern representation (Section 5.1.2) considered for designing efficient parallel
85
algorithm on heterogeneous architecture. Next, in Section 5.1.3, we illustrate the task
creation algorithm that divides the original problem of evaluating the rp-integral at
a grid-point into multiple smaller sub-problems that are defined only on a subset of
data. Section 5.1.4 presents the algorithm for calculating retarded potentials using
the sub-problems created from task creation algorithm, where the set of sub-problems
are distributed across different PUs of the underlying architectures based on their
processing capabilities. Finally, Section 5.2 presents the implementation performance
of the algorithm on two different heterogeneous architectures and show that the use
of machine learning algorithms is effective in partitioning and mapping the workload




The heterogeneous algorithm illustrated here divides the original problem of eva-
luating the rp-integral at different grid points into multiple smaller sub-problems or
tasks that are defined only on a subset of data, and it later maps these tasks to
different PUs based on their memory access and computational load requirement.
In order to create such tasks, we first partition the memory logically into multiple
smaller data blocks. In particular, the array of data grids containing the moments
from all the time steps is logically partitioned into multiple smaller data blocks, B0,
B1, . . . , BNt/m where a data block Bi corresponds to the data generated between
time steps mi − 1 to m(i + 1) (i.e.Dmi−1 to Dm(i+1), for all integers i in range 0
to Nt/m, and m ≥ 3 is an integer which defines the granularity of data partition.
The value of m also controls the granularity of the sub-problems generated by the
task creation algorithm, which is illustrated in later sections of this chapter. It is
important to note that the rp-integral approximation algorithm is such that even the
smallest possible sub-problem require data from at-least three data grids which is
why the value of m must be at-least 3.
The task creation and distribution algorithm presented in this chapter use the
partition among the data grids to identify tasks that are localized around individual
data blocks. Then, the distribution algorithm maps these tasks to PUs such that
each PU computes the tasks localized to one or more data-block, where a data-block
86
is not shared between multiple PUs. The motivation for distributing the tasks based
on their data block requirement is that such mapping has the advantage of assigning
tasks with different memory footprint to different PUs. This ensures that only a
portion of data is required to be transferred to a particular PU and this is crucial for
improving the memory performance on heterogeneous architectures.
5.1.2 DATA ACCESS PATTERN
The representation of data access pattern forecast in the numerical approxima-
tion of rp-integrals is identical to the one described in Section 4.1 of Chapter 4. In
particular, data access pattern for rp-integral evaluation at a grid-point p ∈ Vk is








i denotes the number of partiti-
ons along the subregion Sic∆t,(i+1)c∆t required during rp-integral evaluation at p, and
given the access pattern, we can easily calculate the memory references to any data








The task creation algorithm divides the original problem of evaluating the rp-
integral at a grid-point into multiple smaller sub-problems that are defined only
on a subset of data. Formally, suppose the current time step of the simulation is
k and gk−1 be the predictor function learned by the supervised learning algorithm
using data access patterns observed from one or more time step up to time step
k− 1. Then, data access pattern for rp-integral evaluation at a grid-point p ∈ Vk for




1 , . . . , n
(p)
Nt
]. This forecast is used
to approximate the partition P (p) = 〈r(p)0 , r
(p)
1 , . . . , r
(p)
n 〉, along the outer dimension,
which is required to numerically approximate rp-integral at p within the required
error tolerance (illustrated in Section 4.1.3). Next, for each grid-point p ∈ Vk, the




2 , . . . , P
(p)
mp ,
such that rp-integral computation using the partitions from a sub-array P
(p)
i requires
data from at-most one data block Bj for some integer 0 < j < Nt/m and 0 < i ≤ mp.
This divides the original problem of integrating along the region S0,R(p) using the
partition array P (p) into mp sub-problems, where a sub-problem refers to the rp-
integral computation using a partition sub-array P
(p)
i for some integer i ≤ mp (a sub-
problem is also referred to as a task). Once subdivided, the solution to the original
problem is given by sum of rp-integral estimates of the individual sub-problems.
87
Each sub-problem or task is represented using a tuple, 〈p, P, j,M〉, where p denotes
a grid-point on the 2D spatial grid at the current time step, P denotes a partition
sub-array such that rp-integral at p using the partitions from P is localized to at-most
one data block, an integer j that identifies the data block required for computing
the sub-problem (i.e. Bj), and M denotes the data access pattern expected from the
computation of rp-integral at p using partitions from P (Note that the access pattern
M falls into data block Bj).
5.1.4 ALGORITHM TO COMPUTE RETARDED POTENTIALS
The procedure Compute-Potentials-HT implements the second step of the
four step beam dynamics simulation algorithm on heterogeneous architecture where
it approximates the rp-integral at all grid points on a 2D grid for a given time
step. The procedure takes input k, V, τ, g, and D, where k is the current time step
of simulation, V is a set of grid points on the 2D grid at kth time step such that
|V | = NXNY , τ is the required error tolerance for rp-integral evaluations, g deno-
tes the predictor function learned using supervised learning algorithm at time step
k − 1, and D is the list of 2D data grids of moments from each time step stored
linearly on the device memory. Each grid-point p ∈ V is a reference to 7-tuple
object, (x, y, t, I, ε, access pattern, partition), where (p.x, p.y) denote the Cartesian
coordinate of the grid-point on the 2D grid at time step k, p.t is the simulation
time of the corresponding time step, p.I is the rp-integral estimate, p.ε is the rp-
integral error estimate, p.access pattern is a list containing the data access pattern
for rp-integral computation, and p.partition holds a list containing the partition for
rp-integral computation.
The procedure Compute-Potentials-HT works as follows. Line 2-7 initializes
different attributes of the grid-point object. In particular, for each grid-point p ∈
V , line-3 initializes integral and error estimates to 0, line-4 uses the best predictor
function g learned at time step k − 1 to forecast the access pattern required for rp-
integral computation for the current time step k, and line-5 calls a procedure that
implements the algorithm described in Section 4.1.4 to convert access pattern forecast
to rp-integral partition. Line-6 divides the original problem of calculating rp-integral
at a grid-point p using the predicted partition array p.partition into multiple smaller
sub-problems such that each sub-problem is localized to at-most one data block Bj,
for some integer j (a sub-problem is also referred to as a task). The procedure
88
Create-Tasks implements this algorithm which is described in Section 5.1.3.
Compute-Potentials-HT(k, V, τ, g,D)
1 T ← ∅
2 for each grid-point p ∈ V in parallel
3 p.I ← 0, p.ε← 0
4 p.access pattern← g(p.x, p.y, p.t)
5 p.partition← Compute-Partition(p.access pattern)
6 T ′ ← Create-Tasks(p)
7 T ←Merge-Lists(T, T ′)
8 bins← Task-Binning(T )
9 for each available processing unit in parallel
10 while bins is not empty
11 get list of tasks T ′ from a bin i for some integer i based on the bin
weight and the processing unit where it is scheduled
12 V ′ ← ∅
13 for each task 〈p, j, P,M〉 ∈ T ′
14 q ← p
15 q.I ← 0, q.ε← 0
16 q.access pattern←M
17 q.partition← P
18 List-Insert(V ′, q)
19 copy data generated between time steps mi− 1 to m(i+ 1) from D to D′
20 move D′ to the device memory
21 RP-Computation-PU(k, V ′, τ, g,D′)
22 update the estimates from V ′ to the global list of grid points V
23 g ← Online-Learning(V, g)
The procedure Create-Tasks for a grid-point p returns a list of tasks created
by subdividing the rp-integral at p into multiple smaller sub-problems or tasks. The
individual tasks list corresponding to each grid-point p ∈ V is merged to the global
list T . Next, Tasks-Binning procedure at line-8 implements a binning algorithm
which partitions the list of tasks into multiple disjoint clusters based on the data
block required to compute the tasks. Formally, given the set of tasks T and an
integer k = Nt/m, where each task is a tuple 〈p, j, P,M〉, the binning algorithm
89
partitions the |T | tasks into at-most k disjoint clusters or bins, C = {c1, c2, . . . , ck},
such that the tasks in a bin ci require data from data-block Bi. The Tasks-Binning
procedure returns a pool of tasks, bins, which contain the tasks grouped into multiple
bins based on their data-block requirement. These tasks in bins data structure is
shared between multiple PUs of the underlying heterogeneous architectures. Further,
each bin in bins is assigned a weight which denotes the collective computational
workload of the tasks in that particular bin. The weight for a bin is calculated by
reducing all the access vectors of the tasks in that bin. This weight is equivalent to
the total memory reference made by the tasks of that particular bin.
RP-Computation-PU(k, V, τ, g,D)
1 C ← RP-Clustering(V )
2 L← ∅
3 for each cluster c ∈ C in parallel
4 P ← ∅
5 for each grid-point p ∈ c in parallel
6 P ←Merge-Lists(P, p.partition)
7 for each grid-point p ∈ c in parallel
8 L′ ← Compute-RP-Integral(p, P, τ,D)
9 L←Merge-Lists(L,L′)
10 for each ([a, b], p) ∈ L in parallel
11 (I, ε, P, A)← RP-Quadrature(([a, b], p), τ,D)
12 p.access pattern←Merge-Lists(p.access pattern,A)
13 p.partition←Merge-Lists(p.partition, P )
14 p.I ← p.I + I
15 p.ε← p.ε+ ε
Lines 9-22 implements the task distribution algorithm where tasks from bins
are mapped to PUs such that each PU computes the tasks from one or more bins,
where a particular bin is not shared between multiple PUs. The motivation for
distributing the tasks based on their data block requirement is that such mapping
has the advantage of assigning tasks with different memory footprint to different
PUs. This ensures that only a portion of data is required to be transferred to a
particular PU instead of moving the entire data which is crucial for improving the
memory performance. Each PU dynamically selects the bins from the shared pool,
90
one at a time according to the weights assigned to them and then computes the tasks
of that particular bin in parallel. GPU and Xeon Phi selects the bins starting with
larger weight and work towards the bins with smaller weight, and CPU selects the
bins starting with smaller weight and work towards the bins with larger weight. This
sort of scheduling results in tasks with heavy computational load to be scheduled on
accelerators and the tasks with lighter computational load to be scheduled on CPUs.
The main motivation for such weighted scheduling is because accelerators offer larger
raw compute power than the CPUs.
Each PU implements RP-Computation-PU procedure which evaluates the
tasks assigned to them from the bin (i.e. evaluates rp-integral for the sub-problems
from the bin). The procedure RP-Computation-PU is similar to the pseudo-
code between lines 5 - 19 of Compute-Potentials-ML. Next, in the procedure
Online-Learning, access patterns observed during rp-integral computations at
all grid points p ∈ V is used by a supervised learning algorithm to train and up-
date the predictor function g. The updated prediction function g is used by RP-
Computation-PU procedure during the next time step.
5.2 PERFORMANCE RESULTS
The performance analysis of the heterogeneous implementation of compute retar-
ded potentials stage of the simulation is carried out on two different heterogeneous
system -
• Machine-G - Heterogeneous architecture composed of multi-core CPU with
four NVIDIA Tesla K40 GPUs
– Multi-core CPU - Dual socketed Intel R© Xeon R© CPU E5-2650 v3 @ 2.30GHz
with 10 cores per socket and supports 2 threads per core, making a total
of 20 CPU cores for the multi-core CPU platform.
– GPU - The Tesla K40 used here is a GK110B GPU-processor based on the
popular Kepler micro-architecture [61]. The GK110B processor in K40 of-
fers 12 GB of GDDR5 on-board memory with a peak memory bandwidth
of 288 GB/sec, and it contains 15 streaming multiprocessors (SMs) each
with 192 single-precision CUDA cores and 64 double-precision units cloc-
ked at 745 MHz. These cores in SMs collectively delivers a peak floating-
point performance of 4.29 Tflops and 1.43 Tflops in single-precision and
91
double-precision, respectively. The global memory accesses in all the GPU
kernels are configured to be cached in both L1 and L2 (commonly called
the Caching mode), unless specified otherwise.
• Machine-X - Heterogeneous architecture composed of multi-core CPU with
two Intel Xeon Phi 5110P coprocessor
– Multi-core CPU - Dual socketed Intel R© Xeon R© CPU E5-2670 v2 @ 2.50
GHz with 10 cores per socket and supports 2 threads per core, making a
total of 20 CPU cores for the multi-core CPU platform.
– Xeon Phi - The Xeon Phi 5110P used in this study is based on the Knights
Corner architecture. Each 5110P coprocessor consist of 60 cores and offers
8GB of GDDR5 on-board memory with a peak memory bandwidth of 320
GB/sec. The cores collectively delivers a peak floating-point performance
of 2.02 Tflops and 1.01 Tflops in single-precision and double-precision,
respectively.
The global memory access for the implementation on GPU architecture is configured
to be cached in both L1 and L2 (commonly called the Caching mode), unless spe-
cified otherwise. The initial distribution for all the simulations used to analyze the
performance of heterogeneous implementation is generated by Monte Carlo sampling
of N particles with a total charge of beam bunch Q = 1nC, and the rp-integral at
all grid points are approximated to a error tolerance of τ = 10−6. The results pre-
sented in this section are generated by compiling the heterogeneous code using GCC
5.4 compiler for multi-core portion of the code, Intel’s ICC 16 compiler for Xeon
Phi portion of the code, and NVCC compiler with CUDA 8.0 environment for GPU
portion of the code.
5.2.1 PERFORMANCE ON DIFFERENT ARCHITECTURES
GPU-only Execution Performance
Table 8 illustrates the performance of parallel implementation of compute retar-
ded potentials stage of the simulation on Machine-G using only GPUs for different
simulation configuration with varying number of Tesla K40 GPUs. The speedup re-
ported in Table 8 is measured by comparing the multi-GPU execution against single






One GPU Two GPUs Four GPUs
Time (sec.) Time (sec.) Speedup Time (sec.) Speedup
100000
64× 64 1.03 0.62 1.7 0.47 2.2
128× 128 4.71 2.68 1.8 1.51 3.1
256× 256 55.96 29.61 1.9 16.48 3.4
1000000
64× 64 0.94 0.58 1.6 0.45 2.1
128× 128 3.31 1.87 1.8 1.12 3.0
256× 256 34.53 18.14 1.9 10.15 3.4
Table 8: Execution time of the parallel implementation of compute retarded potenti-
als stage of the simulation on Machine-G in GPU-only execution mode for different
simulation configurations with varying number of Tesla K40 GPUs.
of GPUs and achieves up to 1.9X and 3.4X speedup with two and four GPUs, re-
spectively. This shows that the task creation and distribution algorithm is effective
in partitioning the computational load nearly uniformly between multiple GPUs with
minimal overhead.
Xeon Phi-only Execution Performance
Table 9 illustrates the performance of parallel implementation on Machine-X
using only Xeon Phis for different simulation configuration with varying number
of KNC Xeon Phi accelerators. The speedup reported in Table 9 is measured by
comparing the execution on multiple Xeon Phi against single Xeon Phi. We notice
that performance scales near linearly with the increase in number of Xeon Phi and






One Xeon Phi Two Xeon Phis
Time (sec.) Time (sec.) Speedup
100000
64× 64 1.73 1.0 1.7
128× 128 11.02 6.37 1.7
256× 256 149.34 91.73 1.6
1000000
64× 64 1.54 0.92 1.7
128× 128 8.06 4.68 1.7
256× 256 93.03 57.03 1.6
Table 9: Execution time of the parallel implementation of compute retarded poten-
tials stage of the simulation on Machine-X in Xeon Phi-only execution mode for
different simulation configurations with varying number of KNC Xeon Phi coproces-
sor.
and distribution algorithm is effective in partitioning the computational load nearly
uniformly between multiple Xeon Phis.
Performance Comparison Between Different Architectures
Table 10 illustrates the performance of parallel implementation of compute retar-
ded potentials stage of the simulation on different parallel architectures. The speedup
reported for the implementation on accelerators (GPU and Xeon Phi) in Table 10
is calculated by comparing the execution on accelerator against the multi-core CPU
execution. It is evident from Table 10 that the implementation on KNC Xeon Phi
is up to 2.3X faster than the multi-core CPU implementation, and the GPU imple-






20 CPU cores Tesla K40 GPU KNC Xeon Phi
Time (sec.) Time (sec.) Speedup Time (sec.) Speedup
100000
64× 64 4.00 1.03 3.9 1.73 2.3
128× 128 23.50 4.71 5.0 11.02 2.1
256× 256 279.97 55.96 5.0 149.34 1.9
1000000
64× 64 3.57 0.94 3.8 1.54 2.3
128× 128 17.20 3.31 5.2 8.06 2.1
256× 256 176.15 34.53 5.1 93.03 1.9
Table 10: Performance comparison of the parallel implementation of compute retar-
ded potentials stage on different HPC architectures - (a) Multi-core CPU with 20
cores (using the CPU from Machine-X which is the fastest of the two available
multi-core CPUs), (b) Tesla K40 GPU from Machine-G, and (c) KNC Xeon-Phi
from Machine-X.
Xeon Phi implementations, respectively.
5.2.2 PERFORMANCE ON HETEROGENEOUS ARCHITECTURES
Consider a heterogeneous system with n processing units, P = {P1, P2 . . . , Pn},
and an application A which requires W floating point operations where W represents
the computational workload of A. Let tp and fp represent the execution time in
seconds and the achieved performance in Flops/sec, respectively, for the parallel
implementation of A using processing unit p, where tp = W/fp, for all p ∈ P . Then,
the ideal execution time (without any overheads) for the parallel implementation of
95











We analyze the performance of all our heterogeneous implementations by com-
paring the execution time observed experimentally (denoted by tobserved) against the
ideal execution time (tideal).
Table 11 and Table 12 illustrates the performance of the parallel implementa-
tion of compute retarded potentials stage of the simulation on Machine-G and
Machine-X, respectively. In particular, Table 11 illustrates the performance with
two different heterogeneous system configurations on Machine-G: multi-core CPU
with one Tesla K40 GPU, and multi-core CPU with all four Tesla K40 GPUs. Like-
wise, Table 12 illustrates the performance with two different heterogeneous system
configurations on Machine-X: multi-core CPU with one KNC Xeon Phi, and multi-
core CPU with two KNC Xeon Phis. In both these tables, execution time of the
implementation using only multi-core CPU is denoted by tcpu. In Table 11, execu-
tion time using GPU-only mode with one and four Tesla K40 GPU(s) is denoted
by t1gpu and t4gpu, respectively. Similarity, in Table 12, execution time using Xeon
Phi-only mode with one and two KNC Xeon Phi(s) is denoted by t1phi and t2phi,
respectively. In Table 11 and Table 12, execution time observed on the collective
computing environment of the corresponding heterogeneous architectures is deno-
ted by tobserved, and the ideal execution time on the corresponding heterogeneous
architectures is denoted by tideal, which is calculated using Equation 20.
The efficiency of the heterogeneous implementation on Machine-G and Machine-
X is shown in Figure 24 and Figure 25, respectively. The results indicate that depen-
ding on the grid resolution and the number of particles in the simulation, the parallel
algorithm for heterogeneous architectures achieves 70-97% execution efficiency. This
efficiency is the direct consequence of the effectiveness of heterogeneous algorithm
in partitioning the workload efficiently between different PUs of the underlying ar-
chitecture. In particular, high-efficiency here indicates that the task creation and
distribution approach used in the heterogeneous algorithm is effective in breaking
down the problem into multiple smaller sub-problems and efficiently mapping them
to different PUs based on their individual processing capabilities such that together
they deliver a good aggregate performance.
It is important to note that the task creation and distribution approach used in
96
the current implementation has a overhead associated with creating and maintaining
the shared pool of tasks between the PUs. In particular, the constant communication
from the shared pool to fetch tasks by each PU contributes to the overhead, and incre-
ase in this overhead reduces the achieved execution efficiency. This overhead is more
prominent for simulations with smaller resolutions where the computation time is not
sufficiently large enough to hide the cost of the overhead, as is the case for simulati-
ons with 64× 64 grid-resolution. In such scenarios, accelerator only implementation
delivers good performance than the corresponding heterogeneous implementation.
This is evident from the results in Table 11, where for the simulations with 64× 64











CPU GPU-only Heterogeneous Architectures
20 cores One K40 Four K40 20 CPU cores + One K40 GPU 20 CPU cores + Four K40 GPUs







64× 64 4.80 1.03 0.47 0.97 0.85 0.51 0.43
128× 128 24.72 4.71 1.51 4.30 3.96 1.50 1.42
256× 256 282.95 55.96 16.48 50.11 46.72 16.42 15.57
1000000
64× 64 4.36 0.94 0.45 0.84 0.77 0.50 0.41
128× 128 18.35 3.31 1.12 3.01 2.81 1.11 1.06
256× 256 178.13 34.53 10.15 31.23 28.92 10.12 9.60
Table 11: Performance of the heterogeneous implementation of compute retarded potentials stage of the simulation on
Machine-G which is composed of multi-core CPU and four NVIDIA Tesla K40 GPUs.
Figure 24: Efficiency of the heterogeneous implementation of compute retarded potentials stage of the simulation on Machine-










CPU GPU-only Heterogeneous Architectures
20 cores One KNC Two KNC 20 CPU cores + One Xeon Phi 20 CPU cores + Two Xeon Phi







64× 64 4.00 1.73 1.00 1.39 1.21 1.10 0.80
128× 128 23.50 11.02 6.37 7.75 7.51 5.81 5.01
256× 256 279.97 149.34 91.73 99.37 97.40 71.13 69.09
1000000
64× 64 3.57 1.54 0.92 1.36 1.08 1.05 0.73
128× 128 17.20 8.06 4.68 5.96 5.50 4.61 3.68
256× 256 176.15 93.03 57.03 63.45 60.88 45.53 43.08
Table 12: Performance of the heterogeneous implementation of compute retarded potentials stage of the simulation on
Machine-X which is composed of multi-core CPU and two KNC Xeon Phi coprocessor.
Figure 25: Efficiency of the heterogeneous implementation of compute retarded potentials stage of the simulation on Machine-




This dissertation targeted several goals relating to optimizing the performance
of irregular scientific applications on emerging HPC architectures like GPUs, Xeon
Phis, and heterogeneous architectures composed of multi-core CPUs with GPUs or
Xeon Phis as accelerators. In particular, the dissertation primarily focuses on opti-
mizing the irregular workloads in scientific applications which require execution of
one or more irregular algorithms for multiple time steps (in the order of few hund-
red thousand to millions), where the irregular algorithm at each time step exhibit
control-flow and memory access pattern that are not readily amenable to most pa-
rallel architectures. Using numerical simulation of charged particles beam dynamics
simulations as a motivating example, this dissertation presented novel machine lear-
ning based optimization techniques to address the computational challenges in the
efficient parallel implementation of such irregular applications on HPC architectures.
The machine learning approach presented here relies on supervised learning algo-
rithms to adaptively model and track irregular access patterns observed during the
computation of irregular workloads at each time step of the simulation to anticipate
the future control-flow and data access patterns. We demonstrated that the access
pattern forecasts from anticipation strategies can be successfully used to formulate
optimization decisions that improve the application performance at a future time
step based on the observation from earlier time steps. In particular, we success-
fully used the forecasts to minimize both branch and memory divergence, thereby
reducing the control-flow and memory access irregularities in its parallel implemen-
tation on GPU and Intel MIC architecture. We also showed that the forecast can be
used to optimize the computation-to-thread mapping which maximize the data reuse
by improving data locality. For implementation on heterogeneous architectures, we
demonstrated that the forecast can be used to approximately divide the original pro-
blem into multiple smaller sub-problems, and once divided, we showed that they can
be efficiently mapped between the hybrid mix of processing units of a heterogeneous
architecture to deliver a good aggregate performance.
100
We further quantified the impact of using machine learning approach in resolving
the computational challenges in the parallel implementation of beam dynamics simu-
lation using NVIDIA GPUs and two different heterogeneous architectures. Then, we
presented a detailed performance comparison of this new algorithm against the only
two published parallel algorithms for high-fidelity computation of collective effects on
GPUs: Two-Phase-RP and Heuristics-RP algorithm. Two-Phase-RP algo-
rithm is the first high-performance parallel algorithm for beam dynamics simulation
that enabled high-fidelity simulation both feasible and computationally tractable.
Heuristics-RP algorithm addressed the memory inefficiencies in Two-Phase-RP
algorithm, which further provided a substantial boost in the performance. Now, with
the new and improved algorithm presented in this work that can run on a wide variety
of computing platforms compared to the existing algorithms, it enables unpreceden-
ted efficiency in numerical simulation of all the relevant physics of synchrotron light
sources and electron-ion particle colliders. The newly improved efficiency, coupled
with high-fidelity and precision of our earlier implementations, makes the previously
inaccessible physics tractable. For accelerator physics community in general, this
research is a step forward in developing ultra-bright light sources which are essen-
tial tools for discoveries and innovations in physical, biological, energy and medical
sciences.
Though the reported work has focused on addressing the irregularities in parallel
implementation of charged particles beam dynamics simulation, the work presented
has the potential to be much wider reaching. The optimizations should be equally
applicable to other irregular applications which suffers from similar implementation
challenges as that of beam dynamics simulation. For instance, the approach used
to optimize the computation-to-thread mapping based on the predicted access pat-
tern can be adapted by other applications while developing parallel implementations
on GPUs. As an example, in the parallel implementation of n-body simulation
on GPUs using Barnes-Hut algorithm, we can use this approach to determine the
particles-to-thread mapping to minimize the divergence, improve data reuse and to
make prefetch decisions which improves the overall application performance. Other
applications where this approach can be adapted are molecular dynamics simulation,




[1] T. Agoh and K. Yokoya. Calculation of coherent synchrotron radiation using
mesh. Physical Review Special Topics: Accelerators and Beams, 7:054403, May
2004.
[2] A. Aho, R. Sethi, and J. Ullman. Compilers: principles, techniques, and tools.
Addison Wesley, 1986.
[3] K. Arumugam, D. Ranjan, M. Zubair, A. Godunov, and B. Terzić. An effi-
cient deterministic parallel algorithm for adaptive multidimensional numerical
integration on GPUs. In 42nd International Conference on Parallel Processing,
ICPP’13, pages 486–491, Washington, DC, USA, October 2013. IEEE Computer
Society.
[4] K. Arumugam, D. Ranjan, M. Zubair, A. Godunov, and B. Terzić. A
memory-efficient algorithm for adaptive multidimensional integration with mul-
tiple GPUs. In 20th Annual International Conference on High Performance
Computing, HiPC’13, pages 169–175, December 2013.
[5] K. Arumugam, D. Ranjan, M. Zubair, A. Godunov, and B. Terzić. High-fidelity
simulation of collective effects in electron beams using an innovative parallel
method. In Summer Computer Simulation Conference (SCSC), SummerSim’15,
pages 1–10, San Diego, CA, USA, 2015. Society for Computer Simulation Inter-
national.
[6] K. Arumugam, D. Ranjan, M. Zubair, A. Godunov, and B. Terzić. Memory-
efficient parallel simulation of electron beam dynamics using GPUs. In 23rd
International Conference on High Performance Computing, HiPC’16, pages 212–
221, December 2016.
[7] K. Arumugam, D. Ranjan, M. Zubair, B. Terzi, A. Godunov, and T. Islam. A
machine learning approach for efficient parallel simulation of beam dynamics on
GPUs. In 46th International Conference on Parallel Processing (ICPP), pages
462–471, August 2017.
102
[8] S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W.-m. W. Hwu. Efficient perfor-
mance evaluation of memory hierarchy for highly multithreaded graphics pro-
cessors. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, PPoPP ’12, pages 23–34, 2012.
[9] J. Barnes and P. Hut. A hierarchical O(N log N) force-calculation algorithm.
Nature, 324:446–449, December 1986.
[10] G. Bassi, T. Agoh, M. Dohlus, L. Giannessi, R. Hajima, A. Kabel, T. Lim-
berg, and M. Quattromini. Overview of CSR codes. Nuclear Instruments and
Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors
and Associated Equipment, 557(1):189–204, 2006. Energy Recovering Linacs
2005.
[11] G. Bassi, J. A. Ellison, K. Heinemann, and R. Warnock. Microbunching in-
stability in a chicane: Two-dimensional mean field treatment. Physical Review
Special Topics: Accelerators and Beams, 12:080704, August 2009.
[12] J. Berntsen, T. O. Espelid, and A. Genz. An adaptive algorithm for the approx-
imate calculation of multiple integrals. ACM Transactions on Mathematical
Software (TOMS), 17(4):437–451, December 1991.
[13] J. Berntsen, T. O. Espelid, and A. Genz. DCUHRE: An adaptive multide-
mensional integration routine for a vector of integrals. ACM Transactions on
Mathematical Software (TOMS), 17(4):452–456, December 1991.
[14] E. Blem, M. Sinclair, and K. Sankaralingam. Challenge benchmarks that must
be conquered to sustain the GPU revolution. In Proceedings of the 4th Workshop
on Emerging Applications for Manycore Architecture, 2011.
[15] C. L. Bohn and J. R. Delayen. Fokker-Planck approach to the dynamics of
mismatched charged-particle beams. Physical Review E, 50:1516–1534, August
1994.
[16] M. Borland, Y. C. Chae, P. Emma, J. Lewellen, V. Bharadwaj, W. M. Fawley,
P. Krejcik, C. Limborg, S. V. Milton, H. D. Nuhn, R. Soliday, and M. Wood-
ley. Start-to-end simulation of self-amplified spontaneous emission free elec-
tron lasers from the gun through the undulator. Nuclear Instruments and
103
Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors
and Associated Equipment, 483(1):268–272, 2002. Proceedings of the 23rd In-
ternational Free Electron Laser Confere nce and 8th FEL Users Workshop.
[17] A. Braunstein, M. Mézard, and R. Zecchina. Survey propagation: An algorithm
for satisfiability. Random Structures and Algorithms, 27(2):201–226, September
2005.
[18] M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular
programs on GPUs. In 2012 IEEE International Symposium on Workload
Characterization (IISWC), pages 141–151, November 2012.
[19] M. Burtscher and K. Pingali. An efficient CUDA implementation of the
tree-based barnes hut n-body algorithm, pages 75–92. Morgan Kaufmann, 2011.
[20] R. E. Caflisch. Monte carlo and quasi-monte carlo methods. Acta Numerica,
7:1–49, 1998.
[21] S. Chapra and R. Canale. Numerical Methods for Engineers. McGraw-Hill, 6
edition, 2009.
[22] N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, and R. Balasubramo-
nian. Managing DRAM latency divergence in irregular GPGPU applications. In
Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis, SC’14, pages 128–139, 2014.
[23] S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Un-
derstanding irregular GPGPU graph applications. In 2013 IEEE International
Symposium on Workload Characterization (IISWC), pages 185–195, September
2013.
[24] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A
performance study of general-purpose applications on graphics processors using
CUDA. Journal of Parallel and Distributed Computing, 68(10):1370–1380, 2008.
General-Purpose Processing using Graphics Processing Units.
[25] L. P. Chew. Guaranteed-quality mesh generation for curved surfaces. In
Proceedings of the Ninth Annual Symposium on Computational Geometry, SCG
’93, pages 274–280, 1993.
104
[26] G. Dalquist and A. Björck. Numerical Methods in Scientific Computing, vo-
lume 1. Society for Industrial and Applied Mathematics, 2008.
[27] G. Dasika, A. Sethia, T. Mudge, and S. Mahlke. PEPSC: A power-efficient
processor for scientific computing. In 2011 International Conference on Parallel
Architectures and Compilation Techniques, pages 101–110, October 2011.
[28] Ya. S. Derbenev and V. D. Shiltsev. Transverse effects of microbunch radiative
interaction. SLAC-Pub-7181, 1996.
[29] DOE/SC/Oak Ridge National Laboratory. Titan - Cray XK7, Opteron 6274
16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x. http://www.olcf.
ornl.gov/titan/.
[30] T. Endo, S. Matsuoka, A. Nukada, and N. Maruyama. Linpack evaluation on
a supercomputer with heterogeneous accelerators. In 2010 IEEE International
Symposium on Parallel Distributed Processing (IPDPS), pages 1–8, April 2010.
[31] W. Feng and T. Scogland. Green Top 500 Supercomputers. https://www.
top500.org/green500/lists/2017/06/.
[32] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation
and scheduling for efficient GPU control flow. In 40th Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO 2007), pages 407–420,
December 2007.
[33] A. C. Genz and A. A. Malik. An adaptive algorithm for numerical integration
over an n-dimensional rectangular region. Journal of Computational and Applied
Mathematics, 6(4):295–302, 1980.
[34] G. Golub and C. V. Loan. Matrix Computations. Johns Hopkins University
Press, 3 edition, 1996.
[35] T. Hahn. CUBA-A library for multidimensional numerical integration.
Computer Physics Communications, 168(2):78–95, 2005.
[36] T. Hahn. The CUBA library. Nuclear Instruments and Methods in Physics
Research Section A: Accelerators, Spectrometers, Detectors and Associated
Equipment, 559(1):273–277, 2006. Proceedings of the X International Workshop
on Advanced Computing and Analysis Techniques in Physics Research.
105
[37] K. Hildrum and P. S. Yu. Focused community discovery. In Fifth IEEE
International Conference on Data Mining (ICDM’05), November 2005.
[38] M. D. Hill and M. R. Marty. Amdahl’s law in the multicore era. Computer,
41(7):33–38, July 2008.
[39] R. W. Hockney and J. W. Eastwood. Computer simulations using particles.
Institute of Physics Publishing, London, 1988.
[40] Intel. Intel Xeon Phi Core Micro-architecture. https://software.intel.com/
en-us/articles/intel-xeon-phi-core-micro-architecture.
[41] A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kande-
mir, O. Mutlu, R. Iyer, and C. R. Das. OWL: Cooperative thread array aware
scheduling techniques for improving GPGPU performance. In Proceedings of the
Eighteenth International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS ’13, pages 395–406, 2013.
[42] A. Kabel. Coherent synchrotron radiation calculations using TraFiC4: Multi-
processor simulations and optics scans. In PACS2001. Proceedings of the 2001
Particle Accelerator Conference (Cat. No.01CH37268), volume 4, pages 2988–
2990 vol.4, 2001.
[43] Khronos Group. The OpenCL Specification v2.0. https://www.khronos.org/
opencl/.
[44] D. B. Kirk and W. W. Hwu. Programming Massively Parallel Processors: A
Hands-on Approach, volume 1. Morgan Kaufmann Publishers Inc., 2010.
[45] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish,
M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey.
Debunking the 100X GPU vs. CPU myth: An evaluation of throughput com-
puting on CPU and GPU. In Proceedings of the 37th Annual International
Symposium on Computer Architecture, ISCA ’10, pages 451–460, New York,
NY, USA, 2010. ACM.
[46] R. Li. Progress on the study of CSR effects. In 2nd ICFA Advanced Accelerator
Workshop on the Physics of High Brightness Beams, pages 369–390, November
1999.
106
[47] R. Li. Self-consistent simulation of the CSR effect on beam emittance. In
Nuclear Instruments and Methods in Physics Research Section A, volume 429,
pages 310–314, June 1999.
[48] R. Li. Curvature-induced bunch self-interaction for an energy-chirped bunch
in magnetic bends. Physical Review ST Accelerators and Beams, 11:024401,
February 2008.
[49] R. Li, R. Legg, B. Terzić, J. Bisognano, and R. Bosh. Two-dimensional ef-
fects on the behavior of the CSR force in a bunch compressor chicane. In 33rd
International Free Electron Laser Conference, Shanghai, 2011.
[50] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A
unified graphics and computing architecture. IEEE Micro, 28(2):39–55, March
2008.
[51] M. Mendez-Lojo, M. Burtscher, and K. Pingali. A GPU implementation of
inclusion-based points-to analysis. In Proceedings of the 17th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, PPoPP ’12,
pages 107–116, 2012.
[52] J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated
branch and memory divergence tolerance. In Proceedings of the 37th Annual
International Symposium on Computer Architecture, ISCA ’10, pages 235–246,
2010.
[53] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming, PPoPP ’12, pages 117–128, 2012.
[54] J. Misra. Distributed discrete-event simulation. ACM Computing Surveys
(CSUR), 18(1):39–65, March 1986.
[55] S. Mittal and J. S. Vetter. A survey of CPU-GPU heterogeneous computing
techniques. ACM Computing Surveys, 47(4):69:1–69:35, July 2015.
[56] NAG. Fortran 90 library. Numerical Algorithms Group Inc., Oxford, U.K., 2000.
107
[57] National Super Computer Center in Guangzhou. Tianhe-2 (MilkyWay-2) - TH-
IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon
Phi 31S1P. https://www.top500.org/system/177999.
[58] NVIDIA. CUDA Bandwidth Test. http://docs.nvidia.com/cuda/
cuda-samples/#bandwidth-test.
[59] NVIDIA. CUDA C Best Practices Guide. http://docs.nvidia.com/cuda/
cuda-c-best-practices-guide/.
[60] NVIDIA. CUDA C Programming Guide. http://docs.nvidia.com/cuda/
cuda-c-programming-guide/index.html.
[61] NVIDIA. Next Generation CUDA Compute Architecture: Kepler GK110.
https://www.nvidia.com/content/PDF/kepler/
NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[62] NVIDIA. Profiler:CUDA Toolkit Documentation. http://docs.nvidia.com/
cuda/profiler-users-guide/index.html.
[63] OpenACC.org. The OpenACC Application Programming Interface. https:
//www.openacc.org/.
[64] OpenMP Architecture Review Board. OpenMP Application Program Interface
v4.0. http://www.openmp.org/.
[65] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
chine learning in Python. Journal of Machine Learning Research, 12:2825–2830,
2011.
[66] J. L. Peterson. Petri Nets. ACM Computing Surveys (CSUR), 9(3):223–252,
September 1977.
[67] R. Piessens, E. de Doncker-Kapenga, C. Überhuber, and D. Kahaner.
QUADPACK:A Subroutine Package for Automatic Integration. Springer-Verlag,
Berlin, 1983.
108
[68] J. Sartori and R. Kumar. Branch and data herding: Reducing control and me-
mory divergence for error-tolerant GPU applications. In Proceedings of the 21st
International Conference on Parallel Architectures and Compilation Techniques,
PACT ’12, pages 427–428, 2012.
[69] S. Seo, G. Jo, and J. Lee. Performance characterization of the NAS paral-
lel benchmarks in OpenCL. In IEEE International Symposium on Workload
Characterization (IISWC), pages 137–148, November 2011.
[70] J. Shen, J. Fang, H. Sips, and A. L. Varbanescu. An application-centric evalua-
tion of OpenCL on multi-core CPUs. Parallel Computing, 39(12):834–850, 2013.
Programming models, systems software and tools for High-End Computing.
[71] J. Shen, A. L. Varbanescu, Y. Lu, P. Zou, and H. Sips. Workload partitioning
for accelerating applications on heterogeneous platforms. IEEE Transactions on
Parallel and Distributed Systems, 27(9):2766–2780, September 2016.
[72] E. Strohmaier, J. Dongarra, H. Simon, and M. Meuer. Top 500 Supercomputers.
https://www.top500.org/lists/2017/06/.
[73] E. Sun, D. Schaa, R. Bagley, N. Rubin, and D. Kaeli. Enabling task-level
scheduling on heterogeneous platforms. In 5th Annual Workshop on General
Purpose Processing with Graphics Processing Units, GPGPU-5, pages 84–93,
2012.
[74] Swiss National Supercomputing Centre (CSCS). Piz Daint - Cray XC50, Xeon
E5-2690v3 12C 2.6GHz, Aries interconnect , NVIDIA Tesla P100. http://www.
cscs.ch/computers/piz_daint/index.html.
[75] P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson
Addison Wesley, 2005.
[76] B. Terzić and G. Bassi. New density estimation methods for charged particle
beams with applications to microbunching instability. Physical Review Special
Topics: Accelerators and Beams, 14:070701, July 2011.
[77] B. Terzić, I. Pogorelov, and C. Bohn. Particle-in-cell beam dynamics simu-
lations with a wavelet-based poisson solver. Physical Review Special Topics:
Accelerators and Beams, 10:034201, March 2007.
109
[78] J. S. Vetter and S. Mittal. Opportunities for nonvolatile memory systems in
extreme-scale high-performance computing. Computing in Science Engineering,
17(2):73–82, March 2015.
[79] M. Viñas, Z. Bozkus, and B. B. Fraguela. Exploiting heterogeneous parallelism
with the heterogeneous programming library. Journal of Parallel and Distributed
Computing, 73(12):1627–1638, December 2013.
[80] A. Vladimirov, R. Asai, and V. Karpusenko. Parallel Programming and
Optimization with Intel Xeon Phi Coprocessors. Colfax International, 2nd edi-
tion, 2015.
[81] S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual
performance model for multicore architectures. Communications of the ACM -
A Direct Path to Dependable Software, 52(4):65–76, April 2009.





A.1 OVERVIEW OF CUHRE
This section describes the sequential cuhre algorithm for n-dimensional integra-









where x is an n-vector, and f is an integrand. We use [a,b] to denote the hyper
rectangle [a1, b1]× [a2, b2] . . .× [an, bn].
The heart of cuhre algorithm is the procedure C-Rule([a,b], f, n) which out-
puts a triple (I, ε, κ) where I is an estimate of the integral over hyper rectangle [a,b]
([a1, b1] × [a2, b2] . . . × [an, bn]), ε is an error estimate for I, and κ is the axis along
which [a,b] should be split if needed. An important feature of C-Rule is that it eva-
luates the integrand only for 2n + p(n) points where p(n) is Θ(n3) [12]. This is much
fewer than 15n function evaluations required by a simple repeated one-dimension
integration scheme based on 7/15-point Gauss-Kronrod method.
The procedure Sequential-Cuhre implements the cuhre method to compute
n-dimensional multiple integral. The procedure takes input n, a, b, f , a relative error
tolerance τrel and an absolute error tolerance τabs, where a = (a1, a2, ..., an) and b
= (b1, b2, ..., bn). In the description provided below, H is a priority queue of 4-tuples
([x,y], I, ε, κ) where [x,y] is a subregion, I is an estimate of the integral over this
region, ε an estimate of the error and κ the dimension along which the subregion
should be split if needed. The parameter ε determines the priority for extraction of
elements from the priority queue. The algorithm maintains a global error estimate
εg and a global integral estimate Ig. The algorithm repeatedly splits the region with
greatest local error estimate and updates εg and Ig. The algorithm terminates when
the εg ≤ max(τabs, τrel|Ig|) and outputs integral estimate Ig and error estimate εg.
111
Sequential-Cuhre(n, a,b,f, τrel, τabs)
1 (Ig, εg, κ)← C-Rule([a,b], f, n)
2 H ← ∅
3 insert(H, ([a,b], Ig, εg, κ)
4 while εg > max(τabs, τrel|Ig|)
5 ([a,b], I, ε, κ)← Extract-Max(H)
6 a′ ← (a1, a2, . . . , (aκ + bκ)/2, . . . , an)
7 b′ ← (b1, b2, . . . , (aκ + bκ)/2, . . . , bn)
8 (Ileft, εleft, κleft)← C-Rule([a,b′], f, n)
9 (Iright, εright, κright)← C-Rule([a′,b], f, n)
10 Ig ← Ig − I + Ileft + Iright
11 εg ← εg − ε+ εleft + εright
12 insert(H, ([a,b′], Ileft, εleft, κleft))
13 insert(H, ([a′,b], Iright, εright, κright))
14 return Ig and εg
A.2 GPU-ACCELERATED PARALLEL ALGORITHM
The sequential adaptive quadrature for multi-dimensional integral is poorly suited
to GPUs, as it does not take advantage of the GPU’s data or task parallelism. This
section presents a parallel quadrature algorithm that can utilize the parallel proces-
sors of GPUs to speed up the computation. The parallel algorithm approximates the
integral by adaptively locating the subregions in parallel where the error estimate is
greater than some user-specified error tolerance. It then calculates the integral and
error estimates on these subregions in parallel. The pseudocode for this algorithm is
provided below in Quadrature-PhaseOne and Quadrature-PhaseTwo pro-
cedures.
In Quadrature-PhaseOne, Lmax is a parameter that is based on target GPU
architecture. The goal of this procedure is to create a list of subregions from the
whole region [a,b], with at least Lmax elements for which further computation is
necessary for estimating the integral to desired accuracy. This list is later passed
on to Quadrature-PhaseTwo. The algorithm maintains an list L of subregions,
stored as [aj,bj]. Initially the whole integration region is split into roughly Lmax equal
parts through the procedure Init-Partition. In each iteration of the while loop in
Quadrature-PhaseOne, first the cuhre rules are applied to all subregions in L
112
in parallel to get the integral estimate, error estimate, and the split axis. A list S is
created to store the intervals with these values. Thereafter the algorithm essentially
identifies the “good” and the “bad” subregions in S – the good subregions have
error estimate that is below a chosen threshold, whereas bad subregions have error
estimates exceeding this threshold. The bad subregions need to be further divided,
while the integral and error estimates for the good regions can simply be accumulated.
This is accomplished through the procedures Partition and Update.
Quadrature-PhaseOne(n, a,b, f , d, τrel, τabs, Lmax)
1 Ip ← 0, Ig ← 0, εp ← 0, εg ←∞
// Ip, εp keep sum of integral and error estimates for the “good” subregions
// Ig, εg keep sum of integral and error estimates for all subregions
2 L← Init-Partition(a,b, Lmax, n)
3 while (|L| < Lmax) and (|L| 6= 0) and (εg > max(τabs, τrel|Ig|)
4 S ← ∅
5 for all j in parallel
6 (Ij, εj, κj)← C-Rule(L[j], f, n)
7 insert(S, (L[j], Ij, εj, κj))
8 L← Partition(S, Lmax, τrel, τabs)
9 (Ip, εp, Ig, εg)← Update(S, τrel, τabs, Ip, εp)
10 return (L, Ip, εp, Ig, εg)
Quadrature-PhaseTwo(n, f , τrel, τabs, L, I
g, εg)
1 for j = 1 to |L| parallel
2 Let [aj,bj] be the j
th record in L
3 (Ij, εj)← Sequential-Cuhre(n, aj, bj, f , τrel, τabs)








6 return Igand εg
It is worth noting that the original cuhre algorithm always divides selected
subregion into two parts along the chosen axis where the integrand has the largest
fourth divided difference [12].The proposed algorithm here uses this strategy of choo-
sing the axis,with the distinction that the selected subregion is divided into d pieces
113
Init-Partition(a,b, Lmax, n)
1 l← max{j|jn ≤ Lmax}
2 split [a,b] along each dimension into l equal parts and
save these ln subregions into L
3 return L
Update(S, τrel, τabs, I
p, εp)
1 t1 ← Ip, t2 ← εp, t3 ← 0, t4 ← 0
// t1, t2 accumulates integral and error estimates for the “good” subregions
// t3, t4 accumulates integral and error estimates for all the subregions
2 for j = 1 to |S|
3 Let ([aj,bj], Ij, εj, κj) be the j
th record in S
4 if εj < max(τabs, τrel|Ij|)
5 t1 ← t1 + Ij
6 t2 ← t2 + εj
7 else t3 ← t3 + Ij
8 t4 ← t4 + εj
9 t3 ← t3 + t1
10 t4 ← t4 + t2
11 return (t1, t2, t3, t4)
Partition(S, Lmax, τrel, τabs)
1 L1 ← ∅, L2 ← ∅
// L1 stores the “bad” subregions before subdivision
// L2 stores the subregions after subdivision of “bad” subregions
2 for j = 1 to |S|
3 Let ([aj,bj], Ij, εj, κj) be the j
th record in S
4 if εj ≥ max(τabs, τrel|Ij|)
5 insert ([aj,bj], κj) into L1
6 d← split-factor(Lmax, |L1|)
7 for j = 1 to |L1|
8 Let ([aj,bj], κj) be the j
th record in L1
9 split [aj,bj] into d equal parts along the axis κj and
insert all these subregions into L2
10 return L2
Listing: Auxiliary procedures in Quadrature-PhaseOne and Quadrature-
PhaseTwo
114
along the chosen axis instead of two. The parameter d is dynamically calculated
using a heuristic Split-Factor based on the target architecture and on the number
of bad intervals.Subdivision of a region refines the resolution of that region along with
generating enough subregions to balance the computational load for second phase.
First phase continues until (i) a long enough list of “bad” subregions is created in
which case we proceed to the second phase or (ii) there are no more “bad” subregions
in which case we can return the integral and error estimates Ig and εg as the answer
or (iii) Ig, εg satisfy the error threshold criteria in which case we also return Ig and
εg as the answer. Note that, in case (ii) or (iii) second phase of the algorithm is not
used.
The algorithm continues with the second phase when the global error estimate is
still larger than the required global tolerance. In second phase, on every subregion
[aj,bj] in the list L the algorithm calls sequential cuhre routine ( Sequential-
Cuhre) to compute global integral and error estimate for the selected subregion
(Line 3). Line 4 and 5 update the global integral and error estimate. Second phase
implements a modified version of cuhre to run in parallel for each of the subregions
in the list L returned from first phase. The modified version of cuhre implemented
for GPU take advantage of state-of-the art GPU architectures to speed-up the com-
putations. Our approach combines the original features of cuhre with the improved




B.1 ROOFLINE PERFORMANCE MODEL
Roofline is a visually intuitive performance model used to bound the performance
of various applications running on multicore, manycore, or accelerator processor ar-
chitectures. Rather than simply using percent-of-peak estimates, the model can be
used to assess the quality of attained performance by combining locality, bandwidth,
and different parallelization paradigms into a single performance figure. One can
examine the resultant Roofline figure in order to determine both the implementation
and inherent performance limitations.
Figure 26 shows the Roofline model for NVIDIA Tesla K40 GPU [81]. The
graph is on a log-log scale. The y-axis is attainable double-precision floating-point
performance in units of GFlops/sec, and the x-axis is arithmetic intensity, varying
from 0.125 Flops/DRAM byte-accessed to 32 Flops/DRAM byte-accessed. The sy-
stem being modeled has a peak double precision floating-point performance of 1.4
Tflops/sec and peak memory bandwidth of BWTheoretical-Peak = 288 GB/sec from har-
dware specifications. Additionally, when fused multiply and add (FMA) instructions
are not utilized then the peak double precision floating-point performance drops to
0.7 Tflops/sec. The two horizontal lines denotes the performance ceilings for peak
double-precision (DP) floating point operations with and without FMA. The black
solid diagonal line in Figure 26 indicates the bandwidth ceiling for BWTheoretical-Peak.
However, peak memory bandwidth is often unachievable in practice. So, in order
to analyze the performance more accurately, we measure the experimental memory
bandwidth using the benchmarks from NVIDIA’s official SDK [58]. Experimental
memory bandwidth for K40 is calculated to be BWExperimental-Peak = 200 GB/sec,
and its bandwidth ceiling in roofline model is shown using the blue solid diagonal
line.
The Roofline sets an upper bound on performance of a kernel depending on the









1/8 1/4 1/2  1  2  4  8  16  32
Peak DP GFlops/sec 

















































NVIDIA Tesla K40 GPU
Figure 26: Roofline model for NVIDIA Tesla K40 GPU.
vertical line that hits the roof then it either hits the slanted part or the flat part of
the roof. When the column hits slanted part of the roof, kernel is memory-bound
and the only way to reach peak performance is to increase the arithmetic intensity.
For example, the red vertical column indicates the arithmetic intensity of a memory-
bound kernel and red X marks performance achieved for that particular kernel when
the achieved bandwidth is BWTheoretical-Peak. On the other hand, when the column
hits the flat part of the roof, memory bandwidth is not the limiting factor and the
kernel is compute-bound. For example, the green column indicates the arithmetic
intensity for a compute-bound kernel and green X marks performance achieved for
that particular kernel when the achieved bandwidth is BWTheoretical-Peak.
B.2 PROFILER METRICS
This section contains detailed descriptions of the profiler metrics that are used
to analyze the performance of GPU kernels in this thesis report. These metrics are
117
collected using NVIDIA’s nvprof and the Visual Profiler [62].
• Metric name: Warp execution efficiency
Profiler metric: warp execution efficiency
Description: Ratio of the average active threads per warp to the maximum
number of threads per warp supported on a multiprocessor expressed as per-
centage. Values for warp execution efficiency indicate the following -
– Values of less than 100% indicate the presence of threads with different
control-flow paths which leads to performance bottlenecks on GPU archi-
tectures, as illustrated in Section 2.2.1.
– 100% warp execution efficiency indicate that the kernel has no divergent
threads.
• Metric name: Global load efficiency
Profiler metric: gld efficiency
Description: Ratio of number of bytes requested by the kernel to number
of bytes transferred from the global memory expressed as percentage (or ra-
tio of requested global memory load throughput to required global memory
load throughput expressed as percentage). Values for gld efficiency indicate
following performance behavior of a kernel -
– Efficiency of 100% indicates perfect coalescing.
– Values larger than 100% shows that, on average, the load requests of
multiple threads in a warp are fetched from the same memory address
and are also coalesced.
– Values less than 100% indicates scattered and non-coalesced memory access,
and such accesses waste off-chip bandwidth by over-fetching unnecessary
data.
• Metric name: Global load transactions per request
Profiler metric: gld transactions per request
Description: Average number of global memory load transactions performed
for each global memory load.
118
– Ideal value for transaction per requests is (32 threads per warp × word
size in bytes / 128 bytes per line), which for 8-byte words is 2.0. The ideal
value is obtained by perfect coalescing.
– Values greater than the ideal value shows that substantial number of me-
mory request from the kernel are non-coalesced and only a fraction of
cache line is used , thereby resulting in transaction replays.
• Metric name: Global L1-cache hit rate
Profiler metric: l1 cache global hit rate
Description: Hit rate in L1 cache for global loads
• Metric name: L2-cache hit rate
Profiler metric: l2 l1 read hit rate




Department of Computer Science
Old Dominion University
Norfolk, VA 23529
Kamesh Arumugam Karunanithi received his Bachelor of Engineering degree in
Computer Science and Engineering from Visvesvaraya Technological University, In-
dia in 2010. During his bachelor studies, he worked on various research projects in
the field of parallel computing, bioinformatics, and wireless sensor networks. After
receiving his bachelor degree, he worked as Software developer and Consultant for
web-based startup industries in Bangalore, India. In Fall 2011, Kamesh joined the
Computer Science Department of Old Dominion University to pursue a PhD de-
gree. He is currently working with Dr. Mohammad Zubair and Dr. Desh Ranjan of
Department of Computer Science, and Dr. Baľsa Terzić of Department of Physics
as research assistant, developing high-performance parallel algorithms and its imple-
mentation on emerging HPC architectures for a wide variety of scientific applications.
Most of his research work are interdisciplinary which is a result of close collaboration
with the computational scientists and physicists from ODU and Jeffersons Lab. His
work is published in several peer reviewed international conferences - ICPP (2013
& 2017), HiPC (2013 & 2016), GTC (2013), SCSC (2015 & 2017), IPAC (2015 &
2017), and ICAP (2017). One of his research work titled “High-Fidelity Simulation
of Collective Effects in Electron Beams Using an Innovative Parallel Method” won
the best paper award at the 47th Summer Computer Simulation Conference (SCSC)
held by the Society for Modeling & Simulation International (SCS). This work has
been well received within the computational physics community and has enabled un-
precedented fidelity and precision in studying all the relevant physics of synchrotron
light sources. Further, he received the Modeling and Simulation Research Fellows-
hip from Virginia Modeling, Analysis & Simulation Center of ODU to support his
research for 3 years (maximum allowance). He also serves as a reviewer for BMC
Bioinformatics Journal.
Typeset using LATEX.
