Old Dominion University

ODU Digital Commons
Computer Science Theses & Dissertations

Computer Science

Summer 2017

Multi-GPU Accelerated High-Fidelity Simulations of Beam-Beam
Effects in Particle Colliders
Naga Sai Ravi Teja Majeti
Old Dominion University, nmaje001@odu.edu

Follow this and additional works at: https://digitalcommons.odu.edu/computerscience_etds
Part of the Computer Sciences Commons

Recommended Citation
Majeti, Naga S.. "Multi-GPU Accelerated High-Fidelity Simulations of Beam-Beam Effects in Particle
Colliders" (2017). Master of Science (MS), Thesis, Computer Science, Old Dominion University, DOI:
10.25777/gam8-e879
https://digitalcommons.odu.edu/computerscience_etds/89

This Thesis is brought to you for free and open access by the Computer Science at ODU Digital Commons. It has
been accepted for inclusion in Computer Science Theses & Dissertations by an authorized administrator of ODU
Digital Commons. For more information, please contact digitalcommons@odu.edu.

MULTI-GPU ACCELERATED HIGH-FIDELITY
SIMULATIONS OF BEAM-BEAM EFFECTS IN PARTICLE
COLLIDERS
by
Naga Sai Ravi Teja Majeti

A Thesis submitted to the Faculty of
Old Dominion University in Partial Fulfillment of the
Requirements for the Degree of
MASTER OF SCIENCE
COMPUTER SCIENCE
OLD DOMINION UNIVERSITY
August 2017
Approved by:
Mohammad Zubair (Director)
Desh Ranjan (Co-Director)
Balša Terzić (Co-Director)

i

ABSTRACT
MULTI-GPU ACCELERATED HIGH-FIDELITY SIMULATIONS OF
BEAM-BEAM EFFECTS IN PARTICLE COLLIDERS
Naga Sai Ravi Teja Majeti
Old Dominion University, 2017
Director: Dr. Mohammad Zubair
Numerical simulation of beam-beam effects in particle colliders are crucial in
understanding and the design of future machines such as electron-ion colliders (JLEIC), linacring machines (eRHIC) or LHeC. These simulations model the non-linear collision dynamics of
two counter rotating beams in particle colliders for millions of turns. In particular, at each turn,
the algorithm simulates the collision of two directed beams propagating at different speeds with
different number of bunches each. This leads to non-pair-wise collisions of beams with different
number of bunches that results in an increase in the computational load proportional to the
number of bunches in the beams. Simulating these collisions for millions of turns using
traditional CPUs is challenging due to the complexity in modeling non-linear dynamics of the
beams and the need to simulate collision of every bunch in a reasonable amount of time.
In this Thesis, we present a high-performance scalable implementation to simulate the
beam-beam effects in electron-ion colliders using a cluster of NVIDIA GPUs. The parallel
implementation is optimized to minimize the communication overhead and the performance
scales near linearly with number of GPUs. Further, the new code enables tracking and collision
of the beams for millions of turns, thereby making the previously inaccessible long-term
simulations tractable. As of now, there is no other code in existence that can accurately model

ii

the single particle non-linear dynamics and the beam-beam effects at the same time for a large
enough number of turns required to verify the long-term stability of a collider.

iii

ACKNOWLEDGMENTS

This work is funded by a grant from Jefferson National Laboratory. I must acknowledge
ODU ITS and Jefferson Laboratory for allowing me to access their computational resources.

iv

Table of Contents
List of Figures ................................................................................................................................. v
List of Tables ................................................................................................................................ vii
Chapter
I.

Page

INTRODUCTION ................................................................................................................... 1

II. BACKGROUND AND STATE OF ART ............................................................................... 4
II.1 PHYSICAL PROBLEM .....................................................................................................................4
II.1.1 Tracking .......................................................................................................................................4
II.1.2 Collision ......................................................................................................................................4
II.2 GPU ARCHITECTURE .....................................................................................................................9

III. EXISTING SERIAL ALGORITHM .................................................................................... 13
III.1 OUTLINE OF THE ALGORITHM ................................................................................................13
III.2 TRACKING ALGORITHM ...........................................................................................................15
III.3 COLLISION ALGORITHM ...........................................................................................................16

IV. GPU IMPLEMENTATION................................................................................................... 21
IV.1 PARALLEL ALGORITHM FOR TRACKING .............................................................................21
IV.2 PARALLEL ALGORITHM FOR COLLISION .............................................................................21
IV.2.1 Slicing in Parallel .....................................................................................................................21
IV.2.2 Parallel Apply Kick..................................................................................................................25

V. MULTI-GPU IMPLEMENTATION ..................................................................................... 28
V.1 SCHEDULING BUNCHES .............................................................................................................32
V.2 COMMUNICATIONS USING MESSAGE PASSING INTERFACE (MPI) .................................36

VI. RESULTS .............................................................................................................................. 39
VI.1 SINGLE-GPU PERFORMANCE ...................................................................................................39
VI.2 MULTI-GPU PERFORMANCE ....................................................................................................52

VII. CONCLUSION AND FUTURE WORK .............................................................................. 60
VII.1 CONCLUSION ..............................................................................................................................60
VII.2 FUTURE WORK ...........................................................................................................................61

REFERENCES ............................................................................................................................. 62
VITA ............................................................................................................................................. 64

v

List of Figures
Figure

Page

1. Figure 2.1 – Slicing .................................................................................................................. 5
2. Figure 2.2 – Collisions .............................................................................................................. 6
3. Figure 2.3 – CUDA Programming Model. ............................................................................... 9
4. Figure 4.1 – Slicing-1 ............................................................................................................. 22
5. Figure 4.2 – Slicing-2. ............................................................................................................ 23
6. Figure 4.3 – Slicing-3. ............................................................................................................ 24
7. Figure 4.4 – Slicing-4. ............................................................................................................ 25
8. Figure 4.5 – Collision Traingle ............................................................................................... 25
9. Figure 5.1 – Setup of collider rings ........................................................................................ 28
10. Figure 5.2 – Schedule. ............................................................................................................ 29
11. Figure 5.3 – Schedule of a single random............................................................................... 32
12. Figure 6.1 – Speedup behavior of the GPU implementation .................................................. 41
13. Figure 6.2 – Execution time of Collide procedure.................................................................. 43
14. Figure 6.3 – Roofline model analysis for Compute-Kick ...................................................... 46
15. Figure 6.4 – Warp Divergence-1 ............................................................................................ 49
16. Figure 6.5 – Warp Divergence-2 ............................................................................................ 50
17. Figure 6.6 – Warp Divergence-3 ............................................................................................ 50
18. Figure 6.7 – Warp Divergence-4 ............................................................................................ 51
19. Figure 6.7 – Execution schedule ............................................................................................. 53
20. Figure 6.8 – Time slots-1 ........................................................................................................ 53
21. Figure 6.9 – Time slots-2 ........................................................................................................ 54

vi

22. Figure 6.10 – Time Slots-3 ..................................................................................................... 55
23. Figure 6.11 – Time Slots-4. .................................................................................................... 55
24. Figure 6.12 – Time Slots-5. .................................................................................................... 56

vii

List of Tables
Table

Page

1. Table 5.1 - Distribution of Bunches on GPUs .......................................................................... 32
2. Table 6.1 - Single turn performance results .............................................................................. 40
3. Table 6.2 - Single turn performance of COLLIDE procedure. ................................................. 44
4. Table 6.3 – Performance results of Compute-Kick kernel ....................................................... 45
5. Table 6.4 – Performance of Multi-GPU algorithm on a cluster of GPUs................................. 52
6. Table 6.5 – Predictions-1 .......................................................................................................... 57
7. Table 6.6 – Predictions-2. ......................................................................................................... 58

1

CHAPTER I
INTRODUCTION
Future particle colliders such as the Jefferson Lab Electron-Ion Collider (JLEIC) [13],
linac-ring machines (eRHIC) [5] or LHeC [6] are particularly sensitive to beam-beam effects.
Their design, construction and operation costs routinely measure in billions of dollars. A nonnegligible portion of the cost can be reduced by optimization of the design and performance using
computer simulations. The long-term stability of the beams in the collider is the fundamental
criterion of the proper design and operation.
In order to simulate accurately the dynamics of the beams in a particle collider, it is
necessary to track and collide the beam particles for millions to billions of turns. These long-term
simulations are very time-consuming on a single processor system and need to be implemented on
the massively parallel computer architectures to reduce the simulation time from the order of
months or years to the order of days.
We choose a map-based tracking of the particle transport through the ring. Map
generation techniques in application to accelerator lattices are well developed and are available
in various codes. Therefore, we rely on existing tools and build upon the well-established
verified algorithms of COSY Infinity [7]. The beam-beam interaction requires solving the 3D
Poisson equation for each collision, which is computationally very expensive. The Poisson
equation can be directly solved via a number of standard techniques, including multi-grid,
conjugate gradient [16], or Fourier transform-based approach [10]. But, because of their higher
computational load, simulating long-term beam dynamics in colliders becomes difficult.
Therefore, we chose to invoke various approximations to alleviate the numerical load. One

2

approximation is assuming that the beam distribution is Gaussian. Another is the BassettiErskine approximation [2] which further reduces the problem by assuming the interacting
bunches to be infinitesimally short.
There are two scenarios for this problem:
1 - When each collider ring has only one bunch in it. In this case, all the interactions
happening in this simulation are between these two bunches. For each interaction, we Track both
the beams and Collide them. A single turn involves a single interaction in this scenario.
2 - When two rings have different harmonic numbers, each bunch from the first ring will
interact with all the bunches present in the second ring. For example, when there are n-1 and n
bunches, there is a total of (n-1)n interactions between these bunches. A single turn involves n-1
interactions in this scenario. As there are different harmonic numbers in each ring, each bunch will
interact with a different bunch in each turn. So, it takes n turns until all the n-1 bunches from one
ring interact with all the n bunches from the other ring which sums up to a total of (n-1)n
interactions. In this thesis, one schedule completion of Multi-Bunch implementation refers to the
completion of n turns; in practice this schedule repeats from millions to billions number of times.
In this thesis, we propose a new, high-fidelity model for simulation of long-term beambeam dynamics. The proposed model is optimized to run efficiently on GPU platform which
gives us the chance to study efficiently and accurately the long-term dynamics in colliders. Our
implementation of the inherently parallelizable computations of beam tracking and collision on
GPUs leads to orders-of-magnitude reduction in computational time, thereby making the
previously inaccessible physics tractable. On the other hand, to simulate the interactions between
the multiple bunches we propose a scheduling algorithm to simulate the interactions between

3

multiple bunches on a given number of GPUs. The algorithm is optimized to minimize the
communication overhead and the performance scales nearly linear with the number of GPUs.
The remainder of this paper is organized as follows. Chapter - 2 provides the background
of the physical problem, GPUs, and SIMD Challenges. In Chapter - 3 we outline the steps in
numerical simulation of beam-beam dynamics and describe the existing core algorithms for
Tracking and Collision. In Chapter - 4, we discuss the GPU implementation of Tracking and
Collision algorithms. In Chapter - 5, we discuss the scheduling algorithm used for simulating the
interactions between Multi-Bunch on multiple GPUs. Chapter - 6 presents the performance
results of the proposed parallel algorithms for Tracking and Collision on NVIDIA Tesla K40
GPU. Also, the results about the scaling of Multi-Bunch implementation on multiple GPUs are
also presented in this chapter. Finally, in Chapter - 7, we summarize our findings, conclude and
discuss the future work.

4

CHAPTER II
BACKGROUND AND STATE OF ART
II.1 PHYSICAL PROBLEM
II.1.1 Tracking
Particle tracking for each of the six phase-space coordinates: 𝑥, a ≡ 𝑝2 /𝑝4 , 𝑏 ≡ 𝑝6 /𝑝4 , 𝑙
and 𝛿 is done using the equation
𝑥=

EGIJKL 𝑀(𝑥|𝛼𝛽𝛾𝜂𝜆𝜇)𝑥

E G I J K L

𝑎 𝑦 𝑏 𝑙 𝛿 ,

(1)

where 𝑀(𝑥|𝛼𝛽𝛾𝜂𝜆𝜇) is a single turn map that is generated using a readily available
accelerator lattice design and tracking codes. 𝑥 and 𝑦 are the transverse particle positions, a and
𝑏 are the associated transverse momentum components 𝑝2 and 𝑝6 , respectively, normalized to
the reference momentum 𝑝4 , 𝑙 = −(𝑡 − 𝑡4 )𝑣4 𝛾4 and 𝛿 = (𝐾 − 𝐾4 )/𝐾4 . Here 𝑡, 𝐾, 𝑣4 , 𝛾4 are
the time of the flight, kinetic energy, velocity and Lorentz factor, respectively. The subscript 0
indicates the reference value of the variable.
II.1.2 Collision
Beam-beam effects are one of the most dominant effects limiting the luminosity in
electron-ion colliders [12]. The interaction between the two colliding beams (or a single particle
in the field of particle beam) is described by the Poisson equation:
∆∅ 𝒓 = −

U
VW

𝜌 𝒓 ,

(2)

where 𝜌 is the charge distribution, ∅ the scalar potential, 𝜀4 the permittivity of free space and r the
vector containing spatial coordinates. Solving the Poisson equation can be done directly via a
number of techniques, including multigrid, conjugate gradient, or Fourier transform based

5

approach. These methods provide the exact numerical solution to an arbitrary beam charge
distribution; however, their high computational cost makes them inadequate and inefficient for
simulating long-term beam dynamics in colliders.

Figure 2.1 - Particles (denoted with black circles) in a beam are partitioned into 𝑚 = 3 slices
along longitudinal direction, where 𝐿 is the maximum length of the beam, Δ is the width of each
slice and 𝑙]^_ is the longitudinal coordinate of the left most particle.

6

Figure 2.2 - Collisions between two multi-sliced beams, starting at Position 1 and ending with
Position 2. After each line, all slices in both beams drift in the direction of the arrow by Δ/2, where
Δ is the width of each slice. Grey rectangles denote slices that are colliding at each time.

7

In this thesis, we use Basetti-Erskine approximation [2] to model efficiently the beambeam interaction. Our approach assumes the interacting bunches to be infinitesimally short. The
finite bunch length is modeled by composing the beam of several infinitesimal slices. Each of these
slices can then be treated as an infinitesimally short bunch. Figure 2.1 explains the slicing process,
where the beam distribution is divides into 3 slices. 𝐿 is the total length of the beam, 𝑚 is the
number of slices, ∆ = 𝐿/𝑚 is the width of the individual slice, and the slice number of the particle
is 𝑆𝑙𝑖𝑐𝑒 = ( 𝑙 − 𝑙]^_ )/∆ . The collision between the two beams at the interaction point (IP) is
simulated by collisions of individual slices which is illustrated in the Figure 2.2 where each beam
is divides into 3 slices. Thus, when the beam is divided into m slices, it is evident from the figurecollision that there is a total of m2 collisions between the slices of two beams. The collision between
any two slices with longitudinal positions 𝑧 e and 𝑧 f occurs at 𝑠 = 𝑆 𝑧 e , 𝑧 f ≡ (𝑧 e − 𝑧 f )/2,
taking into the account that the beam sizes are different from those at the IP (𝑠 = 0). The kicks
experienced by both beams can be calculated by:
±
𝑥_jk
= 𝑥 ± ± 𝑆 𝑧 e , 𝑧 f 𝑓n± ,

(3)

±
𝑝2,_jk
= 𝑝2± − 𝑓n± ,
±
𝑦_jk

= 𝑦 ± ± 𝑆 𝑧 e , 𝑧 f 𝑓o± ,

±
𝑝6,_jk
= 𝑝6± − 𝑓o± ,
U

U

U

U

q

q

q

q

±
𝑝p,_jk
= 𝑝p± − 𝑓n± 𝑝2± − 𝑓n± − 𝑓6± 𝑝6± − 𝑓o± − 𝑔± ,

and
𝑓n± =
𝑓o± =

_s∓ u ∓ v±
_ ∓ I±

_s∓ u ∓ v±
_ ∓ I±

𝐹2 𝑋 ± − 𝑋 ∓ , 𝑌 ± − 𝑌 ∓ ; 𝜎2∓ 𝑆 , 𝜎6∓ 𝑆 ,
𝐹6 𝑋 ± − 𝑋 ∓ , 𝑌 ± − 𝑌 ∓ ; 𝜎2∓ 𝑆 , 𝜎6∓ 𝑆 ,

(4)

8

𝑔± =

_s∓ u ∓ v±
_ ∓ I±

[𝑅qq 0, 𝑧 ∗ 𝑔2 𝑋 ± − 𝑋 ∓ , 𝑌 ± − 𝑌 ∓ ; 𝜎2± 𝑆 , 𝜎6± 𝑆

+𝑅€€ 0, 𝑧 ∗ 𝑔6 𝑋 ± − 𝑋 ∓ , 𝑌 ± − 𝑌 ∓ ; 𝜎2± 𝑆 , 𝜎6± 𝑆 ]𝑆,

𝑔2 𝑥, 𝑦, 𝜎2 , 𝜎6 = −

𝑔6 𝑥, 𝑦, 𝜎2 , 𝜎6 = −

U
q ‚ƒ„ f‚…„

U
q ‚ƒ„ f‚…„

𝑥𝐹2 + 𝑦𝐹6 + 2

𝑥𝐹2 + 𝑦𝐹6 + 2

‚…
‚ƒ

‚ƒ
‚…

𝑒

𝑒

f

ƒ„
…„
e „
„†„
„†
ƒ
…

f

ƒ„
…„
e „
„†„
„†
ƒ
…

−1 ,

−1

where 𝑟f is the electron radius and, 𝑟e is the proton radius, 𝑛^f and 𝑛^e are the number of simulation
particles in the ith slice of the electron and proton beam, respectively, with which the slice
containing the particle being advanced is colliding, and 𝐹± is given below. The 𝜎 , 𝑠 are evaluated
at 𝑆 as, e.g., 𝜎2± 𝑆 =

[(𝑋 ± − 𝑋 ± )q ], where averages are evaluated at 𝑠 = 0. 𝑁 ± is the number

of electrons (−) and protons (+) in the actual beam bunches, and 𝑛± is the total number of
simulation particles in electrons (−) and protons (+) beam bunches.
The flat beam approximation (𝜎2 > 𝜎6 ), denoted below by subscript f, is relaxed by
deriving generalized solutions for upright (𝜎2 < 𝜎6 ), given by subscript u, respectively.
𝐸Ž 𝑥, 𝑦; 𝜎2 , 𝜎6 ≡

q•
‚ƒ„ f‚…„

𝑤

2e^6
q ‚ƒ„ f‚…„

−𝑒

f

ƒ„
…„
e „
„†„
„†
ƒ
…

𝑤

†…
†
2e^ ƒ 6
†ƒ
†…

q ‚ƒ„ f‚…„

,

(5)

9
q•

𝐸‘ 𝑥, 𝑦; 𝜎2 , 𝜎6 ≡ 𝑖

𝑤

‚…„ f‚ƒ„

6f^2
q ‚…„ f‚ƒ„

−𝑒

f

ƒ„
…„
e „
„†„
„†
ƒ
…

𝑤

†…
†
6f^ ƒ 2
†ƒ
†…

q ‚…„ f‚ƒ„

,

where
𝐸 𝑥, 𝑦; 𝜎2 , 𝜎6 ≡ 𝐹6 𝑥, 𝑦; 𝜎2 , 𝜎6 + 𝑖𝐹2 𝑥, 𝑦; 𝜎2 , 𝜎6 ,

(6)

and
„

𝑤 𝑧 ≡ 𝑒 fp 1 +

p ’„
𝑒
• 4

q^

„

𝑑𝜉 = 𝑒 fp 𝑒𝑟𝑓𝑐 −𝑖𝑧 ,

(7)

is the complex error function (also known as Faddeeva function), and erfc is the complementary
error function. Complex error function is implemented using the optimized algorithm reported in
[9].

II.2 GPU ARCHITECTURE

Figure 2.3 – CUDA Programming Model.

10

At the hardware level, NVIDIA GPU architecture can be considered as an array of
multithreaded Streaming Multiprocessors (SMs) which are scalable. Each SM comprises of several
Streaming Processor (SP) cores, double-precision logic units (DP units), load/store units, special
function units (SFU) for transcendental instructions such as sin, cosine, reciprocal, and square root,
schedulers and instruction dispatch units, instruction cache, register file, on-chip shared-memory
and L1-cache, read-only cache, and texture units.
Each SP core is a fully pipelined integer arithmetic logic unit (ALU) and single-precision
floating point unit (FPU). Memory can be shared among all SMs as the GPUs support memory
sharing in the form of global, constant and texture memory. The global/texture memory are often
cached and use two-level caching system, where L1- cache is located within each SM, while the
L2-cache is located off-chip and is shared among all the SMs.
NVIDIA invented Compute Unified Device Architecture (CUDA) [1]. CUDA is a parallel
computing platform and programming model used to design parallel computations on NVIDIA
GPUs. Any application using CUDA will have an increased computing performance by using the
power of GPU. Two important components of CUDA programming are Device and Host. Device
is the GPU and Host is the CPU. Kernels are the functions where the logic for the actual
computation which is to be run in parallel resides. These kernels are launched by the Host and
executed on the Device by different parallel CUDA threads. The programmer or compiler
organizes these CUDA threads into one-dimensional, two-dimensional, or three-dimensional
block of threads, called a thread block where each thread within a thread block executes an instance
of the kernel. The size of the thread block varies from one generation of GPUs to another. The
thread blocks are combined into a one-dimensional, two-dimensional, or three-dimensional grid
of thread blocks as illustrated in the Figure 2.3a.

11

The number of thread blocks in a grid is usually dictated by the size of the data being
processed or the number of processors in the system, which it can greatly exceed. The programmer
can write the code that may run on any number of cores as the thread blocks are executed
independently and can be scheduled in any order across any number of cores as illustrated in
CUDA C programming guide. During their execution, CUDA threads can access data from six
different memory sources: register memory, constant memory, shared memory, texture memory,
local memory, and global memory as illustrated in the Figure 2.3b.
Each thread has private register to hold frequently accessed data, which are not controlled
by the programmer. Each thread has private local memory that is used for register spills, function
calls, and automatic array variables. Each thread block has a private shared memory generally used
for inter-thread communication and is accessible by all the threads of a block with the same lifetime
as the block. The global, constant and texture memory can be accessed by all the threads and are
available across all the kernel launches through out the execution timeframe of the same
application.
When a kernel is compiled and ready to be executed, the thread blocks within the kernel
grid are scheduled either concurrently or sequentially on the available SMs as multiple thread
blocks can be executed concurrently on a single SM. As the thread blocks terminate at any given
time, new blocks are launched on the vacated SM.
A warp is a group of 32 parallel threads and the SM creates, manages, schedules and
executes threads in such groups. Even though the individual threads within the warp start their
execution from the same program address, they have their own program counter and register state
and hence free to branch and execute independently. All the threads within a warp may execute

12

the same instruction at any given time and which is why SM is considered to be following SIMT
architecture to manage and execute hundreds of threads concurrently.
When all the 32 threads within a warp agree to the same control-flow or the same execution
path then the full warp efficiency is realized. Warp efficiency is the average percentage of active
threads in each executed warp. Often, data-dependent conditional branch causes threads within the
same warp to follow different execution paths which is called as branch divergence or controlflow divergence which then prompts the warp to execute each branch path serially, disabling
threads that are not on that path. The threads converge back to the same execution path when all
the paths are complete.
Note: Branch divergence occurs only within a warp. Threads within different warps execute
independently regardless of whether they are following common or disjoint execution paths.

13

CHAPTER III
EXISTING SERIAL ALGORITHM
In this chapter, we discuss the working of existing serial algorithm that was developed to
establish the proof-of-concept of beam-beam interactions in particle colliders using BassettiErskine approximation [2].

III.1 OUTLINE OF THE ALGORITHM
At the top-most level, numerical simulation of beam-beam effects consists of two major steps
- Tracking and Collision. These two steps are executed during each turn of the simulation, which
in practice, runs for millions to billions of turns to simulate long-term beam-beam dynamics in
particle colliders.
1. Tracking - The particles from the two input beams, e- and p-beam, are transported through
the ring to bring them to an interaction point using an arbitrary-map generated from readily
available accelerator lattice design and tracking codes (e.g. COSY Infinity [7]). This
requires solving Equation (1) for all particles in the two input beams.
2. Collision - The simulation of collision (or beam-beam interactions) between the two input
beams, e- and p-beam, consist of two consecutive steps.
a. Slicing - Each input beam is sliced into m equal parts along longitudinal
direction, as illustrated in Chapter 2. For example, Figure 2.1 illustrates the
slicing of a beam into three parts along longitudinal axis.
b. Apply Kick - The collision of two beams is simulated using slice-to-slice
interactions, where each slice from one beam collides with every other slice of
the other beam such that the order of collision captures the beams drift along

14

the collider ring. For example, Figure 2.2 illustrates the collision of two beams
that is partitioned into three slices each, where particles from both the colliding
beams experience a total of three kicks (or beam-beam effects), one from each
slice of the counter-rotating beam. This kick computation between a pair of
colliding slices, which is the beam-beam effect of one slice on the other, is
calculated using Equations (3)-(7).
Algorithm 1 – 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝐵𝑒𝑎𝑚 − 𝐵𝑒𝑎𝑚 (𝐸, 𝑃, 𝑀j , 𝑀ž , 𝑑, 𝑡, 𝑚)
1: 𝒇𝒐𝒓 𝑖 = 0 𝑡𝑜 𝑡 𝑑𝑜
2:

𝑇𝑟𝑎𝑐𝑘(𝐸, 𝑀j , 𝑑)

3:

𝑇𝑟𝑎𝑐𝑘(𝑃, 𝑀ž , 𝑑)

4:

𝐶𝑜𝑙𝑙𝑖𝑑𝑒(𝐸, 𝑃, 𝑚)

5: 𝒆𝒏𝒅 𝒇𝒐𝒓
6: 𝒆𝒏𝒅 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏
The pseudo code for numerical simulation of the beam-beam effects is illustrated in
Algorithm 1. In this algorithm, each beam is represented as a list of particles, where each particle
is a six dimensional object denoting the six phase-space coordinates of that particular particle. The
map required to transport the particles through the collider ring is given as a list of 2D matrices,
where each matrix represents the 2D map along one of the six phase-space coordinates, and each
row of the matrix is a 7-tuple object, (α, β, γ, η, λ, µ, M(x|α β γ η λ µ)), denoting the variables with
same representation from Equation (1). The procedure 𝐵𝑒𝑎𝑚 − 𝐵𝑒𝑎𝑚 simulating the beam-beam
effects takes input 𝐸, 𝑃, 𝑀j , 𝑀ž , 𝑑, 𝑡 and 𝑚, where 𝐸 and 𝑃 are the list containing particles from ebeam and p-beam, respectively, 𝑀j and 𝑀ž are the transport maps for e-beam and p-beam,

15

respectively, 𝑑 is the dimension of particles in simulation space (in this case, we have six phasespace coordinates i.e. 𝑑 = 6), 𝑡 is the number of turns required for the simulation, and 𝑚 is the
number of slices required for the collision step of the simulation. In this procedure, each iteration
of the for loop implements beam-beam effects for a single turn of the simulation, where particles
from each beam is first transported through the collider ring using the procedure 𝑇𝑟𝑎𝑐𝑘 on each
beam, and then the collision of the two beams are implemented using 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 procedure. The
pseudocode for these two procedures are presented in Algorithms 2 and 4.

III.2 TRACKING ALGORITHM
Algorithm 2 – 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝑇𝑟𝑎𝑐𝑘 (𝐵, 𝑀¥ , 𝑑)
1: 𝒇𝒐𝒓 𝑖 = 0 𝑡𝑜 𝑑 – 1 𝑑𝑜
2:

𝑀 ← 𝑀¥ [𝑖]

3:

𝒇𝒐𝒓 𝑒𝑎𝑐ℎ 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒 𝑝 ∈ 𝐵 𝒑𝒂𝒓𝒂𝒍𝒍𝒆𝒍 𝒅𝒐

4:

𝑝[𝑖] ← 𝐴𝑝𝑝𝑙𝑦 − 𝑀𝑎𝑝(𝑝, 𝑀, 𝑑, 𝑖)

5:

𝒆𝒏𝒅 𝒇𝒐𝒓

6: 𝒆𝒏𝒅 𝒇𝒐𝒓
7: 𝒆𝒏𝒅 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏
Algorithm 3 – 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝐴𝑝𝑝𝑙𝑦 − 𝑀𝑎𝑝(𝑝, 𝑀, 𝑑, 𝑑𝑖𝑚)
1: 𝑥 ← 𝑝[𝑑𝑖𝑚]
2: 𝒇𝒐𝒓 𝑖 = 0 𝑡𝑜 𝑀. 𝑟𝑜𝑤𝑠 – 1 𝒅𝒐
3:

𝑦 ← 𝑀[𝑖][𝑀. 𝑐𝑜𝑙𝑢𝑚𝑛𝑠 − 1]

4:

𝒇𝒐𝒓 𝑗 = 0 𝑡𝑜 𝑑 – 1 𝒅𝒐

5:

𝑦 ← 𝑦. 𝑝[𝑖]𝑀[𝑖][𝑗]

16

6:

𝒆𝒏𝒅 𝒇𝒐𝒓

7:

𝒙←𝑥+𝑦

8: 𝒆𝒏𝒅 𝒇𝒐𝒓
9: 𝒓𝒆𝒕𝒖𝒓𝒏 𝒙
10: 𝒆𝒏𝒅 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏
The procedure 𝑇𝑟𝑎𝑐𝑘 𝐵, 𝑀¥ , 𝑑

in Algorithm 2 evaluates Equation (1) for all 𝑑-

dimensional particles in a input list 𝐵 using a transport map 𝑀¥ . In particular, for each dimension
in the 𝑑-dimensional coordinate space of the particles, transport map of the corresponding
dimension is applied to all particles in the list 𝐵 using an auxiliary procedure 𝐴𝑝𝑝𝑙𝑦 − 𝑀𝑎𝑝, where
the procedure 𝐴𝑝𝑝𝑙𝑦 − 𝑀𝑎𝑝(𝑝, 𝑀, 𝑑, 𝑑𝑖𝑚) called along a dimension 𝑑𝑖𝑚 for a particle 𝑝 returns
the updated value for 𝑝[𝑑𝑖𝑚] by evaluating Equation (1) using the transport map 𝑀. The pseudo
code for 𝐴𝑝𝑝𝑙𝑦 − 𝑀𝑎𝑝 and other auxiliary methods required in 𝐵𝑒𝑎𝑚 − 𝐵𝑒𝑎𝑚 procedure are
illustrated in Algorithm 3.

III.3 COLLISION ALGORITHM
Algorithm 4 – 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 (𝐸, 𝑃, 𝑚)
1: 𝑆j ← 𝑆𝑙𝑖𝑐𝑒(𝐸, 𝑚)
2: 𝑆ž ← 𝑆𝑙𝑖𝑐𝑒(𝑃, 𝑚)
3: 𝒇𝒐𝒓 𝑖 = 1 𝑡𝑜 𝑚 𝒅𝒐
4:
5:

𝒇𝒐𝒓 𝑗 = 0 𝑡𝑜 𝑖 – 1 𝒑𝒂𝒓𝒂𝒍𝒍𝒆𝒍 𝒅𝒐
Let 𝑠 be the 𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛 − 𝑝𝑜𝑖𝑛𝑡 for 𝑆j [𝑗] and 𝑆ž [𝑖 – 𝑗 – 1]
calculated as described in Chapter 2

6:

𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘(𝑆j [𝑗], 𝑆ž [𝑖 – 𝑗 – 1], 𝑠)

17

7:

𝒆𝒏𝒅 𝒇𝒐𝒓

8: 𝒆𝒏𝒅 𝒇𝒐𝒓
9: 𝒇𝒐𝒓 𝑖 = 𝑚 + 1 𝑡𝑜 2 ∗ 𝑚 – 1 𝒅𝒐
10:
11:

𝒇𝒐𝒓 𝑗 = 𝑖 – 𝑚 𝑡𝑜 𝑚 – 1 𝒑𝒂𝒓𝒂𝒍𝒍𝒆𝒍 𝒅𝒐
Let 𝑠 be the 𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛 − 𝑝𝑜𝑖𝑛𝑡 for 𝑆j [𝑗] and 𝑆ž [𝑖 – 𝑗 – 1]
calculated as described in Chapter 2

12:
13:

𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘(𝑆j [𝑗], 𝑆ž [𝑖 – 𝑗 – 1], 𝑠)
𝒆𝒏𝒅 𝒇𝒐𝒓

14: 𝒆𝒏𝒅 𝒇𝒐𝒓
15: 𝐸 ← 𝑀𝑒𝑟𝑔𝑒 − 𝑆𝑙𝑖𝑐𝑒𝑠(𝑆j , 𝑚)
16: 𝑃 ← 𝑀𝑒𝑟𝑔𝑒 − 𝑆𝑙𝑖𝑐𝑒𝑠(𝑆ž , 𝑚)
17: 𝒆𝒏𝒅 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏
Algorithm 5 – 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝑆𝑙𝑖𝑐𝑒(𝐵, 𝑚)
1: let 𝑆[0 . . 𝑚 − 1] be a new array
2: 𝒇𝒐𝒓 𝑖 = 0 𝑡𝑜 𝑚 – 1 𝒅𝒐
3:

make 𝑆[𝑖] an empty list

4: 𝒆𝒏𝒅 𝒇𝒐𝒓
5: 𝒇𝒐𝒓 𝑒𝑎𝑐ℎ 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒 ∈ B 𝒑𝒂𝒓𝒂𝒍𝒍𝒆𝒍 𝒅𝒐
6:

𝑠𝑙𝑖𝑐𝑒 ← 𝐹𝑖𝑛𝑑 − 𝑆𝑙𝑖𝑐𝑒(𝐵, 𝑝, 𝑚)

7:

𝐿𝑖𝑠𝑡 − 𝐼𝑛𝑠𝑒𝑟𝑡(𝑆[𝑠𝑙𝑖𝑐𝑒], 𝑝)

8: 𝒆𝒏𝒅 𝒇𝒐𝒓
9: 𝒓𝒆𝒕𝒖𝒓𝒏 𝑆

18

10: 𝒆𝒏𝒅 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏
11: 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝐹𝑖𝑛𝑑 − 𝑆𝑙𝑖𝑐𝑒(𝐵, 𝑝, 𝑚)
12:

calculate and return the slice number to which 𝑝 belongs based on the geometry
along longitudinal direction

13: 𝒆𝒏𝒅 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏
The collision between two counter-rotating beams is implemented in the procedure
𝐶𝑜𝑙𝑙𝑖𝑑𝑒 (𝐸, 𝑃, 𝑚), where 𝐸 and 𝑃 represents the list of particles in the two colliding beams and 𝑚
is the number of slices required per beam. This procedure updates all the particles in 𝐸 and 𝑃 to
reflect the beam-beam interaction, and it works as follows. Line 1 and 2 divides the list of particles
in the input beams into 𝑚 slices (or sublists) using 𝑆𝑙𝑖𝑐𝑒 method on each beam 𝐸 and 𝑃. In
particular, procedure 𝑆𝑙𝑖𝑐𝑒(𝐵, 𝑚) partitions the input list of particles 𝐵 into 𝑚 sublists based on
their corresponding slice number which is calculated using 𝐹𝑖𝑛𝑑 − 𝑆𝑙𝑖𝑐𝑒 method that implements
the slicing algorithm described in Chapter 2. Next, the 𝑓𝑜𝑟 loops in lines 3 - 14 calculates the
beam-beam effects (or kicks) for every pair of colliding slices, where the kick computation on all
particles in a pair of colliding slices is implemented using the procedure 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘. The
procedure 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘(𝑆j , 𝑆ž , 𝑠) calculates the kick at a interaction point 𝑠 on all particles in
the input list 𝑆j and 𝑆ž , where the interaction point is calculated as described in Section 2 and [15],
and then it updates all the particles in the input slice with the computed kick. The pseudo code for
𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘 is illustrated in Algorithm 6. Finally, in lines 15 and 16, particles from individual
slices are merged into a single list using the procedure 𝑀𝑒𝑟𝑔𝑒 − 𝑆𝑙𝑖𝑐𝑒𝑠 on 𝑆j and 𝑆ž , respectively.
The 𝑀𝑒𝑟𝑔𝑒 − 𝑆𝑙𝑖𝑐𝑒𝑠 procedure returns a sorted list that is the merge of its input array of lists, and
the output from the two calls to this procedure is stored in the input lists, 𝐸 and 𝑃 respectively.
The updated particles in 𝐸 and 𝑃 are used for simulating the beam-beam effects in the next turn.

19
Algorithm 6 - 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘(𝑆j , 𝑆ž , 𝑠)

1. (𝑥 j , 𝑦 j , 𝜎2j , 𝜎6j ) ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝑀𝑒𝑎𝑛 − 𝑆𝐷(𝑆𝑒, 𝑠)
ž

ž

2. (𝑥 ž , 𝑦 ž , 𝜎2 , 𝜎6 ) ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝑀𝑒𝑎𝑛 − 𝑆𝐷(Sp, s)
3. 𝒇𝒐𝒓 each particle 𝑒 ∈ 𝑆𝑒 𝒑𝒂𝒓𝒂𝒍𝒍𝒆𝒍 𝒅𝒐
ž

ž

4.

(𝐹2 , 𝐹6 ) ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘(𝑒, 𝑥 ž , 𝑦 ž ,𝜎2 , 𝜎6 )

5.

𝑒[0] ← 𝑒[0] − 𝑠 ∗ 𝐹𝑥

6.

𝑒[1] ← 𝑒[1] − 𝐹𝑥

7.

𝑒[2] ← 𝑒[2] − 𝑠 ∗ 𝐹𝑦

8.

𝑒[3] ← 𝑒[3] − 𝐹𝑦

9. 𝒆𝒏𝒅 𝒇𝒐𝒓
10. 𝒇𝒐𝒓 each particle 𝑝 ∈ 𝑆𝑝 𝒑𝒂𝒓𝒂𝒍𝒍𝒆𝒍 𝒅𝒐
11.

(𝐹2 , 𝐹6 ) ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘(𝑝, 𝑥 j , 𝑦 j ,𝜎2j , 𝜎6j )

12.

𝑝[0] ← 𝑒 0 + 𝑠 ∗ 𝐹𝑥

13.

𝑝[1] ← 𝑝 1 + 𝐹𝑥

14.

𝑝[2] ← 𝑝 2 + 𝑠 ∗ 𝐹𝑦

15.

𝑝[3] ← 𝑝 3 + 𝐹𝑦

16. 𝒆𝒏𝒅 𝒇𝒐𝒓
17. 𝒆𝒏𝒅 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏
18. 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝑀𝑒𝑟𝑔𝑒 − 𝑆𝑙𝑖𝑐𝑒𝑠(𝑆, 𝑚)
19.

B←∅

20.

𝒇𝒐𝒓 𝑖 = 0 𝑡𝑜 𝑚 – 1 𝒅𝒐

21.

𝐵 ← 𝑀𝑒𝑟𝑔𝑒 − 𝐿𝑖𝑠𝑡(𝐵, 𝑆[𝑖])

20

22.

𝒆𝒏𝒅 𝒇𝒐𝒓

23.

𝒓𝒆𝒕𝒖𝒓𝒏 𝑩

24. 𝒆𝒏𝒅 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏
The procedure 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘(𝑆j , 𝑆ž , 𝑠) is the heart of 𝐵𝑒𝑎𝑚 − 𝐵𝑒𝑎𝑚 algorithm, and it
works as follows. The two calls to 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝑀𝑒𝑎𝑛 − 𝑆𝐷 method calculates and returns the
mean and standard deviation along the first and third dimension for the particles in 𝑆j and 𝑆ž ,
respectively, where the two dimensions correspond to the transverse position of particles. Next,
for each particle 𝑒 in 𝑆j , the kick from all the particles in 𝑆ž on a particle 𝑒 is calculated using the
procedure 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘, which takes input, a particle 𝑒, mean and standard deviation of the
particles in 𝑆ž , and it returns a pair (𝐹2 , 𝐹6 ). The output values, 𝐹2 and 𝐹6 , represents the kick from
particles in 𝑆ž on 𝑒, and it is calculated using equations 3 to 7 (𝐹2 and 𝐹6 denotes the variables with
same notations from Chapter 2). These computed kicks are used to update the particles in 𝑆j in the
first 𝑓𝑜𝑟 loop. Similarly, in the next 𝑓𝑜𝑟 loop, kick on all the particles in 𝑆ž due to the particles
from 𝑆j is calculated using the procedure 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 and the output from this procedure is
used to update the particles in 𝑆ž .

21

CHAPTER IV
GPU IMPLEMENTATION
In this chapter, we discuss our GPU implementation of Single-Bunch beam-beam
simulation algorithm. We first present a brief overview of the existing GPU algorithm for Tracking
and then discuss the GPU algorithm for Collision in detail.

IV.1 PARALLEL ALGORITHM FOR TRACKING
In our implementation, the input lists of particles and the transport map are always stored
in the GPU memory. The procedure TRACK, which is highly data-parallel, is implemented on
GPU as an independent kernel where the computation of each particle is assigned to parallel
threads with one-to-one correspondence. In this kernel, transport map is stored in shared memory
and it is accessed efficiently to improve the memory performance.

IV.2 PARALLEL ALGORITHM FOR COLLISION
The essential routines for 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 procedure such as 𝑆𝑙𝑖𝑐𝑒, 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘, and 𝑀𝑒𝑟𝑔𝑒 −
𝑆𝑙𝑖𝑐𝑒𝑠 are implemented on GPU. The implementations of all the routines are discussed in the next
subsections.
IV.2.1 Slicing in Parallel
The Single-Threaded way of Slicing and Sorting the particles is to compute the slice
number of one particle at a time, store the slice numbers of each particle, and then sort them using
one of the popular sorting techniques according to their slice number. Although there are some
existing libraries for sorting on GPUs, as we have to sort the particles which has 6 dimensions,

22

none of the libraries were compatible for integration into our implementation. So, we propose a
fast and efficient algorithm for Slicing and Sorting the particles of a Beam on GPU.

Figure 4.1 – Initial positions of all the particles and global counter stores particles size in each
slice. A new array is also initialized to store particles according to their slice number.

During this phase, a thread is assigned to each particle to calculate its slice number. Each
thread will find the slice number of the particle and increments the global counter to update the
count of particles in each slice. For example, in Figure 4.1 there are a total of 1000 particles in the
beam and the beam is to be divided into 3 slices. All the threads calculate their respective particle's
slice number and update the global counter. In this case, Slice-1(S1) has 150 particles, Slice-2(S2)
has 550 particles, and Slice-3(S3) has 300 particles. Then, an empty list is initialized and the spaces
for each slice are allocated separately as shown in the Figure 4.1.

23

Figure 4.2 – First block accessing the global counter to write the particles in their allocated regions.

Next, a lead thread from each block examines the global slice count status and broadcasts
the status to all the particles. For example, in the Figure 4.2, a lead thread from the block (say
block-x) examines the status for all the slices and broadcasts the difference between the actual total
slice count and the current status. In this case, the difference is 0 as none of the blocks have started
sorting their particles according to their slice numbers. So, as there are 60 particles in block-x
which belongs to Slice-1 and the first 60 spaces are allocated to the particles belonging to Slice-1
in block-x. A similar procedure is followed for the particles in Slice-2 and Slice-3. After these
steps, the global slice count will be 90, 310, 170 respectively for Slice-1, 2, and 3.

24

Figure 4.3 - Second block accessing the global counter to write the particles in their allocated
regions.

Figure 4.3 illustrates the scenario when another block (say block-y) tries to access the
global slice count after block-x. Now the difference between actual total slice count and current
status is 60, 240, 330 respectively for Slice-1, 2, and 3. The number of particles in block-y which
belong to Slice-1, 2 and 3 are 40, 120 and 80 respectively. So the starting position for particles in
block-y that belong to Slice-1 will start from 61 in the area that is specially allocated for Slice-1
particles. Similarly, the starting positions for Slice-2 and 3 particles are 240 and 130 in their
respective allocations. Figure 4.4 illustrates the final positions of particles after slicing and sorting
them. In this way, all the particles belonging to the same slice are arranged consecutively which

25

in turn directs the threads to perform a coalesced access memory access in 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝑀𝑒𝑎𝑛 −
𝑆𝐷 and 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘 procedures.

Figure 4.4 – Final positions of particles after slicing and sorting.

IV.2.2 Parallel Apply Kick

Figure 4.5 – Triangle illustrating the slice-to-slice collisions that can happen in parallel at each
stage. In this example, each bunch has 5 slices.

Two main building blocks of 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘 procedure are 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝑀𝑒𝑎𝑛 − 𝑆𝐷 and
𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 routines. Figure 4.5 illustrates the slice-to-slice collision process when the beam
is divided into 5 slices. When the beam is divided into m slices, there will be a total of 2m - 1
stages of slice-to-slice collisions. So, here there are a total 9 stages of slice-to-slice collisions. At
each stage, the numbers represent the slices of e- and p-beam that are colliding with each other.
For example, in the Stage-2, 1st slice from e-beam is colliding with the 2nd slice of p-beam which
is represented as 12, and 2nd slice from e-beam is colliding with the 1st slice from p-beam. In

26

stage-1 and stage-9, only one pair of slices are colliding with each other where as in all the other
stages there are more than one pair of slices that are colliding with each other. We first present the
brief outline of what is happening in the 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘 procedure here. When a pair of slices are
colliding with each other, the procedure 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝑀𝑒𝑎𝑛 − 𝑆𝐷 is called on both the slices which
return mean and standard deviations of the slice in x and y directions. Next, the procedure
𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 is called on all the particles of a slice to compute the kick of the opposite slices
on each particle and return the forces in x and y directions which are later applied on the particle
to get its updated dimensions.
Both routines, 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝑀𝑒𝑎𝑛 − 𝑆𝐷 and 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 are implemented on GPU.
At first, we compute the mean and standard deviation (SD) of all the m slices and store them in
memory. The routine 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝑀𝑒𝑎𝑛 − 𝑆𝐷 is implemented using CUDA based Thrust Library
[3] which has powerful and efficient reduction operation implementations. Then at each stage of
slice-to-slice collision process, we pass those mean and SDs to the 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 procedure
and calculate the updated mean and SDs of the slices at end of the 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 procedure.
For example, during the Stage-1 the mean and SDs of 1e and 1p are passed to 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘
and at the end of the 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 procedure we calculate the updated mean and SDs of 1e
and 1p and update them in the memory so that we retrieve the correct values of mean and SDs of
1e and 1p when they are participating in collision process again during Stage-2.
During Stage-2 to Stage-8, there are more than one pair of slices participating in the
collision process. The sequential way of doing all these pair-wise collisions in a stage is to process
each pair-wise collision one after the other. In our GPU implementation, we do all the pair-wise
collisions in parallel. We fire up the number of threads that are equal to the number of particles in
all the slices and all the thread call the 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 procedure on their respective particles in

27

parallel. The advantage of arranging all the particles that belong to the same slice comes into the
picture here. As the threads that belong to the same warp access the consecutive memory locations
all the memory requests of the threads are completed in at most two transactions.

28

CHAPTER V
MULTI-GPU IMPLEMENTATION
In this chapter, we present the algorithm for simulating the beam-beam effects in particle
colliders where each of the collider rings carry more than one bunch. We refer to the two types of
input beams in the following algorithm description as e-beam and p-beam.

Figure 5.1 – Setup of collider rings with 4p and 3e bunches. Number of e bunches is always one
less than number of p bunches.

29

Figure 5.1 illustrates the setup of two collider rings which has three e-beam and four pbeam bunches. It is to be noted that the number of e-beam bunches is always one less than p-beam
bunches. When there are 𝑛𝑏 p-beam bunches, there will be a total of 𝑛𝑏 (𝑛𝑏 − 1) bunch-to-bunch
interactions. The interaction between two bunches consists of two major steps - Tracking and
Collision as described in Chapter - 3.
Figure 5.2 illustrates the schedule that is repeated for two times for all the 𝑛𝑏(𝑛𝑏 − 1)
bunch-to-bunch interactions when 𝑛𝑏 = 4. During iteration – 1, at each turn every e-beam sees a
different p-beam and a single iteration is said to be completed after every e-beam sees every pbeam and this also refers to completion of one schedule. Next, from iteration – 2, the same schedule
repeats.

Figure 5.2 – Schedule repeated for two iterations for 3e and 4p bunches. A schedule has 4 turns
which is equal to number of p bunches.

30

Sequential Algorithm
Algorithm 7 - function 𝐵𝑒𝑎𝑚 − 𝐵𝑒𝑎𝑚(𝐵j , 𝐵ž , 𝑀j , 𝑀ž , 𝑑, 𝑡, 𝑚, 𝑛𝑏)
1. 𝑒_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚 = 0, 𝑝_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚 = 1
2. 𝒘𝒉𝒊𝒍𝒆 𝑖 < 𝑡:
3.
4.

𝒇𝒐𝒓 𝑖 = 0 𝑡𝑜 𝑛𝑏 ∗ (𝑛𝑏 − 1) 𝒅𝒐
𝑰𝒇 (𝑒_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚 >= 𝑛𝑏 − 1):

5.

𝑒_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚 = 0

6.

𝑖 =𝑖+1

7.
8.

𝑰𝒇 (𝑝_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚 >= 𝑛𝑏):
𝑝_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚 = 0

9.

𝑇𝑟𝑎𝑐𝑘 (𝐵j [𝑒_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚], 𝑀j , 𝑑)

10.

𝑇𝑟𝑎𝑐𝑘 (𝐵ž [𝑝_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚], 𝑀ž , 𝑑)

11.

𝐶𝑜𝑙𝑙𝑖𝑑𝑒 (𝐵j [𝑒_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚], 𝐵ž [𝑝_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚], 𝑚)

12.

𝑒_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚 + +, 𝑝_𝑏𝑢𝑛𝑐ℎ_𝑛𝑢𝑚 + +

13.

𝒆𝒏𝒅 𝒇𝒐𝒓

14. 𝒆𝒏𝒅 𝒘𝒉𝒊𝒍𝒆
15. 𝒆𝒏𝒅 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏
Algorithm - 7 illustrates the pseudo code for the simulation of beam-beam effects when
there are multiple bunches. This simulation is driven by the procedure 𝐵𝑒𝑎𝑚 − 𝐵𝑒𝑎𝑚 which takes
(𝐵j , 𝐵ž , 𝑀j , 𝑀ž , 𝑑, 𝑡, 𝑚, 𝑛𝑏) as inputs. Here 𝐵j and 𝐵ž are the lists of lists where the outer lists
represent the bunches and the inner lists represent the particles in a bunch. As described in Chapter
3, each particle is a six-dimensional object denoting the six phase-space coordinates of that
particular particle, 𝑀j and 𝑀ž are the transport of maps of e- and p-beams respectively, 𝑑 is the

31

dimension of the particles, 𝑡 is the number of turns required for the simulation which should be
the multiple of 𝑛𝑏, and 𝑚 is the number of slices required for the collision step of the simulation.
The schedule given above is implemented sequentially using the 𝑓𝑜𝑟 − 𝑙𝑜𝑜𝑝 in this procedure by
passing the respective e- and p-beams to the Track and Collide procedures and their working is
described in Chapter 3.
Multi-GPU Approach
The set of interactions happening within each turn of the schedule are independent of each
other and the interactions happening in different turns are dependent on each other. For example,
in Figure – 5.2, the interaction (1,1) of Turn – 2 cannot happen until interaction (1,2) of Turn-1
finishes, as both the e-beams are same here. So the idea is to simulate all the interactions in a single
turn simultaneously on a cluster of GPUs. The pair of beams interacting at each turn is different
from the previous iteration. The naive way of running all these turns on multiple GPUs is to move
the bunches between GPUs for each turn and perform the simulations. We propose an algorithm
where the bunch is stored on a particular GPU throughout the simulation without further moving
the bunch data between GPUs for each turn. Once the GPU for all the bunches is decided, during
the phase of beam-beam interaction we only move the parameters between the GPUs that are
required to apply the interaction effects on the bunches. A robust scheduling algorithm is necessary
to schedule all the interactions on the GPUs in a turn without any delay and make neither of the
beams wait for the parameters from the opposite GPU. Table – 5.1 depicts the GPU assignment
and Figure – 5.3 illustrates the scheduling algorithm for a single turn when there are 6 GPUs and
15 bunches.

32
GPU 1
GPU 2
GPU 3
GPU 4
GPU 5
GPU 6

1
4
7
10
12
14

2
5
8
11
13
15

3
6
9

Table 5.1 - Distribution of Bunches on GPUs (6 GPUs, 15 Bunches).

V.1 SCHEDULING BUNCHES

Figure 5.3 – Schedule of a single random turn when there are 15 bunches, and the execution
schedule formed by the scheduling algorithm using the Current Turn Schedule.

The first step of this scheduling is to logically divide all the interactions in a turn according
to the GPU where the e-beam is stored. For example, (1, 2, 3) e-beams are kept in the same logical

33

partition (lg) as they are all stored in the same GPU. This division remains constant all over the
simulation as the order of the e-beams is same for every turn. As there is a maximum of 3 beams
stored in a GPU, this turn is scheduled to finish the execution in 3 steps because a GPU can process
only one interaction at a given time. The execution schedule in the Figure-schedule illustrates the
order in which all the interactions in the turn are executed in three steps. At each step, a GPU
selects a set of e- and p-beams in an order of from a lg for execution. For example, during step –
1, GPU – 2 scans line-1 of lg-2 (as e-beams that belong to GPU – 2 are present in lg-2 in any given
turn) and selects 4th e-beam for execution. Next, it scans line-1 of each lg and selects 5th p-beam
from lg-6 as it belongs to GPU – 2. The multi-GPU algorithm is capable of halting the execution
of one or both the beams for that particular step. The symbol '-' indicates that the particular beam
slot for the GPU is empty during that step. There are two possible reasons for that:
Reason: 1. During step - 1, there are two p-beams (1, 3) that appear in line-1 in their logical
partition and they belong to the same GPU (GPU - 1). Due to this reason, only one of the beam
will be scheduled to execute on GPU - 1 for that particular step. In this case, p-beam - 1 is selected
for execution during step - 1, and p-beam - 3 is scheduled to execute in step - 2. Because of this
the execution for e-beam - 12 which is interacting with p-beam - 3 is also scheduled to execute in
step - 2 which is why the e-beam slot for GPU - 5 is empty. The p-beam slot for GPU - 6 is empty
as none of the logical partitions have p-beam that belongs to this GPU that appears in line-1.
Reason: 2. All the GPUs have finished the execution of their e- and p-beams for that
particular turn. For example, GPU - 4 has completed the execution of all of its e- and p-beams.
The bi-directional arrows in each step indicate the GPUs that are communicating with each
other to execute that respective interaction. For example, during Step - 1, GPU - 2 and GPU - 4
are communicating with each other to execute the interaction 4 - 10 which appears in line-1 of lg-

34

2. Once this execution schedule shown in Figure – 5.3 is formed, the Track and Collide procedures
for e- and p-beams are executed on the GPUs according to the execution schedule.
Algorithm 8 – function 𝐵𝑒𝑎𝑚 − 𝐵𝑒𝑎𝑚 −
𝑀𝑢𝑙𝑡𝑖𝐺𝑃𝑈(𝐵j , 𝐵ž , 𝑀j , 𝑀ž , 𝑑, 𝑡, 𝑚, 𝑛¥ , 𝐺𝑃𝑈𝑜𝑓𝐵𝑒𝑎𝑚, 𝑀𝑎𝑥𝐵𝑢𝑛𝑐ℎ𝐺𝑃𝑈, 𝑛𝑢𝑚𝐺𝑃𝑈)
1. 𝒘𝒉𝒊𝒍𝒆 𝑖 = 1 𝑡𝑜 𝑡:
2.

𝑡𝑜𝑡𝑎𝑙_𝑖𝑗 = 0

3.

𝒇𝒐𝒓 𝑖 = 1 𝑡𝑜 𝑛𝑏 𝒅𝒐

4.

𝒇𝒐𝒓 𝑗 = 1 𝑡𝑜 𝑛𝑏 − 1 𝒅𝒐

5.

𝑐𝑢𝑟_𝑒_𝑠𝑐ℎ𝑑[𝑗] = 𝑒_𝑠𝑐ℎ[𝑡𝑜𝑡𝑎𝑙_𝑖𝑗]

6.

𝑐𝑢𝑟_𝑝_𝑠𝑐ℎ𝑑[𝑗] = 𝑝_𝑠𝑐ℎ[𝑡𝑜𝑡𝑎𝑙_𝑖𝑗]

7.

𝑡𝑜𝑡𝑎𝑙_𝑖𝑗 = 𝑡𝑜𝑡𝑎𝑙_𝑖𝑗 + 1

8.

𝒆𝒏𝒅 𝒇𝒐𝒓

9.

𝒇𝒐𝒓 𝑔𝑝𝑢 = 1 𝑡𝑜 𝑛𝑢𝑚𝐺𝑃𝑈 𝒑𝒂𝒓𝒂𝒍𝒍𝒆𝒍 𝒅𝒐

10.
11.
12.

𝒇𝒐𝒓 𝑠𝑡𝑒𝑝 = 1 𝑡𝑜 𝑀𝑎𝑥𝐵𝑢𝑛𝑐ℎ𝐺𝑃𝑈[𝑔𝑝𝑢] 𝒅𝒐
𝒊𝒇 𝑠𝑡𝑒𝑝 ! = 1
𝑒_𝑏, 𝑜𝑝𝑝_𝑝_𝑏 ← pop any remained e-bunch and its opposite p-bunch in line(step-1) from division-GPU of 𝑐𝑢𝑟_𝑒_𝑠𝑐ℎ𝑑

13.
14.

𝒆𝒍𝒔𝒆
𝑒_𝑏, 𝑜𝑝𝑝_𝑝_𝑏 ← pop e-bunch and its opposite p-bunch in line-step from any
division-GPU of 𝑐𝑢𝑟_𝑒_𝑠𝑐ℎ𝑑

15.
16.

𝒊𝒇 𝑠𝑡𝑒𝑝 ! = 1
𝑝_𝑏, 𝑜𝑝𝑝_𝑒_𝑏 ← pop any remained p-bunch and its opposite e-bunch in line(step-1) from division-GPU of 𝑐𝑢𝑟_𝑝_𝑠𝑐ℎ𝑑

35

17.

𝒆𝒍𝒔𝒆
𝑝_𝑏, 𝑜𝑝𝑝_𝑒_𝑏 ← pop p-bunch and its opposite e-bunch in line-step from any

18.

division-GPU of 𝑐𝑢𝑟_𝑝_𝑠𝑐ℎ𝑑
19.

𝑇𝑟𝑎𝑐𝑘(𝐵j [𝑒_𝑏], 𝑀j , 𝑑)

20.

𝑇𝑟𝑎𝑐𝑘(𝐵ž [𝑝_𝑏], 𝑀ž , 𝑑)

21.

𝐶𝑜𝑙𝑙𝑖𝑑𝑒(𝐵j [𝑒_𝑏], 𝐵ž [𝑝_𝑏], 𝐺𝑃𝑈𝑜𝑓𝐵𝑒𝑎𝑚[𝑜𝑝𝑝_𝑒_𝑏], 𝐺𝑃𝑈𝑜𝑓𝐵𝑒𝑎𝑚[𝑜𝑝𝑝_𝑝_𝑏])

22.
23.
24.

𝒆𝒏𝒅 𝒇𝒐𝒓
𝒆𝒏𝒅 𝒇𝒐𝒓
𝒆𝒏𝒅 𝒇𝒐𝒓

25. 𝒆𝒏𝒅 𝒘𝒉𝒊𝒍𝒆
26. 𝒆𝒏𝒅 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏
Algorithm - 8 illustrates the Multi-GPU version of 𝐵𝑒𝑎𝑚 − 𝐵𝑒𝑎𝑚 algorithm that includes
the scheduling algorithm and the execution of 𝑇𝑟𝑎𝑐𝑘 and 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 functions after the formation of
execution schedule for each step. Here 𝐺𝑃𝑈𝑜𝑓𝐵𝑒𝑎𝑚 is list of list where the outer list represents
each GPU and the inner list represents the bunches store in that particular GPU, 𝑀𝑎𝑥𝐵𝑢𝑛𝑐ℎ𝐺𝑃𝑈
is the maximum out of count of bunches stored in each GPU, and 𝑛𝑢𝑚𝐺𝑃𝑈 is the number of GPUs
in a cluster. At first the current turn schedule is formed using the for-loop in the lines 4-10. At each
step, every GPU will extract the bunch numbers of e and p from the current turn schedule and
executes the 𝑇𝑟𝑎𝑐𝑘 and 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 functions, and the algorithmic steps related to this are described
from lines 9-23. The 𝑇𝑟𝑎𝑐𝑘 procedure is independent and does not need any communication from
other GPUs. Hence the implementation of 𝑇𝑟𝑎𝑐𝑘 procedure is same as described in Chapter - 3.
The 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 procedure has three auxiliary procedures called 𝑆𝑙𝑖𝑐𝑒, 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘, 𝑀𝑒𝑟𝑔𝑒, out of
which the 𝑆𝑙𝑖𝑐𝑒 and 𝑀𝑒𝑟𝑔𝑒 procedures does not need any communication from other GPUs and

36

their implementation remains same as described in Chapter - 3. Also, the execution flow and
parallelism of for-loops calling the procedure 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘 remains the same. As a refresher, the
for-loops in the 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 procedure calculates the beam-beam effects (or kicks) for every pair of
colliding slices, where the kick computation on all particles in a pair of colliding slices is
implemented using the procedure 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘.

V.2 COMMUNICATIONS USING MESSAGE PASSING INTERFACE
(MPI)
For our multi-bunch algorithm to simulate multi-bunch collisions, we used MPI [18] to
exchange the messages between the GPUs
Algorithm 9 - function 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘(𝑆j , 𝑆ž , 𝑠, 𝑜𝑝𝑝_𝑒𝐺𝑃𝑈, 𝑜𝑝𝑝_𝑝𝐺𝑃𝑈)
25. (𝑥 j , 𝑦 j , 𝜎2j , 𝜎6j ) ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝑀𝑒𝑎𝑛 − 𝑆𝐷(𝑆𝑒, 𝑠)
ž

ž

26. (𝑥 ž , 𝑦 ž , 𝜎2 , 𝜎6 ) ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝑀𝑒𝑎𝑛 − 𝑆𝐷(Sp, s)
27. Send (𝑥 j , 𝑦 j , 𝜎2j , 𝜎6j ) to 𝑜𝑝𝑝_𝑝𝐺𝑃𝑈
ž

ž

28. Send (𝑥 ž , 𝑦 ž , 𝜎2 , 𝜎6 ) to 𝑜𝑝𝑝_𝑒𝐺𝑃𝑈
j
j
j
j
29. Receive (𝑥v¿ÀÁ
, 𝑦v¿ÀÁ
, 𝜎2_v¿ÀÁ
, 𝜎6_v¿ÀÁ
) from 𝑜𝑝𝑝_𝑒𝐺𝑃𝑈
j
j
j
j
30. Receive (𝑥v¿ÀÁ
, 𝑦v¿ÀÁ
, 𝜎2_v¿ÀÁ
, 𝜎6_v¿ÀÁ
) from 𝑜𝑝𝑝_𝑝𝐺𝑃𝑈

31. 𝒇𝒐𝒓 each particle 𝑒 ∈ 𝑆𝑒 𝒑𝒂𝒓𝒂𝒍𝒍𝒆𝒍 𝒅𝒐
ž

ž

ž

ž

32.

(𝐹2 , 𝐹6 ) ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘(𝑒, 𝑥v¿ÀÁ , 𝑦v¿ÀÁ , 𝜎2_v¿ÀÁ , 𝜎6_v¿ÀÁ )

33.

𝑒[0] ← 𝑒[0] − 𝑠 ∗ 𝐹𝑥

34.

𝑒[1] ← 𝑒[1] − 𝐹𝑥

35.

𝑒[2] ← 𝑒[2] − 𝑠 ∗ 𝐹𝑦

36.

𝑒[3] ← 𝑒[3] − 𝐹𝑦

37

37. 𝒆𝒏𝒅 𝒇𝒐𝒓
38. 𝒇𝒐𝒓 each particle 𝑝 ∈ 𝑆𝑝 𝒑𝒂𝒓𝒂𝒍𝒍𝒆𝒍 𝒅𝒐
39.

j
j
j
j
(𝐹2 , 𝐹6 ) ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘(𝑝, 𝑥v¿ÀÁ
, 𝑦v¿ÀÁ
, 𝜎2_v¿ÀÁ
, 𝜎6_v¿ÀÁ
)

40.

𝑝[0] ← 𝑒 0 + 𝑠 ∗ 𝐹𝑥

41.

𝑝[1] ← 𝑝 1 + 𝐹𝑥

42.

𝑝[2] ← 𝑝 2 + 𝑠 ∗ 𝐹𝑦

43.

𝑝[3] ← 𝑝 3 + 𝐹𝑦

44. 𝒆𝒏𝒅 𝒇𝒐𝒓
45. 𝒆𝒏𝒅 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏
The procedure 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘(𝑆j , 𝑆ž , 𝑠, 𝑜𝑝𝑝_𝑒𝐺𝑃𝑈, 𝑜𝑝𝑝_𝑝𝐺𝑃𝑈) which is the heart of 𝐵𝑒𝑎𝑚 −
𝐵𝑒𝑎𝑚 algorithm is dependent on the communication from other GPUs. 𝑆j and 𝑆ž are the slices of
e- and p-beam respectively, 𝑠 is the interaction point of both the slices, 𝑜𝑝𝑝_𝑒𝐺𝑃𝑈 and 𝑜𝑝𝑝_𝑝𝐺𝑃𝑈
are the GPUs that the particular GPU has to communicate with to receive the mean and standard
deviations of opposite e- and p-beams. The modified algorithm of 𝐴𝑝𝑝𝑙𝑦 − 𝐾𝑖𝑐𝑘 is described in
Algorithm 9. Initially, the procedure 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝑀𝑒𝑎𝑛 − 𝑆𝐷 is called to calculate the mean and
standard deviations of both e- and p-beam along the first and third dimensions of the particles in
𝑆j and 𝑆ž . Then these calculated mean and standard deviations are sent to the respective GPUs
where the colliding e- and p-beams are stored. In the next step the GPU receives the mean and
standard deviations of colliding e- and p-beams from the same set of GPUs to which it sent the
parameters earlier. Then, for each particle 𝑒 belongs to 𝑆j , the kick from all the particles in the
colliding slice of the p-beam which is potentially residing on the other GPU is calculated using the
ž

ž

ž

ž

procedure 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘. This procedure takes 𝑥v¿ÀÁ , 𝑦v¿ÀÁ , 𝜎2_v¿ÀÁ , 𝜎6_v¿ÀÁ as inputs and

38

returns a pair (𝐹𝑥, 𝐹𝑦). The output values 𝐹𝑥 and 𝐹𝑦 represents the kick from particles in the slice
of the opposite colliding p-beam and it is calculated using Equations (3) to (7) described in Chapter
- 2. These computed kicks are used to update the particles in 𝑆j using the first for loop. Similarly,
in the next for loop, kick on all the particles in 𝑆ž due to the particles in the slice of the colliding
e-beam which is potentially residing on other GPU is computed using the 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘
procedure and the output from this procedure is used to update all the particles in 𝑆ž .

39

CHAPTER VI
RESULTS
In this chapter, we discuss the performance of both the implementations. It is to be noted
that the Single-GPU implementation is only for Single Bunch simulations on a single GPU and
the Multi-GPU implementation is only for Multi-Bunch simulations on Multiple GPUs.
We used NVIDIA Tesla K40 GPUs to run the simulations using Single-Bunch and MultiBunch algorithms. Each GPU on a cluster is hosted on a standalone desktop machine with NVIDIA
Tesla K40 GPU hosted on a multi-core CPU platform with two Intel® Xeon® E5-2630 v4
processors, where each E5-2630 v4 processor consist of 10 cores, making a total of 20 CPU cores
for the multi-core platform. The Tesla K40 used in this study is a GK110B GPU-processor based
on the popular Kepler microarchitecture [8]. The GK110B processor in K40 offers 12 GB of
GDDR5 on-board memory with a peak memory bandwidth of 288 GB/sec, and it contains 15
streaming multiprocessors (SMs) each with 192 single-precision CUDA cores and 64 doubleprecision units clocked at 745 MHz. These cores in SMs collectively delivers a peak floating-point
performance of 4.29 Tflops and 1.43 Tflops in single-precision and double-precision, respectively.
We use double-precision for all the floating-point arithmetic in our implementation of beam-beam
effects simulation. The results reported in this Chapter illustrates the performance for a single turn
of the simulation that is averaged over multiple turns of the entire beam-beam dynamics
simulation, which in practice runs for millions to billions of turns.

VI.1 SINGLE-GPU PERFORMANCE
The performance of our parallel implementation of 𝐵𝑒𝑎𝑚 − 𝐵𝑒𝑎𝑚 procedure on a single
GPU is evaluated against an existing out-of-the-box code that was developed to establish the proof-

40

of-concept of beam-beam interactions in particle colliders using Bassetti-Erskine approximation
[11] [15]. It is important to note that this sequential simulation code is a single threaded
implementation, and it is not optimized to take advantage of the multiple cores of CPU
architectures. In order to establish a fair comparison, we used OpenMP to develop a naively
parallel implementation of this sequential code that uses all the cores of the underlying multi-core
CPU architecture and delivers near-linear speedup in the number of cores used. We use this multicore implementation on 20 CPU cores, along with the sequential implementation on a single CPU
core to analyze the performance of our parallel implementation on K40 GPU.

Table 6.1 - Single turn performance results of sequential (on a single CPU core), multi-core CPU
(on 20 CPU cores), and GPU implementation (on K40 GPU) of beam-beam dynamics simulation
that is averaged over multiple turns for varying number of particles with the number of slices
fixed to 𝑚 = 6.
Table 6.1 illustrates the single turn performance results of sequential (on a single CPU
core), multi-core CPU (on 20 CPU cores), and GPU implementation (on K40 GPU) of the 𝐵𝑒𝑎𝑚 −
𝐵𝑒𝑎𝑚 procedure for varying number of particles with the number of slices fixed to m=6. The
tracking time reported in Table 6.1 is the combined execution time of the two 𝑇𝑟𝑎𝑐𝑘 calls in
𝐵𝑒𝑎𝑚 − 𝐵𝑒𝑎𝑚 procedure, and the collision time is the execution time of the 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 procedure.
The results indicate that depending on the number of particles, parallel implementation of beambeam effects on GPU achieves two to three orders of magnitude speedup when compared against
the non-optimized sequential simulation. This speedup behavior is also illustrated in Figure 6.1
with a blue colored plot. On the other hand, GPU implementation achieves two orders-ofmagnitude speedup when compared against the multi-core CPU implementation.

41

Figure 6.1 - Speedup behavior of the GPU implementation compared against the multi-core CPU
implementation (on 20 CPU cores) for different number of slices with varying number of particles.

Figure 6.1 shows the simulation speedup behavior of the GPU implementation for different
number of slices with varying number of particles. From Figure 6.1 and Table 6.1, we notice that
speedup of the GPU implementation increases near linearly up to one million particles and it
saturates beyond that. This behavior is independent of the number of slices considered in the
simulation. The reason for this behavior is that amount of thread-level parallelism offered by the
GPU implementation and the device utilization has a linear dependence on the number of particles
in the simulation. In other words, fewer number of particles per beam leads to a underutilized GPU
which results in poor to suboptimal performance, and the device utilization (or occupancy) grows
near linearly with the number of particles which leads to a proportional increase in the
performance. The current implementations on K40 GPU achieves full occupancy at approximately
one million particles, and any increase in the input size beyond this point results in a serialized
execution on the GPU, thereby deviating from the linear speedup growth. Note that the point of

42

saturation for a given implementation depends on the GPU and it often varies with each target
architecture.
Tracking Performance
The split execution time in Table 6.1 shows that tracking in the GPU implementation is
two to three orders of magnitude faster than the sequential implementation, and it is two orders of
magnitude faster than the multi-core CPU implementation. The main reason for such a large
performance gain is that the parallel implementation on GPU is highly optimized to take advantage
of the data parallel nature of 𝑇𝑟𝑎𝑐𝑘 procedure, whereas sequential and multi-core implementation
is a proof-of-concept code that is not optimized for performance. In particular, data parallelism in
𝑇𝑟𝑎𝑐𝑘 procedure is exploited in the GPU implementation by mapping the computation of particles
to parallel threads with one-to-one correspondence such that it minimizes both branch and memory
divergence on GPUs, which leads to the effective utilization of GPU resources. In addition, data
reuse is maximized by using shared memory to store the shared transport map. These performance
optimizations together with the massive parallelism offered by the GPU architectures results in the
large performance gain.

43

Collision Performance

Figure 6.2 - Execution time of 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 procedure in sequential (on a single CPU core), multi-core
CPU (on 20 CPU cores), and GPU (on Tesla K40) implementation for varying number of particles
with the number of slices fixed to m = 6.
Figure 6.2 illustrates the execution time of the 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 procedure in both sequential and
GPU implementations from Table 6.1 for varying number of particles. We notice that the collision
time in the sequential implementation is proportional to the number of particles in the simulation,
in other words, collision time in sequential code grows linearly with the input size. This behavior
of the sequential code is expected, as the number of operations (floating-point and integer)
involved in the 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 procedure is proportional to the number of particles. On the other hand,
collision time in the GPU implementation exhibits a non-linear behavior with the number of
particles. The reason for this behavior is that the GPU implementation simulates the collision
between two input beams by executing a slice-to-slice collision on a subset of slices at a given
time, where the number of threads, operations and data parallelism used on GPU is proportional
to the number of particles involved in the current set of colliding slices. This number typically

44

depends on the particle distribution, and it varies from slice to slice and from turn to turn. As a
result, when there are fewer number of particles in the colliding slices, it leads to a underutilized
GPU, and the utilization improves as the number of particles increase. For example, for collision
in Figure 2.2 each row represents the collision on a subset of slices that is executed on GPU in
parallel where the performance depends on the number of particles involved in each row. It is
evident from figure that the number of particles participating in the collision increases from the
top to the center, and then decreases from the center to the bottom. In other words, occupancy and
device utilization starts with a minimum value at the top row and increases as we move to the
center row, and then it starts to decrease from the center to the bottom row. This variation in the
utilization results in the non-linear increase in the execution time on GPU.

Table 6.2 - Single turn performance of the sequential, multi-core CPU, and GPU implementation
of COLLIDE procedure in the beam-beam effects simulation for different input configurations.
Table 6.2 illustrates single turn performance of the sequential (on a single CPU core),
multi-core CPU (on 20 CPU cores), and GPU implementation (on K40 GPU) of 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 procedure
in the beam-beam effects simulation for different input configurations. The results indicate that,
depending on the number of particles and slices, GPU implementation of 𝐶𝑜𝑙𝑙𝑖𝑑𝑒 procedure
delivers a speedup gain of up to 98X and 11X when compared to non-optimized sequential and
multi-core CPU implementation, respectively.

45

Table 6.3 – Performance results of 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 kernel with and without maximum registers
per thread limitation.
Table – 6.3 illustrates the performance of 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 Kernel on NVIDIA Tesla K40
GPU for a different number of input particles and with/without the maxregcount option given
during compilation. In NVIDIA CUDA Compiler, using the maxregcount option we can limit the
number of registers used by a thread to a particular number. These performance metrics are
extracted from NVIDIA Profiler.

46

Analysis Using Roofline Model

Figure 6.3 – Roofline model analysis for 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 kernel on NVIDIA Tesla K40 GPU.
Figure 6.3 shows the Roofline model for K40 GPU. The graph is on a log-log scale. The
y-axis is attainable double-precision floating-point performance in units of Gflops/Sec, and the xaxis is arithmetic intensity, varying from 0.125 Flops/DRAM byte-accessed to 32 Flops/DRAM
byte-accessed. The system being modeled has a peak double precision floating-point performance
of 1.4 Tflops/sec and peak memory bandwidth of BWTheoretical-Peak = 288 GB/Sec from hardware
specifications. The black solid line in Figure 2 indicates the bandwidth ceiling for BWTheoretical-Peak.
However, the peak theoretical bandwidth is often unachievable in practice. So, in order to analyze
the performance more accurately, we measure the experimental memory bandwidth using the
benchmarks from NVIDIA's official SDK [19]. Experimental memory bandwidth for K40 is
calculated to be BWExperimental-Peak = 200 GB/sec, and its bandwidth ceiling in roofline model is

47

shown using the blue solid line plot. The black vertical line indicates the theoretical arithmetic
intensity of 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 kernel.
The theoretical arithmetic intensity of the kernel is around 17 Flops/byte. This indicates
that Collision is a compute bound kernel. This is because only the dimensions of the particles,
Mean and Standard Deviation of the opposite slice are loaded from the memory in the beginning
of the kernel and later there are no other memory accesses performed in between the computations.
Our implementation has achieved a performance of around 210 GFlops/sec. This poor performance
of the kernel is because of the local memory overhead and Warp Divergence happening inside the
kernel. In addition to these two metrics all the other metrics from Table 6.3 are discussed below.
Occupancy
We notice that when maxregcount option is disabled, the number of registers used by each thread
are 82 where as the ideal number lies around 32 for K40 GPU. Due to this excessive use of
registers, each SM is limited to executing a lower number of blocks simultaneously resulting in
low occupancy of GPU. Because of this, the kernel is able to achieve a peak occupancy of only
0.3 and a peak arithmetic intensity of 0.68. When the number of registers used by each thread are
limited to 48, the kernel has achieved a peak occupancy of 0.62 and a peak arithmetic intensity of
1.05. Also, we observe that occupancy of the kernel increases with increase in a number of input
particles. Hence, occupancy is one of the major contributors of kernel’s performance.

48

Local Memory Overhead
We notice that when the maxregcount option is disabled, there is no local memory
overhead. But, when we limit the number of registers per thread to 48, SM runs out registers and
starts spilling into Local memory resulting in increased memory traffic. Even though the kernel
time here is less when compared to the kernel time for which the maxregcount option is disabled,
the local memory overhead remains one of the main factors of the kernel’s poor performance in
this case.
Control-Flow Divergence
Table 6.3 illustrates the Warp Execution Efficiency for different input configurations. We
observe that the Warp Execution Efficiency remains constant and is only 40%. The reason for this
poor efficiency is, that we used an existing algorithm for the complex error function in our GPU
implementation. We extracted some information from NVIDIA Profiler that shows the information
about the divergence happening inside the error function. We see that in Figure 6.4 at line 1195
and 1196 of Collide.cu file which has the source code of 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 kernel, 100% of the
threads are executing the error function (WOFZ). Then, in Figure 6.5, inside the error function, at
line 49, we observe that 96% of threads are active, and in Figure 6.6 at line 60, out of that 96%,
only 57% of threads are executing the for-loop. Later, in Figure 6.7, the else condition at line 73
is executed by remaining 4% of threads and all the other 96% of threads are inactive during this
time. As the computations inside the error function are the major contributors for the overall
computations inside the 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 kernel, the Control-Flow Divergence inside the error
function is one of the main reasons for the poor kernel performance.

49

Figure 6.4 – All the threads are calling error function (WOFZ) at lines 1195 and 1196.

50

Figure 6.5 – 4% of inactive threads at line 49.

Figure 6.6 – Out of 96% active threads, 43% of threads are inactive inside the for-loop.

51

Figure 6.7 – 92% of threads are inactive at line 73.

Memory Performance
The threads inside the 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 − 𝐾𝑖𝑐𝑘 kernel access the global memory only once during
the beginning to fetch the particle information. Except these, there are no other global memory
fetches performed by the threads inside the kernel. The initial fetches from global memory are
perfectly coalesced because of which we observe a global load efficiency of 88% from Table 6.3.
Also, the ideal number of global transactions per request is 2 for 8-byte words and we achieved
1.42 which is much closer to the ideal value. This shows that the kernel performs very well in
terms of accessing global memory efficiently and is one of the positive contributors for kernel’s
performance.
From the above analysis, it is clear that the performance of the kernel is limited by the
number of registers that each thread is using inside the kernel. Limiting the number of registers

52

increases the occupancy decreasing the time taken by the kernel. But, as SM runs out of registers
and spills the variables into local memory, the traffic created due to the local memory access
remains one of the limiting factors of kernel’s performance. In addition to the Local Memory
overhead, Control-Flow-Divergence also remains as the limiting factor of kernel performance.

VI.2 MULTI-GPU PERFORMANCE
We performed experiments on Multiple-GPUs to see how the Multi-Bunch simulation code
scales with the number of GPUs on a cluster. Table – 6.4 illustrates the performance of Multi-GPU
algorithm when the number of GPUs and bunches are increased in the powers of 2. All the
performance results reported for Multi-GPU algorithm are for one iteration / single schedule, 100K
Particles per Bunch, and 3 Slices. We observe that except for a small number of bunches, the
Multi-GPU algorithm scales nearly linearly with the number of GPUs. The reasons for the nonlinear behavior for a smaller number of bunches and near linear behavior for higher bunch numbers
are discussed below.

Table 6.4 – Performance of Multi-GPU algorithm on a cluster of GPUs. GPUs and bunches are
increased in powers of 2.

53

Figure 6.7 – Execution schedule formed by the scheduling algorithm when there are 2 GPUs and
4 Bunches (3e bunches, 4p bunches).

Figure 6.8 – Time slots required to complete all the interactions of a schedule on a Single GPU
when there are 3e and 4p bunches.

Reason 1 - The number of time slots required to complete the interactions does not linearly
decrease with the number of GPUs. For example, in Figure - 6.8 shows the number of time slots
and the total time required to complete all the interactions on a Single GPU when there are 4
bunches. When there are 4 bunches, there is a total of 12 interactions in a schedule. Assuming that
the total time taken to complete an interaction on a single GPU when both e- and p-beams are
active is 't', it takes a total time of 12t to complete a single schedule on a single GPU.

54

Figure 6.9 - Time slots required to complete all the interactions of a schedule on two GPUs when
there are 3e and 4p bunches.

Figure - 6.9 shows the number of time slots and the total time required to complete all the
interaction on 2 GPUs when there are 4 bunches. Now each GPU has to handle the work of 2 ebeams and 2 p-beams. If we look at Figure - 6.7 which is execution schedule formed, at first GPU
1 will handle the execution of the beams 1e and 2p and at the same time, GPU 2 will handle the
execution of the beams 3e and 4p. As shown in Figure - 6.9, the time taken for these two
interactions which are occurring in parallel is t. Now, for the interaction (2e, 3p), GPU 1 will
handle the execution of 2e and GPU 2 will handle the execution of 3p. As the GPUs here are not
handling the execution of both e- and p-beams, the time required here is only t/2 as shown in Figure
- 6.9. Also, as both the beams are residing on different GPUs, there is an extra time taken to
exchange the parameters between the GPUs which is called the communication time c. So the total
time taken here is t/2 + c/2. In a similar way, both GPUs will execute all the interactions according
to the execution schedule and the total time taken for all these interactions is 7t + 3c. Here, we
observe that even though we have 2 GPUs, the execution time is not exactly half of the total time
taken when we have only 1 GPU because of the dependency on previous turn which is indeed the
nature of the problem.
Reason 2 - Most of the times, a GPU has to communicate with other GPUs to get the
parameters needed for applying effects on the beam. For example, in figure 6.9, during the 1st time

55

slot of Turn-3, GPU 1 and GPU 2 are communicating with each other to send and receive the
parameters needed to apply effects on their respective beams. Hence, there is an extra overhead
added in the form of communication time which is the reason for near-linear speed up of MultiGPU algorithm.
Reason 3 - The computations performed in the error function are different for each particle.
Hence a GPU might have to wait for the other GPU until it finishes the execution of error function
and send the required messages.
Timelines similar to Fig - 6.8, 6.9 are shown in the Figures - 6.10, 6.11, 6.12, but only for
a single and random turn when there are 8 Bunches and the figures are for 1, 2, 4 GPUs
respectively. From these Timelines, we observe that the time taken for single turn scales nearly
linear with the number of GPUs.

Figure 6.10 - Time slots and total time required to complete all the interactions of a single turn on
a single GPU when there are 7e and 8p bunches.

Figure 6.11 - Time slots and total time required to complete all the interactions of a single turn on
2 GPUs when there are 7e and 8p bunches.

56

Figure 6.12 - Time slots and total time required to complete all the interactions of a single turn on
4 GPUs when there are 7e and 8p bunches.

Time Slots
For any given number of bunches 𝑛¥ and given number of GPUs 𝑔, the number of time
slots required to complete the schedule can be calculated by the equation,
𝑇𝑖𝑚𝑒 𝑠𝑙𝑜𝑡𝑠 = 𝑛¥ (𝑛¥ /𝑔) for 𝑛𝑏 > 1
𝑇𝑖𝑚𝑒 𝑠𝑙𝑜𝑡𝑠 = 𝑛¥ ((𝑛¥ − 1)/𝑔) for 𝑛𝑏 = 1

(8)

where 𝑛¥ /𝑔 is the number of timeslots required to complete a single turn and 𝑛¥ is the number of
turns required to complete a schedule/one-iteration. It is to be noted that the number of turns
required to complete an iteration is always equal to number of p bunches (as e bunches are always
𝑛¥ − 1).
Communications between GPUs
When there are 𝑔 GPUs and 𝑛¥ bunches (𝑛¥ − 1 e-bunches, 𝑛¥ p-bunches), each of the
𝑛¥ − 1 e-bunches will interact with all the 𝑛¥ p-bunches. Each GPU has 𝑛¥ /𝑔 (𝑛¥ >> 𝑔) bunches
stored in it. So as each bunch has to interact with all the other bunches, a GPU has to communicate
𝑛¥ – (𝑛¥ /𝑔) times to execute all the interactions related to that particular bunch. In the same way,
for all the bunches stored on a GPU, it has to communicate (𝑛¥ /𝑔)(𝑛¥ − 𝑛¥ /𝑔) times with other
GPUs to execute all the interactions that are stored on it. On the whole, when all the GPUs are

57

considered, a total of 𝑛¥ (𝑛¥ − 𝑛¥ /𝑔) communications are required in a schedule / one-iteration.
But, as all the interactions in a particular time slot are happening in parallel, the communications
required during the interactions in a time slot also happen in parallel. In a particular time slot, we
observe that (refer to Figure 6.7) either all or none of the GPUs communicate with other GPUs to
execute the interaction. Hence the number of communications (ignoring the communications
happening in parallel) are less than the number of time slots required to execute a schedule and we
observe the ratio of the number of time slots to communications is always equal to 𝑔/(𝑔 − 1).
Hence the number of communications (ignoring the communications happening in parallel)
required to execute the schedule / one-iteration is given by
𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠 𝑅𝑒𝑞𝑢𝑖𝑟𝑒𝑑 = 𝑛¥ q (𝑔 − 1)/ 𝑔q

(9)

Predicting the time required for given number of GPUs and Bunches
The time taken by the Multi-GPU algorithm for given number of bunches 𝑛¥ (> 1) and
GPUs 𝑔 is resolved by the Equation 10 which is the addition of time(𝑡) required by all the time
slots (Equation – 8) and the time(𝑐) required for all the communications (Equation – 9).
𝑇𝑜𝑡𝑎𝑙 𝑇𝑖𝑚𝑒(𝑇_Ä ) =

_Ä „
Å

(𝑡 +

ÅfU
Å

𝑐)

(10)

Table 6.5 – Comparison between Actual time taken on 7 GPUs and the predicted time on 7 GPUs
when there are bunches that are multiples of 7.
Table 6.5 illustrates the comparison between the actual timings and the predicted timings
(using equation 6.3) of the Multi-GPU algorithm. For these predictions, we have taken bunches in
the multiples of 7 and fixed the number of GPUs to 7. We observe that the predicted timings are
almost closer to the actual timings. The error between those two timings is mainly because the

58

computations inside the error function are different for different particles. So the time taken by
each collision is not exactly the same. But while predicting these timings we assumed a fixed
time (𝑡) of 22.5 msec for each interaction on Tesla K40 Architecture. Also, according to our
experiments, the average time taken (𝑐) for each communication is 1.2 msec which also often
varies in practice depending on the distance between the nodes in cluster. But to predict the timings
more accurately, we used Least Squares method to find the best fit of 𝑐 for a given number of
nodes on a cluster. In this way, by knowing the average time taken (𝑡) for an Interaction on a
particular GPU, we can calculate the communication time for any cluster size to predict the total
time taken taken for a given number of bunches more accurately.
When we apply the Least Squares method to Equation 10 and apply differentiation on it
with respect to 𝑐, we get the equation below which we can use to find 𝑐.
_Ä 2(𝐴_Ä

− 𝐵_Ä 𝑐)(−𝐵_Ä ) = 0

where 𝐴_Ä = 𝑇_Ä −

_Ä „
Å

𝑡 and 𝐵_Ä =

(11)
_Ä „ ÅfU
Å„

𝑐

When we substitute the bunch numbers (𝑛¥ ) from 7 to 56 in the multiples of 7, we get the 𝑐 = 1.66
msec as the best fit of communication time. Below are the predicted results when the inputs to the
equation 10 are 𝑡 = 22.5 msec and 𝑐 = 1.66 msec.

Table 6.6 – Comparison between Actual time taken on 7 GPUs and the predicted time using Least
Squares method on 7 GPUs when there are bunches that are multiples of 7.
Hence, when we have a new cluster setup potentially with different GPU architecture. We
first calculate the time taken (𝑡) for an interaction for that particular input configuration and then

59

run some experiments with different bunch numbers (𝑛¥ ) to find out the best fit of communication
time (𝑐) for Equation 11.

60

CHAPTER VII
CONCLUSION AND FUTURE WORK
VII.1 CONCLUSION
We presented a high-fidelity, high-performance parallel model for simulation of beambeam effects in particle colliders using GPUs. This pioneering implementation on modern GPU
architectures results in orders-of-magnitude speedup over its serial version, thereby bringing the
previously intractable physics within reach for the first time. The parallel implementation of this
simulation model on NVIDIA Tesla K40 GPU outperforms the non-optimized sequential
simulation and it delivers as much as three orders-of-magnitude reduction in computation time.
The development of this advanced new simulation tool will enable carrying out a truly long-term
simulations spanning 400 million turns, which in case of the proposed electron-ion collider JLEIC
is on the order of an hour of machine operation. This will facilitate fine tuning the collider
parameters for more efficient operations which will lead to substantial savings in the design and
operation of these expensive machines. Below is the summary of our contributions in this thesis.
•

Implemented the simulation algorithm for beam-beam effects when the particles collider
carries one or more bunches.

•

Achieved an overall speedup of around 880X with our GPU implementation of SingleBunch beam-beam simulations when compared to the existing sequential implementation.

•

Implemented a Multi-GPU algorithm for Multi-Bunch beam-beam simulations with a
minimal data movement between the GPUs.

•

Our Multi-GPU implementation of Multi-Bunch beam-beam simulations achieved a nearly
linear speedup with number of GPUs on a cluster.

61

VII.2 FUTURE WORK
In the Future, we plan to address the following:
•

Analyze the computations happening inside the error function.

•

Reduce the control-flow divergence occurring inside the error function.

•

Minimize the local memory overhead by reducing the register usage.

•

Use OpenMP to taken advantage of Multiple Cores present in the same node to
parallelize the CPU computations between kernels.

62

REFERENCES
[1] NVIDA “Cuda Programming Guide”. Available via http://docs.nvidia.com/cuda/cuda-cprogramming-guide.
[2] Bassetti, M., and G. Erskine. 1980. Technical report, CERN Report No. CERN-ISRTH/80-06.
[3] Bell, N., and J. Hoberock. Thrust library for GPUs. Available via
http://docs.nvidia.com/cuda/.
[4] Furman, M. 1991. In Particle Accelerator Conference, pp. 422.
[5] Harrison, M., T. Ludlam, and S. Ozak. 2003. Nuclear Instruments and Methods in
Physics Research Section A vol. 499, pp. 235–244.
[6] J. L. Abelleira Fernandez et al. 2012. Journal of Physics vol. G 39 (075001).
[7] Makino, K., and M. Berz. 2002. Nuclear Instrumentation and Methods in Physics
Research Section A vol.427, pp. 338.
[8] NVIDIA. Next Generation CUDA Compute Architecture: Kepler GK110. Available via
https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-ArchitectureWhitepaper.pdf.
[9] Poppe, G. P. M., and C. M. J. Wijers. 1990. ACM Transactions on Mathematical
Software (TOMS) vol. 16,pp. 38–46.
[10] Qiang, J., R. Ryne, and M. Furman. 2002. Physical Review Special Topics Accelerators and Beams vol. 5 (104402).
[11] Roblin, Y., V. Morozov, B. Terzić, M. Aturban, D. Ranjan, and M. Zubair. 2013. In 4th
International Particle Accelerator Conference (IPAC), pp. 1064 – 1066.
[12] S. Abeyratne et al. 2012. arXiv:1209.0757.

63

[13] S. Abeyratne et al. 2015. arXiv:1504.07961.
[14] Schulte, D. 1996. Ph.D. thesis, University of Hamburg. TESLA-97-08.
[15] Terzić, B., A. Godunov, K. Arumugam, D. Ranjan, and M. Zubair. 2015. In 12th
International Computational Accelerator Physics Conference (ICAP’15), pp. 40–43.
[16] Terzić, B., I. Pogorelov, and C. Bohn. 2007. Physical Review Special Topics Accelerators and Beams vol.10 (034201).
[17] Yokoya, K., and H. Koiso. 1990. Particle Accelerators vol. 27, pp. 181.
[18] MPICH. Available via https://www.mpich.org/
[19] NVIDIA, “CUDA Bandwidth Test.” [Online]. Available:
http://docs.nvidia.com/cuda/cuda-samples/#bandwidth-test

64

VITA
Naga Sai Ravi Teja Majeti
Department of Computer Science
Old Dominion University
Norfolk, VA 23529

EDUCATION
M.S. in Computer Science, Old Dominion University, 2017
B.Tech. in Electrical and Electronics Engineering, JNTU Kakinada, India, 2012

EMPLOYMENT
Jan 2016 – May 2017

Graduate Research Assistant
Old Dominion University, Norfolk, VA, USA.

Aug 2015 – Dec 2015

Graduate Teaching Assistant
Old Dominion University, Norfolk, VA, USA.

Jan 2013 – July 2015

Assistant Systems Engineer
Tata Consultancy Services, Bangalore, KA, India.

