Computing techniques by Buffat, X.
Proceedings of the 2018 CERN–Accelerator–School course on
Numerical Methods for Analysis, Design and Modelling of Particle Accelerators, Thessaloniki, (Greece)
Computing techniques
X. Buffat
CERN, Geneva, Switzerland
Abstract
This lecture aims at providing a user’s perspective on the main concepts used
nowadays for the implementation of numerical algorithm on common com-
puting architecture. In particular, the concepts and applications of Central
Processing Units (CPUs), vectorisation, multithreading, hyperthreading and
Graphical Processing Units (GPUs), as well as computer clusters and grid
computing will be discussed. Few examples of source codes illustrating the
usage of these technologies are provided.
1 Introduction
Nowadays several problems can be addressed only via numerical algorithms. The study of particle ac-
celerators is clearly among the many fields relying heavily on highly performing computers to execute
various types of numerical algorithms for design, measurement, correction and operation of the machines
and their individual components as described in the various contributions to these proceedings. Concep-
tually, all these algorithms consist of a series of operations to be executed on a given dataset. In practice,
the number of these operation as well as the size of the dataset is limited by the available technolo-
gies. Here we aim at describing the main concepts that are needed to understand the current technologies
commonly used for the implementation of numerical algorithms, with emphasis on the accelerator world.
This field is rather wide, this lecture is mainly designed for beginners in computing wishing to broaden
their knowledge in view of using more efficiently their personal computers or to investigate which high
performance computing technologies would be most suited for a given application.
In Sec. 2, we describe the main components of the Central Processing Unit (CPU) of the most
common computer, i.e. based on transistors, and the basics of the corresponding nomenclature. The
possibility to combine the power of multiple CPUs to achieve a common task is discussed in Sec. 3. Sec-
tion 4 describes the improvements of the performance that can be expected using parallelism. We’ll then
consider an evolution of the CPU available commercially, aiming at massively parallel computations, i.e.
the Graphical Processing Unit (GPU) (Sec. 5). Eventually, the combination of multiple computers to
achieve large tasks will be discussed in Secs. 6 and 7, with the concept of computer clusters, volunteer
computing and computing grids.
2 The central processing unit
From the user’s point of view, a CPU can be viewed as a device capable of performing a given set of
operations on a given data set. The CPU is operated in cycles during which one of the instruction from
the set is executed, as illustrated in Fig. 1. The main characteristic of the CPU is then the number of
instructions that can be performed per unit time, usually referred to as the clock speed, which nowadays
is limited to a few GHz.
Most CPUs implement mathematical operations ranging from the simple arithmetic or comparison
operators to more complex functions such as trigonometric or exponential functions. The capabilities of
a given CPU are described by the instruction sets that they implement. Most of the CPUs in nowadays
personal computers implement the so-called x86-64 instruction set featuring the operations required for
most applications in accelerator physics, more generally for scientific computing. On the other hand, the
CPUs in smart phones or tablets implement other sets of instructions, making them less performant for
scientific computing.
Available online at https://cas.web.cern.ch/previous-schools 1
ar
X
iv
:2
00
6.
10
66
4v
1 
 [p
hy
sic
s.a
cc
-p
h]
  1
8 J
un
 20
20
Fig. 1: Sketch of the main steps of the execution of a CPU cycle. A given list of instruction is generated by a
compiler or an interpreter based on the source code. At each step, a given instruction and the corresponding data
are loaded in the registers (solid lines), the operation is executed by the CPU, and the result is copied back into the
memory or the next instructions are modified depending on the type of instruction executed (dashed lines).
Each instruction applies to a given data type. This data has to be stored in a dedicated memory on
the CPU, called the register. During each of its cycles, the CPU will perform a given instruction to the
data stored in the register. We note that the instruction itself is also encoded in another dedicated register
on the CPU. For example, the operation could be the replacement of two double precision numbers each
represented by 64 bits by their sum, also represented by 64 bits. To execute such an operation, the register
has to be of at least 128 bits. The size of the register is therefore an important characteristic of the CPU,
it will be further discussed in the next section.
While the CPU instructions can be addressed directly even in high level programming languages,
via so-called intrinsics [1], their direct usage is rather uncommon. The building of the list of instructions
to be executed by the CPU is usually done by a compiler or an interpreter, based on a human readable
source code. The role of the compiler or the interpreter is therefore crucial in the performance of a given
computer code.
For completeness we note that, on top of the mathematical operations, the capabilities of the
CPU include the transfer of the data from the computer’s memory to the register and back, as well as
the modifications of the following instructions that will be assigned to it in the next cycles (e.g. an if
statement leading to the execution of one or another part of the code, i.e. another list of instructions).
2.1 Vectorisation
Above we took the example of the addition of two numbers represented by 64 bits each, thus requiring
at least a 128 bits register. For many implications, the same operations has to be performed on several
elements of an array, in that case an improvement of the speed of the execution by a factor Nv can be
obtained by loading Nv pairs of numbers in the register and perform the addition of all the pairs in one
cycle. To allow for such an operation, the size of the register has to be at least Nv·128 bits. This form of
parallelism is known as vectorisation.
The vectorisation capabilities are evolving significantly nowadays, e.g. the Intel Xeon Phi features
a 512 bits register, thus allowing for an eight-fold vectorisation for double precision operations. The vec-
torisation capability of a given CPU can be inferred from the standard instruction sets that it implements:
AVX and AVX-2 correspond to 256 bits registers and AVX-512 to 512 bits [1].
The code developer can access vectorised instructions via the corresponding intrinsics, however
the most common usage is to rely on the compiler or the interpreter to perform the right choice of instruc-
tions. For most compilers the vectorisation is not enabled by default. For gcc, the flag -ftree-vectorize
2
(implied by the optimisation flag -O3) enables vectorisation [2]. However the capability of the compiler
to recognise vectorisable loops is limited in particular for complicated ones. Thus, while this parallelisa-
tion does in principle not require any modification of the source code, it favours the implementation of
simple loops for an efficient execution of the heavy tasks.
3 Multiple CPUs
The vectorisation goes under the classification of Single Instruction, Multiple Data (SIMD) parallelisa-
tion, which are mostly suited for homogeneous tasks to be performed on large datasets. In many cases
however the tasks are heterogeneous and such a parallelism becomes insufficient. In such cases, it is
more suited to perform Multiple Instruction on Multiple Data (MIMD). On common computers, this is
achieved by affecting multiple CPUs to a given task in an organised way, using the concept of threads.
3.1 Multithreading
So-called serial codes considered in the previous section feature a single sequence of instructions to be
performed by a single CPU. In order to allow for multiple CPUs to perform different tasks simultaneously
one requires separate lists of instructions for each of the CPUs, so-called threads. A certain level of
synchronisation and communication between the threads is ensured via an access to a shared memory
which is discussed in more detail in the next section. Conceptually, the number of threads is independent
of the number of CPUs on which the program is executed. Multithreaded source codes can in principle
be written without any knowledge of the architecture. If it occurs that there are more threads than CPUs,
the threads will share the available resources, having consequences on the memory, and consequently on
the performance (Sec. 3.2). Since the operating systems is responsible for the allocation of the resources
to the different threads, his role can be crucial in the performance of a multithreaded code. In some
cases, a dedicated setting up of the operating system can be necessary, an example is discussed in the
next sections.
An example of C source code featuring multithreading using POSIX threads [3] is shown in
Sec. A.2. This simple example does not demonstrate the flexibility of POSIX threads, yet it features
the two most basic elements, i.e. the creation and synchronisation of the threads. At their creation (lines
59 to 62), the four additional threads will execute independently, possibly on different CPUs if available,
until their synchronisation (lines 65 to 68). This parallelisation is highly efficient since there is no copy
of the memory, i.e. all of the threads act on the same memory. This efficiency comes with the responsi-
bility to ensure that the threads do not interfere with each other, through the common memory. This can
be achieved even for complex tasks through several concepts such as critical sections, reductions, atomic
operations, locks or barriers. These concepts are detailed for example in [4].
While the POSIX threads allows for the implementation of various complex tasks, their usage
requires the usage of a given syntax in the source code which differs significantly from the equivalent
serial code. The Application Programming Interface (API) OpenMP [5] is particularly handy for the
parallelisation of loops using multithreading without effort on the source code. An example of C source
code parallelised with OpenMP is shown in Sec. A.1. This code is in fact a serial code, for which a single
statement has been added (lines 33-35). This so-called pragma instructs the compiler to parallelise the
following loop using multiple threads. This method effortlessly generates a significant speed-up for most
applications even on a common desktop computer featuring few CPUs. For more demanding applica-
tions, high performance computers featuring few tens of CPUs with a shared memory are commercially
available nowadays.
In the accelerator world, this type of parallelism is not only used for scientific computing but is
also particularly suited for controls software. The independent threads can be dedicated to the control of
several individual systems interacting together through a single application.
3
Fig. 2: Topology of the cache memories on a machine with two sockets each featuring two NUMA zones with 8
cores and four levels of cache. Courtesy [6].
3.2 Cache memories
When increasing the number of CPUs, the performance of multithreaded codes can become limited by
the transfer of data between CPUs, as it is technologically challenging to implement large memories close
to the CPU, i.e. with a fast interconnection. In order to maximise the efficiency of the data transfer to
the CPUs’ registers, cache memories are implemented close to the CPUs, thus allowing for less frequent
transfers to the main computer memory which are significantly slower. In fact, the caches are sub-divided
in various levels, from small and close ones to larger and further ones. The example of such a hierarchical
implementation is shown in Fig. 2.
While the optimisation of a code for a given architecture is beyond the scope of this lecture, it is
good to keep in mind that the strategy of the allocation of the memory has an impact on the performance.
A simple example: for applications in which each thread acts mainly on a given fraction of the dataset,
such as the application illustrated in Sec. A, it is advantageous for the data to be allocated close to the
CPU executing the thread. Consequently, the performance is expected to be optimal if the execution
of the threads are constrained to a given CPU. This is not the behaviour by default for most operating
systems, but can be setup when needed.
3.3 Simultaneous multithreading (Hyperthreading)
The time required for the transfer of the data between the CPU registers and the memory (i.e. either the
caches or the main memory) can be significantly longer than a CPU cycle. During this time, the CPU
is stalled for potentially several cycles. In order to profit from this downtime, modern CPUs feature
additional registers allowing them to hold the data required for the execution of two threads, yet a single
instruction can be executed at once for one of them. The so-called simultaneous multithreading is mostly
known under the name of Intel’s implementation: Hyperthreading. While each CPU equipped with this
technology appears as a separate core from the operating system point of view, the performance is usually
between the equivalent of 1 to 2 separate CPUs. For example, the code shown in Sec. A.1 executed in
277 s serially and in 15 s with OpenMP enabled with 24 threads executing on 12 2.3 GHz Intel Xeon
CPUs with hyperthreading enabled, thus generating a speed-up of ≈18.
4
Fig. 3: Illustration of the saturation of the speed up that can be achieved by parallelisation due to the fraction of
non-parallelisable code, namely Amdhal’s law. Courtesy [7].
4 Amdhal’s law
Leaving aside memory limitations such as those discussed in the previous section, the execution of a
parallelised code remains limited by its fraction of non-parallelisation code. If we denote this fraction
of the execution α, only the execution of the remaining fraction (1 − α) is reduced to (1 − α)/Nc by a
parallelisation with Nc cores. Thus we can write the total speed up [8]:
S =
1
α+ 1−αNc
. (1)
This so-called Amdhal’s law quantifies the saturation of the performance of parallelised codes due to the
non-parallelisable part (Fig. 3). It is useful on one side to estimate the resources worth investing into the
solution of a given problem as well as for diagnosing a given implementation of a parallelised code.
5 The graphical processing unit
The vectorisation (Sec. 2.1) is particularly suited for the processing of images and videos, where the
processing of large amounts of quasi-independent data is required, leading to the development of a ded-
icated technology, the GPU. This technology combines large vectorisation capabilities with concepts
similar to multithreading and hyperthreading to achieve a number of operations per second that currently
exceeds those of CPUs by one to two orders of magnitude. This technology is therefore suited for other
applications featuring a similar level independence in the data as in image processing.
While some technologies enable the usage of GPUs without effort from the developers point of
view (e.g. see Sec. A.4 or via the usage of libraries implementing GPU acceleration), an efficient usage
of the GPU requires some understanding of its operation and often the usage of dedicated languages.
As the architecture of a GPU differs significantly from the CPU, they feature specific instruction
sets. Whereas instruction sets are quite standard on CPUs for desktop computers, there exist a variety of
5
Fig. 4: Illustration of the operation of a GPU. The execution of a kernel on a GPU card consists first of the loading
of the instructions as well as the corresponding data to the card memory (solid black line). Then the threads are
loaded in the SMPs caches and executed (here illustrated by one yellow arrow per SMP, each executing the threads
in a given thread block). Finally the data is transferred back from the card memory to the host memory (dashed
black line).
GPUs optimised for different purposes. Mostly due to their optimisation towards single precision oper-
ations, cards for graphics are not optimal for scientific computing applications, which are usually based
on double precision computations. Cards dedicated to scientific computing are available commercially,
often referred to as General-Purpose GPU (GPGPU). They can be recognised by the absence of video
port. The number of single and double precision floating point operations per second (FLOPs) is usually
provided by the manufacturer allowing for comparison between models.
The GPU is usually implemented as a separate card with its own memory. It is operated by the
host, i.e. the computer equipped with CPUs, as illustrated in Fig. 4. The threads on a GPU are grouped
in blocks on a grid. This structure of the threads is reflected on the memory, since only the threads in the
same block have access to a shared memory.
The processing units in a GPU are called Streaming Multiprocessors (SMPs), each are composed
of multiple processing units featuring high vectorisation capabilities, such that 32 or 64 threads can be
executed simultaneously. Following NVIDIA’s nomenclature we’ll refer to the set of threads executed
simultaneously by a single unit of the SMP as warps. The warps are executed independently by the
computing units of the SMPs. There is a strong analogy between the execution of threads on a CPU
and of warps (i.e. sets of threads) on a GPU, except that communication through a shared memory
(synchronisation, locks, etc.) is possible only for threads within the same block. Only recent GPU
architectures allow for block synchronisation [9].
Analogously to hyperthreading, the SMPs feature a large amount of registers allowing for the
storage of multiple warps (and multiple blocks) and thus fast switching between the execution of different
warps during stalls.
This architecture has some implications for the developer:
– Minimum data transfer between the GPU and the host memories is advisable.
– The GPU performs optimally for independent identical tasks to separate datasets, at least within
thread blocks. In other words, the divergence of the execution of the threads (e.g. with if state-
ments) within thread blocks and the communication between threads in different blocks should be
6
minimised.
– Optimally the number of threads in each block should be a multiple of the warp size.
– Blocks featuring as many threads as possible are optimal, unless the corresponding number of
blocks drops below the number of SMPs, then it is advisable to adjust the block size such that all
SMPs can be active simultaneously.
There exists multiple languages to write GPU compatible codes. For example OpenCL is widely
used to develop portable codes. For NVIDIA cards the corresponding proprietary language CUDA is
most popular. The main difference with programming languages for CPUs is the introduction of another
device with it own memory and its own source code: the kernel. The allocation of the space on the GPU
memory as well as the transfer of the data and the execution of the kernel are illustrated with an example
in Sec. A.3. We note in particular the need to specify the number of blocks in the grid as well as the
number of threads in each of the block.
As for CPUs, the clock speed and the memory size are important characteristics of the GPU to
consider. Whereas the first is usually lower than those of CPUs, the GPU is meant to handle large
datasets and thus feature significantly larger memories.
6 The computer cluster
Computer clusters are commonly used for problem sizes beyond the capabilities of single computer. They
combine the computing power of a set of individual computers, usually called nodes in this frame, by
implementing a fast communication network between them thus allowing for a distribution of the tasks.
Nowadays the computer clusters are usually based on Gbit/s second data transfer. These fast networks
links remain ≈100 slower than CPU communication with its caches, such a technology is therefore
appropriate only when single nodes are not sufficient. Clearly, large data transfer between nodes has to
be minimised when using such a technology.
Today the most performing computer cluster is composed of 4806 individual nodes each featuring
2 CPUs and 6 GPUs to reach ≈200 PFLOPs (theoretical peak) [10, 11]. However highly performing
clusters are not the most common, this concept is also widely used to implement smaller machine due
to it cost effectiveness. Indeed, a set of cheap desktop computers with a network switch are sufficient to
build an efficient computer cluster.
Similarly to multithreading, the computing load is shared between individual processes. Unlike
threads, these processes do not have access to a shared memory, since they could be executing on different
nodes. From the developers’ point of view, the communication between processes is handled via a
Message Passing Interface (MPI) [12], independently of the location of the execution of the processes,
i.e. whether they are on the same node or not. The MPI consists of a set of instructions that allows
communication between processes, such as transfer of data or synchronisation. An example is shown in
Sec. A.5. This example is limited to the sole usage of MPI, however multithreading or GPU acceleration
can still be implemented within each node.
6.1 Batch processing
Large computing resources such as computer clusters or computing grids (Sec. 7) are usually shared
between several users. In addition, the execution of heavy tasks on such machines are expected to last
long, possibly up to days or even month. To deal with those constraints, it is convenient to implement an
automated management of the resources. This is usually achieved using the concept of batch processing:
a given task such as the execution of a program with a given input file, is described to the batch system,
along with the resources required to perform the task, such as the time, the number of CPUs, the number
of GPUs or even the amount of memory. Based on these information, the batch system will allocate the
proper resources when available and execute the task. As opposed to the traditional usage of a personal
7
computer, the user only submits job descriptions to the batch system, the time of the execution is no
longer under his control. The allocation of the resources can then be subject to arbitrarily complex
scheduling schemes, e.g. featuring equal sharing between users or favouring some according to a given
payment scheme. An example of batch job description is shown in Sec. A.5.
7 Volunteer and grid computing
For applications which do not require fast communication between processes, yet a large computing
power is needed to perform a set of fully independent tasks, the usage of idling computers through a
web interface becomes interesting. Most volunteer computing projects are based on the Berkeley Open
Infrastructure for Network Computing (BOINC) [13]. Its role is to dispatch individual jobs through the
web on the heterogeneous hardware made available by volunteers who subscribed to it, and retrieve the
results of their execution. The main challenges linked to this type of computing is the portability of the
codes which have to run on several platforms, as well as the attraction of large amounts of volunteers,
which limits its usage to popular scientific projects. Due to these constraints, GPUs are often preferred
for such applications.
The concept of volunteer computing became popular with the project SETI@home (Search for
Extraterrestrial Intelligence) [14]. Today it is also used in the accelerator world, for example with the
LHC@home project [15], allowing for large scale single particle tracking studies at relatively low cost.
More generally, heavy tasks can be distributed over dedicated infrastructures, such as large com-
puter clusters located at various laboratories or universities contributing to a given project, rather than
only idling personal computers. For example, such a large scale computing grid was implemented in
order to store and treat the data generated by the LHC [16].
The interaction of the user with a computing grid is similar to the computer cluster described in
Sec. 6.1.
8 Summary
Due to the saturation of the CPU clock speed, high performance computing nowadays is achieved through
parallelisation of the tasks on several computing units. We may categorise the parallelisation techniques
in two main groups: SIMD and MIMD.
The most common technology implementing SIMD is multithreading (and similarly hyperthread-
ing). From a hardware point of view, this corresponds to the combination of the computing power of
multiple CPUs acting on a common memory. From a developer’s point of view, this requires the writing
of multithreaded codes, the usage of libraries with multithreading capabilities or the usage of compiler
options to automatically parallelise loops using multiple threads. The latter comes essentially without
cost on the code development side and allows for significant speed up even on common desktops. Addi-
tionally, commercial computers are available with few tens of CPUs, thus allowing for the treatment of
problem sizes about an order of magnitude larger with a single machine. This parallelism is well suited
for most of the algorithms used in accelerator physics. On the other hand, applications limited by mem-
ory transfer, encountered for example in the field of data analysis, may not profit from this parallelism.
GPUs, featuring a large memory bandwidth, are usually more suited then.
MIMD is implemented in most CPUs, through vectorised instructions, and in GPUs through
warps. The vectorisation on commercially available CPUs can reach up to 8 double precision opera-
tions performed in parallel and can be used without major modification to a regular serial code. SMPs
in commercially available GPUs allow for up to 64 operations in parallel. From the developer’s point
of view, the usage of GPUs usually requires an additional effort as it is based on the usage of specific
languages. Nevertheless, GPU accelerated libraries as well as compiler options to enable GPU accel-
eration of regular codes also exist. This type of parallelism is particularly suited for independent tasks.
The GPU architecture is optimal for the treatment of large datasets. In accelerators they are used for
8
data analysis (optics corrections, processing of beam instrumentation data, machine learning), solution
of differential equations (time domain electromagnetic / mechanical / thermal simulations, beam evolu-
tion), linear algebra (matrix manipulation, frequency domain electromagnetic / mechanical simulations,
beam stability), Monte-Carlo (particle-matter interaction, detector design, vacuum) or beam physics via
independent multiparticle tracking. The architecture of the GPU is however less suited for strongly
dependent data or divergent execution of the threads, such as those encountered in Particle-in-Cell or
collective effects simulations. Nevertheless, these applications require the handling of large amounts of
data that is less efficient on CPUs, thus they may also profit from GPU acceleration, yet not at the level
of more independent tasks. Tasks featuring small datasets are usually not well suited for GPUs, such as
single particle tracking simulation, optics computations and more generally heterogeneous tasks such as
those encountered in accelerator or equipment control applications.
For tasks that exceeds the capabilities of a single machine, the computer cluster is the most com-
mon tool, combining of the resources of several machine using a MPI to communicate through a fast
network connecting the nodes. For even larger scale computing projects, computing grids are used to
combine the power of several computer clusters.
References
[1] https://software.intel.com/sites/landingpage/IntrinsicsGuide/.
[2] https://gcc.gnu.org/.
[3] Standard for information technology–portable operating system interface (posix(r)) base specifica-
tions, issue 7. IEEE Std 1003.1, 2016 Edition, pages 1–3957, Sep. 2016.
https://doi.org/10.1109/IEEESTD.2016.7582338.
[4] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA, 2008.
[5] https://www.openmp.org/.
[6] https://en.wikipedia.org/wiki/Non-uniform_memory_access/.
[7] https://en.wikipedia.org/wiki/Amdahl’s_law/.
[8] G.M. Amdhal. Validity of the single processor approach to achieving large-scale computing capa-
bilities. AFIPS Conference Proceedings, 30:483–485, 1967.
[9] https://devblogs.nvidia.com/cooperative-groups/.
[10] https://www.top500.org/list/2018/11/.
[11] https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/.
[12] https://www.mpi-forum.org/docs/.
[13] https://boinc.berkeley.edu/.
[14] https://setiathome.berkeley.edu/.
[15] http://lhcathome.web.cern.ch/.
[16] https://wlcg.web.cern.ch/.
[17] https://www.openacc.org/.
[18] https://slurm.schedmd.com/documentation.html.
9
Appendices
A Source codes
The sources codes implementing the same algorithm but parallelised with five different technologies are
shown in the following. The codes are written in C aiming at being explicit more than performant. We
note that these technologies are also available in many other languages with slightly different syntaxes.
Also, many libraries implement one or more possibilities for parallelism that can be enabled if setup
properly for a given hardware. Consequently it is clear that this set of examples is far from providing a
complete overview of parallelised computing techniques, but rather aims at easing the readers’ first steps
into this field.
The example is taken from the study of non-linear dynamics, in particular the impact of the elec-
tromagnetic interaction between the beams in a circular collider. The result of the execution is shown in
Fig. A.1.
Fig. A.1: Position in phase space of 5 · 105 particles with different initial conditions after 104 turns in a circular
collider with a strong head-on beam-beam interaction. The islands generated by the 10th order resonance are
visible, along with more complex substructures.
10
A.1 Serial / OpenMP
To enable multithreading with OpenMP using gcc, the flag -fopenmp is required at compilation.
1 #include <math.h>
2 #include <cstdio >
3 #include <stdlib.h>
4
5 int main(int argc , char *argv []){
6 // Allocation of the memory for the particles coordinates
7 const int nPart = 500000;
8 double x[nPart];
9 double px[nPart ];
10 // Initiallisation of the particles coordinates with Gaussian distributions
11 double radius;
12 double angle;
13 for(int i = 0;i<nPart ;++i){
14 radius = (double)rand() / RAND_MAX;
15 while(radius ==0){
16 radius = rand ();
17 }
18 radius = sqrt (-2.0*log(radius ));
19 angle = (double )2.0* M_PI*rand() / RAND_MAX;
20 x[i] = radius*sin(angle );
21 px[i] = radius*cos(angle);
22 }
23 // Tracking variables
24 int nTurn = 10000;
25 double sinPhix = sin (2.0* M_PI *0.31);
26 double cosPhix = cos (2.0* M_PI *0.31);
27 double scale = 1.0;
28 double thres = 1E-10;
29 double oldX = 1.0;
30 double oldPx = 1.0;
31 // Tracking
32 for(int turn = 0;turn <nTurn ;++ turn){
33 // Multithreading with the large arrays shared and the tracking variables
34 // copied for each thread (firstprivate ). The size of the threads
35 // is adaptive starting at 1000 particles per thread (schedule ).
36 #pragma omp parallel for default(none) shared(x,px) \
37 firstprivate(oldX ,oldPx ,cosPhix ,sinPhix ,thres ,scale) \
38 schedule(guided ,1000)
39 for(int i = 0;i<nPart ;++i){
40 // Phase space rotation
41 oldX = x[i];
42 oldPx = px[i];
43 x[i] = cosPhix*oldX + sinPhix*oldPx;
44 px[i] = -sinPhix*oldX + cosPhix*oldPx;
45 // Beam -beam kick
46 if(abs(x[i]) > thres){
47 px[i] += scale *(1.0 -exp (-0.5*x[i]*x[i]))/x[i];
48 }
49 }
50 }
51 // Output the particles coordinates to a file
52 FILE* file = fopen("phaseSpace_OMP.csv","w");
53 for(int i=0;i<nPart ;++i){
54 fprintf(file ,"%E,%E\n",x[i],px[i]);
55 }
56 fclose(file);
57 }
11
A.2 POSIX threads
To enable multithreading with POSIX thread using gcc, the flag -pthread is required at compilation.
1 #include <math.h>
2 #include <cstdio >
3 #include <stdlib.h>
4 #include <pthread.h>
5
6 // Allocation of the memory for the particles coordinates
7 // as global variables
8 const int nPart = 500000;
9 double x[nPart];
10 double px[nPart ];
11 // Tracking variables defined as global
12 int nTurn = 10000;
13 double sinPhix = sin (2.0* M_PI *0.31);
14 double cosPhix = cos (2.0* M_PI *0.31);
15 double scale = 1.0;
16 double thres = 1E-10;
17 // Definition of the code to be executed by each thread
18 void* track(void* indices_ptr ){
19 int* indices = (int*) indices_ptr;
20 double oldX = 0.0;
21 double oldPx = 0.0;
22 // Tracking
23 for(int turn = 0;turn <nTurn ;++ turn){
24 for(int i = indices [0];i<indices [1];++i){
25 // Phase space rotation
26 oldX = x[i];
27 oldPx = px[i];
28 x[i] = cosPhix*oldX + sinPhix*oldPx;
29 px[i] = -sinPhix*oldX + cosPhix*oldPx;
30 // Beam -beam kick
31 if(abs(x[i]) > thres){
32 px[i] += scale *(1.0 -exp (-0.5*x[i]*x[i]))/x[i];
33 }
34 }
35 }
36 };
37 int main(int argc , char *argv []){
38 // Initiallisation of the particles coordinates with Gaussian distributions
39 double radius;
40 double angle;
41 for(int i = 0;i<nPart ;++i){
42 radius = (double)rand() / RAND_MAX;
43 while(radius ==0){
44 radius = rand ();
45 }
46 radius = sqrt (-2.0* log(radius ));
47 angle = (double )2.0* M_PI*rand() / RAND_MAX;
48 x[i] = radius*sin(angle );
49 px[i] = radius*cos(angle);
50 }
51 // Definition of the chunks of particles for each thread
52 int indices0 [2] = {0 ,100000};
53 int indices1 [2] = {400000 ,500000};
54 int indices2 [2] = {100000 ,200000};
55 int indices3 [2] = {200000 ,300000};
56 int indices4 [2] = {300000 ,400000};
57 // Creation of 4 additional threads
58 pthread_t thread1;
59 pthread_t thread2;
60 pthread_t thread3;
61 pthread_t thread4;
62 pthread_create (&thread1 , NULL , track , indices1 );
63 pthread_create (&thread2 , NULL , track , indices2 );
64 pthread_create (&thread3 , NULL , track , indices3 );
65 pthread_create (&thread4 , NULL , track , indices4 );
66 // Tracking on the current thread , executing in parallel
67 // to the other 4.
68 track(indices0 );
69 // Synchronisation of the threads with the current one
70 pthread_join(thread1 , NULL);
71 pthread_join(thread2 , NULL);
72 pthread_join(thread3 , NULL);
73 pthread_join(thread4 , NULL);
74 // Output the particles coordinates to a file once all threads are synchronised ,
75 // i.e. have finished their execution
76 FILE* file = fopen("phaseSpace_phtreads.csv","w");
77 for(int i=0;i<nPart ;++i){
78 fprintf(file ,"%E,%E\n",x[i],px[i]);
79 }
80 fclose(file);
81 }
12
A.3 CUDA
CUDA codes require compilation with a dedicated compiler (nvcc).
1 #include <math.h>
2 #include <cstdio >
3 #include <stdlib.h>
4
5 // Kernel definition (single turn tracking)
6 __global__
7 void track(double* x, double* px ,int nPart , double sinPhix ,double cosPhix , \
8 double scale ,double thres)
9 {
10 // Mapping of the block -thread indices to 1D array indices
11 int i = blockIdx.x*blockDim.x + threadIdx.x;
12 // Phase space rotation
13 double oldX = 0.0;
14 double oldPx = 0.0;
15 oldX = x[i];
16 oldPx = px[i];
17 x[i] = cosPhix*oldX + sinPhix*oldPx;
18 px[i] = -sinPhix*oldX + cosPhix*oldPx;
19 // Beam -beam kick
20 if(abs(x[i]) > thres){
21 px[i] += scale *(1.0 -exp (-0.5*x[i]*x[i]))/x[i];
22 }
23 }
24
25 int main(int argc , char *argv []){
26 // Allocation of the memory for the particles coordinates on the host memory
27 const int nPart = 500000;
28 double x[nPart];
29 double px[nPart ];
30 // Initiallisation of the particles coordinates with Gaussian distributions
31 // on the host memory
32 double radius;
33 double angle;
34 for(int i = 0;i<nPart ;++i){
35 radius = (double)rand() / RAND_MAX;
36 while(radius ==0){
37 radius = rand ();
38 }
39 radius = sqrt (-2.0* log(radius ));
40 angle = (double )2.0* M_PI*rand() / RAND_MAX;
41 x[i] = radius*sin(angle );
42 px[i] = radius*cos(angle);
43 }
44 // Allocation of the memory for the particles coordinates on the GPU memory
45 double* x_cuda;
46 double* px_cuda;
47 cudaMalloc (&x_cuda , nPart*sizeof(double ));
48 cudaMalloc (&px_cuda , nPart*sizeof(double ));
49 // Copy of the particles coordinates to the GPU memory
50 cudaMemcpy(x_cuda , x, nPart*sizeof(double), cudaMemcpyHostToDevice );
51 cudaMemcpy(px_cuda , px, nPart*sizeof(double), cudaMemcpyHostToDevice );
52 // Definition of the thread structure on the GPU
53 int threadsperblock = 512;
54 int blockspergrid = ceil(nPart / threadsperblock );
55 // Tracking variables
56 int nTurn = 10000;
57 double sinPhix = sin (2.0* M_PI *0.31);
58 double cosPhix = cos (2.0* M_PI *0.31);
59 double scale = 1E-4;
60 double thres = 1E-10;
61 // Tracking
62 for(int turn = 0;turn <nTurn ;++ turn){
63 track <<<blockspergrid ,threadsperblock >>>(x_cuda ,px_cuda ,nPart ,sinPhix ,cosPhix , \
64 scale ,thres );
65 }
66 // Copy the particles coordinates back to the host memory
67 cudaMemcpy(x, x_cuda , nPart*sizeof(double), cudaMemcpyDeviceToHost );
68 cudaMemcpy(px , px_cuda , nPart*sizeof(double), cudaMemcpyDeviceToHost );
69 // deallocate the GPU memory
70 cudaFree(x_cuda );
71 cudaFree(px_cuda );
72 // Output the particles coordinates to a file
73 FILE* file = fopen("phaseSpace_cuda.csv","w");
74 for(int i=0;i<nPart ;++i){
75 fprintf(file ,"%E,%E\n",x[i],px[i]);
76 }
77 fclose(file);
78 }
13
A.4 OpenACC
To enable GPU acceleration using gcc and OpenACC [17], the flag -fopenacc is required at compilation,
along with the path to a CUDA compiler to which the parallelised part is offloaded, e.g. -foffload=nvptx-
none
1 #include <math.h>
2 #include <cstdio >
3 #include <stdlib.h>
4
5 int main(int argc , char *argv []){
6 // Allocation of the memory for the particles coordinates
7 const int nPart = 500000;
8 double x[nPart];
9 double px[nPart ];
10 // Initiallisation of the particles coordinates with Gaussian distributions
11 double radius;
12 double angle;
13 for(int i = 0;i<nPart ;++i){
14 radius = (double)rand() / RAND_MAX;
15 while(radius ==0){
16 radius = rand ();
17 }
18 radius = sqrt (-2.0*log(radius ));
19 angle = (double )2.0* M_PI*rand() / RAND_MAX;
20 x[i] = radius*sin(angle );
21 px[i] = radius*cos(angle);
22 }
23 // Tracking variables
24 int nTurn = 10000;
25 double sinPhix = sin (2.0* M_PI *0.31);
26 double cosPhix = cos (2.0* M_PI *0.31);
27 double scale = 1.0;
28 double thres = 1E-10;
29 double oldX = 1.0;
30 double oldPx = 1.0;
31 // Tracking , to be treated as GPU Kernel
32 #pragma acc parallel
33 for(int turn = 0;turn <nTurn ;++ turn){
34 #pragma acc loop
35 for(int i = 0;i<nPart ;++i){
36 // Phase space rotation
37 oldX = x[i];
38 oldPx = px[i];
39 x[i] = cosPhix*oldX + sinPhix*oldPx;
40 px[i] = -sinPhix*oldX + cosPhix*oldPx;
41 // Beam -beam kick
42 if(abs(x[i]) > thres){
43 px[i] += scale *(1.0 -exp (-0.5*x[i]*x[i]))/x[i];
44 }
45 }
46 }
47 // Output the particles coordinates to a file
48 FILE* file = fopen("phaseSpace_OACC.csv","w");
49 for(int i=0;i<nPart ;++i){
50 fprintf(file ,"%E,%E\n",x[i],px[i]);
51 }
52 fclose(file);
53 }
14
A.5 MPI
The implementations of the MPI feature both a compiler, e.g. mpicc, and a run command such as mpirun
or mpiexec. The latter requires at least the number of processes to be specified when running the code,
e.g. using the flag -np. This command will launch the corresponding numbers of identical copies of the
executable, but featuring a different rank.
1 #include <math.h>
2 #include <cstdio >
3 #include <stdlib.h>
4 #include "mpi.h"
5
6 int main(int argc , char *argv []){
7 // Definition of tags for MPI calls (optional)
8 int TAGREADY = 0;
9 int TAGX = 1;
10 int TAGPX = 2;
11 // Allocation of the memory for the particles coordinates
12 // (for each process)
13 const int nPart = 500000;
14 double x[nPart];
15 double px[nPart ];
16 // MPI initialisation , retrival of the process rank and the total
17 // number of processes
18 MPI_Init (&argc , &argv);
19 int myRank;
20 int activeProcs;
21 MPI_Comm_rank(MPI_COMM_WORLD , &myRank );
22 MPI_Comm_size(MPI_COMM_WORLD , &activeProcs );
23 int chunkSize = ceil(nPart/activeProcs );
24 if(myRank == 0){
25 // Initiallisation of the particles coordinates with Gaussian distributions
26 // performed by one process only
27 double radius;
28 double angle;
29 for(int i = 0;i<nPart ;++i){
30 radius = (double)rand() / RAND_MAX;
31 while(radius ==0){
32 radius = rand ();
33 }
34 radius = sqrt (-2.0* log(radius ));
35 angle = (double )2.0* M_PI*rand() / RAND_MAX;
36 x[i] = radius*sin(angle );
37 px[i] = radius*cos(angle);
38 }
39 // Send a subset of the initial coordinates to other processes
40 for(int iProc =1;iProc <activeProcs ;++ iProc){
41 MPI_Send (&x[iProc*chunkSize], chunkSize , MPI_DOUBLE , iProc ,TAGX , \
42 MPI_COMM_WORLD );
43 MPI_Send (&px[iProc*chunkSize], chunkSize , MPI_DOUBLE , iProc ,TAGPX , \
44 MPI_COMM_WORLD );
45 }
46 } else{
47 // Processes with other ranks wait until they receive their subset of
48 // initial particles coordinate
49 MPI_Status status;
50 MPI_Recv(x,chunkSize , MPI_DOUBLE , 0,TAGX , MPI_COMM_WORLD , &status );
51 MPI_Recv(px,chunkSize , MPI_DOUBLE , 0,TAGPX , MPI_COMM_WORLD , &status );
52 }
53 // Tracking variables (executed by all processes)
54 int nTurn = 10000;
55 double sinPhix = sin (2.0* M_PI *0.31);
56 double cosPhix = cos (2.0* M_PI *0.31);
57 double scale = 1.0;
58 double thres = 1E-10;
59 double oldX = 0.0;
60 double oldPx = 0.0;
61 // Tracking (executed by all processes)
62 for(int turn = 0;turn <nTurn ;++ turn){
63 for(int i = 0;i<chunkSize ;++i){
64 oldX = x[i];
65 oldPx = px[i];
66 x[i] = cosPhix*oldX + sinPhix*oldPx;
67 px[i] = -sinPhix*oldX + cosPhix*oldPx;
68 if(abs(x[i]) > thres){
69 px[i] += scale *(1.0 -exp (-0.5*x[i]*x[i]))/x[i];
70 }
71 }
72 }
73 if(myRank ==0){
74 // One process collects the particles coordinates computed by other processes
75 int nRecv = 1;
76 while(nRecv <activeProcs ){
15
77 MPI_Status status;
78 MPI_Recv(NULL , 0, MPI_BYTE , MPI_ANY_SOURCE ,TAGREADY , MPI_COMM_WORLD ,& status );
79 int iProc = status.MPI_SOURCE;
80 MPI_Recv (&x[iProc*chunkSize], chunkSize , MPI_DOUBLE , status.MPI_SOURCE ,TAGX , \
81 MPI_COMM_WORLD ,& status );
82 MPI_Recv (&px[iProc*chunkSize], chunkSize , MPI_DOUBLE , status.MPI_SOURCE ,TAGPX , \
83 MPI_COMM_WORLD ,& status );
84 ++nRecv;
85 }
86 } else{
87 // All other processes send their data after tracking
88 MPI_Send(NULL , 0, MPI_BYTE , 0,TAGREADY , MPI_COMM_WORLD );
89 MPI_Send(x,chunkSize , MPI_DOUBLE , 0,TAGX , MPI_COMM_WORLD );
90 MPI_Send(px,chunkSize , MPI_DOUBLE , 0,TAGPX , MPI_COMM_WORLD );
91 }
92 MPI_Finalize ();
93 // Only the process that has collected all the data outputs the particles
94 // coordinates to a file
95 if(myRank ==0){
96 FILE* file = fopen("phaseSpace_MPI.csv","w");
97 for(int i=0;i<nPart ;++i){
98 fprintf(file ,"%E,%E\n",x[i],px[i]);
99 }
100 fclose(file);
101 }
102 }
When using batch processing, the syntax for the submission of the job depends on the technology. Below
an example of a script for SLURM [18] illustrates the request for resources (here 20 nodes are requested,
each running 40 processes for 10 minutes). The mpirun/mpiexec and the corresponding flags which are
used in particular for the allocation of the resources are replace by the command srun, as the allocation
is now under the control of the batch system. The submission to the scheduler is done via a given com-
mand, here sbatch jobFileName.
1 #!/bin/bash
2 #SBATCH --nodes 20
3 #SBATCH --tasks -per -node 40
4 #SBATCH --cpus -per -task 1
5 #SBATCH --time 00:10:00
6 #SBATCH --workdir /pathTo/phaseSpace
7
8 srun phaseSpace_MPI
16
