Parallelprogrammering med CUDA by Nielsen, Jon & Mosegaard, Truls
Parallel programming with CUDA
Parallelprogrammering med CUDA
Truls Mosegaard
Jon Nielsen
Supervisor: Keld Helsgaun
Master Thesis. September 2011 - February 2012.
February 27, 2012
Computer Science at CBIT
Building 43.2
Roskilde Universitet 2012
Abstract
English
This report documents our master thesis project, which is about parallel programming with CUDA, the
NVIDIA GPU architecture with support for general purpose computing.
The purpose of the thesis is to uncover the qualities of CUDAas a parallel computing platform, determining
the possibilities and limitations of its ability to handle different types of algorithms.
We examine this by performing a case study of two algorithmsused in the computationally intensive field of
n-body simulations. In our reportwepresent the topics of our thesis through chapters containing overviews
of the relevant theory. Based on this we investigate how CUDA performs using the embarrassingly parallel
n-body all-pairs algorithm, as well as the Barnes-Hut algorithm, which is partially irregular with regards
to parallelization due to its datastructure.
We have found that CUDA performs exceptionally well on n-body all-pairs, observing up to a 100× speed-
up on an optimized GPU implementation compared to an implementation running on a computer with 16
CPU cores. The CUDA implementation of the Barnes-Hut algorithm also shows increased performance, as
the most costly part of the algorithm is parallelizable. We find that although it is possible to implement an
irregular algorithm in CUDA, doing so with success requires an understanding of CUDA programming
and the CUDA model of parallelism.
We conclude that CUDA performs well on massively parallel problems and can be useful for irregular
problems as well. Programming for it can be complex when optimizing or when the algorithm is not easily
parallelized. The platform has a good performance and potential to accelerate suitable applications.
Dansk
Denne rapport dokumenterer vores speciale projekt, der omhandler parallel programmering med CUDA,
NVIDIAs GPU arkitektur til generelle formål.
Formålet med specialet er at undersøge CUDAs kvaliteter som en parallel platform, og bestemme dets
muligheder og begrænsninger i forbindelse med at håndtere forskellige typer algoritmer.
Vi undersøger dette ved at udføre et case studie af to algoritmer der bruges indenfor det beregningsmæssige
intensive område n-body simuleringer. I vores rapport præsenterer vi specialets emner i kapitler
indeholdende oversigter over den relevante teori. Baseret på dette, undersøger vi hvordanCUDApræsterer
i den “embarassingly parallel” n-body all-pairs algoritme, og Barnes-Hut algoritmen som er delvist
irregulær med hensyn til parallelisering grundet dens datastruktur.
Vi har fundet at CUDA præsterer ekseptionelt godt ved n-body all-pairs, hvor vi observerer en 100×
forøgelse af hastigheden på en optimeret GPU implementation, sammenlignet med en implementation
der kører på en computer med 16 CPU kerner. CUDA implementationen af Barnes-Hut algoritmen viser
også forøget præstation, da den dyreste del af algorithmen er paralleliserbar. Vi finder at selvom det er
muligt at implementere en irregulær algoritme i CUDA, kræves der forståelse for CUDA programmering
og CUDAs model for parallelisme at gøre det med success.
Vi konkluderer at CUDApræsterer godt påmassivt parallel problemer, og kan være brugbart til irregulære
problemer også. Det kan være komplekst at programmere til når man optimerer eller når algoritmen ikke
er let paralleliserbar. Platformen har god ydeevne og potentiale til at accelerere passende applikationer.
i
Preface
Our interest in parallel computing with CUDAwas first sparked during the last session of the introductory
parallel and distributed computing course, IPDC, whichwe attended at Computer Science at Roskilde Uni-
versity. In the lecture, guest lecturer Thomas Schrøder from the groupGlass & Time at IMFUFA at Roskilde
University gave a presentation on CUDA and the groups work involving molecular dynamics simulations
conducted on a cluster of GPUs. At the beginning of this project, and after corresponding with the group,
wewere allowed access to resources on their cluster for the work on this thesis on parallel computing using
CUDA.
In this thesis we have approached CUDA and parallel programming with the purpose of learning CUDA
programming, and being able to describe CUDA as a parallel computing platform. We have investigated
the use and performance of CUDA through two cases from the field of n-body simulations. The cases are
based on the CUDA implementation of the all-pairs and Barnes-Hut algorithms for n-body simulations.
Though the algorithms are for solving the same problem, they are quite different in the way they work,
their performance and parallelizability.
The making of this project has been a learning process for us and we feel the thesis report should re-
flect that. The report contains a chapter on CUDA. This chapter gives the reader an introducing to CUDA
and CUDA programming, and some details on specific uses and benefits of the platform. To put this in
context we have written a chapter on general parallel computing, platforms and programming concepts.
This chapter is used to introduce some concepts used or referenced later. Our intended reader is interested
in computer science and parallel computing and has some knowledge of C programming. Readers who al-
ready have knowledge on parallel computing or programmingmay skip that chapter at their discretion.We
have included a chapter giving an overview of the physics andmathematics of the simulations in our cases.
In support of our case chapters, a chapter detailing the software and hardware of the experimental setup
is included as an appendix. As cases we use two different algorithms for performing n-body simulations.
The cases are meant to illustrate the use of CUDA when implementing regular and irregular parallelism.
Our two cases are in seperate chapters detailing their implementations and the experiments performed.
The thesis report is written in English. Partly because we found English to be the more natural language to
write on this subject in. Partly because we want to give more people the opportunity to read it.
We have included a glossary in the appendix (page 106) in which the reader can find descriptions of rele-
vant words, concepts and abbreviations used throughout.
This master thesis report represents the documentation of our project, which was performed as part of
our final study module, K2-S at Computer Science at CBIT at Roskilde University. It is also the intention of
our efforts to fulfill the requirements of this module, as stipulated in section 21 of the study guide1.
1 Contained in the document “Datalogi studieordning (2006)”.
ii
Table of contents
1 Introduction 1
2 Problem description 2
3 Method 3
3.1 Selection of cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Study of theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3 n-body all-pairs case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 n-body Barnes-Hut case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Parallel computing 6
4.1 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.1.1 Amdahl’s law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.1.2 Flynn’s taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1.3 Granularity and parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Parallel systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.1 Von Neumann machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.2 The microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2.4 Parallel computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Parallel programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3.1 Message passing / distributed memory . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.2 Shared memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.3 “Concepts” of parallel programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 CUDA 20
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 CUDA hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.1 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.3 CUDA capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Parallel programming in CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.1 CUDA C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.2 Built-in variables and device querying . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.3 Kernels and threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.4 Device memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.5 Data coalescence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.6 Parallel constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3.7 Special functions and instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 Optimization in CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4.1 Knowing the architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iii
5.4.2 Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 The n-body problem 37
6.1 Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Numerical integration methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2.1 Euler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2.2 Verlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Simulation algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3.1 All-pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3.2 Barnes-Hut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3.3 Other algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4.1 Plummer model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7 N-body all-pairs 45
7.1 All-pairs n-body simulation, sequential version . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.1.2 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.3 Tests & results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2 All-pairs n-body, parallel OpenMP version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2.2 Tests & results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.3 All-pairs n-body first CUDA version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.3.2 Known issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3.3 Tests & results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.4 All-pairs N-body, CUDA: float4 optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.4.2 Known issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.4.3 Tests & results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.5 All-pairs N-body, CUDA: shared memory optimization . . . . . . . . . . . . . . . . . . . . . 56
7.5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.5.2 Known issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.5.3 Tests & results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.6 All-pairs n-body, CUDA: loop optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.6.1 Investigating the loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.6.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.6.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.6.4 Tests & results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.7 All-pairs N-body, CUDA: Verlet integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.7.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.7.2 Known issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.7.3 Tests & results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.8 Summary of all-pairs case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8 N-body Barnes-Hut 65
8.1 Barnes-Hut, sequential version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.1.1 General implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.1.2 Implementation of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.1.3 Known issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
iv
8.1.4 Tests & results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.2 Barnes-Hut, OpenMP version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.2.2 Tests & results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.3 Barnes-Hut, CUDA version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.3.1 Program structure and main method . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3.2 “Radius” calculating kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.3.3 Octree building kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3.4 Compute center of mass kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.3.5 Compute force & advance bodies kernels . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.3.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.4 Tests & results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.5 Summary of the Barnes-Hut case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9 Analysis and discussion 90
9.1 Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.2 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.3 Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10 Conclusion 94
Bibliography 95
11 Appendix A: Experimental setup 97
11.1 Software used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
11.2 Hardware resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
11.3 Utility programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
11.3.1 File format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
11.3.2 Input and output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.3.3 Plummer galaxy generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.3.4 Energy calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.3.5 Bitmap creator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
12 Appendix B: Floating point precision and performance 101
13 Appendix C: CUDA compute capability 103
14 Appendix D: Glossary 106
15 Appendix E: Source code 108
v
1 Introduction
Since the 1960s, a tendency known as Moore’s Law has been observed. This “law” states that the number
of transistors that it is possible to place on an integrated circuit doubles around every 2 years. Due to the
lower limits of how small transistors can get, amongst other limitations, the tendency is predicted to be
close to an end. The processing capability of a computers Central Processing Unit (CPU) is connected with
this tendency, and for a long time the processing speed of CPUs would increase in a predictable way. This
meant that the same programs and algorithms would run faster as the processors evolved. In the past
decade, the limit of the performance that can effectively be achieved on a single processing unit has been
pushed, and there has been a shift towards integrating multiple cores in a single chip. It is now the norm
for a CPU to have 2, 4, 6 or more cores. To use the extra processing capability offered by multicore CPUs,
programs and algorithms need to be parallelized. Parallel programming is not a new technique, and has
been in use in industrial and scientific applications as way of solving large problems faster for many years.
Another category of processor that has evolved in the same way is known as a Graphics Processing Unit
(GPU). GPUs are specialized processors that are optimized for graphics processing, with for instance
floating point calculations and matrix operations. Graphical tasks are highly parallelizable, and a modern
graphics device now consists of a many-core architecture with tens or hundreds of cores. The Geforce GTX
480 GPU used in a case in this project, for instance, has 480 cores. Primarily motivated by the requirements
of 3D games, the processing power of these graphic card devices were accessible only through 3D graphic
APIs such as Microsofts Direct3D for Windows or the cross-platform OpenGL.
In 2006 Nvidia, who is one of the largest producers of GPUs, introduced CUDA which is an architecture
for general purpose computing using GPUs (GPGPU). It includes a new parallel programming model,
and the purpose is to solve certain computational problems more efficiently than on a CPU. The majority
of their graphics devices are designed for computer games, making them affordable for the amount of
processing capability provided. Recent device models for instance, has a theoretical maximum of floating
point operations per second (FLOP) that is ten times that of a comparable high-end CPU[30, p. 2]. CUDA
enables this processing capability to be used generally and not just for graphics, and it is today often used
in a scientific setting, for instance in simulations.
1
2 Problem description
An increase in speed of a single core CPU will almost always also increase the speed of existing programs
for the platform. However, in parallel computing the program must be designed to use multiple processor
cores. There exists several programming languages and paradigms to support parallel programming on
different architectures and for various applications. Depending on the application and the choices made,
a parallel implementation may offer significant increases in computing speed. In the case of CUDA GPUs
this is especially true. AGPU is not a general purpose processor in the sameway a CPU is. ACUDA capable
GPU comes in the form of a device that is either built-in or added to a supported host computer system.
The device, as it is known in CUDA terminology, is accessible only through programs running on the host.
Programming of CUDA devices is done using C for CUDA, which consists of a subset of the C/C++ lan-
guage, extensions specific to CUDA, and an API. Restrictions are placed on the C language due to limi-
tations of the device hardware, and indeed not all applications can be optimized by an implementation
in CUDA. Additional restrictions apply depending on the CUDA capability version of the GPU, a number
used to indicate features and functionaly specific to that revision. Examples are none or limited support for
double precision floating point computations[35], restricted use of branching and function calls, but also
optimized or new features from version to version.
In CUDA programming, a program is written, in which parts of the program runs on the host CPU, and
parts are executed on the device GPU. Special functions called kernels are written into the host program
source code, but compiled for and executed on the device. Since CUDA devices have tens or even hun-
dreds of cores, and supports the execution of thousands of threads, they are best suited for computations
that are highly parallelizable, and that can run without communicating with the code running on the host
computer too frequently[28].
With this in mind, we wish to answer the question of the quality of CUDA as a parallel computing plat-
form. To answer this question, we need to determine which possibilities and which limitations CUDA has
to offer compared to other parallel platforms. CUDA is advertised as being very able to handle certain types
algorithms for solving computational problems. The questions are to which degree, and how able it is to
handle other types of algorithms. Specifically, which applications can successfully take advantage of the
high number of cores, computational capability and architecture of the GPU.
2
3 Method
Following up on the issues listed in the problem description we will expand a bit on them, and propose a
method to address them.
In order to determine what kind of a parallel computing platform CUDA is and what its qualities are,
we first need to study parallel computing in general to put it into context. We then need to look at CUDA
itself, which means learning CUDA programming and its relation to the workings of the hardware. This
will allow us to recognize the possibilities of the platform, as well as the limitations.
Regarding the ability of CUDA to help in solving computational problems through its model of paral-
lelism, we will need to examine a problem that suits the scale and architecture of a GPU. We would also
like to knowwhat the possibilities are for CUDA helping in solving computational problemswhere achiev-
ing this kind of parallelism appears to be less straight-forward. It is not within the scope of this thesis to
provide a complete and full survey of CUDA and all possible applications. Instead we will exemplify use
of CUDA through two cases.
3.1 Selection of cases
N-body simulations and similar simulations are often performed using parallel systems, and they are
frequently used for demonstrating performance of parallel systems. They are useful for benchmarking
hardware performance, as the problem size can be adjusted to fit the system. The benchmark works
both ways, though, as the requirements for performing n-body simulations are similar to other types of
simulations. We have selected two cases for investigating different uses of CUDA. The first case, the all-
pairs algorithm for simulating the n-body problem, is well-suited for massive parallelization due to the
nature of the problem. We wish to determine how well CUDA handles this problem, by measuring its
performance. The second case we have selected is the Barnes-Hut optimized algorithm for performing n-
body simulations. While it is applied on the same problem, it is very different from the all-pairs algorithm.
The algorithm is noted for reducing the time complexity of an n-body simulation by placing its data in
an octree. It is not obviously parallelizable in CUDA, and it is our intention to determine if it can be
implemented, and if so what performance can be achieved.
3.2 Study of theory
This thesis is and has been a learning experience for us. The purpose of this project is partly to document
what we have learned, and partly to introduce the reader to the use of CUDA in the context of general
parallel computing and programming. Since we will also be applying CUDA to specific cases, we will need
to study and communicate relevant aspects concerning the theory behind the cases in the project. The
3
3.3. n-body all-pairs case
idea is to give the reader an idea of the work we have performed and our process and to create a basis for
understanding the project cases.We have therefore included three chapters of theory relevant to this thesis.
Parallel computing
In this chapter we introduce the basic background information referenced in our thesis such as “classical”
parallel computing architectures, programming concepts and techniques.
CUDA
Use of CUDA requires learning a new programming paradigm, and a certain amount of base knowledge
about this and the hardware is required to implement and understand CUDAprograms.We have therefore
included a chapter that documents GPU hardware, the CUDA programming model and how they relate
and finally strategies for CUDA optimizations.
Theory related to the cases
As a reference we have included a chapter that briefly explains the physical andmathematical theories that
are the foundations of the implementation of the n-body simulations, as well as explaining the computa-
tional algorithms.
3.3 n-body all-pairs case
The n-body problem solved by the all-pairs algorithm, which is our first case, is embarrassingly parallel,
and is useful in exemplifying and demonstrating parallelization both in general and especially on CUDA.
Solving the problem is limited both by processing capability and problem sizes. The problem sizes can be
increased arbitrarily as processing capability allows it.
Implementation
As a point of reference, we will implement a sequential and parallel version for a normal computer. We
will then convert the parallel version to CUDA and create successive implementations, each optimizing
performance using new CUDA functionality and optimization methods, as documented in the CUDA
chapter.
Experiments
To measure the performance, we will do tests and experiments on each version and each optimization. We
will verify and explain performance gains caused by each optimization. For each version we will perform
benchmarks to be able to discuss the significance of the optimizations and use of CUDA, and to compare
performance across versions.
3.4 n-body Barnes-Hut case
Implementing the Barnes-Hut algorithm for n-body simulations is our second case. The algorithm uses a
different strategy for performing n-body simulations, optimized for time complexity. This strategy involves
approximating the simulation by placing bodies in an octree data structure and using a heuristic to
determine when to apply calculations. We have selected this case both for its potential to speed up n-body
simulations and to see if it is possible to implement in CUDA.
4
3.4. n-body Barnes-Hut case
Implementation
Implementing the Barnes-Hut algorithm in parallel is not trivial. Dividing up the work and keeping
processors busy is a challenge. Ensuring that the algorithm is performed correctly is also more of an issue
for this case. We will again create a sequential and parallel reference versions for normal computers to
compare both performance and results. We will then implement a CUDA parallel version.
Experiments
As with the first case, we will perform tests and experiments on the different versions. Since the algorithm
performs approximations, we will need to consider the “correctness” of the simulations by some metric,
to determine if the implementations are acceptable. Due to the promising features of the algorithm, we
will perform benchmarks comparing its performance to the performance of the optimized implementation
of the all-pairs algorithm in the first case. Performance between our implementations of the Barnes-Hut
algorithmwill also bemeasured by benchmarking. Sincewedo not expect as high gains in performance due
to parallelization as with the all-pairs case, we will do a more in depth analysis of where in the algorithm
we observe performance gains and how they relate to CUDA.
5
4 Parallel computing
In this chapter we will be focusing on the topic of parallel computing, in order to be able to correctly
understand and describe CUDA in this context. It is not meant to be a complete survey, and will mostly be
limited towhat is relevant to our project and a later discussion onCUDAand its place in parallel computing.
The chapter is divided into three sections: parallelism, parallel systems and parallel programming.
The section on parallelismwill contain background information and classicifications on the different types
of parallelism as well as theory on parallelism and speedup.
The section on parallel systems will describe parts of the architecture of the systems typically used in
parallel computing. Some attention will be paid to memory here, as it is generally important to consider
when writing parallel programs, especially so when using CUDA.
Finally, the section on parallel programmingwill provide an overview of parallel programming paradigms
and terminology, leading into the chapter about the architecture and programming model of CUDA.
4.1 Parallelism
This section will include theories on parallelism that is the basis of parallel systems and parallel
programming. Firstly we will present the two major laws concerning possible speedup of programs
using parallelism. Furthermore we will present the general computer architectures exemplified in Flynn’s
taxonomy and finally we will briefly cover the granularity of parallelism.
4.1.1 Amdahl’s law
The primary motivation for performing parallel computing, and also the primary criteria by which a
successful application of parallel computing is measured, is the observed decrease in elapsed time. If an
application running on a single processor is observed to finish its execution in T1 time, one might assume
that a perfectly parallelized version of the application would perform the same task in T1/P time, given P
processors. This cannot be assumed, however, and an observed excution time TP on P processors is used
to measure what is known as the speedup ratio:
S = T1/TP (4.1)
This ratio can be used to measure the success of the parallelization as compared to a linear speedup, which
is rarely possible. One reason why this is the case is that almost all parallel applications contain parts that
are not possible to parallelize, and remain sequential. Assuming that the parallel application is working
on the same problem, the parallel parts will execute faster given more processors, while the sequential
parts will spend the same amount of time executing. Amdahl’s law, formulates that the expected speedup
should given by[9]:
6
4.1. Parallelism
S = 1/( fpar/P+ (1− fpar)) (4.2)
Where fpar is the fraction of the program that is parallelizable. The implication of this, is that given a
parallel application and a fixed problem, the speedup provided by extra processors will eventually be
overshadowed by the sequential part. An application with fpar = 0.8 would for instance get a speedup of
1.67 on 2 processors, while 16 processors would yield a speedup of only 4.
Gustafson’s law
Luckily the scenario described above, and indeed Amdahl’s law, is not indicative of the scalability and
succesful application of parallel computing in general. In his 1988 article “Reevaluating Amdahl’s law”,
John L. Gustafson critisises Amdahl’s law and the skepticism it has caused towards the viability of massive
parallelism, as his own research at SandiaNational Laboratories indicates otherwise. He points to his recent
research where a 1024-processor system is running various parallelized simulations. A speedup ratio of
between 1016 and 1021 is achieved using programs with sequential parts ranging from 0.4 to 0.8 percent.
According to Amdahl’s law, a maximum speedup ratio of only 200 would have been possible. From this he
presents a formula for estimating the scaled speedup, which today is also known as Gustafson’s law[18]:
S = P+ (1− P) fser (4.3)
In this formula, which assumes that the program and problem has been scaled to the parallel computer
used. P is again the number of processors and fser is the serial part of the program.Gustafson concludes that
programs and problem sizes can scale with the number of processors available, and should be estimated
assuming that this is the case. Simulations in which the programmer can modify sizes and parameters
depending on the amount of processors are specificallymentioned as an ideal application of parallelization
using this theory.
4.1.2 Flynn’s taxonomy
Flynn’s taxonomy, first proposed by Michael J. Flynn in some of his research articles in the 1970s, is a
classification of computer architectures. He revisits the classification in a 1996 article, Parallel Architectures,
which he co-authored. In the article, he presents the idea that streams can represent objects in the form of
data or actions in the form of instructions. Based on the four combinations possible that are shown in table
4.1, each representing a parallel architecture, the following classifications are presented[13]:
Single Instruction Multiple Instruction
Single Data SISD SIMD
Multiple Data MISD MIMD
Table 4.1 Flynn’s taxonomy
SISD
Single instruction, single data. This is arguably the most familiar architecture, often exemplified by
the uniprocessor PC, suggesting that no concurrency occurs. Even if most new PCs today feature
multicore processors, this classification is still relevant. The 1996 article points out, that while not obvious,
concurrency can occur when instruction level parallelism is exploited. A processor that exploits instruction
level parallelism still operates on a sequential stream of instructions, butmay choose to optimize its internal
7
4.1. Parallelism
execution of them. Such optimization is used, for instance, in the superscalar architecture where some
instructions can be executed concurrently, and complex instructions are executed using a pipeline.
SIMD
Single instruction, multiple data. In this architecture a single instruction stream is used to process multiple
data streams by using an array or vector processor. In an array processor, processor elements execute the
same instructions on many data elements simultaneously. In a vector processor, a single processor element
works efficiently on a sequence of multiple elements of data. The vector processor relies on smaller datasets
and pipelining[13]. An example of SIMD type processing is found in todays graphics processing units.
MISD
Multiple instruction, single data. This type of processing could, for instance, be used to build a pipeline of
processors used to process data, each feeding its output as input to the next. It is described that while this
is what a vector processor does on a microarchitecture level, the lack of suitable programming constructs
to support MISD style processing is a drawback[13]. Another use of MISD that can be imagined is for use
in redundancy situations, where multiple processors process the same data, in case one or more should
fail.
MIMD
Multiple instruction, multiple data. TheMIMD class of parallel architecture consists of multiple processors
that work independently on multiple sets of data. Although the processors are seperate, they often
work on the same program[13], dividing the workload. Typical examples of this are shared-memory
symmetric multiprocessors (SMPs) or multicore processors, although identical processors are not required
in this classicification. Distributed, networked processors are also a part of this classification. These may
communicate using various memory models, depending on program implementation and programming
paradigm.
Use of Flynn’s taxonomy
While some parallel architectures of today do not strictly fit into Flynn’s taxonomy, it is still relevant.
Parallel computing is here to stay, and thinking along the lines of Flynn’s taxonomy, knowing which basic
architectures exist, can help in choosing the correct system for a given problem or for using resources
optimally. Architectures that do not fit into one of the four classifications, can usually be described using a
combination of classifications.
4.1.3 Granularity and parallelism
The success of parallelism depends on making good use of available resources. This implies keeping
the parallel system busy and using all of its resources to solve the problem in an efficient manner.
Identifying what to parallelize and how is crucial in this. For small problems, fine-grained parallelism
is often preferred. A typical example of this are threaded parallel programs on shared-memory systems.
Here, it may be beneficial to only parallelize certain parts of a program, for instance loops[9]. Overuse
of parallel constructs can cause the program to scale poorly or even adversely affect performance due to
overhead. When problems are not obviously parallelizable, or does not scale well, it may be beneficial to
maximize resource use by working on several different kinds of problems at once, which is sometimes
called task-parallelism. Coarse-grained parallelism is when large chunks of code runs in parallel and/or
8
4.2. Parallel systems
work is performed on large amounts of data occurs in parallel. In fine-grained parallelism, less work is
done between communication between processes, sometimes causing more overhead. The parallelization
strategy whereby data is partitioned to be processed on a large grid of processors is often referred to as
data-parallelism[9]. Data-parallelism sometimes has the added advantage of locality exactly because each
processor works on its own part of the problem. Some types of inherently parallel problems can be so
easily exploited that they are referred to as embarrassingly parallel. Examples of this are found in brute-
force searches, graphics rendering, and certain kinds of simulations.
4.2 Parallel systems
In this section we will not be describing the details of concrete architectures or listing esoteric examples
of specific parallel systems. Instead we will be discussing the general computer which is the basis of most
parallel systems, its background, and particular issues which must be considered when high performance
is the goal. As we shall see later, this will apply to CUDA as well.
4.2.1 Von Neumann machine
A Von Neumann Machine is another name used to refer to what is also known as the common computer,
named in honor of mathematician John von Neumann. While he was not the first to come up with the
concept of a computer system, he was the first to develop a system that worked by accepting instructions
and storing them in memory[6]. Considering that previous computer designs worked on fixed, hardcoded
programs, this was and is a significant contribution. The outline of how the Von Neumann machine works
still applies today on a conceptual level. This concept revolves around a memory for storing programs and
data and a processor for executing the programs and processing the data.
The processor in a Von Neumann style machine works on a sequence of instructions from a supported
instruction set. This instruction set defines what the processor is capable of at the lowest programmable
level. These instructions are encoded and stored together with the data. The processor, at a minimum, con-
sists of two main parts: a control unit (CU) and an arithmetic logic unit (ALU). See figure 4.1. The control
unit fetches data frommemory, places it in a instruction register, and decodes the instruction[36]. Until this
happens, the processor cannot distinguish an instruction from data. If the instruction is valid, the processor
will then proceed to execute it. If the instruction is a type with one or more operands, these are also fetched
frommemory and placed in an input register. Registers are a form of memory that is built in the processor
on which it can perform its operations directly. Registers may be for a specific purpose such as instruc-
tions, input or output, or general-purpose. Once the instruction has been fetched from memory, decoded,
its operand or operands fetched, the CU can direct the ALU to perform the decoded logic or arithmetic
instruction. The result is placed in an output register, and may then be worked on further or placed back
to the main memory.
The main memory is the only external memory which the processor can access directly, and the means by
which it retrieves program code and also data. This in itself causes latency, as the memory is not located
on the processor like the registers, and its data must be transfered there. Even more speed is lost due to the
slow speed of the memory, relative to the speed at which the processor operates. Regardless of the speed
of the memory, type or speed of system bus and additional caches of the processor, this part of the design
is most often considered a bottleneck. For this reason it is also known as the Von Neumann bottleneck.
9
4.2. Parallel systems
Figure 4.1 Conceptual illustration of the processor in a Von Neumann machine
4.2.2 The microprocessor
The processors of today are based on integrated circuit technology, as they have been since the introduction
of the microprocessor. The first microprocessor was introduced by Intel in 1971 and it was commercially
successful[6]. The co-founder of Intel, Gordon E. Moore had by this time already noted the consistent ad-
vances in integrated circuit research. He suggested that about every second year1, the number of transistors
that could be placed on a given area would double, at approximately the same price. He predicted that this
tendency would continue to be true for the forseeable future, in what is known as Moore’s law. The law
still holds, but its consequences are today used to increasingly focus on parallelism.
The first microprocessor was revolutionary because of the amount of computing capability available on
its relatively small size. Since then advances in processing capability have been closely tied to the steady
increase of transistors. Increasing amounts of transistors directly affects howmuch logic can be represented
per area size. Through the history of microprocessor advances this fact has primarily been exploited in two
ways: to do more processing each clock cycle, and to increase the clock rate. The first microprocessor was a
4-bit processor running at a low clock rate and with a limited instruction set. The next generation doubled
the word size to 8 bits, increased the clock rate and number of instructions[6]. The word size is the size
of data the processor is designed to work with, i.e. perform calculations on and use for addressing mem-
ory. Since then, the extra logic allowed by an increase in transistors has generally been used to increase
word size, added support for new types of instructions, floating point operations for instance, and other
specialised processing instructions. It has also been utilised for performing more instructions per clock
cycle, thereby increasing processing capability as well as speed. Increase in transistor density makes the
processor more energy-efficient, thereby allowing the increase in clock rate. Over time, these possibilities
have caused countless incremental improvements to new processors being released. Today, the tendency
is to focus less on increasing the clock rate. As a result of this, we have extremely small and low-power
processors, fast mainstream CPUs that offer extensive capabilities on multiple CPU cores, and specialized
processors such as the GPU with a high number of relatively simple cores.
1 In his original article this was approximated to be one year[25], but it was later adjusted to two years.
10
4.2. Parallel systems
4.2.3 Memory
While the speed of the CPU of any computer system is what ultimately determines how fast it can
potentially process data, memory is perhaps the most important limiting factor. As described earlier, a
requirement of Von Neumann-style machines is that program code and data is loaded in the system
memory, a task handled by operating system and program code[36]. The speed of the memory and the
bus it uses are the factors that limit the speed at which data is transfered to the CPU. Depending on the
instruction, it takes one or several clock cycles to carry out. Measured in Hertz, i.e. number of clock cycles
per second, a typical high end CPU as of 2011, for example a model of the Intel i7, works with a speed of
3,2 Ghz (3.200.000.000 hertz)2.
Memory controllers are used to control the flow of data to and from the CPU, and the bus itself is a path
of parallel wires that connects the memory to the CPU. The frequency of the memory is also measured in
Hertz, and the latency of thememory and thewidth of the bus determines howmuchdata can be transfered
per cycle. The operating speed of the memory3 and bus is typically a fraction of the speed of the processor.
Various techniques that enable several transactions per bus cycle are used to compensate. Even so, the
disparity increases with the speed of the processor. Additionally, when dealing with a multiprocessor or
multicore system that shares memory, the total bandwith of the bus is also shared, as seen on figure 4.2.
Just as it is required to load data from the slower storage memories into the main memory before the CPU
Figure 4.2 Multiple processors sharing a bus
can access it, moving the data in the memory closer to the CPU is used to compensate for the disparity
between processing speed, memory speed and bus bandwith. This concept is known as data locality. This
concept will be described later in this chapter. CPUs use memory pipelines and special cache memories
to achieve this[6]. The cache memories are placed closer to or on the microprocessor, and are faster and
more expensive than the main memory, so smaller quantities are used. Different methods and caches are
used. We will now list the different kinds of memory that is used in a typical computer. See figure 4.3 for
an illustration.
Registers are the fastest memory as they are a built-in part of the CPU. Amodern CPU hasmany registers,
categorised by several different types. General purpose or multipurpose registers may be used for
several purposes as, as implied, or have special purposes depending on the instruction used. On an
Intel CPU for instance, these include accumulator and data registers for calculations, registers for
counting, indexing and pointers[6]. Some special purpose registers such as those used for instruction
and stack pointer, status, can only be changed by specific instructions or implicitly. Segment registers
are used for addressing and addressing using offsets. The size of the registers depend on their use
2 According to specifications found at http://www.intel.com/
3 The DDR3 type memory required for the i7 operates at between 100 and 266Mhz
11
4.2. Parallel systems
Figure 4.3 This figure shows the different levels of memory in a computer
and the word size(s) supported by the CPU, single precision floating point instructions for instance
operate on 32-bit registers. Packed registers that consist of several pieces of data are used with special
instructions for vector calculations, or other specific purposes. Registers may be used to exploit data
locality, sometimes refered to as locality of reference, for instance by storing constants used in looped
calculations instead of fetching them again.
Cache The cache memories are fast types of memory that exists between the RAMmemory and the CPU.
There are typically several layers of cache named after their proximity to the processor core. L1 (level
1) is the closest to the CPU, and L2 is typically slower and larger. Some processors also have a L3
cache. In the Intel i7 example CPUmentioned earlier, the processor has a split L1 cache with 32kb for
instructions and 32kb for data and a 256kb L2 cache per core, and 8mb L3 cache shared between the
cores. Use, speed, capacity and placement differ on different CPUs. Caches and especially the size of
the caches can have a tremendous impact on performance due to locality. Data that is reused can be
kept in a cache, so that when it is refered to, it will be close to the processor. In the case of a parallel
system or multicore CPU, each core may have reusable data relevant only to that core stored, for an
aggregated exploitation of locality.
RAM (Random Access Memory) is the main memory of a computer system, typically using DRAM
(DynamicRandomAccessMemory) technology. Its size reflects the amount of data andprogram code
that the computer can contain without accessing storage, and thus what the processor can address.
As mentioned earlier, the speed of accessing data stored in memory depends on the speed of the
memory (latency), clock rate and width of bus. Reading one byte from a random place in memory
will leave the CPU waiting, while a predictable memory access pattern such as reading sequential
memory addresses can help achieve high throughput. Several levels of cache and pipelining strategies
are used to realize this.
Storage: SSD, HDD and Network Permanent storage memory typically consists of a hard disk drive
(HDD), solid-state drive (SSD) or the storage of another computer accessed over a network. The
advantage is that the storage is non-volatile, relatively large and inexpensive. SSDs use non-volatile
memory chips to store data. A regular HDD is works by rotating magnetic disks and moving heads
that read from the disk. To fetch data from the HDD, it needs to be located, the disk rotated to the
right position and the head moved there. Reading from or writing to a HDD is for this reason very
slow. SSDs are faster at locating data, but compared to the faster memories, throughput is in the same
range. Network storage depends on the storage devices of the external computer, as well as the speed
and congestion of the network. In any case, these storage devices represent a bottleneck: data that is
12
4.3. Parallel programming
stored must be loaded into main memory as the first step in processing it.
Knowledge of the levels of memory is important when writing high performance programs. When writing
programs for mainstream CPUs some of the optimizations are done by the compiler and the processor
itself. In CUDA, data placement in memory and use of different memories for optimizing access must be
done by the programmer.
4.2.4 Parallel computers
In this section we described the abstract concept of the Von Neumann machine and the way the CPU and
memory generally works in a mainstream computer. This is because most parallel computers today work
the same way. Parallel computer clusters are constructed using normal computers, and multicore CPUs
havemade shared-memory parallel computers the norm. The VonNeumann bottleneck applies, andmem-
ory speed, caches and data locality are important issues to consider, as the bottlenecks and issues can also
scale. While the CUDA architecture is not what we would consider a Von Neumann-style machine, and its
hardware and programming paradigm is not typical when compared to other parallel systems, all of this
is still relevant. Because in addition to being a peripheral which is installed on the device bus on a regular
computer, thus being an additional step away from the data, it shares many of the same performance con-
cerns illustrated here, as we shall see later.
When implementing parallel code, it is important to choose the correct algorithm, to partition the source
data correctly, and write it so that it scales. To write scalable code, it is necessary to consider potential
speed-up on existing or future parallel systems with more processors or resources. A parallel program can
be said to scale well if it is able to yield a high speed-up ratio given extra processors, or if it will be able to
handle larger problems, as suggested by Gustafson. To maximize performance, knowledge of exploitable
features or pitfalls of the platform is important to have. Especially so when writing programs targeting
several platforms, or relying on the programming language or compiler. We are not asserting that tuning
parallel programs thisway is impossible usingC andOpenMP, but aswe shall see later, CUDAdoes provide
some useful tools and methods to make explicit use of platform features.
4.3 Parallel programming
In this final section of the chapter we will look into parallel programming paradigms, concepts, designs
and considerations. We will touch the major concepts but we do not wish to address subjects that does not
concern our problem at hand more than briefly.
Within parallel programming two major memory models exists: “distributed memory” and “shared
memory”. The memory model is of interest as it dictates how processes interact with each other and how
the parallelism needs to be implemented in order to get the best results.
This section will rounded up by describing major programming paradigms, including classical concepts,
pitfalls and designs, that all are of interest to our cases.
Programming approaches
In parallel programming, the terms explicit and implicit are sometimes used to refer to two general
approaches[8, p. 8].
Implicit parallel programs are developed without full control over the parallelism, and it is assumed that
a preprocessor or the compiler handles the parallelization.
In explicit parallel programming its is up to the programmer to manage the parallelism. The programmer
13
4.3. Parallel programming
is responsible for most of the parallelization, things like decomposing program in order to distribute tasks
between processors or how to optimally place, fetch and handle memory. Explicit programming is used
when it is assumed that the developer is better judge of how a program should be structured in order to
optimize the use of parallelism.
4.3.1 Message passing / distributed memory
Distributed memory or message passing model (MPM) as it is also called, typically deals with cluster com-
puting where several computers work on a problem over a network. Generally there can be one or several
tasks or processes running on several computers that each have their own local memory. In order for tasks
to exchange data with other tasks they make use of a message system to communicate, as illustrated in
figure 4.4. Each message needs a set of instructions, depending on which kind of data being transferred,
to make sure the right tasks gets it and that the data is used correctly [37]. While passing or receiving
messages does not necessarily halt processes, message instructions might have processes wait for several
reasons, which could be waiting for more data, waiting on other process finishing some task etc. In mes-
sage passing systems it is very important tomake sure data is stored optimally andminimizing unnecessary
communication. This is true as transferring extra data over the network or having idle processes waiting
for answers, would introduce a huge overhead to the system. As it is imperative that the messages are
encoded correctly and that the flow of data through the processes are handled right, on these grounds
message passing programs are almost exclusively written explicatively [8].
MPI (message passing interface) is one of the most popular libraries for message passing systems [8] and
can be applied to FORTRAN, C and C++ programming languages. MPI has a wast library of message pass-
ing functions and makes for portable programs that runs on compliant MPI implementations [17].
Figure 4.4 this figure shows a conceptual view of the two memory models:Left distributed memory, where each CPU
has its own local memory and use messages to pass data and instructions. Righ shared memory, where several CPU’s
share the memory space through a high speed bus and all can view and change this memory location.
4.3.2 Shared memory
The sharedmemorymodel (SMM)dealswith an architecturewhere several processors share and can access
a global memory space through a high speed bus. Changes in amemory space performed by one process is
visible by all other processes in the system and so communication between processes is also done through
the shared memory space (illustrated in figure 4.4 and 4.2 ). The architecture means that communication
is more straightforward and faster due to the proximity of data in SMM than in MPM. There are two
main concerns using this architecture: The first is making sure it is “safe” to access a memory location,
14
4.3. Parallel programming
i.e. determining if the right data is present or if the process accessing the memory is in conflict with other
processes (see 4.3.3). Secondly the scalability of a program is limited as extra CPUs increase the stress-load
on the bus as the traffic to the shared memory increases.
In computer science, we often refer to the smallest independent instruction stream or unit of processing
as a thread [17, p. 3]. In the shared memory paradigm the parallelism is usually achieved using threads
to execute the different parts of a program or working on the different data using the same code.
Multithreading is a concept that allows independent threads to execute simultaneously. Multiple threads
can be created by a single process, regardless of the number of processors. If there are less processors than
threads, the application or operating system makes use of timesharing instead of concurrent execution.
Although executing independently, each thread then shares the resources of a processor. When a single
processor has several threads assigned to it, context switching is used to simulate concurrency. While one
thread executes the others remain dormant as shown in figure 4.5. Each thread has its own resources, such
as a program counter and a stack[9, p. 329], and managing the state makes context switching an expensive
operation. This makes it counterproductive in some situations to launch many more threads than there
are processors, as the overhead can start to eat time gained from speed-up, for instance. Depending on
scheduling, however, a number of threads less than or equal to the number of processors, will execute
concurrently.
Figure 4.5 Threads on three processors. Note that multiple threads, residing on the same processor or process, cannot
execute simultaneously
OpenMP
One of the most known APIs for implementing shared memory multithreading is the Open Multi-
Processing (OpenMP) API. OpenMP is a multi-platform API for C, C++ and Fortran that supports most
architectures and operating systems. It provides programmers a simple interface for writing scalable and
portable parallel programs for the shared-memory model.
OpenMP makes use of compiler directives, called pragmas, and library routines to parallelize code, and
environment variables to affect the parallelism at runtime. It provides an interface that makes it easy to
parallelize sequential code out of the box [9, p. 8-10]. Using the API, it is possible to write a programwhere
the parallelization is done somewhat implicitly, but with OpenMP the programmer does have the option
to be increasingly explicit as needed.
In our CPU C implementations of the n-body problem, we use OpenMP to parallelize the sequential
versions that are then benchmarked and compared with CUDA implementations. We chose OpenMP as
the appropriate tool, for, amongst other reasons:
15
4.3. Parallel programming
• A relatively small amount of effort is required to parallelize sequential code, giving more immediate
results when experimenting with parallelization.
• It does not break existing sequential code, and can easily be disabled for testing, allowing us to
incrementally parallelize programs.
• OpenMP is portable and a standard part of the compilers we used, allowing us to write and test
programs on our own computer, later benchmarking and analyzing their performance on more
powerful hardware setups.
• Both CUDA and OpenMP natively supports C and C++ programming, making benchmark compar-
ison between implementations viable.
4.3.3 “Concepts” of parallel programming
While the choice of memory model depends largely on problem at hand, most concepts of parallelization
overlap and can be applied to either model. This section has been divided into 3 parts where we will
introduce the reader to parallelization architectures, concepts and considerations respectively. This section
focuses on concepts within parallel programming that is of interest to CUDA and our implementation of
our case programs.
Classical parallel programming architectures
When deciding how a specific task should be solved there are several techniqueswithin parallel computing
that may be chosen, depending on the problem at hand. It is quite probable that several of the designs can
be used in the same program or to solve the same task. Some may be more or less efficient than others,
though. A schematic illustration can be seen in figure 4.6. The four major designs are[8] as follows:
Master - slave In this design one master process decomposes a problem into smaller tasks that are
each assigned to another process, a so called slave. Partial results are gathered from the slaves by the
master process that in the end finishes up the process for the final result. This method can achieve high
computational speedup, but the hierarchical design might cause communication to the master process to
become a bottleneck.
Divide and conquer This design is typically used in recursion and refers to a design where a problem is
divided into two subproblems and these subproblems may in turn be divided. The results are thereafter
combined by the processes that split them to finally solve the problem at the top level processes.
Pipelining A design where a processes gets data from the process to the left and after performing its
designated task on the data, sending it to the process to the right.
Single Program, Multiple Data (SPMD) This technique makes use of processes that performs the same
task, but on different data. It is the most common design and scales well if a there is an appropriate way of
dividing data between processors.
There exists other designs as well and a more extensive project may well incorporate several of the designs
mentioned above.
Concepts
There are a number of concepts that is encountered when working with parallel programming. We will
present a few here together with their attributes.
Fork/join: When a process or thread spawns a number of sub-threads or sub-processes it is called a fork.
The program goes from executing a sequential stream of instructions, into a parallel region where the
spawned threads performs work in parallel. After the spawned threads have completed their work, they
return to the spawning process, and when they have all returned, the sequential execution continues. This
16
4.3. Parallel programming
Figure 4.6 Schematic over the four presented designs. Top left: master-slave, master process divides tasks to slaves.
Top right: divide and conquer, each process divides the problem further to other processes. Bottom left: pipelining -
each process completes its designated task on data before sending it to the next process. Bottom right: SPMD - each
process performs the same task but on different data.
is called a join.
Synchronization:As it is quite possible for threads or processes to work in disarray and at different speeds,
sometimes threads needs to “wait up” for other threads. When this happens, individual threads wait at a
specific point in execution until all threads reach it, before resuming execution. This is called a synchro-
nization and can be done for several reasons but most often because the threads waits for the right data to
be present, retrieved by other threads.
Critical regions Sometimes it is important inside a parallel region that only one thread works at a time.
This is called a critical region. See figure 4.7 for a illustration schematic over fork/join and critical regions.
There are several ways to implement critical regions, dependent on programming language or the problem
at hand.
One way of enforcing a critical region is using locks. If a resource within a program has a lock, a thread
that want to make use of the locked resource has to wait for the lock to free. The thread with the lock can
use the resource while other threads wait, competing for access to the lock once freed[9, p. 324]. Chang-
ing a value in a memory location means reading it, processing it, and writing it back. Because of how this
works, it is possible for one thread to read a piece of memory and a second thread to do the same right
afterwards. This called a read/write (R/W) hazard. After the first thread has done its computation on the
data and writes it back, the second thread overwrites this data with a value that used the original data for
its computations. Hereby, the contribution of the first thread is lost.
It is imperative for some implementations that threads are prevented from attempting to read and write
the same value simultaneously. To read and write to a critical location securely, without risk of those types
of errors, is called an atomic operation. Atomic operations exists for various types of operations, and their
behavior is well defined. Atomically reading from or writing to a critical location obviously removes some
of the speed-up achieved by parallelism, as the operations are essentially sequential. They may come with
additional overhead and should therefore only be used when necessary, and otherwise should be avoided.
17
4.3. Parallel programming
Loop parallelism: The first and easiest part of a program to parallelize is the loops. Parallelizing loops is
a typical example of the SPMD design and the programmer needs to make sure the indexes are divided
between available threads (as can be seen by the code example below).
1 i n t threads = number_of_threads ;
2 f o r ( i n t i = threads ; i < n ; i += threads ) {
3 //do something
4 }
When parallelizing loops it is important to keep the locality of reference and latency in mind when fetching
data, for example from arrays. Merging loops that share data elements (loop-fusion) or splitting loops to
make sure each loop only access one array, and thereby might be able to cache the entire array. Another
trick is to unroll the loop, that is including more operations in each iteration. Loop unrolling makes sure
the processor is occupied while fetching data, thereby “hiding” the latency associated with the fetch. Also
loop unrolling might facilitate instruction level parallelism as well as data reuse, increasing performance
[9]. Loop unrolling is exemplified by the two code snippets below. First the non unrolled code:
1 f o r ( i n t i = 0 ; i < n ; i++) {
2 array_Sum [ i ] = array_A [ i ] + array_B [ i ] ;
3 }
The loop, unrolled once, will look as follows:
1 f o r ( i n t i = 0 ; i < n ; i += 2) {
2 //remember arrays needs to have a s i z e d iv i s ab l e by 2
3 array_Sum [ i ] = array_A [ i ] + array_B [ i ] ;
4 array_Sum [ i+1] = array_A [ i+1] + array_B [ i+1 ] ;
5 }
Reduction: The process of taking elements of an array and producing a smaller array, usually recursively
until an array of the size one is reached, is called a reduction. Below is “pseudo” C code exemplifying
a parallel reduction. In the code we consider an application that makes use of 4 threads to add 8 index
elements of an array together.
1 f l o a t array [ 8 ] = { 1 . 0 , 2 . 5 , −1.2 , 9 . 2 , 6 . 9 , −0.1 , 3 . 2 , −0 .7 } ;
2 n = size_of_array ( array ) /2 ;
3 id = get_thread_id ( ) ; // four threads with the ids 0 , 1 , 2 , 3
4 while ( id <n )
5 {
6 array [ id ]+=array [ id+n ] ;
7 n/=2; // in t ege r devis ion
8 // synchronize
9 }
In each iteration of the reduction the n values in the indexes of one half of the array is added with the
corresponding n values in the other half. After this is done we divide n with 2 effectively halving the array
and threads needed for the operation and redo the addition. Between each iterationwe need to synchronize
threads aswewant to avoid thread using values in the array that has not yet been updated by other threads.
This reduction is done through the while loop in 3 iterations, after which all elements have been added
together in the first element of the array. Reductions are often seen in parallel programming and can be
used efficiently to avoid critical regions.
18
4.3. Parallel programming
Figure 4.7 Schematic over the concepts of fork/join and critical regions. In a fork several processes are spawned and in
a join they are merged again. In a critical region, only one thread is working at a time.
Considerations concerning concurrency
The whole concept of several processes working simultaneously, sharing the workload of some task and
the problems introduced by this is called concurrency. Some of the major concerns regarding concurrency
will be summarized below. Considering the points made below is a good way to achieve performance in a
parallel program.
Load balancing: Make sure that all threads are occupied as much as possible by giving them the same
amount of work. If one or more threads is unoccupied the performance will drop correspondingly.
Race conditions: The event of two processes trying to access the same shared resource at the same time
is called a race condition. This can lead to several problems as described the section on critical regions.
The solution is making this code critical for example by using locks or atomics, or better still, rethink the
implementation.
When threads do not make any progress, perhaps while waiting for data they rely on from other threads
or because other threads has higher priority, is called starvation. Threads that are created but never used,
are said to be dormant. To avoid dormancy, one has to make sure that “sleeping” threads knows when to
start executing their task.
If (at least) two threads are dependent on each other at the same time to continue an execution path, this
is called deadlock. Two deadlocked threads will never execute to the end.
Parallel overhead: When executing a parallel region there is often some overhead introduced to
the program. Changing between single and parallel execution might slow the program and another
implementation might be worth considering.
19
5 CUDA
Compute Unified Device Architecture, or CUDA, is the name given to NVIDIAs parallel computing plat-
form and programming model. It is a complete computing platform with a hardware architecture specifi-
cation, supported by extended versions of existing programming languages, an API and a runtime envi-
ronment.
The CUDA hardware is based on GPU technology. The GPU, or graphics processing unit, is a term coined
by NVIDIA in 1999[31]. Around this time, VGA (video graphics array) controllers had been advancing to
support acceleration of 2D- and 3D-graphics, and the GPU introduced an integrated processing unit that
supported that of a traditional high-end workstation graphics pipeline, hence the need for a term. Since
then, GPUs have steadily become more general, replacing fixed function logic with programmable func-
tionality[33].
First uses of GPUs for general purpose computing (GPGPU) were achieved by exploiting graphics pro-
gramming APIs that interfaced with the hardware driver, such as Microsofts DirectX libaries and the open
source OpenGL. This wasmade possible by the well-defined behavior of these APIs. The disadvantage was
that the user had to have intimate knowledge of the APIs and the ability to express programs in terms of
graphics[31].
To address the interest in and issues with GPGPUprogramming, NVIDIA defined the unified device archi-
tecture and released CUDA C, a version of standard C with extensions to support GPU programming[31].
The first CUDA capable device, representing CUDA capability v1.0 was the G80 architecture, first released
in 2006. Since then each new CUDA-based architecture has added features resulting in updates of the
CUDA capability specification, followed by support in CUDA C.
In this chapter we will highlight features of the hardware, and describe the concepts of the programming
model. This includes hardware features with implications for use as well as specific functionality.
5.1 Introduction
This section introduce basic concepts related to CUDA and CUDA programming. Following this section,
the rest of the chapter will focus on GPU hardware and CUDA programming in more detail.
CUDA C is designed for parallel computing, and allows the programmer to program and to schedule
massive amounts of threads on a compatible GPU.
The idea behind the CUDA programming paradigm is that programs are structured in such a way that
some code is be executed on the CPU of the host, while small functions called runs in threads on the GPU.
The code to be executed by the CPU of the host, is also simply referred to as host code. Host code schedules
kernels to be executed on the GPU, which in the context of CUDA is referred to as the device. A kernel can
be launched using thousands or even millions of “lightweight” threads that are to be run on the device.
CUDA threads are thought of as lightweight for the following reasons:
20
5.1. Introduction
A CUDA thread has almost no creation overhead, meaning that thousands can be created quickly.
On a low level, handling of CUDA threads is hardware accelerated. Much of the performance is also
obtained by automatic resource handling and sharing done on a hardware level. The programmer is
encouraged to “think big” and schedule threads liberally and as needed. More threads can hide latency
caused by data fetching, leading to performance gains.
The threads are also lightweight in the way that the programmer has less control over them and their order
of execution. Scheduling of the execution of threads and blocks of threads is also handled on the hardware.
In a kernel the same code is executed by each tread. Consider a normal sequential C program performing
vector addition in the code snippet below.
1 f l o a t vectorA [ 3 ] = { 2 . 0 , − 1 . 5 , 2 . 1 } , vectorB [ 3 ] = { 1 . 0 , 0 . 5 , − 1 . 7 } ; f l o a t vectorC [ 3 ] ;
2 f o r ( i n t i ; i < 3 ; i++)
3 {
4 vectorC [ i ] = vectorA [ i ]+vectorB [ i ] ;
5 }
In CUDA executing the code above on the device would make each thread perform this, overwriting each
others results. In order to make use of the parallel structure, each thread is assigned an id that can be used
for the parallelization. The id can specify which work a thread should perform or be used conditionally.
In efficient CUDA programs, data is arranged so that each thread can perform work on it naturally. An
example of CUDA code that does the same as the above sequential C program is listed below:
1 f l o a t vectorA [ 3 ] = { 2 . 0 , − 1 . 5 , 2 . 1 } , vectorB [ 3 ] = { 1 . 0 , 0 . 5 , − 1 . 7 } ; f l o a t vectorC [ 3 ] ;
2 id = threadIdx . x
3 i f ( id < 3)
4 vectorC [ id ] = vectorA [ id ]+vectorB [ id ] ;
The loop has been removed and each threadwith an id below 3 does its part of the addition. In the example
the use of CUDA could seem a bit overkill, but the addition could in principle be done for arrays of almost
arbitrary size. In that case we would just launch as many threads as are needed.
The GPU consists of multiprocessors units that each contain a number of processor cores that execute
threads in parallel. CUDA programs are portable between compatible versions of GPUs and regardless of
the number of multiprocessors or cores, the same amount of threads can be scheduled. In order to facilitate
this, CUDA introduces a level of abstraction. Threads are conceptually divided into blocks. Each block of
a specific kernel launch contains the same number of threads, and the total amount of blocks are arranged
in a so called grid. The grid consists of all the blocks and thereby threads of one kernel. A conceptual view
of this is seen in figure 5.1 [30].
On making a kernel call the blocks are divided up and assigned to the multiprocessors. This way CUDA
makes sure a program is portable across hardware versions, and is able to execute regardless of the
processor count or resources of theGPU.As theGPUs evolve new functionality and features are introduced
that are not backwards compatible, however. We will return to this later in the chapter.
CUDA operates with an execution unit called a warp. A warp consist 32 threads that are in principle
having their instructions executed simultaneously on the multiprocessor. If one or several threads executes
conditional code that differ in code path from other threads in the warp, these different execution paths are
effectively serialized, as the threads need to wait for each other. This phenomenon is referred to as thread
divergence.
As we will show in this chapter, CUDA comes with rich possibilities as well as some caveats, that the
programmer should be aware of to fully utilize the parallelism offered by the GPU hardware through the
use of CUDA.
21
5.2. CUDA hardware
Figure 5.1 The CUDA architecture on a conceptual level. The grid is divided into blocks that each consists of a number
of threads. Picture taken from the CUDA programming guide[30].
5.2 CUDA hardware
CUDA capable hardware is found in the NVIDIA GPUs released since the introduction of the G80 archi-
tecture. These GPUs are often found in mid to high-end laptops and workstation PCs making them quite
ubiquitous. Versions that are integrated into the motherboard may share the main memory of a system,
giving them a less desirable performance. Other mid to high-end versions come in the form of device cards
that are inserted into the system, and depending on the system, several cards may be configured. Spe-
cial high-end cards exists that are designed solely for executing GPGPU programs, and come without any
video interface ports, such as the Tesla series. In any case, the device is connected to a bus on the system
typically used for graphics peripherals, such as the PCI-Express bus[31]. The latency and width of the bus
determines the rate of transfers from the memory on the host computer to the memory of the device card.
Asmentioned previously, CUDAhardware comeswith aCUDAcapability versionwhich entails changes in
configuration of processors andmemory, or the addition of new features that can be used for programming.
For our experiments we have used the following products: Tesla C1060 with CUDA capability v1.3 and
GeForce GTX 480 with CUDA capability v2.0. The latter model is of the architecture codenamed “Fermi”,
which we will refer to, unless otherwise mentioned.
5.2.1 GPU
ACUDA device contains a GPU consisting of a hierarchy of processors, which together represent the com-
bined processing capability. The top level processor is called a StreamingMultiprocessor (SM) and contains
a number of Streaming Processors (SP). The SP cores are also called CUDA cores. For 2.0 CUDA capability
22
5.2. CUDA hardware
GPUs the SM consists of 32 SPs[31]. For v1.x this number is 8, and for the new v2.1 it is 48. The GTX 480
GPU that we will use later has 15 SMs, each with 32 SPs, for a total of 480 CUDA cores. See figure 5.2 for
an overview of the SM architecture.).
The way the GPU works is closely related to how it is used when writing GPGPU programs. The GPU
Figure 5.2 The architecture of a Fermi streaming multiprocessor. Picture taken from the Fermi whitepaper[31].
executes one or more kernel grids, where a kernel represents a function that is to run in its own thread.
Each SM executes one or more blocks of threads, one warp at a time. A warp refers to 32 threads that are
executed. A block can consist of amaximumof 1024 threads. TheGPU containswhat is called aGigaThread
scheduler which distributes blocks to the SM thread schedulers. Each SM contains 16 load/store units, en-
abling source/destination addresses for a half warp to be calculated per clock cycle[31]. This can be used
when reading from or writing to caches or memory.
Each SP has its own integer arithmetic logic unit (ALU) and floating point unit (FPU). The FPU supports
both single and double precision, the latter with significantly improved performance compared to the dou-
ble precision performance of earlier CUDA devices with double precision support[31]. The measure of
single- and double precision floating point performance is that of multiply-add (MAD) instructions per
clock. As the name implies multiply-add operations can be used to perform the those two calculations at
once, meaning that the potential amount of floating point operations is double that amount. Fermi addi-
tionally supports fusedmultiply-add (FMA) instructions for both single anddouble precision floating point
23
5.2. CUDA hardware
operations. FMA instructions do no rounding or truncating in the intermediate stage, thereby decreasing
loss of precision. In the earlier architecture FMA was only supported for double precision calculations.
Fermi supports 256 FMAs per clock versus 30 of the previous architecture on double precision. The cor-
responding numbers for single precision are 512 FMAs versus 240 MADs per clock. For notes on floating
point values and precision see the appendix included in chapter 12.
In addition to containing 32 SPs, each MP contains 4 special function units (SFU) which contains hard-
ware support for performing sine, cosine, square root and reciprocal square root amongst other opera-
tions. Earlier CUDA hardware contains 2 SFUs per SM. Each SFU performs one of these operations in one
clock-cycle, which means that an entire warp will execute the operation in 8 clocks. The SFUs are separate,
meaning that the dispatcher unit can direct other instructions to the SPs while waiting[31]. The Fermi ar-
chitecture comes with 2 warp and 2 dispatch units per SM, where earlier architectures has 1. This enables it
to execute instructions from two warps simultaneously. Additionally, many instructions can be performed
concurrently, such as single-precision floating point and integer. Double precision instructions cannot.
When considering the architecture of aGPU, itwould appear to fit bestwithin the SIMDclassification.How-
ever, in CUDA literature the classification is referred to as single instruction, multiple threads (SIMT)[33,
A.4]. In SIMT, a warp of 32 threads can be executed synchronously when they do not diverge caused by
the use of conditional statements. But it is still possible to execute the threads independently when they
do diverge. This has its advantages. Programs do not need to be written so that they always handles data
the same way, and the GPU has hardware barrier synchronization, which reduces overhead significantly.
Additionally, different warps may be executed simultaneously, making a CUDA GPU more like a MIMD
with the added benefits of SIMT.
5.2.2 Memory
In this subsection wewill describe the memory configuration of a GPU. This review of the architecture will
take basis in the Fermi-architecture introduced by NVIDIA in 2010. The architecture support compute ca-
pability 2.0 (described in the following section (5.2.3)) and the graphics card, GeForce GTX 480, that we use
for one of our cases makes use of the Fermi architecture. As a graphics card first and foremost is designed
for graphic processing the whole conceptual design is characterized by this. Especially due to numerous
graphic workloads such as, pixel write, buffer reads and attribute reads, there is high demand on memory
transfers within a GPU[33, p. A36]. The massive amount of data that needs to be transferred makes use
of several DRAM chips to fully utiliz bus bandwidth. The largest part of a GPUs memory is made up of
DRAM, which have some characteristics that needs to be addressed. The data within a DRAM is divided
into data banks which contains several rows of data. While opening one row for reading data takes dozens
of clock-cycles, subsequent reads to the same row only takes 4 cycles. The GPUs memory controller makes
sure calls to a given row waits until enough calls are present before opening the row to transfer all data at
once. This is done in order to fully utilize the data bus through DRAM data locality [33, p. A37]. This can
result in higher latency as a data fetch request from a rowmight wait for more requests before completing.
Apart from the main memory consisting of the DRAM blocks the GPU makes use of a variety of special-
ized global and local memory caches, as well as registers within the streaming multiprocessors. Below is a
general description of the different types of memory within the GPU. Looking at the schematic on figure
5.3, which is a conceptual overview of a Fermi cards memory architecture, will help the reader throughout
the section.
Host Memory: The host memory does not reside on the device card. It is a term used to refer to the RAM
of the host. A Fermi device card is designed for use with a PCIe 2.0 × 16, that has a peak bandwidth
24
5.2. CUDA hardware
of 20 GB per second. Even so, data transfered between host and device comes with a high latency.
Figure 5.3 A conceptual view of the memory structure on a GPU. It shows two (of many) multiprocessors, running two
threads, and how different memory spaces are accessible and divided between these.
Global Memory: The global memory or the main memory of the graphic card is in our case, for instance,
made up of 1536 MB (1024*1024) of GDDR5 RAM. GDDR5 RAM is a DRAM type designed for
applications running with a high bandwidth. It has a bus bitrate of 384 and can make 32-, 64-, or
128-byte memory transactions with a theoretical memory bandwidth of 177.4 GB/s. Due to DRAM
bandwidth optimizations, and, as each multiprocessor works independently, there is incoherence
between multiprocessors. Unordered reads and writes are therefore a consequence of the GPU
architecture. As a result of the incoherence, all threads on the GPU are able to read from all memory
spaces, but all reads and writes are, unless otherwise specified, unordered between SMs.
Due to the nature of graphics processing, the memory architecture of a GPU focuses on a higher
throughput of data and therefore latency cannot be low [33, p. A37].
L2 cache: All SMs shares the same 768KB unified level 2 cache. It services all operations: load, store, as
well as texture. It has not been possible for us to determine the type of memory used, nor the speed
of the bus for this memory1.
Texture Memory: In graphics it is important to texture different objects with some surface. A GPU has
a special cache that takes this into consideration, called the texture cache. Each SM has a read-only
texture cache, that points to a texture memory space within device memory. The texture cache is
optimized for 2D spatial locality. Threads of the same warp that read texture or surface addresses
that are close together, in what we can refer to as two dimensions, will achieve better performance
using the texture cache[33, p. A40] [30, p. 167].
1 Including the Fermi whitepaper[31], the CUDA programing guide[30], or NVIDIAs homepage http://www.nvidia.com/.
25
5.2. CUDA hardware
Constant Memory: Just as with texture memory, each SM has a read-only cache that points to global
memory. This is the constantmemory cache that points to the constantmemory space. Requests to the
constantmemory from a singlewarp are split into asmany requests as there arememory addresses in
the request. The requests are thereafter retrieved to the constant cachewhere the data can be retrieved
from by the SMs, hereby minimizing bandwidth usage [30, p. 170]. Constant memory is thus useful
when it is known that few memory locations will be accessed each clock-cycle.
L1 cache and SharedMemory: Present on each SM.On the Fermi architecture card, there is 64KB of level 1
cache made up of SRAM. SRAM, or static random-access memory, is an expensive type of RAM, that
is much faster than DRAM. This cache is divided into two parts: a normal cache and a user managed
cache called the shared memory [30, p. 92]. Depending on the program, the L1 cache can be set to
either be 16 or 48 KB, where the size of shared memory is the remainder. If the programmer knows
he needs fast transactions of memory between threads in a block, shared memory use is a possibility.
If the programmer is in need of a general speedup of all data transfers he can opt for a larger L1
cache. The cache makes use of an on-chip high bandwidth bus [33, p. A39], and therefore reads from
and writes to this memory location is very fast. The limitation of the shared memory is that it is SM
specific, it has a limited size, and it is only alive during a kernel call. One also have to look out for
bank conflicts. A bank conflict occurs if two or more threads access any bytes within different 32-bit
words belonging to the same bank, and the access is serialized. A memory bank is a division of the
shared memory.
Registers: The registers on the GPU are general purpose. Each SM has a number of registers to share
between its cores. If too much register space is used by a kernel, the number of cores per SM that can
be utilized is lowered or local memory (described below) can be utilized [23, p. 91]. The Geforce GTX
480 has 32.768, 32-bit registers on each of the 15 SMs.
Local Memory: Within the global memory, memory that is reserved to specific threads is referred to as
local memory. These locations are used if too much data is attempted stored in registers and extra
storage is needed. This is also refered to as register spilling. The local memory suffers from the same
latency as global memory access does [30, p. 96][33, p. A40].
5.2.3 CUDA capability
As new GPUs are released, increased capacity and new functionality is featured. Programs can be written
to take advantage of extra resources or to incorporate new functionality. CUDA comes in severals versions
eachmapping to a generation of graphic cards. Each generation of CUDAGPUs is identifiedwith a number,
called the CUDA compute capability, which refers to a version of CUDA forwith a certain set of capabilities.
Older GPUs will not work with a program compiled for CUDA with a higher CUDA compute capability.
CUDA is backwards compatible, however. At the moment of writing there exists 6 versions of CUDA
compute capability: 1.0, 1.1, 1.2, 1.3, 2.0 and 2.1. As each revision of the compute capability comes with a
number of added or different functionality, and a chart is the easiest way to illustrate it. We have included
the chart found in the NVIDIA Programmers Guide [30] in the appendix in chapter 13.
As mentioned in section 5.2 we have access to two semi-recent generations of GPUs which we use in our
respective cases. To highlight and summarize some significant differences between the 1.3 and 2.0 versions
which we are using [31][30]:
• More and faster atomic operations.
• More sharedmemory and the possibility to configure the ratio between L1 cache and sharedmemory.
• Increased speed on double precision floating point arithmetics.
• Increased number of registers and threads for each SM.
26
5.3. Parallel programming in CUDA
• Increased accuracy of certain floating point operations as well as implementation of all standardized
rounding methods for floats.
5.3 Parallel programming in CUDA
The default CUDA programming language is the CUDA version of C/C++. We will only be focusing on
the C part, as that is all we use in device code and because C++ in device code has additional rules and
restrictions that apply[30, p. D]. Likewise, besides describing general concepts, extensions, and some API
usage, we will limit ourselves mainly to what we have used during our study of CUDA.
5.3.1 CUDA C
The CUDA version of C extends standard C in a number of ways to support the programming model and
device features. These extensions allow use of the device, but come with some restrictions compared to
standard C. Some of these restrictions do not apply to CUDA capability v2.0 and above devices, only those
below, v1.x.
Function qualifiers
Additional qualifiers for functions are used to specify if and how functions should be compiled for host or
device.
The __global__ function qualifier indicates a device function that is callable from the host, and only
the host. Global functions must have return type void, and cannot be used recursively. Parameters can be
transfered from the host to the device. In CUDA v1.x parameters are transfered to shared memory, and
are limited to 256 bytes total. In v2.0 they are transfered via constant memory and are limited to 4 KB[29].
Function pointers to global functions are allowed in host code. When called, global functions must contain
a launch configuration which will be described in the section about kernels, 5.3.3.
The __device__ function qualifier is used to mark a function as a device function. Device functions are
compiled to device code, and are only callable from other device functions. In CUDA v1.x device functions
are inlined by default and cannot be used recursively. The __noinline__ qualifier can be used to request
no inline from the compiler, but it does not have to enforce it. In CUDA v2.0 device functions can be called
recursively[29].
The __host__ function qualifier can only be used to in combination with the device qualifier to indicate
that a function should be compiled for and exist both in host and device code. Used alone it only has a
cosmetic effect, as regular function definitions are considered host functions by default.
Variable qualifiers
Like functions, CUDA offers additional qualifiers to indicate types of variables, which differ both in scope
and use. They have a few things in common. They cannot be used on structs or on local variables within
host functions. They are also only allowed at source file scope.
The __device__ variable qualifier is used to declare device variables. By default, device variables are
stored in the main global memory, but this can be changed by using one of the other qualifiers. Device
variables are accessible directly from device functions or from host code using the CUDA runtime library.
27
5.3. Parallel programming in CUDA
They are alive as long as the program is.
The __constant__ variable qualifier is used alone or with the above device qualifier (optional, the ef-
fect is the same). It works the same way as device variables stored in global memory, but is suitable to
use for constants as the constant part of memory is cached. An additional restriciton is that it can only be
written to by CUDA runtime library calls from host code[29].
The __shared__ variable qualifier, also optionally used with the device variable qualifier, is used to de-
clare a variable in shared memory. A variable in shared memory has the lifetime of a block (see section
5.3.3), and is only visible from threads in the block. An additional extern keyword can be used when dy-
namically allocating shared memory, but more on that in section 5.3.3.
The volatile qualifier is not unique to CUDA C, but it may more often be required to use. In CUDA pro-
grams, threads often work on the same array, accessing and writing data that other threads use. The com-
piler may however optimize multiple accesses to the same memory locations away and reuse the value
already in registers. To avoid this, the volatile qualifier is used. It only solves the issue of unwanted opti-
mizations, and the various synchronization functions must still be used to ensure that all threads see the
same data.
Data types
CUDA has support for vector type variables, where the various integer and floating point type variables
are packed together. For (un)signed versions of char, short, int, long, float, vectors with 1, 2, 3 and 4 values
exists. The values are accessed individually by using the x, y, z and w fields respectively. For (un)signed
types double and longlong vectors with 1 or 2 values exists. An additional type dim3, which is based on
uint3 exists for the purposes of defining dimensions. For undefined members the default value is set to
1[29]. It is used, for instance, when launching multidimensional kernel configurations.
5.3.2 Built-in variables and device querying
Device functions all have access to a set of built-in variables used to determine their location in the grid
and the dimensions of the grid. gridDim is of type dim3 and contain the dimensions of the grid.blockIdx is
of type uint3 and contains the index position of the current block in the grid. blockDim is of type dim3 and
and contains the block dimensions. threadIdx is of type unit3 and contains the index position of the cur-
rent thread in the block. warpSize is of type int and contains the number of threads in the warp. All these
built-in variables come with the restriction that they must not be assigned to, or their address taken[29].
Use of them will be shown in 5.3.3.
On the host side, functions in the CUDA runtime and the cudaDeviceProp struct is used to make run-
time decisions. The cudaGetDeviceCount() function is used to get the number of CUDA GPUs available in
the system, and cudaGetDeviceProperties() is used to retrieve a cudaDeviceProp struct[35].
1 i n t devCount = 0 ;
2 cudaGetDeviceCount(&devCount ) ;
3 i f ( devCount > 0)
4 {
5 cudaDeviceProp devProp ;
6 cudaGetDeviceProperties(&devProp , 0 ) ;
7 printf ( " Device name : %s\n" , devProp . name ) ;
8 i n t ccMajor = devProp . major ; // major r ev i s i on ccMajor . x
9 i n t ccMinor = devProp . minor ; // minor r ev i s i on X . ccMinor
28
5.3. Parallel programming in CUDA
10 printf ( " t h i s program has CUDA capab i l i t y vers ion %d.%d \n" , ccMajor , ccMinor ) ;
11 i f ( ccMajor < 2)
12 {
13 printf ( "CUDA capab i l i t y vers ion 2 or above required . Ex i t ing .\n" ) ;
14 exit (−1) ;
15 }
16 //execute r e s t of program
17 }
In the above example we check if any devices are available, print out the name of the first device, and check
if it is of CUDA capability 2 or above. The ccMajor attribute is used to determine major version, while
the ccMinor attribute goes unused in this example. It could be used to make a runtime decision based
on capabilities specific to CUDA versions later than 2.0. The cudaDeviceProp struct contains additional
information about the device, such as multiprocessor (MP) count, amount of global memory, maximum
threads and shared memory available per block, which can be used during runtime. Many of the listed
values correlate to the CUDA capability as described in 5.2.3. This can be used, for instance, to launch a
kernel with a dynamic configuration, or as in the example aborting execution on an incompatible device.
5.3.3 Kernels and threads
Asmentioned in section 5.3.1, CUDAextendsCwith keyword qualifiers used to indicate functions to run on
the device. A function using the __global__ qualifier is refered to as a kernel under theCUDAprogramming
model. Kernels are executed by calling the function using a special syntax indicating its launch parameters.
This syntax is best explained by illustrating it:
1 __global__ void myKernel ( void ) ; //kernel d e f i n i t i on
2 . . .
3 myKernel<<<1,1>>>() ; // c a l l kernel
4 i n t n = 16 ;
5 dim3 threads (n , n ) ;
6 myKernel<<<1,threads>>>() ; // c a l l using d i f f e r e n t launch conf igura t ion
7 myKernel<<<1,threads , n∗n∗ s i z eo f ( f l o a t ) >>>() ; // c a l l using d i f f e r e n t launch conf igura t ion
In the above listing we see a host-callable device function defined and later called from host code. The
<<< 1, 1 >>> syntax between the function name and its parameter list specifies the kernel launch con-
figuration. In the first case it launches one block consisting of a single thread. After that we try calling
the same kernel using a dim3 type variable (see 5.3.1) with dimensions (16, 16, 1) to define thread count.
Launching it this way can be used to provide a natural way for threads to work on multidimensional vec-
tors or matrices[29] for instance. As mentioned in 5.3.1 the kernel launch configuration can also be used to
dynamically allocate shared memory for the thread blocks. By adding a third argument, its value is used
to reserve that amount of shared memory to each block[30]. In this case it allocates the size of a float for
each thread in the block. This memory can then be used to, for instance, declare a extern __shared__
array of floats inside the kernel to be used by each thread. As mentioned in 5.3.1, global functions must not
return any value. Parameters to the kernel function may be transfered as usual in the parameter list. They
are transfered by using the sharedmemory, which is why it is often recommended to use constant memory
variables instead of parameters to functions[28], especially if the list of parameters is long. As parameters
cannot be used to return values2, this is a sensible thing to do even for shorter lists of parameters.
As mentioned in 5.3.2 CUDA makes available built-in variables that each thread can access. Below is an
example of their use in a one-dimensional launch configuration:
2 Pointers to devicememory can be used towrite to devicememory, but it still needs to be transfered back to the host by a CUDA runtime
function.
29
5.3. Parallel programming in CUDA
1 __global__ void myKernel ( )
2 {
3 i n t threadsInBlock = blockDim . x ;
4 i n t threadsInGrid = threadsInBlock∗gridDim . x ;
5 i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ; //thread index within en t i r e grid
6 extern __shared__ f l o a t sharedData [ ] ; //dynamically a l l o ca t ed shared memory
7 . . .
8 }
Here values for threads in the current block, total threads in grid, and current thread id within the grid are
extracted from the built-in values. Also shown is an example of how to make use of dynamically allocated
shared memory. When the extern __shared__ qualifiers are used to declare arrays, they are set to start in
the same address in memory[29]. For this reason, if multiple arrays are to make use of the shared memory
allocated to the block, they must be created by calculating offsets, a technique also known as aliasing. In
this example the sharedData array is dynamically allocated the amount of memory declared in the launch
configuration of the kernel.
5.3.4 Device memory
In the CUDA programming model, host and device memory are seperate, and kernels can only operate
on device memory. Likewise, host code cannot access device memory directly, and host-callable device
functions cannot return any value or change the contents of any parameter passed to it. As mentioned in
5.3.1, special qualifiers are used to mark device memory. Assigning or reading to or from device memory
is supported through calls to the CUDA runtime library. The CUDA runtime library provides functions
for copying to and from variables, allocating, copying and freeing memory to and from host and device.
CUDA versions of the malloc, memcpy and free functions known from C exists for handling linear mem-
ory. Additional functions for allocating and copying CUDA arrays exists. CUDA arrays are special memory
layouts optimized for texture fetching[29].
The cudaMalloc function is used to allocate memory on the device. The function takes the address of a
pointer type as parameter as well as size. The thing to note is that this pointer type is a host variable and
the value that is filled into it is the starting address of the memory allocated on the device. Thus, for the
kernel to access the pointer, it must be passed as a parameter at launch, or placed in a (constant) device
variable before launch.
The cudaFree function is used to free memory on the device. It is important to make use of this func-
tion, as in our experience omitting it will cause host code to crash as the program unloads.
The cudaMemcpy function is used to copy memory between host and device, and from device to device.
It takes as parameters a destination pointer, a source pointer, a size, and a variable indicating source and
destination type (host or device).
The cudaMemcpyToSymbol function is used to copy from host to device variables or device to device
variables. It supports copying to variables constant memory as well as global memory. It takes as param-
eters destination symbol, source pointer, size of data to copy. Optionally offset from start of symbol and
a variable indicating source (host or device) and destination (only device). The symbol mentioned can be
either a char string naming the device variable, or a device variable declared with a CUDA qualifier, if it
exists in the same source file.
30
5.3. Parallel programming in CUDA
In the below code samples, we will show some example use of variables, memory allocation and copy-
ing to and from the device.
1 i n t n = 1024 ;
2 float4 ∗dataTemp ; //temp host po inter fo r cudaMalloc use
3 float4 ∗data ; //host po inter to data on host
4 __constant__ float4 ∗data_Dev ; //constant device pointer to data on device
In the above code sample, we first declare our variables. Two host pointers are declared globally, one tem-
porary for cudaMalloc use and one for storing the data on the host. A constant device pointer type variable
is also declared.
Memory for n float4s is then allocated on host, and on the device using cudaMalloc. The pointer to the
data on the device is placed in the temporary host pointer.
1 data = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ; //a l l o c a t e host memory
2 cudaMalloc ( ( void ∗∗ ) &dataTemp , n ∗ s i z eo f ( float4 ) ) ; //a l l o c a t e device memory
3 /∗prepare data∗/
After preparing data, loading or filling the host memory with data to be processed, we copy the data to the
device. And the temporary pointer to the constant device pointer, as shown below:
1 cudaMemcpy ( dataTemp , data , n ∗ s i z eo f ( float4 ) , cudaMemcpyHostToDevice ) ) ; //copy data to device
2 cudaMemcpyToSymbol ( data_Dev , &dataTemp , s i z eo f ( void ∗ ) ) ; //copy pointer to device
3 myKernel<<<BLOCKS , THREADS>>>() ;
During kernel calls the data is now accessible through the constant device pointer, without the need for
passing it as a parameter.
After the kernel call returns, we copy backmemory, use or save the data computed on by the CUDA kernel,
and free the memory as seen below:
1 cudaMemcpy ( data , dataTemp , n ∗ s i z eo f ( float4 ) , cudaMemcpyDeviceToHost ) ; //copy data back to host
2 /∗use data∗/
3 free ( data ) ;
4 cudaFree ( dataTemp ) ;
As illustrated here, managing variables and data that exist on both host and device and/or is moved
between them requires a bit of extra code and variables. In much of the CUDA literature, appending h,
host, d, dev, device or variations to the variable name is used as a naming convention to keep track of
(temporary) variables that exists both places, or on all variables.
5.3.5 Data coalescence
Coalescing of data is a central point in the CUDA programming paradigm when aiming for performance.
The concept refers to the meaning of the word itself, which is “to unite into a whole” [21]. The basic idea
is to optimize data fetches in such a way that data is transferred to a warp in as few loads as possible.
Data transfers are done in loads of 32- 64- or 128-bytes. If a transfer loads less bytes, it “wastes” the extra
bytes not transferred. When a warp asks for data from a memory location this memory location needs to
be opened (as explained in section 5.2.2) which takes time and as reading extra data is fast, all the data in
31
5.3. Parallel programming in CUDA
the memory location needed by the warp might as well be fetched. Aligning data in the memory in such
a way that data fetches from the same warp can utilize the transfer size optimally, is using coalescence.
Sometimes it can be easier to think of coalescence the other way around, namely to arrange data fetches so
they utilize aligned data in memory, for example arrays. Consider the two constructions below for a single
warp (32 threads):
1 f l o a t array [ 3 2 0 ] ;
2 tid = threadIdx . x ;
3 stride = 32 ;
4 // coalesced data f e t ch
5 f o r ( i n t i = tid ; i<320;i+=stride ) {
6 array [ i]+= 1 . 0 f ;
7 }
8 // non coalesced data f e t ch ;
9 f o r ( i n t i = tid ∗32 ;i<(tid+1) ∗32 ; i++) {
10 array [ i]+= 1 . 0 f ;
11 }
Both for loops does do the same operations to the array, but the first loop gets data in a coalesced manner
as the warp retrieves 32 aligned memory locations at a time, while the 2nd loop is totally uncoalesced
retrieving only a 32-bit word (one float) in each fetch as can be seen below.
1 //coalesced as the t reads f e t che s al igned memory l o c a t i on s
2 1st iteration 2nd iteration 3rd iteration . . .
3 thread 0 array [ 0 ] array [ 3 2 ] array [ 6 4 ] . . .
4 thread 1 array [ 1 ] array [ 3 3 ] array [ 6 5 ] . . .
5 thread 2 array [ 2 ] array [ 3 4 ] array [ 6 6 ] . . .
6 . . .
7 // uncoalesced , as each i t e r e a t i o n has no al igned memory f e t che s
8 1st iteration 2nd iteration 3rd iteration . . .
9 thread 0 array [ 0 ] array [ 1 ] array [ 2 ] . . .
10 thread 1 array [ 3 2 ] array [ 3 3 ] array [ 3 4 ] . . .
11 thread 2 array [ 6 4 ] array [ 6 5 ] array [ 6 6 ] . . .
12 . . .
5.3.6 Parallel constructs
Control of the parallel execution in CUDA is achieved primarily through its constructs for atomic
operations, memory fences and synchronization functions.
Atomic functions
An atomic function does a read-modify-write function to a memory location without interference from
other threads. CUDA’s atomic functions can be used on a 32- or 64-bit word. There are a number of different
atomic operations which are listed in the programming guide [30, p. 119]. Using an atomic function could
look like the code snippet below:
1 i n t re turn = atomicAdd(&counter , threadIdx . x ) ;
The__atomicAdd() function adds the threads id to the value of counter. It is done as one operationwithout
any other thread being able to read or write to the counter location.
32
5.4. Optimization in CUDA
Memory fences
Due to how the GPUworks it is not always true that memory operations done by other threads is instantly
visible to other threads. The scheduling of blocks and theirmemory operationsmeans that the programmer
cannot be sure of the ordering of the writes to global memory. If one needs to make sure data is visible to
all other threads in the grid, a memory fence can be implemented.
CUDA has several functions useful for implementing a memory fence in the API. One of them is the
__threadfence() function. These functionsmake sure that certain data is visible to all threads of the current
context, determined by the specific threadfence function called. The__threadfence() function ensures that
changes to sharedmemory is visiblewithin the thread block, and changes to globalmemory is visible across
all threads on the device[30]. In other words all memory operations that precede the __threadfence() call
needs to complete, before the call itself completes. Variants on the function have other specific rules and
uses.
Synchronization
Synchronization in CUDAusing__synchthreads()-function is limited to threadswithin a block. The func-
tionwaits until all threadswithin the block has reached the synchronization point and allmemory locations
altered are visible to the block. A _and(int predicate), _count(int predicate) or _or(int predicate) call
can be used to synchronize. In addition to synchronizing, a count, a logical “and” or a logical “or” is also
performed to the predicate, for all threads in the block.
5.3.7 Special functions and instructions
CUDA programmers have access to extra fast hardware supported arithmetic functions using the SFUs.
By using the prefix __ before a given math function, the compiler uses the built-in, also called intrin-
sic math function. A compiler parameter (-use_fast_math) can make all intrinsic mathematical functions
supported by the SFUs used as default[30, p. 138]. While considerably faster, these functions may come
with a price of less precision in the floating point operations. The maximum ULP error of the operations
in comparison to the regular math functions can be found in a table of the programming guide[30, p. 134],
and should be consulted when determining if the error is acceptable. Notes on floating point values and
precision are included in the appendix found in chapter 12.
Threads within a warp has the ability to make use of three voting functions [30]:
1 i n t __all ( i n t predicate ) ; //re turns 0 i f any threads pred ica te ==0
2 i n t __any ( i n t predicate ) ; //re turns 0 i f a l l threads pred ica te ==0
3 unsigned in t __ballot ( i n t predicate ) ; // eva luates pred ica te fo r a l l threads of the warp and ←↩
re turns an in t ege r whose Nth b i t i s s e t i f and only i f pred ica te eva luates to non−zero fo r ←↩
the Nth thread of thewarp
The warp voting functions are used to avoid warp divergence.
5.4 Optimization in CUDA
When working with the number of cores in the hundreds it is quite obvious that there is a lot of perfor-
mance to be gained from parallelization compared to single execution. So when using a GPU it is of course
very important to optimize the parallelization to get a desirable speedup (see section 4.1.1). The architec-
33
5.4. Optimization in CUDA
ture of the GPU hardware is designed for a highly parallel execution and as described, it has several special
hardware units to make massive parallelization possible. Sometimes speedup comes at the cost of preci-
sion though. CUDA C makes for a highly explicit parallel programming language, and there are several
techniques that can help a programmer achieve high performance of a program running on a GPU.
In this section we will uncover optimization techniques for CUDA programming, especially those impor-
tant for our thesis. As can be seen most optimizations does in some way relate to general parallelism (4.3),
but often relates directly to the hardware as well, therefore in the first subsection we will look into how
knowing the architecture of a GPU can help us increase performance of a program. The second section
will look at techniques to keep the processors busy in order to achieve high performance levels of a CUDA
program implementation. If nothing else is mentioned, NVIDIA’s “Best Practices Guide”[28] is used for
reference in this section .
5.4.1 Knowing the architecture
The architecture of a GPU as explained in the start of this chapter (5.2), has both limitations and possibil-
ities when writing code for a CUDA program. Addressing both sides will make your program run faster.
We will now look at what the hardware inclinations are.
As mentioned in section (5.2.2), the memory bandwidth between the host and the device is considerably
lower than any internal bus on the card. Also any transfers with the host is asynchronous. Therefore keep-
ing as much of the execution on the device as possible is of high priority. This is true even if it means
having parts of the program that does not show any performance gain on the GPU still running on the
GPU, thereby keeping memory transfers to a minimum
The concept “locality of reference” is of importance for a CUDA programmer, as he has control over which
memory types are used when writing a CUDA program. Making sure that steadily used data is located
close to the processing kernels in either register space or shared memory is a possibility for CUDA pro-
grammers. The data fetches from global memory occurs in 32- 64- and 128-bytes and takes hundreds of
clock-cycles [28]. The load and store (L/S) -units on each SM manages data fetches from the same warp in
such a way that if the data is aligned correctly, several threads retrieve data in the same fetch. Coalescing
data so nearby treads in a warp can retrieve data in the same fetch can speed up a program. Imagine the
following example: one thread requests a float (4 bytes) from global memory. If data is coalesced, 7 nearby
threads could also retrieve 7 floats in the same fetch, making the data transfer of aligned floats a total of
32 bytes and speeding the process up by a factor 8. Coalescing global memory data access is possibly the
most important optimization for a CUDA programmer [28].
The different kinds of caches within the GPU are available to the programmer, and it is possible to utilize
these caches to speed up a program. To avoid repeated fetches of the same data from global memory to the
same warp, data can be saved within the shared memory cache for fast access for all threads within the
warp. Shared memory data is only visible for threads in the same warp but has a a lot lower latency than
global memory (approximately 100× faster [28]). Using the shared memory can help achieve coalescence,
since chunks of data can be fetched simultaneously, data that would not otherwise be used at the same time
in the programs execution. While using the shared memory, avoiding accesses that causes bank conflicts
is an important factor.
Constant cache use can increase speed under certain circumstances. Constant cache reads (in the case of a
cache hit) from the same half-warps has the same latency as reading from a register as long as all threads
are reading from the same address. Multiple addresses increases the time linearly as reads needs to be
serialized. Constant cache is, as the name suggests, usable for storing constants in a CUDA program.
Texture memory cache is optimized for memory reads that has a two dimensional spatial locality, as men-
34
5.4. Optimization in CUDA
tioned earlier (5.2.2). Texture memory can also be used for streaming fetches, optimized towards lower
bandwidth demand and does not decrease fetch latency as such. Streaming texture fetching can in turn
do operations on fetched data, like unpacking or interpolating. It is not something we make use of in this
thesis though.
Minimizing single execution is a major concern as always while writing parallel programs, but especially
so with a GPU. The GPU facilitates hardware specific atomic operations that are extremely fast [31], mean-
ing the overhead of performing atomics is not as great as in regular CPU parallelism.
By using the volatile keyword in conjunction with atomics, its possible to use values at memory locations
as lightweight locks (4.3.3), locking the location and not entire code segments[7].
A programmermust also keep inmind that due to the SIMT architecture of the SMs, any divergence caused
by conditional statements3 in the execution path within a warp will slow execution on a given SM down.
As treadswithin the samewarp not executing the given branch of executionwill remain dormant, meaning
that each path will be executed serially.[28, p. 49]
In each SM the SFUs supports instructions for special arithmetic operations directly on the hardware. Us-
ing these special instructions will increase speed, possibly at the cost of accuracy. Mathematical functions
can be called using the “__” prefix to facilitate use of the fast hardware functions.
Theway a SM schedules its processor and facilitates registers to increase performance requires some special
attention and will therefore be described more extensively in the next section.
5.4.2 Occupancy
Knowing that all multiprocessors keeps busy while executing a program is a good indication for an op-
timized use of the GPU hardware. Keeping “load balancing” in mind when designing the kernels is of
great value for a CUDA programmer. As the CUDA language design with warps, threads, blocks and grid
somewhat maps the GPU’s architecture, it is possible to predict how to keep the multiprocessors busy. The
metric used to decide how busy the processors are is called occupancy. Although a high occupancy is pre-
ferred, an occupancy below 100% does not necessarily decrease performance, as it might not be possible to
keep all CUDA cores occupied.
A single warp executes all its instructions sequentially. Therefore having several warps executing in the
same SM will improve the performance, as there always is a warp executing while others wait for data.
This is a good way to work around latency of data fetches, also referred to as latency hiding. Occupancy
is a ratio, measured by the number of active warps within a SM, compared to the maximum number of
possible warps.
When an instruction needs a result stored in a register from an previous arithmetic result there is a latency
of 24 clock cycles. This latency can only be hidden by executing other warps. For a GPU of compute capa-
bility 2.0, this means that each SM should be assigned at least 3824 threads to accommodate this register
latency. The described phenomenon is called register dependency.
As the register space is limited within a multiprocessor and several warps may reside within the same SM,
it is possible to overuse registers space. If too many registers are used per thread block then the number
of blocks within each SM will be limited, and occupancy will drop. It is possible to assign the data to the
local memory, also called register spilling, but this will increase latency by a hundredfold. Ergo, if the use
of register space is excessive performance will drop, and the register use needs to be addressed.
Keeping the number of treads per block a multiple of 32 will facilitate the architecture of a SM, again
making sure no single processor within the SM has to sleep while waiting for excessive threads to finish
their work. Keeping the number of threads as a multiple of 32 will help data fetches be coalesced, as ar-
3 i.e. using control commands such as: if, switch, do, for, while
4 the number of threads needed to hide register latency changes with compute capability
35
5.4. Optimization in CUDA
ranged data will be retrieved in the right data sizes.
Loop unrolling can be specifically5 implemented by the programmer by using the command #pragma
unroll and can lead to performance gains if the register count can accommodate this. The reason behind
this is that loop unrolling reduces loop overhead [33, p. A67] and can sometimes optimize away thread
divergence within a warp [28].
5 the compiler might unroll inner loops by itself
36
6 The n-body problem
In this chapter we will give a brief introduction to our chosen case, the n-body problem, exemplified by
simulating the dynamics of a galaxy of celestial bodies. Firstly, as the problem is based on a physical prob-
lem, we will explain the physical laws of force on which the simulation is based. Secondly we will describe
our choice of n-body algorithms and numerical methods, as these can change the computational work con-
siderably and therefore are of interest to our thesis. Lastly, we will explain the Plummer model which is a
density model for the galaxy we will perform our simulations on.
The n-body problem is a classical problem in physics as the problem applies to several physical fields
where particles of some sort interact through physical forces. Originally the n-body problemwas the prob-
lem of predicting the motion of a group of celestial objects that interact with each other gravitationally.
Solving this problem has been motivated by the need to understand the motion of the Sun, planets and
the visible stars. Understanding how particles influence each other has been an interest for physicists since
before Newton proposed the laws we still use today. Examining how celestial bodies act on each other is
a commonly simulated n-body problem, but similar uses do exist in other fields, for example molecular
dynamics and quantum mechanics. A n-body problem of n > 2 cannot be solved analytically for all cases
and has to be solved discretely through simulation. To describe the dynamics of a large physical system or
a system that has dynamics that lies within a very short timespan, an immense amount of calculations has
to be done. As the size of the simulated system or the number of time-steps increases, so does the require-
ments of computational power. The n-body problem represents the requirements of many other types of
simulations[34, p 1], and the classical all-pairs implementation of a program to solve the problem is easily
parallelized. This makes the n-body problem well suited as a case for our study of parallel computing and
CUDA.
6.1 Physics
The classic n-body problem is the problem of simulating how a number of celestial bodies, like suns in
a galaxy or planets in a solar system interact with each other through gravitation. The gravitational force
acting on the stars is governed by the potential energy between them. The movement of the star bodies
is given by Newton’s laws of motion[38, p 53-59] and can be found by computing the gravitational forces
between them and hereby finding the acceleration of each body.:
1st law: “Every object persists in its state of rest or uniformmotion in a straight line unless it is compelled
to change that state by forces impressed on it”, i.e. the velocity of a body remains constant unless it
is acted upon by an external force.
2nd law “The acceleration produced by a particular force acting on a body is directly proportional to the
magnitude of the force and inversely proportional to the mass of the body”. The velocity of a body
changes if acted upon by an external force F, by the magnitude of this change is given by the formula:
F = ma where m is the mass of the body and a the acceleration.
37
6.2. Numerical integration methods
3rd law “To every action there is always opposed an equal reaction; or, the mutual actions of two bodies
upon each other are always equal, and directed to contrary parts.” I.e. for every action there is an
equal and opposite reaction.
The force F between two bodies is calculated as follows:
F = G
m1m2
r2
(6.1)
where G is the gravitational constant given by (6.67300× 1011 m3 kg1 s2), m1 and m2 are the masses of the
first and second bodies respectively and r is the distance between the bodies. The velocity of a particle is the
derivative of the position and acceleration is the derivative of the velocity, that is p =
∫
vdt and v =
∫
adt
where p, v and a is the position, velocity and acceleration vectors respectively. Using the formulas above
and by the use of numerical integration we are able to approximate the position.
Within mechanics the energy of a conservative system Esys is given by the sum of the potential Epot and
kinetic energy Ekin. If no energy is considered dissipated to other forms (for example by conversion to heat)
the energy of the system should remain constant. Finding Esys for a galaxy with N bodies is done using the
formula below [38].
Etot = Epot + Ekin = −
N
∑
i=1
N
∑
j=i+1
G
mimj
r
+
1
2
N
∑
i=1
miv2i (6.2)
Where vi is the velocity vector of the i’th body.
6.2 Numerical integration methods
To determine the position of a particle, “all” we need to do is integrate the acceleration twice. Asmentioned
before, this cannot be done analytically for n-body problems with an n above 2. To predict movement of
the particles in the system we have to solve the problem discretely, using numerical integration. A myriad
of different integration methods exists that suit various problems and differ vastly in complexity, accuracy
and computational time. We will describe two of the simplest methods which are also the two we have
used in our simulation implementations: Euler and Verlet integration. Afterwards we will briefly mention
what a different numerical integration method could bring to the table for use in our discussion.
6.2.1 Euler
Euler integration approximates the solution to a differential equation using a first order prediction. Euler’s
method calculates the next step from the current one by assuming that the current state (velocity) plus the
current rate of change (acceleration) multiplied by step-size is a good approximation to the next state. To
decide the next value of v to some time tn at the iteration n i.e. vn = v(tn) one only needs to know the
previous state, namely vn−1 = v(tn−1). The time between the tn and tn−1 is the step size ∆t. Thus by using
Euler’s method we approximate the value of vn at time tn and the same is true for the position, its rate of
change being the velocity. So using Euler to find the next time-steps position wewill be using the following
formula:
v(tn+1) = v(tn) + ∆ta(tn+1) (6.3)
p(tn+1) = p(tn) + ∆tv(tn+1) (6.4)
38
6.3. Simulation algorithms
Euler’s formula is rather inaccurate for large ∆t. For longer simulations the step size needs to be incredibly
small to keep the error bounded, and Euler’s formula handles quick switches in acceleration badly. This
makes Euler unsuitable for scientific work where accuracy is of importance. Euler is easily understandable,
easily implementable, uses a minimal number of calculations per iteration and captures the dynamics of
most systems in a shorter timescale. This makes Euler integration useful for testing initial implementations
of algorithms, to see if the algorithmworkswithout additionally having toworry about the implementation
of a more complex numerical integration. We have used Euler integration for some of the implementations
of n-body simulations during our project, and later switched method, as will be described later.
6.2.2 Verlet
Verlet integration, first introduced by Loup Verlet in 1967[40] is commonly used in particle physics on both
gravitational and molecular type problems. The algorithm is energy conserving method which only has a
linear error growth [19, p 399]. A slightly modified version that we use in our report is the velocity Verlet
algorithm. It is numericallymathematically equivalent to the original Verlet algorithm but deals better with
floating point round-off errors [11, p 352-354]. The velocity Verlet algorithm is presented below [39].
v(tn+1) = v(tn) + ∆t2 (a(tn+1) + a(tn))
p(tn+1) = p(tn) + ∆tv(tn) + ∆t2 a(tn)
A numerical integration method approximates an integral, usually by Taylor expansions. The expansions
are truncated at some higher order of the polynomial, and the lowest order of the truncation dictates the
order of the truncation error. The accuracy of Euler and Verlet integration is given by the truncation error
per time-step ∆t. For Euler the truncation error per time-step is O(∆t)2) meaning that for the timespan of
1
∆t , the expected error is O(∆t)). For Verlet the truncation error per time-step is O(∆t)
4) corresponding to
an expected error of O(∆t)3) for a time-span of 1∆t [12]. There exists several other numerical methods that
all have their advantages and disadvantages. Methods of a higher order1 have lesser error for the same size
time-step, compared to a method of a lower order. Higher order methods requires more computations per
iteration and might also require the use of several data points in a time-series for interpolation.
Methods that uses a variable step size also exists. These methods avoid large errors where the function at
hand has large values and/or rapidly changing values of the derivative, by lowering the step-size.
Before choosing a higher order method or methods with variable step size one could consider whether
speed or accuracy is the most important aspect of each time step. It might even be the case that the same
accuracy can be achieved in the same time span by using a smaller step size and a faster method. Amethod
with a variable time step could possibly slowdownan entire simulation because of some star pair that needs
extra attention while passing each other at a very close range. Additionally, in a chaotic system such as a
galaxy and in a simulation where single precision is used on the floating point operations, the accuracy of
the numerical method might also be overshadowed by the error on other factors, such as round-off errors
or the inherited sensibility of the initial conditions.
6.3 Simulation algorithms
In order to see how a galaxy evolves over time the collective contribution to each stars acceleration needs
to be found. Each other star in the galaxy contributes to the acceleration by equation 6.1 shown above.
The force decreases by the distance between two stars, squared, and therefore minor masses that are not
1 a higher order numerical integration method has an error term with a high exponent k of the step-size ∆t (O((∆t))), meaning that the
error will be small for small step-sizes
39
6.3. Simulation algorithms
located directly in the galaxy in particular, will not effect stars locatedwithin it significantly. Big star clusters
or galaxy clusters that has a huge collective mass does affect a galaxy. But all stars will be affected in the
same direction. Aswewant to look at the dynamics within a galaxywe can therefore ignore possible bodies
outside the galaxy. The acceleration of a star a0 in an n-body cluster can be found by adding all contributions
of forces from all other stars together.
a0 =
N
∑
i=1
G
mi
r2
rˆ0i (6.5)
where mi is the mass of the attracting star, r is the distance between the two stars and rˆi is the directional
vector between the two stars.
Several algorithms to simulate the dynamics of star clusters exists in computer science, the easiest to
understand is the all-pairs algorithm that in principle is just a conversion of the equation (6.5) above with
slight modifications.
6.3.1 All-pairs
Calculating the forces between all celestial bodies in a system and advancing the system iteratively using
these force calculations is called an all-pairs simulation. The all-pairs algorithm is straightforward to
implement in a serial program as well as a parallelized version. Considering a galaxy at some time t, each
particle has a position and a velocity. In this simulation the stars have no size meaning the masses are just
point masses and there can be no collision between two different stars.
We can find how the system progresses in time by first using equation 6.1 to calculate the force for each
star pair, and then using Newton’s 2nd law (see section 6.1) to derive the acceleration of each particle. We
then find how the velocity and position vectors have evolved during some small time step δt using the
numerical integration (see section, 6.2).
Although easy to understand and implement, the computing time complexity for an all-pairs simulation
is O(n2) [34], as the interactions of each body with every other body in the system needs to be calculated
(see figure 6.1). This limits the method because as the size of the problem grows the time-frame grows
quadratically.
In some implementations of an all-pairs simulationwewant2 tomake the same calculations for all pairwise
interactions, even between a star and itself. Here the acceleration contribution is zero. As the direction
vector is zero in this case, we introduce a softening factor e to r2 to avoid division by zero. The softening
factor e needs to be sufficiently small to not introduce a large error to all the other pairwise calculations.
The acceleration for all stars can therefore be calculated as:
an =
N
∑
i=0
G
mi
r2 + e
rˆni (6.6)
The force calculation and progress of an all-pairs system is a case of an embarrassingly parallel problem,
as the same calculations needs to be done on the same data for all particles.
6.3.2 Barnes-Hut
The Barnes-Hut algorithm (BH) was first published in a 1986 article by Josh Barnes and Piet Hut [3]. It
deals with the complexity problem of the all-pairs algorithm, reducing the it to O(n log(n)) for uniform
distributions [4][3]. The algorithm is adaptable meaning it deals well with slightly skewed distributions,
such as a galaxywith a Plummer distribution (described later 6.4.1), and has a controllable loss of precision
[34].
2 This will be explained later.
40
6.3. Simulation algorithms
Figure 6.1 The figure shows that the computations needed for one body, is also needed for all other bodies in the system
In essence the BH algorithm envelops the galaxy in a cubic bounding box and spatially divides the box into
8 sub-cubes, one for each corner of the original box. Thereafter we keep on dividing these boxes further into
8 boxes and so on until each sub-box only contains one star. Instead of calculating the force contribution for
each single star, the collective mass of all stars in a cube can be used instead. This can be done if the cube
has a center which is located far enough away from the star for which the force is acted upon. The value
which determines if this center is far enough away, is chosen by the user of the algorithm, and is also the
mechanism with which precision is controlled.
The BH algorithm makes use of an octree data structure to keep track of the spatial division of the galaxy.
The leaves of the tree represents the actual star bodies of galaxy, while internal nodes represents the cubes
that divides space. The internal nodes also keep track of the collective mass of underlying leaf nodes, and
where the center of this mass is located. For each iteration in time the BH algorithm has the following
steps[3]:
Building the octree (see figure 6.2): The root of the tree is created representing an internal node of the
entire bounding box for the galaxy. Starting with the root node each star is inserted into the tree,
choosing one of the eight branches representing the eight corners of the box represented by the
internal node. As the star has chosen which position to be inserted in there are 3 possibilities. If
there is no node in that position, a leaf node is created with the star in it. If there is an internal node,
the space having already been sub-divided and the star chooses its position in this internal node
accordingly. If there already is a leaf node in the spot, this is converted into a internal node and both
stars (the one already at the spot and the one we are trying to insert) has to be inserted into this
newly created internal node. Building a octree has a best-case time complexity ofO(n log(n)). If two
particles are almost on top of each other the space needs to be divided a myriad of times before the
two particles are bounded by each their box. Each such pair would increase the tree building time
accordingly. Therefore the worst case of time complexity can in theory be indefinitely high.
Calculating center of mass: Each internal node has a mass representing the collective mass of all of its
nodes, internal (in) or leaf (ln). The position vector pin of an internal node, represents the center of
mass of all its nodes. The position vector is calculated as follows
pin =
1
min
7
∑
i=0
pimi. (6.7)
wheremin is the collectivemass of the internal node in,pi andmi being each of its child-nodes position
vector and mass respectively. This step has the time-complexity ofO(in+ ln), with in+ ln being the
total number of nodes in the tree.
Force calculation : To calculate the force impacted on each star, we traverse the tree from the top following
each branch and stopping at a node only if one of the following two criteria is met: The node is a leaf
41
6.3. Simulation algorithms
or the internals nodes center of mass is located far enough away from the current star (see figure 6.3).
The tree is not traversed any further in that branch, and the force is calculated using the node reached.
To determine if a node is far enough away we look at the following inequality:
s < θd (6.8)
where s is the diameter of the current internal nodes’ bounding-box, d is the distance between this
star and the current internal node and θ is a parameter that determiners the level of accuracy of the
algorithm (theta is usually 0.5 and a θ of 0 would mean all leaf nodes would be visited and we would
have the equivalent of a regular all-pairs calculation).
Each particle will at least traverse the tree once (to itself). In a balanced tree with a reasonable theta,
some branches will be visited to the leaf nodes but most branches will be truncated early. Therefor
the force calculation has a time complexity equivalent to that of the tree building (O(n log(n)) [20]),
but with a much higher constant as the general particle traverses the tree several times.
Progress system: After the force and therefore the accelerations are found the system can be progressed
in time using numerical integration. The time complexity of this step is O(n).
Figure 6.2 This figure shows a Barnes-Hut tree in a two dimensional example, where the octree instead is a quad-tree.
The figure shows how the stars in space (in the left picture) has been placed in the quad-tree (in the right picture)
Figure 6.3 When an Internal nodes center of mass is far enough away from the current star G, the collective mass and
position of the node is used in lieu of the bodies beneath it in the tree, in this case A and B.
42
6.4. Data
6.3.3 Other algorithms
There are several other algorithms for simulating n-body problems. The algorithms differ in complexity
and there are several algorithms that are specific to the physical problem at hand. For reference we list two
of the major ones below:
Fast multipole method. This algorithm places bodies in the leaves of a balanced octree, and determines
the multipole expansions 3 of the nodes in the octree[4]. By truncating the extensions, speedup can
be reached but at the cost of accuracy.
Variants of the particle mesh method (PMM). The PMM is an obsolete method where particles are
placed in a grid, and force calculations are done using fourier transformations on grid point instead
of each particle. Although the method itself is imprecise, several hybrids with other methods exist,
for example all-pairs particle mesh.
6.4 Data
For our project we have chosen to simulate celestial bodies. This can be anything from asteroids to planets
or whole galaxy clusters with endless amounts of stars. To be able to use the hardware to the fullest, we
will at least need enough particles to occupy all threads supported by the available processors on the GPU.
The problem has to be about the size of galaxy. Galaxies comes in a variety of shapes (elliptical, spiral,
and irregular) and sizes ranging from several million to some trillion stars on average[15]. Actual data
of real galaxies that is readily available and suitable for a simulations of different proportions is hard to
come by. Taking into consideration that we need to test our programs on different sizes of galaxies and
furthermore that we need to keep the values we do calculations on normalized in order to avoid floating
point problems (see appendix, chapter 12), we have decided to use a model galaxy. A popular model to use
in galaxy simulations, although not very realistic, is the Plummer model[26][22].
6.4.1 Plummer model
The Plummermodelmodels a spherical stellar system. The Plummermodel consists of only stars andmake
use of a density function to create initial conditions for the suns in the galaxy, that is the initial position and
velocity. The Plummer model is normalized in order be able to have an easy overview of the galaxy, to be
able to compare the results of our simulations to others and to make calculations better suited for floating
point operations.
The normalization is done by considering the gravitational constant G to be 1. The collective mass M of the
galaxy is also set to be 1 so that each star thus has a mass of 1M . The dimension of the galaxy is calculated
based on a variable Rwhich is also set to 1[2]. This variable represents the radius of whichmost bodies will
be within. A Plummer galaxy is spherical, meaning that the stars cumulative mass has approximately the
same distribution in all directions from the center. The distribution of the mass is what determines each
particles position and makes for a loose star cluster where the core is quite vast. The time dynamics in the
Plummer model is that of the crossing time, that is, the average time it takes one particle to cross form one
side of the galaxy to the other [22]. A recipe for Plummer galaxy creation was posted in 1974 by Aarsleth et
Al. [2] and goes as follows. Given the values of M, R and G above and where Xi is some random number
obtained from an uniformly distribution in the interval [0 : 1], the mass inside a sphere of radius r of a
Plummer galaxy is:
M(r) = r2(1+ r2)−
3
2 ) (6.9)
3 A multipole expansion is a mathematical series of an angle-dependent function. It is frequently used when describing gravitational
fields [16].
43
6.4. Data
r is given by
r = (X−
2
3
i − 1)−
1
2 (6.10)
the Cartesian coordinates x, y and z of our star should now all be placed on the surface of the sphere.
z = (1− 2X2)r
x =
√
r2 − z2 cos(2piX3)
y =
√
r2 − z2 sin(2piX3)
now the maximum velocity V of each star needs to be less than the escape velocity 4 Ve for that particular
star.
Ve =
√
2(1+ 2r)−
1
4 (6.11)
it can be shown [2] that the distribution of q = VVe is proportional to g(q) = q
2(1− q2) 72 . Drawing from this
distribution we can find V and hereby determine the velocities in each direction vx vy vz:
vz = (1− 2X4)V
vx =
√
V2 − v2z cos(2piX5)
vy =
√
V2 − v2z sin(2piX5)
In this way all n initial positions and velocities of the stars in the galaxy can be created. In figure 6.4 an
image of a generated Plummer galaxy consisting of 25000 stars can be seen from the z-axis towards the x,
y plane.
Figure 6.4 The figure shows a Plummer galaxy seen from the z-axis and consists of 25000 bodies created by our
Plummer model program
4 The velocity needed for a given star to escape from the gravity of a galaxy
44
7 N-body all-pairs
In this chapter wewill describe the implementations of our all-pairs n-body case programs and perform ex-
periments on them. The purpose of this case is to implement a CUDA version of an embarassingly parallel
problem, namely the all-pairs n-body algorithm, and investigate the effects of various CUDA optimizations
on it.
In order to do so, we have implemented all-pairs n-body simulator programs in different versions. All
versions consists of a simple all-pairs n-body simulation using Euler’s formula for integration, with the ex-
ception of one version that uses velocity Verlet integration. All versions use generated input data based on
the Plummer model. The physics of n-nbody simulations, integration methods, and the Plummer model
are described in chapter 6.
For each program we will describe its implementation, or changes to implementation compared to the
previous version. We will then perform experiments on them. All host code is implemented mainly in C
with some use of C++ libraries and functionality. All device code is implemented in C with CUDA exten-
sions, as described in the CUDA chapter, and does not make use of C++ features.
The first program that we implemented is a sequential version which we have created to get acquainted
with the n-body problem, and to have a reference version for comparing results. This version is host only,
as is the parallel version using the OpenMP parallel programming API. The purpose of the latter is to
compare the performance of our parallel CUDA GPU implementations with the performance of a parallel
implementation for a multicore CPU.
The first parallel CUDA version uses basic CUDA functionality. The versions following it contains opti-
mizations related to main memory reading, use of the faster shared memory, and loops. Each optimized
version builds on the previous, and in the final version we change the integration method to the slightly
more computationally expensive velocity Verlet.
Experiments will be performed on all programs using the software and hardware setup described the
appendix (chapter 11). The experiments will consist of tests and benchmarks on a range of datasets and
varying parameters. The observed resultswill be presented in each section, and the chapterwill be rounded
off with a summary of the results.
Each section in this chapter describing an implementation of a n-body simulation program will be struc-
tured the following way.
First we will introduce the implementation, the motivation behind it, and the expected outcome. There-
after we will go into detail specific to the implementation, illustrated with code samples. If there are any
known issues with the implementation these will be explained.
45
7.1. All-pairs n-body simulation, sequential version
Next, we will present the test we perform on the program. We are testing performance on different data
sizes for all problems. The galaxy sizes we examine are presented in table 7. Parameters that will vary are
thread count and configurations. Additionally, simulation parameters will be adjusted so that the runtime
is long enough to provide a meaningful benchmark. Tests relevant for the particular implementation will
be conducted as well.
Finally, the results of the tests will be presented.
name 1k 2k 4k 8k 16k 32k 64k 128k
size 1024 2048 4096 8192 16384 32768 65536 131072
Table 7.1 The galaxy sizes and the name with which they are referred to
7.1 All-pairs n-body simulation, sequential version
The first n-body simulation program we have created is a sequential version of the all-pairs algorithm,
made in C with some use of C++. There are several reasons behind the decision to make a sequential ver-
sion.
First of all, we wanted a strictly sequential version, which we know gives the correct results. For the pur-
poses of debugging parallel implementations, it is useful to have a sequential version of the program. We
use it to compare the output of a given parallel version using the same simulation data and parameters. By
comparing the output data from our sequential version, we can be fairly secure in knowing that the parallel
implementation has not introduced errors specific to the parallelism. Even though themain purpose of our
project is not high-precision simulations, making sure that there are no errors or potential problems with
the parallel versions is naturally important.
Secondly, by using the benchmarks from the sequential version we are also able to investigate performance
gains from parallelism.
Finally, knowing the original version of the implementation, the reader should be able to incrementally
follow the development of the later versions. In the next section we will describe the implementation of a
parallel OpenMP version, which is directly based on this version, as is the first CUDA implementation.
7.1.1 Implementation
This is the first version of the n-body all-pairs algorithm. The implementation uses formula 6.6 presented
chapter 6. The description of the implementation will be thorough, even though the version is the simplest,
as the rest of the chapter will build on this original implementation. The full source code can be found
through the appendix (chapter 15).
Themainmethod of our program reads in the simulation data from files containing Plummermodel galax-
ies that were generated by the program described the appendix (chapter 11). The filename of the input file
is taken from a command line parameter, as are parameters overriding the default number of and size of
timesteps. Simulation parameter values are saved into global variables, as are the pointers to the arrays
containing the data. Variables of interest are:
t, dt and n representing number of time iterations, the size of each time-step and the number of bodies
46
7.1. All-pairs n-body simulation, sequential version
respectively.
The softening factor is called EPSILON and the universal gravitational constant is G.
The position coordinates, velocity and acceleration vectors of all bodies are represented by three arrays
each (px, py, pz, vx, vy, vz, ax, ay, az), one for each Cartesian coordinate. The mass of each body is saved
in the array m. It should be noted that the mass is the same for all bodies and G is always one due to our
use of a simplified Plummer model.
After loading the data and parameters, our program runs the main loop of the simulation. The simulate
function discretely iterates the movement of the system one time, and is called t times. We time the entire
simulation and optionally the final state of the galaxy can be printed to an output file.
The main loop of the program can be seen below. It might be worth noting that the force calculation is
the same for any star pair and therefore the number of calculations are halved by reusing the temporary
variables. Also worth noting is the intentional use of floats instead of doubles. We do this as we want to be
able to compare with CUDA implementations, where floats are used for reasons concerning performance.
1 void simulate ( void )
2 {
3 fo r ( i n t i = 0 ; i < n ; i++) //for each body
4 {
5 f o r ( i n t j =i ; j < n ; j++) //increment a c c e l e r a t i on for both s t a r s in pa i r s
6 {
7 // ca l cu l a t e squared dis tance adding so f t en ing f a c t o r
8 f l o a t rtemp = EPSILON + dist2 ( px [ i ] , py [ i ] , pz [ i ] , px [ j ] , py [ j ] , pz [ j ] ) ;
9 f l o a t temp = G / sqrtf ( rtemp ∗ rtemp ∗ rtemp ) ;
10 ax [ i ] += m [ j ]∗ temp ∗ ( px [ j ] − px [ i ] ) ;
11 ay [ i ] += m [ j ]∗ temp ∗ ( py [ j ] − py [ i ] ) ;
12 az [ i ] += m [ j ]∗ temp ∗ ( pz [ j ] − pz [ i ] ) ;
13 ax [ j ] += m [ i ]∗ temp ∗ (−(px [ j ] − px [ i ] ) ) ;
14 ay [ j ] += m [ i ]∗ temp ∗ (−(py [ j ] − py [ i ] ) ) ;
15 az [ j ] += m [ i ]∗ temp ∗ (−(pz [ j ] − pz [ i ] ) ) ;
16 } // update ve l o c i t y using euler
17 vx [ i ] += ax [ i ] ∗dt ;
18 vy [ i ] += ay [ i ] ∗dt ;
19 vz [ i ] += az [ i ] ∗dt ;
20 }
21 fo r ( i n t i = 0 ; i < n ; i++) // update pos i t i on using euler
22 {
23 px [ i ] += dt ∗ ( vx [ i ] ) ;
24 py [ i ] += dt ∗ ( vy [ i ] ) ;
25 pz [ i ] += dt ∗ ( vz [ i ] ) ;
26 ax [ i ] =0.0f ; ay [ i ] =0.0f ; az [ i ] =0.0f ; // r e s e t a c c e l e r a t i on
27 }
28 }
As can be seen, the first part of the simulate function makes use of two nested for loops to calculate and
accumulates the acceleration and thereafter the velocity of each star body. Afterwards the entire system is
progressed by another for loop that increments the position using the velocity.
The energy function mentioned in the setup chapter appendix (11) is called when a debug flag is set using
a#define. This is to measure the energy value of the system, a value we use as verification.
7.1.2 Notes
As explained in (6) the compute complexity for the all-pairs algorithm is O(n2), so even if the force
calculation is halved the algorithm runs progressively slower as the problem grows.
Using Euler integration will make the calculations imprecise, especially for “large” step-sizes.
As the program uses single-precision floating point calculations, large systems with low values of mass
47
7.2. All-pairs n-body, parallel OpenMP version
and by extension the acceleration, will be more affected by the precision.
7.1.3 Tests & results
Aswe are using Euler integration it is of interest to examinewithwhich time step size, dt, a good simulation
result is achieved. The energy should stay about the same in order for the measured time-step size to be
deemed acceptable. We examine the energy for the same time interval of 1, equivalent to the crossing time
of one particle, but with different iterations counts. The energy difference between the starting energy and
the energy after the entire simulation, as a function of the size of the dt, n be seen in table 7.1.3. It can be
dt 1 0.1 0.02 0.01 0.002 0.001 0.0008 0.0005 0.0002 0.0001
E diff. 0.30 1.76 7.21 2.96 0.586 0.28 0.290 0.130 0.013 0.000
E diff.% 44.1 262 1060 403 86.7 41.8 42.8 19.1 2.0 0
Table 7.2 The table shows the difference in energy of the system progressed by the sequential all-pairs program, as a
function of the step-size dt for a time span of 1 (the average crossing time of a star).
seen that there is a significant difference in energy for a value of dt larger than 0.0002, so to be on the safe
side we use 0.0001 as dt for the remainder of our experiments.
We will now benchmark the time it takes for the program to finish simulations of the galaxy sizes listed
in table 7. Only the actual simulation is timed and a the mean team of one iteration is used as benchmark.
The results of the benchmarking can be seen in figure 7.1. On the graph it can be seen that the run-times
are increasing as expected. The program becomes unbearably slow for larger problems. With a galaxy size
of about 1 million bodies, one time-step of the simulation can take 10 hours or longer depending on CPU.
Figure 7.1 The average time for one iteration of the all-pairs sequential simulation for various sizes of galaxies. The
y-axis is representing the timings in milliseconds presented on a logarithmic scale.
7.2 All-pairs n-body, parallel OpenMP version
In order to transcend into parallelism the first obvious step was to convert our sequential program into a
parallel one using OpenMP. As described in chapter 4 OpenMP easily handles parallelism with implicit
use. We wanted a version that was a bridge between the CPU version and the GPU versions without much
change to the general structure of the program. That said the parallelized version does not accommodate
the halving of the force calculation. The reason for this is twofold. Firstly, we did want the implementation
to be as simple as possible, without having to worry about race conditions or having to implement a new
48
7.2. All-pairs n-body, parallel OpenMP version
force matrix. Secondly, the CUDA versions do not halve the force calculations either as this would cause
thread divergence.
7.2.1 Implementation
The implementation of the parallelized version is almost identical to the one of the sequential. In the nested
main loop conducting the velocity calculations does traverse all bodies in each iteration as just mentioned,
making the program almost half as fast as the sequential version for a single thread.
The parallelization of the two main loops have been done using OpenMP pragmas. Two kinds of pragmas
are used in this version.
#pragma omp parallel for tells the compiler to parallelize the for loopmaking sure the iterations are done
by different threads, see pseudocode below. Both the velocity and the position calculations are done using
a parallel for loop.
1 #pragma omp pa r a l l e l fo r de fau l t ( none ) shared (n , dt , px , py , pz , vx , vy , vz , ax , ay , az , m)
2 f o r ( i n t tk = 0 ; tk < n ; tk++)
3 {
4 fo r ( i n t j = 0 ; j < n ; j++)
5 {
6 //ca l cu l a t e fo r ce i n t e r a c t i on between body tk and j
7 {
8 }
#pragma omp barrier: This pragma synchronizes the threads, threads wait for all other to finish execution
before continuing beyond the point of the barrier. The barriers are used as we need to be sure that all
velocities are updated before starting to use them in the position calculations.
7.2.2 Tests & results
One point that can be made immediately about OpenMP is the extreme ease of use, seen in this example.
For an easily implementable embarrassingly parallel algorithm as this, the results are great, as we will now
see.
The benchmarking for the different galaxy sizes can be seen in figure 7.2. The graph shows the mean time
used for one iteration of the simulation for thread configurations of 1, 2 ,4 8,16 and 32 on the 16 core ma-
chine Alvin. We observe that the quadratic growth in time consumption can be observed as the galaxy size
grows for all number of threads.
When we use 32 threads we do not experience any speedup, actually only a speed decrease for smaller
versions of the problem. This is of course due to Alvin only having 16 cores so the extra threads will only
cause overhead in the program.
We also observe that the time used is increased additionally between sample sizes of 16k and 32k. We do
not exactly knowwhy this occurs, it might have something to do with cached data that needs to be re-read
at some galaxy size introducing overhead. To give an exact answer to the phenomenon we would need
perform investigations, and we do not think this is in the scope of this report that has a focus on CUDA
programming.
Figure 7.3 shows the relative speedup given the number of threads for all galaxy sizes. We observe an
speedup factor for most configurations in the same magnitude as the number of threads used in the sim-
ulation. Deviating from this is of course the galaxy size of 32k and 32 threads of reasons explained above.
49
7.3. All-pairs n-body first CUDA version
We consider the speedup quite satisfactory as the algorithm is embarrassingly parallel and therefore the
speedup should be almost linear.
Figure 7.2 The graph shows the benchmark speed for the all-pairs simulation using our OpenMP program. The x-axis
show the galaxy size and the y-axis the time used for one iteration of the simulation. Both axis are of logarithmic scale.
Each curve represents a configuration of the number of threads used in the simulation.
Figure 7.3 The figure shows the relative speed increase as a function of number of threads. Each column represents a
certain size of the galaxy shown in the legend.
7.3 All-pairs n-body first CUDA version
We have implemented a succession of CUDA versions of the all-pairs algorithm, each making use of new
CUDA features or optimizations. The idea of the first program is to make a rather “naive” version to see
what speedup, if any, a simple re-write from C to CUDA C can accomplish. With this program we also
demonstrate what it minimally requires to successfully convert a C program to CUDA. Even without any
50
7.3. All-pairs n-body first CUDA version
optimizations we expect to observe a speedup, compared to the sequential all-pairs program due to the
number of processors on the CUDA device.
7.3.1 Implementation
The implementation of this first CUDA all-pairs version has been made to be close to the OpenMP version.
Only changes made to the code that relates to CUDA will be introduced. The implementation does not
use the optimization to the algorithm which requires only half the force calculations, as this would cause
thread divergence. The full code can be found in the appendix (chapter 15) for reference.
The program has two new includes namely, #include "cuda.h" and #include "book.h". The first de-
fines functions for the host and types for the CUDA driver API[35]. The latter is a package that among
other handles errors, which is the main purpose for us. TheHANDLE_ERROR() function (used in CUDA
by Example [35]) is from this package.
Two new important variables are added, namely threads, defining the number of threads in each block
and blocks defines the number of blocks in the grid, explained in chapter 5. In the first test version we
changed the values of these from the command line before execution of the program. We did so to find an
optimal configuration of blocks and threads. Later on these will changed.
In a CUDA program we need to keep track of data on both the host and the device. We have marked
any variables on the device with the postfix _Dev. Most arrays and variables will be initiated as normal
on the host and have a _Dev counterpart on the device to which data is transferred. This is done in our
initAndCopyDataToDevice () function. In the function all memory allocation is done for the device. As
an example, the x value of the position vector is stored in the array px on the host, we allocate memory for
a device array px_Dev to store the data in using cudaMalloc ((void **) &px_Dev, n * sizeof (float)).
Then we transfer data from the host array to the device array using cudaMemcpy (px_Dev, px, n * sizeof
(float), cudaMemcpyHostToDevice). After the GPU is finishedwith the simulationwe copy data back and
free the memory space using cudaMemcpy (px, px_Dev, n * sizeof (float), cudaMemcpyDeviceToHost)
and cudaFree (px_Dev); respectively.
Our program includes two kernel calls called by the host and one device only function. The __device__
float dist2 is a helper function that calculates the squared distance between two vectors. The two func-
tions with the __global__ qualifier makes up the main loop of the simulation. Both are called with the
number of blocks and threads as launch configuration parameters. The first __global__ void updateVe-
locity< < < blocks,threads > > >, does as the name suggest calculate the velocity of the system. The
second __global__ void updatePosition< < <blocks,threads> > >, progresses the system using the
found velocities. There is not a global barrier as such in CUDA, but a kernel call wait until all threads
have finished and thereby work as a barrier. The first kernel executes as seen in the code fragment below
1 __global__ void updateVelocity ( variables )
2 {
3 i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ;
4 i n t numThreads = gridDim . x ∗ blockDim . x ; //number of n−body i n t e r a c t i o n s per thread
5 f o r ( i n t tk = tid ; tk < n ; tk += numThreads )
6 { . . . }
7 }
The calculation inside the outer for loop is known from earlier, but the variables tid and numThreads are
not. We use functions threadIdx.x, blockIdx.x, blockDim.x and gridDim.x (described in section 5.3.1) to
identify a unique global thread id tid and the number of threads in the configuration, numThreads. The
51
7.3. All-pairs n-body first CUDA version
two variables are then are used as loop counters.
The second kernel has the same way to create loop counters as the first, and the loop has not changed from
the original version of the program.
We use a CUDA runtime library function together with type cudaEvent_t to time the simulation on the
device.
7.3.2 Known issues
This is a first version of the CUDA program and is not optimized as such. The benchmarks will be reflected
by this. Large problems will experience increasing floating point problems as the acceleration added to the
velocity during the velocity calculation will be of a significantly smaller magnitude due to the mass being
small. This will not interfere with any benchmarking speeds, but with the accuracy of the calculations.
7.3.3 Tests & results
As the calculations are now done on a new platform, we will re-examine the energy of the system. As with
the sequential C version we examining a decreasing size of dt until the energy stays the same. As can be
seen in table 7.3.3 the energy stabilizes at step-size of 0.0001, just as with the sequential version.
dt 1 0.1 0.02 0.01 0.002 0.001 0.0008 0.0005 0.0002 0.0001
E diff. 0.30 1.77 0.53 0.50 0.27 0.24 0.10 0.130 0.02 0.00
E diff.% 262 472 45.3 25.00 132 25.24 4.41 19.1 0.00 0
Table 7.3 The table shows the difference in energy of the system progressed by the all-pairs CUDA program, as a
function of the step-size dt for a time span of 1 (the average crossing time of a star).
The number of threads in the execution of the program is of interest when benchmarking the speed of
the simulation. Just as with the parallelized C version, we look at benchmarks of several configurations of
the thread count. The number of CUDA cores on the TESLA card is 240, but the number of simultaneous
threads supported on the card is above 33 million (64k blocks × 512 threads for compute capability 1.0).
The entire number of threads is divided in blocks andwe have chosen to look at the different configurations
presented below (blocks / threads):
• first setup: 1 / 1. Only one thread running on the GPU. This test will correspond to a sequential
execution of the program. The execution time for larger galaxies has not been tested as the run-times
were too high.
• second setup: 1 / 512. The program runs exclusively on one SM. This would in theory mean that only
32 threads (one warp) is executed at once.
• third setup: 1024 / 1. Each SM runs with only one thread each. On the Tesla card that means 30
threads.
• fourth setup: 30 / 32. One block for each streaming multiprocessor on the card and each SM runs
with one warp worth of threads. This is 4 times as many threads as cores on the machine.
• fifth setup: 512 / 192. The recommended minimum number of threads on each block to optimize
register dependencies (see section 5.4). The block-size was chosen to accommodate all galaxy sizes.
We assume this setup should be close to an optimal setup.
• sixth setup: 65535 / 512. The maximum number of blocks and threads possible on the Tesla card.
Chosen to see if this has any effect.
52
7.3. All-pairs n-body first CUDA version
The timing results from the different simulations mentioned above are summarized in figure 7.4. Each sim-
ulation has been run for several iterations and the graph shows themean elapsed time value of one iteration.
Figure 7.4 The average time for one iteration of the all-pairs CUDA simulation, first version, for various sizes of galaxies
and various block and thread counts. The y-axis is representing the timings in milliseconds. Presented on a logarithmic
scale.
From the results we can see that the fastest setup can be up to 1000 times faster than single thread execu-
tion, which is more than the speedup 240 cores can account for. The results can be explained in the light of
the architecture (section 5.4).
The third setup is the 2nd slowest but still approximately 100 times faster then single execution, even though
only one thread runs on each of the 30 multiprocessor. As several kernels are assigned on each SM some
optimizations of data fetching and scheduling are done on the hardware.
The second setup has a better performance for smaller systems than for bigger ones. And as only one SM
works on the problem this is expectable. Looking at the smallest galaxy size, we can see that the 2nd setup
“only” works half speed of the optimal setup. This can easily be explained looking at the code; only 2 SM
will work on the data with the way we made the program, as threads from blocks 3 and up, will never
enter the for-loops.
Setup 4 and 2 has the same speed for the smallest galaxy size. The 2nd setup only makes use of one SM,
but in comparison to setup 4 it would look like it works 30 times faster then all the SM in setup 4. This can
be explained by occupancy and latency hiding. Both setups will have SM that runs the algorithm twice,
all threads in setup 2, but only 64 in setup 4. This means that 28 SM’s in the fourth will stand idle for half
the duration of the simulation. Occupancy means that due to scheduling 192 threads can execute simul-
taneously in the 2nd setup while of course only 32 executes in setup 4. The initial data fetch from global
memory for each warp cannot be hidden and takes hundreds of clock-cycles (see section 5.4) and there-
fore subsequent data fetches for setup 2 does not slow the process down as the SM is working during the
later fetches. This example shows us that using the processor optimally really increases performance of a
streaming multiprocessor.
Adding a lot more blocks than needed does not increase performance as can be seen by comparing setup
5 and 6. It does create some overhead, slowing down simulations of smaller problems. And first when all
SMs are occupied in galaxy sizes of 64k and more will performance of setup 6 be equal to that of 5.
Figure 7.5 shows that threads and block counts larger than that of setup 5 does not really introduce any
speed-up at all.
53
7.3. All-pairs n-body first CUDA version
Figure 7.5 The graph shows the average time for one simulation iteration of the all-pairs CUDA first version, the chosen
sizes of galaxies and various larger blocks and threads sizes. The y-axis is representing the timings in milliseconds.
Presented on a logarithmic scale.
Finally we observe figure 7.6, which is a comparison between our OpenMP version (using 16 threads)
and the best simple CUDA version (the blocks=512 and threads=192 configuration). We can see that for
small problem sizes the CUDA version performs worse than the CPU version. This can be easily explained
as the CUDA card is not at all fully utilized at least 5.760 threads (192 threads and 30 SMs) has to be used in
order to facilitate use of all multiprocessors. The best speed increase is gained in the 32k galaxy where the
speed-up is of 400%, but the other large galaxy sizes have similar speedups. We find the speedup interest-
ing considering relatively the powerful Alvin machine and the fact that the CUDA version is not optimized
fully.
Figure 7.6 The graph shows the speedup gained by using a simple cuda version, compared to an OpenMP version
using 16 cores.
54
7.4. All-pairs N-body, CUDA: float4 optimization
7.4 All-pairs N-body, CUDA: float4 optimization
Latency of data fetches is, as described in chapter 5, of great concern and so is coalescing. A simple opti-
mization that shows this is using the compound type float4 that packs 4 floats in one single unit. The type
has good use when handling vectors and can help coalesce data fetches as 4 times more data is retrieved
with one fetch, data that would not have been coalesced when present in four different arrays.
7.4.1 Implementation
The implementation of float4 in our CUDA all-pairs program is quite straightforward. The four arrays
representing position (px, py, pz) and the mass is packed in one float4 called p_m for all bodies. The
velocities is stored in a vel float4 array. Temporary float variables are swapped for float4 variables where
we are able to do so.
7.4.2 Known issues
For higher versions of the CUDA compute capability the float3 type might be used to save space for
the velocity vector, float3 is slower than float4 due to coalescing issues for lower values of the compute
capability. As we only fetch one compound at a time the use of float3 does not show any speed increase for
this implementation.
7.4.3 Tests & results
In figure 7.7 and 7.8 the results from the float4 optimization can be seen. We note that the behavior of
simulation time and galaxy size does follow that of the previous version.
The speedup is the greatest for small sizes of galaxies. We explain the extra speedup for the small galaxies
as a result of an inability for the graphic card to hide latency in global memory fetches, due to the loop only
executing once.
Figure 7.7 The graph shows mean time in milliseconds used for one iteration of the all-pairs simulation with the
float4 optimization for CUDA compared to the first version. The x-axis shows galaxy size and the y-axis the time in
milliseconds, both axis are logarithmic scale.
55
7.5. All-pairs N-body, CUDA: shared memory optimization
Figure 7.8 The graph shows the speedup gained with using float4 optimization, compared to a version without it.
7.5 All-pairs N-body, CUDA: shared memory optimization
The SM specific level 1 cache on the GPU, referred to as the shared memory, can be utilized to greatly
increase performance of a CUDA program. In this next version of the all-pairs program we will demon-
strate that using the shared memory will increase the speed of our program. The memory space works as
a buffer for programs where it is possible to store reusable data that has low latency compared to that of
global memory. In our case using the shared memory will also increase data coalescing as well as having
threads cooperate in the data handling.
The size of the sharedmemory is limited and its only visible for threads in the same block, whichmeanswe
need to do a small “trick” to be able to use the shared memory. This will be done by utilizing a technique
called tiling. The idea of tiling is to divide a problem into smaller parts that can be handled separately in a
beneficiary way. In our case we will tile the acceleration calculation.
In order to explain the technique we will use a small example. We look at the acceleration calculations as
a matrix, where each body will contribute their partial quantity of acceleration to the other bodies. We are
then able to divide the acceleration calculation into smaller portions.
Consider a 12-body system, where we are using a program with 3 blocks with 4 threads in each. As there
are several blocks we cannot utilize shared memory to fit in the entire galaxy. All the acceleration con-
tributions aij, where i and j are the i’th and j’th body of the system, can be described as an entry in the
acceleration matrix Acc below.
Acc =

a0,0 a0,1 . . . a0,11
a1,0 a1,1 . . . a1,11
...
...
. . .
...
a11,0 a11,1 . . . a11,11

Consider that the one thread handles the calculations for one body, but instead of calculating all the
contributions in one go, we divide the matrix into 9 tiles that all are smaller 4× 4matrices ((A)):
Acc =
A(0,0) A(0,1) A(0,2)A(1,0) A(1,2) A(1,2)
A(2,0) A(2,2) A(2,2)

56
7.5. All-pairs N-body, CUDA: shared memory optimization
Where each minor matrix would look as follows, ts being the size of a side of a tile, where each tile being
a ts× ts square.
A(r,c) =

a(r·ts+0,c·ts+0) a(r·ts+0,c·ts+1) a(r·ts+0,c·ts+2) a(r·ts+0,c·ts+3)
a(r·ts+1,c·ts+0) a(r·ts+1,c·ts+1) a(r·ts+1,c·ts+2) a(r·ts+1,c·ts+3)
a(r·ts+2,c·ts+0) a(r·ts+2,c·ts+1) a(r·ts+2,c·ts+2) a(r·ts+2,c·ts+3)
a(r·ts+3,c·ts+0) a(r·ts+3,c·ts+1) a(r·ts+3,c·ts+2) a(r·ts+3,c·ts+3)

Each team of threads in the block will enter one tile at a time, in order to find each incremental acceleration
ai,j.
Now before calculating the accelerations in the tile, each thread reads in the data (position and mass float4
construct) for each of the four bodes, from the columns, into the shared memory. This data transfer will be
coalesced as the 4 float4’s are placed next to each other in an array, and will fill up a full 64-byte transfer.
As each thread needs to use the data from all 4 bodies to calculate all four incremental accelerations, so we
have traded 3 slow data transfers from global memory to 4 fast ones from the shared memory. The process
is then repeated for all the tiles.
7.5.1 Implementation
When implementing our shared memory version of the all-pairs simulation the first thing we do is
to add a third launch configure parameter to the calculate velocity kernel call: updateVelocity < <
<BLOCKS,THREADS,THREADS*(sizeof(float4)) > > >(n, dt, p_m_Dev, vel_Dev). This defines
the size of our shared memory and in this case it has the size of a block of threads times the size of a float4.
We then assign an array shared_p_m[] thismemory space using the extern __shared__ keyword.Within
the velocity kernel we need to keep track of how many tiles we have, so a tiles variable is defined.
We make use of the __syncthreads() function that works as a synchronization barrier for the block, in
order to avoid concurrency problems when reading and writing to the shared memory space.
The velocity calculation loop is now restructured using 3 nested loops. The first one running through all
bodies, the second one running through all tiles, and the third reading in data and performing acceleration
calculations for each body in the tile. The changes made in this version of the CUDA all-pairs program can
be seen in the code below (reduced for brevity):
1 __global__ void updateVelocity ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel )
2 {
3 . . .
4 i n t tiles = n/blockDim . x+1; //number of t i l e s
5 extern __shared__ float4 shared_p_m [ ] ; //a shared memory array on the device , there i s 2KB ←↩
shared per processor (16K per SM) .
6 f o r ( i n t k =tid ; k < n ; k+=threadsInGrid )
7 {
8 float4 p_mK= p_m [ k ] ;
9 float4 accK ; accK . x =0.0f ; accK . y =0.0f ; accK . z =0.0f ; //temporary a c c e l e a t i on f l o a t 4
10 f o r ( i n t j = 0 ; j < tiles ; j++) //the block takes one t i l e a t the time and reads in to ←↩
shared memory
11 {
12 i f ( j∗tiles+threadIdx . x<n )
13 {
14 shared_p_m [ threadIdx . x ]=p_m [ j∗threadsInBlock+threadIdx . x ] ;
15 }
16 __syncthreads ( ) ; // make sure a l l memory i s read in
17 fo r ( i n t l=0;l<threadsInBlock ; l++) // each thread c a l c u l a t e s the ve l o c i t y i t s body using←↩
the shared t i l e memory
18 {
19 do // for each s t a r in t i l e , do the usual a c c e l e r a t i on c a l cu l a t i on s
20 }
57
7.6. All-pairs n-body, CUDA: loop optimization
21 __syncthreads ( ) ; // makes sure no memory i s overwri t ten
22 }
23 //saving the temp var i ab l e s back to g loba l memory
24 }
25 }
7.5.2 Known issues
The size of the shared memory can be an issue, but in our case we can maximally read in 8kB (512 threads
* 16 bytes per float4), which is below the 16kB available in a SM on a Tesla card.
Our implementation of the program only takes block-sizes that is a whole fraction of the galaxy size. This
means that we have to change the block-size a nudge to 256 for the rest of our implementations.
7.5.3 Tests & results
As can be seen in figure 7.9 and 7.10 the optimization contributes to a speed-up for all galaxy sizes. The
speedup increases with the galaxy size and speedups close to 600% is seen for the 128k galaxy. This is not
odd as the part optimized, the acceleration calculation, is the most demanding for both the computations
and data fetching. Decreasing this will of course contribute the most to bigger problems. As we can see in
figure 7.9 the time complexity still follows the O(n2) just with a lower constant.
Figure 7.9 The graph shows mean time in milliseconds used for one iteration of the all-pairs simulation with the tiling
(shared memory) optimization, compared to the float4 version. The x-axis shows galaxy size and the y-axis the time in
milliseconds, both axis are logaritmic scale.
7.6 All-pairs n-body, CUDA: loop optimization
The code that is most often executed in our program is within loops, and iteration count increases with
problem sizes, making it an interesting place to optimize. We know, as described in our CUDA chapter,
that we can achieve gains in performance by unrolling loops. Loop unrolling transforms several iterations
of the loop into one code segment. Doing so can decrease the overhead of performing loop iterations. An-
other method for optimizing loops involves making the code in the body of the loop as efficient as possible
and removing unneeded code.
58
7.6. All-pairs n-body, CUDA: loop optimization
Figure 7.10 The graph shows the speedup gained using tiling (shared memory) optimization, compared to a version
with just float4.
CUDA has several fast arithmetic operations that can be used to increase performance of certain functions,
sometimes at the price of precision (see appendix chapter (12)). One of these is the square root function,
which we use in the innermost loop of our program. For this version of the CUDA all-pairs program, we
will optimize by making use of our knowledge of the CUDA square root function and by applying loop
unrolling.
7.6.1 Investigating the loop
Before performing the optimizations, we will look at the code being optimized. The updateVelocity and
updatePosition CUDA functions are used to perform the simulation in our program. Both iterate over the
data using loops. Most of the calculation work of the updateVelocity function, the largest of the two func-
tions, is performed within the inner loop.
The code looks as follows:
1 f l o a t rtemp = EPSILON + dist2 ( this_p_m , shared_p_m [ l ] ) ;
2 f l o a t temp = dt ∗ G ∗ shared_p_m [ l ] . w/sqrtf ( rtemp∗rtemp∗rtemp ) ;
3 thisVel . x += temp ∗ ( shared_p_m [ l ] . x − this_p_m . x ) ;
4 thisVel . y += temp ∗ ( shared_p_m [ l ] . y − this_p_m . y ) ;
5 thisVel . z += temp ∗ ( shared_p_m [ l ] . z − this_p_m . z ) ;
Focusing on the second line, we see two rather expensive calculations: a division, and a divisor which is
the result of a call to a square root function. In CUDA compute capability 2 and above, calls to sqrtf intro-
duces no additional error[30]. However, in CUDA compute capability below 2, the square root function is
approximated bymaking use of the reciprocal square root operation supported by CUDA.Wewill now see
how that works out.
By using the cuobjdump -sass command that is part of the CUDA SDK we disassemble the CUDA part
59
7.6. All-pairs n-body, CUDA: loop optimization
of the compiled program, enabling us to see the program translated to assembly code. Locating the rele-
vant part of the updateVelocity function we see the following:
1 /∗01b0∗/ /∗0xc00c1834 ∗/ FMUL32 R13 , R12 , R12 ;
2 /∗01b4∗/ /∗0xc00d1830 ∗/ FMUL32 R12 , R12 , R13 ;
3 /∗01b8∗/ /∗0x9000183140000780∗/ RSQ R12 , R12 ;
4 /∗01c0∗/ /∗0x9000186100000780∗/ RCP R24 , R12 ;
We notice that the reciprocal square root (RSQ) operation is used in conjunction with the reciprocal
(RCP) operation to yield an approximated square root. However, in addition to both being approximated
operations with an extra ULP error of 2 and 1 respectively[30] for a total of 3, we do not strictly need to
perform this. Since our square root is in the divisor we can use the reciprocal square root directly. Now
that we know that it is translated this way in our program for CUDA compute capability v1.x device, we
can make explicit use of rsqrtf.
7.6.2 Implementation
We change the line containing the call to sqrtf with the following:
1 f l o a t temp = dt ∗ G ∗ shared_p_m [ l ] . w∗rsqrtf ( rtemp∗rtemp∗rtemp ) ;
Notice that the division is now gone as well. The disassembled code confirms that RCP mentioned previ-
ously is now gone, and a quick run of the program on a set of test data reveals a speed-up ratio of 1.5x (50%
increase).
Our next loop optimization is loop unrolling. Loop unrolling typically benefits small loops with few cal-
culations the best. This is because without unrolling the loop-specific code also needs to be executed, thus
causing overhead. While loop unrolling can be done by hand, we will use the loop unroll#pragma offered
by the compiler. According to the CUDA programming manual, the compiler only ensures correctness of
code, and it is up to the programmer to use a proper value to increase performance.
In the code below, we see the code using the unroll pragma. A value of 8 was found to be best follow-
ing the benchmarks presented later. The loop of the updatePosition function has the same pragma added.
1 #pragma unro l l 8
2 f o r ( i n t l=0;l<threadsInBlock ; l++) // each thread c a l c u l a t e s the ve l o c i t y of i t s body using the ←↩
shared t i l e memory
3 {
4 i f ( j∗tiles+l<n )
5 {
6 f l o a t rtemp = EPSILON + dist2 ( this_p_m , shared_p_m [ l ] ) ;
7 f l o a t temp = dt ∗ G ∗ shared_p_m [ l ] . w∗rsqrtf ( rtemp∗rtemp∗rtemp ) ;
8 thisVel . x += temp ∗ ( shared_p_m [ l ] . x − this_p_m . x ) ;
9 thisVel . y += temp ∗ ( shared_p_m [ l ] . y − this_p_m . y ) ;
10 thisVel . z += temp ∗ ( shared_p_m [ l ] . z − this_p_m . z ) ;
11 }
12 }
After loop unrolling, the disassembler reveals that the size of the entire updateVelocity function has
increased from 872 (0x368) to 2104 (0x838) bytes, indicating that the kernel now mainly consists of the
unrolled code of the inner loop.
60
7.7. All-pairs N-body, CUDA: Verlet integration
7.6.3 Notes
We use the rsqrtf function that calculates the inverse square root of a nonzero float f ( 1√
f
). This function is
faster, but comes with a lower accuracy, up to 2 ULP’s. However, as we uncovered, the standard sqrtf, when
compiled for CUDA compute capability v1.x already makes use of this approximated function. Only when
using it in a CUDA compute capability 2.x program does it represent reduced precision as compared to
use of the standard function. In this case, we have optimized and possibly reduced potential imprecision.
7.6.4 Tests & results
As mentioned, we found the loop unrolling value by benchmarking. The benchmarks of various unroll
values are presented in figure 7.11. Though not easy to see, the optimal unroll value is found to be 8.
Easier to see is the speed-up compared to our previous tiling version, shown in figure 7.12.
Figure 7.11 The graph shows the time for the time of one iteration for loop unrolling and rsqrtf optimization. Each
curve shows one galaxy size and the x-axis shows how many times the loop has been unrolled.
The loop optimized version has an observed increase in performance of 55-70% compared to tiling,
depending on problem size.
7.7 All-pairs N-body, CUDA: Verlet integration
Thus far in our all-pairs programswe have been using Euler integration. The choice of numerical integrator
could be a thesis in itself and is not really in the scope of this computer science project. In order to illustrate
the power of another integration method we have made a last CUDA all-pairs program that uses velocity
verlet integration as described in section 6.2.2.
Euler is a very imprecise method because of a large truncation error, also does not inherently conserve
energy, as for example Verlet integration does. Velocity Verlet integration should make us able to take
longer iteration steps (dt) due to lesser truncation errors, in practice speeding up a simulation. When using
Verlet, we need tomake a fewmore lines of calculations in each integration step, so we expect the execution
time should be increased some.
61
7.7. All-pairs N-body, CUDA: Verlet integration
Figure 7.12 The graph shows mean time in milliseconds used for one iteration of the all-pairs simulation with loop
optimizations, compared to the tiling version. The x-axis shows galaxy size and the y-axis the time in milliseconds, both
axis are on a logarithmic scale.
7.7.1 Implementation
Implementing velocity Verlet is pretty straightforward, we only need to keep track of temporary variables
of velocity and acceleration from last iteration. Reduced code showing how we calculate the position for
one body is presented below. See the formula in section 6.2.2 for reference:
1 // aT i s a temporaray a c c e l e r a t i on vector from the ve l o c i t y kernel
2 // p_m, vel and acc i s pos i t ion , v e l o c i t y and a c c e l e r a t i on vector ( f l o a t 4 ) arrays .
3 i f ( steps > 0) //skip updateing vel in f i r s t step , already considered done .
4 vel [ i ] += ( a − acc [ i ] ) ∗ 0 . 5 ; //update l a s t i t e r a t i o n s v e l c i t y
5 acc [ i ] =aT ;
6 float4 dv , vh // temporary ve l o c i t y vec tors
7
8 dv = acc [ k ] ∗ dt ∗ 0 . 5 ;
9 hv = vel [ k ] + dv ;
10 p_m [ k ] += vh ∗ dt ; //update pos i t i on
11 vel [ k ] = dv + hv ; // save c o l l e c t i v e ve l o c i t y
As can be seen the code mimics the Verlet formula reusing acceleration and velocity from this and last
iteration, unless it is the first iteration where the velocity is considered updated already.
7.7.2 Known issues
Even though velocity Verlet integration is energy conservative as such, the energy of a system may change
or drift. This is due to a phenomenon called floating point drifting. Drifting is caused by the numerous
floating point operation rounding errors, and is present in most dynamics simulations regardless of
integration method [10].
7.7.3 Tests & results
When testing our Verlet version of all-pairs we found that the time consumption of a iteration is almost
identical to that of an Euler implementation. The calculations performed are a minimal part of our
optimized CUDA version and it does not introduce any extra data fetches. This means that the resulting
62
7.8. Summary of all-pairs case
performance hit is negligible as most computational power is used elsewhere in the program.
Figure 7.13 shows the energy of the system that simulates for one unit of time. So the step-size is inverse of
the number of steps taken. We can see that velocity Verlet is accurate for much larger values of d. The step-
size of 0.005 that can be utilized when using velocity Verlet, compared to an euler step-size of 0.0001 makes
for a 50 times faster simulation.When examining the energy of the system for even smaller step-sizes using
velocity Verlet, we found that it did introduce aminor decrease in the energy level. This could be explained
by an increase in floating point drifting, as a number of extra arithmetic instructions was added.
Figure 7.13 The energy difference between the original energy and the energy after a number of time-steps. The size of
the time-step dt is the inverse of the number of steps t, i.e dt = 1t
7.8 Summary of all-pairs case
Throughout this chapter we have looked at several versions of the all-pairs program, from an unoptimized
CPU version to an optimized CUDA version using velocity Verlet integration. The speedup gained through
the versions for our 128k Plummer galaxy can be seen in figure 7.14.
We can see that the huge processing capability of the graphic card does increase speedup in the simplest
implementation possible when comparing to a high end 16 core computer.
We also notice that the most optimal version of the CUDA implementations is 23 times faster than the
“naive” version. Optimizations can be achieved knowing the architecture and functionality of CUDA as
well as through experimentation.
Optimizingmemory usage anddata fetches has gained us themost speedup throughuse of sharedmemory
and data coalescing. Functionality related knowledge has also heightened performance. Though the use of
the hardware accelerated rsqrtf function. Through experimentation with block and thread sizes as well as
loop unrolling even more processing power can be squeezed out of the GPU.
We should mention that in the latest version of the program, compared to the initial version, of both the
OpenMP version and the first CUDA version, the relative complexity of the code has increases many times.
Most programmers would quite easily understand the first versions while the latter ones needs more effort
to understand.
63
7.8. Summary of all-pairs case
Figure 7.14 The graph shows performance gain percentages for the different versions of the all pairs program, when
simulating a galaxy of 131072 stars. The first CUDA version is sat as the norm for the speedup.
64
8 N-body Barnes-Hut
In this chapter we will describe our implementations of the n-body simulation optimized by the Barnes-
Hut algorithm. As we saw in chapter 7, CUDA can be used to achieve high performance on problems that
naturally map well to massive parallelism. We achieved a tremendous speedup due to the embarrassingly
parallel nature of the all-pairs n-body problem. So much so that problem sizes that would otherwise take
hours for a single time-step became possible to run in minutes. However, given the O(n2) time complex-
ity of the all-pairs algorithm, a large enough problem will still be impractical to simulate. The Barnes-Hut
optimized algorithm, however, has a time complexity ofO(nlog(n)) [3], making simulations of larger, inter-
esting problem sizes viable[7]. Motivated by the performance gains of the all-pairs CUDA version, and the
promising reduction in complexity of the Barnes-Hut algorithm, we decided to see what the performance
of a CUDA implementation would be.
Though the problem being simulated is the same, the Barnes-Hut algorithm is much different from the
all-pairs algorithm, as described in 6.3.2. The Barnes-Hut algorithm is an approximation, using a hierar-
chical tree-based data structure, to make sub-divisions of space[3]. It is not embarrassingly parallel. Our
approach for implementation was the same as for the all-pairs algorithm. That is, to create a sequential
version for reference, and an OpenMP version for benchmarking parallel CPU performance, and finally
making the required changes to convert it into a CUDA version. Both in order to compare performance on
similar implementations and to help us debug aspects of the code related to parallelizing the algorithm.
We initially created sequential and OpenMP parallel C++ versions, implementing the Barnes-Hut algo-
rithm in a fairly obvious manner, by using classes to represent nodes and leaves of the tree and a recursion
for force calculations. However, when converting this to CUDA, we did experience problems. CUDA does
not support recursive functions at all in device capability v1.x, whichwaswhatwe had access to at the time.
Further restriction applies to use of pointers and class features. Andmore importantly, reasonable memory
performance is hard to obtain when pointers are used to access data stored in objects. For this reason we
discarded the object-oriented C++ implementations and instead opted to create all versionswith the CUDA
implementation inmind. All mentioned versions are included in the source code appendix (see chapter 15).
During our development of the Barnes-Hut programs, we found a CUDA implementation of the Barnes-
Hut algorithm1 which is highly optimized. Its implementation is described in an article in a chapter of a
volume of the GPUGems series that is yet unpublished. The article, An Efficient CUDA Implementation of
the Tree-Based Barnes Hut n-Body Algorithm ([7]), has been made available online by one of its authors,
Martin Burtscher2. This implementation has inspired us, and parts of our code is based on it. The imple-
mentation of our CPU versions does resemble the CUDA implementation, as we as mentioned above, to
be able to compare the versions directly.
1 v2.1 used, available at http://www.gpucomputing.net/?q=node/1314 as is v2.2 which was just found to be released at the time of
writing.
2 At the time of writing, a copy is hosted on the website of the Texas State University:
http://www.cs.txstate.edu/~mb92/papers/gcg11.pdf
65
8.1. Barnes-Hut, sequential version
In the following we will describe our Barnes-Hut programs in the same manner as the all-pairs chapter.
Descriptions of implementations will be longer because they require a more extensive explanation, and the
sequential C and parallel OpenMP versions will be closer to the final CUDA implementation.
8.1 Barnes-Hut, sequential version
The sequential C version of the Barnes-Hut algorithm we have created has the purpose of acting as a
reference point. The Barnes-Hut algorithm is challenging to implement in parallel compared to the all-pairs
algorithm.More so thanwith the implementations of the all-pairs n-body, we have consulted the sequential
version when comparing results or debugging the different stages of the algorithm. The benchmarks
performed on this version will also show how the algorithm compares to a massively parallel version of
the all-pairs algorithm.
8.1.1 General implementation
The implementation of the Barnes-Hut programs follows the same scheme and structure. The data struc-
ture and variables defined in our sequential version will appear in the parallelized versions as well. As the
Barnes-Hut programs are more extensive than the all-pairs versions, the implementation section of this
and the OpenMP versions will be an overview, only highlighting interesting code. The reader is advised to
refer to section 6.3.2 for a walk-through of the Barnes-Hut algorithm. The full source code can be found in
the appendix (see chapter 15).
Data structures and variables
Most variables in the Barnes-Hut algorithm remains the same as in the all pairs implementation. We still
performing the same simulation on the same system, hence the only new variables are concerning the tree,
the maximum “radius” of the system (r), the accuracy parameter of the algorithm THRESH (threshold) as
well as some extra timing variables.
The tree data structure is represented by two variables and an array. A schematic of the our octree data
structure can be seen in figure 8.1.
• nnodes: The total number of nodes in the octree, leaf nodes and internal nodes, is represented by the
number nnodes. The leaves, i.e. the bodies, are represented by numbers from [0 : (n − 1)] and the
internal nodes are represented by the numbers [n : (2n− 1)], with the root being 2n-1.
• childNodes is an array of the size 8× n. It keeps track of the children of the internal nodes. An internal
node internal refers to a segment of 8 elements that is used to represent its 8 childrenwith the formula
(internal− n) ∗ 8+ index, where index is a number between 0 and 7.Within the segment of 8 children
each element index represent a direction in space as shown in table 8.1.1. Each element points to a
child, leaf (0-(n-1)), internal (n-(nnodes-1)) or if no child is present, the number -1 is used to indicate
this.
• usedInternal is an integer variable that keeps track of the last used internal node. As the tree is built,
internal nodes names are polled starting at index position nnodes and counting backwards from
there.
With these variables and array explained we are ready to look at the main method of the program.
66
8.1. Barnes-Hut, sequential version
Figure 8.1 This figure shows a conceptual view of how our octree is constructed. This tree has used three internal nodes
so far. The root node has the 8 last places in the childNode array representing its children, the next internal has the next
8 and so on. The usedInternal variable represents the last internal node created in the tree.
index 0 1 2 3 4 5 6 7
direction
(x,y,z)
(-x,-y,-z) (x,-y,-z) (-x,y,-z) (x,y,-z) (-x,-y,z) (x,-y,z) (-x,y,z) (x,y,z)
Table 8.1 The table shows the direction in space indicated by the indexing of child nodes.
Main method
The structure of the main program does not differ significantly from our all-pairs implementation. The
same command line parameters can be used to control input, output, time-step-size and the number of
iterations of the main loop. Galaxy input files are read using the function found in input.c, and output files
written similarly (see appendix chapter 11). Memory is allocated using our init() function and the galaxy
data is loaded into this memory. The position andmass arrays are double the count of bodies in the galaxy,
in order to accommodate the position and mass of the internal nodes in the octree.
The timeval struct is used with the C library function gettimeofday() to measure elapsed time of the entire
simulation as well as each individual timings for functions in the main loop.
The main loop consist of calls to 5 functions that correspond to the 4 steps explained in section 6.3.2. After
the main loop has finished, if the program is compiled with the debug flag set, results of the simulation
will be printed including the calculated energy of the system. Printing of data and energy values is not
done by default as it produces a lot of output, and takes considerable time.
8.1.2 Implementation of functions
This section will describe the 5 functions that makes up the main simulation loop of our Barnes-Hut
sequential program. The first and second function corresponds to the first step in the algorithm 6.3.2 and
the other three functions corresponds to the last three steps.
67
8.1. Barnes-Hut, sequential version
“Radius” calculating function
The calculateRadius function represents the first step of the algorithm. It could be argued that its name is
misleading, as a radius is something a circle or a sphere has, and what we really have is a simple bounding
box. The function calculates a single value used to represent a cubic bounding box. This value that we refer
to as the radius is the absolute value of any x, y or z coordinate of a body, and the cubic bounding box
always has a centre of (0, 0, 0). Our reason for this simplification is that, initially, the Plummer model is
uniformly distributed. However, it does have the effect that the real center of the system of bodies is moved
slightly, and can cause the distribution of bodies to become askew.
Tree building function
The implementation of the building of the octree is characterized by the fact that it is done without using
recursion. First we initialize temporary variables, poll and create the root node of the octree. The values
x y and z are used to determine the path through the tree as shown in table 8.1.1. leaf, current and child
variables represents the leaf we are inserting, the current node we are inserting in, and child is the child of
the current node in the given path. The root note is created in the center of the galaxy.
The main loop iterates over all bodies or leaves inserting them into the tree. It is crucial to know that when
a child is polled in the childNodes array, the integer value represents a internal node if equal to n or larger,
an empty spot if -1 or a leaf node if between 0 and n-1. The flow of the main loop can be seen in the code
below, reduced for brevity:
1 while ( child>=n ) // as long as we h i t an i n t e rn a l node keep t rave r s ing the t r e e
2 {
3 // f ind which ch i ld in current i n t e r n a l node the l e a f should be inse r t ed in
4 child = childNodes [ ( current−n ) ∗8+index ] ;
5 }
6 // leaves the while loop as a l e a f or empty spot i s h i t .
7 i f ( child == NOCHILD ) // i n s e r t l e a f in empty spot
8 {
9 childNodes [ ( current−n ) ∗8+index ( leaf ) ]=leaf ;
10 }
11 e l s e // there i s another l e a f in the current pos i t ion , i . e . ch i ld
12 {
13 do // s t a r t a loop tha t f i n i s h e s when l e a f i s in se r t ed in t r e e
14 {
15 // we have the two l e a f s namely ch i ld and l e a f ;
16 i n t newInternal = −−usedInternal ; //new in t e r na l i s crea ted
17 pos [ newInternal ]= halfRadius ∗ (x , y , z ) ; //f ind pos i t i on of the new In t e rna l
18 insert ( child )−−>newInternal ; // i n s e r t ch i ld in the new in t e r na l
19 child = childNodes [ ( newInternal−n ) ∗8+index ( leaf ) ] ; //now try to i n s e r t l e a f
20 }
21 while ( child > NOCHILD ) ; // while l e a f and ch i ld t r ave r s e s in the same d i r e c t i on
22 childNodes [ ( current−n ) ∗8+index ]=leaf ; //do−loop f in i shed and l e a f in se r t ed in t r e e .
23 }
Center of mass calculating function
In this function we calculate the center of mass (CoM) for an internal node, iterating over its children using
formula 6.7 found in chapter 6. The collective mass of each internal node is found adding the mass of all its
children together. The collective masses and the CoM of each internal node are found by iterating though
them, starting at the last created node. In that way we knowwe do not start a calculation on a internal node
that contains another internal node that has not had its CoM and mass calculated yet. The full code can as
usual be found in the appendix (see chapter 15).
68
8.1. Barnes-Hut, sequential version
Acceleration/force calculating function
This function calculates the acceleration of all leaf nodes using the tree built in the 2 previous functions.
For each node the tree is traversed down each branch until either a leaf is reached or an internal node that
is far enough away is reached.
In the program we make use of two variables that together works as a stack. These variables are an array
called parent and a long3 integer called visited. The parent array keeps track of which internal nodes are
in the path of the current traversal. visited remembers which directions (i.e child) are chosen during each
step of the traversal.
The variable depth keeps track of at which level the current traversal is, level 0 being the root node. In the
rdepth array we keep track of the squared radius relative to the depth, in order to use it in the inequality:
THRESH2 < r
2
d2 which is a rewrite of the formula for θ given in section 6.3.2.
CurrentNode and CurrentChild are variables that holds the internal node and the child direction at
the current depth. Three temporary variables acc_x, acc_y and acc_z keeps track of the accumulated
acceleration vector.
As the tree is traversed we start at the root node which is our first current node and with its first child 0. As
the root cannot be far enough away from any node with a THRESH of 0.5, the root will be pushed on the
stack and the node in position corresponding to the first child will be sat as current. The child chosen (i.e.
0) will also be pushed on the stack. We traverse tree in this way and keep pushing children and parents on
stack until the current node is either a leaf, an empty node or the inequality is fulfilled. If either a leaf is
reached or the inequality is satisfied, currentNode can then be used to calculate and add the acceleration
component to temporary vector. After an end of a branch is reached the traversal goes up one level and
pop the stacks and continues with the next child by incrementing currentChild by one. This approach is
continued until the whole tree is traversed. An overview of the algorithm code can be seen in the pseudo-
code below :
1 do //while there i s s t i l l unvis i ted branches of t r e e
2 {
3 while ( currentChild <8) // go up a l ev e l when a l l 8 ch i ldren are v i s i t ed
4 {
5 do
6 {
7 push current node on stack & find next node in child array
8 go one level down & push current child onto stack
9 } while ( current node is an internal node , too close to current leaf )
10
11 i f ( currentNode != NOCHILD ) // i f the node i s not empty
12 {
13 calculate_and_update ( temp_acc ) ; // c a l cu l a t e a c c e l e r a t i on component
14 }
15 pop child & parent from stacks
16 go to next child i . e . currentChild++
17 }
18 do
19 {
20 pop child & parent from stacks
21 go to next child i . e . currentChild++
22 } while ( currentChild >7) ; // make sure the current ch i ld does not exceed 7
23 }
Each leaf node finds its acceleration in the same way as in earlier body to body calculations. Lastly we
update the acceleration and velocity arrays with the temporary variable according to the Verlet scheme, as
3 A long is 64 bits long, and an integer between 0 and 7 is represented by 3 bits. This means that a maximum of 21 child variables can
be stored in the visited variable. This translates to a balanced tree consisting of 1.15× 1018 internal nodes which is large compared to our
simulation sizes.
69
8.1. Barnes-Hut, sequential version
also seen in the last all-pairs CUDA program (see section 7.7).
Updating the positions is also done as in the all-pairs Verlet CUDA version.
8.1.3 Known issues
The fact that the Barnes-Hut tree building can become unbounded if 2 bodies are placed on top of each
other, does relate to this implementation specifically. There is only room for as many internal nodes as leaf
nodes in the program. This means that an unbalanced tree could possibly make the program crash. This
does not happen initially, as the density function of a Plummer galaxy makes for a somewhat uniform dis-
tribution. After a large number of iterations, however, there are two related scenarios that could make the
octree become unbalanced. As mentioned, two bodies that comes very close to each other will make for a
long branch of the tree. Another scenario is where one body moves away from the vicinity of the galaxy,
making the “radius” become huge, and relatively pushing all other bodies closer together. The escaping
body requires the space to be divided one extra time, each time the body doubles its distance from the
galaxy center.
8.1.4 Tests & results
It is apparent when running the Barnes-Hut sequential program that is more efficient than the all-pairs
program when we increase the sample sizes. As we can see in figure 8.2 the time used for one iteration of
the simulation with the bh program outperforms the all-pairs CPU program. This is of course expected,
due to the time complexity of the problem.
It can bee seen that time used still at the largest galaxy size we simulate is out performed by an order of
magnitude by the most optimized CUDA version. The slope of the curve indicates that the Barnes-Hut
program will outperform the all-pairs CUDA at some galaxy size, which is of course also expected.
It can be seen in the pie-chart 8.3 that almost all time used on the algorithm is in the force calculation for all
sizes of galaxies, its also hinted that the time spent in force calculation increases with the size of the galaxy,
again expected.
Figure 8.2 The graph shows the average time it takes for one iteration of the simulation for the Barnes-Hut simulation.
For comparison the sequential all-pairs CPU version and the most optimized CUDA all-pairs version with velocity
Verlet integration.
70
8.2. Barnes-Hut, OpenMP version
Figure 8.3 The pie shows the distribution of computation between the different functions. Each “wheel” shows one
size of the galaxy, the outermost being the 1k galaxy and the innermost the 128k galaxy.
8.2 Barnes-Hut, OpenMP version
As with the original sequential all-pairs implementation, we wanted to parallelize our C version using
OpenMPwhile keeping the upcoming CUDA implementation in mind. While some of the functions in the
C program can be easily parallelized, others have a very irregular structure that is less suitable for paral-
lelization.
We know that calculating the force is the most expensive function, and at the same time its implementation
is embarrassingly parallel. Parallelizing this function alone would gain us most of the achievable speedup.
Sequential execution on the GPU is extremely slow, though. So is performing the calculations on the host
CPU instead. For this reason we decided to try and parallelize all functions for the CPU version as this has
to be done in the later CUDA version anyhow.
We found that for this problem, the parallelization of the irregular functions did not increase speed at all.
The overhead and solving of concurrency issues overshadowed the speedup otherwise possible through
data parallelism. It also made our code less robust, as will be explained in this section.
8.2.1 Implementation
The parallelization of the C sequential version of the Barnes-Hut algorithm is mostly done using the
OpenMP compiler directive#pragma. In order to deal with concurrency, some of the parts of the original
implementation have been changed.
The functions calculateRadius, calculateAccleration and progressSystem have all been easily implemented
using the #pragma omp for command, parallelizing the main for-loops. The pseudo-code below shows
how this is done for the function progressSystem:.
1 i n t leaf ;
2 #pragma omp pa r a l l e l fo r pr iva te ( l e a f )
3 f o r ( leaf = 0 ; leaf < n ; leaf++)
4 {
5 update acceleration velocity & position f o r leaf .
6 }
In the radius calculation we have implemented a reduction in order to make each thread contribute to the
final result.
71
8.2. Barnes-Hut, OpenMP version
Irregular parallelism
In the last two functions,morework-sharing has to be done, requiring a thorough explanation. For the func-
tion that calculates the center of mass, a form of task parallelism and pipelining has been implemented.
Each of the threads calculates the mass and CoM for an internal node assigned to them, beginning at the
last used internal node. The threads then leapfrogs through the internal nodes using a parallel for-loop.
It is possible for one thread to depend on data from another thread. Therefore the parallelized function
needs to handle this. The mass of the internal nodes are used as an indicator for which nodes are ready
to be calculated on. A value over zero indicates that it is. To avoid deadlocks incurred by threads reading
a cached value of the mass of an internal node, instead of the actual value that was changed by another
thread, a volatile float array is used for this function.
The function now consists of two parts. The first part resembles the sequential code, but now caches the
internal nodes indexes that are not ready to use. The second part loops over the cached nodes, and when
the mass is updated uses them for the calculation.
In the buildTree function, we have tried to implement two versions. They both cause the program to per-
form worse. The concurrency problems in the tree building is mainly seen in the part where new internal
nodes are created.
In a “naive” parallelization, the variable usedInternal is used by all threads at the same time. Due to read /
write hazards (see 4.3.3), this would cause a race condition. In OpenMP this can be solved by using atomic
operations, critical regions, or locks. Due to the lack of functionality of atomics in OpenMP (atomics can-
not read and change a value in one operation), we were not able to use atomics for the implementation.
When using either locks or critical regions in OpenMP, locks the whole code segment where new internal
nodes are created, which is where the majority of the code is located. A critical region was implemented,
but it meant a great increase of execution time. Using more threads only increased execution time due to
overhead, instead of speeding anything up. This is due to threads staying idle while one is inserting new
internal nodes.
We implemented a second version where we sorted the galaxy spatially, in order to insert the bodies into
subtrees corresponding to their spatial location. That way we could avoid concurrency problems by using
a divide and conquer implementation, where each thread independently could handle its own part of the
tree.
This version has two problems. Firstly, it is somewhat slower than the sequential version for the galaxy sam-
ple sizes we benchmark on. The indexing and sorting using counting sort, as well as a sequential building
of the top of the tree, causes significant overhead.
Secondly, each thread needs a range of internal nodes to poll from when expanding the tree. This makes
the problem described in section 8.1 worse. The threads handling problematic areas of the galaxy will run
out of internal nodes more often, as there are fewer internal nodes available. This meant that the program
might crash in some simulation configurations. The code for this is included in the code appendix(see
chapter 15), but in the end we used a strictly sequential implementation of the buildTree function, as this
was faster and more stable.
8.2.2 Tests & results
We havemade some benchmarks on the OpenMP program. The results can be seen in figure 8.4. The figure
shows that the elapsed runtime of the simulation is reduced satisfactory when using more threads. We are
inclined to say that the implementation scales well with extra computing cores.
72
8.3. Barnes-Hut, CUDA version
If we look at the distribution of work (see figure 8.5) we observe the following: The force calculation is
Figure 8.4 The graph shows the average time it takes for one iteration of the simulation for the Barnes-Hut OpenMP
version. The graph shows the timing of simulation run with 1, 2, 4, 8, and 16 threads.
the function gaining speedup from an increasing number of threads. The irregular parts of the code does
not. The time spent on “radius” calculation and progressSystem is negligible. This means that the time
consumed by the buildTree and computeCenterOfMass functions becomes a larger part of the total elapsed
simulation time as more cores are used. This can be seen in the pie chart. The main point of parallelizing
the Barnes-Hut algorithm is to have the force calculation run faster. In the irregular parts of the program,
Figure 8.5 The pie chart shows the distribution of computation time between the different functions in the Barnes-Hut
OpenMP simulation of a 128k galaxy. Each “wheel” shows a number of threads used, the innermost represents 16
threads, and continuing outwards 8, 4, 2, and 1 thread(s). The right graph shows a staple diagram of the time used per
function in the same simulation.
it appears that there are too few calculations performed in the program with regards to the overhead of
the parallelization. We have also observed while implementing this program that use of OpenMP’s critical
regions causes too much overhead to be useful. Additionally, the OpenMP atomic functions does not have
the “return” functionality required for this implementation.
8.3 Barnes-Hut, CUDA version
In this section, the main part of our second case, the CUDA implementation of the Barnes-Hut n-body
algorithm will be explained. The data structures and use of the algorithm is as in the previous implemen-
73
8.3. Barnes-Hut, CUDA version
tations described in this chapter. As with the previous versions, steps of the algorithm are implemented
in functions. However, due to how CUDA works, especially in relation to this algorithm, the functions are
much more extensive. Parallelism across warps, blocks and the entire grid of threads must be considered.
Doing so requires the introduction of extra barriers, checks, and reductions to the algorithm. For better
distribution of work and to avoid issues, each kernel is also executedwith its own set of launch parameters.
As explained in the beginning of the chapter, we base parts of our CUDA implementation on Burtscher’s
program. There are two major differences in the implementation. The first is that we are using float4s for
the acceleration and velocity vectors and position coordinates. We did this in order to increase data coales-
cence in the force calculations kernel, as we observed that this worked well in our all-pairs case. Secondly
we did not implement a sorting kernel as we had none in the parallel CPU version to base it on. This lead
us to some interesting observations, as we will show later.
It is important for us to describe the implementation more thoroughly in order to show the complexity
of the code in CUDA and our understanding of its workings. For this reason, a reader not interested in the
in-depth explanation of the code, can just read the last paragraph of section 8.3.1 and the first paragraph
of sections 8.3.2 through 8.3.5 in order to gain an overview of the implementation.
Each function will again be described separately, but in more detail. Flow of the algorithm, use of thread
based indexing, and use of barriers and synchronization especially. In our result section, each function will
also be benchmarked.
8.3.1 Program structure and main method
The entire program is contained within one source file, except for the runtime C and CUDA includes, and
some of our host code utility functions. It is due to the restriction of file scope on device variables that
all CUDA code is in the same file. All the device functions making up the implementation of Barnes-Hut
algorithm are built around globally accessible variables for simulation parameters and data pointers. The
equivalent host code variables are also kept globally accessible. The define directive is used extensively
for constant simulation values and parameters. NOCHILD, LOCK, and CHILDVISITIED are distinct
negative integer values used for building and traversing the tree, as used previously also.MAXDEPTH and
WARPSIZE, both set at 32, are used in the functions for computing center of mass and computing force.
Due to how the latter works, the effective maximum depth is shorter. Each CUDA function has individual
launch configurations defined as well, for instance:
1 # def ine RADIUSTHREADS 512
2 # def ine RADIUSBLOCKSF 2
Here, the thread count and block factor for the calculateRadius kernel is set. The thread count is the amount
of threads per block. Restrictions may apply to the thread count when kernel function code depends on
the value. The block factor is used in conjunction with the variableMPCOUNT to schedule an amount of
blocks. TheMPCOUNT variable is filled during runtimewith the number of multiprocessors (MPs) on the
GPU. For some kernels, the factor must be 1, so that each MP is assigned only one block of threads. Other
than that, the thread count and block factors can be adjusted for performance.
The main method of the program checks for a CUDA capable device using cudaGetDeviceCount(), ex-
its if none are found, and loads device properties on the default GPU using cudaGetDeviceProperties().
It prints information about the GPU, saves the number of MPs to theMPCOUNT variable, and warns and
exits if the device is not of capability 2.0 or above. The main method is also responsible for loading and
74
8.3. Barnes-Hut, CUDA version
parsing simulation data files. At least one argument to the main method is required, an input filename. Us-
ing the function in input.c, temporary arrays are filled with input position, velocity and mass data as well
as default parameters. Parameters overriding number of timesteps (t) and timestep size (dt) are optionally
loaded from arguments 2 and 3 respectively, and output is optionally written following a simulation using
the function in output.c if a filename is given as the 4th parameter.
Global variables representing simulation parameters and data pointers generally exists for both host and
device. Device variables are distinguished by being postfixed with _Dev, and temporary variables by T
or Temp. The init() and uninit() functions are used for allocating memory on the host and for freeing it
respectively. The initDevice() and uninitDevice() functions are used for doing the same on the device. As
described in the section on CUDA device memory in 5.3.4, cudaMalloc() saves a pointer to device memory
in a host variable. The transferToDevice() function transfers these pointers to pointer variables in constant
memory on the device using the CUDA function cudaMemcpyToSymbol(). It also transfers simulation
parameters to the constant memory, and copies the simulation data memory using cudaMemcpy(). The
transferToHost() function performs the opposite, i.e. copying data back to host after a simulation is per-
formed. In these functions for initialization and data transfer, we check if the return value of each call to
the CUDA runtime equals cudaSuccess.
Before, during and after the main loop of executing the simulation, we use CUDA events for recording
timings. Events consist of event types and calls to the CUDA runtime. We create events, start and stop, for
timing the entire simulation, and startlap and stoplap for timing individual kernels.
1 cudaEvent_t startlap , stoplap ;
2 HANDLE_ERROR ( cudaEventCreate(&startlap ) ) ;
3 HANDLE_ERROR ( cudaEventCreate(&stoplap ) ) ;
In the above listing, we create the events for timing kernel executions. Each timing is performed as follows:
1 HANDLE_ERROR ( cudaEventRecord ( startlap , 0 ) ) ;
2 calculateRadius<<<RADIUSBLOCKSF∗MPCOUNT , RADIUSTHREADS >>>() ;
3 HANDLE_ERROR ( cudaEventRecord ( stoplap , 0 ) ) ;
4 HANDLE_ERROR ( cudaEventSynchronize ( stoplap ) ) ;
5 HANDLE_ERROR ( cudaEventElapsedTime(&lap , startlap , stoplap ) ) ;
Events, like kernel calls, are asynchronous meaning that they return before being completed on the de-
vice[30]. The second parameter of the cudaEventRecord() function is used to set which stream they apply
to. Setting it to stream 0 guarantees that they will complete after all preceding commands and functions
of all are finished[30]. We do not make any other explicit use of streams, but they can be used to control
order of execution of commands and kernels per stream. Unlike the initialization and transfer functions,
we use a generic error handler for runtime commands. Errors returned by runtime commands following
one or more kernel calls do not necessarily apply to that particular call[30], and are usually the result of
a crash caused by a preceding kernel call. The cudaEventSynchronize() command waits for the event to
finish, and cudaEventElapsedTime() writes the difference between the events to the lap variable. The lap
values are saved in an array of dimensions [t][5], a timing for each of the 5 kernel calls, for each timestep t.
Removing file and parameter loading, test for CUDA device, basic sanity checks, text output and code
used for timings and debugging, the main method boils down to the following:
1 init ( ) ; initDevice ( ) ; transferToDevice ( ) ;
2 initDev<<<1,1>>>() ;
75
8.3. Barnes-Hut, CUDA version
3 f o r ( i n t i = 0 ; i < t ; i++)
4 {
5 calculateRadius<<<RADIUSBLOCKSF∗MPCOUNT , RADIUSTHREADS >>>() ;
6 buildTree<<<TREEBLOCKSF∗MPCOUNT , TREETHREADS>>>() ;
7 computeCenterOfMass<<<CENTERBLOCKSF∗MPCOUNT , CENTERTHREADS >>>() ;
8 computeForce<<<FORCEBLOCKSF∗MPCOUNT , FORCETHREADS >>>() ;
9 advanceBodies<<<ADVANCEBLOCKSF∗MPCOUNT , ADVANCETHREADS >>>() ;
10 }
11 transferToHost ( ) ; uninit ( ) ; uninitDevice ( ) ;
The initDev kernel call shown above sets block and step counters, used in subsequent kernels.
8.3.2 “Radius” calculating kernel
The primary function of the calculateRadius kernel is to calculate what we refer to as the radius. What we
mean by that is themaximum absolute value of any x, y or z coordinate. It also creates the root of the tree. It
consists of three steps. In step one, each thread determines maximum value of each coordinate value from
its share of the data, and the results are stored in shared memory. In step two, a reduction is performed
on the values in the shared memory by each block. In the third step, the last block saves the radius to a
variable in the global memory and creates the root of the tree.
Temporary variables for each minimum and maximum of x, y and z is declared together with correspond-
ing arrays in shared memory. The arrays are for the results of step one of each thread in the block, and are
statically allocated using the RADIUSTHREADS define. In the following code, the temporary xmax and
the xmaxs shared array are used as an example.
1 i n t tid = threadIdx . x ; //thread id in current block
2 inc = RADIUSTHREADS ∗ gridDim . x ; // t o t a l threads
3 f o r ( i = tid + ( blockIdx . x ∗ RADIUSTHREADS ) ; i < n_Dev ; i += inc )
4 {
5 val = posm_Dev [ i ] . x ;
6 xmax = ( xmax > val ) ? xmax : val ;
7 //f ind other max/min
8 }
9 xmaxs [ tid ] = xmax ; //save thread max
10 //save other max/min
Here the thread id within the current block is saved in the tid variable, and the incrementor variable holds
total number of threads. Using thread id within the entire grid as a base, the loop iterates over all bod-
ies using the incrementor. Each thread determines minimum and maximum coordinate of n/inc bodies.
The tid variable is used as index to save the results of each thread in the block in the sharedmemory arrays.
Next, a reduction is performed.
1 f o r ( i = RADIUSTHREADS/2; i > 0 ; i /= 2)
2 {
3 __syncthreads ( ) ;
4 i f ( tid<i )
5 {
6 temp = tid + i ;
7 xmax = ( xmax > xmaxs [ temp ] ) ? xmax : xmaxs [ temp ] ;
8 xmaxs [ tid ] = xmax ;
9 . . .
10 }
11 }
12 i f ( tid == 0)
76
8.3. Barnes-Hut, CUDA version
13 {
14 temp = blockIdx . x ;
15 xmax_Dev [ temp ] = xmax ;
16 . . .
17 }
Counter i is set at RADIUSTHREADS/2 for starters and halving at each iteration, threads compare and
update their minimum/maximum values against those stored in the shared arrays at position i+ tid. The
tid < i condition means that the 0th thread in the block performs the last part. It then saves the block mini-
mum/maxmimum to global memory arrays with memory corresponding to the block count allocated. As
a note, the way the reduction works, this kernel requires a thread count that is a power of 2. If it is not,
values are missed due to integer division being performed on an unequal number before 1.
In the final step, the incrementor (inc) variable is reused to represent block count minus 1. The atomicInc()
function is then used to check if the last block has been reached. This works because the value stored at
blockCount_Dev is compared to inc, and if it is larger than or equal, blockCount_Dev is set to 0. Other-
wise it is incremented by one, and the old value is returned by the function[30]. When this return value
equals inc, it means that the last block has incremented. Again, xmax is used as example for brevity in the
following code:
1 inc = gridDim . x − 1 ;
2 i f ( inc == atomicInc ( ( unsigned i n t ∗ ) &blockCount_Dev , inc ) )
3 {
4 f o r ( i = 0 ; i <= inc ; i++)
5 {
6 xmax = ( xmax > xmax_Dev [ i ] ) ? xmax : xmax_Dev [ i ] ;
7 . . .
8 }
9 xmax = ( fabsf ( xmax ) > fabsf ( xmin ) ) ? fabsf ( xmax ) : fabsf ( xmin ) ;
10 . . .
11 val = ( xmax > ymax ) ? xmax : ymax ;
12 . . .
13 r_Dev = val ;
14 //c rea t e root node
15 step_Dev++;
16 posm_Dev [ nnodes_Dev ] . x = 0 . 0 f ;
17 posm_Dev [ nnodes_Dev ] . y = 0 . 0 f ;
18 posm_Dev [ nnodes_Dev ] . z = 0 . 0 f ;
19 posm_Dev [ nnodes_Dev ] . w = 0 . 0 f ;
20 f o r ( i=0;i<8;i++)
21 {
22 childNodes_Dev [ ( nnodes_Dev−n_Dev ) ∗8+i ]=NOCHILD ;
23 }
24 }
After the “radius” has been found at last and saved to r_Dev, step count variable step_Dev is incremented.
This value is -1 before first simulation iteration due to Verlet integration. Finally the root node is created.
The assignments on a float4 variable in the posm_Dev array shows the awkward use situation we were
forced in to. To avoid errors due to compiler optimizations, the float4 posm_Dev array is declared volatile
in the program.Due to a restriction enforced by the compiler, however, assigning to or from a float4 variable
is not permitted, hence the individual access to the float4 members.
8.3.3 Octree building kernel
The flow of the buildTree kernel follows the flow of our sequential implementation of the Barnes-Hut tree
building algorithm in principle. The function diverges in some points in order to make it suitable for the
77
8.3. Barnes-Hut, CUDA version
CUDA architecture, and to deal with concurrency. Octree nodes and their entire subtrees can be locked in
order to avoid threads overwriting each others work. The kernel is implemented with a divide and con-
quer algorithm and this introduces some thread divergence within the warps as well as some overhead
and redundancy.
Firstly, each thread is assigned a body in the usual manner. That is, using the threadIdx.x, blockIdx.x and
blockDim.x to identify an unique id (tid) for each thread, and the first body worked on by the thread has
the index tid.
The code of themain loop in the function looks similar to that of the sequential version, but is changed from
using a for-loop to use a while-loop. This has been done in order to accommodate the CUDA architecture
by attempting to ensure that the execution paths does not differ too much, as well as to avoid deadlock
scenarios.
After the local variables have been initialized, the three building starts in the main loop. Each thread tries
to insert a leaf into the tree. The flow of code for a thread in the buildTree() function is illustrated by the
flowchart in figure 8.6, which the reader can consult when reading this section.
Inside the while-loop, the thread start by saving the position of the leaf into temporary variables. The
Figure 8.6 Flowchart illustrating the main while-loop of the tree building kernel performed by a thread
thread then try to insert the leaf in the root node.
1 i f ( skip != 0 )
2 {
3 temp_leaf_vector = posm_Dev [ i ] ; //save pos i t i on in temporary var i ab l e
78
8.3. Barnes-Hut, CUDA version
4 // s t a r t a t root and f ind d i r e c t i on of t r av e l
5 x = −1.0f , y = −1.0f , z = −1.0f ;
6 i f ( 0 . 0 f < px ) { index = index | 1 ; x = 1 . 0 f ; }
7 i f ( 0 . 0 f < py ) { index = index | 2 ; y = 1 . 0 f ; }
8 i f ( 0 . 0 f < pz ) { index = index | 4 ; z = 1 . 0 f ; }
9 }
This code section is skipped during next iteration of the loop if the thread fails to insert its leaf.
The leaf now traverses the tree finding and following children in the childNodes_Dev array as long as in-
ternal nodes are encountered. When an empty branch or a leaf node is reached, we insert the leaf node, just
as in the sequential code. In order to avoid concurrency problems, indices of the childNodes_Dev array
can be locked, meaning that the entire subtree of that child is locked until the thread is finished updating
it. If this was not done, the threads would overwrite each others subtrees.
If the thread finds a child of the current internal node that is locked, the thread start over with the loop.
If it is not locked, the thread will try to lock that index in childNodes_Dev. The lock is applied by setting
the childNodes index at the current child to LOCK using an atomic operation. The atomic operation makes
sure that no other threads can do the same. When a node is inserted in the locked spot, that part of the tree
is again open for other threads, as the index does not contain a LOCK value. The locking mechanism can
be seen in the reduced code below:
1 i f ( child != LOCK ) //skip i f locked
2 {
3 locked = ( current−n_Dev ) ∗8+index ;
4 i f ( child == atomicCAS ( ( i n t ∗ )&childNodes_Dev [ locked ] , child , LOCK ) ) //t ry to lock
5 {
6 i f ( child == NOCHILD ) //spot i s ava i l ab l e fo r l e a f i n s e r t i on
7 {
8 childNodes_Dev [ locked ] = leaf ;
9 }
10 e l s e do //we have two l ea f s , ch i ld and l e a f
11 {
12 newInternal = atomicSub ( ( i n t ∗ )&usedInternal , 1 ) − 1 ; //make new in t e r na l node
13 . . .
14 //c rea t e i n t e r n a l nodes and in s e r t i ng ch i ld in t r e e
15 }
16 childNodes_Dev [ ( current−n_Dev ) ∗8+index ] = leaf ;
17 __threadfence ( ) ; //make sure i t ' s wr i t ten
18 . . .
The __atomicCAS() and __atomicSub() functions are both CUDA-supported atomic operations that we
use to avoid race conditions in the treeBuilding kernel. The __atomicCAS() function is used to ensure
that no other threads writes to or locks the childNodes index that the current thread is trying to lock.
The CAS part of the atomicCAS name is short for “Compare And Swap” and the function takes 3 argu-
ments. In our programwe use the function as follows: The first argument, the value of the childNodes_Dev
array at index “locked” is compared to the value of child. If the two compared values are equal, the value
LOCK is inserted in the array, and the value of the second argument is returned. Otherwise the value in
the array remains in place and is also returned. If the value is not the same as child, this means that another
thread has changed the value. The locking mechanism means that the childNodes array at that particular
index cannot be accessed by other threads, meaning no data can be overwritten.
The __atomicSub() function is used to make sure that only one thread decrements and reads the usedIn-
ternal value at a time. This ensures that creating a new internal node will not introduce a race condition
wherein two threads gets the same value for their newInternal.
The__threadfence() function is used so that all other threads in the grid can see the new nodes created in
the innermost do-loop, before releasing the lock. The__threadfence() function guarantees that the global
79
8.3. Barnes-Hut, CUDA version
memory write is visible to all threads in all blocks.
At the end of the loop, the __synchthreads() function is called. This makes sure that all threads in the
block waits for other threads at the end of each iteration of the while loop. This has nothing to do with con-
currency, but is an optimization presented by Burtscher[7]. The idea is to minimize the use of the atomic
operations and to maximize the bandwidth of the threads doing the work. Especially at the start of the
tree building, most threads would experience running into locked locations. Having a lot of threads in the
block keep trying to acquire locks using the __atomicCAS() function, which is expensive and sequential,
is counterproductive. Instead most threads wait at the end of the loop, while the last threads are perform-
ing the calculation and data fetching intensive work in the innermost do-loop.
Looking at the flowchart 8.6 it can be seen that there is a lot of thread divergence in the flow of the code.
Tread divergence means that the code has to be serialized within the warp. Additionally, threads will re-
peatedly try to acquire locks, fail, and not perform any work. Combined, this means that although the
octree building is parallelized the greatest performance cannot be expected from it.
8.3.4 Compute center of mass kernel
This kernel is very similar to the equivalent function in our parallel OpenMP implementation. Each thread
leapfrogs through the internal nodes created in the tree building kernel and finds their center of mass
and collective masses of their up to 8 children. As the mass of children of one internal node may or may
not depend on the calculations of another thread, internal nodes are locked until ready to be used by other
threads. Thisway, the kernelworks a bit like a pipelinewherein threadswait for data fromprevious threads.
As always with CUDA, some initiation work needs to be done before the execution of the algorithm. The
local variables in the function can be seen in the code below:
1 __shared__ v o l a t i l e i n t childrenCaches [ CENTERTHREADS ∗ 8 ] ;
2 i n t i , j , internal , bottom , currentChild ;
3 i n t index = threadIdx . x ∗8 ;
4 i n t missing = 0 ;
5 i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ; //absolute thread number
6 i n t threads = gridDim . x ∗ blockDim . x ; // t o t a l number of threads
7 f l o a t cent_pos_x , cent_pos_y , cent_pos_z , cent_mass ;
Each thread will cache up to 8 nodes during the execution of this kernel. In order to speedup the program,
and to save register space, this is done in the sharedmemory using an array declaredwith the__shared__
qualifier. The array holds 8 ints per thread in the block. As the memory is shared between all threads in
the block, each thread uses the index variable that points to its part of the shared memory cache.
Temporary floating point variables (cent_pos_x, cent_pos_y, cent_pos_z, cent_mass) are used for the
intermediate results in the function. This speeds up the function as thesewillmost likely be kept in registers,
but also servers another purpose as we first want to update the mass after an entire step of the main loop
is taken. Amissing variable is also introduced. It is used to remember how many internal child nodes that
the current thread has cached.
The id (tid) of the thread is found in the usual manner, and it is used to find the starting internal for a given
thread using the formula below.
1 internal = ( bottom & (−WARPSIZE ) ) + tid ; //bottom − ( bottom % WARPSIZE) + t id
80
8.3. Barnes-Hut, CUDA version
Where bottom is the last used internal node in the tree building kernel. The line above is an optimized way
of doing modulus operations that increases performance a tad [28].
The main part of the code is comprised of the while-loop where each thread leapfrogs over the existing
internal nodes and sums up their CoM’s and collective masses. As the calculations are done in parallel,
we need to ensure that all internal nodes are ready to be used for calculation purposes. As in the OpenMP
version, mass is used as a lock that indicates if a child is ready or not. If the mass is positive, it is ready to
use. Otherwise the child will be remembered in the cache in order to be used when ready. Therefore, the
main loop has two parts, as can be seen in the pseudo-code below.
1 while ( internal <= nnodes_Dev )
2 { //temporary va r i ab l e s
3 f l o a t cent_pos_x = 0 . 0 f , cent_pos_y = 0 . 0 f , cent_pos_z = 0 . 0 f , cent_mass = 0 . 0 f ;
4 f o r ( i n t child = 0 ; child < 8 ; child++) iterate through possible children
5 {
6 use nodes that are ready to update temporary cent_pos & cent_mass
7 cache internal nodes that are Not ready in shared memory
8 }
9 do {// while there are missing ch i ldren
10 fo r ( i n t child = 0 ; child < 8 ; child++)
11 {
12 iterate through caced children
13 i f child is ready update temporary mass & CoM
14 }
15 i f ( missing==0)
16 { //when a l l nodes are ca l cu l a t ed
17 update position of the internal in global memory
18 __threadfence ( ) ; // make sure the updated pos i t i on i s seen by other threads
19 posm_Dev [ internal ] . w = cent_mass ; // update mass of i n t e r n a l and unlock
20 internal += threads ; //go to next i n t e r n a l node
21 }
22 } while ( missing > 0) ;
23 }
First part iterates over the 8 children and any node ready will contribute to the temporary CoM and mass.
Any empty nodes and unready children are saved in the shared memory cache. For each non-empty node
saved, we increment the variablemissing by one, in order to knowwhen to end the second part of the loop.
The empty nodes are moved to the back of the 8 children as seen in the following code snippet:
1 f o r ( child = 0 ; child < 8 ; child++) {
2 j = 0 ;
3 i f ( child != j )//for c a l c . a c c e l . , move ch i ldren to end
4 {
5 childNodes_Dev [ ( internal−n_Dev ) ∗8+child ] = NOCHILD ;
6 childNodes_Dev [ ( internal−n_Dev ) ∗8+j ] = currentChild ;
7 } }
This will also speed up the next kernel, as it will reduce the length of its code and save threads from re-
dundant work.
Second part of the main loop in the code listing before the previous iterates over cached children. If a child
has a positive mass, it is now ready and can be used for the CoM calculations and its mass can be added
to the temporary variables. When all children have been updated, the position of the internal node can be
updated in the global memory array. The__threadfence() function is called tomake sure that the updated
positions are visible to the other threads. Afterwards the mass of the internal node can be updated. The
thread can now iterate to the next internal node assigned to it and start over.
81
8.3. Barnes-Hut, CUDA version
When a SM has several blocks assigned to it, there is no guarantee as to which block gets scheduled for
execution first. This means that only one block can be scheduled per SM when calling this kernel from the
host. If more blocks are assigned to the same SM, we will not know if the threads will perform the calcu-
lations beginning from last created internal node, which is crucial. If the SM schedules the blocks in the
wrong order, the kernel would deadlock. This is becaus eall threads would wait for child nodes to become
ready and these nodesmay belong to a block scheduled for later execution. Thereforewe only have asmany
blocks as SMs in this kernel.
8.3.5 Compute force & advance bodies kernels
The last two kernel functions called in the main loop, are the two that are the most parallelizable. The func-
tion that advances bodies is embarrassingly parallel as always. The acceleration calculations are done in the
computeForce function, that has some thread cooperation within the warps in order to avoid divergence.
In this kernel, the tree built in the last two functions is traversed in order to find the acceleration for each
body. The tree is traversed down each branch of the tree, until a leaf node or an internal node that is far
enough away for all the threads in the warp, is found. The kernel makes use of a CUDA specific thread
voting function as well as stack implementations to traverse the tree.
Computing the force is as seen earlier in this chapter (figure 8.2 and 8.5) the part of the algorithm that
uses the most time. The actual calculations are not as intensive as the fetching of data, as observed in the
all-pairs case (see chapter 7) where we made more extensive use of the __shared__ memory.
The main loop is close to that of the CPU versions described by the pseudo-code on page 69, but with mi-
nor differences. We have included a flowchart for reference in figure 8.7 which illustrates the flow of the
algorithm for a thread.
There is quite a number of added variables in the computeForce kernel that needs some explanation.
The currentNode, currentChild, leaf, depth variables are used the same way as in earlier versions.
1 leaf = the leaf node For which the thread is finding accelaraion
2 depth = the depth/level in the tree of the current traversal
3 currentNode = the node in the octree we are currently at
4 currentChild = in t between 0−7, representing the direction we are taking from the current node
The variables px, py, pz, ax, ay, az, dx, dy, dz are temporary variables of the current leafs acceleration,
velocity vectors and position coordinates. The variable tmp is used temporarily for different purposes.
In order to have the threads in the warp cooperate, three variables described below is created.
1 base = threadIdx . x / WARPSIZE ; //warp number in block
2 sbase = base ∗ WARPSIZE ; // f i r s t thread in warp
3 did = base ∗ MAXDEPTH ; //depth index i . e did correponds to depth = 0
These variables are used to access the appropriate parts of the shared memory that corresponds to the
warp the thread resides within.
We make use of two stacks, the array parent and the long integer visited. The parent array is located in
shared memory for fast memory access. Although not actual stack implementations as such, they work in
a similar way as can be seen in the code below.
1 //push on s tack
82
8.3. Barnes-Hut, CUDA version
Figure 8.7 Flowchart showing the flow of the main for-loop in the forceCalculation kernel
2 depth++;
3 visited = visited << 3 ;
4 visited = visited | currentChild ;
5 parent [ depth ] = currentNode ;
6 //pop from stack
7 currentChild = visited & 7 ;
8 visited = visited >> 3 ;
9 currentChild++;
10 depth−−;
Within the shared memory the rdepth array is also created. Instead of doing the calculations in the loop,
most calculations that corresponds to formula 6.8 found in chapter 6 are done beforehand. When we do
the calculations, the diameter s is 2× r and θ is 0.5. We also use the fact that the rdepth array index at 0 is
actually used for the first subdivision of the tree and not the root. Indecies can therefore be found using this
formula4: d2 > 4r20.25l where l is the level/depth of the traversal. The first thread in the block finds these
values and then one thread in each warp updates the variables for that warps part of the shared memory.
The __synchtreads() function is used between these calls to make sure that the shared memory has been
written before continuing. Inside the loop, we use the values from rdepth in the following way:
1 tmp = dx∗dx + dy∗dy + dz∗dz + EPSILON ; //squared dis tance from l e a f to currentNode
2 i f ( ( currentNode < n_Dev ) || __all ( tmp >= rdepth [ depth ] ) ) //currentNode l e a f node or f a r enough ←↩
away on a l l threads in warp
3 {
4 0.5ls < θd −→ d2 > s20.52l
θ2
−→ d2 > r2220.25l4 −→ l is one lower than actual depth −→ d2 > 4r20.25l+14 −→ d2 > 0.25r2220.25l4
83
8.3. Barnes-Hut, CUDA version
4 //update temporary a c c e l e r a t i on
5 currentChild++; // i t e r a t e to next ch i ld
6 }
7 e l s e
8 {
9 //push on s tack
10 . . .
The variable tmp,which holds a value representing d from the formula, is compared to the value of rdepth
at the current depth for all threads in the warp. The __all() voting function (see section 5.3.7) enables us
to avoid thread divergence, as it returns true if and only if all threads in the warp evaluates the logical
statement to be true. In the traversal of the tree, the __all() function is used to traverse the tree until all
leaf nodes for all threads in the warp are far enough away. It does mean that some threads will possibly
traverse the tree too deep in each branch as another thread in the warp needs to go deeper. Without the
__all() function, thread divergence would cause treads to wait instead of traversing, so we might as well
welcome the extra accuracy this extra depth induces.
It can also be seen that any data changes made to shared memory is done by the first thread in the warp in
order to to save bandwidth.
After the tree has been traversed each, threads updates acceleration and velocity according to the Verlet
scheme.
When the computeForce kernel is finished, the velocity and position updating is done exactly as in previ-
ous versions of the program in the advanceBodies kernel.
8.3.6 Performance
As mentioned in the section about the general structure of the program, defines at the beginning of the
source file are used to set the number of threads per block, and how many blocks to schedule per SM. For
tuning our program, we have generated a 524288 body input data file. Using this, each kernel was tested
with all combinations of block factors 1 through 5 (except where prohibited) and thread counts of 256, 512,
768 and 1024. Below we present the optimal kernel launch parameters for that particular configuration on
the GPU which we use:
1 # def ine RADIUSTHREADS 256/∗must be a power of 2∗/
2 # def ine RADIUSBLOCKSF 1
3 # def ine TREETHREADS 512
4 # def ine TREEBLOCKSF 2
5 # def ine CENTERTHREADS 768
6 # def ine CENTERBLOCKSF 1/∗must remain 1∗/
7 # def ine FORCETHREADS 512
8 # def ine FORCEBLOCKSF 4
9 # def ine ADVANCETHREADS 768
10 # def ine ADVANCEBLOCKSF 5
Total runtimes with 10 timesteps for each kernel (radius, tree, center, force, advance) on the 512k n-body
system: 1.77, 77.52, 15.21, 12058.01 and 3.08 ms.
We have previously mentioned that we have based parts of our code on Burtscher’s program5, v2.1 specif-
ically. Differences that may affect performance include our use of the float4 data type for storing vectors
and coordinates.
5 Obtainable from http://www.gpucomputing.net/?q=node/1314 as of the time of writing.
84
8.3. Barnes-Hut, CUDA version
It was our assumption that float4 would increase performance. But given how they were used, we do not
expect our use of float4 to give any increase in performance, nor do we expect a significant decrease.
While it was at one point implemented in the Barnes-HutOpenMPversion,we did not finish the implemen-
tation or implement a sorting function in the CUDA version. Sorting the bodies, according to Burtscher’s
article[7], has a significant impact on the function for force calculations. At this point we found it tempting
to perform a single benchmark on his program, for comparison.
We are running his program on the same hardware and software setup, and it is compiled the same way.
His program does not accept input data, but generates it in host code before execution given a number (n)
as parameter. His data generation routine does not truncate the Plummer model as ours does.
Total runtimes with 10 timesteps for each kernel of Burtscher’s program on a 512k (524288) body system:
0.9, 83.6, 33.5, 13.1, 1974.1 and 4.0 ms.
The fourth timing is the sorting kernel. The three first timings, and the last, are comparable to ours, but
the force kernel and indeed the total runtime is close to 6 times shorter. Curious to determine if this large
disparity in runtime is in fact caused by lack of sorting, we disabled sorting6 in his program and ran the
test again: 0.9, 83.8, 33.5, 0.0, 11727.6 and 3.9 ms. We note that the disabled kernel now has a runtime of
zero and the force kernel is slowed significantly, comparable to our program.
Now that we have an indication that the lack of sorting is a performance issue, we will perform a single
testing benchmark to determine what the effects of sorting would have on our program, before proceeding
with our regular benchmarks. To do so, we will use the generated 512k n-body data set, and a specially
sorted version where we use our counting sort function implemented originally for the OpenMP version.
In our counting sort we use the octree to define the index we sort after. This is done in a similar way to how
we extend the visited variable in the forceCalculation kernel, appending a 3-bit word representing the tree
branch to an integer called sortIndex.
1 sortIndex =1;
2 f o r ( i n t i = 0 , i < depth ; i++)
3 {
4 sortIndex = sortIndex << 3 ; //making room for 3 b i t s
5 sortIndex | 3bitword ; //adding the branch to index
6 }
The granularity of the sorting is defined by how many levels of the octree we use for the sorting. After
choosing the granularity of the sorting, we order the bodies after which subtree they belong to in the tree
at a certain level (figure 8.8 shows how this is done for a quadtree).
In the sorted version, the bodies are sorted spatially at 5 levels using counting sort, meaning that they are
ordered to be placed in a balanced tree down to level 5. We will run 10 timesteps on each, with a large (0.1)
and a small (0.005) timestep size, to determine the effect on long and short timespans with regards to the
simulation. The timings are presented in table 8.3.6, with measurements of time in ms.
As we see, sorting has a significant impact on performance. It is no surprise that when using a small dt,
individual timings are not significantly different from the average. This is because very little happens
in the simulations, and the bodies move very little. When a larger value of dt is used, the bodies move
more, effectively undoing the impact of the initial sort. With the unsorted data and the large dt, we see
6 Done by removing the call to the kernel and patching i = sortd[k] to i = k in the 2nd for loop of the force computing kernel.
85
8.4. Tests & results
Figure 8.8 This figure shows how our counting sort would sort bodies using a quadtree according to depths 0, 1 and 2.
Bodies are arranged in the array after which subtree they belong to.
Timestep Sorted,
dt 0.1
Sorted,
dt 0.005
Unsorted,
dt 0.1
Unsorted,
dt 0.005
1 136.51 136.60 1207.58 1207.21
2 139.01 136.54 1207.08 1205.78
3 143.95 136.96 1206.97 1206.80
4 155.76 137.29 1212.23 1206.90
5 175.19 137.30 1222.58 1207.13
6 205.27 136.85 1232.87 1206.92
7 265.86 136.90 1249.90 1206.52
8 403.59 137.56 1277.30 1205.20
9 626.14 136.69 1293.69 1206.36
10 803.20 137.12 1314.12 1206.71
Average 305.45 136.98 1242.43 1206.55
Total 3054.47 1369.80 12424.34 12065.53
Table 8.2 Timings in ms for comparing elapsed times when using sorted and unsorted data.
a slight change in runtime from the first to the last step. But with the sorted data set this difference is
considerable, increasing the runtime sixfold fromfirst to tenth step, getting closer to the runtimes of initially
unsorted data. We note that the timings on the force function, as a result of presorting the data, are faster
than the timing of an equivalent run of Burtscher’s program. We suspect, that given an efficient CUDA
implementation our counting sort algorithm, a speedup could be gained compared to his program.
8.4 Tests & results
We have benchmarked the simulation for 10 galaxy sizes. The average times for one full iteration of the
simulation and for each kernel can be seen in table 8.4. The O(nlog(n)) trend can be seen in the data, but
they are a bit skewed as the GPU is not fully utilized for smaller sample sizes, as we also saw in chapter 7.
We observe that the force calculation takes up the most of the calculation time as we have also seen in the
earlier versions. In the larger galaxy sizes the runtime of the 4 other kernels is almost negligible.
The two irregular kernels still take up some time in the simulation, while the radius and advance kernels
are sub-linear and linear in time complexity, using almost no time.
86
8.5. Summary of the Barnes-Hut case
n Radius Tree Center Force Advance All
1024 0.10 0.15 0.12 1.56 0.01 1.99
2048 0.10 0.16 0.14 2.70 0.01 3.18
4096 0.11 0.21 0.15 4.41 0.01 4.95
8192 0.11 0.39 0.17 8.50 0.01 9.22
16384 0.11 0.72 0.18 19.80 0.02 20.87
32768 0.11 1.06 0.25 42.00 0.03 43.50
65536 0.11 1.49 0.33 94.44 0.05 96.46
131072 0.12 2.38 0.50 225.21 0.08 228.34
262144 0.14 4.08 0.84 521.58 0.16 526.84
524288 0.18 7.75 1.50 1205.80 0.31 1215.58
Table 8.3 Average times in ms for single kernels and all kernels on different size simulations. All with t = 100 and dt =
0.0001.
We compare our CUDA implementation to the OpenMP implementation using 16 threads in figure 8.9. We
see that CUDA version does perform better for all galaxy sizes, but the increase in performance does not
compare to the difference between the different all-pairs versions at all. The pie-chart in figure 8.9 shows
that especially the irregular parts of BH takes up a larger part in the OpenMP version. In fact, the force
calculation is only twice as slow for a 128k galaxy.
Figure 8.9 The leftmost figure shows a graph of the simulation time for one iteration of the Barnes-Hut CUDA and
OpenMP programs as a function of the galaxy size. The rightmost pie charts shows the timeusage on the 5 parts of the
programs, the innermost “wheel” shows the distribution for the CUDA program.
It can be seen in graph 8.10 that the most optimized CUDA all-pairs version has better performance for
small sample sizes, while Barnes-Hut has better performance for larger ones. This is as expected, since the
overhead of the Barnes-Hut algorithm itself, and the parallelization of it, overshadows the performance
gains for small sample sizes. At larger galaxy sizes the quadratic time complexity of the all-pairs algorithm
overshadows the computational throughput of the implementation.
8.5 Summary of the Barnes-Hut case
In the Barnes-Hut case we have found that it is much harder to utilize the massive parallelism offered by
CUDA, and to achieve the same kind of performance gains as in the all-pairs case. CUDA and GPUs are
made for massive parallelism and in this chapter we have shown use of a few specific CUDA functions that
87
8.5. Summary of the Barnes-Hut case
Figure 8.10 A comparison between the all-pairs program and the Barnes-Hut program, executed on a C1060 GPU and
a GTX 480 GPU respectively.
we did not use in the previous case. For instance the implementation of locks using atomics and thread-
fences as well as voting functions to avoid thread divergence. These makes the CUDA version handle con-
currency better than the OpenMP program, the way we chose to implement it. Admittedly, the OpenMP
implementation could have been more elegant and made better use of OpenMP features. Presumably a
different implementation, using OpenMP tasks and recursion, could have performed significantly better.
But as mentioned, the purpose of the OpenMP version was to learn, help with debugging and to compare
performance against a similar CUDA implementation.
Though it is possible to implement irregular code in CUDA it requires quite some knowledge and a devel-
oped skill-set in CUDA programming to do so with success. We had difficulties implementing the Barnes-
Hut algorithm in CUDA, and we acknowledge that we were able to finish our CUDA Barnes-Hut program
only by learning from and using parts of Burtschers solution.
The way CUDA works and the additional constraints that must be followed to achieve high performance
makes it difficult to implementmany kinds of classical optimizing algorithms. In the implementation of the
tree building we can see how the architecture counteracts the parallelism through warp divergence. When
the threads are on different paths, they are not working simultaneously, effectively reducing the number
of cores in use.
The implementation of this program in CUDA also required more use of debugging, and deeper debug-
ging. We found the experience of debugging CUDA frustrating, and not just because the cuda-gdb debug-
ger that NVIDIA provides with the Linux SDK is not user friendly. We found that in some cases, use of the
debugger itself caused changes in behavior, or could bring down the CUDA driver of the system. More-
over, the kind of problems that would occur were related to threads and memory access, and would not
reproduce easily or at all when changes to launch configurations were made in order to debug. But that is
obviously a consequence of the use of CUDA.
Despite this, we did succeed in implementing the Barnes-Hut algorithm in CUDA in a way that surpassed
the performance of the OpenMP version running on a system with 16 CPU cores. We cannot rule out that
a much better implementation could be made that would work better than this CUDA version. However,
we also know that the CUDA version can be made to perform significantly better. We explain the relatively
low performance of the benchmarked version with the distribution of the data and the corresponding path
of execution taken by the algorithm. We do so based on tests on sorted data, which also shows that the
increased lack of sorting as the algorithm moves the bodies around, causes a corresponding decrease in
performance. In our tests, this is directly caused by the chosen timestep sizes being quite large, but the
88
8.5. Summary of the Barnes-Hut case
effect would be the same given a longer simulation and smaller timesteps.
In both CUDA and OpenMP implementations, the majority of the work done lies in the force calculating
function. But the OpenMP version does not suffer from a lack of sorted bodies. With the CUDA version
this causes a significant decrease in performance, showing the potential weakness of the SIMT model.
Somethingwewould like to point out, but arewary ofmaking any conclusions based on, is the performance
achieved using the sortingmethodwe used.We tested on a n-body dataset of spatially sorted bodies, sorted
using the counting sort algorithm. Due to the entire warp having to traverse the tree together in the force
calculation kernel, we see that a spatially sorted array of bodies achieve speedups of an order ofmagnitude.
The timings and speedupobserved in our force calculating function using the galaxy sorted by our counting
sort algorithm, appears towork better than the corresponding timings and speedup observed in Burtschers
program with a sorting kernel.
The speedup gained in context of the n-body problem, going from O(n2) in the all-pairs algorithm to
O(nlog(n)) in the Barnes-Hut is observed in all three implementations.
89
9 Analysis and discussion
We will now analyze and describe the results of our cases, our use of CUDA and discuss the project in
general. We will give a summary of the case studies by explaining the observed results and putting them
into context.
9.1 Cases
In this project our goal was to gain knowledge about CUDA as a parallel platform, learn how to use it, and
to implement and optimize programs using it. Demonstrating use of CUDA through cases has been the
intention from the start. We ended up looking into two rather large cases, each extensive in its own way.
Using our cases as a basis for the analysis and discussion, we will now sum up the observations and results
from the two case chapters.
A main point of our cases is the performance increases achieved through parallel implementations run-
ning on GPUs and optimizations applied using CUDA. But given the nature of our cases, we will begin by
evaluating on the quality of the simulations.
After attempting to use various methods of verifying results of simulations, we ended up using an en-
ergy function to determine if the energy is conserved in a simulation with reasonable parameters. Though
we know that our simplifications and assumptions regarding the n-body problem means that the simu-
lations are not realistic per se, we feel that doing simulations where the results are wrong alone due to a
faulty implementation of the algorithm or errors caused by parallelisation is unsatisfactory. We have used
the energy calculation as a tool during debugging and verification of implementations to indicate unex-
pected problems.
Besides catching errors, this also allowed us to observe problems with the same cause: the use of single
precision floating point variables and arithmetic. We knew this would be an issue to some degree from the
beginning. As an example, we observed that two mathematically equivalent implementations, i.e. adding
the incremental accelerations to the velocity vector instead of summing them up and then adding, do not
result in identical results. This is due to the use of single precision floating points arithmetic, for which we
have written some notes in the appendix at chapter 12. Errors introduced by use of single precision in the
former was only a problem for large galaxies, making the problem less obvious initially. While it is possi-
ble to work around these issues, there is no reason to avoid the obvious fix when performing simulations
that require high precision. This fix is to use double precision floating point arithmetic, which we opted
not to use in our project1 due to the big differences in performance on the hardware available to us. The
newer GPUs released are addressing this with increased double precision performance, which promises
to eventually make this a non-issue.
1 As a note, this would be trivial to change in our programs, except for float4 types for which there is no double precision equivalent.
90
9.1. Cases
Tomeasure performance, we did timed simulations on benchmark datasets consisting of 8 generated Plum-
mer models of increasing sizes (starting with 1024, each doubling until 131072) on all versions of the pro-
grams. For the Barnes-Hut case, we added some larger sizes for testing at the end. For the parallel OpenMP
programs which was executed on the 16 CPU core server, we timed them on 1, 2, 4, 8 and 16 threads to see
if the programs scaled efficiently to the number of cores. We did similar performance measuring with the
CUDA versions. Though a benchmark of speedup still speaks to the scalability of the program, in CUDA it
would not be related to core count alone.Much of the speedup obtained is through optimized data fetching
and the other resources the cores share. To sum up, the benchmarks for the CUDA programs showed the
following:
The all-pairs simulation donewith an optimized version of the CUDA implementation showed us a “whop-
ping” performance gain (showed earlier in figure 7.14), compared to our parallel OpenMP program. The
speedup of a 100 times observed compared to the 16 core computer indicates that the CUDA GPU used
provides good computational value on that particular implementation and setup.
Regarding the Barnes-Hut case, it is true that the sequential Barnes-Hut implementation is sped up by
introducing extra CPU cores and threads and evenmore when implementing it to run on a GPU. However,
we only observe a speedup of approximately 2.5 comparing our CUDA version to the parallel OpenMP
version, when all SMs on the GPU are utilized. When testing the consequences of sorting data, we notice
that a speedup of 25 times may be possible, but this does not compare to the speedups observed in the
all-pairs implementations that were up to 100 times faster.
As previously pointed out, in order to either facilitate CUDA functionality optimally or to implement the
irregular algorithm, the complexity of the code has increased tremendously in all of our implementations.
When considering the simulation performance of a Barnes-Hut implementation compared to an all-pairs,
we observe that using this implementation is notworth the effort for small datasetswhen comparingCUDA
implementations, but it is when comparing CPU implementations.
We have located two major contributions to the lack of speedup and have a possible solution in mind
(hinted in the end of chapter 8) which we will present below.
The first contribution to the lack of speedup when using the Barnes-Hut algorithm in CUDA is warp di-
vergence, which severely slows down the execution. This is solved by Burtscher by a sorting kernel that
sorts bodies semi-spatially [7]. We performed a mini-experiment using data that was pre-sorted by using
the indexing of the bodies (presented in section 8.3.6). We observed in this instance that our a forceCalcu-
lation kernel performed faster than Burtscher’s, however we are not making any conclusion based on this
observation.
Radix sort implemented onGPUs is shown to be possiblewith a performance of onemillion sorted keys per
millisecond [24]. Finding the index for a body is done in linear time and is embarrassingly parallel. Com-
bining a fast CUDA sorting algorithm and the indexing as keys could potentially improve performance of
the calculateForce kernel and indeed the entire simulation.
The second contribution to the lack of speedup is related to the above. Compared to our all-pairs imple-
mentation, data fetches are not optimized in the same way in the forceCalculations kernel. Data fetching as
seen in the tiling version of the CUDA all-pairs implementation is fully coalesced, while in each iteration
of the force kernel we are lacking coalescence, potentially only retrieving one body at a time. If the bod-
ies were to be ordered after spatiality using our suggested sorting method, it would be possible to better
utilize shared memory. Bodies within entire blocks would be grouped together spatially, and using shared
91
9.2. CUDA
memory through tiling would be possible, as we would have better control over which body data needs to
be fetched to the block during the traversal of the tree.
9.2 CUDA
Based on our experiences with the cases wewill evaluate CUDA as a platform and its place in parallel com-
puting.We have observed the possibilities offered by CUDA and it limitations through the implementation
oft two types of algorithms.
As expected, CUDAhandles embarrassingly parallel problemswell withmoderate speedups for even naive
implementations. Seen in context of performance value for money, CUDA certainly has potential to solve
problems similar to ours. We believe problems such as simulations, processing, and bruteforcing, can be
implemented with handsome performance gains even by a novice in CUDA.
Optimizing an implementation in CUDA can be awhole othermatter indeed.We find that writing and tun-
ing an optimized CUDA program is comparable to writing an optimized CPU program in assembler: an
intimate knowledge of the problem and system architecture is required to get the best performance. For a
full optimization, the programmer has amyriad of things to consider such as (but not limited to): hardware
resources, problem sizes, use of memory, use of special functions, setup of individual kernels and speed
vs. accuracy. As seen with the simple all-pairs problem, many optimizations were made, but it is still not
optimized for small problems that do not allow utilization of all SMs. For more elaborate problems with
several kernel calls, even though all of them are embarrassingly parallel, optimization might require very
extensive work.
The Barnes-Hut case shows that CUDA can be used for a parallelization of an algorithm that is irregu-
lar. The irregular functions especially, as shown in figure 8.5 and table 8.4, takes up increasingly more time
as the galaxy sizes grow, compared to the sequential version as seen in figure 8.3. The parallelization of
Barnes-Hut still shows increases in performances due to parallelization of the force calculation. The impli-
cation is that in order to gain something from the parallelization, there has to be amajor part of the program
that is suitable for CUDAs form of parallelism.We consider the warp divergence caused by branching code
in CUDA to represent a bottleneck. It is a consequence of the SIMT architecture. This bottleneck has simi-
lar implications to the potential for speedup as the sequential part of code has in Amdahl and Gustafson’s
laws. As the bottleneck is harder to see or detect intuitively, and as single thread performance in CUDA is
low, it can be a problem.
That said, we have seen that it is possible to utilize CUDA hardware well given a suitable algorithm and
by using specific functionality. It is also possible to implement irregular code that performs better than a
CPU implementation, but it is not easy.
One needs to consider if the effort is worth the payoff. The use of CUDA and potential for optimization of
the code is problem dependent, and it can make previously unfeasible simulations viable, for example.
GPGPU programming is still a rather young paradigm and during the span of the project we have en-
countered issues that are a consequence of this. How an existing program will scale between GPUs is not
completely obvious as the hardware resources varies, and it is hard to get actual specifications on all parts
of the card. Optimized programs also often run best on the GPU they were tuned for. For example, the
amount of cores on the GTX 480 GPU is twice that of the Tesla C1060 GPU, but our most optimized CUDA
all-pairs program does not run at twice the speed on it.
We found the CUDA documentation to be lacking at times, and a few times we found example CUDA code
published by NVIDIA that would not compile. But it does appear that the maturity of the software is mak-
92
9.3. Project
ing rapid improvements in their latest versions.
On a hardware level, each generation of GPUs that comes with higher versions of the CUDA compute ca-
pability does come with more andmore new features and improvements. Some of these also makes CUDA
perform better on existing supported features, such as double precision floating point arithmetic.
NVIDIA is not the only producer of GPUs nor is CUDA the only platform that supports programming of
GPUs. The OpenCL framework, for instance, supports a similar programming paradigm to that of CUDA,
but supports both GPUs and CPUs from several manufacturers.
Programs that makes use of GPUs for accelerating certain non-graphical parts are also seen increasingly.
As more andmore standard computers contain relatively powerful GPUs for accelerating graphics, we can
expect them to also be used to accelerate programs more frequently, perhaps taking on a more general role
much like the floating-point coprocessor had.
9.3 Project
In this section we will briefly reflect on the project as a whole.
Most of our time in the beginning of the process was spent learning and examining CUDA, participating
in a CUDA study group, and training by implementing example programs. Later, we spent considerable
time studying the n-body field from a non-physicist point of view and implementing a C++ version
of the Barnes-Hut algorithm, that ultimately proved impossible to port to CUDA. Our concerns with
implementing the algorithms correctly in an unfamiliar environment is partly what caused us to spend
time programming sequential and parallel CPU versions for reference. Benchmarking for performancewas
of course the main motivation, and while implementation of CUDA versions did cause some “headaches”,
observing the results was quite satisfactory. As always with these kinds of projects, we have some strong
ideas towards the end as to what could have been done better, or what could also have been interesting to
implement, given the knowledge. Then again, we only had those ideas due to what we learned.
It has taken the course of the whole project to get where we are now, and if we had known what we do
now, we might have chosen other cases to study. The n-body all-pairs problem, while an exciting example
when incrementally optimizing and illustrating the implementation in CUDA, is tried and tested. It is
an often used benchmark for parallelism, and several CUDA implementations exist. With regards to the
more difficult Barnes-Hut algorithm, it was perhaps naive of us to think that we could come up with
an implementation as good as the existing within the scope of a single case. Burtscher, who researches
the implementation of irregular programs on GPUs used two months, alone on converting C code and
implementing his CUDA solution[7]. We ended up basing much of our Barnes-Hut CUDA code on his,
learning quite a bit from studying his code, however.
We each have interests in other fields, and for example cryptography andmathematicalmodeling in general
is something we see ourselves using CUDA for in the future, respectively. CUDA is an interesting platform
to study and use, and has a lot of potential for accelerating many applications, as we have learned over the
course of this project. Everything considered, we are satisfied with the project.
93
10 Conclusion
We conclude that CUDA is exceptionally capable at handling massively parallel problems and algorithms,
but for irregulary algorithms research is required to achieve performance. It is not as easy to quickly
implement a parallel program compared to other parallel programming paradigms, and implementing
optimized programs is a complex endeavor. In any case, hardware supporting the platform has a relatively
high value performance and has potential to help accelerate suitable applications.
94
Bibliography
[1] Joshua A. Anderson, Chris D. Lorenz, and A. Travesset. “General purpose molecular dynamics
simulations fully implemented on graphics processing units”. In: Journal of Computational Physics 227
(2008).
[2] S. J. Arsleth, M. Henon, and R. Wielen. “A Comparason of Numerical Methods the Study of Star
Cluster Dynamics”. In: Astronomy and Astrophysics 37 (1974).
[3] Josh Barnes and Piet Hut. “A hierachial O(N log N) force-calculation algorithm”. In: Nature 324
(1986).
[4] Guy Blelloch and Giraija Narlikar. “A Practical Comparason of N-body Algortims”. In: Paralell
Algoritms, Series in Discrete Mathematics and Theoretical Computer Science 30 (1997).
[5] Guy Blelloch and Girija Narlikar. “A Practical Comparison of N-Body Algorithms”. In: Parallel
Algorithms, Series in Discrete Mathematics and Theoretical Computer Science 30 (1997).
[6] Barry B. Brey. The Intel Microprocessors. 5th ed. Prentice-Hall, 2000.
[7] Martin Burtscher and Keshav Pingali. “An Efficient CUDA Implementation of the Tree-Based Barnes
Hut n-body algorithm”. In: GPU Computing Gems Unknown volume (2011).
[8] Rajkumar Buyya. High Performance Cluster Computing: Programming and Applications. vol 2. Prentice-
Hall, 1999.
[9] Barbara Chapman, Gabriele Jost, and Ruud Van Der Pas. Using OpenMP. 1st edition. The MIT Press,
2008.
[10] Peter H. Colberga and Felix Höfling. “Highly accelerated simulations of glassy dynamics using
GPUs: Caveats on limited floatingpoint precision”. In: Computer Physics Communications 182 (2010).
[11] Germund Dahlquist and Å ke Björck. Numerical Methods. Prentica Hall, 1974.
[12] Katheryn Ferrel. “Hamiltonian Mechanics and the Construction of numerical Integrators”. Depart-
ment of Mathematics, University of California, San Diego, 2009.
[13] Michael J. Flynn and Kevin W. Rudd. “Parallel Architectures”. In: ACM Computing Surveys, Vol. 28,
No. 1, March 1996 (1996).
[14] Richard Friedman. About the OpenMP ARB and OpenMP.org. Jan. 26, 2012. url: http://openmp.org/
wp/about-openmp/.
[15] Hartmut Frommert and Christine Kronberg. Galaxies. 2009. url: http://messier.seds.org/.
[16] L. Greengard and V. Rokhlin. “A Fast Algorithm for Particle Simulations”. In: journal of computational
physics 135 (1997).
[17] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI. 2nd edition. The MIT Press, 1999.
[18] John L. Gustafson. “Reevaluating Amdahl’s law”. In: Communications of the ACM, May 1988, Volume
31, Number 5 (1988).
[19] Ernst Hairer, Christian Lubich, and Gerhard Wanner. “Geometric numerical integration illustrated
by the Störmer - Verlet method”. In: Acta Numerica Cambridge University Press (2003).
95
Bibliography
[20] Lars Hernquist. “Performance Characteristics of tree Codes”. In: the Astrophysical Journal Supplement
Series 64 (1987).
[21] A. S. Hornby, E. V. Gatenby, and H. Wakefield. The Advanced Learner’s Dictionary of Current English.
Second edition. Oxford, 1963.
[22] Piet Hut and Jun Makino. The Art of Computational Science, The Kali Code. 2005. url: http :
//www.artcompsci.org/.
[23] David B. Kirk andWen mei W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach.
1st. Morgan Kaufmann publisher, 2010.
[24] Duane Merril and Andrew Grimshaw. “High performance and scalable radix sorting: A case study
of implementing dynamic parallelism for GPU computing”. In: Department of Computer Science,
University of Virginia (2011).
[25] Gordon E. Moore. “Crammingmore components onto integrated circuits”. In: ElectronicsVolume 38,
Number 8, April 19, 1965 (1965).
[26] K. S. V. S. Narasimhan, K. S. Sastry, and Saleh Mohammed Alladin. “Gravitational Potential Energy
of Interpenetrating Spherical Galaxies in Hernquistâs Model”. In: J. Astrophys. Astr. 18 (1996).
[27] NVIDIA. GPU Gems 3. First. Pearson Education, Inc., 2008.
[28] NVIDIA. NVIDIA CUDA Best Practices Guide. Version 3.0. NVIDIA, 2010.
[29] NVIDIA. NVIDIA CUDA C, Programming guide. Version 3.1.1. NVIDIA, 2010.
[30] NVIDIA. NVIDIA CUDA C, Programming guide. Version 4.0. NVIDIA, 2011.
[31] NVIDIA.NVIDIAs Next Generation CUDACompute Architecture:TM FermiTM. Jan. 30, 2012. url: http:
/ / www . nvidia . co . uk / content / PDF / fermi _ white _ papers / NVIDIA _ Fermi _ Compute _
Architecture_Whitepaper.pdf.
[32] Lars Nyland, Mark Harris, and Jan Prins. “Fast N-Body Simulation with CUDA”. In: GPU Gems 3, p.
667 (2007).
[33] David A. Patterson and John L. Hennessy. Computer Organization and Design. 4th. Morgan Kaufmann
publisher, 2009.
[34] D.P. Playne, M.G.B. Johnson, and K.A. Hawick. “Benchmarking GPU Devices with N-Body Simula-
tions”. In: (2009).
[35] Jason Sanders and Edward Kandrot. CUDA By Example, An Introduction to General-Purpose GPU
Programming. First printing, July 2010. Addison-Wesley, 2010.
[36] AbrahamSilberschatz, Peter Galvin, andGregGagne.AppliedOperating SystemConcepts. First edition.
John Wiley and Sons, Inc., 2000.
[37] David B. Skillicorn and Domenco Talia. “Models and Languages for Parallel Computation”. In:ACM
Computing Surveys 30 (1998).
[38] Morten M. Sternheim and Joseph W. Kane. General Physics. Second. Wiley, 1991.
[39] William C Swope et al. “A computer simulation method for the calculation of equilibrium constants
for the formation of physical clusters ofmolecules: Application to small water clusters”. In: the Journal
of Chemical Physics 76 (1982).
[40] Loup Verlet. “Computer Experiments on Classical Fluids. I. Thermodynamical Properties of
Lennard, -Jones Molecules”. In: Physical Review 30 (1967).
96
11 Appendix A: Experimental setup
In this chapter we will give a brief overview of the software and hardware resources we have used. These
have been used for development, testing and experiments during our project.
11.1 Software used
Operating system and environment
The CUDA runtime and SDK is available for PCs running a version of a Linux-, OSX-, or Windows-based
operating system that supports NVIDIAs driver[35]. For the duration of our project we have used CUDA
on Linux-based setups.
The Linux setups used for development of all programs, and for running CUDA programs are all based on
a 64-bit (AMD64) version of the Linux kernel v2.6.32 and v4.4.5 of the GNU toolchain. The GNU toolchain
consists of the GNU compiler collection (GCC), the make program for automated compiling and linking,
the GNU Debugger (GDB) for debugging, and other related tools. For CUDA development, version 4.0 of
NVIDIAs CUDA Toolkit was used. This toolkit consists of a compiler, include headers for source files, the
linkable runtime library, a debugger, and associated tools.
Compiler
The NVIDIA C Compiler (nvcc) works similarly to the GCC C/C++ compiler (g++). Since source files con-
tain both host and device code, the compiler acts as a front-end that compiles host and device code seper-
ately[35], combining them and the CUDA runtime into single “fat” binaries. Hence the need for the C
compiler of the GCC toolchain. For our non-CUDA programs we have used the g++ compiler alone. It
should be noted that when specifying 32 or 64-bit (the default) architecture to the nvcc compiler, it will
apply to both host and device code. This is enforced for compatibility reasons, one of which is the ability
to store pointers. Other compiler parameters that affect code-generation are optimization and debug pa-
rameters. For tests and experiments we have used the -O3 parameter for level 3 (maximum) optimization
on both host and device programs and code. For debugging sessions we have used the g (for host) and -G
(for device) parameters. Since CUDA capability determines if certain functionality is supported, the nvcc
compiler accepts a parameter that defines architecture. The -arch sm_20 parameter is used to indicate that
the code uses functionality supported by the 2.0 architecture.
As a side note, the nvcc compiler has some interesting parameters for automatic optimization using the
faster CUDA instructions. The –prec-sqrt parameter for instance can be used to turn on or off a faster ap-
proximation of square root calculations. The -use_fast_math parameters turns this, andmany other faster
approximations on. The –maxrregcount parameter can be used to control the maximum registers a GPU
kernel function can use, which when increased can increase per-thread performance, but in turn limits the
amount of threads per block.We do not use these parameters, however. All of our source code in the project
97
11.2. Hardware resources
comes with appropriate make-files with the applicable parameters, see 15.
Debugging
Debugging was done using cuda-gdb which is a version of GDB extended to support device as well as host
code debugging. Debugging CUDA threads is a different experience than debugging multithreaded CPU
programs, as not all threads may be alive when a bug occurs or a breakpoint is reached. Likewise, kernel
calls or calls to the CUDA runtime that fail, may have failed due to errors in an earlier kernel launch. Errors
of this type was during our project mostly caused by race conditions or addressing the wrong memory.
For these cases we used the cuda-memcheck program first to determine which kernel andwhich block and
thread caused the problem, or to verify that the bug was no longer there.
11.2 Hardware resources
The computers used for development and CUDA tests are part of a cluster hosted and used by IMFUFA,
a research group at the NSM institute at RU. We have used the test machines bead30 and i055 which are
usually not part of the job queue system. They both have an AMD Phenom 9950 CPU with 4 cores, and
4GB ram. The first one has a Tesla C1060 with a GPU of compute capability v1.3 which we have used for
all of our n-body all-pair simulation implementations in CUDA. The latter has two GeForce GTX 480 cards
of v2.0 capability of which we only use one for our second case, the CUDA version of the Barnes-Hut op-
timized n-body simulation.
We have opted not to use the same machines to benchmark the parallel CPU programs that we use for
comparing with the CUDA implementations, as they only feature a single quadcore CPU. Instead we used
a server at CBIT, alvin, with 4 quadcore CPUs for a total of 16 cores. The server has 32GB of RAM, and
each CPU is a quadcore Intel Xeon E7430 with 12MB L3 cache. To ensure that the maximum amount of
resources were available to us, we made sure to only run the parallel CPU programs when the operating
system indicated that no other users or resource intensive processes were active.
11.3 Utility programs
We have created some utility code and small programs to help handle loading, saving, creating and
inspecting our n-body simulation data. All code is included in the appendix (chapter 15).
11.3.1 File format
Since we have created several programs and versions of programs for performing n-body simulations, we
have defined a very simple file format to be able to perform simulations on the same source datamore easily.
The file format is in plain text, and starts with a configuration line where the number of bodies (n) and
timesteps (t) as well as size of timesteps (dt) can be defined. The values are parsed as two text formatted
ints and a float respectively, separated by a space. Only the n value is required to be correct, t and dt may
be dummy values or may be used to configure default simulation parameters. Following this, n lines con-
taining position coordinates, velocity vector coordinates andmass of each body are expected. These values
are represented as 7 space separated formatted floats.
98
11.3. Utility programs
1 # Example n−body data file
2 # n t dt
3 3 100 0 .001
4 0 .37401 0 .347141 0 .315877 0 .0173262 0 .0272703 0 .0155869 0 .333333
5 0 .283956 0 .044704 0 .510042 0 .00174227 −0.00652188 −0.0331298 0 .333333
6 0 .65741 −0.593354 −0.241873 −0.0155045 −0.028615 −0.0258132 0 .333333
In the above examplewe see two comment lines,which are skippedwhenparsing the file. The configuration
line specifies n, t, and dt to be 3, 100 and 0.001 respectively, and is followed by lines containing the 3 bodies.
We would already now like to note that this is not an optimal solution. The initial idea was to make it
possible to easily inspect file data and to create the files by hand. For larger sets simulation data, however,
the files become large and slower to load and parse. An improvement to this could be a binary file format
as opposed to one composed of formatted strings.
11.3.2 Input and output
Input and output of our programs is done using the NbodyInput and NbodyOutput C++ (.h/.cpp) classes
we have created, or the input.c and output.c files each containing a C function for the same purpose. They
support the file format described above, and reads and writes lines of text formatted variables one by one.
11.3.3 Plummer galaxy generator
In order to create input data files representing Plummer model galaxies we have written a small program.
The program is called PlummerModel and is follows the recipe presented in section 6.4.1, with one
modification. For our galaxies we truncate the galaxy at radius one, thus the distribution is somewhat
skewed. This means that the energy of the system is not -0.25 which it is for a usual Plummer model
galaxy[22], but approximately -0.67 as the potential energy is higher in the more dense cluster.
The Plummer model program requires 3 parameters: the size of the galaxy, the requested output filename
of the galaxy and a random seed used when creating the random numbers in the algorithm. The code can
be found in the appendix at 15.
11.3.4 Energy calculator
The energy.c file contains a function for calculating the energy contained in a n-body system. The energy
function is an implementation of the energy equation (equation 6.2) for a conservative system found in
section 6.1. In a conservative system the energy stays the same. We know that our galaxy has an energy
of approximately -0.67 when created. After a simulation we use this function to look at the energy of
the system again. The energy of the system after a simulation gives us an indication of the quality of
the calculation. If the energy stays close to the same, we are inclined to accept the simulation as being
acceptable. If the energy changes significantly during a reasonably configured simulation, we know there
is something potentially wrong. We have used this function extensively during debugging to determine if
changes in simulation parameters, integration method, algorithms or parallelism have caused errors to be
introduced. Usually, small variations where none was expected has been caused by floating-point related
rounding or truncation errors. The code can be found through the appendix (see chapter 15).
11.3.5 Bitmap creator
We have created a small C++ program for visualising output data of the n-body simulations. We use this
to inspect the effects of the simulation and for this report. It takes as input the filename of a datafile in the
99
11.3. Utility programs
format described above, loads the n bodies and scales and plots the x and y coordinates, +-2.0 maximum,
into a 1000 by 1000 bitmap. Each star is presented as a single white pixel. The bitmap creator uses v1.06 of
the open source EasyBMP Cross-PlatformWindows Bitmap Library1 to output to standard bitmap (.bmp)
files. Sample output is shown in figure 11.3.5.
Figure 11.1 These 4 images shows a galaxy of 1024 stars and how they evolve in a time-span of 1.2t, with 0.3t between
each image
1 Found at http://easybmp.sourceforge.net
100
12 Appendix B: Floating point precision and performance
The nature of floating point numbers (floats) and issues with floating point arithmetic will be presented in
this appendix chapter. The reason for including this is twofold. Firstly GPUs sometimes uses special hard-
ware accelerated operations to speed up calculations, that may affect floating point precision. Secondly the
problem referred to as floating point rounding errors is most pronounced with the use of single precision
floats. Although not a problem that relates to parallel programming as such, single precision floats is often
used in GPU computing. This is because some GPUs do not support double precision floating point values
(doubles) at all, or at a lower speed than that of floats. In this chapter we will focus on communicating
some the problems that arises with floats as it might cause various inaccuracies in the kind of simulations
we are performing.
Representing a floating point value in a computer is done using a binary word. The float is constructed
by dividing the binary into 3 parts:
• the sign (S), that represents if the number is negative or positive
• the mantissa (M), also called the precision, as it determines the precision of the float.
• the exponent (E), that determines the size of the number.
A decimal numbers’ normalized float representation, v is in theory found using the above presented
quantities in formula (12.1) presented below: here E is of the size n.
v = (−1)S × 1.M× 2E−(2(n−1))+1 (12.1)
Of course not every rational number can be represented in this way as the will be “gaps” in between the
actual numbers created by using equation 12.1. It can be seen that there will also be a maximum upper
and lower size of a given float. Depending on the size of the word, the float can have a larger size, as
the exponent can be larger. A small value of E gives more impact to the mantissa thereby increasing the
precision, as well as making for a less precise float when E is larger. This tendency has been exemplified in
figure 12.1. The figure shows the numeric range of a 5 bit float (sign is 0, ergo a positive number) with a 3
bit mantissa and a 2 bit exponent. Note that the resolution of the number is better closer to zero and there
is no number larger than 7 [23, p. 134].
Figure 12.1 The numeric range of a 5 bit normalized float.
The least significant bit, also referred to as the units in last place (ULP) is a term used to indicate the magni-
tude of the error when dealingwith floats. Sometimes when rounding or performing arithmetic operations
101
Figure 12.2 The construction of a single precision float of the IEEE format. (Source: http://phimuemue.com)
of floats, one or more of the last bits in a floats mantissa can be disregarded due to the nature of the op-
eration. These disregarded bits are the ULPs when speaking of a float error. There is an obvious problem
with rounding errors of a given decimal number, where the rounding error can be up to half of an ULP
[33]. Arithmetic operations performed on floating points introduces another problem that concerns the
ULP. For an example in addition or subtraction of two floats that have different sizes of the exponent, the
mantissa of the number with the smaller exponent is right shifted until the exponents fit. The numbers are
then added, but as there now are more bits than the format allows, the final number is rounded to fit the
size of the original float, possibly removing a couple of ULPs in the float with the lesser exponent. This can
result in the smaller numbers getting virtually eaten by the larger one. The error introduced by other more
complex operations could possibly be even larger [23, p. 135+].
The rounding errors also means that it matters in which order operations are performed, and in which
order numbers are arranged, even though mathematically the results should be the same. In parallel com-
puting the results from the same problem might have small variations as floating point operations is done
in a slightly different order. Trying to maintain operations between numbers with the same order of E can
greatly increase accuracy [23, p. 137].
The current standard encoding for floating points is the IEEE 754 encoding. The encoding for a single
precision float uses a 32-bit word that has a 1-bit S, 8-bit E, and a 23-bit M (see figure 12.2), while a double
precision float uses a 64 bit word, where 1-bit represent S, has a 11-bit E, and a 52-bit M 29. As a double
has 29 more bits for mantissa, the round-off error is reduced to 1229 of that of a single. The IEEE encoding
defines that a float for which all the exponents bits are one and the mantissa is zero, represents infinity.
If the mantissa is not zero while the exponent is all ones the composition represents “not a number” (NaN).
The CUDA architecture has differentiated floating point operations done in hardware units, specially de-
signed for those types of operations. Common floating point operations are done within the normal FPU
units are: multiplication multiply-add (MAD), minimum, maximum, compare and conversion between
float and integer. Special function units are hardware units designed for specific operations. These units
are typically faster than using regular instructions but with lower accuracy in some cases. Floating point
instructions on SFUs are: sine, cosine, binary logarithm and exponential, reciprocal, and reciprocal square
root as well as interpolation functions [33].
While floating point precision for has been an issue in earlier versions of the hardware, the Fermi archi-
tecture is more precise[31][23, p. 136], with the precision of a normal CPU. The exact deviations in CUDA
(depending on version) from the IEEE 754 standard can be found in the programming guide [30, p. 159].
GPUs with CUDA capability of 1.3 was the first to support doubles, but are exceptionally slow at handling
double precision arithmetics (8× lower than single precision). According toNVIDIA, GPUswith capability
2.0 has an increase in speed with which is handles doubles, with a maximal speed that is approximately
half the speed of maximum single precision performance[31].
102
13 Appendix C: CUDA compute capability
Figure 13.1 Features that depend on compute capability. (CUDA programming guide 4.0 [30])
103
Figure 13.2 Technical specifications that depend on compute capability. (CUDA programming guide 4.0 [30])
104
Figure 13.3 Technical specifications that depend on compute capability. (CUDA programming guide 4.0 [30])
105
14 Appendix D: Glossary
Block Ablock consists of a number of threads performing a kernel function on a streamingmultiprocessor.
CoM Center of Mass. An approximated positional vector representing the center of the masses of a given
internal node in the Barnes-Hut octree.
CPU The Central Processing Unit of a computer.
CUDA Compute Unified Device Architecture, an NVIDIAs computing platform for programming GPUs.
Device In the context of this project, a CUDA GPU, or GPU code.
Divergence In the context of CUDA, divergence is when different code paths are taken by threads in a
warp due to conditional statements in the source code. Causes parts of the kernel to be executed
sequentially.
Embarrassingly parallel A problem that when parallelized, each process has no dependency on the other
processes.
Fermi With regards to CUDA, the codename for an NVIDIA GPU architecture of capability v2.0.
FPU Floating point unit (of a processor).
GPGPU General-purpose computing on GPUs
GPU The Graphics Processing Unit, a term coined by NVIDIA for their graphics processors, also used
generally.
Grid A grid consists of all blocks in a kernel.
Host In the context of CUDA and this project, a computer, or non-GPU computer code.
Kernel In the context of GPGPU, a function executing on the GPU.
n-body problem A field within physics where n particles all affect each other through some force.
Octree An octree is a tree data structure where each internal node has eight children.
Pipeline Anumber of processes that in series conduct operations on data, so that the output of one process
is the input of the next one.
Pragma A directive used in programming, providing additional information to the compiler.
Quadtree A quadtree is a tree data structure where each internal node has four children.
R/W hazard Read / Write hazard. A concurrency problem that occurs when 2 threads or more try to
change a word in the same memory location simultaneously.
SFU Special function unit. A special part of SM for accelerated mathematical and special functions.
SM Streaming multiprocessor consisting of several SPs.
SP Streaming processor. SPs are grouped in SMs.
Tesla NVIDIA’s High-end GPU series based on various architectures and of different CUDA capability
versions.
106
ULP Units in Last Place. The number of ULPs that is “[”lost] in a floating point operation determines the
error of the function.
Warp 32 threads of a block executing concurrently on a SM. See also divergence.
Word A group of bits seen as one unit.
107
15 Appendix E: Source code
We have included the entire source code and all related files in a zip archive (digital library version) and
on a DVD attached to the printed version. Most source code files have also been printed out as an external
appendix.
Below is a listing of the contents of the entire code directory of our project repository:
• barnes-hut: Disregard. Discarded C++ Barnes-Hut implementation.
• barnes-hut-c: Barnes-Hut C implementation.
• barnes-hut-cuda: Barnes-Hut CUDA implementation.
• barnes-hut-c-openmp: Barnes-Hut parallel OpenMP implementation.
• barnes-hut-openmp: Disregard. Discarded C++ OpenMP Barnes-Hut implementation.
• galaxy: Test data used for all simulation benchmarks (1d, 2d, .. 10d. IIsort512k is the sorted galaxy).
• libs: EasyBMP library for bitmap creation program.
• misc: Disregard. Miscellaneous files.
• nbody-cuda: First n-body all-pairs implementation in CUDA.
• nbody-openmp: n-body all-pairs parallel OpenMP implementation.
• nbody-seq: n-body all-pairs sequential C implementation.
• nbodydata: Various utilities and helper files. See Appendix A in chapter 11.
• opt-cuda-nbody: Optimized n-body all-pairs CUDA implementations (in subdirectories float4, tiling,
loop, verlet).
The directories for each program contains amakefile for compiling the program using the GNUGCCmake
command. The directories may also contain additional unneeded files that should be disregarded.
108
Appendix E: Source Code
Appendix E: Kildekode
Truls Mosegaard
Jon Nielsen
Supervisor: Keld Helsgaun
Master Thesis. September 2011 - February 2012.
April 7, 2012
Computer Science at CBIT
Building 43.2
Roskilde Universitet 2012

Table of contents
1 Printed source code 2
2 All-pairs, sequential, C++ 4
3 All-pairs, OpenMP 8
4 All-pairs, CUDA, simple 12
5 All-pairs, CUDA, float4 16
6 All-pairs, CUDA, tiling 20
7 All-pairs, CUDA, loop optimizations 24
8 All-pairs, CUDA, velocity Verlet 30
9 Barnes-Hut, Sequential, C 36
10 Barnes-Hut, OpenMP 46
11 Discarded sorting function, Barnes-Hut, OpenMP 58
12 Barnes-Hut, CUDA 64
13 Utility, input, C 78
14 Utility, input, C++ 80
15 Utility, output, C 84
16 Utility, output, C++ 86
17 Utility, Plummer galaxy creator 90
18 Utility, bitmap generator 94
19 Utility, energy calculator 98
2
1
1 Printed source code
This extra appendix is a printout of the source code referenced in the report for easy reference. We have
used the listing package from LATEX in order to print out with syntax highlighting and line numbers. Each
program has one chapter in this appendix, startingwith the all pairs programs, followed by the Barnes-Hut
programs and finally our utility code.
2
3
2 All-pairs, sequential, C++
1 /∗
2 Sequent ia l vers ion of the a l l−pa i r s n−body simulat ion using Euler i n t eg r a t i on
3 By Jon & Truls
4 ∗/
5
6 # include <iostream>
7 # include <fstream>
8 # include <s t r ing >
9 # include <sstream>
10 # include <c s td l i b >
11 # include <math . h>
12 # include "NbodyInput . h"
13 # include "NbodyOutput . h"
14 # include " NBUtils . h "
15 # include " benchmark . c "
16 # include " energy . c "
17
18 using namespace std ;
19
20 // simulat ion parameters
21 # def ine G 1 . 0 f
22 # def ine EPSILON 0.0000001 f
23 # def ine DEBUG f a l s e
24 i n t n = 0 ;
25 i n t t = 0 ;
26 f l o a t dt = 0 . 0 f ;
27 f l o a t r = 0 . 0 f ;
28
29 // body value arrays ( pos i t ion , ve loc i ty , mass )
30 f l o a t ∗px , ∗py , ∗pz ;
31 f l o a t ∗vx , ∗vy , ∗vz ;
32 f l o a t ∗ax , ∗ay , ∗az ;
33 f l o a t ∗m ;
34
35 // helper func t ions
36 //bool readBodies ( char ∗ fn ) ;
37 void printBodies ( void ) ;
38 // simulat ion func t ions
39 void simulate ( void ) ;
40 i n l i n e f l o a t dist2 ( f l o a t x1 , f l o a t y1 , f l o a t z1 , f l o a t x2 , f l o a t y2 , f l o a t z2 ) ;
41
42 i n t main ( i n t argc , char ∗argv [ ] )
43 {
44 cout << "N−body simulator , s equen t i a l vers ion . " << endl ;
45 i f ( argc >= 2) //fi lename required as a minimum
46 {
47 stringstream ssin ;
48 ssin << argv [ 1 ] ;
49 string inputFileName = ssin . str ( ) ;
50 NbodyInput in ( inputFileName ) ;
51 in . debug = DEBUG ;
52 i f ( in . read ( ) )
53 {
54 cout << " Input−f i l e " << inputFileName << " suc c e s fu l l y loaded . " << endl ;
55 px = in . px ; py = in . py ; pz = in . pz ;
4
56 vx = in . vx ; vy = in . vy ; vz = in . vz ;
57 m = in . m ;
58 n = in . n ;
59 r = in . getMaxCoordinate ( ) ;
60 i f ( argc >= 4) //get t , dt parameters from argv
61 {
62 t = ( i n t ) strtol ( argv [ 2 ] , NULL , 10 ) ;
63 dt = ( f l o a t ) strtod ( argv [ 3 ] , NULL ) ;
64 }
65 e l s e
66 {
67 t = in . t ;
68 dt = in . dt ;
69 }
70 ax = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
71 ay = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
72 az = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
73 f o r ( i n t i = 0 ; i < n ; i++)
74 {
75 ax [ i ] = 0 . 0 f ; ay [ i ] = 0 . 0 f ; az [ i ] = 0 . 0 f ;
76 }
77 cout << "n : " << n << " t : " << t << " dt : " << dt << " r : " << r << endl ;
78 }
79 e l s e
80 {
81 cout << " Error reading input−f i l e " << inputFileName << " , e x i t i ng . " << endl ;
82 exit (−1) ;
83 }
84
85 i f ( DEBUG )
86 printBodies ( ) ;
87 cout << " ∗∗∗ SIMULATION STARTS ∗∗∗ " << endl ;
88 s t r u c t timeval begin , end ;
89 gettimeofday (&begin , NULL ) ; //begin stopwatch
90 fo r ( i n t steps = 0 ; steps < t ; steps++)
91 {
92 simulate ( ) ;
93 }
94 gettimeofday (&end , NULL ) ; //end stopwatch
95 long elapsed = timevaldiff (&begin , &end ) ;
96 cout << " ∗∗∗ SIMULATION ENDS ∗∗∗ " << endl ;
97 i f ( DEBUG )
98 printBodies ( ) ; // pr in t r e s u l t s
99 cout << "Time elapsed : " << elapsed << "ms" << endl ;
100 i f ( DEBUG )
101 cout << " System energy a f t e r s imulat ion : " << computeEnergy (n , px , py , pz , vx , vy , vz←↩
, m ) << endl ;
102 i f ( argc == 5) //wri te output f i l e
103 {
104 stringstream ssout ;
105 ssout << argv [ 4 ] ;
106 string outputFileName = ssout . str ( ) ;
107 NbodyOutput out (n , t , dt ) ;
108 //out . addBodies (n , px , py , pz , vx , vy , vz , m) ;
109 out . addBodies (n , px , py , pz , vx , vy , vz , m , ax , ay , az ) ;
110 out . addTimestamp ( ) ;
111 string sscomment ( " Simulat ion output from nbody−seq " ) ;
112 out . addComment ( sscomment ) ;
113 i f ( out . writeBodyFile ( outputFileName ) )
114 {
115 cout << "Output−f i l e " << outputFileName << " suc c e s fu l l y wri t ten . " << endl ;
116 }
117 e l s e
118 {
119 cout << " Error wri t ing output−f i l e " << outputFileName << " , e x i t i ng . " << endl ;
120 exit (−1) ;
121 }
122 }
5
2. All-pairs, sequential, C++
123 free ( ax ) ; free ( ay ) ; free ( az ) ; //the other arrays are freed by input ob j e c t being ←↩
destroyed
124 }
125 e l s e
126 {
127 cout << " In co r r e c t number of parameters supplied . To run : " << endl ;
128 cout << argv [ 0 ] <<
129 " <input−f i l e > [< t imesteps > <s i z e of timestep >] [<output−f i l e >] "
130 << endl ;
131 re turn 1 ;
132 }
133 re turn 0 ;
134 }
135
136
137 void printBodies ( void )
138 {
139 cout << " Pr in t ing system of bodies : " << endl ;
140 cout << "n : " << n << endl ;
141 cout << " r : " << r << endl ;
142 f o r ( i n t i = 0 ; i < n ; i++)
143 {
144 cout << "body " << i << " : pos=( " << px [ i ] << " , " << py [ i ] << " , " <<
145 pz [ i ] << " ) ve l =( " << vx [ i ] << " , " << vy [ i ] << " , " << vz [ i ] <<
146 " ) m=" << m [ i ] << endl ;
147 }
148 }
149
150 void simulate ( void )
151 {
152 f o r ( i n t i = 0 ; i < n ; i++) //for each body
153 {
154 fo r ( i n t j =i ; j < n ; j++) //increment a c c e l e r a t i on for both bo id ies in pa i r s
155 {
156 f l o a t rtemp = EPSILON + dist2 ( px [ i ] , py [ i ] , pz [ i ] , px [ j ] , py [ j ] , pz [ j ] ) ;
157 f l o a t temp = G / sqrtf ( rtemp ∗ rtemp ∗ rtemp ) ;
158 ax [ i ] += m [ j ]∗ temp ∗ ( px [ j ] − px [ i ] ) ;
159 ay [ i ] += m [ j ]∗ temp ∗ ( py [ j ] − py [ i ] ) ;
160 az [ i ] += m [ j ]∗ temp ∗ ( pz [ j ] − pz [ i ] ) ;
161 ax [ j ] += m [ i ]∗ temp ∗ (−(px [ j ] − px [ i ] ) ) ;
162 ay [ j ] += m [ i ]∗ temp ∗ (−(py [ j ] − py [ i ] ) ) ;
163 az [ j ] += m [ i ]∗ temp ∗ (−(pz [ j ] − pz [ i ] ) ) ;
164 } // update ve l o c i t y using euler
165 vx [ i ] += ax [ i ] ∗dt ;
166 vy [ i ] += ay [ i ] ∗dt ;
167 vz [ i ] += az [ i ] ∗dt ;
168 }
169 f o r ( i n t i = 0 ; i < n ; i++) // update pos i t i on using euler
170 {
171 px [ i ] += dt ∗ ( vx [ i ] ) ;
172 py [ i ] += dt ∗ ( vy [ i ] ) ;
173 pz [ i ] += dt ∗ ( vz [ i ] ) ;
174 ax [ i ] =0.0f ; ay [ i ] =0.0f ; az [ i ] =0.0f ; // r e s e t a c c e l e r a t i on
175 }
176 }
177
178 // spe c i a l case where the r e su l t i s squared a f t e r re turn
179 i n l i n e f l o a t dist2 ( f l o a t x1 , f l o a t y1 , f l o a t z1 , f l o a t x2 , f l o a t y2 , f l o a t z2 )
180 {
181 re turn ( ( x1 − x2 ) ∗ ( x1 − x2 ) + ( y1 − y2 ) ∗ ( y1 − y2 ) +
182 ( z1 − z2 ) ∗ ( z1 − z2 ) ) ;
183 }
6
7
3 All-pairs, OpenMP
1 /∗
2 OpenMP vers ion of the a l l−pa i r s n−body simulat ion using Euler i n t eg r a t i on
3 By Truls & Jon
4 ∗/
5
6 # include <iostream>
7 # include <fstream>
8 # include <s t r ing >
9 # include <sstream>
10 # include <c s td l i b >
11 # include <math . h>
12 # include <omp. h>
13 # include "NbodyInput . h"
14 # include "NbodyOutput . h"
15 # include " NBUtils . h "
16 # include " benchmark . c "
17 # include " energy . c "
18
19 using namespace std ;
20
21 # def ine G 1 . 0 f
22 # def ine EPSILON 0.0000001 f
23 # def ine DEBUG f a l s e
24 // simulat ion parameters
25 i n t n = 0 ;
26 i n t t = 0 ;
27 f l o a t dt = 0 . 0 f ;
28 f l o a t r = 0 . 0 f ; //radius of system
29 i n t threads = 0 ;
30
31 // body value arrays ( pos i t ion , ve loc i ty , mass , a c c e l l e r a t i o n )
32 f l o a t ∗px , ∗py , ∗pz ;
33 f l o a t ∗vx , ∗vy , ∗vz ;
34 f l o a t ∗m ;
35 f l o a t ∗ax , ∗ay , ∗az ;
36
37 // helper func t ions
38 void printBodies ( i n t tail ) ;
39 long timevaldiff ( s t r u c t timeval ∗time1 , s t r u c t timeval ∗time2 ) ;
40
41 // simulat ion func t ions
42 void simulate ( void ) ;
43 i n l i n e f l o a t dist2 ( f l o a t x1 , f l o a t y1 , f l o a t z1 , f l o a t x2 , f l o a t y2 , f l o a t z2 ) ;
44
45 i n t main ( i n t argc , char ∗argv [ ] )
46 {
47 #pragma omp parallel shared ( threads ) //pragma required to get number of threads
48 {
49 threads = omp_get_num_threads ( ) ;
50 }
51 printf ( "N−body simulator , OpenMP vers ion . Using %d threads .\n" , threads ) ;
52 i f ( argc >= 2) //fi lename required as a minimum
53 {
54 stringstream ssin ;
55 ssin << argv [ 1 ] ;
8
56 string inputFileName = ssin . str ( ) ;
57 NbodyInput in ( inputFileName ) ;
58 in . debug = DEBUG ;
59 i f ( in . read ( ) )
60 {
61 cout << " Input−f i l e " << inputFileName << " suc c e s fu l l y loaded . " << endl ;
62 px = in . px ; py = in . py ; pz = in . pz ;
63 vx = in . vx ; vy = in . vy ; vz = in . vz ;
64 m = in . m ;
65 n = in . n ;
66 r = in . getMaxCoordinate ( ) ;
67 i f ( argc >= 4) //get t , dt parameters from argv
68 {
69 t = ( i n t ) strtol ( argv [ 2 ] , NULL , 10 ) ;
70 dt = ( f l o a t ) strtod ( argv [ 3 ] , NULL ) ;
71 }
72 e l s e //use de fau l t parameters from input f i l e
73 {
74 t = in . t ;
75 dt = in . dt ;
76 }
77
78 ax = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
79 ay = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
80 az = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
81 f o r ( i n t i = 0 ; i < n ; i++)
82 {
83 ax [ i ] = 0 . 0 f ; ay [ i ] = 0 . 0 f ; az [ i ] = 0 . 0 f ;
84 }
85
86 cout << "n : " << n << " t : " << t << " dt : " << dt << " r : " << r << endl ;
87 }
88 e l s e
89 {
90 cout << " Error reading input−f i l e " << inputFileName << " , e x i t i ng . " << endl ;
91 exit (−1) ;
92 }
93
94 i f ( DEBUG )
95 printBodies ( 2 0 ) ;
96 cout << " ∗∗∗ SIMULATION STARTS ∗∗∗ " << endl ;
97 s t r u c t timeval begin , end ;
98 gettimeofday (&begin , NULL ) ; //begin stopwatch
99 fo r ( i n t steps = 0 ; steps < t ; steps++)
100 {
101 simulate ( ) ;
102 }
103 gettimeofday (&end , NULL ) ; //end stopwatch
104 long elapsed = timevaldiff (&begin , &end ) ;
105 cout << " ∗∗∗ SIMULATION ENDS ∗∗∗ " << endl ;
106 i f ( DEBUG )
107 printBodies ( 2 0 ) ; // pr in t r e s u l t s
108 cout << "Time elapsed : " << elapsed << "ms" << endl ;
109 i f ( DEBUG )
110 cout << " System energy a f t e r s imulat ion : " << computeEnergy (n , px , py , pz , vx , vy , vz←↩
, m ) << endl ;
111 i f ( argc == 5) //wri te output f i l e
112 {
113 stringstream ssout ;
114 ssout << argv [ 4 ] ;
115 string outputFileName = ssout . str ( ) ;
116 NbodyOutput out (n , t , dt ) ;
117 //out . addBodies (n , px , py , pz , vx , vy , vz , m) ;
118 out . addBodies (n , px , py , pz , vx , vy , vz , m , ax , ay , az ) ;
119 out . addTimestamp ( ) ;
120 string sscomment ( " Simulat ion output from nbody−omp" ) ;
121 out . addComment ( sscomment ) ;
122 i f ( out . writeBodyFile ( outputFileName ) )
123 {
9
3. All-pairs, OpenMP
124 cout << "Output−f i l e " << outputFileName << " suc c e s fu l l y wri t ten . " << endl ;
125 }
126 e l s e
127 {
128 cout << " Error wri t ing output−f i l e " << outputFileName << " , e x i t i ng . " << endl ;
129 exit (−1) ;
130 }
131 }
132 free ( ax ) ; free ( ay ) ; free ( az ) ; //the other arrays are freed by input ob j e c t being ←↩
destroyed
133 }
134 e l s e
135 {
136 cout << " In co r r e c t number of parameters supplied . To run : " << endl ;
137 cout << argv [ 0 ] <<
138 " <input−f i l e > [< t imesteps > <s i z e of timestep >] [<output−f i l e >] "
139 << endl ;
140 re turn 1 ;
141 }
142 re turn 0 ;
143 }
144
145 void printBodies ( i n t tail )
146 {
147 i n t start = 0 ;
148 i f ( tail )
149 start = n − tail ;
150 cout << " Pr in t ing system of bodies : " << endl ;
151 cout << "n : " << n << endl ;
152 cout << " r : " << r << endl ;
153 f o r ( i n t i = start ; i < n ; i++)
154 {
155 cout << "body " << i << " : pos=( " << px [ i ] << " , " << py [ i ] << " , " <<
156 pz [ i ] << " ) ve l =( " << vx [ i ] << " , " << vy [ i ] << " , " << vz [ i ] << " ) m=" << m [ i ] ←↩
<< endl ;
157 }
158 }
159
160 // for spe c i a l case where r e su l t i s squared a f t e r re turn
161 i n l i n e f l o a t dist2 ( f l o a t x1 , f l o a t y1 , f l o a t z1 , f l o a t x2 , f l o a t y2 , f l o a t z2 )
162 {
163 re turn ( ( x1 − x2 ) ∗ ( x1 − x2 ) + ( y1 − y2 ) ∗ ( y1 − y2 ) +
164 ( z1 − z2 ) ∗ ( z1 − z2 ) ) ;
165 }
166
167 void simulate ( void )
168 {
169 #pragma omp parallel fo r de fau l t ( none ) shared (n , dt , px , py , pz , vx , vy , vz , ax , ay , az , m )
170 f o r ( i n t tk = 0 ; tk < n ; tk++)
171 {
172 ax [ tk ]=0 . 0 f ; ay [ tk ]=0 . 0 f ; az [ tk ]=0 . 0 f ;
173 fo r ( i n t j = 0 ; j < n ; j++)
174 {
175 f l o a t rtemp = EPSILON + dist2 ( px [ tk ] , py [ tk ] , pz [ tk ] , px [ j ] , py [ j ] , pz [ j ] ) ;
176 f l o a t temp = G ∗ m [ j ] / sqrtf ( rtemp ∗ rtemp ∗ rtemp ) ;
177 ax [ tk ] += temp ∗ ( px [ j ] − px [ tk ] ) ;
178 ay [ tk ] += temp ∗ ( py [ j ] − py [ tk ] ) ;
179 az [ tk ] += temp ∗ ( pz [ j ] − pz [ tk ] ) ;
180 }
181
182 vx [ tk ] += dt ∗ ax [ tk ] ;
183 vy [ tk ] += dt ∗ ay [ tk ] ;
184 vz [ tk ] += dt ∗ az [ tk ] ;
185 }
186 #pragma omp barrier
187 #pragma omp parallel fo r de fau l t ( none ) shared (n , dt , px , py , pz , vx , vy , vz , m )
188 f o r ( i n t tk = 0 ; tk <= n ; tk++)
189 {
190 i f ( tk < n ) //update and pos i t i on
10
191 {
192 px [ tk ] += dt ∗ ( vx [ tk ] ) ;
193 py [ tk ] += dt ∗ ( vy [ tk ] ) ;
194 pz [ tk ] += dt ∗ ( vz [ tk ] ) ;
195 }
196 }
197 #pragma omp barrier
198 }
11
4 All-pairs, CUDA, simple
1 /∗
2 CUDA vers ion of the a l l−pa i r s n−body simulat ion using Euler i n t eg r a t i on
3 F i r s t attempt , naive implementation .
4 By Jon and Truls
5 ∗/
6
7 # include <iostream>
8 # include <fstream>
9 # include <s t r ing >
10 # include <sstream>
11 # include <c s td l i b >
12 # include <math . h>
13 # include " input . c "
14 # include " output . c "
15 # include " energy . c "
16
17 using namespace std ;
18
19 // CUDA inc ludes
20 # include " cuda . h"
21 # include " book . h"
22
23 # def ine G 1 . 0 f
24 # def ine EPSILON 0.0000001 f
25 # def ine DEBUG f a l s e
26
27 // simulat ion parameters
28 i n t blocks = 512 , threads = 384 ;
29 i n t n = 0 ;
30 i n t t = 0 ;
31 f l o a t dt = 0 . 0 f ;
32 f l o a t r = 0 . 0 f ; // radius of system , may be needed l a t e r
33
34 // body value arrays ( pos i t ion , ve loc i ty , mass )
35 f l o a t ∗px , ∗py , ∗pz ;
36 f l o a t ∗vx , ∗vy , ∗vz ;
37 f l o a t ∗m ;
38
39 // body value arrays fo r device
40 f l o a t ∗px_Dev , ∗py_Dev , ∗pz_Dev ;
41 f l o a t ∗vx_Dev , ∗vy_Dev , ∗vz_Dev ;
42 f l o a t ∗m_Dev ;
43
44 // helper func t ions
45 //bool readBodies ( char ∗ fn ) ;
46 void printBodies ( i n t tail ) ;
47
48 // simulat ion func t ions on device
49 __global__ void updateVelocity ( i n t n , f l o a t dt , f l o a t ∗px , f l o a t ∗py , f l o a t ∗pz , f l o a t ∗vx , f l o a t←↩
∗vy , f l o a t ∗vz , f l o a t ∗m ) ;
50 __global__ void updatePosition ( i n t n , f l o a t dt , f l o a t ∗px , f l o a t ∗py , f l o a t ∗pz , f l o a t ∗vx , f l o a t←↩
∗vy , f l o a t ∗vz , f l o a t ∗m ) ;
51 __device__ f l o a t dist2 ( f l o a t x1 , f l o a t y1 , f l o a t z1 , f l o a t x2 , f l o a t y2 , f l o a t z2 ) ;
52 void initAndCopyDataToDevice ( void ) ;
53 void copyDataToHostAndFree ( void ) ;
12
54
55 i n t main ( i n t argc , char ∗argv [ ] )
56 {
57 char filename [ 6 4 ] ;
58 char filenameOutput [ 6 4 ] ;
59 printf ( "N−body simulator , CUDA vers ion .\n" ) ;
60 i f ( argc >= 2) //fi lename required as a minimum
61 {
62 strcpy ( filename , argv [ 1 ] ) ;
63 i f (1 == readinput ( filename , &n , &t , &dt , &px , &py , &pz , &vx , &vy , &vz , &m ) )
64 {
65 printf ( " Input−f i l e %s suc c e s fu l l y loaded .\n" , filename ) ;
66 i f ( argc >= 4) //overr ide with t , dt parameters from argv
67 {
68 t = ( i n t ) strtol ( argv [ 2 ] , NULL , 10 ) ;
69 dt = ( f l o a t ) strtod ( argv [ 3 ] , NULL ) ;
70 }
71 i f ( argc >= 6) //overr ide with blocks , threads from argv
72 {
73 blocks = ( i n t ) strtol ( argv [ 4 ] , NULL , 10 ) ;
74 threads = ( i n t ) strtol ( argv [ 5 ] , NULL , 10 ) ;
75 }
76 printf ( "n : %d t : %d dt : %f blocks : %d threads : %d\n" , n , t , dt , blocks , threads ) ;
77 }
78 e l s e
79 {
80 printf ( " Error reading input−f i l e %s , e x i t i ng .\n" , filename ) ;
81 exit (−1) ;
82 }
83
84 i f ( DEBUG )
85 printBodies ( 2 0 ) ;
86 printf ( " ∗∗∗ CUDA SIMULATION STARTS ∗∗∗\n" ) ;
87 initAndCopyDataToDevice ( ) ;
88 //begin timing
89 cudaEvent_t start , stop ;
90 f l o a t elapsed = 0 . 0 f ;
91 HANDLE_ERROR ( cudaEventCreate (&start ) ) ;
92 HANDLE_ERROR ( cudaEventCreate (&stop ) ) ;
93 HANDLE_ERROR ( cudaEventRecord ( start , 0 ) ) ;
94 fo r ( i n t steps = 0 ; steps < t ; steps++)
95 {
96 updateVelocity<<<blocks , threads>>>(n , dt , px_Dev , py_Dev , pz_Dev , vx_Dev , vy_Dev , ←↩
vz_Dev , m_Dev ) ;
97 updatePosition<<<blocks , threads>>>(n , dt , px_Dev , py_Dev , pz_Dev , vx_Dev , vy_Dev , ←↩
vz_Dev , m_Dev ) ;
98 }
99 HANDLE_ERROR ( cudaEventRecord ( stop , 0 ) ) ;
100 HANDLE_ERROR ( cudaEventSynchronize ( stop ) ) ;
101 HANDLE_ERROR ( cudaEventElapsedTime(&elapsed , start , stop ) ) ;
102 HANDLE_ERROR ( cudaEventDestroy ( start ) ) ;
103 HANDLE_ERROR ( cudaEventDestroy ( stop ) ) ;
104 //end timing
105 copyDataToHostAndFree ( ) ;
106 printf ( " ∗∗∗ CUDA SIMULATION ENDS ∗∗∗\n" ) ;
107 i f ( DEBUG )
108 printBodies ( 2 0 ) ; // pr in t r e s u l t s
109 printf ( "CUDA time elapsed : %f ms\n" , elapsed ) ;
110 i f ( DEBUG )
111 printf ( " System energy a f t e r s imulat ion : %f\n" , computeEnergy (n , px , py , pz , vx , vy , ←↩
vz , m ) ) ;
112 i f ( argc == 7) //wri te output f i l e
113 {
114 strcpy ( filenameOutput , argv [ 6 ] ) ;
115 i f (1 == writeoutput ( filenameOutput , n , t , dt , px , py , pz , vx , vy , vz , m ) )
116 {
117 printf ( "Output−f i l e %s suc c e s fu l l y wri t ten .\n" , filenameOutput ) ;
118 }
119 e l s e
13
4. All-pairs, CUDA, simple
120 {
121 printf ( " Error wri t ing output−f i l e %s , e x i t i ng . " , filenameOutput ) ;
122 exit (−1) ;
123 }
124 }
125 free ( px ) ; free ( py ) ; free ( pz ) ; free ( vx ) ; free ( vy ) ; free ( vz ) ; free ( m ) ;
126 }
127 e l s e
128 {
129 printf ( " I n co r r e c t number of parameters supplied . To run :\n" ) ;
130 printf ( "%s <input−f i l e > [< timesteps > <s i z e of timestep >] [< blocks > <threads >] [<output−←↩
f i l e >]\n" , argv [ 0 ] ) ;
131 re turn 1 ;
132 }
133 re turn 0 ;
134 }
135
136 void printBodies ( i n t tail )
137 {
138 i n t start = 0 ;
139 i f ( tail )
140 start = n − tail ;
141 cout << " Pr in t ing system of bodies : " << endl ;
142 cout << "n : " << n << endl ;
143 cout << " r : " << r << endl ;
144 f o r ( i n t i = start ; i < n ; i++)
145 {
146 cout << "body " << i << " : pos=( " << px [ i ] << " , " << py [ i ] << " , " <<
147 pz [ i ] << " ) ve l =( " << vx [ i ] << " , " << vy [ i ] << " , " << vz [ i ] << " ) m=" << m [ i ] ←↩
<< endl ;
148 }
149 }
150
151 // spe c i a l case of d i s tance where r e su l t would be squared a f t e r re turn
152 __device__ f l o a t dist2 ( f l o a t x1 , f l o a t y1 , f l o a t z1 , f l o a t x2 , f l o a t y2 , f l o a t z2 )
153 {
154 re turn ( ( x1 − x2 ) ∗ ( x1 − x2 ) + ( y1 − y2 ) ∗ ( y1 − y2 ) + ( z1 − z2 ) ∗ ( z1 − z2 ) ) ;
155 }
156
157 __global__ void updateVelocity ( i n t n , f l o a t dt , f l o a t ∗px , f l o a t ∗py , f l o a t ∗pz , f l o a t ∗vx , f l o a t←↩
∗vy , f l o a t ∗vz , f l o a t ∗m )
158 {
159 i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ;
160 i n t numThreads = gridDim . x ∗ blockDim . x ; //number of n−body i n t e r a c t i o n s per thread
161 f o r ( i n t tk = tid ; tk <n ; tk+= numThreads )
162 {
163 fo r ( i n t j = 0 ; j < n ; j++)
164 {
165 f l o a t rtemp = EPSILON + dist2 ( px [ tk ] , py [ tk ] , pz [ tk ] , px [ j ] , py [ j ] , pz [ j ] ) ;
166 f l o a t temp = dt ∗ G ∗ m [ j ]/sqrtf ( rtemp∗rtemp∗rtemp ) ;
167 vx [ tk ] += temp ∗ ( px [ j ] − px [ tk ] ) ;
168 vy [ tk ] += temp ∗ ( py [ j ] − py [ tk ] ) ;
169 vz [ tk ] += temp ∗ ( pz [ j ] − pz [ tk ] ) ;
170 }
171 }
172 }
173
174 __global__ void updatePosition ( i n t n , f l o a t dt , f l o a t ∗px , f l o a t ∗py , f l o a t ∗pz , f l o a t ∗vx , f l o a t←↩
∗vy , f l o a t ∗vz , f l o a t ∗m )
175 {
176 i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ;
177 i n t numThreads = gridDim . x ∗ blockDim . x ; //number of n−body i n t e r a c t i o n s per thread
178 f o r ( i n t tk = tid ; tk <n ; tk+= numThreads )
179 {
180 px [ tk ] += dt ∗ ( vx [ tk ] ) ;
181 py [ tk ] += dt ∗ ( vy [ tk ] ) ;
182 pz [ tk ] += dt ∗ ( vz [ tk ] ) ;
183 }
184 }
14
185
186 void initAndCopyDataToDevice ( void )
187 {
188 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &px_Dev , n ∗ s i z eo f ( f l o a t ) ) ) ;
189 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &py_Dev , n ∗ s i z eo f ( f l o a t ) ) ) ;
190 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &pz_Dev , n ∗ s i z eo f ( f l o a t ) ) ) ;
191 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &vx_Dev , n ∗ s i z eo f ( f l o a t ) ) ) ;
192 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &vy_Dev , n ∗ s i z eo f ( f l o a t ) ) ) ;
193 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &vz_Dev , n ∗ s i z eo f ( f l o a t ) ) ) ;
194 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &m_Dev , n ∗ s i z eo f ( f l o a t ) ) ) ;
195 HANDLE_ERROR ( cudaMemcpy ( px_Dev , px , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
196 HANDLE_ERROR ( cudaMemcpy ( py_Dev , py , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
197 HANDLE_ERROR ( cudaMemcpy ( pz_Dev , pz , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
198 HANDLE_ERROR ( cudaMemcpy ( vx_Dev , vx , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
199 HANDLE_ERROR ( cudaMemcpy ( vy_Dev , vy , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
200 HANDLE_ERROR ( cudaMemcpy ( vz_Dev , vz , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
201 HANDLE_ERROR ( cudaMemcpy ( m_Dev , m , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyHostToDevice ) ) ;
202 }
203
204 void copyDataToHostAndFree ( void )
205 {
206 HANDLE_ERROR ( cudaMemcpy ( px , px_Dev , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyDeviceToHost ) ) ;
207 HANDLE_ERROR ( cudaMemcpy ( py , py_Dev , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyDeviceToHost ) ) ;
208 HANDLE_ERROR ( cudaMemcpy ( pz , pz_Dev , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyDeviceToHost ) ) ;
209 HANDLE_ERROR ( cudaMemcpy ( vx , vx_Dev , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyDeviceToHost ) ) ;
210 HANDLE_ERROR ( cudaMemcpy ( vy , vy_Dev , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyDeviceToHost ) ) ;
211 HANDLE_ERROR ( cudaMemcpy ( vz , vz_Dev , n ∗ s i z eo f ( f l o a t ) , cudaMemcpyDeviceToHost ) ) ;
212 HANDLE_ERROR ( cudaFree ( px_Dev ) ) ;
213 HANDLE_ERROR ( cudaFree ( py_Dev ) ) ;
214 HANDLE_ERROR ( cudaFree ( pz_Dev ) ) ;
215 HANDLE_ERROR ( cudaFree ( vx_Dev ) ) ;
216 HANDLE_ERROR ( cudaFree ( vy_Dev ) ) ;
217 HANDLE_ERROR ( cudaFree ( vz_Dev ) ) ;
218 HANDLE_ERROR ( cudaFree ( m_Dev ) ) ;
219 }
15
5 All-pairs, CUDA, float4
1 /∗
2 CUDA implementation of the a l l−pa i r s N−body simulat ion using Euler i n t eg r a t i on
3 Optimisat ion : Ins tead of sepera te arrays fo r s c a l a r components , arrays using the CUDA f l o a t 4 type
4 fo r pos i t i on & mass and ve l o c i t y & padding .
5 By Jon and Truls
6 ∗/
7
8 # include <iostream>
9 # include <fstream>
10 # include <s t r ing >
11 # include <sstream>
12 # include <c s td l i b >
13 # include <math . h>
14 # include " input . c "
15 # include " output . c "
16 # include " energy . c "
17
18 using namespace std ;
19
20 // CUDA inc ludes
21 # include " cuda . h"
22 # include " book . h"
23
24 # def ine G 1 . 0 f
25 # def ine EPSILON 0.0000001 f
26 # def ine DEBUG 0
27
28 // simulat ion parameters
29 i n t blocks = 512 , threads = 192 ;
30 bool debug = f a l s e ;
31 i n t n = 0 ;
32 i n t t = 0 ;
33 f l o a t dt = 0 . 0 f ;
34 f l o a t r = 0 . 0 f ; // radius of system , may be needed l a t e r
35
36 // body value arrays ( pos + mass , v e l o c i t y )
37 float4 ∗p_m ;
38 float4 ∗vel ;
39
40 // body value arrays fo r device
41 float4 ∗p_m_Dev ;
42 float4 ∗vel_Dev ;
43
44 // helper func t ions
45 //bool readBodies ( char ∗ fn ) ;
46 void printBodies ( i n t tail ) ;
47
48 // simulat ion func t ions on device
49 __global__ void updateVelocity ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel ) ;
50 __global__ void updatePosition ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel ) ;
51 __device__ f l o a t dist2 ( float4 a , float4 b ) ;
52 void initAndCopyDataToDevice ( void ) ;
53 void copyDataToHostAndFree ( void ) ;
54
55 i n t main ( i n t argc , char ∗argv [ ] )
16
56 {
57 char filename [ 6 4 ] ;
58 char filenameOutput [ 6 4 ] ;
59 printf ( "N−body simulator , CUDA f l o a t 4 vers ion .\n" ) ;
60 i f ( argc >= 2) //fi lename required as a minimum
61 {
62 strcpy ( filename , argv [ 1 ] ) ;
63 f l o a t ∗pxT = 0 , ∗pyT = 0 , ∗pzT = 0 ;
64 f l o a t ∗vxT = 0 , ∗vyT = 0 , ∗vzT = 0 ;
65 f l o a t ∗mT = 0 ;
66 i f (1 == readinput ( filename , &n , &t , &dt , &pxT , &pyT , &pzT , &vxT , &vyT , &vzT , &mT ) )
67 {
68 printf ( " Input−f i l e %s suc c e s fu l l y loaded .\n" , filename ) ;
69 i f ( argc >= 4) //overr ide with t , dt parameters from argv
70 {
71 t = ( i n t ) strtol ( argv [ 2 ] , NULL , 10 ) ;
72 dt = ( f l o a t ) strtod ( argv [ 3 ] , NULL ) ;
73 }
74 p_m = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
75 vel = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
76 f o r ( i n t i = 0 ; i < n ; i++)
77 {
78 p_m [ i ] . x = pxT [ i ] ; p_m [ i ] . y = pyT [ i ] ; p_m [ i ] . z = pzT [ i ] ; p_m [ i ] . w = mT [ i ] ;
79 vel [ i ] . x = vxT [ i ] ; vel [ i ] . y = vyT [ i ] ; vel [ i ] . z = vzT [ i ] ; vel [ i ] . w = 0 . 0 f ;
80 }
81 printf ( "n : %d t : %d dt : %f\n" , n , t , dt ) ;
82 }
83 e l s e
84 {
85 printf ( " Error reading input−f i l e %s , e x i t i ng .\n" , filename ) ;
86 exit (−1) ;
87 }
88
89 i f ( DEBUG )
90 printBodies ( 2 0 ) ;
91 printf ( " ∗∗∗ CUDA SIMULATION STARTS ∗∗∗\n" ) ;
92 initAndCopyDataToDevice ( ) ;
93 //begin timing
94 cudaEvent_t start , stop ;
95 f l o a t elapsed = 0 . 0 f ;
96 HANDLE_ERROR ( cudaEventCreate (&start ) ) ;
97 HANDLE_ERROR ( cudaEventCreate (&stop ) ) ;
98 HANDLE_ERROR ( cudaEventRecord ( start , 0 ) ) ;
99 fo r ( i n t steps = 0 ; steps < t ; steps++)
100 {
101 updateVelocity<<<blocks , threads>>>(n , dt , p_m_Dev , vel_Dev ) ;
102 updatePosition<<<blocks , threads>>>(n , dt , p_m_Dev , vel_Dev ) ;
103 }
104 HANDLE_ERROR ( cudaEventRecord ( stop , 0 ) ) ;
105 HANDLE_ERROR ( cudaEventSynchronize ( stop ) ) ;
106 HANDLE_ERROR ( cudaEventElapsedTime(&elapsed , start , stop ) ) ;
107 HANDLE_ERROR ( cudaEventDestroy ( start ) ) ;
108 HANDLE_ERROR ( cudaEventDestroy ( stop ) ) ;
109 //end timing
110 copyDataToHostAndFree ( ) ;
111 printf ( " ∗∗∗ CUDA SIMULATION ENDS ∗∗∗\n" ) ;
112 i f ( DEBUG )
113 printBodies ( 2 0 ) ; // pr in t r e s u l t s
114 printf ( "CUDA time elapsed : %f ms\n" , elapsed ) ;
115 fo r ( i n t i = 0 ; i < n ; i++)
116 {
117 pxT [ i ] = p_m [ i ] . x ; pyT [ i ] = p_m [ i ] . y ; pzT [ i ] = p_m [ i ] . z ; mT [ i ] = p_m [ i ] . w ;
118 vxT [ i ] = vel [ i ] . x ; vyT [ i ] = vel [ i ] . y ; vzT [ i ] = vel [ i ] . z ;
119 }
120 i f ( DEBUG )
121 printf ( " System energy a f t e r s imulat ion : %f\n" , computeEnergy (n , pxT , pyT , pzT , vxT , ←↩
vyT , vzT , mT ) ) ;
122 i f ( argc == 5) //wri te output f i l e
123 {
17
5. All-pairs, CUDA, float4
124 strcpy ( filenameOutput , argv [ 4 ] ) ;
125 i f (1 == writeoutput ( filenameOutput , n , t , dt , pxT , pyT , pzT , vxT , vyT , vzT , mT ) )
126 {
127 printf ( "Output−f i l e %s suc c e s fu l l y wri t ten .\n" , filenameOutput ) ;
128 }
129 e l s e
130 {
131 printf ( " Error wri t ing output−f i l e %s , e x i t i ng . " , filenameOutput ) ;
132 exit (−1) ;
133 }
134 }
135 free ( pxT ) ; free ( pyT ) ; free ( pzT ) ; free ( vxT ) ; free ( vyT ) ; free ( vzT ) ; free ( mT ) ;
136 }
137 e l s e
138 {
139 printf ( " I n co r r e c t number of parameters supplied . To run :\n" ) ;
140 printf ( "%s <input−f i l e > [< timesteps > <s i z e of timestep >] [<output−f i l e >]\n" , argv [ 0 ] ) ;
141 re turn 1 ;
142 }
143 re turn 0 ;
144 }
145
146 void printBodies ( i n t tail )
147 {
148 i n t start = 0 ;
149 i f ( tail )
150 start = n − tail ;
151 cout << " Pr in t ing system of bodies : " << endl ;
152 cout << "n : " << n << endl ;
153 cout << " r : " << r << endl ;
154 fo r ( i n t i = start ; i < n ; i++)
155 {
156 cout << "body " << i << " : pos=( " << p_m [ i ] . x << " , " << p_m [ i ] . y << " , " <<
157 p_m [ i ] . z << " ) ve l =( " << vel [ i ] . x << " , " << vel [ i ] . y << " , " << vel [ i ] . z << " ) m=" << p_m [ i←↩
] . w << endl ;
158 }
159 }
160
161 // spe c i a l case of d i s tance where r e su l t would be squared a f t e r re turn
162 __device__ f l o a t dist2 ( float4 a , float4 b )
163 {
164 re turn ( ( a . x − b . x ) ∗ ( a . x − b . x ) + ( a . y − b . y ) ∗ ( a . y − b . y ) + ( a . z − b . z ) ∗ ( a . z − b . z ) ) ;
165 }
166
167 __global__ void updateVelocity ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel )
168 {
169 i n t tid =threadIdx . x + blockIdx . x ∗ blockDim . x ;
170 i n t threads = ( blockDim . x∗gridDim . x ) ;
171 f o r ( i n t k=tid ; k < n ; k+=threads )
172 {
173 float4 p_mK = p_m [ k ] ;
174 fo r ( i n t j = 0 ; j < n ; j++)
175 {
176 float4 p_mJ = p_m [ j ] ;
177 f l o a t rtemp = EPSILON + dist2 ( p_mK , p_mJ ) ;
178 float4 velK = vel [ k ] ;
179 f l o a t temp = dt ∗ G ∗ p_mJ . w/sqrtf ( rtemp∗rtemp∗rtemp ) ;
180 velK . x += temp ∗ ( p_mJ . x − p_mK . x ) ;
181 velK . y += temp ∗ ( p_mJ . y − p_mK . y ) ;
182 velK . z += temp ∗ ( p_mJ . z − p_mK . z ) ;
183 vel [ k ] = velK ;
184 }
185 }
186 }
187
188 __global__ void updatePosition ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel )
189 {
190 i n t tid =threadIdx . x + blockIdx . x ∗ blockDim . x ;
191 i n t threads = ( blockDim . x∗gridDim . x ) ;
18
192 fo r ( i n t k =tid ; k < n ; k+= threads )
193 {
194 p_m [ k ] . x += dt ∗ ( vel [ k ] . x ) ;
195 p_m [ k ] . y += dt ∗ ( vel [ k ] . y ) ;
196 p_m [ k ] . z += dt ∗ ( vel [ k ] . z ) ;
197 }
198 }
199
200 void initAndCopyDataToDevice ( void )
201 {
202 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &p_m_Dev , n ∗ s i z eo f ( float4 ) ) ) ;
203 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &vel_Dev , n ∗ s i z eo f ( float4 ) ) ) ;
204
205 HANDLE_ERROR ( cudaMemcpy ( p_m_Dev , p_m , n ∗ s i z eo f ( float4 ) , cudaMemcpyHostToDevice ) ) ;
206 HANDLE_ERROR ( cudaMemcpy ( vel_Dev , vel , n ∗ s i z eo f ( float4 ) , cudaMemcpyHostToDevice ) ) ;
207 }
208
209 void copyDataToHostAndFree ( void )
210 {
211 HANDLE_ERROR ( cudaMemcpy ( p_m , p_m_Dev , n ∗ s i z eo f ( float4 ) , cudaMemcpyDeviceToHost ) ) ;
212 HANDLE_ERROR ( cudaMemcpy ( vel , vel_Dev , n ∗ s i z eo f ( float4 ) , cudaMemcpyDeviceToHost ) ) ;
213 HANDLE_ERROR ( cudaFree ( p_m_Dev ) ) ;
214 HANDLE_ERROR ( cudaFree ( vel_Dev ) ) ;
215 }
19
6 All-pairs, CUDA, tiling
1 /∗
2 Simple CUDA implementation of the a l l−pa i r s N−body simulat ion
3 Optimisat ion : added t i l i n g . Bui lds on top of the f l o a t 4 opt imisa t ion .
4 By Jon and Truls
5 ∗/
6
7 # include <iostream>
8 # include <fstream>
9 # include <s t r ing >
10 # include <sstream>
11 # include <c s td l i b >
12 # include <math . h>
13 # include " input . c "
14 # include " output . c "
15 # include " energy . c "
16
17 // CUDA inc ludes
18 # include " cuda . h"
19 # include " book . h"
20
21 using namespace std ;
22
23 // simulat ion parameters
24 # def ine G 1 . 0 f
25 # def ine EPSILON 0.0000001 f
26 # def ine DEBUG 0
27 # def ine BLOCKS 512
28 # def ine THREADS 256 //needs to be whole f r a c t i o n of galaxy s i z e
29 i n t n = 0 ;
30 i n t t = 0 ;
31 f l o a t dt = 0 . 0 f ;
32 f l o a t r = 0 . 0 f ; // radius of system , may be needed l a t e r
33
34 // body value arrays ( pos + mass , v e l o c i t y )
35 float4 ∗p_m ;
36 float4 ∗vel ;
37
38 // body value arrays fo r device
39 float4 ∗p_m_Dev ;
40 float4 ∗vel_Dev ;
41
42 // helper func t ions
43 //bool readBodies ( char ∗ fn ) ;
44 void printBodies ( i n t tail ) ;
45
46 // simulat ion func t ions on device
47 __global__ void updateVelocity ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel ) ;
48 __global__ void updatePosition ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel ) ;
49 __device__ f l o a t dist2 ( float4 a , float4 b ) ;
50 void initAndCopyDataToDevice ( void ) ;
51 void copyDataToHostAndFree ( void ) ;
52
53 i n t main ( i n t argc , char ∗argv [ ] )
54 {
55 char filename [ 6 4 ] ;
20
56 char filenameOutput [ 6 4 ] ;
57 printf ( "N−body simulator , CUDA t i l i n g vers ion .\n" ) ;
58 i f ( argc >= 2) //fi lename required as a minimum
59 {
60 strcpy ( filename , argv [ 1 ] ) ;
61 f l o a t ∗pxT = 0 , ∗pyT = 0 , ∗pzT = 0 ;
62 f l o a t ∗vxT = 0 , ∗vyT = 0 , ∗vzT = 0 ;
63 f l o a t ∗mT = 0 ;
64 i f (1 == readinput ( filename , &n , &t , &dt , &pxT , &pyT , &pzT , &vxT , &vyT , &vzT , &mT ) )
65 {
66 printf ( " Input−f i l e %s suc c e s fu l l y loaded .\n" , filename ) ;
67 i f ( argc >= 4) //overr ide with t , dt parameters from argv
68 {
69 t = ( i n t ) strtol ( argv [ 2 ] , NULL , 10 ) ;
70 dt = ( f l o a t ) strtod ( argv [ 3 ] , NULL ) ;
71 }
72 p_m = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
73 vel = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
74 f o r ( i n t i = 0 ; i < n ; i++)
75 {
76 p_m [ i ] . x = pxT [ i ] ; p_m [ i ] . y = pyT [ i ] ; p_m [ i ] . z = pzT [ i ] ; p_m [ i ] . w = mT [ i ] ;
77 vel [ i ] . x = vxT [ i ] ; vel [ i ] . y = vyT [ i ] ; vel [ i ] . z = vzT [ i ] ; vel [ i ] . w = 0 . 0 f ;
78 }
79 printf ( "n : %d t : %d dt : %f\n" , n , t , dt ) ;
80 }
81 e l s e
82 {
83 printf ( " Error reading input−f i l e %s , e x i t i ng .\n" , filename ) ;
84 exit (−1) ;
85 }
86
87 i f ( DEBUG ) printBodies ( 2 0 ) ;
88 printf ( " ∗∗∗ CUDA SIMULATION STARTS ∗∗∗\n" ) ;
89 initAndCopyDataToDevice ( ) ;
90 //begin timing
91 cudaEvent_t start , stop ;
92 f l o a t elapsed = 0 . 0 f ;
93 HANDLE_ERROR ( cudaEventCreate (&start ) ) ;
94 HANDLE_ERROR ( cudaEventCreate (&stop ) ) ;
95 HANDLE_ERROR ( cudaEventRecord ( start , 0 ) ) ;
96 fo r ( i n t steps = 0 ; steps < t ; steps++)
97 {
98 updateVelocity<<<BLOCKS , THREADS , THREADS ∗ ( s i z eo f ( float4 ) ) >>>(n , dt , p_m_Dev , vel_Dev ) ;
99 updatePosition<<<BLOCKS , THREADS>>>(n , dt , p_m_Dev , vel_Dev ) ;
100 }
101 HANDLE_ERROR ( cudaEventRecord ( stop , 0 ) ) ;
102 HANDLE_ERROR ( cudaEventSynchronize ( stop ) ) ;
103 HANDLE_ERROR ( cudaEventElapsedTime(&elapsed , start , stop ) ) ;
104 HANDLE_ERROR ( cudaEventDestroy ( start ) ) ;
105 HANDLE_ERROR ( cudaEventDestroy ( stop ) ) ;
106 //end timing
107 copyDataToHostAndFree ( ) ;
108 printf ( " ∗∗∗ CUDA SIMULATION ENDS ∗∗∗\n" ) ;
109 i f ( DEBUG )
110 printBodies ( 2 0 ) ; // pr in t r e s u l t s
111 printf ( "CUDA time elapsed : %f ms\n" , elapsed ) ;
112 fo r ( i n t i = 0 ; i < n ; i++)
113 {
114 pxT [ i ] = p_m [ i ] . x ; pyT [ i ] = p_m [ i ] . y ; pzT [ i ] = p_m [ i ] . z ; mT [ i ] = p_m [ i ] . w ;
115 vxT [ i ] = vel [ i ] . x ; vyT [ i ] = vel [ i ] . y ; vzT [ i ] = vel [ i ] . z ;
116 }
117 i f ( DEBUG )
118 printf ( " System energy a f t e r s imulat ion : %f\n" , computeEnergy (n , pxT , pyT , pzT , vxT , ←↩
vyT , vzT , mT ) ) ;
119 i f ( argc == 5) //wri te output f i l e
120 {
121 strcpy ( filenameOutput , argv [ 4 ] ) ;
122 i f (1 == writeoutput ( filenameOutput , n , t , dt , pxT , pyT , pzT , vxT , vyT , vzT , mT ) )
123 {
21
6. All-pairs, CUDA, tiling
124 printf ( "Output−f i l e %s suc c e s fu l l y wri t ten .\n" , filenameOutput ) ;
125 }
126 e l s e
127 {
128 printf ( " Error wri t ing output−f i l e %s , e x i t i ng . " , filenameOutput ) ;
129 exit (−1) ;
130 }
131 }
132 free ( pxT ) ; free ( pyT ) ; free ( pzT ) ; free ( vxT ) ; free ( vyT ) ; free ( vzT ) ; free ( mT ) ;
133 }
134 e l s e
135 {
136 printf ( " I n co r r e c t number of parameters supplied . To run :\n" ) ;
137 printf ( "%s <input−f i l e > [< timesteps > <s i z e of timestep >] [<output−f i l e >]\n" , argv [ 0 ] ) ;
138 re turn 1 ;
139 }
140 re turn 0 ;
141 }
142
143 void printBodies ( i n t tail )
144 {
145 i n t start = 0 ;
146 i f ( tail )
147 start = n − tail ;
148 cout << " Pr in t ing system of bodies : " << endl ;
149 cout << "n : " << n << endl ;
150 cout << " r : " << r << endl ;
151 fo r ( i n t i = start ; i < n ; i++)
152 {
153 cout << "body " << i << " : pos=( " << p_m [ i ] . x << " , " << p_m [ i ] . y << " , " <<
154 p_m [ i ] . z << " ) ve l =( " << vel [ i ] . x << " , " << vel [ i ] . y << " , " << vel [ i ] . z << " ) m=" << p_m [ i←↩
] . w << endl ;
155 }
156 }
157
158 // spe c i a l case of d i s tance where r e su l t would be squared a f t e r re turn
159 __device__ f l o a t dist2 ( float4 a , float4 b )
160 {
161 re turn ( ( a . x − b . x ) ∗ ( a . x − b . x ) + ( a . y − b . y ) ∗ ( a . y−b . y ) + ( a . z − b . z ) ∗ ( a . z − b . z ) ) ;
162 }
163
164 __global__ void updateVelocity ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel )
165 {
166 i n t threadsInBlock = blockDim . x ;
167 i n t threadsInGrid = threadsInBlock∗gridDim . x ;
168 i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ;
169 i n t tiles = n/blockDim . x+1; //number of t i l e s
170 extern __shared__ float4 shared_p_m [ ] ; //a shared memory array on the device , there i s 2KB ←↩
shared per processor (16K per SM) .
171 fo r ( i n t k =tid ; k < n ; k+=threadsInGrid )
172 {
173 float4 p_mK= p_m [ k ] ;
174 float4 accK ;
175 // velK = vel [ k ] ;
176 accK . x =0.0f ;
177 accK . y =0.0f ;
178 accK . z =0.0f ;
179 fo r ( i n t j = 0 ; j < tiles ; j++) //the block takes one t i l e a t the time and reads in to ←↩
shared memory
180 {
181 i f ( j∗tiles+threadIdx . x<n )
182 {
183 shared_p_m [ threadIdx . x ]=p_m [ j∗threadsInBlock+threadIdx . x ] ;
184 }
185 __syncthreads ( ) ; // make sure a l l memory i s read in
186 f o r ( i n t l=0;l<threadsInBlock ; l++) // each thread c a l c u l a t e s the ve l o c i t y i t s body using←↩
the shared t i l e memory
187 {
188 i f ( j∗tiles+l<n )
22
189 {
190 f l o a t rtemp = EPSILON + dist2 ( p_mK , shared_p_m [ l ] ) ;
191 f l o a t temp = G ∗ shared_p_m [ l ] . w/sqrtf ( rtemp∗rtemp∗rtemp ) ;
192 accK . x += temp ∗ ( shared_p_m [ l ] . x − p_mK . x ) ;
193 accK . y += temp ∗ ( shared_p_m [ l ] . y − p_mK . y ) ;
194 accK . z += temp ∗ ( shared_p_m [ l ] . z − p_mK . z ) ;
195 }
196 }
197 __syncthreads ( ) ; // makes sure no memory i s overwri t ten
198 }
199 //saving the temp var i ab l e s back to g loba l memory
200 vel [ k ] . x +=dt∗ accK . x ;
201 vel [ k ] . y +=dt∗ accK . y ;
202 vel [ k ] . z +=dt∗ accK . z ;
203 }
204 }
205
206 __global__ void updatePosition ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel )
207 {
208 i n t threadsInGrid = blockDim . x∗gridDim . x ;
209 i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ;
210 //number of n−body i n t e r a c t i o n s per thread
211 fo r ( i n t k =tid ; k < n ; k+=threadsInGrid )
212 {
213 //update pos i t i on
214 p_m [ k ] . x += dt ∗ ( vel [ k ] . x ) ;
215 p_m [ k ] . y += dt ∗ ( vel [ k ] . y ) ;
216 p_m [ k ] . z += dt ∗ ( vel [ k ] . z ) ;
217 }
218 }
219
220 void initAndCopyDataToDevice ( void )
221 {
222 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &p_m_Dev , n ∗ s i z eo f ( float4 ) ) ) ;
223 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &vel_Dev , n ∗ s i z eo f ( float4 ) ) ) ;
224 HANDLE_ERROR ( cudaMemcpy ( p_m_Dev , p_m , n ∗ s i z eo f ( float4 ) , cudaMemcpyHostToDevice ) ) ;
225 HANDLE_ERROR ( cudaMemcpy ( vel_Dev , vel , n ∗ s i z eo f ( float4 ) , cudaMemcpyHostToDevice ) ) ;
226 }
227
228 void copyDataToHostAndFree ( void )
229 {
230 HANDLE_ERROR ( cudaMemcpy ( p_m , p_m_Dev , n ∗ s i z eo f ( float4 ) , cudaMemcpyDeviceToHost ) ) ;
231 HANDLE_ERROR ( cudaMemcpy ( vel , vel_Dev , n ∗ s i z eo f ( float4 ) , cudaMemcpyDeviceToHost ) ) ;
232 HANDLE_ERROR ( cudaFree ( p_m_Dev ) ) ;
233 HANDLE_ERROR ( cudaFree ( vel_Dev ) ) ;
234 }
23
7 All-pairs, CUDA, loop optimizations
1 /∗
2 Simple CUDA implementation of the a l l−pa i r s N−body simulat ion
3 Optimisat ion : inner loops optimized in device code . Bui lds on the t i l i n g optimised vers ion .
4 By Jon and Truls
5 ∗/
6
7 # include <iostream>
8 # include <fstream>
9 # include <s t r ing >
10 # include <sstream>
11 # include <c s td l i b >
12 # include <math . h>
13 # include " input . c "
14 # include " output . c "
15 # include " energy . c "
16
17 // CUDA inc ludes
18 # include " cuda . h"
19 # include " book . h"
20
21 using namespace std ;
22
23 // simulat ion parameters
24 # def ine G 1 . 0 f
25 # def ine EPSILON 0.0000001 f
26 # def ine DEBUG 0
27 # def ine BLOCKS 512
28 # def ine THREADS 256
29 i n t n = 0 ;
30 i n t t = 0 ;
31 f l o a t dt = 0 . 0 f ;
32 f l o a t r = 0 . 0 f ; // radius of system , may be needed l a t e r
33
34 // body value arrays ( pos + mass , v e l o c i t y )
35 float4 ∗p_m ;
36 float4 ∗vel ;
37
38 // body value arrays fo r device
39 float4 ∗p_m_Dev ;
40 float4 ∗vel_Dev ;
41
42 // helper func t ions
43 //bool readBodies ( char ∗ fn ) ;
44 void printBodies ( i n t tail ) ;
45
46 // simulat ion func t ions on device
47 __global__ void updateVelocity ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel ) ;
48 __global__ void updatePosition ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel ) ;
49 __device__ f l o a t dist2 ( float4 a , float4 b ) ;
50 void initAndCopyDataToDevice ( void ) ;
51 void copyDataToHostAndFree ( void ) ;
52
53 i n t main ( i n t argc , char ∗argv [ ] )
54 {
55 char filename [ 6 4 ] ;
24
56 char filenameOutput [ 6 4 ] ;
57 printf ( "N−body simulator , CUDA loop optimized vers ion .\n" ) ;
58 i f ( argc >= 2) //fi lename required as a minimum
59 {
60 strcpy ( filename , argv [ 1 ] ) ;
61 f l o a t ∗pxT = 0 , ∗pyT = 0 , ∗pzT = 0 ;
62 f l o a t ∗vxT = 0 , ∗vyT = 0 , ∗vzT = 0 ;
63 f l o a t ∗mT = 0 ;
64 i f (1 == readinput ( filename , &n , &t , &dt , &pxT , &pyT , &pzT , &vxT , &vyT , &vzT , &mT ) )
65 {
66 printf ( " Input−f i l e %s suc c e s fu l l y loaded .\n" , filename ) ;
67 i f ( argc >= 4) //overr ide with t , dt parameters from argv
68 {
69 t = ( i n t ) strtol ( argv [ 2 ] , NULL , 10 ) ;
70 dt = ( f l o a t ) strtod ( argv [ 3 ] , NULL ) ;
71 }
72 p_m = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
73 vel = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
74 f o r ( i n t i = 0 ; i < n ; i++)
75 {
76 p_m [ i ] . x = pxT [ i ] ; p_m [ i ] . y = pyT [ i ] ; p_m [ i ] . z = pzT [ i ] ; p_m [ i ] . w = mT [ i ] ;
77 vel [ i ] . x = vxT [ i ] ; vel [ i ] . y = vyT [ i ] ; vel [ i ] . z = vzT [ i ] ; vel [ i ] . w = 0 . 0 f ;
78 }
79 printf ( "n : %d t : %d dt : %f\n" , n , t , dt ) ;
80 }
81 e l s e
82 {
83 printf ( " Error reading input−f i l e %s , e x i t i ng .\n" , filename ) ;
84 exit (−1) ;
85 }
86
87 i f ( DEBUG ) printBodies ( 2 0 ) ;
88 printf ( " ∗∗∗ CUDA SIMULATION STARTS ∗∗∗\n" ) ;
89 initAndCopyDataToDevice ( ) ;
90 //begin timing
91 cudaEvent_t start , stop ;
92 f l o a t elapsed = 0 . 0 f ;
93 HANDLE_ERROR ( cudaEventCreate (&start ) ) ;
94 HANDLE_ERROR ( cudaEventCreate (&stop ) ) ;
95 HANDLE_ERROR ( cudaEventRecord ( start , 0 ) ) ;
96 fo r ( i n t steps = 0 ; steps < t ; steps++)
97 {
98 updateVelocity<<<BLOCKS , THREADS , THREADS ∗ ( s i z eo f ( float4 ) ) >>>(n , dt , p_m_Dev , vel_Dev ) ;
99 updatePosition<<<BLOCKS , THREADS>>>(n , dt , p_m_Dev , vel_Dev ) ;
100 }
101 HANDLE_ERROR ( cudaEventRecord ( stop , 0 ) ) ;
102 HANDLE_ERROR ( cudaEventSynchronize ( stop ) ) ;
103 HANDLE_ERROR ( cudaEventElapsedTime(&elapsed , start , stop ) ) ;
104 HANDLE_ERROR ( cudaEventDestroy ( start ) ) ;
105 HANDLE_ERROR ( cudaEventDestroy ( stop ) ) ;
106 //end timing
107 copyDataToHostAndFree ( ) ;
108 printf ( " ∗∗∗ CUDA SIMULATION ENDS ∗∗∗\n" ) ;
109 i f ( DEBUG )
110 printBodies ( 2 0 ) ; // pr in t r e s u l t s
111 printf ( "CUDA time elapsed : %f ms\n" , elapsed ) ;
112 fo r ( i n t i = 0 ; i < n ; i++)
113 {
114 pxT [ i ] = p_m [ i ] . x ; pyT [ i ] = p_m [ i ] . y ; pzT [ i ] = p_m [ i ] . z ; mT [ i ] = p_m [ i ] . w ;
115 vxT [ i ] = vel [ i ] . x ; vyT [ i ] = vel [ i ] . y ; vzT [ i ] = vel [ i ] . z ;
116 }
117 i f ( DEBUG )
118 printf ( " System energy a f t e r s imulat ion : %f\n" , computeEnergy (n , pxT , pyT , pzT , vxT , ←↩
vyT , vzT , mT ) ) ;
119 i f ( argc == 5) //wri te output f i l e
120 {
121 strcpy ( filenameOutput , argv [ 4 ] ) ;
122 i f (1 == writeoutput ( filenameOutput , n , t , dt , pxT , pyT , pzT , vxT , vyT , vzT , mT ) )
123 {
25
7. All-pairs, CUDA, loop optimizations
124 printf ( "Output−f i l e %s suc c e s fu l l y wri t ten .\n" , filenameOutput ) ;
125 }
126 e l s e
127 {
128 printf ( " Error wri t ing output−f i l e %s , e x i t i ng . " , filenameOutput ) ;
129 exit (−1) ;
130 }
131 }
132 free ( pxT ) ; free ( pyT ) ; free ( pzT ) ; free ( vxT ) ; free ( vyT ) ; free ( vzT ) ; free ( mT ) ;
133 }
134 e l s e
135 {
136 printf ( " I n co r r e c t number of parameters supplied . To run :\n" ) ;
137 printf ( "%s <input−f i l e > [< timesteps > <s i z e of timestep >] [<output−f i l e >]\n" , argv [ 0 ] ) ;
138 re turn 1 ;
139 }
140 re turn 0 ;
141 }
142
143 bool readBodies ( char ∗fileName )
144 {
145 string line ;
146 ifstream file ( fileName , ios : : in ) ;
147 i f ( file . is_open ( ) )
148 {
149 i f ( getline ( file , line ) )
150 {
151 stringstream ss ( line ) ;
152 ss >> n ; // save n to g loba l var iab le , and a l l o c a t e memory
153 p_m = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
154 vel = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
155 i f ( getline ( file , line ) )
156 {
157 stringstream ss ( line ) ;
158 ss >> r ; // save r ( radius ) to g loba l va r i ab l e
159 f o r ( i n t i = 0 ; i < n ; i++)
160 {
161 i f ( getline ( file , line ) )
162 {
163 stringstream ss ( line ) ;
164 // temp values fo r reading
165 f l o a t pxT = 0 . 0 f , pyT = 0 . 0 f , pzT = 0 . 0 f ;
166 f l o a t vxT = 0 . 0 f , vyT = 0 . 0 f , vzT = 0 . 0 f ;
167 f l o a t mT = 0 . 0 f ;
168 ss >> pxT >> pyT >> pzT >> vxT >> vyT >> vzT >> mT ;
169 float4 p_mT = { pxT , pyT , pzT , mT } ;
170 float4 velT = { vxT , vyT , vzT , 0 . 0 f } ;
171 p_m [ i ] = p_mT ;
172 vel [ i ] = velT ;
173 } // ge t l i n e in loop
174 e l s e
175 {
176 re turn f a l s e ;
177 }
178 } // for loop
179 } // f i r s t g e t l i n e
180 e l s e
181 {
182 re turn f a l s e ;
183 }
184 }
185 file . close ( ) ;
186 }
187 e l s e
188 {
189 cout << " Error opening f i l e " << fileName << endl ;
190 re turn f a l s e ;
191 }
192 re turn true ;
26
193 }
194
195 void printBodies ( i n t tail )
196 {
197 i n t start = 0 ;
198 i f ( tail )
199 start = n − tail ;
200 cout << " Pr in t ing system of bodies : " << endl ;
201 cout << "n : " << n << endl ;
202 cout << " r : " << r << endl ;
203 f o r ( i n t i = start ; i < n ; i++)
204 {
205 cout << "body " << i << " : pos=( " << p_m [ i ] . x << " , " << p_m [ i ] . y << " , " <<
206 p_m [ i ] . z << " ) ve l =( " << vel [ i ] . x << " , " << vel [ i ] . y << " , " << vel [ i ] . z << " ) m←↩
=" << p_m [ i ] . w << endl ;
207 }
208 }
209
210 // spe c i a l case of d i s tance where r e su l t would be squared a f t e r re turn
211 __device__ f l o a t dist2 ( float4 a , float4 b )
212 {
213 re turn ( ( a . x − b . x ) ∗ ( a . x − b . x ) + ( a . y − b . y ) ∗ ( a . y−b . y ) + ( a . z − b . z ) ∗ ( a . z − b . z ) ) ;
214 }
215
216 __global__ void updateVelocity ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel )
217 {
218 i n t threadsInBlock = blockDim . x ;
219 i n t threadsInGrid = threadsInBlock∗gridDim . x ;
220 i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ;
221 i n t tiles = n/threadsInBlock+1; //number of t i l e s
222 extern __shared__ float4 shared_p_m [ ] ; //a shared memory array on the device , there i s 2KB ←↩
shared per processor (16K per SM) .
223 f o r ( i n t k = tid ; k < n ; k+=threadsInGrid )
224 {
225 float4 this_p_m = p_m [ k ] ;
226 float4 thisVel = vel [ k ] ;
227 fo r ( i n t j = 0 ; j < tiles ; j++) //the block takes one t i l e a t the time and reads in to ←↩
shared memory
228 {
229 i f ( j∗tiles+threadIdx . x<n )
230 {
231 shared_p_m [ threadIdx . x ]=p_m [ j∗threadsInBlock+threadIdx . x ] ;
232 }
233 __syncthreads ( ) ; // make sure a l l memory i s read in
234 #pragma unroll 8
235 f o r ( i n t l=0;l<threadsInBlock ; l++) // each thread c a l c u l a t e s the ve l o c i t y of i t s body ←↩
using the shared t i l e memory
236 {
237 i f ( j∗tiles+l<n )
238 {
239 f l o a t rtemp = EPSILON + dist2 ( this_p_m , shared_p_m [ l ] ) ;
240 f l o a t temp = dt ∗ G ∗ shared_p_m [ l ] . w∗rsqrtf ( rtemp∗rtemp∗rtemp ) ;
241 thisVel . x += temp ∗ ( shared_p_m [ l ] . x − this_p_m . x ) ;
242 thisVel . y += temp ∗ ( shared_p_m [ l ] . y − this_p_m . y ) ;
243 thisVel . z += temp ∗ ( shared_p_m [ l ] . z − this_p_m . z ) ;
244 }
245 }
246 __syncthreads ( ) ; // make sure no memory i s overwri t ten
247 }
248 //saving the temp var i ab l e s back to g loba l memory
249 p_m [ k ]=this_p_m ;
250 vel [ k ] = thisVel ;
251 }
252 }
253
254 __global__ void updatePosition ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel )
255 {
256 // in t threadsInBlock = blockDim . x ;
257 const i n t threadsInGrid = blockDim . x∗gridDim . x ;
27
7. All-pairs, CUDA, loop optimizations
258 const i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ;
259 float4 p_mK ;
260 #pragma unroll 8
261 f o r ( i n t k = tid ; k < n ; k += threadsInGrid )
262 {
263 p_mK = p_m [ k ] ;
264 p_mK . x += dt ∗ ( vel [ k ] . x ) ;
265 p_mK . y += dt ∗ ( vel [ k ] . y ) ;
266 p_mK . z += dt ∗ ( vel [ k ] . z ) ;
267 p_m [ k ] = p_mK ;
268 }
269 }
270
271 void initAndCopyDataToDevice ( void )
272 {
273 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &p_m_Dev , n ∗ s i z eo f ( float4 ) ) ) ;
274 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &vel_Dev , n ∗ s i z eo f ( float4 ) ) ) ;
275 HANDLE_ERROR ( cudaMemcpy ( p_m_Dev , p_m , n ∗ s i z eo f ( float4 ) , cudaMemcpyHostToDevice ) ) ;
276 HANDLE_ERROR ( cudaMemcpy ( vel_Dev , vel , n ∗ s i z eo f ( float4 ) , cudaMemcpyHostToDevice ) ) ;
277 }
278
279 void copyDataToHostAndFree ( void )
280 {
281 HANDLE_ERROR ( cudaMemcpy ( p_m , p_m_Dev , n ∗ s i z eo f ( float4 ) , cudaMemcpyDeviceToHost ) ) ;
282 HANDLE_ERROR ( cudaMemcpy ( vel , vel_Dev , n ∗ s i z eo f ( float4 ) , cudaMemcpyDeviceToHost ) ) ;
283 HANDLE_ERROR ( cudaFree ( p_m_Dev ) ) ;
284 HANDLE_ERROR ( cudaFree ( vel_Dev ) ) ;
285 }
28
29
8 All-pairs, CUDA, velocity Verlet
1 /∗
2 CUDA implementation of the a l l−pa i r s N−body simulat ion
3 Optimisat ion : changes in t eg r a t i on method to Ver l e t . Bui lds on the loop optimised vers ion .
4 By Jon and Truls
5 ∗/
6
7 # include <iostream>
8 # include <fstream>
9 # include <s t r ing >
10 # include <sstream>
11 # include <c s td l i b >
12 # include <math . h>
13 # include " input . c "
14 # include " output . c "
15 # include " energy . c "
16
17 // CUDA inc ludes
18 # include " cuda . h"
19 # include " book . h"
20
21 using namespace std ;
22
23 // simulat ion parameters
24 # def ine G 1 . 0 f
25 # def ine EPSILON 0.0000001 f
26 # def ine DEBUG 0
27 # def ine BLOCKS 512
28 # def ine THREADS 256
29 i n t n = 0 ;
30 i n t t = 0 ;
31 f l o a t dt = 0 . 0 f ;
32 f l o a t r = 0 . 0 f ; // radius of system , may be needed l a t e r
33
34 // body value arrays ( pos + mass , v e l o c i t y )
35 float4 ∗p_m ;
36 float4 ∗vel ;
37 float4 ∗acc ;
38
39
40 // body value arrays fo r device
41 float4 ∗p_m_Dev ;
42 float4 ∗vel_Dev ;
43 float4 ∗acc_Dev ;
44
45 // helper func t ions
46 //bool readBodies ( char ∗ fn ) ;
47 void printBodies ( i n t tail ) ;
48
49 // simulat ion func t ions on device
50 __global__ void updateVelocity ( i n t n , f l o a t dt , i n t steps , float4 ∗p_m , float4 ∗vel , float4 ∗acc )←↩
;
51 __global__ void updatePosition ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel , float4 ∗acc ) ;
52 __device__ f l o a t dist2 ( float4 a , float4 b ) ;
53 void initAndCopyDataToDevice ( void ) ;
54 void copyDataToHostAndFree ( void ) ;
30
55
56 i n t main ( i n t argc , char ∗argv [ ] )
57 {
58 char filename [ 6 4 ] ;
59 char filenameOutput [ 6 4 ] ;
60 printf ( "N−body simulator , CUDA vers ion using Ver l e t i n t eg r a t i on method .\n" ) ;
61 i f ( argc >= 2) //fi lename required as a minimum
62 {
63 strcpy ( filename , argv [ 1 ] ) ;
64 f l o a t ∗pxT = 0 , ∗pyT = 0 , ∗pzT = 0 ;
65 f l o a t ∗vxT = 0 , ∗vyT = 0 , ∗vzT = 0 ;
66 f l o a t ∗mT = 0 ;
67 i f (1 == readinput ( filename , &n , &t , &dt , &pxT , &pyT , &pzT , &vxT , &vyT , &vzT , &mT ) )
68 {
69 printf ( " Input−f i l e %s suc c e s fu l l y loaded .\n" , filename ) ;
70 i f ( argc >= 4) //overr ide with t , dt parameters from argv
71 {
72 t = ( i n t ) strtol ( argv [ 2 ] , NULL , 10 ) ;
73 dt = ( f l o a t ) strtod ( argv [ 3 ] , NULL ) ;
74 }
75 p_m = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
76 vel = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
77 acc = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
78 f o r ( i n t i = 0 ; i < n ; i++)
79 {
80 p_m [ i ] . x = pxT [ i ] ; p_m [ i ] . y = pyT [ i ] ; p_m [ i ] . z = pzT [ i ] ; p_m [ i ] . w = mT [ i ] ;
81 vel [ i ] . x = vxT [ i ] ; vel [ i ] . y = vyT [ i ] ; vel [ i ] . z = vzT [ i ] ; vel [ i ] . w = 0 . 0 f ;
82 acc [ i ] . x = 0 . 0 f ; acc [ i ] . y = 0 . 0 f ; acc [ i ] . z = 0 . 0 f ;
83 }
84 printf ( "n : %d t : %d dt : %f\n" , n , t , dt ) ;
85 }
86 e l s e
87 {
88 printf ( " Error reading input−f i l e %s , e x i t i ng .\n" , filename ) ;
89 exit (−1) ;
90 }
91
92 i f ( DEBUG ) printBodies ( 2 0 ) ;
93 printf ( " ∗∗∗ CUDA SIMULATION STARTS ∗∗∗\n" ) ;
94 initAndCopyDataToDevice ( ) ;
95 //begin timing
96 cudaEvent_t start , stop ;
97 f l o a t elapsed = 0 . 0 f ;
98 HANDLE_ERROR ( cudaEventCreate (&start ) ) ;
99 HANDLE_ERROR ( cudaEventCreate (&stop ) ) ;
100 HANDLE_ERROR ( cudaEventRecord ( start , 0 ) ) ;
101 fo r ( i n t steps = 0 ; steps < t ; steps++)
102 {
103 updateVelocity<<<BLOCKS , THREADS , THREADS ∗ ( s i z eo f ( float4 ) ) >>>(n , dt , steps , p_m_Dev , ←↩
vel_Dev , acc_Dev ) ;
104 updatePosition<<<BLOCKS , THREADS>>>(n , dt , p_m_Dev , vel_Dev , acc_Dev ) ;
105 }
106 HANDLE_ERROR ( cudaEventRecord ( stop , 0 ) ) ;
107 HANDLE_ERROR ( cudaEventSynchronize ( stop ) ) ;
108 HANDLE_ERROR ( cudaEventElapsedTime(&elapsed , start , stop ) ) ;
109 HANDLE_ERROR ( cudaEventDestroy ( start ) ) ;
110 HANDLE_ERROR ( cudaEventDestroy ( stop ) ) ;
111 //end timing
112 copyDataToHostAndFree ( ) ;
113 printf ( " ∗∗∗ CUDA SIMULATION ENDS ∗∗∗\n" ) ;
114 i f ( DEBUG )
115 printBodies ( 2 0 ) ; // pr in t r e s u l t s
116 printf ( "CUDA time elapsed : %f ms\n" , elapsed ) ;
117 fo r ( i n t i = 0 ; i < n ; i++)
118 {
119 pxT [ i ] = p_m [ i ] . x ; pyT [ i ] = p_m [ i ] . y ; pzT [ i ] = p_m [ i ] . z ; mT [ i ] = p_m [ i ] . w ;
120 vxT [ i ] = vel [ i ] . x ; vyT [ i ] = vel [ i ] . y ; vzT [ i ] = vel [ i ] . z ;
121 }
122 f l o a t energyTemp = 0 . 0 f ;
31
8. All-pairs, CUDA, velocity Verlet
123 i f ( DEBUG )
124 {
125 energyTemp = computeEnergy (n , pxT , pyT , pzT , vxT , vyT , vzT , mT ) ;
126 printf ( " System energy a f t e r s imulat ion : %f\n" , energyTemp ) ;
127 }
128 i f ( argc == 5) //wri te output f i l e
129 {
130 strcpy ( filenameOutput , argv [ 4 ] ) ;
131 i f (1 == writeoutput ( filenameOutput , n , t , dt , pxT , pyT , pzT , vxT , vyT , vzT , mT ) )
132 {
133 printf ( "Output−f i l e %s suc c e s fu l l y wri t ten .\n" , filenameOutput ) ;
134 }
135 e l s e
136 {
137 printf ( " Error wri t ing output−f i l e %s , e x i t i ng . " , filenameOutput ) ;
138 exit (−1) ;
139 }
140 }
141 free ( pxT ) ; free ( pyT ) ; free ( pzT ) ; free ( vxT ) ; free ( vyT ) ; free ( vzT ) ; free ( mT ) ;
142 }
143 e l s e
144 {
145 printf ( " I n co r r e c t number of parameters supplied . To run :\n" ) ;
146 printf ( "%s <input−f i l e > [< timesteps > <s i z e of timestep >] [<output−f i l e >]\n" , argv [ 0 ] ) ;
147 re turn 1 ;
148 }
149 re turn 0 ;
150 }
151 void printBodies ( i n t tail )
152 {
153 i n t start = 0 ;
154 i f ( tail )
155 start = n − tail ;
156 cout << " Pr in t ing system of bodies : " << endl ;
157 cout << "n : " << n << endl ;
158 cout << " r : " << r << endl ;
159 f o r ( i n t i = start ; i < n ; i++)
160 {
161 cout << "body " << i << " : pos=( " << p_m [ i ] . x << " , " << p_m [ i ] . y << " , " <<
162 p_m [ i ] . z << " ) ve l =( " << vel [ i ] . x << " , " << vel [ i ] . y << " , " << vel [ i ] . z << " ) m←↩
=" << p_m [ i ] . w << endl ;
163 }
164 }
165
166 // spe c i a l case of d i s tance where r e su l t would be squared a f t e r re turn
167 __device__ f l o a t dist2 ( float4 a , float4 b )
168 {
169 re turn ( ( a . x − b . x ) ∗ ( a . x − b . x ) + ( a . y − b . y ) ∗ ( a . y−b . y ) + ( a . z − b . z ) ∗ ( a . z − b . z ) ) ;
170 }
171
172 __global__ void updateVelocity ( i n t n , f l o a t dt , i n t steps , float4 ∗p_m , float4 ∗vel , float4 ∗acc )
173 {
174 i n t threadsInBlock = blockDim . x ;
175 i n t threadsInGrid = threadsInBlock∗gridDim . x ;
176 i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ;
177 i n t tiles = n/threadsInBlock+1; //number of t i l e s
178 extern __shared__ float4 shared_p_m [ ] ; //a shared memory array on the device , there i s 2KB ←↩
shared per processor (16K per SM) .
179 f o r ( i n t k = tid ; k < n ; k+=threadsInGrid )
180 {
181 float4 this_p_m = p_m [ k ] ;
182 f l o a t ax = 0 . 0 f ;
183 f l o a t ay = 0 . 0 f ;
184 f l o a t az = 0 . 0 f ;
185 fo r ( i n t j = 0 ; j < tiles ; j++) //the block takes one t i l e a t the time and reads in to ←↩
shared memory
186 {
187 i f ( j∗tiles+threadIdx . x<n )
188 {
32
189 shared_p_m [ threadIdx . x ]=p_m [ j∗threadsInBlock+threadIdx . x ] ;
190 }
191 __syncthreads ( ) ; // make sure a l l memory i s read in
192 #pragma unroll 8
193 f o r ( i n t l=0;l<threadsInBlock ; l++) // each thread c a l c u l a t e s the ve l o c i t y of i t s body ←↩
using the shared t i l e memory
194 {
195 i f ( j∗tiles+l<n )
196 {
197 f l o a t rtemp = EPSILON + dist2 ( this_p_m , shared_p_m [ l ] ) ;
198 f l o a t temp = dt ∗ G ∗ shared_p_m [ l ] . w∗rsqrtf ( rtemp∗rtemp∗rtemp ) ;
199 ax += temp ∗ ( shared_p_m [ l ] . x − this_p_m . x ) ;
200 ay += temp ∗ ( shared_p_m [ l ] . y − this_p_m . y ) ;
201 az += temp ∗ ( shared_p_m [ l ] . z − this_p_m . z ) ;
202 }
203 }
204 __syncthreads ( ) ; // make sure no memory i s overwri t ten
205 }
206 //saving the temp var i ab l e s back to g loba l memory
207 i f ( steps > 0)
208 {
209 f l o a t half_dt =dt ∗0 . 5 f ;
210 vel [ k ] . x += ( ax − acc [ k ] . x ) ∗ half_dt ;
211 vel [ k ] . y += ( ay − acc [ k ] . y ) ∗ half_dt ;
212 vel [ k ] . z += ( az − acc [ k ] . z ) ∗ half_dt ;
213 }
214 acc [ k ] . x =ax ;
215 acc [ k ] . y =ay ;
216 acc [ k ] . z =az ;
217 }
218 }
219
220 __global__ void updatePosition ( i n t n , f l o a t dt , float4 ∗p_m , float4 ∗vel , float4 ∗acc )
221 {
222 // in t threadsInBlock = blockDim . x ;
223 const i n t threadsInGrid = blockDim . x∗gridDim . x ;
224 const i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ;
225 f l o a t dth = dt ∗ 0 . 5 f ;
226 f l o a t dvx , dvy , dvz ;
227 f l o a t vhx , vhy , vhz ;
228 #pragma unroll 8
229 f o r ( i n t k = tid ; k <n ; k+=threadsInGrid )
230 {
231 dvx = acc [ k ] . x ∗ dth ;
232 dvy = acc [ k ] . y ∗ dth ;
233 dvz = acc [ k ] . z ∗ dth ;
234
235 vhx = vel [ k ] . x + dvx ;
236 vhy = vel [ k ] . y + dvy ;
237 vhz = vel [ k ] . z + dvz ;
238
239 p_m [ k ] . x += vhx ∗ dt ;
240 p_m [ k ] . y += vhy ∗ dt ;
241 p_m [ k ] . z += vhz ∗ dt ;
242
243 vel [ k ] . x = vhx + dvx ;
244 vel [ k ] . y = vhy + dvy ;
245 vel [ k ] . z = vhz + dvz ;
246 }
247 }
248
249 void initAndCopyDataToDevice ( void )
250 {
251 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &p_m_Dev , n ∗ s i z eo f ( float4 ) ) ) ;
252 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &vel_Dev , n ∗ s i z eo f ( float4 ) ) ) ;
253 HANDLE_ERROR ( cudaMalloc ( ( void ∗∗ ) &acc_Dev , n ∗ s i z eo f ( float4 ) ) ) ;
254 HANDLE_ERROR ( cudaMemcpy ( p_m_Dev , p_m , n ∗ s i z eo f ( float4 ) , cudaMemcpyHostToDevice ) ) ;
255 HANDLE_ERROR ( cudaMemcpy ( vel_Dev , vel , n ∗ s i z eo f ( float4 ) , cudaMemcpyHostToDevice ) ) ;
256 HANDLE_ERROR ( cudaMemcpy ( acc_Dev , acc , n ∗ s i z eo f ( float4 ) , cudaMemcpyHostToDevice ) ) ;
33
8. All-pairs, CUDA, velocity Verlet
257 }
258
259 void copyDataToHostAndFree ( void )
260 {
261 HANDLE_ERROR ( cudaMemcpy ( p_m , p_m_Dev , n ∗ s i z eo f ( float4 ) , cudaMemcpyDeviceToHost ) ) ;
262 HANDLE_ERROR ( cudaMemcpy ( vel , vel_Dev , n ∗ s i z eo f ( float4 ) , cudaMemcpyDeviceToHost ) ) ;
263 HANDLE_ERROR ( cudaMemcpy ( acc , acc_Dev , n ∗ s i z eo f ( float4 ) , cudaMemcpyDeviceToHost ) ) ;
264 HANDLE_ERROR ( cudaFree ( p_m_Dev ) ) ;
265 HANDLE_ERROR ( cudaFree ( vel_Dev ) ) ;
266 HANDLE_ERROR ( cudaFree ( acc_Dev ) ) ;
267 }
34
35
9 Barnes-Hut, Sequential, C
1 /∗
2 ∗ bh . c
3 Sequent ia l Barnes−Hut implementation using C and arrays
4 ∗/
5
6 # include <math . h>
7 # include <malloc . h>
8 # include <s t r i ng . h>
9 # include < f l o a t . h>
10 # include < s t d l i b . h>
11 # include " input . c "
12 # include " output . c "
13 # include " energy . c "
14 # include <sys/time . h>
15
16 # def ine NOCHILD −1
17 # def ine THRESH 0 .5 f
18 # def ine G 1 . 0 f
19 # def ine EPSILON 0.0000001 f // so f ten ing f a c t o r
20 # def ine DEBUG 0
21
22 i n t n , t ; //number of bodies/ l ea f s , t imesteps
23 f l o a t dt ; //s i z e of i n t eg r a t i on step
24 i n t nnodes ; //number of nodes , l e a f s + i n t e r na l
25 i n t ∗ childNodes ;
26 f l o a t ∗px , ∗py , ∗pz ;
27 f l o a t ∗vx , ∗vy , ∗vz ;
28 f l o a t ∗ax , ∗ay , ∗az ;
29 double ∗timings0 , ∗timings1 , ∗timings2 , ∗timings3 , ∗timings4 ;
30 f l o a t ∗m ;
31 f l o a t r ; //max radius of system
32 i n t usedInternal ; // keeps t rack of the sma l l e s t value of i n t e r n a l node index
33 i n t INIT = 0 ;
34 i n t steps ;
35
36 void printTree ( void ) ;
37 i n l i n e f l o a t sqdist ( f l o a t x1 , f l o a t y1 , f l o a t z1 , f l o a t x2 , f l o a t y2 , f l o a t z2 ) ;
38 i n t getIndex ( f l o a t ∗x , f l o a t ∗y , f l o a t ∗z , i n t centre , i n t leaf , f l o a t ∗px , f l o a t ∗py , f l o a t ∗pz )←↩
;
39
40 i n t init ( i n t nT )
41 {
42 px = ( f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( f l o a t ) ) ;
43 py = ( f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( f l o a t ) ) ;
44 pz = ( f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( f l o a t ) ) ;
45 vx = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
46 vy = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
47 vz = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
48 ax = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
49 ay = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
50 az = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
51 m = ( f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( f l o a t ) ) ;
52 childNodes = ( i n t ∗ ) malloc ( ( nT +1) ∗ 8 ∗ s i z eo f ( i n t ) ) ;
53 timings0 = ( double ∗ ) malloc ( s i z eo f ( double ) ∗ t+1) ;
54 timings1 = ( double ∗ ) malloc ( s i z eo f ( double ) ∗ t+1) ;
36
55 timings2 = ( double ∗ ) malloc ( s i z eo f ( double ) ∗ t+1) ;
56 timings3 = ( double ∗ ) malloc ( s i z eo f ( double ) ∗ t+1) ;
57 timings4 = ( double ∗ ) malloc ( s i z eo f ( double ) ∗ t+1) ;
58 i f ( ( px != NULL )&&(py != NULL )&&(pz != NULL )&&(vx != NULL )&&(vy != NULL )&&(vz != NULL )
59 &&(m != NULL )&&(ax != NULL )&&(ay != NULL )&&(az != NULL )&&(childNodes != NULL ) )
60 INIT = 1 ;
61 re turn INIT ;
62 }
63
64 i n t unInit ( void )
65 {
66 i f ( INIT==1)
67 {
68 free ( px ) ; free ( py ) ; free ( pz ) ; free ( vx ) ; free ( vy ) ; free ( vz ) ; free ( m ) ;
69 free ( ax ) ; free ( ay ) ; free ( az ) ; free ( childNodes ) ;
70 free ( timings0 ) ; free ( timings1 ) ; free ( timings2 ) ; free ( timings3 ) ; free ( timings4 ) ;
71 }
72 re turn INIT ;
73 }
74
75 f l o a t calculateRadius ( ) //naive radius ca l cu l a t i on , re turns abs of the b igges t coordinate number
76 {
77 f l o a t rT = 0 . 0 f ;
78 i f (1==INIT )
79 {
80 f l o a t xmax = FLT_MIN , ymax = FLT_MIN , zmax = FLT_MIN ;
81 f l o a t xmin = FLT_MAX , ymin = FLT_MAX , zmin = FLT_MAX ;
82 i n t i ;
83 fo r ( i = 0 ; i < n ; i++)
84 {
85 xmax = ( px [ i ] > xmax ) ? px [ i ] : xmax ; xmin = ( px [ i ] < xmin ) ? px [ i ] : xmin ;
86 ymax = ( py [ i ] > ymax ) ? py [ i ] : ymax ; ymin = ( py [ i ] < ymin ) ? py [ i ] : ymin ;
87 zmax = ( pz [ i ] > zmax ) ? pz [ i ] : zmax ; zmin = ( pz [ i ] < zmin ) ? pz [ i ] : zmin ;
88 }
89 rT = ( fabsf ( xmax ) > rT ) ? fabsf ( xmax ) : rT ; rT = ( fabsf ( xmin ) > rT ) ? fabsf ( xmin ) : rT ;
90 rT = ( fabsf ( ymax ) > rT ) ? fabsf ( ymax ) : rT ; rT = ( fabsf ( ymin ) > rT ) ? fabsf ( ymin ) : rT ;
91 rT = ( fabsf ( zmax ) > rT ) ? fabsf ( zmax ) : rT ; rT = ( fabsf ( zmin ) > rT ) ? fabsf ( zmin ) : rT ;
92 }
93 re turn rT ;
94 }
95
96 void buildTree ( )
97 {
98 //build root
99 f l o a t x = 0 . 0 f , y = 0 . 0 f , z = 0 . 0 f ;
100 //values to change pos i t i on
101 usedInternal = nnodes −1;
102 i n t leaf , current , child ; // the l e a f we i n s e r t and the current node , and the current nodes ←↩
ch i ld
103 current = usedInternal ;
104 i n t i ;
105 f o r ( i=0;i<8;i++)
106 {
107 childNodes [ ( current−n ) ∗8+i ]=−1;
108 }
109 px [ current ]=0 . 0 f ;
110 py [ current ]=0 . 0 f ;
111 pz [ current ]=0 . 0 f ;
112 m [ current ]=0 . 0 f ;
113 // i n s e r t l e a f s
114 f o r ( leaf=0;leaf<n ; leaf++)
115 {
116 child=nnodes−1;
117 f l o a t r_this=r ;
118 i n t index = 0 ;
119 while ( child>=n )
120 {
121 current = child ;
122 index = getIndex(&x , &y , &z , current , leaf , px , py , pz ) ;
37
9. Barnes-Hut, Sequential, C
123 r_this ∗= 0 . 5 ;
124 child = childNodes [ ( current−n ) ∗8+index ] ;
125 }
126 i f ( child == NOCHILD )
127 {
128 childNodes [ ( current−n ) ∗8+index ]=leaf ;
129 }
130 e l s e // there i s a l e a f in the current pos i t ion , i . e . ch i ld
131 {
132 i n t originalInternal=(current−n ) ∗8+index ; // remember t h i s node through i t e r a t i o n
133 i n t altNode=−1; //make sure not to i n s e r t node f i r s t time
134 do
135 {
136 // we have the two l e a f s namely ch i ld and l e a f ;
137 i n t newInternal = −−usedInternal ; //new in t e r na l i s pol led
138 i f ( altNode < newInternal )
139 altNode=newInternal ;
140 px [ newInternal ]=px [ current ]+r_this∗x ;
141 py [ newInternal ]=py [ current ]+r_this∗y ;
142 pz [ newInternal ]=pz [ current ]+r_this∗z ;
143 m [ newInternal ]=0 . 0 f ;
144 // do not i n s e r t the new in t e rna l node in the f i r s t loop
145 i f ( altNode != newInternal )
146 {
147 childNodes [ ( current−n ) ∗8+index ]=newInternal ;
148 }
149 i n t j ;
150 f o r ( j=0;j<8;j++) // c rea t e new empty nodes fo r the newly crea ted i n t e r na l node
151 {
152 childNodes [ ( newInternal−n ) ∗8+j ]=NOCHILD ;
153 }
154
155 index = getIndex(&x , &y , &z , newInternal , child , px , py , pz ) ;
156 childNodes [ ( newInternal−n ) ∗8+index ] = child ; // ch i ld i s in se r t ed
157 current=newInternal ;
158 index = getIndex(&x , &y , &z , newInternal , leaf , px , py , pz ) ;
159 child=childNodes [ ( current−n ) ∗8+index ] ;
160 r_this ∗=0 . 5 ;
161 }
162 while ( child > NOCHILD ) ; // while there i s no empty space fo r the keep on looping
163 childNodes [ ( current−n ) ∗8+index ]=leaf ;
164 childNodes [ originalInternal ]=altNode ;
165 }
166 }
167 }
168 void calculateCentreOfMass ( )
169 {
170 while ( usedInternal<nnodes )
171 {
172 // i t e r e t a t e trough chi ldren .
173 f l o a t cent_mass =0.0f ;
174 f l o a t cent_pos_x =0.0f ;
175 f l o a t cent_pos_y =0.0f ;
176 f l o a t cent_pos_z =0.0f ;
177 i n t k =0;
178 fo r ( k = 0 ; k<8;k++)
179 {
180 i n t currentChild=childNodes [ ( usedInternal−n ) ∗8+k ] ;
181 i f ( currentChild >=0)
182 {
183 f l o a t tempx = cent_pos_x ∗ cent_mass + px [ currentChild ] ∗ m [ currentChild ] ;
184 f l o a t tempy = cent_pos_y ∗ cent_mass + py [ currentChild ] ∗ m [ currentChild ] ;
185 f l o a t tempz = cent_pos_z ∗ cent_mass + pz [ currentChild ] ∗ m [ currentChild ] ;
186 cent_mass += m [ currentChild ] ;
187 cent_pos_x = tempx / cent_mass ;
188 cent_pos_y = tempy / cent_mass ;
189 cent_pos_z = tempz / cent_mass ;
190 }
191 }
38
192 px [ usedInternal ]=cent_pos_x ;
193 py [ usedInternal ]=cent_pos_y ;
194 pz [ usedInternal ]=cent_pos_z ;
195 m [ usedInternal ]=cent_mass ;
196 usedInternal++;
197 }
198 }
199 f l o a t sqdist ( f l o a t x1 , f l o a t y1 , f l o a t z1 , f l o a t x2 , f l o a t y2 , f l o a t z2 )
200 {
201 re turn ( x1−x2 ) ∗ ( x1−x2 ) +(y1−y2 ) ∗ ( y1−y2 ) +(z1−z2 ) ∗ ( z1−z2 ) ;
202 }
203
204 void calculateAcceleration ( )
205 {
206 // t rave r s e t r e e from root down updating a c c e l e r a t i on when reaching a l e a f or the t resho ld i s←↩
reached .
207 i n t leaf ;
208 long visited ; //s tack implimentation of v i s i t ed ch i ldren (0−7)
209 i n t currentNode ;
210 i n t currentChild ;
211 i n t depth ;
212 i n t MAXDEPTH=32;
213 i n t parent [ MAXDEPTH ] ; //parent internalNode s tack implementation
214 f l o a t rdepth [ MAXDEPTH ] ;
215 i n t i ;
216 rdepth [ 0 ]= r∗r ∗4 ;
217 f o r ( i=1;i<MAXDEPTH ; i++)
218 {
219 rdepth [ i ] = rdepth [i−1] ∗ 0 .25 f ;
220 }
221 f o r ( leaf =0;leaf<n ; leaf++)
222 {
223 visited=1;
224 currentNode = nnodes−1; // s t a r t a t root
225 currentChild = 0 ;
226 depth =0;
227 i n t acc_x = 0 . 0 f ; i n t acc_y = 0 . 0 f ; i n t acc_z = 0 . 0 f ; // temporary va r i ab l e s
228 do //while there i s s t i l l unvis i ted branches of t r e e
229 {
230 while ( currentChild<8 ) // go up a l ev e l when a l l 8 ch i ldren are v i s i t ed
231 {
232 do
233 {
234 parent [ depth ]=currentNode ;
235 currentNode = childNodes [ ( currentNode−n ) ∗8+currentChild ] ;
236 depth++;
237 visited = visited << 3 ;
238 visited = visited | currentChild ;
239 currentChild=0; // remark tha t t h i s w i l l be r e s e t i f the do loop i s ex i t ed .
240 }
241 while ( ( currentNode>=n )&&(THRESH/2<(rdepth [ depth ]/sqdist ( px [ currentNode ] , py [←↩
currentNode ] , pz [ currentNode ] , px [ leaf ] , py [ leaf ] , pz [ leaf ] ) ) ) ) ; //current node ←↩
i s not a l ea f , or d i s tance to current node i s not below tresho ld
242 // i f the node i s not empty , c a l c u l a t e and increment a c c e l e r a t i on of the l e a f ←↩
with the fo r ce of the current node .
243 i f ( currentNode != NOCHILD )
244 {
245 f l o a t rtemp = EPSILON + sqdist ( px [ leaf ] , py [ leaf ] , pz [ leaf ] , px [ currentNode ] , py [←↩
currentNode ] , pz [ currentNode ] ) ;
246 f l o a t temp = G ∗ m [ currentNode ] / sqrtf ( rtemp ∗ rtemp ∗ rtemp ) ;
247 acc_x = acc_x + temp ∗ ( px [ currentNode ] − px [ leaf ] ) ;
248 acc_y = acc_y + temp ∗ ( py [ currentNode ] − py [ leaf ] ) ;
249 acc_z = acc_z + temp ∗ ( pz [ currentNode ] − pz [ leaf ] ) ;
250 }
251 currentChild = visited & 7 ; //remark , current ch i ld somewhat r e s e t
252 visited = visited >> 3 ;
253 depth−−;
254 currentNode = parent [ depth ] ;
255 currentChild++;
39
9. Barnes-Hut, Sequential, C
256 }
257 do // make sure the current ch i ld does not exceed 7 i f we are a t two or more ←↩
succeding f a r r i gh t succeding brances
258 {
259 depth−−;
260 currentChild = visited & 7 ;
261 visited = visited >> 3 ;
262 currentNode = parent [ depth ] ;
263 currentChild++;
264 }
265 while ( currentChild>7 && depth >= 0) ;
266 }
267 while ( depth >= 0) ;
268 i f ( steps > 0)
269 {
270 f l o a t half_dt =dt ∗0 . 5 f ;
271 vx [ leaf ] += ( acc_x − ax [ leaf ] ) ∗ half_dt ;
272 vy [ leaf ] += ( acc_y − ay [ leaf ] ) ∗ half_dt ;
273 vz [ leaf ] += ( acc_z − az [ leaf ] ) ∗ half_dt ;
274 }
275 ax [ leaf ] =acc_x ;
276 ay [ leaf ] =acc_y ;
277 az [ leaf ] =acc_z ;
278 }
279 }
280
281
282 void progressSystem ( void )
283 {
284 f l o a t dth = dt ∗ 0 . 5 f ;
285 f l o a t dvx , dvy , dvz ;
286 f l o a t vhx , vhy , vhz ;
287 i n t leaf ;
288 f o r ( leaf =0; leaf <n ; leaf++)
289 {
290 dvx = ax [ leaf ] ∗ dth ;
291 dvy = ay [ leaf ] ∗ dth ;
292 dvz = az [ leaf ] ∗ dth ;
293
294 vhx = vx [ leaf ] + dvx ;
295 vhy = vy [ leaf ] + dvy ;
296 vhz = vz [ leaf ] + dvz ;
297
298 px [ leaf ] += vhx ∗ dt ;
299 py [ leaf ] += vhy ∗ dt ;
300 pz [ leaf ] += vhz ∗ dt ;
301
302 vx [ leaf ] = vhx + dvx ;
303 vy [ leaf ] = vhy + dvy ;
304 vz [ leaf ] = vhz + dvz ;
305 }
306 }
307
308 void printTree ( )
309 {
310 i n t last [ 3 2 0 0 0 ] ;
311 i n t next [ 3 2 0 0 0 ] ;
312 i n t i=0;
313 f o r ( i = 0 ; i<32000;i++)
314 {
315 next [ i ] = −1;
316 last [ i ] = −1;
317 }
318 i =0;
319 i n t depth=0;
320 i n t count=0;
321 printf ( "\n depth : %d " , depth ) ;
322 i n t currentInternal ;
323 i n t iteration =0;
40
324 next [ 0 ]= nnodes−1;
325 do
326 {
327 i n t j ;
328 do
329 {
330 currentInternal=next [ iteration ] ;
331 printf ( " || %d || " , currentInternal ) ;
332 f o r ( j =0; j < 8 ; j++)
333 {
334 i n t child =childNodes [ ( currentInternal −n ) ∗8+j ] ;
335 printf ( "%d " , child ) ;
336 i f ( child>n )
337 {
338 last [ count ]=child ;
339 count++;
340 }
341 }
342 i++;
343 iteration++;
344 } while ( next [ iteration ] >0) ;
345 memcpy ( next , last , s i z eo f ( i n t ) ∗3200) ;
346 fo r ( j=0;j<3200;j++)
347 {
348 last [ j ]=−1;
349 }
350 iteration=0;
351 depth++;
352 count=0;
353 printf ( "\n \n depth : %d \n" , depth ) ;
354 i f ( next [0]==−1)
355 {
356 i =n ;
357 printf ( " done ! \n" ) ;
358 }
359 }
360 while (i<n ) ;
361 }
362
363 //o f f s e t 1 t runcs from s t a r t , o f f s e t 2 from end , pr in t with Acce le ra t ion ?
364 void printBodies ( i n t count , i n t offset1 , i n t offset2 , i n t withAcc )
365 {
366 i f ( (1==INIT ) && (n>0) )
367 {
368 i n t i , a , b ;
369 a = ( offset1 >= count ) ? 0 : offset1 ;
370 b = ( offset2 >= count ) ? 0 : count−offset2 ;
371 printf ( "n : ( pos ) ( ve l ) [ acc ] (m)\n" ) ;
372 fo r ( i = a ; i < b ; i++)
373 {
374 printf ( "%d : (% f %f %f ) (% f %f %f ) " , i , px [ i ] , py [ i ] , pz [ i ] , vx [ i ] , vy [ i ] , vz [ i ] ) ;
375 i f ( withAcc >0) printf ( " [% f %f %f ] " , ax [ i ] , ay [ i ] , az [ i ] ) ;
376 printf ( " (% f )\n" , m [ i ] ) ;
377 }
378 fflush ( stdout ) ;
379 }
380 }
381
382 //re turns a in t ege r number of elapsed ms
383 long timevaldiff ( s t r u c t timeval ∗time1 , s t r u c t timeval ∗time2 )
384 {
385 long msec ;
386 msec = ( time2−>tv_sec−time1−>tv_sec ) ∗ 1000 ;
387 msec += ( time2−>tv_usec−time1−>tv_usec ) / 1000 ;
388 re turn msec ;
389 }
390
391 //re turns a double number of elapsed ms for shor t e r t imings
392 double timevaldiffd ( s t r u c t timeval ∗time1 , s t r u c t timeval ∗time2 )
41
9. Barnes-Hut, Sequential, C
393 {
394 double usec ;
395 usec = ( double ) ( time2−>tv_sec−time1−>tv_sec ) ∗ 1000000 ;
396 usec += ( double ) ( time2−>tv_usec−time1−>tv_usec ) ;
397 re turn ( usec/1000 .0 ) ;
398 }
399
400
401 i n t main ( i n t argc , char ∗ argv [ ] )
402 {
403 char filename [ 6 4 ] ;
404 char filenameOutput [ 6 4 ] ;
405 printf ( "N−body simulator , Barnes−Hut sequen t i a l C vers ion\n" ) ;
406 i f ( argc >= 2) //fi lename required as a minimum
407 {
408 strcpy ( filename , argv [ 1 ] ) ;
409 // i n i t i a l i s e to avoid warnings
410 f l o a t ∗pxT = 0 , ∗pyT = 0 , ∗pzT = 0 ;
411 f l o a t ∗vxT = 0 , ∗vyT = 0 , ∗vzT = 0 ;
412 f l o a t ∗mT = 0 ;
413 i f (1 == readinput ( filename , &n , &t , &dt , &pxT , &pyT , &pzT , &vxT , &vyT , &vzT , &mT ) )
414 {
415 printf ( " Input−f i l e %s suc c e s fu l l y loaded .\n" , filename ) ;
416 i f ( argc >= 4) //overr ide with t , dt parameters from argv
417 {
418 t = ( i n t ) strtol ( argv [ 2 ] , NULL , 10 ) ;
419 dt = ( f l o a t ) strtod ( argv [ 3 ] , NULL ) ;
420 }
421 printf ( " Parameters . n : %d t : %d dt : %f\n" , n , t , dt ) ;
422 nnodes = n ∗2 ;
423 init ( n ) ;
424 i n t count=n∗ s i z eo f ( f l o a t ) ;
425 memcpy ( px , pxT , count ) ; memcpy ( py , pyT , count ) ; memcpy ( pz , pzT , count ) ; memcpy (m , mT , count ) ;
426 memcpy ( vx , vxT , count ) ; memcpy ( vy , vyT , count ) ; memcpy ( vz , vzT , count ) ;
427 free ( pxT ) ; free ( pyT ) ; free ( pzT ) ; free ( vxT ) ; free ( vyT ) ; free ( vzT ) ; free ( mT ) ;
428 // pr intBodies (n , 0 , 0 , 0 ) ;
429 s t r u c t timeval begin , end , lap_begin , lap_end ;
430 gettimeofday(&begin , NULL ) ;
431 f o r ( steps=0; steps < t ; steps++)
432 {
433 timings0 [ t ] = 0 . 0 ; timings1 [ t ] = 0 . 0 ; timings2 [ t ] = 0 . 0 ; timings3 [ t ] = 0 . 0 ; timings4 [ t ] = 0 . 0 ;
434 double lap ;
435 gettimeofday(&lap_begin , NULL ) ;
436 r = calculateRadius ( ) ;
437 gettimeofday(&lap_end , NULL ) ;
438 lap = timevaldiffd(&lap_begin , &lap_end ) ;
439 i f ( DEBUG )
440 printf ( " Radius c a l cu l a t i on done in ms : %f \n" , lap ) ;
441 timings0 [ steps ]=lap ;
442
443 gettimeofday(&lap_begin , NULL ) ;
444 buildTree ( ) ;
445 gettimeofday(&lap_end , NULL ) ;
446 lap = timevaldiffd(&lap_begin , &lap_end ) ;
447 i f ( DEBUG )
448 printf ( " Tree bu i l t ( using %d internalNodes ) in ms : %f \n" , nnodes −←↩
usedInternal , lap ) ;
449 timings1 [ steps ]=lap ;
450
451 gettimeofday(&lap_begin , NULL ) ;
452 calculateCentreOfMass ( ) ;
453 gettimeofday(&lap_end , NULL ) ;
454 lap = timevaldiffd(&lap_begin , &lap_end ) ;
455 i f ( DEBUG )
456 printf ( " Centre of mass ca l cu l a t ed in ms : %f \n" , lap ) ;
457 timings2 [ steps ]=lap ;
458
459 gettimeofday(&lap_begin , NULL ) ;
460 calculateAcceleration ( ) ;
42
461 gettimeofday(&lap_end , NULL ) ;
462 lap = timevaldiffd(&lap_begin , &lap_end ) ;
463 i f ( DEBUG )
464 printf ( " Acce le ra t ion ca l cu l a t ed in ms : %f \n" , lap ) ;
465 timings3 [ steps ]=lap ;
466
467 gettimeofday(&lap_begin , NULL ) ;
468 progressSystem ( ) ;
469 gettimeofday(&lap_end , NULL ) ;
470 lap = timevaldiffd(&lap_begin , &lap_end ) ;
471 i f ( DEBUG )
472 printf ( " System progressed in ms : %f \n energy at t h i s point %f \n" ,lap , ←↩
computeEnergy (n , px , py , pz , vx , vy , vz , m ) ) ;
473 timings4 [ steps ]=lap ;
474 }
475 gettimeofday(&end , NULL ) ;
476 long elapsed = timevaldiff(&begin , &end ) ;
477 printf ( "n : %d t : %d dt : %f , t o t . time : %ld ms \n" , n , t , dt , elapsed ) ;
478 i f ( DEBUG )
479 printf ( " Intermediate t imings : \nstep | radius | bu i ld t r ee | centerOfMass | ca l c←↩
acc | progress |\n" ) ;
480 i n t i ;
481 f o r ( i=0; i<t ; i++)
482 {
483 i f ( DEBUG )
484 printf ( " %5d |%7f |%10f |%12f |%9f |%9f |\n" ,i , timings0 [ i ] , timings1 [ i ] ,←↩
timings2 [ i ] , timings3 [ i ] , timings4 [ i ] ) ;
485 timings0 [ t]+=timings0 [ i ] ;
486 timings1 [ t]+=timings1 [ i ] ;
487 timings2 [ t]+=timings2 [ i ] ;
488 timings3 [ t]+=timings3 [ i ] ;
489 timings4 [ t]+=timings4 [ i ] ;
490 }
491 printf ( " c a l . r %f t r e e %f CoM %f ca l . acc %f prog . sys %f \n" , timings0 [ t ] , timings1 [ t ] ,←↩
timings2 [ t ] , timings3 [ t ] , timings4 [ t ] ) ;
492 i f ( DEBUG )
493 printf ( " System energy a f t e r s imulat ion : %f\n" , computeEnergy (n , px , py , pz , vx , ←↩
vy , vz , m ) ) ;
494 }
495 i f ( argc == 5) //output f i l e requested
496 {
497 strcpy ( filenameOutput , argv [ 4 ] ) ;
498 i f (1 == writeoutput ( filenameOutput , n , t , dt , px , py , pz , vx , vy , vz , m , ax , ay , az ) )
499 {
500 printf ( "Output−f i l e %s suc c e s fu l l y wri t ten .\n" , filenameOutput ) ;
501 }
502 e l s e
503 {
504 printf ( " Error wri t ing output−f i l e %s , e x i t i ng .\n" , filenameOutput ) ;
505 exit (−1) ;
506 }
507 }
508 unInit ( ) ;
509 }
510 e l s e
511 {
512 printf ( " I n co r r e c t number of parameters supplied . To run :\n" ) ;
513 printf ( "%s <input−f i l e > [< timesteps > <s i z e of timestep >] [<output−f i l e >]\n" , argv [ 0 ] ) ;
514 re turn 1 ;
515 }
516 re turn 0 ;
517 }
518
519 i n t getIndex ( f l o a t ∗x , f l o a t ∗y , f l o a t ∗z , i n t centre , i n t leaf , f l o a t ∗px , f l o a t ∗py , f l o a t ∗pz )
520 {
521 i n t index = 0 ;
522 ∗x= −1.0f ; ∗y = −1.0f ; ∗z = −1.0f ;
523 i f ( px [ centre ] < px [ leaf ] )
524 {
43
9. Barnes-Hut, Sequential, C
525 index = index | 1 ;
526 ∗x = 1 . 0 f ;
527 }
528 i f ( py [ centre ] < py [ leaf ] )
529 {
530 index = index | 2 ;
531 ∗y = 1 . 0 f ;
532 }
533 i f ( pz [ centre ] < pz [ leaf ] )
534 {
535 index = index | 4 ;
536 ∗z = 1 . 0 f ;
537 }
538 re turn index ;
539 }
44
45
10 Barnes-Hut, OpenMP
1 /∗
2 OpenMP Barnes−Hut implementation using C and arrays
3 by Truls & Jon
4 ∗/
5
6 # include <math . h>
7 # include < s t d l i b . h>
8 # include <s t r i ng . h>
9 # include < f l o a t . h>
10 # include < s t d l i b . h>
11 # include <sys/time . h>
12 # include <omp. h>
13 # include " input . c "
14 # include " output . c "
15 # include " energy . c "
16
17 # def ine NOCHILD −1
18 # def ine LOCK −2
19 # def ine THRESH 0 .5 f
20 # def ine G 1 . 0 f
21 # def ine EPSILON 0.0000001 f // so f ten ing f a c t o r
22 # def ine DEBUG 0
23
24 i n t n , t ; //number of bodies/ l ea f s , t imesteps
25 f l o a t dt ; //s i z e of i n t eg r a t i on step
26 i n t nnodes ; //number of nodes , l e a f s + i n t e r na l
27 i n t v o l a t i l e ∗ childNodes ;
28 i n t ∗ sortValue ; // represen t s s p a t i a l l o c a t i on s
29 i n t ∗ sortingIntervals ;
30 f l o a t ∗px , ∗py , ∗pz ;
31 f l o a t ∗vx , ∗vy , ∗vz ;
32 f l o a t ∗ax , ∗ay , ∗az ;
33 v o l a t i l e f l o a t ∗ vol_px , ∗vol_py , ∗vol_pz , ∗vol_m ;
34 double ∗timings0 , ∗timings1 , ∗timings2 , ∗timings3 , ∗timings4 , ∗timings5 ;
35 f l o a t ∗m ;
36 f l o a t r ; //max radius of system
37 f l o a t ∗ rmax ;
38 f l o a t ∗ rmin ;
39 i n t intervals ;
40 i n t steps , build_depth ;
41 i n t errorInTree = f a l s e ;
42
43 i n t v o l a t i l e usedInternal ; // keeps t rack of the sma l l e s t value of i n t e r n a l node index
44 i n t INIT = 0 ;
45
46 i n t threads ;
47
48 void progressSystem ( void ) ;
49
50 void printTree ( void ) ;
51 i n l i n e f l o a t sqdist ( f l o a t x1 , f l o a t y1 , f l o a t z1 , f l o a t x2 , f l o a t y2 , f l o a t z2 ) ;
52 i n t getIndex ( f l o a t ∗x , f l o a t ∗y , f l o a t ∗z , i n t centre , i n t leaf , f l o a t ∗px , f l o a t ∗py , f l o a t ∗pz )←↩
;
53
54
46
55 i n t init ( i n t nT )
56 {
57 #pragma omp parallel shared ( threads ) //pragma required to get number of threads
58 {
59 threads = omp_get_num_threads ( ) ;
60 }
61 printf ( " Using %d threads \n" , threads ) ;
62 px = ( f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( f l o a t ) ) ;
63 py = ( f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( f l o a t ) ) ;
64 pz = ( f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( f l o a t ) ) ;
65 vol_px= ( v o l a t i l e f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( v o l a t i l e f l o a t ) ) ;
66 vol_py= ( v o l a t i l e f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( v o l a t i l e f l o a t ) ) ;
67 vol_pz= ( v o l a t i l e f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( v o l a t i l e f l o a t ) ) ;
68 vol_m= ( v o l a t i l e f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( v o l a t i l e f l o a t ) ) ;
69 vx = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
70 vy = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
71 vz = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
72 ax = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
73 ay = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
74 az = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
75 m = ( f l o a t ∗ ) malloc (2 ∗ nT ∗ s i z eo f ( f l o a t ) ) ;
76 sortValue = ( i n t ∗ ) malloc ( nT ∗ s i z eo f ( i n t ) ) ;
77 childNodes = ( i n t ∗ ) malloc ( ( ( ( nT +1) ∗ 8) ) ∗ s i z eo f ( i n t ) ) ; // plus one used for t h i s vers ion
78 timings0 = ( double ∗ ) malloc ( s i z eo f ( double ) ∗ ( t+1) ) ;
79 timings1 = ( double ∗ ) malloc ( s i z eo f ( double ) ∗ ( t+1) ) ;
80 timings2 = ( double ∗ ) malloc ( s i z eo f ( double ) ∗ ( t+1) ) ;
81 timings3 = ( double ∗ ) malloc ( s i z eo f ( double ) ∗ ( t+1) ) ;
82 timings4 = ( double ∗ ) malloc ( s i z eo f ( double ) ∗ ( t+1) ) ;
83 timings5 = ( double ∗ ) malloc ( s i z eo f ( double ) ∗ ( t+1) ) ;
84 rmax = ( f l o a t ∗ ) malloc ( threads ∗ s i z eo f ( f l o a t ) ) ;
85 rmin = ( f l o a t ∗ ) malloc ( threads ∗ s i z eo f ( f l o a t ) ) ;
86 i n t i ;
87 f o r ( i = 0 ; i < threads ; i++)
88 {
89 rmax [ i ] = FLT_MIN ;
90 rmin [ i ] = FLT_MAX ;
91 }
92 i f ( ( px != NULL )&&(py != NULL )&&(pz != NULL )&&(vx != NULL )&&(vy != NULL )&&(vz != NULL )
93 &&(m != NULL )&&(ax != NULL )&&(ay != NULL )&&(az != NULL )&&(childNodes != NULL )
94 && ( rmax != NULL ) && ( rmin != NULL )&&(timings0 != NULL )&&(timings1 != NULL )
95 &&(timings2 != NULL )&&(timings3 != NULL )&&(timings4 != NULL )&&(timings5 != NULL ) )
96 INIT = 1 ;
97 re turn INIT ;
98 }
99
100 i n t unInit ( void )
101 {
102 i f ( INIT==1)
103 {
104 free ( px ) ; free ( py ) ; free ( pz ) ; free ( vx ) ; free ( vy ) ; free ( vz ) ; free ( m ) ;
105 free ( ax ) ; free ( ay ) ; free ( az ) ;
106 free ( rmax ) ; free ( rmin ) ;
107 free ( timings0 ) ; free ( timings1 ) ; free ( timings2 ) ; free ( timings3 ) ; free ( timings4 ) ; free (←↩
timings5 ) ;
108 free ( ( i n t ∗ ) childNodes ) ; free ( ( f l o a t ∗ ) vol_px ) ; free ( ( f l o a t ∗ ) vol_py ) ; free ( ( f l o a t ∗ )←↩
vol_pz ) ;
109 }
110 re turn INIT ;
111 }
112
113 //naive radius ca l cu l a t i on , re turns b igges t abs value of a coordinate number
114 //note : fo r most s i z e s of n , t h i s i s slower than s e r i a l vers ion
115 f l o a t calculateRadius ( )
116 {
117 f l o a t result = −1.0f ;
118 i n t i , j , tid ;
119 i f (1==INIT )
120 {
121 f l o a t rmaxT =0.0f ;
47
10. Barnes-Hut, OpenMP
122 f l o a t rminT = 0 . 0 f ;
123 #pragma omp parallel firstprivate ( rmaxT , rminT ) p r iva te (i , j , tid )
124 {
125 tid = omp_get_thread_num ( ) ;
126 #pragma omp parallel f o r
127 f o r ( i = tid ; i < n ; i += threads )
128 {
129 rmaxT = ( px [ i ] > rmaxT ) ? px [ i ] : rmaxT ;
130 rmaxT = ( py [ i ] > rmaxT ) ? py [ i ] : rmaxT ;
131 rmaxT = ( pz [ i ] > rmaxT ) ? pz [ i ] : rmaxT ;
132 rminT = ( px [ i ] < rminT ) ? px [ i ] : rminT ;
133 rminT = ( py [ i ] < rminT ) ? py [ i ] : rminT ;
134 rminT = ( pz [ i ] < rminT ) ? pz [ i ] : rminT ;
135 i f ( ( i+threads )>=n )
136 {
137 rmax [ tid ]=rmaxT ;
138 rmin [ tid ]=rminT ;
139 }
140 }
141 f o r ( j = threads / 2 ; j > 0 ; j/=2)
142 {
143 i f ( tid<j )
144 {
145 rmax [ tid ] = ( rmax [ tid ] > rmax [ tid+j ] ) ? rmax [ tid ] : rmax [ tid+j ] ;
146 rmin [ tid ] = ( rmin [ tid ] < rmin [ tid+j ] ) ? rmin [ tid ] : rmin [ tid+j ] ;
147 }
148 }
149 }//end omp
150 result = ( fabsf ( rmax [ 0 ] ) > fabsf ( rmin [ 0 ] ) ) ? fabsf ( rmax [ 0 ] ) : fabsf ( rmin [ 0 ] ) ;
151 }
152 re turn result ;
153 }
154
155 void buildTree ( )
156 {
157 //build root
158 f l o a t x = 0 . 0 f , y = 0 . 0 f , z = 0 . 0 f ;
159 //values to change pos i t i on
160 usedInternal = nnodes −1;
161 i n t leaf , current , child ; // the l e a f we i n s e r t and the current node , and the current nodes ←↩
ch i ld
162 current = usedInternal ;
163 i n t i ;
164 f o r ( i=0;i<8;i++)
165 {
166 childNodes [ ( current−n ) ∗8+i ]=−1;
167 }
168 px [ current ]=0 . 0 f ;
169 py [ current ]=0 . 0 f ;
170 pz [ current ]=0 . 0 f ;
171 m [ current ]=0 . 0 f ;
172 // i n s e r t l e a f s
173 f o r ( leaf=0;leaf<n ; leaf++)
174 {
175 child=nnodes−1;
176 f l o a t r_this=r ;
177 i n t index = 0 ;
178 while ( child>=n )
179 {
180 current = child ;
181 index = getIndex(&x , &y , &z , current , leaf , px , py , pz ) ;
182 r_this ∗= 0 . 5 ;
183 child = childNodes [ ( current−n ) ∗8+index ] ;
184 }
185 i f ( child == NOCHILD )
186 {
187 childNodes [ ( current−n ) ∗8+index ]=leaf ;
188 }
189 e l s e // there i s a l e a f in the current pos i t ion , i . e . ch i ld
48
190 {
191 i n t originalInternal=(current−n ) ∗8+index ; // remember t h i s node through i t e r a t i o n
192 i n t altNode=−1; //make sure not to i n s e r t node f i r s t time
193 do
194 {
195 // we have the two l e a f s namely ch i ld and l e a f ;
196 i n t newInternal = −−usedInternal ; //new in t e r na l i s pol led
197 i f ( altNode < newInternal )
198 altNode=newInternal ;
199 px [ newInternal ]=px [ current ]+r_this∗x ;
200 py [ newInternal ]=py [ current ]+r_this∗y ;
201 pz [ newInternal ]=pz [ current ]+r_this∗z ;
202 m [ newInternal ]=0 . 0 f ;
203 // do not i n s e r t the new in t e r na l node in the f i r s t loop
204 i f ( altNode != newInternal )
205 {
206 childNodes [ ( current−n ) ∗8+index ]=newInternal ;
207 }
208 i n t j ;
209 fo r ( j=0;j<8;j++) // c r ea t e new empty nodes fo r the newly created i n t e rn a l node
210 {
211 childNodes [ ( newInternal−n ) ∗8+j ]=NOCHILD ;
212 }
213
214 index = getIndex(&x , &y , &z , newInternal , child , px , py , pz ) ;
215 childNodes [ ( newInternal−n ) ∗8+index ] = child ; // ch i ld i s in se r t ed
216 current=newInternal ;
217 index = getIndex(&x , &y , &z , newInternal , leaf , px , py , pz ) ;
218 child=childNodes [ ( current−n ) ∗8+index ] ;
219 r_this ∗=0 . 5 ;
220 }
221 while ( child > NOCHILD ) ; // while there i s no empty space fo r the keep on looping
222 childNodes [ ( current−n ) ∗8+index ]=leaf ;
223 childNodes [ originalInternal ]=altNode ;
224 }
225 }
226 }
227
228 i n t calculateMassForOne ( i n t internal )
229 {
230 i n t cache_children [ 8 ] ;
231 i n t missing =0;
232 // i t e r e t a t e trough chi ldren , f i r s t r e s e t pos i t i ons used in t r e e bui lding .
233 f l o a t cent_mass =0.0f ;
234 f l o a t cent_pos_x =0.0f ;
235 f l o a t cent_pos_y =0.0f ;
236 f l o a t cent_pos_z =0.0f ;
237 i n t k =0;
238 f o r ( k = 0 ; k<8;k++)
239 {
240 cache_children [ k ]=−1;
241 }
242 f o r ( k = 0 ; k<8;k++)
243 {
244 i n t currentChild=childNodes [ ( internal−n ) ∗8+k ] ;
245 i f ( currentChild >=0)
246 {
247 i f ( vol_m [ currentChild ] >0)
248 {
249 f l o a t tempx = cent_pos_x ∗ cent_mass + px [ currentChild ] ∗ vol_m [ currentChild ] ;
250 f l o a t tempy = cent_pos_y ∗ cent_mass + py [ currentChild ] ∗ vol_m [ currentChild ] ;
251 f l o a t tempz = cent_pos_z ∗ cent_mass + pz [ currentChild ] ∗ vol_m [ currentChild ] ;
252 cent_mass += vol_m [ currentChild ] ;
253 cent_pos_x = tempx / cent_mass ;
254 cent_pos_y = tempy / cent_mass ;
255 cent_pos_z = tempz / cent_mass ;
256 }
257 e l s e
258 {
49
10. Barnes-Hut, OpenMP
259 cache_children [ k ]=currentChild ;
260 missing++;
261 }
262 }
263 }
264 while ( missing >0)
265 {
266 fo r ( k = 0 ; k<8;k++)
267 {
268 i n t currentChild=cache_children [ k ] ;
269 i f ( currentChild>=0 )
270 {
271 f l o a t temp_m = vol_m [ currentChild ] ;
272 i f ( temp_m >0)
273 {
274 f l o a t tempx = cent_pos_x ∗ cent_mass + px [ currentChild ] ∗ vol_m [ currentChild←↩
] ;
275 f l o a t tempy = cent_pos_y ∗ cent_mass + py [ currentChild ] ∗ vol_m [ currentChild←↩
] ;
276 f l o a t tempz = cent_pos_z ∗ cent_mass + pz [ currentChild ] ∗ vol_m [ currentChild←↩
] ;
277 cent_mass += vol_m [ currentChild ] ;
278 cent_pos_x = tempx / cent_mass ;
279 cent_pos_y = tempy / cent_mass ;
280 cent_pos_z = tempz / cent_mass ;
281 cache_children [ k ]=NOCHILD ;
282 missing−−;
283 }
284 e l s e i f ( temp_m <−0.5f )
285 {
286 cache_children [ k ]=NOCHILD ;
287 missing−−;
288 }
289 }
290 }
291 }
292 px [ internal ]=cent_pos_x ;
293 py [ internal ]=cent_pos_y ;
294 pz [ internal ]=cent_pos_z ;
295 vol_m [ internal ]=cent_mass ;
296 re turn 1 ;
297 }
298 void calculateCentreOfMass ( )
299 {
300
301 i n t counter = usedInternal ;
302 i n t i ;
303 #pragma omp parallel fo r
304 f o r ( i=usedInternal ; i<nnodes ; i++)
305 {
306 vol_m [ i ]=0 . 0 f ;
307 px [ i ]=0 . 0 f ;
308 py [ i ]=0 . 0 f ;
309 pz [ i ]=0 . 0 f ;
310 }
311 #pragma omp parallel fo r
312 f o r ( i=0;i<=n ; i++)
313 {
314 vol_m [ i ]=m [ i ] ;
315 }
316
317 #pragma omp parallel fo r firstprivate ( counter )
318 f o r ( i=counter ; i<nnodes ; i++)
319 {
320 calculateMassForOne ( i ) ;
321 }
322
323 #pragma omp parallel fo r
324 f o r ( i=usedInternal ; i<nnodes ; i++)
50
325 {
326 m [ i ]=vol_m [ i ] ;
327 }
328 }
329 f l o a t sqdist ( f l o a t x1 , f l o a t y1 , f l o a t z1 , f l o a t x2 , f l o a t y2 , f l o a t z2 )
330 {
331 re turn ( x1−x2 ) ∗ ( x1−x2 ) +(y1−y2 ) ∗ ( y1−y2 ) +(z1−z2 ) ∗ ( z1−z2 ) ;
332 }
333
334 void calculateAcceleration ( )
335 {
336 // t rave r s e t r e e from root down updating a c c e l e r a t i on when reaching a l e a f or the t resho ld i s←↩
reached .
337 i n t leaf ;
338 long visited ;
339 i n t currentNode ;
340 i n t currentChild ;
341 i n t depth ;
342 i n t MAXDEPTH=32;
343 i n t parent [ MAXDEPTH ] ;
344 f l o a t rdepth [ MAXDEPTH ] ;
345 i n t i ;
346 rdepth [ 0 ]= r∗r ∗4 ;
347 f o r ( i=1;i<MAXDEPTH ; i++)
348 {
349 rdepth [ i ] = rdepth [i−1] ∗ 0 .25 f ;
350 }
351 #pragma omp parallel fo r pr iva te ( leaf , depth , parent , currentNode , currentChild , visited )
352 f o r ( leaf =0;leaf<n ; leaf++)
353 {
354 i n t maxdepth =0;
355 visited=1;
356 currentNode = nnodes−1; // s t a r t a t root
357 currentChild = 0 ;
358 depth =0;
359 i n t acc_x = 0 . 0 f ; i n t acc_y = 0 . 0 f ; i n t acc_z = 0 . 0 f ; // temporary va r i ab l e s
360 do //while there i s s t i l l unvis i ted branches of t r e e
361 {
362 while ( currentChild<8 ) // go up a l ev e l when a l l 8 ch i ldren are v i s i t ed
363 {
364 do
365 {
366 parent [ depth ]=currentNode ;
367 currentNode = childNodes [ ( currentNode−n ) ∗8+currentChild ] ;
368 depth++;
369 i f ( depth > maxdepth )
370 maxdepth=depth ;
371 visited = visited << 3 ;
372 visited = visited | currentChild ;
373 currentChild=0; // remark tha t t h i s w i l l be r e s e t i f the do loop i s ex i t ed .
374 }
375 while ( ( currentNode>=n ) && ( THRESH/2<(rdepth [ depth ]/sqdist ( px [ currentNode ] , py [←↩
currentNode ] , pz [ currentNode ] , px [ leaf ] , py [ leaf ] , pz [ leaf ] ) ) ) ) ; //current node ←↩
i s not a l ea f , or d i s tance to current node i s not below tresho ld
376 // i f the node i s not empty , c a l c u l a t e and increment a c c e l e r a t i on of the l e a f ←↩
with the fo r ce of the current node .
377 i f ( currentNode != NOCHILD )
378 {
379 f l o a t rtemp = EPSILON + sqdist ( px [ leaf ] , py [ leaf ] , pz [ leaf ] , px [ currentNode ] , py [←↩
currentNode ] , pz [ currentNode ] ) ;
380 f l o a t temp = G ∗ m [ currentNode ] / sqrtf ( rtemp ∗ rtemp ∗ rtemp ) ;
381 acc_x = acc_x + temp ∗ ( px [ currentNode ] − px [ leaf ] ) ;
382 acc_y = acc_y + temp ∗ ( py [ currentNode ] − py [ leaf ] ) ;
383 acc_z = acc_z + temp ∗ ( pz [ currentNode ] − pz [ leaf ] ) ;
384 }
385 currentChild = visited & 7 ; //remark , current ch i ld somewhat r e s e t
386 visited = visited >> 3 ;
387 depth−−;
388 currentNode = parent [ depth ] ;
51
10. Barnes-Hut, OpenMP
389 currentChild++;
390 }
391 do // make sure the current ch i ld does not exceed 7 i f we are a t two or more ←↩
succeding f a r r i gh t succeding brances
392 {
393 depth−−;
394 currentChild = visited & 7 ;
395 visited = visited >> 3 ;
396 currentNode = parent [ depth ] ;
397 currentChild++;
398 }
399 while ( currentChild>7 && depth >= 0) ;
400 } while ( depth >= 0) ;
401 i f ( steps > 0)
402 {
403 f l o a t half_dt =dt ∗0 . 5 f ;
404 vx [ leaf ] += ( acc_x − ax [ leaf ] ) ∗ half_dt ;
405 vy [ leaf ] += ( acc_y − ay [ leaf ] ) ∗ half_dt ;
406 vz [ leaf ] += ( acc_z − az [ leaf ] ) ∗ half_dt ;
407 }
408 ax [ leaf ] =acc_x ;
409 ay [ leaf ] =acc_y ;
410 az [ leaf ] =acc_z ;
411 }
412 }
413
414 void progressSystem ( void )
415 {
416
417 i n t leaf ;
418 #pragma omp parallel fo r pr iva te ( leaf )
419 f o r ( leaf =0; leaf <n ; leaf++)
420 {
421 f l o a t dth = dt ∗ 0 . 5 f ;
422 f l o a t dvx , dvy , dvz ;
423 f l o a t vhx , vhy , vhz ;
424
425 dvx = ax [ leaf ] ∗ dth ;
426 dvy = ay [ leaf ] ∗ dth ;
427 dvz = az [ leaf ] ∗ dth ;
428
429 vhx = vx [ leaf ] + dvx ;
430 vhy = vy [ leaf ] + dvy ;
431 vhz = vz [ leaf ] + dvz ;
432
433 px [ leaf ] += vhx ∗ dt ;
434 py [ leaf ] += vhy ∗ dt ;
435 pz [ leaf ] += vhz ∗ dt ;
436
437 vx [ leaf ] = vhx + dvx ;
438 vy [ leaf ] = vhy + dvy ;
439 vz [ leaf ] = vhz + dvz ;
440 }
441 }
442
443 void printTree ( )
444 {
445 i n t last [ 3 2 0 0 0 ] ;
446 i n t next [ 3 2 0 0 0 ] ;
447 i n t i=0;
448 f o r ( i = 0 ; i<32000;i++)
449 {
450 next [ i ] = −1;
451 last [ i ] = −1;
452 }
453 i =0;
454 i n t depth=0;
455 i n t count=0;
456 printf ( "\n depth : %d " , depth ) ;
52
457 i n t currentInternal ;
458 i n t iteration =0;
459 next [ 0 ]= nnodes ;
460 do
461 {
462 i n t j ;
463 do
464 {
465 currentInternal=next [ iteration ] ;
466 printf ( " || %d || " , currentInternal ) ;
467 f o r ( j =0; j < 8 ; j++)
468 {
469 i n t child =childNodes [ ( currentInternal −n ) ∗8+j ] ;
470 printf ( "%d " , child ) ;
471 i f ( child>n )
472 {
473 last [ count ]=child ;
474 count++;
475 }
476 }
477 i++;
478 iteration++;
479 } while ( next [ iteration ] >0) ;
480 memcpy ( next , last , s i z eo f ( i n t ) ∗3200) ;
481 fo r ( j=0;j<3200;j++)
482 {
483 last [ j ]=−1;
484 }
485 iteration=0;
486 depth++;
487 count=0;
488 printf ( "\n \n depth : %d \n" , depth ) ;
489 i f ( next [0]==−1)
490 {
491 i =n ;
492 printf ( " done ! \n" ) ;
493 }
494 }
495 while (i<n ) ;
496 }
497
498 //o f f s e t 1 t runcs from s t a r t , o f f s e t 2 from end , pr in t with Acce le ra t ion ?
499 void printBodies ( i n t count , i n t offset1 , i n t offset2 , i n t withAcc )
500 {
501 i f ( (1==INIT ) && (n>0) )
502 {
503 i n t i , a , b ;
504 a = ( offset1 >= count ) ? 0 : offset1 ;
505 b = ( offset2 >= count ) ? 0 : count−offset2 ;
506 printf ( "n : ( pos ) ( ve l ) [ acc ] (m)\n" ) ;
507 fo r ( i = a ; i < b ; i++)
508 {
509 printf ( "%d : (% f %f %f ) (% f %f %f ) " , i , px [ i ] , py [ i ] , pz [ i ] , vx [ i ] , vy [ i ] , vz [ i ] ) ;
510 i f ( withAcc >0) printf ( " [% f %f %f ] " , ax [ i ] , ay [ i ] , az [ i ] ) ;
511 printf ( " (% f )\n" , m [ i ] ) ;
512 }
513 fflush ( stdout ) ;
514 }
515 }
516
517 //re turns a in t ege r number of elapsed ms
518 long timevaldiff ( s t r u c t timeval ∗time1 , s t r u c t timeval ∗time2 )
519 {
520 long msec ;
521 msec = ( time2−>tv_sec−time1−>tv_sec ) ∗ 1000 ;
522 msec += ( time2−>tv_usec−time1−>tv_usec ) / 1000 ;
523 re turn msec ;
524 }
525
53
10. Barnes-Hut, OpenMP
526 //re turns a double number of elapsed ms for shor t e r t imings
527 double timevaldiffd ( s t r u c t timeval ∗time1 , s t r u c t timeval ∗time2 )
528 {
529 double usec ;
530 usec = ( double ) ( time2−>tv_sec−time1−>tv_sec ) ∗ 1000000 ;
531 usec += ( double ) ( time2−>tv_usec−time1−>tv_usec ) ;
532 re turn ( usec/1000 .0 ) ;
533 }
534
535 i n t getSortValue ( i n t leaf , i n t depth )
536 {
537 f l o a t Tr = r ;
538 f l o a t grid_posx=0.0f ;
539 f l o a t grid_posy=0.0f ;
540 f l o a t grid_posz=0.0f ;
541 f l o a t x = px [ leaf ] ;
542 f l o a t y = py [ leaf ] ;
543 f l o a t z = pz [ leaf ] ;
544 i n t index = 0 ;
545 i n t currentDepth ;
546 f o r ( currentDepth =0; currentDepth <depth ; currentDepth++)
547 {
548 i n t temp_index =0;
549 i f ( grid_posx > x )
550 {
551 temp_index = temp_index | 1 ;
552 grid_posx = grid_posx − Tr ∗ 0 . 5 ;
553 }
554 e l s e
555 {
556 grid_posx = grid_posx + Tr ∗ 0 . 5 ;
557 }
558 i f ( grid_posy > y )
559 {
560 temp_index = temp_index | 2 ;
561 grid_posy = grid_posy − Tr ∗ 0 . 5 ;
562 }
563 e l s e
564 {
565 grid_posy = grid_posy + Tr ∗ 0 . 5 ;
566 }
567 i f ( grid_posz > z )
568 {
569 temp_index = temp_index | 4 ;
570 grid_posz = grid_posz − Tr ∗ 0 . 5 ;
571 }
572 e l s e
573 {
574 grid_posz = grid_posz + Tr ∗ 0 . 5 ;
575 }
576
577 index = index << 3 ;
578 index = index | temp_index ;
579 Tr ∗=0 . 5 ;
580 }
581 re turn index ;
582 }
583
584 i n t main ( i n t argc , char ∗ argv [ ] )
585 {
586 char filename [ 6 4 ] ;
587 char filenameOutput [ 6 4 ] ;
588 printf ( "N−body simulator , Barnes−Hut OpenMP C vers ion\n" ) ;
589 i f ( argc >= 2) //fi lename required as a minimum
590 {
591 strcpy ( filename , argv [ 1 ] ) ;
592 // i n i t i a l i s e to avoid warnings
593 f l o a t ∗pxT = 0 , ∗pyT = 0 , ∗pzT = 0 ;
594 f l o a t ∗vxT = 0 , ∗vyT = 0 , ∗vzT = 0 ;
54
595 f l o a t ∗mT = 0 ;
596 i f (1 == readinput ( filename , &n , &t , &dt , &pxT , &pyT , &pzT , &vxT , &vyT , &vzT , &mT ) )
597 {
598 printf ( " Input−f i l e %s suc c e s fu l l y loaded .\n" , filename ) ;
599 i f ( argc >= 4) //overr ide with t , dt parameters from argv
600 {
601 t = ( i n t ) strtol ( argv [ 2 ] , NULL , 10 ) ;
602 dt = ( f l o a t ) strtod ( argv [ 3 ] , NULL ) ;
603 }
604 printf ( " Parameters . n : %d t : %d dt : %f\n" , n , t , dt ) ;
605 nnodes = n ∗2 ;
606 init ( n ) ;
607 i n t count=(n ) ∗ s i z eo f ( f l o a t ) ;
608 memcpy ( px , pxT , count ) ; memcpy ( py , pyT , count ) ; memcpy ( pz , pzT , count ) ; memcpy (m , mT , count ) ;
609 memcpy ( vx , vxT , count ) ; memcpy ( vy , vyT , count ) ; memcpy ( vz , vzT , count ) ;
610 free ( pxT ) ; free ( pyT ) ; free ( pzT ) ; free ( vxT ) ; free ( vyT ) ; free ( vzT ) ; free ( mT ) ;
611
612 s t r u c t timeval begin , end , lap_begin , lap_end ;
613 gettimeofday(&begin , NULL ) ;
614
615 f o r ( steps=0; steps < t ; steps++)
616 {
617 double lap ;
618 gettimeofday(&lap_begin , NULL ) ;
619 r = calculateRadius ( ) ;
620 gettimeofday(&lap_end , NULL ) ;
621 lap = timevaldiffd(&lap_begin , &lap_end ) ;
622 i f ( DEBUG )
623 printf ( " Radius ( r=%f ) c a l cu l a t i on done in ms : %f \n" ,r , lap ) ;
624 timings0 [ steps ]=lap ;
625
626 gettimeofday(&lap_begin , NULL ) ;
627 buildTree ( ) ;
628 gettimeofday(&lap_end , NULL ) ;
629 lap = timevaldiffd(&lap_begin , &lap_end ) ;
630 i f ( DEBUG )
631 printf ( " Tree bu i l t using %d In t e rna l Nodes , in ms : %f \n" , nnodes − ←↩
usedInternal , lap ) ;
632 timings1 [ steps ]=lap ;
633
634 gettimeofday(&lap_begin , NULL ) ;
635 calculateCentreOfMass ( ) ;
636 gettimeofday(&lap_end , NULL ) ;
637 lap = timevaldiffd(&lap_begin , &lap_end ) ;
638 i f ( DEBUG )
639 printf ( " Centre of mass ca l cu l a t ed in ms : %f \n" , lap ) ;
640 timings2 [ steps ]=lap ;
641
642 gettimeofday(&lap_begin , NULL ) ;
643 calculateAcceleration ( ) ;
644 gettimeofday(&lap_end , NULL ) ;
645 lap = timevaldiffd(&lap_begin , &lap_end ) ;
646 i f ( DEBUG )
647 printf ( " Acce le ra t ion ca l cu l a t ed in ms : %f \n" , lap ) ;
648 timings3 [ steps ]=lap ;
649
650 gettimeofday(&lap_begin , NULL ) ;
651 progressSystem ( ) ;
652 gettimeofday(&lap_end , NULL ) ;
653 lap = timevaldiffd(&lap_begin , &lap_end ) ;
654 i f ( DEBUG )
655 printf ( " System progressed in ms : %f \n energy at t h i s point %f \n" ,lap , ←↩
computeEnergy ( n , px , py , pz , vx , vy , vz , m ) ) ;
656 timings4 [ steps ]=lap ;
657 }
658 gettimeofday(&end , NULL ) ;
659 long elapsed = timevaldiff(&begin , &end ) ;
660 i f ( DEBUG )
55
10. Barnes-Hut, OpenMP
661 printf ( " En t i r e s imulat ion f in i shed in %ld ms \n energy of system %f \n" , elapsed ,←↩
computeEnergy (n , px , py , pz , vx , vy , vz , m ) ) ;
662 i f ( DEBUG )
663 printf ( " Intermediate t imings : \nstep bu i ld t r ee | centeOfMass | ca l c acc | ←↩
progress |\n" ) ;
664 i n t i ;
665 timings0 [ t ] = 0 . 0 ; timings1 [ t ] = 0 . 0 ; timings2 [ t ] = 0 . 0 ; timings3 [ t ] = 0 . 0 ; timings4 [ t←↩
] = 0 . 0 ;
666 i f ( errorInTree )
667 {
668 printf ( " Program f in i shed but there was an er ror .\n" ) ;
669 }
670 f o r ( i=0; i<t ; i++)
671 {
672 i f ( DEBUG )
673 printf ( " |%7d |%8f |%10f |%12f |%9f |%9f |\n" ,i , timings0 [ i ] , timings1 [ i ] ,←↩
timings2 [ i ] , timings3 [ i ] , timings4 [ i ] ) ;
674 timings0 [ t]+=timings0 [ i ] ;
675 timings1 [ t]+=timings1 [ i ] ;
676 timings2 [ t]+=timings2 [ i ] ;
677 timings3 [ t]+=timings3 [ i ] ;
678 timings4 [ t]+=timings4 [ i ] ;
679 }
680 f l o a t energyTemp = 0 . 0 f ;
681 i f ( DEBUG )
682 energyTemp = computeEnergy ( n , px , py , pz , vx , vy , vz , m ) ;
683 printf ( "n : %d , t : %d , dt : %f , threads : %d , time : %ld ms, e : %f\n" , n , t , dt , threads ,←↩
elapsed , energyTemp ) ;
684 printf ( "sum |radius | bu i ld t r e e | centeOfMass | ca l c acc | progress |\n" ) ;
685 printf ( " |%8f |%10f |%12f |%9f |%9f |\n" , timings0 [ t ] , timings1 [ t ] , timings2 [ t ] ,←↩
timings3 [ t ] , timings4 [ t ] ) ;
686 i f ( argc == 5) //output f i l e requested
687 {
688 strcpy ( filenameOutput , argv [ 4 ] ) ;
689 i f (1 == writeoutput ( filenameOutput , n , t , dt , px , py , pz , vx , vy , vz , m , ax , ay , ←↩
az ) )
690 {
691 printf ( "Output−f i l e %s suc c e s fu l l y wri t ten .\n" , filenameOutput ) ;
692 }
693 e l s e
694 {
695 printf ( " Error wri t ing output−f i l e %s , e x i t i ng .\n" , filenameOutput ) ;
696 exit (−1) ;
697 }
698 }
699 unInit ( ) ;
700 }
701 }
702 re turn 0 ;
703 }
704 i n t getIndex ( f l o a t ∗x , f l o a t ∗y , f l o a t ∗z , i n t centre , i n t leaf , f l o a t ∗px , f l o a t ∗py , f l o a t ∗pz )
705 {
706 i n t index = 0 ;
707 ∗x= −1.0f ; ∗y = −1.0f ; ∗z = −1.0f ;
708 i f ( px [ centre ] < px [ leaf ] )
709 {
710 index = index | 1 ;
711 ∗x = 1 . 0 f ;
712 }
713 i f ( py [ centre ] < py [ leaf ] )
714 {
715 index = index | 2 ;
716 ∗y = 1 . 0 f ;
717 }
718 i f ( pz [ centre ] < pz [ leaf ] )
719 {
720 index = index | 4 ;
721 ∗z = 1 . 0 f ;
722 }
56
723 re turn index ;
724 }
57
11 Discarded sorting function, Barnes-Hut, OpenMP
1 // unfinished sor t ing func t ions saved for re f e rence and used to so r t data fo r the CUDA Barnes−Hut
2
3
4 void countingSort ( void )
5 {
6 i n t b , i ;
7 i n t par_level =build_depth ;
8 i n t no_buckets = pow ( 8 , par_level ) ; //index s t a r t s a t 1
9 intervals=no_buckets ;
10 sortingIntervals = ( i n t ∗ ) malloc ( ( no_buckets+1) ∗ s i z eo f ( i n t ) ) ;
11 i n t temp_sortingIntervals [ threads ] [ no_buckets +1 ] ;
12 i n t buckets [ no_buckets ] ;
13 i n t temp_buckets [ threads ] [ no_buckets ] ;
14 i n t ∗ sort ;
15 sort = ( i n t ∗ ) malloc ( n ∗ s i z eo f ( i n t ) ) ;
16 f l o a t ∗ Tpx = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( i n t ) ) ;
17 f l o a t ∗ Tpy = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( i n t ) ) ;
18 f l o a t ∗ Tpz= ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( i n t ) ) ;
19 f l o a t ∗ Tvx= ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( i n t ) ) ;
20 f l o a t ∗ Tvy= ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( i n t ) ) ;
21 f l o a t ∗ Tvz= ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( i n t ) ) ;
22 f l o a t ∗ Tm= ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( i n t ) ) ;
23
24 f o r ( b =0; b<no_buckets ; b++)
25 {
26 buckets [ b ] =0 ;
27 fo r ( i=0;i<threads ; i++)
28 {
29 temp_buckets [ i ] [ b ] =0 ;
30 }
31 }
32 i n t total ;
33 #pragma omp parallel pr iva te ( total , b ) shared ( buckets , temp_buckets , temp_sortingIntervals ,←↩
sortingIntervals )
34 {
35 i n t tid = omp_get_thread_num ( ) ;
36 #pragma omp parallel f o r
37 fo r ( i=tid ; i<n ; i+=threads ) //adding up index−occurances fo r each bucket
38 {
39 i n t this_index=sortValue [ i ] ;
40 temp_buckets [ tid ] [ this_index ]++ ;
41 }
42
43 temp_sortingIntervals [ tid ] [ 0 ] = 0 ;
44 total=0;
45 fo r ( b =0; b<no_buckets ; b++) //using s i z e of buckets to determine so r t ing i n t e r v a l s
46 {
47 total += temp_buckets [ tid ] [ b ] ;
48 temp_sortingIntervals [ tid ] [ b+1] = total ;
49 }
50 }
51
52 #pragma omp parallel fo r pr iva te ( i ) shared ( sortingIntervals , temp_sortingIntervals )
53 f o r ( b=0;b<no_buckets+1;b++)
54 {
58
55 sortingIntervals [ b ] =0 ;
56 fo r ( i=0;i<threads ; i++)
57 {
58 sortingIntervals [ b]+=temp_sortingIntervals [ i ] [ b ] ;
59 }
60 }
61 //cannot e a s i l y be implemented as p a r a l l e l in open mp
62 f o r ( i=0;i<n ; i++) //might be pa r a l l e l i z ed with i n t e r v a l s
63 {
64 i n t temp=0;
65 buckets [ sortValue [ i ] ]+=1 ;
66 temp += buckets [ sortValue [ i ] ] ;
67 sort [ i]=−temp + sortingIntervals [ sortValue [ i ] + 1 ] ;
68 }
69 #pragma omp parallel fo r
70 f o r ( i=0;i<n ; i++)
71 {
72 i n t sortedIndex = sort [ i ] ;
73 Tpx [ sortedIndex ]=px [ i ] ;
74 Tpy [ sortedIndex ]=py [ i ] ;
75 Tpz [ sortedIndex ]=pz [ i ] ;
76 Tvx [ sortedIndex ]=vx [ i ] ;
77 Tvy [ sortedIndex ]=vy [ i ] ;
78 Tvz [ sortedIndex ]=vz [ i ] ;
79 Tm [ sortedIndex ]=m [ i ] ;
80 }
81 #pragma omp parallel fo r pr iva te ( i )
82 f o r ( i=0;i<n ; i++)
83 {
84 px [ i ]=Tpx [ i ] ;
85 py [ i ]=Tpy [ i ] ;
86 pz [ i ]=Tpz [ i ] ;
87 vx [ i ]=Tvx [ i ] ;
88 vy [ i ]=Tvy [ i ] ;
89 vz [ i ]=Tvz [ i ] ;
90 m [ i ]=Tm [ i ] ;
91 }
92 }
93
94 /∗ MISSING PART OF MAIN
95
96 gett imeofday (&lap_begin , NULL) ;
97 #pragma omp pa r a l l e l fo r
98 fo r ( i i =0 ; i i <n ; i i ++)
99 {
100 sortValue [ i i ] = getSortValue ( i i , build_depth ) ; //hardcoded depth , i f we have more than 64 ←↩
threads i t does not work , a l so needs to be changed for small problems .
101 }
102 gett imeofday (&lap_end , NULL) ;
103 lap = t imeva ld i f fd (&lap_begin , &lap_end ) ;
104 i f (DEBUG)
105 p r i n t f ( " Found sortValues in ms : %f \n " , lap ) ;
106 t imings1 [ s teps ]= lap ;
107
108 gett imeofday (&lap_begin , NULL) ;
109 count ingSort ( ) ;
110 gett imeofday (&lap_end , NULL) ;
111 lap = t imeva ld i f fd (&lap_begin , &lap_end ) ;
112 i f (DEBUG)
113 p r i n t f ( " Sor t ing done done in ms : %f \n " , lap ) ;
114 t imings2 [ s teps ]= lap ;
115 ∗/
116
117 void buildTree ( ) // p a r a l l e l vers ion
118 {
119 //build root
120 f l o a t x = 0 . 0 f , y = 0 . 0 f , z = 0 . 0 f ; //values to change pos i t i on
121 usedInternal = nnodes ;
122 //top bui lding s t a r t
59
11. Discarded sorting function, Barnes-Hut, OpenMP
123 i n t current , j , depth , i , index = 0 ;
124 i n t childrenNeeded=0;
125 i n t currentDepth ;
126 f l o a t rtemp=r ;
127 depth =build_depth ;
128 current=nnodes ;
129 vol_px [ current ]=0 . 0 f ;
130 vol_py [ current ]=0 . 0 f ;
131 vol_pz [ current ]=0 . 0 f ;
132
133 f o r ( i=0;i<(n+1) ∗8 ; i++)
134 {
135 childNodes [ i ]=NOCHILD ;
136 }
137 childNodes [ ( nnodes−n+1) ∗8 ]=current ;
138 f o r ( currentDepth =1; currentDepth <=depth ; currentDepth++)
139 {
140 childrenNeeded +=pow ( 8 , ( currentDepth ) ) ;
141 }
142 currentDepth=1;
143 i n t childrenForNext = 8 ;
144 f o r ( j=0;j<childrenNeeded ; j++)// s t a r t a f t e r root
145 {
146 usedInternal−−;
147 childNodes [ ( n+1)∗8−j−1]=usedInternal ;
148 i n t parent =childNodes [ ( n+1) ∗8 − j/8 ] ;
149 i f ( j/childrenForNext==1)
150 {
151 currentDepth++;
152 childrenForNext+=pow ( 8 , ( currentDepth ) ) ;
153 rtemp ∗=0 . 5 ;
154 }
155 index = 7−j%8;
156 i n t xT = 1&index ;
157 i n t yT = 2&index ;
158 yT = yT >>1;
159 i n t zT = 4&index ;
160 zT = zT >>2;
161 x = ( f l o a t ) xT ;
162 y = ( f l o a t ) yT ;
163 z = ( f l o a t ) zT ;
164 vol_px [ usedInternal ]=vol_px [ parent ]+x∗rtemp−rtemp ∗ 0 . 5 ;
165 vol_py [ usedInternal ]=vol_py [ parent ]+y∗rtemp−rtemp ∗ 0 . 5 ;
166 vol_pz [ usedInternal ]=vol_pz [ parent ]+z∗rtemp−rtemp ∗ 0 . 5 ;
167 }
168 //top bui lding end
169 i n t child ;
170 i n t tid ;
171 f l o a t r_this ;
172 // i n s e r t l e a f s
173 i n t temp_usedInternal ;
174 #pragma omp parallel pr iva te ( child , r_this , tid , i , temp_usedInternal ) firstprivate ( current , ←↩
index , x , y , z ) shared ( childNodes )
175 {
176 tid = omp_get_thread_num ( ) ;
177 i n t tjeck = ( usedInternal−n )/threads ;
178 temp_usedInternal = usedInternal − tid ∗ ( ( usedInternal−n )/threads ) ;
179 i n t step = intervals/threads ;
180 i n t start = sortingIntervals [ tid∗step ] ;
181 i n t end = sortingIntervals [ ( tid+1) ∗step ] ;
182 #pragma omp barrier
183 fo r ( i=start ; i<end ; i++)
184 {
185 child = nnodes ;
186 depth =0;
187 r_this=r ;
188 current = nnodes ;
189 while ( child>=n )
190 {
60
191 current = child ;
192 //index = getIndex ( x , y , z , current , i , px , py , pz ) ;
193 index = getIndex(&x ,&y ,&z , current , i , px , py , pz , vol_px , vol_py , vol_pz ) ;
194 r_this ∗= 0 . 5 ;
195 depth++;
196 child = childNodes [ ( current−n ) ∗8+index ] ;
197
198 }
199 i f ( child == NOCHILD )
200 {
201 childNodes [ ( current−n ) ∗8+index ]=i ;
202 }
203 e l s e // there i s a l e a f in the current pos i t ion , i . e . ch i ld
204 {
205 i n t originalInternal=(current−n ) ∗8+index ; // remember t h i s node through i t e r a t i o n
206 i n t alt=−1; //make sure not to i n s e r t node f i r s t time
207 do
208 {
209 // we have the two l e a f s namely ch i ld and i ;
210 i n t newInternal ;
211 newInternal = −−temp_usedInternal ;
212 tjeck −−;
213 i f ( tjeck < 0) { errorInTree=true ; }
214 alt = ( newInternal > alt ) ?newInternal : alt ;
215 vol_px [ newInternal ]=vol_px [ current ]+r_this∗x ;
216 vol_py [ newInternal ]=vol_py [ current ]+r_this∗y ;
217 vol_pz [ newInternal ]=vol_pz [ current ]+r_this∗z ;
218 m [ newInternal ]=0 . 0 f ;
219 // do not i n s e r t the new in t e r na l node in the f i r s t loop
220 i f ( alt != newInternal )
221 {
222 childNodes [ ( current−n ) ∗8+index ]=newInternal ;
223 }
224 depth++;
225 i n t j ;
226 fo r ( j=0;j<8;j++) // c r ea t e new empty nodes fo r the newly created i n t e rn a l ←↩
node
227 {
228 childNodes [ ( newInternal−n ) ∗8+j ]=NOCHILD ;
229 }
230
231 i f ( alt != newInternal )
232 {
233 childNodes [ ( current−n ) ∗8+index ] = newInternal ;
234 }
235 index = getIndex(&x ,&y ,&z , newInternal , child , px , py , pz , vol_px , vol_py , vol_pz ) ;
236
237 childNodes [ ( newInternal−n ) ∗8+index ] = child ; // ch i ld i s in se r t ed
238 current=newInternal ;
239 index = getIndex(&x ,&y ,&z , newInternal , i , px , py , pz , vol_px , vol_py , vol_pz ) ;
240
241 child=childNodes [ ( current−n ) ∗8+index ] ;
242 r_this ∗=0 . 5 ;
243
244 }
245 while ( child > NOCHILD ) ; // while there i s no empty space fo r the l ea f , keep on ←↩
looping
246 childNodes [ ( current−n ) ∗8+index ]=i ;
247 childNodes [ originalInternal ]=alt ;
248
249 }
250 }
251
252 i f ( ( tid+1) == threads )
253 {
254 usedInternal = temp_usedInternal ;
255 }
256 }
257
61
11. Discarded sorting function, Barnes-Hut, OpenMP
258
259 }
62
63
12 Barnes-Hut, CUDA
1 /∗
2 CUDA implementation of the Barnes−Hut N−body algorithm using arrays
3 By Jon and Truls
4 Par t s based on Burtscher ' s CUDA Implementation of the Tree−based Barnes Hut n−Body Algorithm ( v2←↩
. 1 )
5 ∗/
6
7 //C inc ludes
8 # include <math . h>
9 # include < s t d l i b . h>
10 # include <s t r i ng . h>
11 # include < f l o a t . h>
12 # include <sys/time . h>
13 //CUDA inc ludes
14 # include " cuda . h"
15 # include " book . h"
16 //Own inc ludes
17 # include " input . c "
18 # include " output . c "
19 # include " energy . c "
20
21 //simulat ion and kernel launch def ines
22 # def ine DEBUG 1
23 # def ine NOCHILD −1
24 # def ine LOCK −2
25 # def ine CHILDVISITED −3
26 # def ine EPSILON 0.000001 f
27 # def ine MAXDEPTH 32
28 # def ine WARPSIZE 32
29 //radius kernel must have threads tha t i s power of 2 due to reduct ion algorithm
30 # def ine RADIUSTHREADS 256/∗must be a power of 2∗/
31 # def ine RADIUSBLOCKSF 1
32 # def ine TREETHREADS 512
33 # def ine TREEBLOCKSF 2
34 # def ine CENTERTHREADS 768
35 # def ine CENTERBLOCKSF 1/∗must remain 1∗/
36 # def ine FORCETHREADS 512
37 # def ine FORCEBLOCKSF 4
38 # def ine ADVANCETHREADS 768
39 # def ine ADVANCEBLOCKSF 5
40
41 //simulat ion var i ab l e parameters , host
42 i n t n = 0 , nnodes = 0 ;
43 i n t t = 0 ;
44 f l o a t r = 0 . 0 f ;
45 f l o a t dt = 0 . 0 f ;
46
47 //s t a t e var iab les , host
48 i n t MPCOUNT = 0 ;
49 i n t INIT = 0 ;
50 i n t INIT_DEVICE = 0 ;
51
52 //simulat ion data , host
53 f l o a t ∗pxTemp = NULL , ∗pyTemp = NULL , ∗pzTemp = NULL ,
54 ∗vxTemp = NULL , ∗vyTemp = NULL , ∗vzTemp = NULL ,
64
55 ∗mTemp = NULL ; //temp for f i l e loading
56 float4 ∗posm = NULL ; //combined pos i t i on and mass values
57 float4 ∗vel = NULL ;
58 float4 ∗acc = NULL ;
59 i n t ∗childNodes ;
60
61 //simulat ion data , device
62 //poin te r s to a l l o ca t ed memory are f i r s t s tored in Temporary host va r i ab l e s
63 //and then t rans f e red to the Device constant memory
64 float4 ∗posm_DevT , ∗vel_DevT , ∗acc_DevT ;
65 __constant__ v o l a t i l e float4 ∗posm_Dev ;
66 __constant__ v o l a t i l e float4 ∗vel_Dev ;
67 __constant__ v o l a t i l e float4 ∗acc_Dev ;
68 f l o a t ∗xmax_DevT , ∗ymax_DevT , ∗zmax_DevT , ∗xmin_DevT , ∗ymin_DevT , ∗zmin_DevT ;
69 __constant__ v o l a t i l e f l o a t ∗xmax_Dev ;
70 __constant__ v o l a t i l e f l o a t ∗ymax_Dev ;
71 __constant__ v o l a t i l e f l o a t ∗zmax_Dev ;
72 __constant__ v o l a t i l e f l o a t ∗xmin_Dev ;
73 __constant__ v o l a t i l e f l o a t ∗ymin_Dev ;
74 __constant__ v o l a t i l e f l o a t ∗zmin_Dev ;
75
76 i n t ∗childNodes_DevT ;
77 __constant__ v o l a t i l e i n t ∗childNodes_Dev ;
78 __device__ f l o a t ∗rtest_DevT ; // t e s t
79
80 //simulat ion parameters , device
81 __constant__ i n t n_Dev ;
82 __constant__ i n t nnodes_Dev ;
83 __constant__ f l o a t dt_Dev ;
84 __constant__ f l o a t ∗rtest_Dev ;
85 __device__ v o l a t i l e f l o a t r_Dev ;
86 __device__ v o l a t i l e i n t blockCount_Dev ;
87 __device__ v o l a t i l e i n t step_Dev ;
88 __device__ v o l a t i l e i n t usedInternal ;
89
90 //funct ion de f in i t i on s , device
91 __global__ void initDev ( void ) ;
92 __global__ void test ( void ) ;
93 __global__ void calculateRadius ( void ) ;
94 __global__ void buildTree ( void ) ;
95 __global__ void computeCenterOfMass ( void ) ;
96 __global__ void sortBodies ( void ) ;
97 __global__ void computeForce ( void ) ;
98 __global__ void advanceBodies ( void ) ;
99
100 //funct ion de f in i t i on s , host
101 i n t init ( void ) ;
102 i n t uninit ( void ) ;
103 i n t initDevice ( void ) ;
104 i n t uninitDevice ( void ) ;
105 i n t transferToDevice ( void ) ;
106 i n t transferToHost ( void ) ;
107
108 i n t main ( i n t argc , char ∗argv [ ] )
109 {
110 char filename [ 6 4 ] ;
111 printf ( " Barnes−Hut N−body simulator , CUDA vers ion .\n" ) ;
112 i f ( argc >= 2)
113 {
114 strcpy ( filename , argv [ 1 ] ) ;
115 }
116 e l s e
117 {
118 printf ( " I n co r r e c t number of parameters supplied .\nTo run : %s <input f i l e > [< timesteps > <←↩
t imestep s ize >] [<output f i l e >]\n" , argv [ 0 ] ) ;
119 re turn 1 ;
120 }
121 //check CUDA
122 i n t devCount = 0 ;
65
12. Barnes-Hut, CUDA
123 cudaGetDeviceCount(&devCount ) ;
124 i f ( devCount > 0)
125 {
126 cudaDeviceProp devProp ;
127 cudaGetDeviceProperties(&devProp , 0 ) ;
128 printf ( "\nCUDA info :\n" ) ;
129 printf ( " Device name:\ t\ t\ t%s\n" , devProp . name ) ;
130 i n t ccMajor = devProp . major , ccMinor = devProp . minor ;
131 printf ( "CUDA capab i l i t y :\ t\ t%d.%d\n" , ccMajor , ccMinor ) ;
132 i f ( ccMajor < 2)
133 {
134 printf ( " Capab i l i ty >= 2 . x required ! Ex i t ing . . . \ n" ) ;
135 exit (−1) ;
136 }
137 printf ( "Max threads per block :\ t\ t%d\n" , devProp . maxThreadsPerBlock ) ;
138 printf ( " Shared memory per block :\ t%ld\n" , devProp . sharedMemPerBlock ) ;
139 MPCOUNT = devProp . multiProcessorCount ;
140 printf ( " Core ( mult iprocessor ) count :\ t%d (%d)\n" , ( ccMajor > 1) ? 48 ∗ MPCOUNT : 8 ∗ ←↩
MPCOUNT , MPCOUNT ) ;
141 }
142 e l s e
143 {
144 printf ( "No CUDA capable device found ! Ex i t ing . . . \ n" ) ;
145 exit (−1) ;
146 }
147
148 printf ( " Loading input−f i l e %s . . . " , filename ) ;
149 i f (1 == readinput ( filename , &n , &t , &dt , &pxTemp , &pyTemp , &pzTemp , &vxTemp , &vyTemp , &←↩
vzTemp , &mTemp ) )
150 {
151 printf ( "Done!\n" ) ;
152 i f ( argc >= 4) //t , dt overr ide de fau l t parameters from arguments
153 {
154 t = ( i n t ) strtol ( argv [ 2 ] , NULL , 10 ) ;
155 dt = ( f l o a t ) strtod ( argv [ 3 ] , NULL ) ;
156 }
157 //todo : copy over arrays , f r e e temp arrays
158 printf ( " Simulat ion parameters : n = %d , t = %d , dt = %f\n" , n , t , dt ) ;
159 i f ( ! init ( ) )
160 {
161 printf ( " Error i n i t i a l i s i n g memory on host , e x i t i ng .\n" ) ;
162 exit (−1) ;
163 }
164 i f ( ! initDevice ( ) )
165 {
166 printf ( " Error i n i t i a l i s i n g memory on device , e x i t i ng .\n" ) ;
167 exit (−1) ;
168 }
169 i f ( ! transferToDevice ( ) )
170 {
171 printf ( " Error t r an s f e r i ng s imulat ion parameters and data to device , e x i t i ng .\n" ) ;
172 exit (−1) ;
173 }
174 //begin timing
175 printf ( " ∗∗∗ BH NBODY CUDA SIMULATION STARTS ∗∗∗\n" ) ;
176 f l o a t elapsed = 0 . 0 f , lap = 0 . 0 f ;
177 f l o a t (∗ laps ) [ ] [ 5 ] ;
178 laps = ( f l o a t ( ∗ ) [ ] [ 5 ] ) malloc ( t ∗ s i z eo f ( f l o a t [ 5 ] ) ) ; //dynamically a l l o c a t e t
179 cudaEvent_t start , stop ;
180 HANDLE_ERROR ( cudaEventCreate(&start ) ) ;
181 HANDLE_ERROR ( cudaEventCreate(&stop ) ) ;
182 cudaEvent_t startlap , stoplap ;
183 HANDLE_ERROR ( cudaEventCreate(&startlap ) ) ;
184 HANDLE_ERROR ( cudaEventCreate(&stoplap ) ) ;
185 HANDLE_ERROR ( cudaEventRecord ( start , 0 ) ) ;
186 initDev<<<1,1>>>() ;
187 fo r ( i n t i = 0 ; i < t ; i++)
188 {
189 HANDLE_ERROR ( cudaEventRecord ( startlap , 0 ) ) ;
66
190 calculateRadius<<<RADIUSBLOCKSF∗MPCOUNT , RADIUSTHREADS >>>() ;
191 HANDLE_ERROR ( cudaEventRecord ( stoplap , 0 ) ) ;
192 HANDLE_ERROR ( cudaEventSynchronize ( stoplap ) ) ;
193 HANDLE_ERROR ( cudaEventElapsedTime(&lap , startlap , stoplap ) ) ;
194 (∗ laps ) [ i ] [ 0 ] = lap ;
195 HANDLE_ERROR ( cudaEventRecord ( startlap , 0 ) ) ;
196 buildTree<<<TREEBLOCKSF∗MPCOUNT , TREETHREADS>>>() ;
197 HANDLE_ERROR ( cudaEventRecord ( stoplap , 0 ) ) ;
198 HANDLE_ERROR ( cudaEventSynchronize ( stoplap ) ) ;
199 HANDLE_ERROR ( cudaEventElapsedTime(&lap , startlap , stoplap ) ) ;
200 (∗ laps ) [ i ] [ 1 ] = lap ;
201 HANDLE_ERROR ( cudaEventRecord ( startlap , 0 ) ) ;
202 computeCenterOfMass<<<CENTERBLOCKSF∗MPCOUNT , CENTERTHREADS >>>() ;
203 HANDLE_ERROR ( cudaEventRecord ( stoplap , 0 ) ) ;
204 HANDLE_ERROR ( cudaEventSynchronize ( stoplap ) ) ;
205 HANDLE_ERROR ( cudaEventElapsedTime(&lap , startlap , stoplap ) ) ;
206 (∗ laps ) [ i ] [ 2 ] = lap ;
207 HANDLE_ERROR ( cudaEventRecord ( startlap , 0 ) ) ;
208 computeForce<<<FORCEBLOCKSF∗MPCOUNT , FORCETHREADS >>>() ;
209 HANDLE_ERROR ( cudaEventRecord ( stoplap , 0 ) ) ;
210 HANDLE_ERROR ( cudaEventSynchronize ( stoplap ) ) ;
211 HANDLE_ERROR ( cudaEventElapsedTime(&lap , startlap , stoplap ) ) ;
212 (∗ laps ) [ i ] [ 3 ] = lap ;
213 HANDLE_ERROR ( cudaEventRecord ( startlap , 0 ) ) ;
214 advanceBodies<<<ADVANCEBLOCKSF∗MPCOUNT , ADVANCETHREADS >>>() ;
215 HANDLE_ERROR ( cudaEventRecord ( stoplap , 0 ) ) ;
216 HANDLE_ERROR ( cudaEventSynchronize ( stoplap ) ) ;
217 HANDLE_ERROR ( cudaEventElapsedTime(&lap , startlap , stoplap ) ) ;
218 (∗ laps ) [ i ] [ 4 ] = lap ;
219 }
220 HANDLE_ERROR ( cudaEventRecord ( stop , 0 ) ) ;
221 HANDLE_ERROR ( cudaEventSynchronize ( stop ) ) ; //wait fo r stop even to f i n i s h
222 HANDLE_ERROR ( cudaEventElapsedTime(&elapsed , start , stop ) ) ;
223 HANDLE_ERROR ( cudaEventDestroy ( start ) ) ;
224 HANDLE_ERROR ( cudaEventDestroy ( stop ) ) ;
225
226 //end timing
227 printf ( " ∗∗∗ BH NBODY CUDA SIMULATION ENDS ∗∗∗\n" ) ;
228 printf ( "CUDA time elapsed : %fms\n" , elapsed ) ;
229 i f ( ! transferToHost ( ) )
230 {
231 printf ( " Error t r an s f e r i ng s imulat ion parameters and data to host , e x i t i ng .\n" ) ;
232 exit (−1) ;
233 }
234 printf ( " System radius a t end of s imulat ion : %f\n" , r ) ;
235 printf ( " RADIUS | TREE | CENTER | FORCE | ADVANCE \n" ) ;
236 f l o a t lap0 = 0 . 0 f , lap1 = 0 . 0 f , lap2 = 0 . 0 f , lap3 = 0 . 0 f , lap4 = 0 . 0 f ;
237 fo r ( i n t i = 0 ; i < t ; i++)
238 {
239 lap0 += (∗ laps ) [ i ] [ 0 ] ; lap1 += (∗ laps ) [ i ] [ 1 ] ; lap2 += (∗ laps ) [ i ] [ 2 ] ; lap3 += (∗ laps ) [←↩
i ] [ 3 ] ; lap4 += (∗ laps ) [ i ] [ 4 ] ;
240 i f ( DEBUG )
241 printf ( " t=%d : %f| %f| %f| %f| %f\n" , i , (∗ laps ) [ i ] [ 0 ] , (∗ laps ) [ i ] [ 1 ] , (∗ laps ) [ i←↩
] [ 2 ] , (∗ laps ) [ i ] [ 3 ] , (∗ laps ) [ i ] [ 4 ] ) ;
242 }
243 printf ( " Avg . %f| %f| %f| %f| %f\n" , lap0/t , lap1/t , lap2/t , lap3/t , lap4/t ) ;
244 printf ( " Tota l %f| %f| %f| %f| %f\n" , lap0 , lap1 , lap2 , lap3 , lap4 ) ;
245 free ( laps ) ;
246 f l o a t ∗pxT = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
247 f l o a t ∗pyT = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
248 f l o a t ∗pzT = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
249 f l o a t ∗mT = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
250 f l o a t ∗vxT = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
251 f l o a t ∗vyT = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
252 f l o a t ∗vzT = ( f l o a t ∗ ) malloc ( n ∗ s i z eo f ( f l o a t ) ) ;
253 fo r ( i n t i = 0 ; i < n ; i++)
254 {
255 pxT [ i ]=posm [ i ] . x ;
256 pyT [ i ]=posm [ i ] . y ;
67
12. Barnes-Hut, CUDA
257 pzT [ i ]=posm [ i ] . z ;
258 mT [ i ]=posm [ i ] . w ;
259 vxT [ i ]=vel [ i ] . x ;
260 vyT [ i ]=vel [ i ] . y ;
261 vzT [ i ]=vel [ i ] . z ;
262 }
263 i f ( DEBUG ) { printf ( " Energy of system . . . " ) ; printf ( " : %f\n" , computeEnergy (n , pxT , pyT , ←↩
pzT , vxT , vyT , vzT , mT ) ) ; }
264 i f ( argc == 5) //wri te output f i l e
265 {
266 char filenameOut [ 6 4 ] ;
267 strcpy ( filenameOut , argv [ 4 ] ) ;
268 printf ( " Writing output−f i l e %s . . . " , filenameOut ) ;
269 i f (1 == writeoutput ( filenameOut , n , t , dt , pxT , pyT , pzT , vxT , vyT , vzT , mT ) )
270 {
271 printf ( "Done!\n" ) ;
272 }
273 e l s e
274 {
275 printf ( " Error ! Ex i t ing .\n" ) ;
276 exit (−1) ;
277 }
278 }
279 free ( pxT ) ; free ( pyT ) ; free ( pzT ) ; free ( vxT ) ; free ( vyT ) ; free ( vzT ) ; free ( mT ) ;
280 i f ( ! ( uninit ( ) && uninitDevice ( ) ) )
281 {
282 printf ( " Error u n i n i t i a l i s i n g memory on host and/or device , e x i t i ng .\n" ) ;
283 exit (−1) ;
284 }
285 }
286 e l s e //reading f i l e f a i l e d
287 {
288 printf ( " Error ! Ex i t ing .\n" ) ;
289 exit (−1) ;
290 }
291 re turn 0 ;
292 }
293
294 i n t init ( void )
295 {
296 nnodes = n ∗ 2 ;
297 posm = ( float4 ∗ ) malloc ( ( nnodes+1) ∗ s i z eo f ( float4 ) ) ;
298 vel = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
299 acc = ( float4 ∗ ) malloc ( n ∗ s i z eo f ( float4 ) ) ;
300 childNodes = ( i n t ∗ ) malloc (8 ∗ ( n+1) ∗ s i z eo f ( i n t ) ) ;
301 INIT = ( ( posm != NULL ) && ( vel != NULL ) && ( childNodes != NULL ) ) ;
302 i f ( INIT )
303 {
304 i n t i ;
305 fo r ( i = 0 ; i < n ; i++)
306 {
307 posm [ i ] . x = pxTemp [ i ] ;
308 posm [ i ] . y = pyTemp [ i ] ;
309 posm [ i ] . z = pzTemp [ i ] ;
310 posm [ i ] . w = mTemp [ i ] ;
311 vel [ i ] . x = vxTemp [ i ] ;
312 vel [ i ] . y = vyTemp [ i ] ;
313 vel [ i ] . z = vzTemp [ i ] ;
314 vel [ i ] . w = 0 . 0 f ;
315 acc [ i ] . x = 0 . 0 f ;
316 acc [ i ] . y = 0 . 0 f ; ;
317 acc [ i ] . z = 0 . 0 f ;
318 acc [ i ] . w = 0 . 0 f ;
319 }
320 }
321 re turn INIT ;
322 }
323
324 i n t uninit ( void )
68
325 {
326 i f ( INIT )
327 {
328 free ( posm ) ;
329 free ( vel ) ;
330 free ( acc ) ;
331 free ( childNodes ) ;
332 }
333 re turn INIT ;
334 }
335
336 i n t initDevice ( void )
337 {
338 //simulat ion data
339 INIT_DEVICE = ( cudaSuccess == cudaMalloc ( ( void ∗∗ ) &posm_DevT , ( nnodes+1) ∗ s i z eo f ( float4 ) ) )←↩
;
340 INIT_DEVICE = ( INIT_DEVICE ) && ( cudaSuccess == cudaMalloc ( ( void ∗∗ )&vel_DevT , n ∗ s i z eo f (←↩
float4 ) ) ) ;
341 INIT_DEVICE = ( INIT_DEVICE ) && ( cudaSuccess == cudaMalloc ( ( void ∗∗ )&acc_DevT , n ∗ s i z eo f (←↩
float4 ) ) ) ;
342 //radius data
343 INIT_DEVICE = ( INIT_DEVICE ) && ( cudaSuccess == cudaMalloc ( ( void ∗∗ )&xmax_DevT , MPCOUNT ∗ ←↩
RADIUSBLOCKSF ∗ s i z eo f ( f l o a t ) ) ) ;
344 INIT_DEVICE = ( INIT_DEVICE ) && ( cudaSuccess == cudaMalloc ( ( void ∗∗ )&ymax_DevT , MPCOUNT ∗ ←↩
RADIUSBLOCKSF ∗ s i z eo f ( f l o a t ) ) ) ;
345 INIT_DEVICE = ( INIT_DEVICE ) && ( cudaSuccess == cudaMalloc ( ( void ∗∗ )&zmax_DevT , MPCOUNT ∗ ←↩
RADIUSBLOCKSF ∗ s i z eo f ( f l o a t ) ) ) ;
346 INIT_DEVICE = ( INIT_DEVICE ) && ( cudaSuccess == cudaMalloc ( ( void ∗∗ )&xmin_DevT , MPCOUNT ∗ ←↩
RADIUSBLOCKSF ∗ s i z eo f ( f l o a t ) ) ) ;
347 INIT_DEVICE = ( INIT_DEVICE ) && ( cudaSuccess == cudaMalloc ( ( void ∗∗ )&ymin_DevT , MPCOUNT ∗ ←↩
RADIUSBLOCKSF ∗ s i z eo f ( f l o a t ) ) ) ;
348 INIT_DEVICE = ( INIT_DEVICE ) && ( cudaSuccess == cudaMalloc ( ( void ∗∗ )&zmin_DevT , MPCOUNT ∗ ←↩
RADIUSBLOCKSF ∗ s i z eo f ( f l o a t ) ) ) ;
349 INIT_DEVICE = ( INIT_DEVICE ) && ( cudaSuccess == cudaMalloc ( ( void ∗∗ )&rtest_DevT , s i z eo f ( f l o a t )←↩
) ) ;
350 // t r e e data
351 //INIT_DEVICE = ( INIT_DEVICE ) && ( cudaSuccess == cudaMalloc ( ( void ∗∗ )&nodes_DevT , 2 ∗ n ∗ ←↩
s i z eo f ( i n t ) ) ) ;
352 INIT_DEVICE = ( INIT_DEVICE ) && ( cudaSuccess == cudaMalloc ( ( void ∗∗ )&childNodes_DevT , 8 ∗ ( n←↩
+1) ∗ s i z eo f ( i n t ) ) ) ;
353 re turn INIT_DEVICE ;
354 }
355
356 i n t uninitDevice ( void )
357 {
358 i n t success = 0 ;
359 i f ( INIT_DEVICE )
360 {
361 //simulat ion data
362 success = ( cudaSuccess == cudaFree ( posm_DevT ) ) ;
363 success = ( success ) && ( cudaSuccess == cudaFree ( vel_DevT ) ) ;
364 //radius data
365 success = ( success ) && ( cudaSuccess == cudaFree ( xmax_DevT ) ) ;
366 success = ( success ) && ( cudaSuccess == cudaFree ( ymax_DevT ) ) ;
367 success = ( success ) && ( cudaSuccess == cudaFree ( zmax_DevT ) ) ;
368 success = ( success ) && ( cudaSuccess == cudaFree ( xmin_DevT ) ) ;
369 success = ( success ) && ( cudaSuccess == cudaFree ( ymin_DevT ) ) ;
370 success = ( success ) && ( cudaSuccess == cudaFree ( zmin_DevT ) ) ;
371 success = ( success ) && ( cudaSuccess == cudaFree ( rtest_DevT ) ) ; // t e s t
372 // t r e e data
373 success = ( success ) && ( cudaSuccess == cudaFree ( childNodes_DevT ) ) ;
374 }
375 re turn success ;
376 }
377
378 i n t transferToDevice ( void )
379 {
380 i n t success = 0 ;
381 i f ( ( INIT ) && ( INIT_DEVICE ) )
69
12. Barnes-Hut, CUDA
382 {
383 //copy simulat ion data
384 success = ( cudaSuccess == cudaMemcpy ( posm_DevT , posm , n ∗ s i z eo f ( float4 ) , ←↩
cudaMemcpyHostToDevice ) ) ;
385 success = ( success ) && ( cudaSuccess == cudaMemcpy ( vel_DevT , vel , n ∗ s i z eo f ( float4 ) , ←↩
cudaMemcpyHostToDevice ) ) ;
386 success = ( success ) && ( cudaSuccess == cudaMemcpy ( acc_DevT , acc , n ∗ s i z eo f ( float4 ) , ←↩
cudaMemcpyHostToDevice ) ) ;
387 //copy simulat ion parameters
388 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( n_Dev , &n , s i z eo f ( i n t ) ) ) ;
389 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( nnodes_Dev , &nnodes , s i z eo f ( i n t←↩
) ) ) ;
390 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( dt_Dev , &dt , s i z eo f ( f l o a t ) ) ) ;
391 //copy simulat ion po in te r s
392 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( posm_Dev , &posm_DevT , s i z eo f (←↩
void ∗ ) ) ) ;
393 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( vel_Dev , &vel_DevT , s i z eo f ( void←↩
∗ ) ) ) ;
394 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( acc_Dev , &acc_DevT , s i z eo f ( void←↩
∗ ) ) ) ;
395 //copy radius po in te r s
396 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( xmax_Dev , &xmax_DevT , s i z eo f (←↩
void ∗ ) ) ) ;
397 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( ymax_Dev , &ymax_DevT , s i z eo f (←↩
void ∗ ) ) ) ;
398 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( zmax_Dev , &zmax_DevT , s i z eo f (←↩
void ∗ ) ) ) ;
399 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( xmin_Dev , &xmin_DevT , s i z eo f (←↩
void ∗ ) ) ) ;
400 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( ymin_Dev , &ymin_DevT , s i z eo f (←↩
void ∗ ) ) ) ;
401 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( zmin_Dev , &zmin_DevT , s i z eo f (←↩
void ∗ ) ) ) ;
402 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( rtest_Dev , &rtest_DevT , s i z eo f (←↩
f l o a t ∗ ) ) ) ; // t e s t
403 //copy t r e e po in te r s
404 //success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( nodes_Dev , &nodes_DevT , ←↩
s i z eo f ( void ∗ ) ) ) ;
405 success = ( success ) && ( cudaSuccess == cudaMemcpyToSymbol ( childNodes_Dev , &←↩
childNodes_DevT , s i z eo f ( void ∗ ) ) ) ;
406 }
407 re turn success ;
408 }
409
410 i n t transferToHost ( void )
411 {
412 i n t success = 0 ;
413 i f ( ( INIT ) && ( INIT_DEVICE ) )
414 {
415 //copy back data
416 success = ( cudaSuccess == cudaMemcpy ( posm , posm_DevT , ( nnodes+1) ∗ s i z eo f ( float4 ) , ←↩
cudaMemcpyDeviceToHost ) ) ;
417 success = ( success ) && ( cudaSuccess == cudaMemcpy ( vel , vel_DevT , n ∗ s i z eo f ( float4 ) , ←↩
cudaMemcpyDeviceToHost ) ) ;
418 success = ( success ) && ( cudaSuccess == cudaMemcpy ( acc , acc_DevT , n ∗ s i z eo f ( float4 ) , ←↩
cudaMemcpyDeviceToHost ) ) ;
419 success = ( success ) && ( cudaSuccess == cudaMemcpy(&r , rtest_DevT , s i z eo f ( f l o a t ) , ←↩
cudaMemcpyDeviceToHost ) ) ; // t e s t
420 success = ( success ) && ( cudaSuccess == cudaMemcpy ( childNodes , childNodes_DevT , 8 ∗ ( n+1) ←↩
∗ s i z eo f ( i n t ) , cudaMemcpyDeviceToHost ) ) ; // t e s t
421 }
422 re turn success ;
423 }
424
425 __global__ void initDev ( void )
426 {
427 blockCount_Dev = 0 ; //for counting blocks executed
428 step_Dev = −1; //skip ve l o c i t y update un t i l a f t e r f i r s t s tep
429 }
70
430
431 __global__ void calculateRadius ( void )
432 {
433 i n t i , temp , inc ; //note : these va r i ab l e s are reused
434 f l o a t val ;
435 f l o a t xmax = 0 . 0 f , xmin = 0 . 0 f , ymax = 0 . 0 f , ymin = 0 . 0 f , zmax = 0 . 0 f , zmin = 0 . 0 f ;
436 __shared__ v o l a t i l e f l o a t xmaxs [ RADIUSTHREADS ] , xmins [ RADIUSTHREADS ] ,
437 ymaxs [ RADIUSTHREADS ] , ymins [ RADIUSTHREADS ] , zmaxs [ RADIUSTHREADS ] , zmins [ RADIUSTHREADS ] ;
438
439 i n t tid = threadIdx . x ; //thread id in current block
440 inc = RADIUSTHREADS ∗ gridDim . x ; // t o t a l threads
441 // i t e r a t e over a l l bodies in increments of t o t a l number of threads
442 f o r ( i = tid + ( blockIdx . x ∗ RADIUSTHREADS ) ; i < n_Dev ; i += inc )
443 {
444 val = posm_Dev [ i ] . x ;
445 xmax = ( xmax > val ) ? xmax : val ;
446 xmin = ( xmin < val ) ? xmin : val ;
447 val = posm_Dev [ i ] . y ;
448 ymax = ( ymax > val ) ? ymax : val ;
449 ymin = ( ymin < val ) ? ymin : val ;
450 val = posm_Dev [ i ] . z ;
451 zmax = ( zmax > val ) ? zmax : val ;
452 zmin = ( zmin < val ) ? zmin : val ;
453 }
454 //save to shared memory
455 xmaxs [ tid ] = xmax ; xmins [ tid ] = xmin ;
456 ymaxs [ tid ] = ymax ; ymins [ tid ] = ymin ;
457 zmaxs [ tid ] = zmax ; zmins [ tid ] = zmin ;
458
459 //reduction , t h i s w i l l only work when the block i s execut ing using a number
460 //of threads tha t i s the power of 2
461 f o r ( i = RADIUSTHREADS/2; i > 0 ; i /= 2)
462 {
463 __syncthreads ( ) ;
464 i f ( tid<i )
465 {
466 temp = tid + i ;
467 xmax = ( xmax > xmaxs [ temp ] ) ? xmax : xmaxs [ temp ] ;
468 xmin = ( xmin < xmins [ temp ] ) ? xmin : xmins [ temp ] ;
469 xmaxs [ tid ] = xmax ; xmins [ tid ] = xmin ;
470 ymax = ( ymax > ymaxs [ temp ] ) ? ymax : ymaxs [ temp ] ;
471 ymin = ( ymin < ymins [ temp ] ) ? ymin : ymins [ temp ] ;
472 ymaxs [ tid ] = ymax ; ymins [ tid ] = ymin ;
473 zmax = ( zmax > zmaxs [ temp ] ) ? zmax : zmaxs [ temp ] ;
474 zmin = ( zmin < zmins [ temp ] ) ? zmin : zmins [ temp ] ;
475 zmaxs [ tid ] = zmax ; zmins [ tid ] = zmin ;
476 }
477 }
478 //save r e su l t s
479 i f ( tid == 0)
480 {
481 temp = blockIdx . x ;
482 xmax_Dev [ temp ] = xmax ; xmin_Dev [ temp ] = xmin ;
483 ymax_Dev [ temp ] = ymax ; ymin_Dev [ temp ] = ymin ;
484 zmax_Dev [ temp ] = zmax ; zmin_Dev [ temp ] = zmin ;
485 }
486
487 inc = gridDim . x − 1 ;
488 // i f old value == gridDim−1, we are a t the l a s t block
489 i f ( inc == atomicInc ( ( unsigned i n t ∗ ) &blockCount_Dev , inc ) )
490 {
491 fo r ( i = 0 ; i <= inc ; i++)
492 {
493 xmax = ( xmax > xmax_Dev [ i ] ) ? xmax : xmax_Dev [ i ] ;
494 xmin = ( xmin < xmin_Dev [ i ] ) ? xmin : xmin_Dev [ i ] ;
495 ymax = ( ymax > ymax_Dev [ i ] ) ? ymax : ymax_Dev [ i ] ;
496 ymin = ( ymin < ymin_Dev [ i ] ) ? ymin : ymin_Dev [ i ] ;
497 zmax = ( zmax > zmax_Dev [ i ] ) ? zmax : zmax_Dev [ i ] ;
498 zmin = ( zmin < zmin_Dev [ i ] ) ? zmin : zmin_Dev [ i ] ;
71
12. Barnes-Hut, CUDA
499 }
500 //naive radius c a l cu l a t i on : max . abs . value of any x , y or z coordinate
501 xmax = ( fabsf ( xmax ) > fabsf ( xmin ) ) ? fabsf ( xmax ) : fabsf ( xmin ) ;
502 ymax = ( fabsf ( ymax ) > fabsf ( ymin ) ) ? fabsf ( ymax ) : fabsf ( ymin ) ;
503 zmax = ( fabsf ( zmax ) > fabsf ( zmin ) ) ? fabsf ( zmax ) : fabsf ( zmin ) ;
504 val = ( xmax > ymax ) ? xmax : ymax ;
505 val = ( val > zmax ) ? val : zmax ;
506 r_Dev = val ;
507 ∗rtest_Dev = val ; // t e s t
508 step_Dev++;
509 posm_Dev [ nnodes_Dev ] . x = 0 . 0 f ; //root node
510 posm_Dev [ nnodes_Dev ] . y = 0 . 0 f ;
511 posm_Dev [ nnodes_Dev ] . z = 0 . 0 f ;
512 posm_Dev [ nnodes_Dev ] . w = 0 . 0 f ;
513 fo r ( i=0;i<8;i++)
514 {
515 childNodes_Dev [ ( nnodes_Dev−n_Dev ) ∗8+i ]=NOCHILD ;
516 }
517 }
518 }
519
520 __global__ void buildTree ( void )
521 {
522 f l o a t px , py , pz ;
523 f l o a t x = −1.0f , y = −1.0f , z = −1.0f ;
524 f l o a t rthis ;
525 rthis = r_Dev ;
526 usedInternal = nnodes_Dev ;
527 i n t current = nnodes_Dev , newInternal , child , locked , alt , index , i , j , depth , tid , threads , ←↩
skip = 1 ;
528 tid = threadIdx . x + blockIdx . x ∗ blockDim . x ; //absolute thread number
529 threads = gridDim . x ∗ blockDim . x ; // t o t a l number of threads
530 i = tid ;
531
532 while ( i < n_Dev )
533 {
534 i f ( skip != 0 )
535 {
536 skip = 0 ;
537 current = nnodes_Dev ;
538 index = 0 ;
539 px = posm_Dev [ i ] . x ;
540 py = posm_Dev [ i ] . y ;
541 pz = posm_Dev [ i ] . z ;
542 depth = 1 ;
543 rthis = r_Dev ;
544 x = −1.0f , y = −1.0f , z = −1.0f ;
545 i f ( 0 . 0 f < px ) { index =index | 1 ; x = 1 . 0 f ; }
546 i f ( 0 . 0 f < py ) { index =index | 2 ; y = 1 . 0 f ; }
547 i f ( 0 . 0 f < pz ) { index =index | 4 ; z = 1 . 0 f ; }
548 }
549 child = childNodes_Dev [ ( current−n_Dev ) ∗8+index ] ;
550 while ( child >= n_Dev )
551 {
552 current = child ;
553 index=0;
554 x = −1.0f , y = −1.0f , z = −1.0f ;
555 i f ( posm_Dev [ current ] . x < px )
556 {
557 index = index | 1 ;
558 x = 1 . 0 f ;
559 }
560 i f ( posm_Dev [ current ] . y < py )
561 {
562 index = index | 2 ;
563 y = 1 . 0 f ;
564 }
565 i f ( posm_Dev [ current ] . z < pz )
566 {
72
567 index = index | 4 ;
568 z = 1 . 0 f ;
569 }
570 rthis ∗= 0 . 5 ;
571 depth++;
572 child = childNodes_Dev [ ( current−n_Dev ) ∗8+index ] ;
573 }
574
575 i f ( child != LOCK ) //skip i f locked
576 {
577 locked = ( current−n_Dev ) ∗8+index ;
578 i f ( child == atomicCAS ( ( i n t ∗ )&childNodes_Dev [ locked ] , child , LOCK ) ) //t ry to lock
579 {
580 i f ( child == NOCHILD ) //spot i s ava i l ab l e fo r l e a f i n s e r t i on
581 {
582 childNodes_Dev [ locked ] = i ;
583 }
584 e l s e
585 {
586 alt = −1; //remember not to i n s e r t node f i r s t time
587 do //we have two l ea f s , ch i ld and i
588 {
589 newInternal = atomicSub ( ( i n t ∗ )&usedInternal , 1 ) − 1 ;
590 alt = ( newInternal > alt ) ? newInternal : alt ;
591 rthis ∗=0 . 5 ;
592 posm_Dev [ newInternal ] . x=posm_Dev [ current ] . x+rthis∗x ;
593 posm_Dev [ newInternal ] . y=posm_Dev [ current ] . y+rthis∗y ;
594 posm_Dev [ newInternal ] . z=posm_Dev [ current ] . z+rthis∗z ;
595 posm_Dev [ newInternal ] . w=0.0f ;
596
597 i f ( alt != newInternal )
598 {
599 childNodes_Dev [ ( current−n_Dev ) ∗8+index ] = newInternal ;
600 }
601
602 depth++;
603
604 f o r ( j = 0 ; j < 8 ; j++) //mark nodes empty
605 {
606 childNodes_Dev [ ( newInternal − n_Dev ) ∗8+j ] = NOCHILD ;
607 }
608
609 i f ( alt != newInternal )
610 {
611 childNodes_Dev [ ( current−n_Dev ) ∗8+index ] = newInternal ;
612 }
613 index = 0 ;
614 i f ( posm_Dev [ newInternal ] . x < posm_Dev [ child ] . x )
615 {
616 index = index | 1 ;
617 }
618 i f ( posm_Dev [ newInternal ] . y < posm_Dev [ child ] . y )
619 {
620 index = index | 2 ;
621 }
622 i f ( posm_Dev [ newInternal ] . z < posm_Dev [ child ] . z )
623 {
624 index = index | 4 ;
625 }
626 childNodes_Dev [ ( newInternal−n_Dev ) ∗8+index ] = child ; // ch i ld i s in se r t ed
627 current=newInternal ;
628 index =0;
629 x = −1.0f , y = −1.0f , z = −1.0f ;
630 i f ( posm_Dev [ newInternal ] . x < px )
631 {
632 index = index | 1 ;
633 x = 1 . 0 f ;
634 }
635 i f ( posm_Dev [ newInternal ] . y < py )
73
12. Barnes-Hut, CUDA
636 {
637 index = index | 2 ;
638 y = 1 . 0 f ;
639 }
640 i f ( posm_Dev [ newInternal ] . z < pz )
641 {
642 index = index | 4 ;
643 z = 1 . 0 f ;
644 }
645
646 child=childNodes_Dev [ ( current−n_Dev ) ∗8+index ] ;
647 }
648 while ( child > NOCHILD ) ; //while there i s no space fo r the l ea f , keep on ←↩
looping
649 childNodes_Dev [ ( current−n_Dev ) ∗8+index ]=i ;
650 __threadfence ( ) ; //make sure i t ' s wr i t ten
651 childNodes_Dev [ locked ]=alt ;
652 } //end e l s e
653 i += threads ;
654 skip = 1 ;
655 }//end lock
656 } //end i f not locked
657 __syncthreads ( ) ;
658 }
659 }
660
661 __global__ void computeCenterOfMass ( void )
662 {
663 __shared__ v o l a t i l e i n t childrenCaches [ CENTERTHREADS ∗ 8 ] ;
664 i n t i , j , internal , bottom , currentChild ;
665 i n t index = threadIdx . x ∗8 ;
666 i n t missing = 0 ;
667 i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ; //absolute thread number
668 i n t threads = gridDim . x ∗ blockDim . x ; // t o t a l number of threads
669 f l o a t cent_pos_x , cent_pos_y , cent_pos_z , cent_mass ;
670 f l o a t massT ;
671 bottom = usedInternal ;
672 internal = ( bottom & (−WARPSIZE ) ) + tid ; //bottom − ( bottom % WARPSIZE) + t id
673 i f ( internal < bottom )
674 internal += threads ;
675 while ( internal <= nnodes_Dev )
676 {
677 cent_pos_x = 0 . 0 f ; cent_pos_y = 0 . 0 f ; cent_pos_z = 0 . 0 f ; cent_mass = 0 . 0 f ;
678 j = 0 ;
679 fo r ( i = 0 ; i < 8 ; i++)
680 {
681 currentChild = childNodes_Dev [ ( internal − n_Dev ) ∗8+i ] ;
682 i f ( currentChild >= 0)
683 {
684 i f ( i != j )//for c a l c . a c c e l . , move ch i ldren to end
685 {
686 childNodes_Dev [ ( internal−n_Dev ) ∗8+i ] = NOCHILD ;
687 childNodes_Dev [ ( internal−n_Dev ) ∗8+j ] = currentChild ;
688 }
689 massT = posm_Dev [ currentChild ] . w ;
690 i f ( massT > 0)
691 {
692 f l o a t tempx = cent_pos_x ∗ cent_mass + posm_Dev [ currentChild ] . x ∗ massT ;
693 f l o a t tempy = cent_pos_y ∗ cent_mass + posm_Dev [ currentChild ] . y ∗ massT ;
694 f l o a t tempz = cent_pos_z ∗ cent_mass + posm_Dev [ currentChild ] . z ∗ massT ;
695 cent_mass += massT ;
696 cent_pos_x = tempx / cent_mass ;
697 cent_pos_y = tempy / cent_mass ;
698 cent_pos_z = tempz / cent_mass ;
699 childrenCaches [ index+i ] = CHILDVISITED ;
700 }
701 e l s e
702 {
703 childrenCaches [ index+i ] = currentChild ;
74
704 missing++;
705 }
706 j++;
707 }
708 e l s e
709 {
710 childrenCaches [ index+i ] = currentChild ;
711 }
712 }
713 do
714 {
715 f o r ( i = 0 ; i<8;i++)
716 {
717 currentChild=childrenCaches [ index+i ] ;
718 i f ( currentChild >= 0)
719 {
720 massT = posm_Dev [ currentChild ] . w ;
721 i f ( massT >0)
722 {
723 f l o a t tempx = cent_pos_x ∗ cent_mass + posm_Dev [ currentChild ] . x ∗ massT ;
724 f l o a t tempy = cent_pos_y ∗ cent_mass + posm_Dev [ currentChild ] . y ∗ massT ;
725 f l o a t tempz = cent_pos_z ∗ cent_mass + posm_Dev [ currentChild ] . z ∗ massT ;
726 cent_mass += massT ;
727 cent_pos_x = tempx / cent_mass ;
728 cent_pos_y = tempy / cent_mass ;
729 cent_pos_z = tempz / cent_mass ;
730 childrenCaches [ index+i ]=CHILDVISITED ;
731 missing−−;
732 }
733 }
734 }
735 i f ( missing==0)
736 {
737
738 posm_Dev [ internal ] . x=cent_pos_x ;
739 posm_Dev [ internal ] . y=cent_pos_y ;
740 posm_Dev [ internal ] . z=cent_pos_z ;
741 __threadfence ( ) ;
742 posm_Dev [ internal ] . w=cent_mass ;
743 internal += threads ;
744 }
745
746 } while ( missing > 0) ;
747
748
749 }
750 }
751
752 __global__ void computeForce ( void )
753 {
754 i n t currentNode , currentChild , leaf , depth , did ;
755 i n t i , base , sbase , diff ;
756 f l o a t px , py , pz , ax , ay , az , dx , dy , dz , tmp ;
757 i n t tid = threadIdx . x + blockIdx . x ∗ blockDim . x ; //absolute thread number
758 i n t threads = gridDim . x ∗ blockDim . x ; // t o t a l number of threads
759 long visited ; //keeps t rack of path to currentNode
760 //each warp has i t s own parent cache within the parent array
761 __shared__ v o l a t i l e i n t parent [ MAXDEPTH ∗ FORCETHREADS/WARPSIZE ] ;
762 //each warp has i t s own radius ( by depth ) cache
763 __shared__ v o l a t i l e f l o a t rdepth [ MAXDEPTH ∗ FORCETHREADS/WARPSIZE ] ;
764 __shared__ v o l a t i l e i n t step ;
765
766 base = threadIdx . x / WARPSIZE ; //warp number in block
767 sbase = base ∗ WARPSIZE ; // f i r s t thread in warp
768 did = base ∗ MAXDEPTH ; //depth index
769
770 i f (0 == threadIdx . x ) // f i r s t thread in block
771 {
772 step = step_Dev ; //used for ve l o c i t y − ve r l e t
75
12. Barnes-Hut, CUDA
773 // precompute values tha t depend only on t r e e l e v e l
774 rdepth [ 0 ] = r_Dev ∗ r_Dev ∗ 4 . 0 f ; // (2∗ r ) ^2
775 fo r ( i = 1 ; i < MAXDEPTH ; i++)
776 {
777 rdepth [ i ] = rdepth [i−1] ∗ 0 .25 f ;
778 }
779 }
780
781 __syncthreads ( ) ;
782
783 diff = threadIdx . x − sbase ;
784 i f ( diff < MAXDEPTH )
785 {
786 rdepth [ diff+did ] = rdepth [ diff ] ;
787 }
788
789 __syncthreads ( ) ;
790
791 f o r ( leaf = tid ; leaf < n_Dev ; leaf += threads )
792 {
793 px = posm_Dev [ leaf ] . x ;
794 py = posm_Dev [ leaf ] . y ;
795 pz = posm_Dev [ leaf ] . z ;
796 ax = 0 . 0 f ; ay = 0 . 0 f ; az = 0 . 0 f ;
797
798 i f ( sbase == threadIdx . x )
799 {
800 parent [ did ] = nnodes_Dev ;
801 }
802 visited = 1 ;
803 __threadfence ( ) ; //make sure i t ' s v i s i b l e to a l l threads
804 currentChild = 0 ;
805 depth =did ;
806 while ( depth >= did )
807 {
808 while ( currentChild < 8)
809 {
810 currentNode = childNodes_Dev [ ( parent [ depth]−n_Dev ) ∗8+currentChild ] ;
811
812 i f ( currentNode >= 0)
813 {
814 dx = posm_Dev [ currentNode ] . x − px ;
815 dy = posm_Dev [ currentNode ] . y − py ;
816 dz = posm_Dev [ currentNode ] . z − pz ;
817 tmp = dx∗dx + dy∗dy + dz∗dz + EPSILON ;
818 i f ( ( currentNode < n_Dev ) || __all ( tmp >= rdepth [ depth ] ) )// f a r enough away on←↩
a l l threads in warp
819 {
820 tmp = rsqrtf ( tmp ) ; // 1/ sq r t ( tmp) approximation
821 tmp = posm_Dev [ currentNode ] . w ∗ tmp ∗ tmp ∗ tmp ;
822 ax += dx ∗ tmp ;
823 ay += dy ∗ tmp ;
824 az += dz ∗ tmp ;
825 currentChild++;
826 }
827 e l s e
828 {
829 depth++;
830 visited = visited << 3 ;
831 visited = visited | currentChild ;
832 i f ( sbase == threadIdx . x )
833 {
834 parent [ depth ] = currentNode ;
835 }
836 __threadfence ( ) ;
837 currentChild = 0 ;
838 }
839 }
840 e l s e
76
841 {
842 i f ( did <= ( depth − 1) )
843 {
844 currentChild = visited & 7 ;
845 visited = visited >> 3 ;
846 currentChild++;
847 depth−−;
848 }
849 e l s e
850 {
851 depth =did ;
852 currentChild++;
853 }
854 }
855 }
856 do
857 {
858 depth−−;
859 currentChild = visited & 7 ;
860 visited = visited >> 3 ;
861 currentChild++;
862 }
863 while ( currentChild>7 && depth >= did ) ;
864
865 }
866 i f ( step > 0)
867 {
868 f l o a t half_dt =dt_Dev ∗0 . 5 f ;
869 vel_Dev [ leaf ] . x += ( ax − acc_Dev [ leaf ] . x ) ∗ half_dt ;
870 vel_Dev [ leaf ] . y += ( ay − acc_Dev [ leaf ] . y ) ∗ half_dt ;
871 vel_Dev [ leaf ] . z += ( az − acc_Dev [ leaf ] . z ) ∗ half_dt ;
872 }
873 acc_Dev [ leaf ] . x = ax ;
874 acc_Dev [ leaf ] . y = ay ;
875 acc_Dev [ leaf ] . z = az ;
876 }
877 }
878
879 __global__ void advanceBodies ( void )
880 {
881 i n t i ;
882 i n t threads = gridDim . x ∗ blockDim . x ;
883 i n t tid = ( blockIdx . x ∗ blockDim . x ) + threadIdx . x ;
884
885 f l o a t dt = dt_Dev ;
886 f l o a t dth = dt ∗ 0 . 5 f ;
887
888 f l o a t dvx , dvy , dvz ;
889 f l o a t vhx , vhy , vhz ;
890
891 f o r ( i = tid ; i < n_Dev ; i += threads )
892 {
893 dvx = acc_Dev [ i ] . x ∗ dth ;
894 dvy = acc_Dev [ i ] . y ∗ dth ;
895 dvz = acc_Dev [ i ] . z ∗ dth ;
896
897 vhx = vel_Dev [ i ] . x + dvx ;
898 vhy = vel_Dev [ i ] . y + dvy ;
899 vhz = vel_Dev [ i ] . z + dvz ;
900
901 posm_Dev [ i ] . x += vhx ∗ dt ;
902 posm_Dev [ i ] . y += vhy ∗ dt ;
903 posm_Dev [ i ] . z += vhz ∗ dt ;
904
905 vel_Dev [ i ] . x = vhx + dvx ;
906 vel_Dev [ i ] . y = vhy + dvy ;
907 vel_Dev [ i ] . z = vhz + dvz ;
908 }
909 }
77
13 Utility, input, C
1 /∗
2 Code for reading input f i l e s in C
3 Use : pass re f e rence to values and po in te r s tha t are to be f i l l e d to readinput
4 ∗/
5
6 # include <s td io . h>
7 # include <s t r i ng . h>
8 # include <malloc . h>
9
10 i n t countsp ( char ∗ str ) ;
11 i n t readinput ( char ∗filename , i n t ∗n , i n t ∗t , f l o a t ∗dt , f l o a t ∗∗posx , f l o a t ∗∗posy , f l o a t ∗∗posz←↩
, f l o a t ∗∗velx , f l o a t ∗∗vely , f l o a t ∗∗velz , f l o a t ∗∗mass ) ;
12
13 /∗
14 i n t main ( i n t argc , char ∗ argv [ ] )
15 {
16
17 i f ( argc < 2)
18 {
19 p r i n t f ( " P lease supply a va l id f i lename as argument .\n " ) ;
20 re turn 1 ;
21 }
22 char f i lename [ 6 4 ] ;
23 s t rcpy ( fi lename , argv [ 1 ] ) ;
24 i n t n = 0 , t = 0 , i = 0 ;
25 f l o a t dt = 0 . 0 f ;
26 f l o a t ∗px , ∗py , ∗pz ;
27 f l o a t ∗vx , ∗vy , ∗vz ;
28 f l o a t ∗m;
29 i f (1 == readinput ( fi lename , &n , &t , &dt , &px , &py , &pz , &vx , &vy , &vz , &m) )
30 {
31 p r i n t f ("%d %d %f\n " , n , t , dt ) ;
32 f o r ( i = 0 ; i < n ; i ++)
33 {
34 p r i n t f ("% f %f %f %f %f %f %f\n " , px [ i ] , py [ i ] , pz [ i ] , vx [ i ] , vy [ i ] , vz [ i ] , m[ i ] ) ;
35 }
36 f r e e ( px ) ; f r e e (py ) ; f r e e ( pz ) ; f r e e ( vx ) ; f r e e ( vy ) ; f r e e ( vz ) ; f r e e (m) ;
37 }
38 re turn 0 ;
39 }
40 ∗/
41 i n t readinput ( char ∗filename , i n t ∗n , i n t ∗t , f l o a t ∗dt , f l o a t ∗∗posx , f l o a t ∗∗posy , f l o a t ∗∗posz←↩
, f l o a t ∗∗velx , f l o a t ∗∗vely , f l o a t ∗∗velz , f l o a t ∗∗mass )
42 {
43 FILE ∗file = fopen ( filename , " r " ) ;
44 char line [ 2 5 6 ] ;
45 i n t len = 0 , sp = 0 , INIT = 0 , count = 0 ;
46 i n t nT = 0 , tT = 0 ;
47 f l o a t dtT = 0 . 0 f ;
48 f l o a t posxT , posyT , poszT , velxT , velyT , velzT , massT ;
49 while ( fgets ( line , 256 , file ) )
50 {
51 len = strlen ( line ) ;
52 //skip comments , l i n e s tha t s t a r t with space , l i n e s not conta in ing any data
53 i f ( ( ' # ' == line [ 0 ] ) || ( ' ' == line [ 0 ] ) || ( '\n ' == line [ 0 ] ) || ( len < 2) )
78
54 continue ;
55 sp = countsp ( line ) ;
56 i f ( sp == 2) //assume conf ig s e t t i n g "n t dt "
57 {
58 i f (3 == sscanf ( line , "%d %d %f " , &nT , &tT , &dtT ) )
59 {
60 i f ( nT > 0)
61 {
62 (∗ posx ) = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
63 (∗ posy ) = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
64 (∗ posz ) = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
65 (∗ velx ) = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
66 (∗ vely ) = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
67 (∗ velz ) = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
68 (∗ mass ) = ( f l o a t ∗ ) malloc ( nT ∗ s i z eo f ( f l o a t ) ) ;
69 ∗n = nT ; ∗t = tT ; ∗dt = dtT ;
70 i f ( ( posx != NULL )&&(posy != NULL )&&(posz != NULL )&&(velx != NULL )&&(vely != NULL )&&(←↩
velz != NULL )&&(mass != NULL ) )
71 INIT = 1 ;
72 }
73 }
74 }
75 e l s e i f ( ( INIT == 1) && ( sp >= 6) )
76 {
77 i f (7 == sscanf ( line , "%f %f %f %f %f %f %f " , &posxT , &posyT , &poszT , &velxT , &velyT , &←↩
velzT , &massT ) )
78 {
79 (∗ posx ) [ count ] = posxT ; (∗ posy ) [ count ] = posyT ; (∗ posz ) [ count ] = poszT ;
80 (∗ velx ) [ count ] = velxT ; (∗ vely ) [ count ] = velyT ; (∗ velz ) [ count ] = velzT ;
81 (∗ mass ) [ count ] = massT ;
82 count++;
83 }
84 }
85 }
86 fclose ( file ) ;
87 re turn ( ( INIT == 1) && ( nT == count ) ) ;
88 }
89
90 i n t countsp ( char ∗ str )
91 {
92 i n t sp = 0 , i ;
93 fo r ( i = 0 ; i < ( i n t ) strlen ( str ) ; i++)
94 {
95 i f ( ' ' == str [ i ] )
96 sp++;
97 }
98 re turn sp ;
99 }
79
14 Utility, input, C++
1 # i fnde f NBODYINPUT_H_
2 # def ine NBODYINPUT_H_
3
4 # include <s t r ing >
5
6 c l a s s NbodyInput
7 {
8 publ ic :
9 std : : string fileName ;
10 i n t n , t ;
11 f l o a t dt ;
12 f l o a t ∗px , ∗py , ∗pz , ∗vx , ∗vy , ∗vz , ∗m , ∗ax , ∗ay , ∗az ;
13 f l o a t xmax , ymax , zmax ;
14 f l o a t xmin , ymin , zmin ;
15 bool debug , init ;
16 NbodyInput ( std : : string ) ;
17 ~NbodyInput ( void ) ;
18 f l o a t getMaxCoordinate ( void ) ;
19 bool read ( void ) ;
20 void print ( void ) ;
21 i n t countSpaces ( std : : string str ) ;
22 } ;
23
24 # endi f /∗ NBODYINPUT_H_ ∗/
1 //TODO: f i x so i t doesnt pr in t nonsense a c c e l e r a t i on vec tors
2 // or f i x output to wri te 0 . 0 ' s
3
4 # include "NbodyInput . h"
5
6 # include <cs td io >
7 # include <s t r ing >
8 # include <iostream>
9 # include <sstream>
10 # include <fstream>
11 # include <iostream>
12 # include <c f l o a t >
13 # include <cmath>
14
15 # def ine DEBUG true
16 using namespace std ;
17
18 NbodyInput : : NbodyInput ( string fn )
19 {
20 fileName = fn ;
21 init = f a l s e ;
22 xmax = FLT_MIN , ymax = FLT_MIN , zmax = FLT_MIN ;
23 xmin = FLT_MAX , ymin = FLT_MAX , zmin = FLT_MAX ;
24 debug = DEBUG ;
25 i f ( debug ) cout << "New NbodyInput : " << fileName << endl ;
26 }
27
28 NbodyInput : : ~ NbodyInput ( void )
80
29 {
30 i f ( init )
31 {
32 de le t e [ ] px ;
33 de le t e [ ] py ;
34 de le t e [ ] pz ;
35 de le t e [ ] vx ;
36 de le t e [ ] vy ;
37 de le t e [ ] vz ;
38 de le t e [ ] m ;
39 }
40 }
41
42 i n t NbodyInput : : countSpaces ( std : : string str )
43 {
44 i n t c = 0 ;
45 f o r ( i n t i = 0 ; i < ( i n t ) str . size ( ) ; i++)
46 {
47 i f ( str [ i]== ' ' ) c++;
48 }
49 re turn c ;
50 }
51
52 bool NbodyInput : : read ( void )
53 {
54 string line ;
55 bool paramsRead = f a l s e ;
56 i n t count = 0 ;
57 i n t nT = 0 , tT = 0 ;
58 f l o a t dtT = 0 . 0 f ;
59 ifstream file ( fileName . data ( ) , ios : : in ) ;
60 i f ( file . is_open ( ) )
61 {
62 while ( getline ( file , line ) )
63 {
64 i f ( " # " == line . substr ( 0 , 1 ) || (2 > line . size ( ) ) )
65 {
66 i f ( debug ) cout << "NbodyInput comment : " << line . substr ( 1 , line . length ( ) −1) << ←↩
endl ;
67 continue ; // comment , r e s t a r t loop
68 }
69 stringstream ss ( line ) ;
70 i f ( ! paramsRead ) // read parameters , i n i t i a l i s e arrays
71 {
72 ss >> nT >> tT >> dtT ;
73 i f ( nT > 0)
74 {
75 n = nT ;
76 t = tT ;
77 dt = dtT ;
78 px = new f l o a t [ n ] ;
79 py = new f l o a t [ n ] ;
80 pz = new f l o a t [ n ] ;
81 vx = new f l o a t [ n ] ;
82 vy = new f l o a t [ n ] ;
83 vz = new f l o a t [ n ] ;
84 m = new f l o a t [ n ] ;
85 ax = new f l o a t [ n ] ;
86 ay = new f l o a t [ n ] ;
87 az = new f l o a t [ n ] ;
88 init = true ;
89 paramsRead = true ;
90 }
91 e l s e
92 {
93 i f ( debug ) cout << "NbodyInput e r ror : missing parameters . " << endl ;
94 re turn f a l s e ;
95 }
96 }
81
14. Utility, input, C++
97 e l s e i f ( count <= n )// read bodies
98 {
99 f l o a t pxT = 0 . 0 f , pyT = 0 . 0 f , pzT = 0 . 0 f ,
100 vxT = 0 . 0 f , vyT = 0 . 0 f , vzT = 0 . 0 f , mT = 0 . 0 f ,
101 axT = 0 . 0 f , ayT = 0 . 0 f , azT = 0 . 0 f ;
102 i f (9 == countSpaces ( line ) )
103 {
104 ss >> pxT >> pyT >> pzT >> vxT >> vyT >> vzT >> mT >>
105 axT >> ayT >> azT ;
106 }
107 e l s e i f (6 == countSpaces ( line ) )
108 {
109 ss >> pxT >> pyT >> pzT >> vxT >> vyT >> vzT >> mT ;
110 }
111 px [ count ] = pxT ;
112 py [ count ] = pyT ;
113 pz [ count ] = pzT ;
114 vx [ count ] = vxT ;
115 vy [ count ] = vyT ;
116 vz [ count ] = vzT ;
117 m [ count ] = mT ;
118 ax [ count ] = axT ;
119 ay [ count ] = ayT ;
120 az [ count ] = azT ;
121 xmax = ( pxT > xmax ) ? pxT : xmax ; xmin = ( pxT < xmin ) ? pxT : xmin ;
122 ymax = ( pyT > ymax ) ? pyT : ymax ; ymin = ( pyT < ymin ) ? pyT : ymin ;
123 zmax = ( pzT > zmax ) ? pzT : zmax ; zmin = ( pzT < zmin ) ? pzT : zmin ;
124 count++;
125 }
126 e l s e
127 {
128 i f ( debug ) cout << "NbodyInput warning : more l i n e s than expected ( " << n << " ) . " ←↩
<< endl ;
129 }
130 }
131 }
132 file . close ( ) ;
133 re turn ( n==count ) ;
134 }
135
136 f l o a t NbodyInput : : getMaxCoordinate ( void )
137 {
138 f l o a t max = 0 . 0 f ;
139 using namespace std ;
140 max = ( abs ( xmax ) > max ) ? abs ( xmax ) : max ;
141 max = ( abs ( ymax ) > max ) ? abs ( ymax ) : max ;
142 max = ( abs ( zmax ) > max ) ? abs ( zmax ) : max ;
143 max = ( abs ( xmin ) > max ) ? abs ( xmin ) : max ;
144 max = ( abs ( ymin ) > max ) ? abs ( ymin ) : max ;
145 max = ( abs ( zmin ) > max ) ? abs ( zmin ) : max ;
146 re turn max ;
147 }
148
149 void NbodyInput : : print ( void )
150 {
151 i f ( ! init )
152 {
153 i f ( debug ) cout << "NbodyInput e r ror : f i l e not read . " << endl ;
154 }
155 e l s e
156 {
157 cout << n << " " << t << " " << dt << endl ;
158 fo r ( i n t i = 0 ; i < n ; i++)
159 {
160 cout << px [ i ] << " " << py [ i ] << " " << pz [ i ] << " "
161 << vx [ i ] << " " << vy [ i ] << " " << vz [ i ] << " " << m [ i ] << " "
162 << ax [ i ] << " " << ay [ i ] << " " << az [ i ]
163 << endl ;
164 }
82
165 }
166 }
83
15 Utility, output, C
1 /∗
2 Code for wri t ing output to a f i l e in C
3 Use : pass output fi lename , s imulat ion parameters , and re f e r ence s to bodies ' arrays to writeoutput
4 ∗/
5
6 # include <s td io . h>
7 # include <s t r i ng . h>
8 # include <malloc . h>
9 # include <time . h>
10 //# include " input . c "
11 i n t writeoutput ( char ∗filename , i n t n , i n t t , f l o a t dt , f l o a t ∗posx , f l o a t ∗posy , f l o a t ∗posz , ←↩
f l o a t ∗velx , f l o a t ∗vely , f l o a t ∗velz , f l o a t ∗mass , f l o a t ∗accx , f l o a t ∗accy , f l o a t ∗accz ) ;
12 i n t writeoutput ( char ∗filename , i n t n , i n t t , f l o a t dt , f l o a t ∗posx , f l o a t ∗posy , f l o a t ∗posz , ←↩
f l o a t ∗velx , f l o a t ∗vely , f l o a t ∗velz , f l o a t ∗mass ) ;
13 /∗
14 // t e s t by reading a f i l e and wri t ing i t again
15 i n t main ( i n t argc , char ∗ argv [ ] )
16 {
17
18 i f ( argc < 2)
19 {
20 p r i n t f ( " P lease supply a va l id f i lename as argument .\n " ) ;
21 re turn 1 ;
22 }
23 char f i lename [ 6 4 ] ;
24 s t rcpy ( fi lename , argv [ 1 ] ) ;
25 i n t n = 0 , t = 0 ;
26 f l o a t dt = 0 . 0 f ;
27 f l o a t ∗px , ∗py , ∗pz ;
28 f l o a t ∗vx , ∗vy , ∗vz ;
29 f l o a t ∗m;
30 f l o a t ∗ax , ∗ay , ∗az ;
31 i f (1 == readinput ( fi lename , &n , &t , &dt , &px , &py , &pz , &vx , &vy , &vz , &m) && (n > 0) )
32 {
33 p r i n t f ( " Read %d bodies from %s\n " , n , f i lename ) ;
34 ax = ( f l o a t ∗ ) malloc (n ∗ s i z eo f ( f l o a t ) ) ;
35 ay = ( f l o a t ∗ ) malloc (n ∗ s i z eo f ( f l o a t ) ) ;
36 az = ( f l o a t ∗ ) malloc (n ∗ s i z eo f ( f l o a t ) ) ;
37 //r e s e t a l l o ca t ed memory
38 memset ( ax , '\0 ' , n ∗ s i z eo f ( f l o a t ) ) ;
39 memset ( ay , '\0 ' , n ∗ s i z eo f ( f l o a t ) ) ;
40 memset ( az , '\0 ' , n ∗ s i z eo f ( f l o a t ) ) ;
41 s t r c a t ( f i lename , " . out " ) ;
42 i f (1 == writeoutput ( fi lename , n , t , dt , px , py , pz , vx , vy , vz , m, ax , ay , az ) )
43 p r i n t f ( " Wrote %d bodies to %s\n " , n , f i lename ) ;
44 f r e e ( px ) ; f r e e (py ) ; f r e e ( pz ) ; f r e e ( vx ) ; f r e e ( vy ) ; f r e e ( vz ) ; f r e e (m) ; f r e e ( ax ) ; f r e e ( ay ) ; ←↩
f r e e ( az ) ;
45 }
46 re turn 0 ;
47 }
48 ∗/
49
50 i n t writeoutput ( char ∗filename , i n t n , i n t t , f l o a t dt , f l o a t ∗posx , f l o a t ∗posy , f l o a t ∗posz , ←↩
f l o a t ∗velx , f l o a t ∗vely , f l o a t ∗velz , f l o a t ∗mass )
51 {
84
52 re turn writeoutput ( filename , n , t , dt , posx , posy , posz , velx , vely , velz , mass , NULL , NULL , ←↩
NULL ) ;
53 }
54
55 i n t writeoutput ( char ∗filename , i n t n , i n t t , f l o a t dt , f l o a t ∗posx , f l o a t ∗posy , f l o a t ∗posz , ←↩
f l o a t ∗velx , f l o a t ∗vely , f l o a t ∗velz , f l o a t ∗mass , f l o a t ∗accx , f l o a t ∗accy , f l o a t ∗accz )
56 {
57 FILE ∗file = fopen ( filename , "w" ) ;
58 i n t i , count = 0 ;
59 time_t tt ;
60 s t r u c t tm ∗ tmt ;
61 time(&tt ) ;
62 tmt = localtime(&tt ) ;
63 //wri te f i l e comment header + parameters on a sepera te l i n e
64 i f (0 > fprintf ( file , " # N−body simulat ion output f i l e .\n# F i l e wr i t ten %s# Simulat ion ←↩
parameters (n , t , dt ) :\n%d %d %f\n" , asctime ( tmt ) , n , t , dt ) ) //asct ime produces newline
65 re turn 0 ;
66 fo r ( i = 0 ; i < n ; i++)
67 {
68 //pr in t l ine , check i f wr i t ten to f i l e
69 i f ( ( accx == NULL ) || ( accy == NULL ) || ( accz == NULL ) )
70 {
71 i f (1 <= fprintf ( file , "%f %f %f %f %f %f %f\n" , posx [ i ] , posy [ i ] , posz [ i ] , velx [ i ] , ←↩
vely [ i ] , velz [ i ] , mass [ i ] ) )
72 count++;
73 }
74 e l s e
75 {
76 i f (1 <= fprintf ( file , "%f %f %f %f %f %f %f %f %f %f\n" , posx [ i ] , posy [ i ] , posz [ i ] , ←↩
velx [ i ] , vely [ i ] , velz [ i ] , mass [ i ] , accx [ i ] , accy [ i ] , accz [ i ] ) )
77 count++;
78 }
79 }
80 fflush ( file ) ;
81 fclose ( file ) ;
82 re turn ( count==n ) ;
83 }
85
16 Utility, output, C++
1 /∗
2 ∗ NbodyOutput . h
3 ∗
4 ∗ Created on : Nov 23 , 2011
5 ∗ Author : t r u l s
6 ∗/
7 # include <s t r ing >
8
9 # i fnde f NBODYOUTPUT_H_
10 # def ine NBODYOUTPUT_H_
11
12 c l a s s NbodyOutput {
13 publ ic :
14 i n t n , t , curr_line ; //number of bodies , t imesteps
15 f l o a t dt ;
16 f l o a t ∗px , ∗py , ∗pz , ∗vx , ∗vy , ∗vz , ∗m , ∗ax , ∗ay , ∗az ;
17 std : : string comment ;
18 NbodyOutput ( i n t nT , i n t tT , f l o a t dtT ) ;
19 v i r t u a l ~NbodyOutput ( ) ;
20 bool addBody ( f l o a t p_x , f l o a t p_y , f l o a t p_z , f l o a t v_x , f l o a t v_y , f l o a t v_z , f l o a t m , f l o a t←↩
a_x , f l o a t a_y , f l o a t a_z ) ;
21 bool addBody ( f l o a t p_x , f l o a t p_y , f l o a t p_z , f l o a t v_x , f l o a t v_y , f l o a t v_z , f l o a t m ) ;
22 bool addBodies ( i n t bodies , f l o a t ∗ p_x , f l o a t ∗ p_y , f l o a t ∗ p_z , f l o a t ∗ v_x , f l o a t ∗ v_y , f l o a t ∗←↩
v_z , f l o a t ∗ m , f l o a t ∗ a_x , f l o a t ∗ a_y , f l o a t ∗ a_z ) ;
23 bool addBodies ( i n t bodies , f l o a t ∗ p_x , f l o a t ∗ p_y , f l o a t ∗ p_z , f l o a t ∗ v_x , f l o a t ∗ v_y , f l o a t ∗←↩
v_z , f l o a t ∗ m ) ;
24 void addTimestamp ( void ) ;
25 void addComment ( std : : string morecomment ) ;
26 bool writeBodyFile ( std : : string fileName ) ;
27 } ;
28
29 # endi f /∗ NBODYOUTPUT_H_ ∗/
1 /∗
2 ∗ NbodyOutput . cpp
3 ∗/
4
5 # include <iostream>
6 # include <cs td io >
7 # include <s t r ing >
8 # include <sstream>
9 # include <fstream>
10 # include <time . h>
11 # include "NbodyOutput . h"
12
13 using namespace std ;
14
15 NbodyOutput : : NbodyOutput ( i n t nT , i n t tT , f l o a t dtT )
16 {
17 n = nT ; t = tT ; dt = dtT ;
18 curr_line = 0 ;
19 string first = " # Bodies f i l e crea ted by NbodyOutput c l a s s " ;
20 comment = first ;
86
21 px = new f l o a t [ n ] ;
22 py = new f l o a t [ n ] ;
23 pz = new f l o a t [ n ] ;
24 vx = new f l o a t [ n ] ;
25 vy = new f l o a t [ n ] ;
26 vz = new f l o a t [ n ] ;
27 m = new f l o a t [ n ] ;
28 ax = new f l o a t [ n ] ;
29 ay = new f l o a t [ n ] ;
30 az = new f l o a t [ n ] ;
31 f o r ( i n t i=0;i<n ; i++)
32 {
33 px [ i ] = 0 . 0 f ; py [ i ] = 0 . 0 f ; pz [ i ] = 0 . 0 f ;
34 vx [ i ] = 0 . 0 f ; vy [ i ] = 0 . 0 f ; vz [ i ] = 0 . 0 f ;
35 m [ i ] = 0 . 0 f ;
36 ax [ i ] = 0 . 0 f ; ay [ i ] = 0 . 0 f ; az [ i ] = 0 . 0 f ;
37 }
38 }
39
40 NbodyOutput : : ~ NbodyOutput ( )
41 {
42 de le t e [ ] px ;
43 de le t e [ ] py ;
44 de le t e [ ] pz ;
45 de le t e [ ] vx ;
46 de le t e [ ] vy ;
47 de le t e [ ] vz ;
48 de le t e [ ] m ;
49 }
50
51 bool NbodyOutput : : addBody ( f l o a t p_x , f l o a t p_y , f l o a t p_z , f l o a t v_x , f l o a t v_y , f l o a t v_z , f l o a t←↩
mT )
52 {
53 re turn NbodyOutput : : addBody ( p_x , p_y , p_z , v_x , v_y , v_z , mT , 0 . 0 f , 0 . 0 f , 0 . 0 f ) ;
54 }
55
56 bool NbodyOutput : : addBody ( f l o a t p_x , f l o a t p_y , f l o a t p_z , f l o a t v_x , f l o a t v_y , f l o a t v_z , f l o a t←↩
mT , f l o a t a_x , f l o a t a_y , f l o a t a_z )
57 {
58 i f ( curr_line >= n )
59 {
60 re turn f a l s e ;
61 }
62 e l s e
63 {
64 px [ curr_line ] = p_x ;
65 py [ curr_line ] = p_y ;
66 pz [ curr_line ] = p_z ;
67 vx [ curr_line ] = v_x ;
68 vy [ curr_line ] = v_y ;
69 vz [ curr_line ] = v_z ;
70 m [ curr_line ] = mT ;
71 ax [ curr_line ] = a_x ;
72 ay [ curr_line ] = a_y ;
73 az [ curr_line ] = a_z ;
74
75 curr_line++;
76 re turn true ;
77 }
78 }
79
80 bool NbodyOutput : : addBodies ( i n t bodies , f l o a t ∗ p_x , f l o a t ∗ p_y , f l o a t ∗ p_z , f l o a t ∗ v_x , f l o a t ∗ ←↩
v_y , f l o a t ∗ v_z , f l o a t ∗ mT )
81 {
82 re turn NbodyOutput : : addBodies ( bodies , p_x , p_y , p_z , v_x , v_y , v_z , mT , ax , ay , az ) ;
83 }
84
85 bool NbodyOutput : : addBodies ( i n t bodies , f l o a t ∗ p_x , f l o a t ∗ p_y , f l o a t ∗ p_z , f l o a t ∗ v_x , f l o a t ∗ ←↩
v_y , f l o a t ∗ v_z , f l o a t ∗ mT , f l o a t ∗ a_x , f l o a t ∗ a_y , f l o a t ∗ a_z )
87
16. Utility, output, C++
86 {
87 i f ( ( curr_line + bodies ) > n )
88 {
89 re turn f a l s e ;
90 }
91 e l s e
92 {
93 fo r ( i n t i = 0 ; i<bodies ; i++)
94 {
95 addBody ( p_x [ i ] , p_y [ i ] , p_z [ i ] , v_x [ i ] , v_y [ i ] , v_z [ i ] , mT [ i ] , a_x [ i ] , a_y [ i ] , a_z [ i←↩
] ) ;
96 }
97 }
98 re turn true ;
99 }
100
101 void NbodyOutput : : addComment ( std : : string morecomment )
102 {
103 comment += "\n" ;
104 comment += " # " ;
105 comment += morecomment ;
106 }
107
108 void NbodyOutput : : addTimestamp ( void ) //adds current system time to f i l e comments
109 {
110 time_t tt ;
111 s t r u c t tm ∗ tmt ;
112 time(&tt ) ;
113 tmt = localtime(&tt ) ;
114 stringstream ss ;
115 ss << asctime ( tmt ) ;
116 string s = ss . str ( ) ;
117 s . erase ( s . length ( ) −1, 1 ) ; //remove the \n added by asct ime
118 NbodyOutput : : addComment ( s ) ;
119 }
120
121 bool NbodyOutput : : writeBodyFile ( std : : string fileName )
122 {
123 ofstream outstream ( fileName . data ( ) , ios : : out ) ;
124 i f ( ! outstream . is_open ( ) )
125 {
126 re turn f a l s e ;
127 }
128 e l s e
129 {
130 outstream << comment << endl ;
131 outstream << n << " " << t << " " << dt << endl ;
132 fo r ( i n t i = 0 ; i < n ; i++)
133 {
134 outstream << px [ i ] << " " << py [ i ] << " " << pz [ i ] << " "
135 << vx [ i ] << " " << vy [ i ] << " " << vz [ i ] << " " << m [ i ] ;
136 i f ( ( ax [ i ] != 0 ) && ( az [ i ] != 0 ) && ( az [ i ] != 0 ) )
137 {
138 outstream << " " << ax [ i ] << " " << ay [ i ] << " " << az [ i ] ;
139 }
140 outstream << endl ;
141 }
142 }
143 outstream . close ( ) ;
144 re turn true ;
145 }
88
89
17 Utility, Plummer galaxy creator
1 /∗
2 ∗ PlummerModel . h
3 ∗
4 ∗ Created on : Nov 25 , 2011
5 ∗ Author : j on i
6 ∗/
7
8 # i fnde f PLUMMERMODEL_H_
9 # def ine PLUMMERMODEL_H_
10
11 # include <s t r ing >
12
13 c l a s s PlummerModel
14 {
15 publ ic :
16 i n t n ;
17 f l o a t rSeed ;
18 f l o a t ∗px , ∗py , ∗pz , ∗vx , ∗vy , ∗vz , ∗m ;
19 void generateBodies ( void ) ;
20 bool writeBodies ( std : : string ) ;
21 i n l i n e f l o a t randf ( void ) ;
22 PlummerModel ( in t , f l o a t ) ;
23 ~PlummerModel ( ) ;
24 } ;
25
26 # endi f /∗ PLUMMERMODEL_H_ ∗/
1 /∗
2 ∗ PlummerModel . cpp
3 ∗
4 ∗/
5
6 # include "PlummerModel . h"
7 # include "NbodyOutput . h"
8
9 # include <iostream>
10 # include <fstream>
11 # include <s t r ing >
12 # include <sstream>
13 # include <c s td l i b >
14 # include <math . h>
15
16 # def ine SQRT2 1.414213562
17 # def ine PI 3.1415926535897932384626433832795
18 # def ine s q r t f ( x ) ( f l o a t ) sq r t ( x )
19
20 using namespace std ;
21
22 PlummerModel : : PlummerModel ( i n t nBodies , f l o a t randomSeed )
23 {
24 n = nBodies ;
25 rSeed = randomSeed ;
26 px = new f l o a t [ n ] ;
90
27 py = new f l o a t [ n ] ;
28 pz = new f l o a t [ n ] ;
29 vx = new f l o a t [ n ] ;
30 vy = new f l o a t [ n ] ;
31 vz = new f l o a t [ n ] ;
32 m = new f l o a t [ n ] ;
33 }
34
35 PlummerModel : : ~ PlummerModel ( )
36 {
37 de le t e [ ] px ;
38 de le t e [ ] py ;
39 de le t e [ ] pz ;
40 de le t e [ ] vx ;
41 de le t e [ ] vy ;
42 de le t e [ ] vz ;
43 de le t e [ ] m ;
44 }
45
46 i n l i n e f l o a t PlummerModel : : randf ( void )
47 {
48 re turn ( f l o a t ) rand ( ) /RAND_MAX ;
49 }
50
51 void PlummerModel : : generateBodies ( void )
52 {
53 f l o a t posx , posy , posz , velx , vely , velz , mass , r , theta , phi , a , b , velfact ;
54 mass = 1 . 0 / n ;
55
56 //seed the prng funct ion
57 srand ( rSeed ) ;
58 f o r ( i n t body = 0 ; body < n ; body++)
59 {
60
61 f l o a t test ;
62 do {
63 r = 1.0/ sqrtf ( ( pow ( randf ( ) , ( −2 .0/3 .0 ) ) −1.0) ) ;
64 theta = acos ( ( randf ( ) −0.5) ∗2) ;
65 phi = randf ( ) ∗2∗PI ;
66 posx = r∗sin ( theta ) ∗cos ( phi ) ;
67 posy = r∗sin ( theta ) ∗sin ( phi ) ;
68 posz = r∗cos ( theta ) ;
69 test = posx∗posx+posy∗posy+posz∗posz ;
70 }
71 while ( test >1.0) ; //t runca te s a t radius one
72 a = 0 . 0 ;
73 b = 0 . 1 ;
74 while ( a < pow ( b∗b∗(1.0−b∗b ) , 3 . 5 ) )
75 {
76 a = randf ( ) ;
77 b = randf ( ) /10 . 0 ;
78 }
79
80 do
81 {
82 velfact = b∗SQRT2∗pow ( 1 . 0+ r∗r ,−0 .25) ;
83 theta = acos ( ( randf ( ) −0.5) ∗2) ;
84 phi = randf ( ) ∗2∗PI ;
85 velx = velfact∗sin ( theta ) ∗cos ( phi ) ;
86 vely = velfact∗sin ( theta ) ∗sin ( phi ) ;
87 velz = velfact∗cos ( theta ) ;
88 test = velx∗velx+vely∗vely+velz∗velz ;
89 }
90 while ( test >1.0) ;
91 px [ body ] = posx ;
92 py [ body ] = posy ;
93 pz [ body ] = posz ;
94 vx [ body ] = velx ;
95 vy [ body ] = vely ;
91
17. Utility, Plummer galaxy creator
96 vz [ body ] = velz ;
97 m [ body ] = mass ;
98 }
99 }
100
101 bool PlummerModel : : writeBodies ( string fileName )
102 {
103 NbodyOutput nbOut (n , 1 , 0 . 0 f ) ;
104 char buf [ 1 0 0 ] ;
105 size_t len ;
106 len = sprintf ( buf , "Plummer model galaxy crea ted with parameters n : %d , random seed : %f " , n , ←↩
rSeed ) ;
107 string comment ( buf , len ) ;
108 nbOut . addComment ( comment ) ;
109 nbOut . addTimestamp ( ) ;
110 f o r ( i n t i = 0 ; i < n ; i++)
111 {
112 nbOut . addBodies (n , px , py , pz , vx , vy , vz , m ) ;
113 }
114 re turn nbOut . writeBodyFile ( fileName ) ;
115 }
92
93
18 Utility, bitmap generator
1 # i fnde f NBODYBITMAP_H_
2 # def ine NBODYBITMAP_H_
3
4 # include <s t r ing >
5 # include "EasyBMP . h"
6
7 c l a s s NbodyBitmap
8 {
9 publ ic :
10 f l o a t ∗px , ∗py , ∗pz ;
11 i n t n ;
12 i n t dimx , dimy ;
13 f l o a t maxradius ;
14 bool init ;
15 BMP ∗ bitmap ;
16 NbodyBitmap ( i n t dx , i n t dy , f l o a t mr ) ;
17 ~NbodyBitmap ( void ) ;
18 void setInputData ( i n t n , f l o a t ∗ipx , f l o a t ∗ipy , f l o a t ∗ipz ) ;
19 bool setInputFile ( std : : string fn ) ;
20 bool writeBitmap ( std : : string fn ) ;
21 } ;
22
23
24 # endi f /∗ NBODYBITMAP_H_ ∗/
1 /∗
2 ∗ NbodyBitmap . cpp
3 ∗/
4 # include <iostream>
5 # include <cs td io >
6 # include <s t r ing >
7 # include <sstream>
8 # include <fstream>
9 # include <cmath>
10 # include "EasyBMP . h"
11 # include " EasyBMP_DataStructures . h"
12 # include "NbodyInput . h"
13 # include "NbodyBitmap . h"
14
15 using namespace std ;
16
17 NbodyBitmap : : NbodyBitmap ( i n t dx , i n t dy , f l o a t mr )
18 {
19 dimx = dx ; dimy = dy ;
20 maxradius = mr ;
21 bitmap = new BMP ( ) ;
22 bitmap−>SetBitDepth ( 2 4 ) ;
23 bitmap−>SetSize ( dimx , dimy ) ;
24 init = f a l s e ;
25 }
26
27 NbodyBitmap : : ~ NbodyBitmap ( void )
28 {
94
29 de le t e bitmap ;
30 }
31
32 void NbodyBitmap : : setInputData ( i n t nbodies , f l o a t ∗ipx , f l o a t ∗ipy , f l o a t ∗ipz )
33 {
34 n = nbodies ;
35 px = ipx ; py = ipy ; pz = ipz ;
36 init = true ;
37 RGBApixel b ;
38 b . Red = 0 ; b . Green = 0 ; b . Blue = 0 ;
39 RGBApixel w ;
40 w . Red = 248 ; w . Green = 248 ; w . Blue = 255 ; //" ghost white "
41 f o r ( i n t i = 0 ; i < dimx ; i++)
42 {
43 fo r ( i n t j = 0 ; j < dimy ; j++)
44 {
45 bitmap−>SetPixel (i , j , b ) ; // f i l l out bitmap with black
46 }
47 }
48 f o r ( i n t i = 0 ; i < n ; i++)
49 {
50 f l o a t pxT = px [ i ] , pyT = py [ i ] ; // we ' re only p lo t t i ng 2d
51 i f ( ( abs ( pxT ) <= maxradius ) && ( abs ( pyT ) <= maxradius ) )
52 {
53 i n t pxx , pxy ;
54 pxx = ( i n t ) ( ( pxT + maxradius ) ∗ ( dimx / ( 2 . 0 ∗ maxradius ) ) ) ;
55 pxy = ( i n t ) ( ( pyT + maxradius ) ∗ ( dimy / ( 2 . 0 ∗ maxradius ) ) ) ;
56 bitmap−>SetPixel ( pxx , pxy , w ) ; // plo t body as one p ixe l
57 }
58 }
59 }
60
61 bool NbodyBitmap : : setInputFile ( std : : string fn )
62 {
63 NbodyInput in ( fn ) ;
64 bool b = in . read ( ) ;
65 i f ( b )
66 {
67 NbodyBitmap : : setInputData ( in . n , in . px , in . py , in . pz ) ;
68 }
69 re turn ( b && init ) ;
70 }
71
72 bool NbodyBitmap : : writeBitmap ( std : : string fn )
73 {
74 i f ( ! init ) re turn init ;
75 re turn bitmap−>WriteToFile ( fn . data ( ) ) ;
76 }
77
78 i n t main ( i n t argc , char ∗ argv [ ] )
79 {
80 NbodyBitmap nbbmp ( 1000 , 1000 , 2 . 0 ) ;
81
82 i f ( argc == 2)
83 {
84 stringstream ss ;
85 ss << argv [ 1 ] ;
86 string fileName = ss . str ( ) ;
87 i f ( nbbmp . setInputFile ( fileName ) )
88 {
89 cout << fileName << " loaded . " << endl ;
90 ss << " .bmp" ;
91 string bitmapFileName = ss . str ( ) ;
92 i f ( nbbmp . writeBitmap ( bitmapFileName ) )
93 {
94 cout << bitmapFileName << " wri t ten . " << endl ;
95 }
96 }
97 }
95
18. Utility, bitmap generator
98 e l s e
99 {
100 cout << "Usage : " << argv [ 0 ] << " <input data f i l e >" << endl ;
101 }
102 re turn 0 ;
103 }
96
97
19 Utility, energy calculator
1 //For c a l cu l a t i ng the energy of a n−body system
2
3 i n l i n e f l o a t dist_sqrd ( f l o a t x1 , f l o a t y1 , f l o a t z1 , f l o a t x2 , f l o a t y2 , f l o a t z2 )
4 {
5 re turn ( x1−x2 ) ∗ ( x1−x2 ) +(y1−y2 ) ∗ ( y1−y2 ) +(z1−z2 ) ∗ ( z1−z2 ) ;
6 }
7
8 f l o a t computeEnergy ( i n t n , f l o a t ∗px , f l o a t ∗py , f l o a t ∗pz , f l o a t ∗vx , f l o a t ∗vy , f l o a t ∗vz , ←↩
f l o a t ∗m )
9 {
10 //Calcu la t e s the energy of the galaxy adding po t en t i a l and k i n e t i c energy together .
11 //Sum of po t e rn t i a l energy i s negat ive the r e fo r e the minus .
12 double kin = 0 . 0 ; double pot = 0 . 0 ;
13 f o r ( i n t i = 0 ; i < n ; i++)
14 {
15 kin += ( double ) ( m [ i ] ∗ dist_sqrd ( vx [ i ] , vy [ i ] , vz [ i ] , 0 . 0 f , 0 . 0 f , 0 . 0 f ) ) ;
16 }
17 f o r ( i n t i = 0 ; i < n−1; i++)
18 {
19 fo r ( i n t j =i+1; j<n ; j++)
20 {
21 pot −= ( double ) ( m [ i ] ∗ m [ j ] ) /sqrt ( dist_sqrd ( px [ i ] , py [ i ] , pz [ i ] , px [ j ] , py [ j ] , pz [ j←↩
] ) ) ;
22 }
23 }
24 re turn ( f l o a t ) ( 0 . 5 ∗ kin+pot ) ;
25 }
98
