CADISHI: Fast parallel calculation of particle-pair distance histograms
  on CPUs and GPUs by Reuter, Klaus & Köfinger, Jürgen
CADISHI: Fast parallel calculation of particle-pair distance histograms on
CPUs and GPUs
Klaus Reutera,∗, Jürgen Köfingerb
aMax Planck Computing and Data Facility, Gießenbachstraße 2, 85748 Garching, Germany
bMax Planck Institute of Biophysics, Max-von-Laue-Straße 3, 60438 Frankfurt, Germany
Abstract
We report on the design, implementation, optimization, and performance of the CADISHI software pack-
age, which calculates histograms of pair-distances of ensembles of particles on CPUs and GPUs. These
histograms represent 2-point spatial correlation functions and are routinely calculated from simulations of
soft and condensed matter, where they are referred to as radial distribution functions, and in the analysis of
the spatial distributions of galaxies and galaxy clusters. Although conceptually simple, the calculation of ra-
dial distribution functions via distance binning requires the evaluation of O(N2) particle-pair distances where
N is the number of particles under consideration. CADISHI provides fast parallel implementations of the
distance histogram algorithm for the CPU and the GPU, written in templated C++ and CUDA. Orthorhom-
bic and general triclinic periodic boxes are supported, in addition to the non-periodic case. The CPU kernels
feature cache-blocking, vectorization and thread-parallelization to obtain high performance. The GPU ker-
nels are tuned to exploit the memory and processor features of current GPUs, demonstrating histogramming
rates of up to a factor 40 higher than on a high-end multi-core CPU. To enable high-throughput analyses of
molecular dynamics trajectories, the compute kernels are driven by the Python-based CADISHI engine. It
implements a producer-consumer data processing pattern and thereby enables the complete utilization of all
the CPU and GPU resources available on a specific computer, independent of special libraries such as MPI,
covering commodity systems up to high-end HPC nodes. Data input and output are performed efficiently
via HDF5. In addition, our CPU and GPU kernels can be compiled into a standard C library and used
with any application, independent from the CADISHI engine or Python. The CADISHI software is freely
available under the MIT license.
Keywords: radial distribution function, pair-distance distribution function, two-point correlation function,
distance histogram, GPU, CUDA
PROGRAM SUMMARY
Program Title: CADISHI
Licensing provisions: MIT
Programming language: C++, CUDA, Python
∗Corresponding author
Email addresses: klaus.reuter@mpcdf.mpg.de (Klaus
Reuter), juergen.koefinger@biophys.mpg.de (Jürgen
Köfinger)
Nature of problem:
Radial distribution functions are of fundamental impor-
tance in soft and condensed matter physics and astro-
physics. However, the calculation of distance histograms
scales quadratically with the particles number. To be able
to analyze large data sets, fast and efficient implementa-
tions of distance histogramming are crucial.
Solution method:
CADISHI provides parallel, highly optimized implemen-
Preprint submitted to Computer Physics Communications August 7, 2018
ar
X
iv
:1
80
8.
01
47
8v
1 
 [p
hy
sic
s.c
om
p-
ph
]  
4 A
ug
 20
18
tations of distance histogramming. On the CPU, high
performance is achieved via an advanced cache blocking
scheme in combination with vectorization and threading.
On the GPU, the problem is decomposed via a tiling
scheme to exploit the GPU’s massively parallel architec-
ture and hierarchy of global, constant and shared memory
efficiently, resulting in significant speedups compared to
the CPU. Moreover, CADISHI exploits all the resources
(GPUs, CPUs) available on a compute node in parallel.
Additional comments including Restrictions and Unusual
features (approx. 50-250 words):
CADISHI implements the minimum image convention for
orthorhombic and general triclinic periodic boxes. We
provide Python interfaces and the option to compile the
kernels into a plain C library.
1. Introduction
Radial distribution functions link the structural
and thermodynamic properties of soft and condensed
matter [1, 2]. Structurally, these particle pair corre-
lation functions provide the probability of finding a
particle at a certain distance from another particle of
a system. Thermodynamically, these functions deter-
mine the equation of state for systems with pair-wise
interactions. In astronomy and astrophysics, these
spatial two-point correlation functions are used to de-
scribe the distribution of galaxies or galaxy clusters
in the universe [3, 4].
In experiments on condensed matter, the Fourier
transform of the radial distribution function is mea-
sured by elastic scattering of x-rays or neutrons, or
light scattering in the case of microscopically sized
particles like colloids. Scattering intensities can be
calculated accurately from radial distribution func-
tions using Debye’s equation [5]. This approach is
used, for example, to calculate small- and wide-angle
x-ray scattering intensities from molecular dynamics
simulations of biological macromolecules in solution
[6, 7].
Radial distribution functions are central to liq-
uid state theory and facilitate the interpretation of
molecular simulations. For simple liquids, we can
predict phase transitions of liquid mixtures using ra-
dial distributions functions obtained from the integral
theory of Ornstein and Zernike [8, 1]. In simulations,
radial distribution functions help us to interpret or-
dering effects and the resulting effective interactions
between particles [9]. For complex systems, we usu-
ally lack feasible theoretical approaches to calculate
radial distribution functions. We thus estimate ra-
dial distribution functions from molecular dynamics
simulation (MD) trajectories and Monte Carlo simu-
lation ensembles by calculating histograms of particle
pair-distances [10].
This task of calculating a radial distribution func-
tion scales with the number of particles squared and
is thus computationally challenging for large systems.
Large-scale parallel molecular dynamics simulations
of soft and condensed matter in explicit solvent can
generate large amounts of trajectory data with hun-
dreds of thousands of frames with potentially mil-
lions of particles per frame [11]. Levine, Stone, and
Kohlmeyer were the first to tackle this challenge by
taking advantage of the processing power of CUDA-
enabled GPUs [12]. Their software can be easily used
via VMD [13], a widely used program to set up, vi-
sualize, and analyze molecular dynamics simulations.
Extending their pioneering efforts, we provide here
a novel software for the "CAlculation of DIStance
HIstograms" (CADISHI), which uses both CPUs and
GPUs on a single node to calculate radial distribu-
tion functions at very high performance. In addition
to non periodic systems, orthorhombic and general
triclinic boxes are supported.
This paper is structured as follows. Section 2
briefly introduces the mathematical background and
discusses sequential and parallel distance histogram-
ming. Section 3 details how the distance histogram
algorithms are efficiently implemented on the CPU
and on the GPU. We report and discuss extensive
benchmark results for both kinds of processors in
section 4. Finally, section 5 closes the paper with
a summary.
2. Methods
To calculate a radial distribution function from an
ensemble of particles, we calculate all pair distances
and collect them in a histogram. If the ensemble
stems from molecular simulations then we have to
properly take into account the boundary conditions.
2
Commonly, we use periodic boundary conditions in
simulations and apply the minimum image conven-
tion. We distinguish two scenarios: In the first sce-
nario, we are interested in bulk properties and we
calculate radial distribution functions up to half the
minimum image distance, e.g., half the box-length
for a cubic box. In the second scenario, we sim-
ulate a single macromolecule in a box as a model
for a dilute system. To calculate scattering intensi-
ties (SAXS/WAXS), we have to cut out the macro-
molecule and a sufficiently thick layer of solvent and
effectively embed this system in infinite solvent [6]. In
this case, we calculate the radial distribution function
for the complete sub-system we have cut out, without
applying any periodic boundary conditions.
In the following, we first show how radial distri-
bution functions are calculated from histograms of
pair-distances following the notation of Levine et al.
[12]. We then sketch the basic algorithm, recapit-
ulate how to take periodic boundary conditions for
orthorhombic and triclinic boxes into account, and
discuss different methods for the implementation and
parallelization.
2.1. Mathematical background
The radial distribution function [1, 2] is defined as
g(r) = lim
dr→0
p(r)
4pi(Npairs/V )r2dr
. (1)
Here, r is the distance between a pair of particles,
p(r) is the average number of atom pairs found at
a distance between r and r + dr, Npairs is the total
number of unique atom pairs in the system, and V is
the total volume of the system. For MD simulations,
p(r) is calculated from a finite number of trajectory
frames Nframes for all unique atom pairs indexed by
i, j as
p(r) =
1
Nframes
Nframes∑
k
∑
i,j(6=i)
δ(r − rijk), (2)
where rijk is the distance between particles i and j
at frame k. The δ function is replaced by a uniform
Algorithm 1 Two-species particle pair distance his-
togramming algorithm.
initialize histogram to zero
for i = 0 to N1 {loop over species 1} do
for j = 0 to N2 {loop over species 2} do
compute distance vector dxij between particle
i and particle j,
apply minimum image convention to dxij {for
periodic systems only},
compute distance rij = (x2ij + y2ij + z2ij)1/2
obtain bin index κij = (int)nbinsrij/rmax
increment histogram bin at index κij
end for
end for
histogram on a grid by introducing
p(r) =
1
Nframes
Nframes∑
k
∑
i,j(6=i)
∑
κ
dκ(r, rijk). (3)
Here, κ is the histogram bin index. The value of a
histogram bin is defined as
dκ(r, ri,j,k) =
{
∆r−1 if rκ ≤ r, rijk < rκ + ∆r
0 else.
(4)
where ∆r is the width of a histogram bin, and rκ =
κ∆r is the lower bound of a histogram bin.
2.2. Basic sequential particle pair distance histogram
computation
Algorithm 1 sketches the basic sequential two-
species distance histogram computation, which is a
common use case. From a set of N1 particles of
species 1 and a set of N2 particles of species 2, the
distance for each combination of two particles is eval-
uated, rescaled to an integer index which is finally
used to increment the bin counter. In periodic sys-
tems, the minimum image convention is applied to
the distance first. In total, distances between N1×N2
particle pairs need to be binned.
Considering the single species distance histogram
computation of a set with N particles, the difference
to algorithm 1 is given by the constraint that we per-
form only evaluations of unique pairs. To this end,
3
the inner loop in algorithm 1 is modified to start from
j = i + 1, with N = N1 = N2. In this single species
case, in total N(N − 1)/2 particle pairs need to be
considered.
It is obvious that the floating-point computation
intense part of the algorithm is given by the distance
computation. Further numerical costs are added by
periodic boundary conditions, which introduce the
need for rounding operations and, in the case of the
general triclinic box, the need for multiple distance
evaluations between the different images.
On a side note, it seems tempting to avoid the
numerically costly square root operation in the dis-
tance calculation and to use a quadratically scaled
histogram instead. However, in practice it turns out
that the resulting larger histogram array spoils the
cache efficiency and leads in combination with the
necessary postprocessing of the histogram to inferior
results compared to a high-performance implementa-
tion of the direct Euclidean distance computation.
2.3. Periodic boundary conditions
Commonly, molecular dynamics simulations and
Monte Carlo simulations of soft and condensed mat-
ter apply periodic boundary conditions (PBCs) to
minimize surface effects, which would otherwise cause
artifacts. Moreover, using triclinic box geometries
corresponding to the truncated octahedron or the
rhombododecahedron, for example, we can calculate
radial distribution functions for larger distances than
if we used an orthorhombic box with the same num-
ber of particles. The use of triclinic boxes can also
increase performance of simulations of single macro-
molecules by minimizing the number of solvent parti-
cles we have to add. In a periodic box the distance be-
tween two particles is given by the distance between
one particle in the central box and the nearest image
of the other particle. CADISHI implements this so-
called the minimum image convention for both, the
orthorhombic and the general triclinic periodic box
on both the CPU and the GPU, as detailed on in the
following.
Under the minimum image convention, the dis-
tance vector dx′ between two points in a general pe-
riodic box is given by
dx′ = dx− b nint (b−1dx) , (5)
where dx is the difference vector between the images
of two particles in the same box, b is the 3 × 3 ma-
trix of box vectors, and nint denotes rounding to
the nearest integer. For further details, we refer to
Appendix B in the book by Tuckerman [14].
An orthorhombic box has a rectangular basis and
the basis vectors can have different lengths. In this
case, b and b−1 are diagonal which simplifies the
evaluation of equation (5) in practice.
A general triclinic box has three basis vectors of
different lengths which intersect at arbitrary angles.
Hence, an orthorhombic box is a special case of the
triclinic box. Equation (5) gives the minimum dis-
tance vector between two images for distances up to
half of the minimum width of the periodic box. Note
that b is now non-diagonal. To support larger dis-
tances, the distance vector from equation 5 needs to
be shifted to the neighboring cells in order to find the
true minimum distance. We refer the reader to the
PhD thesis of Tsjerk Wassenaar for details [15].
2.4. Parallel particle-pair distance histogram compu-
tation.
In the case of distance histogramming, the dis-
tance calculations between particle pairs are embar-
rassingly data parallel. The computational load is
easily balanced between processing units by distribut-
ing equally sized subsets of the data, i.e. the combina-
torial set of all the relevant particle pairs. Handling
the bin updates and the computation of the final his-
togram correctly in parallel turns out to be more in-
tricate. Basically, one of the following approaches is
possible, as pointed out in Ref. [12].
First, all the processing units may update the bins
of a single shared histogram concurrently, requiring
atomic hardware instructions or other mechanisms
to synchronize the individual memory updates. Sec-
ond, each processing unit may fill its own private his-
togram, followed by a global reduction to sum up all
the private partial histograms in order to obtain the
final histogram. Third, a combination of both the
previous approaches may be favorable, where groups
4
of processing units share a private histogram. As will
be detailed on in the following sections, the second
option is suitable for the CPU whereas the third op-
tion is well suited for the GPU.
3. Implementation
The CPU and GPU kernels discussed in the fol-
lowing sections are implemented in templated C++.
Templates avoid code duplication and allow, for ex-
ample, to compile executables for single and double
precision coordinate input from the same source code.
Moreover, templates facilitate the generation of effi-
cient code for all supported cases, i.e., with or with-
out periodic boxes, and with or without the check if
a distance falls within the maximum allowed value
in the non-periodic case. This is critical for perfor-
mance because branches in inner loops are avoided
completely at runtime. The distance computation
including the minimum image convention is imple-
mented only once and used by both, CPU and GPU,
via header file inclusion. We provide Python inter-
faces to the kernels.
Finally, to enable processing of large-scale MD sim-
ulation data, the CPU and GPU kernels need to be
driven efficiently to exploit all the compute resources
available on a computer. To this end we have imple-
mented a Python layer labeled the CADISHI engine
which is presented in section 3.3.
3.1. Histogram computation on CPUs
An efficient implementation of the particle-pair
distance histogram algorithm needs to take advan-
tage of all levels of parallelism and caches of mod-
ern x86_64 CPUs. First, there are several physi-
cal cores per chip which typically support more than
one hardware thread each (simultaneous multithread-
ing, called hyper-threading for Intel CPUs). Second,
each core supports SIMD parallelism on vectors with
a width of 128 (SSE2), 256 (AVX), or even 512 bits
(AVX512), being able to operate on 4, 8, and 16 single
precision numbers with a single instruction, respec-
tively. Moreover, to hide memory latencies, a cache
hierarchy exists with individual caches per physical
core (L1 and L2), and caches shared between mul-
tiple cores (L3). Finally, in multi-socket machines,
0 1 2 3 j
i
0
1
2
3
1,1
2,1bs
bs
0 1 2 3 j
i
0
1
2
2,1bs
bs
Figure 1: Loop tiling for the distance histogram calculation in
the case of a single particle species (top) and in the case of two
particle species (bottom). For a single particles species, the
tiles are triangular directly below the diagonal and quadratic
or rectangular elsewhere, with edges of maximum length bs.
For two particle species, the i axis indicates the tile index of
the first species and the j axis indicates tile index of the second
species.
different chips may access the same physical mem-
ory, however, at different latencies and bandwidths
(NUMA).
To optimize for cache utilization, we implemented
cache-blocked versions of the algorithms in addition
to a direct implementation of the double-loop struc-
ture of algorithm 1. The cache-blocked versions tar-
get the L2 cache, which has a size of 256 kB to 1 MB
on modern CPUs and is therefore large enough to
hold for each thread a block of coordinates, an index
buffer, and the partial histogram.
In our loop tiling, shown schematically in Fig. 1,
the block size bs is defined as the length of an edge
of a tile. The block size depends on the histogram
5
width n_bins and is given by the solution of
L2_cache_size− reserve + extension
= sizeof(uint32_t) · bs2
+2 · sizeof(coordinate_tuple) · bs
+sizeof(uint32_t) · n_bins , (6)
which is a simple quadratic equation. On the
left-hand side of Eq. (6) we define the amount of
cache in bytes we make available for cache blocking.
This amount is determined by the size of the cache
L2_cache_size, determined at runtime once during
initialization, by the amount of of cache we reserve for
other data reserve, and by extension. Here, we set
reserve to 16 kB. The value of extension is set to
zero per default. For histograms wider than 45k bins,
the value of extension is increased proportional to
the histogram width in order to avoid that the block
size gets too small and to enable the blocking scheme
also for wide histograms. The transition at 45k bins
was determined by benchmarking.
The right-hand side Eq. (6) depends on the size
sizeof(uint32_t) of the cache array for the indices,
the size of the two sets of particle coordinates, 2 ·
sizeof(coordinate_tuple), and the width of the
histogram, sizeof(uint32_t) · n_bins. As shown
in the benchmark section below, the blocking scheme
allows to scale to large problem sizes without any
performance degradation. At small problem sizes,
the non-blocked versions are faster and a heuristic in
the code decides which kernel is to be called for a
specific input.
The CPU kernels are threaded by means of
OpenMP directives. Each thread works on a sub-
set of the combinatorial set of all relevant particle
pairs and updates its own private instance of the
histogram. The cache-blocked implementation par-
allelizes over tiles, whereas the non-blocked version
parallelizes the double-loop directly using OpenMP
directives. For each thread, the final step consists
of the reduction of all the private partial histograms
into the complete histogram.
In addition to threading, SIMD parallelism (vec-
torization) is of key importance to achieve good per-
formance on modern CPU cores. Two implementa-
tion details turn out to be essential for vectoriza-
tion. First, to adjust the memory alignment, the
coordinate triples must be padded, either implicitly
via compiler-specific attributes or explicitly by ex-
tending the triple by a fourth dummy element. Sec-
ond, incrementing a histogram bin immediately after
each distance calculation would access memory in a
non-predictable fashion and potentially lead to cache
thrashing. Thus, the bin indices are stored temporar-
ily in a contiguous buffer, which is of size bs2 in case
of the kernels with cache blocking. The bin updates
are done from that buffer array, an operation that is
inherently non vectorizable due to its non contiguous
memory access pattern on the histogram array. Note
that we do not use intrinsics for portability reasons.
Hence, it is up to the compiler to actually vectorize
the code which may fail in some cases as shown be-
low. The software Intel Amplifier XE was used to
guide the optimization work.
For performance reasons, we use integers for bin
counts and thus have to take care to avoid inte-
ger overflows. The thread-private histograms use
unsigned 32 bit integers (uint32_t). Whenever
a thread’s number of processed particle pairs ap-
proaches the upper limit of uint32_t, the thread-
private histogram is added atomically to the global
histogram (“flushed”) and reset to zero. Th global
histogram uses unsigned 64 bit integers (uint64_t).
Compared to a pure 32 bit implementation, the per-
formance penalty turns out to be marginal. This
strategy is also applied in the GPU implementation.
3.2. Histogram computation on GPUs
GPUs are not only successfully used to speed up
simulations of molecular dynamics [16, 17], they are
similarly well suited to accelerate analysis tasks such
as the pair-distance histogram calculation. State-of-
the art graphics processing units are comprised of
several streaming multiprocessors that have on the
order of 10 to 100 cores each. All the multiproces-
sors have access to a cached single global memory on
the GPU, distinct from the host’s main memory. We
partly follow the lines of Levine et. al. [12] for the
implementation of a tiling scheme suitable to obtain
high performance. Moreover, we extend their work,
e.g. with kernels enabled to scale into the large bin
6
number regime and with the support of triclinic pe-
riodic boxes.
Our implementations are based on the NVIDIA
CUDA programming model [18] but the key points
made are applicable to other platforms as well. In the
spirit of a heterogeneous programming model, kernels
are launched from the host code to perform computa-
tion on the GPU using numerous lightweight threads.
Logically, CUDA organizes threads in thread blocks,
and arranges the thread blocks on a grid. All threads
from a thread block run on the same streaming multi-
processor, grouped into so called warps of 32 threads
that run simultaneously. Threads of the same thread
block are able to communicate via a shared memory
that can be regarded as a user-managed cache. Dif-
ferent thread blocks are independent. When multiple
threads read from the same memory address at the
same time, GPU constant memory with its associ-
ated constant memory cache is highly beneficial. For
further details, we refer the reader to the NVIDIA
CUDA documentation [18].
With typically on the order of a thousand to a mil-
lion of particles per MD trajectory frame, the combi-
natorial set of all particle pairs contains on the order
of 106 to 1012 elements. Clearly, the large number
of independent distance calculations can be mapped
to independent threads very well in order to keep
the numerous GPU cores busy. However, the effi-
cient handling of the histogram bin updates is a ma-
jor challenge. This operation involves some kind of
synchronization between threads and is, moreover,
characterized by scattered memory accesses.
In general, the data transfer between host and
GPU memory is a performance critical aspect. For
the distance histogram computation, the cost of the
data transfer is insignificant for sufficiently large
problem sizes due to the computational complex-
ity O(N2), nevertheless we take several optimization
steps. To avoid the overhead from multiple individ-
ual transfers, the complete multi-species coordinate
set of an MD trajectory frame is prepared in a con-
tiguous memory area in pinned memory on the host
and is then copied to a contiguous area of GPU global
memory in a single operation. The GPU kernels then
calculate the histograms for all combinations of par-
ticle species. Finally, the resulting set of histograms
is transferred from GPU global memory back to CPU
pinned memory in a single operation. GPU memory
is allocated at the first kernel call only and reused at
subsequent calls.
To obtain the optimum performance for all relevant
input parameters and to allow for cross validation,
three kernel implementations of increasing complex-
ity were developed that are explained in detail below.
In the production code, the most suitable kernel is
selected together with its optimum launch parame-
ters for a particular GPU model based on the par-
ticle numbers and on the histogram width. To this
end, the implementation internally provides heuris-
tics covering recent GPU architectures. The NVIDIA
Visual Profiler was used to guide the optimization
work.
In the following sections, we present the GPU his-
togram kernel implementations in order of increasing
complexity.
3.2.1. A basic GPU kernel
A straight-forward approach to implement the
pair-particle distance histogram calculation using the
CUDA framework is to map the double-loop struc-
ture of algorithm 1 to a two-dimensional grid of
thread blocks such that each thread works on an in-
dividual particle pair. Bin increments are naively im-
plemented by performing atomic updates of a single
shared histogram in global memory. The overhead of
this approach can be reduced by choosing the actual
CUDA grid smaller than the total grid spanned by
the number of particle coordinates. Doing so we let
each thread work on several particle pairs by looping
over the full coordinate set using the CUDA total grid
size as offset. Consequently, global memory accesses
are coalesced automatically.
On older GPUs (pre Kepler), the binning rate
could be increased by cloning the histogram bins in
global memory and reducing them in a final step,
which lowers the access frequency of individual bins
and therefore the collision rate. On Kepler and more
recent GPUs we used during the final development
of this work, we find that the new fast atomic op-
erations in global memory do not make such cloning
necessary any more [18, Kepler tuning guide].
7
We refer to this implementation as the simple ker-
nel. It is characterized by a rather flat performance
profile independent of the number of bins. In the
scope of this work it is only used to perform cor-
rectness checks and as the starting point for more
complex kernels. It is outperformed substantially by
the two implementations presented in the following,
which are actually intended for production use.
3.2.2. Improving performance by coordinate tiling in
constant memory
To speed up the simple GPU kernel, a tiling scheme
is introduced in the spirit of a cache blocking tech-
nique aimed at a reduction of the number of accesses
to global memory. We first make use of the GPU’s
constant memory segment in order to accelerate the
memory access to the particle coordinate data. Lim-
ited in size to 64 kB, constant memory has an as-
sociated fast on-chip cache that delivers a value as
quickly as if it was read directly from a register, pro-
vided that all the threads in the warp access the same
address. However, data can be copied to constant
memory only from the host code.
In our implementation, the inner loop of algorithm
1 is mapped to a one-dimensional CUDA grid, cover-
ing the second coordinate set stored in global mem-
ory. The outer loop is written explicitly inside the
kernels. It iterates over a tile of the first coordinate
set that is stored in constant memory. Hence, each
thread reads a coordinate tuple from global memory
in a coalesced fashion and performs the distance bin-
ning for all the particles stored in the constant mem-
ory tile. The latter is the same for all the threads
at each loop iteration such that we exploit the fast
constant memory cache. We added to the host code
an additional loop which handles the copying of co-
ordinate data to constant memory before launching
the kernel. For single precision data, a tile in the 64
kB of constant memory comprises about 5300 coor-
dinate tuples. For each such tile a kernel launch has
to be performed.
To optimize the performance for frames with sev-
eral different particle species and different particle
numbers, we minimize the number of kernel launches
for two-species histograms.We first sort the particle
coordinate sets by increasing particle number. As a
consequence, the set for the second species located
in global memory has typically more members than
the first one located in constant memory. This ar-
rangement reduces the number of necessary kernel
launches.
We label this implementation the global memory
kernel because it keeps the histogram in global mem-
ory. It turns out to be the fastest kernel when going
to larger histogram bin numbers, as we will demon-
strate below. In addition, it is an important interme-
diate step towards optimizing the kernel further, as
will be done in the following by introducing a tiling
scheme for the histogram in shared memory.
3.2.3. Improving performance by histogram tiling in
shared memory
The global memory kernel presented in the previ-
ous section optimizes the coordinate tuple accesses
but performs the atomic bin updates in compara-
bly slow global memory. The key step to increase
the binning performance is to introduce private par-
tial histograms in shared memory. Compared to
pre-Maxwell GPUs, the performance benefits greatly
from the fast shared memory atomic operations that
became available with Maxwell chips [18, Maxwell
tuning guide].
In general, one shared memory buffer can be al-
located per CUDA thread block. It can be accessed
by all the threads in the block at near register speed.
However, shared memory is a scarce resource and lim-
ited to a maximum of 48 kB per thread block on most
of the GPUs relevant to the present work. Up to now,
only the Volta V100 GPU can be configured to pro-
vide 96 kB of shared memory to a thread block, which
significantly improves performance of this implemen-
tation [18, Volta tuning guide]. Hence, 32 bit partial
histograms in shared memory can be at most 12288
(×2 on the V100) bins wide. Consequently, to al-
low the kernels to process wider histograms, the bin-
ning range must be tiled, requiring multiple sweeps
through all the particle pairs, increasing the run time
proportionally. Before the lifetime of a thread block
ends, the private partial 32-bit histogram in shared
memory is added atomically to the 64-bit histogram
in global memory.
8
subset of
coord_1
(constant)
coord_1
coord_2
(global)coord_2
histogram
(global)histogram
thread_block_2
partial histogram (shared)
thread_block_N
partial histogram (shared)
thread_block_1
partial histogram (shared)
. . .
CPU GPU
Figure 2: Tiling scheme employed for the parallel decomposition on the GPU, including the use of the memory hierarchy. From
CPU memory (left, blue), a set of coordinate data from the first species is copied to GPU constant memory (constant, grey)
before each kernel launch. The coordinate data of the second species is copied initially once to global GPU memory (global,
red). The loop over that second set of coordinates is implemented via a 1d grid of thread blocks, such that each thread block
operates on a different subset of coord_2. Each thread block internally loops over the first species, accessing the identical
element at a time from each thread and thereby exploiting the fast constant memory cache. Each thread block updates its
private partial histogram in shared memory (shared, violet). Before a thread block terminates, the private partial histogram is
added atomically to the histogram in global GPU memory, which is finally copied back to CPU memory.
Figure 2 shows a schematic of our shared-memory-
based GPU implementation, providing a detailed ex-
planation of the memory access in the caption.
We label this implementation the shared memory
kernel. It is by far the fastest kernel at compara-
bly small bin numbers, when only one or few sweeps
through the particle pairs is necessary. For any re-
quested number of bins, the implementation deter-
mines the number of sweeps from the available shared
memory. The shared memory histograms are then
tiled to have (about) the same size for all the sweeps.
3.2.4. GPU kernel optimization
We determined the optimum block sizes of both,
the global memory and the shared memory kernels
for the 1d CUDA grid from benchmark runs of suffi-
ciently large problem sets, which saturate the GPUs.
These optimal size are chosen automatically by a sim-
ple heuristic. For Maxwell and newer GPUs, a block
size of 512 threads is chosen, whereas on older (Ke-
pler) GPUs it is beneficial to use larger blocks of 1024
threads. A user may override these defaults when
calling the kernels.
In case of the shared memory kernel, we deter-
mine the size of the tiles for the partial histograms
by the number of sweeps necessary to compute a re-
quested histogram width. The number of sweeps is
determined by the maximum amount of shared mem-
ory available per thread block. In general, as little
shared memory as possible should be used in order
to keep the occupancy of the streaming multiproces-
sors sufficiently high, primarily to hide global mem-
ory latencies. The best performance is not necessar-
ily achieved at an occupancy of 100%, which refers
to the maximum number of threads a GPU is able
to keep active. For some compute-bound kernels the
instruction-level parallelism is able to use the GPU
very well at occupancies smaller than one [19]. As
shown in section 4.2 the distance histogram kernels
fall into this class and are able to saturate the GPUs
at occupancies down to 25%.
So far, we only discussed the two-species compu-
9
tation. As pointed out before, for the single-species
case the only difference is given by the start index of
the inner loop, such that duplicate and same-particle
evaluations are avoided. In the GPU kernels, an if
branch is used to determine from the thread index
within the CUDA grid if the thread shall evaluate a
particle pair. Note that for sufficiently large prob-
lem sizes only a small fraction of the CUDA thread
blocks are actually affected by this if branch, very
similar to the diagonal blocks required for the CPU
cache blocking (see fig. 1). If the condition is true for
all the threads of a warp, no overhead is introduced.
If it is true for only some of the threads, the branches
are serialized, i.e., some threads of the warp stay idle
while the other threads perform their computations
in parallel.
Intentionally, the present implementation does not
split large sets of particle pairs onto several GPUs.
Rather, motivated by realistic application scenarios
that require the processing of numerous frames, in-
dividual frames are processed completely on a single
GPU, allowing to exploit trivial frame parallelism by
using several GPUs.
3.3. CADISHI parallel engine
Next, we discuss the CADISHI engine, which en-
ables users to exploit all the resources (CPUs, GPUs)
available on a compute node simultaneously. Such
an efficient use of resource is especially useful for the
parallel analysis of long MD trajectories with many
frames.
The concept of a data processing pipeline serves as
the design principal, as shown schematically in fig-
ure 3. Frames are provided and buffered in a queue
by a reader process. To ensure high performance,
the input trajectory is read from HDF5 files [20].
The frames are picked up and processed in parallel
by multiple worker processes, each computing all the
histograms for a particular frame using the CPU or
GPU kernels. The results are put into a second queue
from which a writer process fetches the histograms,
averages them optionally, and saves them to HDF5.
To enable users to import MD simulation data eas-
ily, we provide a conversion tool, which uses a generic
reader [21, 22].
reader
queue
worker_1worker_0 worker_N
queue
writer
. . .
Figure 3: Schematic of the CADISHI trajectory data process-
ing framework. Multiple processes (reader, writer, workers)
that communicate and synchronize via queues are used to im-
plement node-level parallelism. The workers either use GPU
or CPU resources.
This design offers a high degree of modularity and
flexibility for current and future methodical exten-
sions, e.g., the integration into more complex analy-
sis workflows using the CAPRIQORN package [7, 6].
We use the Python programming language to im-
plement CADISHI. In particular, we use Python’s
multiprocessing module, which is part of the stan-
dard library [23] and enables node-level parallelism
out of the box on virtually any platform. The im-
plementation does not use nor require a third-party
dependency such as the message passing interface
(MPI) for distributed-memory parallelism. The cost
of the inter-process communication of the atom co-
ordinates and the histograms is negligible compared
to the N2 complexity of the computational prob-
lem. The workers release the global interpreter lock
of Python explicitly when calling the compiled CPU
and GPU kernels such that the inter-process commu-
nication continues to run smoothly during the com-
putation.
A useful configuration on a typical two-socket two-
GPU compute node could be as follows. Two CPU
workers are used, each running the previously dis-
10
cussed thread-parallel CPU histogram code on an in-
dividual CPU socket. In addition, two GPU workers
are used, each of them running the GPU histogram
code on an individual GPU. It is important to reserve
a physical CPU core for each, the reader, the writer,
and the GPU workers, in order to guarantee quick
data transfer and avoid IO becoming the bottleneck.
CADISHI picks appropriate core numbers and also
handles the process pinning automatically.
4. Performance benchmarks
We performed extensive performance benchmarks
and investigate the binning rates of the CPU and
GPU kernels individually for input data of various
sizes. The input data was generated by putting par-
ticles at pseudo-random coordinates into a unit box.
To profile the CPU and GPU codes, we used a driver
program to supply the coordinate data, to launch the
kernels, to determine the time-to-solution, and to cal-
culates the binning rate in billion atom pairs per sec-
ond (bapps). All computations discussed below were
performed in single precision, which is the relevant
use case when MD simulation data is processed. In
general, double precision runs turn out to be between
a factor of 1.3 to 2 slower on CPUs and non-consumer
GPUs, depending on the problem size and the pres-
ence of a periodic box. In all cases, we measured the
time to solution, which includes memory transfers to
and from the GPU and overhead from GPU kernel
launches.
In the second part, we show results for the node-
level performance obtained on a compute node with
two GPUs, running the CADISHI engine on a prac-
tically relevant data set from an MD simulation.
4.1. CPU performance
The CPU kernel was profiled on a shared memory
machine with two Intel Xeon Platinum 8164 proces-
sors [24], providing 26 physical cores and 52 hardware
threads each. The base clock of a core is 2 GHz, and
the cores support the AVX512 instruction set.
Benchmark results in this section are exclusively
based on the binary generated by the Intel compiler
icc compiler in version 2017 using the optimiza-
tion flags "-fast -xHost -qopt-zmm-usage=high
-qopenmp". Indeed, the Intel compiler manages to
generate AVX512 code for the inner loop with the
distance computation, covering all the possible box
cases. The instruction set in use was checked at run-
time by reading out hardware counters for floating
point instructions via the Linux perf tool.
For comparison, we compiled kernels using the
GNU g++ compiler in version 7.2 with the op-
timization flags "-O3 -march=native -ffast-math
-funroll-loops -fopenmp" applied. In doing so,
vectorized AVX512 code is generated by g++ for the
inner loop of algorithm 1 in the case without periodic
boundary conditions. In the cases of orthorhombic
and triclinic boxes, rounding operations and opera-
tions to determine the minimum prevent the vector-
ization by the GNU compiler. Compared to the bi-
nary generated by icc, we find that the binary from
g++ runs virtually at the same speed for the vector-
ized case without any box, whereas the cases with a
periodic box run significantly slower due to the lack-
ing vectorization.
As will be seen in the next section, the GPUs per-
form much better in general, which mitigates the
drawback of having the boxed computation not vec-
torized well on the CPU with the widely used GNU
compiler.
Figure 4 shows a scan in the number of OpenMP
threads for a fixed problem size of 1M× 1M particle
pairs and 10k bins, covering the three possible box
cases. Initially, the number of threads is increased
from 1 up to all the 26 physical cores on a single
socket. While the scaling is ideal at the beginning,
the curve starts to deviate from the ideal scaling when
the socket gets increasingly filled. Since the kernel
is of complexity O(N2) and therefore compute and
not memory limited, this effect is likely to be caused
by the dynamic clocking of the vector units, which
reduces the frequencies as more and more cores are
used in order to limit the heat dissipation. Memory
accesses are of minor importance, in particular due
to the cache blocking optimization. Going from 1 to
2 sockets, the scaling is ideal. Enabling hyperthread-
ing in addition shows no benefit in the case without
PBCs and only marginal benefit in the other cases,
indicating that the CPU pipelines are already used
quite well. The scaling is rather similar for the three
11
 10
 100
 1000
1 2 4 8 16 26 52 104
w
al
l c
lo
ck
 ti
m
e 
[s 
pe
r 1
01
2  
at
om
 p
ai
rs
]
number of threads
no box
orthorhombic
triclinic
1 socket 2 sockets
SMT
Figure 4: Wall clock time of the histogram calculation as a
function of the number of OpenMP threads. Calculations were
performed on the Skylake CPU for a fixed problem size of 1M×
1M particle pairs and 10k bins, with cache-blocking enabled.
The vertical dotted lines indicate the transition from 1 to 2
sockets, and, moreover, the transition into the simultaneous
multithreading regime.
cases. The run times clearly indicate the cost asso-
ciated with the orthorhombic and the triclinic box,
in particular. For all the following investigations on
the CPU, we use 26 threads on a single socket, in or-
der to provide a 1:1 baseline for the comparison with
current GPU models done in the following section.
Figure 5 shows a scan in the number of histogram
bins on the Skylake processor, comparing the im-
plementations with and without the cache blocking
scheme for a fixed problem size of 500k × 500k par-
ticle pairs and 10k bins, with and without periodic
boxes. At histogram widths of up to about 10k bins,
the implementation with cache blocking achieves the
highest binning rate of about 13 bapps in the case
without PBCs. In the following, the binning rate de-
creases to about 10 bapps at 100k bins, around and
beyond which a linear decrease is observed. The rea-
son for the initial steep decrease is that, on each core,
the thread-private histogram occupies an increasingly
larger fraction of the L2 cache, as the number of
bins is increased. Therefore, the coordinate blocks
are chosen accordingly smaller, decreasing the effi-
ciency of the blocking scheme, cf. Eq. 6. Note that
 0
 2
 4
 6
 8
 10
 12
 14
 16
 0  50000  100000  150000  200000  250000
bi
lli
on
 a
to
m
 p
ai
rs
 p
er
 ti
m
e 
[1
/s]
number of bins
cache blocking
no blocking
no box
orthorhombic
triclinic
Figure 5: Histogramming rate as a function of the number of
bins. Calculations were performed on the Skylake CPU for a
fixed problem size of 500k × 500k atom pairs. We compare
the implementations with and without cache blocking for the
cases without PBCs and with PBCs using orthorhombic and
triclinic unit cells.
the cache blocking scheme was not designed to opti-
mize for scans in the histogram width but rather for
large problem sizes as will be discussed next. More-
over, in practice many applications require only mod-
erate histogram widths which lie within the regime of
highest performance. In comparison, the implemen-
tation without cache blocking is significantly slower,
for small bin numbers by about one third in the case
without box. The relative advantage from the cache
blocking decreases when going to the orthorhombic
case and virtually vanishes for the triclinic case which
is caused by the increasing arithmetic intensity mak-
ing the memory accesses less important.
Figure 6 shows scans in the problem size while
keeping a fixed histogram width of 10k bins. Re-
sults for the implementations with and without cache
blocking are shown. In addition to cases without
PBCs, performance data for the kernels handling pe-
riodic boxes is included. We first discuss the scans
without periodic box. Up to a problem size of about
100k×100k atom pairs, the code without cache block-
ing turns out to be faster than the variant with block-
ing, clearly indicating some overhead of the blocking
scheme. Going beyond that problem size, the per-
12
 0
 2
 4
 6
 8
 10
 12
 14
 1000  10000  100000
 1x106
bi
lli
on
 a
to
m
 p
ai
rs
 p
er
 ti
m
e 
[1
/s]
number of atoms per selection
cache blocking
no blocking
no box
orthorhombic
triclinic
Figure 6: Histogramming rate as a function of the problem size.
Calculations were performed on the Skylake CPU for fixed his-
togram width of 10k. We compare the cache-blocked with the
non-blocked implementation. In addition to the default case
without PBCs, we show results for PBCs using orthorhombic
and triclinic boxes.
formance of the code with cache blocking is nearly
constant, whereas the performance without blocking
drops severely by more than two thirds in the range
under consideration. Turning towards the cases with
periodic boxes, it is observed that the binning rate
is clearly slower compared to the case without pe-
riodic boxes. Comparing the plateaus, the perfor-
mance goes down to about two thirds for the or-
thorhombic case and down to about one third for
the triclinic case. The overhead is partially caused
by the nearbyint() function which is used to per-
form the rounding during the application of the min-
imum image convention. In microbenchmarks, the
nearbyint() function turned out to be faster than
the round() function by about 10%, which is likely
due to the fact that it does not raise the Inexact ex-
ception on the CPU. For the triclinic box, a three-fold
loop over all neighbouring boxes is executed in addi-
tion to find the minimum image for the general case,
causing the additional overhead.
4.2. GPU performance
The CUDA code was compiled with the gen-
eral optimization flags "-O3 -use_fast_math" and
was in addition adapted for the GPU ar-
chitecture under consideration, e.g., by apply-
ing the flags "–generate-code arch=compute_70,
code=compute_70" for the Volta GPU. For the host
code, GCC and the same flags as previously were
used.
We profiled the GPU kernel on several state-of-
the-art hardware platforms as shown in table 1. We
present results for NVIDIA V100 [27], P100 [26], and
GTX1080 [25] GPUs. The V100 GPU is part of an
NVIDIA DGX-1 system [29], whereas the P100 GPU
is part of an IBM POWER8+ system [28]. In the
DGX-1 system, the V100 GPUs are connected to the
host CPUs via PCIe 3.0. In the POWER8+ system,
the P100 GPUs are connected via NVLink which is
about a factor 3 faster in host-device bandwidth than
PCIe 3.0. Note that the GTX1080 GPU was designed
for entertainment applications, lacks ECC memory,
and has only very few units enabled for double pre-
cision operations.
Below we compare the performance of both im-
plementations of interest. First, we investigate
the shared memory kernel which keeps partial his-
tograms in shared memory, potentially requiring sev-
eral sweeps through all the particle pairs when going
to larger bin numbers (see section 3.2.3). In addition
we profile the global memory kernel, which updates a
single histogram in global memory (see section 3.2.2)
and turns out to be of advantage for larger bin num-
bers.
We present scans in the histogram width in figure
7 with the problem size kept fixed at 4M× 4M atom
pairs, which is large enough to saturate the GPUs
(see below). Initially, the performance curves for the
shared memory kernels are constant at rather high
levels. Remarkably, for the V100 GPU we find a
plateau of highest performance between 9k and 12k
bins, where the curve peaks around 9k bins at a bin-
ning rate of 495 bapps. Following that initial range
of highest performance, the binning rate decreases
in steps, with a width determined by the size of of
the shared memory on the GPU. Due to the lim-
ited memory, multiple sweeps are required for larger
bin numbers. Here, the V100 GPU has twice the
step width due to its twice as large shared memory
of 96 kB per thread block. The performance profile
13
Table 1: Overview on the GPUs and host systems used for the performance benchmarks. For more detailed specifications, we
refer to the references [25, 26, 27, 28, 29]. Note that the system with GTX1080 GPUs is also used for benchmark runs based
on real MD simulation data, cf. Sec. 4.4.
GPU (2×) GTX1080 (4×) P100 (8×) V100
fp32 peak [TFLOPS] 8.87 10.6 15.7
mem-bandw. [GB/s] 320 732 900
Bus PCIe 3.0 NVLink PCIe 3.0
CPU Intel Haswell IBM POWER8+ Intel Broadwell
2 × E5-2680v3 2 packages 2 × E5-2698v4
Cores (Threads) 2 × 12 (× 2) 2 × 10 (× 8) 2 × 20 (× 2)
 1
 10
 100
 1000
 1000  10000  100000
bi
lli
on
 a
to
m
 p
ai
rs
 p
er
 ti
m
e 
[1
/s]
number of bins
shared memory
global memory
GTX1080
P100
V100
Figure 7: Scan in the number of histogram bins for a problem
size kept fixed at 4M × 4M atom pairs without any periodic
box, comparing the implementations using shared and global
memory for the histogram updates on the three GPUs under
consideration.
of the consumer-grade GTX1080 GPU is very simi-
lar to that of the enterprise-grade P100 GPU, both
featuring the Pascal microarchitecture. The P100
GPU is somewhat faster, likely due to its higher in-
ternal memory bandwidth. In contrast, the perfor-
mance curves of the global memory kernel rise ini-
tially and reach plateaus at 12k bins. For larger bin
numbers, the possibility of collisions during bin up-
dates in global memory is low. Again the V100 GPU
is the fastest, followed by the P100 and GTX1080 de-
vices. The break-even point for the global memory
kernels is around 60k bins for the Pascal GPUs and
at 128k bins for the V100 GPU. This information is
used by the implementation for a simple heuristic to
decide which kernel is to be called for a given bin
number.
The achieved occupancy on the GPUs is nearly
100% for histogram widths up to 4k bins (8k on the
V100), when the CUDA grid is chosen to launch 512
threads per block for which we observe the best per-
formance. As confirmed using the nvprof tool the
occupancy decreases down to 25% as the histogram
width in shared memory is increased to the maxi-
mum possible value of about 12k (24k for the V100)
bins. The scans shown in figure 7 indicate that the
performance of the shared memory kernel is not at
all affected by the varying occupancy which is due to
their compute-bound characteristics and instruction-
level parallelism [19]. Rather the number of sweeps is
decisive as seen clearly from the staircase-type curves.
Figure 8 shows scans in the problem size for cases
with and without periodic boxes, while the histogram
width is kept fixed at 10k bins. Following the initial
steep rise of the performance curves, it can be seen
that there are at least 1010 atom pairs required for
the Pascal-based GPUs and at least 1012 atom pairs
for the V100 GPU to reach performance saturation
which is indicated by a flat-top in all cases. For the
P100 GPU, the effect from the fast NVLink intercon-
nect is clearly observed for small problem sizes when
the GPUs are not saturated and the costs of mem-
ory transfer and kernel launches are significant. In
particular, the curves of the P100 device are shifted
towards the left compared to the GTX1080 and the
V100 GPUs which are connected via the relatively
slower PCIe 3 bus to the host CPUs. The curves
for the periodic boxes already saturate the GPUs
14
 0.001
 0.01
 0.1
 1
 10
 100
 1000
 1000  10000  100000
 1x106  1x107  1x108
bi
lli
on
 a
to
m
 p
ai
rs
 p
er
 ti
m
e 
[1
/s]
number of atoms per selection
no box
orthorhombic
triclinic
GTX1080
P100
V100
Figure 8: Scan in the problem size for a fixed histogram width
of 10k bins, comparing the implementations with and without
the periodic boxes on the three GPUs under consideration.
for somewhat smaller problem sizes and level off at
lower binning rates, which is both due to the higher
arithmetic intensity. Interestingly, on both the non-
consumer-grade P100 and V100 GPUs and compared
to the results from the GTX1080 GPU, the perfor-
mance of the periodic box cases is much closer to the
cases without PBCs. For example, in the case of the
orthorhombic box, the binning rate with box is 0.69
(0.48) of the binning rate without box on the P100
(V100) GPU, whereas it is only 0.07 on the GTX1080
device. The reason is likely to be linked to the fact
that a significant part of the Pascal processor’s arith-
metic capabilities are disabled on the consumer-grade
GTX1080, in particular a large fraction of the double-
precision units. We speculate that this might apply
as well to the round operation required by the peri-
odic boxes.
4.3. Performance comparison between CPU and
GPU
Table 2 gives a direct comparison between the per-
formance on the CPU and on the GPUs under consid-
eration, based on data from the previously discussed
runs. A large problem size with 4M× 4M atom pairs
was selected at which the performance on both types
of processors is saturated, and a histogram width of
10k bins, where the highest binning rates are seen on
both types of processors. Absolute and relative per-
formance numbers are compared for the cases with-
out and with periodic boxes. In all cases the GPUs
are significantly faster than the CPU. While the con-
sumer grade GTX1080 is competitive with the pro-
fessional Pascal model P100 in the case without peri-
odic box, the GTX1080 is significantly outperformed
in the orthorhombic and triclinic cases, presumably
linked to disabled floating point rounding capabili-
ties. The Volta V100 GPU is by far the fastest, beat-
ing the 26 core Skylake by a factor of up to ∼ 40 in
the case without any box.
Note that our implementations are optimized to
be most efficient for small and moderately large bin
numbers up to 24k. In many application scenarios
it is possible and sufficient to choose the number of
bins, i.e., the desired numerical resolution of the one-
dimensional radial distribution function, to lie within
that range of highest performance. Finally, we point
out that the CPU code is faster for small problem
sizes, below selections of about 100k atoms, which
is due to the overhead induced by the heterogeneous
programming model and hardware of the GPU.
4.4. CADISHI single-node application performance
This section reports on the CADISHI application
performance based on actual MD simulation data
measured on a standard 2-socket server equipped
with 2 GPUs.
The data set under consideration comprises 2000
frames with about 280000 particles each from an
MD simulation of F1-ATPase. Simulations were per-
formed using NAMD [30]. With 9 chemical species,
36 partial histograms have to be evaluated for all
the possible intra- and inter-species combinations for
each frame, where the individual numbers of the
particle-pairs are highly different, ranging from ∼ 102
to ∼ 1010, which is in particular challenging for the
GPU implementation. No periodic box was consid-
ered. The compressed HDF5 trajectory file has a size
of 9 GB. A resolution of 8000 histogram bins was
chosen, going up to a maximum radius of 300Å. We
summed partial histograms at intervals of 100 frames
and wrote the summed-up histograms to the disk,
leading to a compressed HDF5 output file of 18 MB
15
Table 2: Comparison between the performance on the Skylake chip with 26 cores and the performance on the three GPUs
under consideration for a problem size of 4M× 4M atom pairs, 10k bins, without and with periodic boxes. For the GPUs, the
performance relative to the CPU is given.
no box orthorhombic box triclinic box
processor bapps [s−1] rel. bapps [s−1] rel. bapps [s−1] rel.
Skylake 12.38 1.00 8.63 1.00 4.22 1.00
GTX1080 135.66 10.96 9.44 1.09 6.28 1.49
P100 175.04 14.14 120.89 14.01 30.11 7.14
V100 494.45 39.95 239.34 27.74 55.25 13.10
in size. The performance benchmarks were run on
a HPC cluster with two Haswell CPUs [E5-2680 v3,
with 12 (24) physical cores (hardware threads) each]
and two consumer grade GTX1080 GPUs per node,
and a shared GPFS file system on which the IO was
performed.
Table 3 gives performance numbers for a selection
of setups found to perform well. Note that for the
reader, writer, and any (potentially present) GPU
workers one physical core was reserved each, and that
for the setups with CPUs involved all the remaining
physical cores were used. The waiting time for new
work packages is about 10 ms for all the setups shown.
The plain CPU runs C1 and C2 processed the com-
plete dataset in less than 3 hours. Here, using two
workers on separate NUMA domains turns out to be
slightly faster than using only one worker on both
the domains. Simultaneous multithreading was en-
abled for the sake of a small speedup (cf. Fig. 4). On
the other hand, the plain GPU run G2 processed the
2000 frames in slightly below 6 minutes. Relative to
the run C2 the speedup is 27.0, demonstrating that
the GPU has a significant advantage not only with
synthetic benchmark data but also with real MD sim-
ulation data.
It seems tempting to perform hybrid runs to fur-
ther speed up the GPU runs, however, our experi-
ments indicate that keeping all the CPU cores busy
in addition to the GPUs does pay off only marginally,
if at all. For such hybrid runs we find that the num-
ber of threads per CPU worker must be not more
than the number of available physical cores. For
larger numbers of threads, hyperthreading clogs the
threaded multiprocessing queues between the reader,
the writer, and the worker processes. Only the case
G2C1 with a single CPU worker is slightly faster than
the plain GPU case G2. For hybrid runs, the imbal-
ance in processing speed between the CPU and the
GPU leads to the situation that the CPU worker is
processing a final work package while the GPU work-
ers have already finished. This effect is still signifi-
cant in the present example with 2000 frames but will
become less important for very large frame numbers.
Moreover, a run-time system for task-based paral-
lelism may be helpful to mitigate such situations, see
e.g. [31].
5. Summary and conclusions
The CADISHI software achieves very high perfor-
mance on both, CPUs and GPUs. The kernels for
both types of processors can be driven by the Python-
based CADISHI engine to enable high-throughput
analysis of MD trajectories. CADISHI implements
a producer-consumer model and thereby allows for
the complete utilization of all the CPU and GPU re-
sources available on a specific computer, independent
of special libraries such as MPI, covering commod-
ity systems up to high-end HPC nodes. CADISHI
enables the analysis of trajectories with many thou-
sands of frames in a minimum amount of time. Pro-
cessing 2000 frames with 280000 particles each of
trajectory data from an F1-ATPase simulation was
demonstrated to run in less than 6 minutes on a stan-
dard two-socket compute node with two consumer-
grade GPUs.
To achieve high performance on the CPU, we pro-
posed a cache tiling scheme tailored to fit the L2
cache size of a CPU core, OpenMP SIMD directives
in combination with a linear index buffer to help the
16
Table 3: Overview on the single-node CADISHI application performance achieved for the F1-ATPase dataset with 2000 frames,
measured on the 2-socket Haswell node with 2 GTX1080 GPUs. The time was taken until all the partial histograms were
written to disk, i.e., the total time to solution is given, whereas the performance in bapps is reported per compute worker and
does not include buffering and IO time. The use of simultaneous multithreading is indicated by an asterisk (∗). Three runs per
setup were performed and averaged.
setup workers bapps [s−1] time [s]
CPU (threads) GPU CPU GPU
C1 1 (44∗) 0 8.2 0 9594.6
C2 2 (22∗) 0 4.2 0 9422.9
G2 0 2 0 117.1 348.8
G2C1 1 (20) 2 6.6 118.1 341.7
G2C2 2 (10) 2 3.3 116.0 351.3
compiler generate vectorized code, and thread par-
allelism over tiles using classical OpenMP directives.
In our test, the implementation performs and scales
well up to a full shared memory node consisting of
two 26-core Intel Skylake processors.
Compared to running the optimized CPU code on
a 26-core Intel Skylake processor, we find that our
optimized GPU code achieves a speedup of up to 40
on an NVIDIA V100 GPU for a case without any pe-
riodic box. For orthorhombic and triclinic periodic
boxes the speedup is 28 and 14, respectively. The
consumer-grade GTX1080 GPU is competitive with
the professional models, in particular for cases with-
out a periodic box.
Here, we confirmed the observation of Levine et
al. [12] that the use of constant memory to cache co-
ordinate data is the key to achieve high performance,
even though the maximum available constant mem-
ory per GPU did not increase over the GPU gen-
erations. In contrast to Levine et al., we use scarce
shared memory exclusively for storing histogram data
and not for actively caching coordinate data. We
rather rely on the GPU’s native hardware caches.
Levine at al. use shared memory to implement over-
flow protection, which we handle differently via con-
stant memory tiles of known size in combination with
a CUDA grid directly mapping the loop over the sec-
ond particle species. The histogram updates in both,
global and shared memory, have seen significant im-
provement due to the introduction of fast atomic in-
structions with recent GPU generations [18]. More-
over, the 96 kB of shared memory per streaming mul-
tiprocessor of the V100 GPU accelerates the compu-
tation for medium and large bin numbers. This hard-
ware feature was previously unavailable.
Importantly, we complement the work of Levine et
al. [12] by providing a high-performance CPU imple-
mentation featuring vectorization and parallelization,
the support for the triclinic box, template-based sup-
port for single and double precision and for runtime
checks of the distances to fit within the maximum
binning range, and the CADISHI parallel engine for
node-level parallelism, leveraging unprecedented pos-
sibilities of large-scale MD data analysis. We provide
Python interfaces and the option to compile the ker-
nels into a plain C library.
The CADISHI software package presented in this
paper is available free of charge in source code under
the permissive MIT license at [32]. It can be used to-
gether with the CAPRIQORN software package [7, 6]
to calculate SAXS/WAXS scattering intensities from
molecular dynamics trajectories.
Acknowledgements
We thank Prof. Gerhard Hummer, Max Linke, and
Dr. Markus Rampp for fruitful discussions. We thank
Prof. Kei-ichi Okazaki for providing an initial NAMD
setup for F1-ATPase. We acknowledge financial sup-
port by the Max Planck Society.
17
References
[1] J.-P. Hansen, I. R. McDonald, Theory of Simple
Liquids, Fourth Edition: with Applications to
Soft Matter, 4th Edition, Academic Press, 2013.
[2] D. A. McQuarrie, Statistical mechanics / Donald
A. McQuarrie, Harper and Row New York, 1975.
[3] V. Springel, S. D. M. White, A. Jenkins,
C. S. Frenk, N. Yoshida, L. Gao, J. Navarro,
R. Thacker, D. Croton, J. Helly, J. A. Peacock,
S. Cole, P. Thomas, H. Couchman, A. Evrard,
J. Colberg, F. Pearce, Simulations of the for-
mation, evolution and clustering of galaxies and
quasars, Nature 435 (2005) 629 EP –. doi:
10.1038/nature03597.
[4] M. Kerscher, I. Szapudi, A. S. Szalay, A com-
parison of estimators for the two-point correla-
tion function, The Astrophysical Journal Letters
535 (1) (2000) L13.
[5] P. Debye, Molecular-weight determination by
light scattering., The Journal of Physical and
Colloid Chemistry 51 (1) (1947) 18–32. doi:
10.1021/j150451a002.
[6] J. Köfinger, G. Hummer, Atomic-resolution
structural information from scattering experi-
ments on macromolecules in solution, Phys. Rev.
E 87 (2013) 052712. doi:10.1103/PhysRevE.
87.052712.
[7] J. Köfinger, K. Reuter, Capriqorn software
package (2018).
URL https://github.com/bio-phys/
capriqorn
[8] L. S, Z. F. Ornstein, Accidental deviations of
density and opalescence at the critical point of
a single substance, Royal Netherlands Academy
of Arts and Sciences (KNAW). Proceedings. 17
(1914) 793.
[9] C. N. Likos, Effective interactions in soft
condensed matter physics, Physics Reports
348 (4) (2001) 267 – 439. doi:10.1016/
S0370-1573(00)00141-1.
[10] D. Frenkel, B. Smit, Chapter 4 - Molecu-
lar Dynamics Simulations, in: D. Frenkel,
B. Smit (Eds.), Understanding Molecular Simu-
lation (Second Edition), second edition Edition,
Academic Press, San Diego, 2002, pp. 63 – 107.
[11] Protein dynamics: Moore’s law in molecular bi-
ology, Current Biology 21 (2) (2011) R68–R70.
doi:10.1016/j.cub.2010.11.062.
[12] B. G. Levine, J. E. Stone, A. Kohlmeyer, Fast
analysis of molecular dynamics trajectories with
graphics processing units — radial distribution
function histogramming, Journal of Computa-
tional Physics 230 (9) (2011) 3556 – 3569. doi:
10.1016/j.jcp.2011.01.048.
[13] W. Humphrey, A. Dalke, K. Schulten, VMD:
Visual molecular dynamics, Journal of Molec-
ular Graphics 14 (1) (1996) 33 – 38. doi:
10.1016/0263-7855(96)00018-5.
[14] M. Tuckerman, Statistical Mechanics: The-
ory and Molecular Simulation, Oxford graduate
texts, Oxford University Press, 2011.
[15] T. A. Wassenaar, Molecular dynamics of sense
and sensibility in processing and analysis of
data, Ph.D. thesis, university of Groningen
(2006).
URL http://hdl.handle.net/11370/
b0c3a19b-9f60-4911-ab23-d9725a2d45a2
[16] B. G. Levine, D. N. LeBard, R. DeVane,
W. Shinoda, A. Kohlmeyer, M. L. Klein, Mi-
cellization studied by gpu-accelerated coarse-
grained molecular dynamics, Journal of Chemi-
cal Theory and Computation 7 (12) (2011) 4135–
4145. doi:10.1021/ct2005193.
[17] C. Kutzner, S. Páll, M. Fechner, A. Eszter-
mann, B. L. de Groot, H. Grubmüller, Best
bang for your buck: Gpu nodes for gromacs
biomolecular simulations, Journal of Computa-
tional Chemistry 36 (26) (2015) 1990–2008. doi:
10.1002/jcc.24030.
[18] NVIDIA Corporation, CUDA C Programming
Guide (2018).
18
URL https://docs.nvidia.com/cuda/index.
html
[19] V. Volkov, Better performance at lower occu-
pancy, 2010.
URL http://www.nvidia.com/content/
GTC-2010/pdfs/2238_GTC2010.pdf
[20] The HDF Group, Hierarchical Data Format, ver-
sion 5 (1997-2018).
URL http://www.hdfgroup.org/HDF5/
[21] M.-A. Naveen, D. E. J., W. T. B., B. Oliver,
MDAnalysis: A toolkit for the analysis of molec-
ular dynamics simulations, Journal of Compu-
tational Chemistry 32 (10) 2319–2327. doi:
10.1002/jcc.21787.
[22] Richard J. Gowers, Max Linke, Jonathan
Barnoud, Tyler J. E. Reddy, Manuel N. Melo,
Sean L. Seyler, Jan Domański, David L. Dot-
son, Sébastien Buchoux, Ian M. Kenney, Oliver
Beckstein, MDAnalysis: A Python Package for
the Rapid Analysis of Molecular Dynamics Sim-
ulations, in: Sebastian Benthall, Scott Rostrup
(Eds.), Proceedings of the 15th Python in Sci-
ence Conference, 2016, pp. 98 – 105.
[23] Python Software Foundation, The Python Stan-
dard Library (2018).
URL https://docs.python.org/2/library/
[24] Intel Corporation, Intel Xeon Platinum 8164
Processor (2017).
URL https://ark.intel.
com/products/120503/
Intel-Xeon-Platinum-8164-Processor-35_
75M-Cache-2_00-GHz
[25] NVIDIA Corporation, NVIDIA GeForce GTX
1080 (2016).
URL https://international.download.
nvidia.com/geforce-com/international/
pdfs/GeForce_GTX_1080_Whitepaper_FINAL.
pdf
[26] NVIDIA Corporation, NVIDIA Tesla P100
(2016).
URL https://images.nvidia.com/
content/pdf/tesla/whitepaper/
pascal-architecture-whitepaper.pdf
[27] NVIDIA Corporation, NVIDIA Tesla V100
GPU Architecture (2017).
URL http://images.nvidia.com/
content/volta-architecture/pdf/
volta-architecture-whitepaper.pdf
[28] A. B. Caldeira, V. Haug, S. Vetter, IBM
Power System S822LC for High Performance
Computing (2016).
URL https://www.redbooks.ibm.com/
redpapers/pdfs/redp5405.pdf
[29] NVIDIA Corporation, NVIDIA DGX-1 With
Tesla V100 System Architecture (2017).
URL http://images.
nvidia.com/content/pdf/
dgx1-v100-system-architecture-whitepaper.
pdf
[30] J. C. Phillips, R. Braun, W. Wang, J. Gum-
bart, E. Tajkhorshid, E. Villa, C. Chipot, R. D.
Skeel, L. Kalé, K. Schulten, Scalable molecu-
lar dynamics with namd, Journal of Compu-
tational Chemistry 26 (16) 1781–1802. doi:
10.1002/jcc.20289.
[31] C. Augonnet, S. Thibault, R. Namyst, P.-A.
Wacrenier, StarPU: A Unified Platform for Task
Scheduling on Heterogeneous Multicore Archi-
tectures, Concurrency and Computation: Prac-
tice and Experience, Special Issue: Euro-Par
2009 23 (2011) 187–198. doi:10.1002/cpe.
1631.
[32] K. Reuter, J. Köfinger, Cadishi software package
(2018).
URL https://github.com/bio-phys/cadishi
19
