Calculating interactions or correlations between pairs of particles is
typically the most time-consuming task in particle simulation or correlation
analysis. Straightforward implementations using a double loop over particle
pairs have traditionally worked well, especially since compilers usually do a
good job of unrolling the inner loop. In order to reach high performance on
modern CPU and accelerator architectures, single-instruction multiple-data
(SIMD) parallelization has become essential. Avoiding memory bottlenecks is
also increasingly important and requires reducing the ratio of memory to
arithmetic operations. Moreover, when pairs only interact within a certain
cut-off distance, good SIMD utilization can only be achieved by reordering
input and output data, which quickly becomes a limiting factor. Here we present
an algorithm for SIMD parallelization based on grouping a fixed number of
particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions
between particles in a pair of such clusters improves data reuse compared to
the traditional scheme and results in a more efficient SIMD parallelization.
Adjusting the cluster size allows the algorithm to map to SIMD units of various
widths. This flexibility not only enables fast and efficient implementation on
current CPUs and accelerator architectures like GPUs or Intel MIC, but it also
makes the algorithm future-proof. We present the algorithm with an application
to molecular dynamics simulations, where we can also make use of the effective
buffering the method introduces.Comment: Accepted for publication in Computer Physics Communication