SIMD Vectorization for the Lennard-Jones Potential with AVX2 and AVX-512
  instructions by Watanabe, Hiroshi & Nakagawa, Koh M.
ar
X
iv
:1
80
6.
05
71
3v
2 
 [c
s.M
S]
  2
2 O
ct 
20
18
SIMD Vectorization for the Lennard-Jones Potential with AVX2 and AVX-512
instructions
Hiroshi Watanabe∗ and Koh M. Nakagawa
The Institute for Solid State Physics, The University of Tokyo,
Kashiwanoha 5-1-5, Kashiwa, Chiba 277-8581, Japan
(Dated: October 23, 2018)
This work describes the SIMD vectorization of the force calculation of the Lennard-Jones potential
with Intel AVX2 and AVX-512 instruction sets. Since the force-calculation kernel of the molecular
dynamics method involves indirect access to memory, the data layout is one of the most important
factors in vectorization. We find that the Array of Structures (AoS) with padding exhibits better
performance than Structure of Arrays (SoA) with appropriate vectorization and optimizations. In
particular, AoS with 512-bit width exhibits the best performance among the architectures. While
the difference in performance between AoS and SoA is significant for the vectorization with AVX2,
that with AVX-512 is minor. The effect of other optimization techniques, such as software pipelin-
ing together with vectorization, is also discussed. We present results for benchmarks on three CPU
architectures: Intel Haswell (HSW), Knights Landing (KNL), and Skylake (SKL). The performance
gains by vectorization are about 42% on HSW compared with the code optimized without vectoriza-
tion. On KNL, the hand-vectorized codes exhibit 34% better performance than the codes vectorized
automatically by the Intel compiler. On SKL, the code vectorized with AVX2 exhibits slightly better
performance than that with vectorized AVX-512.
I. INTRODUCTION
Since Alder and Wainwright performed molecular dy-
namics (MD) simulations for the first time [1], MD has
been an important tool for exploring the wide variety
of fields in science. Starting from the first MD simula-
tion with 32 atoms, the number of atoms involved in MD
has continued to increase owing to the development of
computational power and has reached hundreds of bil-
lions to trillions of atoms [2, 3]. It is expected that the
size and complexity of systems will continue to increase
with the development of computers. Since the increase
in the CPU frequency stopped in the early 2000s [4], the
performance development of CPUs has mainly depended
on increasing the number of CPU cores and the SIMD
width. SIMD, which is short for single instruction mul-
tiple data, enables us to process multiple data with a
single instruction. The width of SIMD is the number of
data that can be processed simultaneously and it is de-
termined by the bit length of registers. Beginning with
SSE, which supports 128-bit registers, 256-bit registers
for AVX and 512-bit registers for AVX-512 are avail-
able, where SSE and AVX are shorts for Streaming SIMD
extensions and advanced vector extensions, respectively.
Since the theoretical peak performance of a CPU is the
value when the application uses the vector width to the
full, the utilization of SIMD is crucial for the performance
of applications on modern CPU architecture. However,
it is not trivial to transform a scalar kernel to a SIMD-
vectorized one. Additionally, the optimal method of vec-
torization is different for each instruction set architecture
∗Corresponding author; Electronic address:
hwatanabe@issp.u-tokyo.ac.jp
(ISA). Portability is one of the important issues for vec-
torization. The SIMD capabilities are usually provided
as an extension of the existing ISA. SIMD and related
instructions are different for each architecture. To ad-
dress this issue, several approaches have been proposed.
One of them is a directive-based approach [5]. By using
directives, one can utilize GPGPU or many-core proces-
sors effectively while retaining portability. Another is
a framework approach. Karpin´ski and McDonald pro-
vided a framework called Unified Multi/Many-Core En-
vironment (UME), which allows a programmer to utilize
the SIMD capabilities without detailed knowledge of the
specific architecture [6]. With these approaches, one can
develop an application that exhibits high performance
without increasing software complexity.
Despite the above efforts, SIMD vectorization remains
a difficult and cumbersome task, since the performance
would be unsatisfactory simply by vectorizing the exist-
ing scalar code. To utilize the SIMD capability of mod-
ern CPUs, it is necessary to combine SIMD vectorization
with an optimal data layout and other optimization tech-
niques. In this paper, we describe the SIMD vectoriza-
tion of the force calculation for the Lennard-Jones (LJ)
potential with AVX2 and AVX-512 on several types of
CPU. The force calculation is the most time-consuming
part of MD, and therefore, the efficiency of the kernel
directly determines the performance of MD. Since the
force-calculation kernel is a typical example of the vec-
torization of a loop with indirect access, there have been
many reports on the vectorization of MD codes, includ-
ing that of the many-body potential [7]. Additionally,
the important kernels of widely used MD packages, such
as Gromacs [8], LAMMPS [9], and NAMD [10], have al-
ready been effectively vectorized and are available. How-
ever, the previous works mainly focused on the perfor-
mance of the whole application instead of that of the
2vectorized kernel. While the performance of the applica-
tion is ultimately most important, the total performance
depends on various factors such as the communication
and the cost of thread synchronization. Such factors
make it difficult to evaluate the impact of vectorization.
Therefore, we focus on force-calculation kernel in this
paper.In particular, we investigate in detail the impact
of optimization techniques when used with vectorization.
We also discuss the difference between the vectorization
with AVX2 and AVX-512 and between CPU architec-
tures. Since the computational intensity of LJ is low,
i.e., a number of floating point operations is small rela-
tive to the amount of memory access, the memory access
is the main bottleneck. Therefore, most of our efforts
are devoted to optimizing memory access. The optimiza-
tion techniques described in this paper can be applied to
other cases with low computational intensity. Although
mixed-precision calculations are known to improve the
performance of MD, we use double-precision numbers for
all calculations throughout this work for simplicity.
The rest of the article is organized as follows. The
benchmark conditions are described in the next section.
Then we give a brief introduction to the optimization
techniques employed before vectorization in Sec. III. Vec-
torization with AVX2 is described in Sec. IV and that
with AVX-512 and related optimizations are described
in Sec. V. The benchmark results on SKL are shown in
Sec. VI. Section VII is devoted to a summary and dis-
cussion. The associated code is available at [11].
II. BENCHMARK CONDITIONS
We consider three types of CPU architecture; Intel
Haswell (HSW), Knights Landing (KNL), and Skylake
(SKL), which we refer to as HSW, KNL, and SKL, re-
spectively. We performed benchmark simulations on the
systems of the Institute for Solid State Physics of the
University of Tokyo for HSW and SKL and of the Infor-
mation Technology Center of the University of Tokyo for
KNL. The conditions for the benchmark simulations are
as follows. The simulation box is a cube with a linear
size of 100 σ, where σ is the diameter of an LJ atom.
Atoms are placed at the face-centered-cubic lattice. The
cutoff length is 3.0 σ. The number of atoms is 119164,
and consequently, the number density is 1.0. Only force
calculations are performed and the time to calculate the
force 100 times is measured. The program is executed
as a single-threaded process on a single CPU core. The
software is compiled using Intel C++ compiler. The de-
tails of the CPU, the versions of the compiler, and the
compilation options are listed in Table I.
 ✁✂✄☎✆ ✝✞✟✠✡☛ ☞
✌✍✎✏✑✒ ✓✔ ✕✖ ✗✘
✙✚
✛✜✢✣✤✥ ✦✧★✩✪
✫✬✭✮✯ ✰✱ ✲✳✴✵✶✷✸✹✺ ✻✼✽✾✿❀❁❂❃❄❅❆❇❈ ❉❊ ❋●❍■❏ ❑▲▼◆❖
P◗❘❙❚❯ ❱❲❳❨❩❬ ❭❪❫❴❵❛ ❜❝❞❡❢❣
❤✐❥❦❧♠♥♦♣qrst✉✈✇①②
③④⑤⑥⑦⑧⑨⑩❶❷❸❹❺❻❼❽❾❿
➀➁➂➃➄➅➆➇➈➉➊➋➌➍➎➏➐➑
➒➓➔→➣↔↕➙➛➜➝➞➟➠➡➢➤➥
FIG. 1: (Color online) Data layout. The numbers represent
the indices of atoms. Left: Structure of Array (SoA). The
x coordinates of the atoms are contiguously arranged in the
memory. The same is true for the y and z coordinates. Right:
Array of Structure (AoS). Coordinates are stored for each
atom. Examples of data definition in C language are also
shown for both cases.
III. OPTIMIZATION BEFORE
VECTORIZATION
In this section, we describe several optimization tech-
niques employed before SIMD vectorization since these
techniques affect the SIMD vectorization. While some of
the techniques have been described in a past paper [12],
we discuss them here to make this paper self-contained.
Since we consider a three dimensional system, a position
vector and velocity vector each contain three elements.
When a system contains N atoms, we have to store N
position and velocity vectors in memory. There are two
ways to arrange such data in memory, Structure of Array
(SoA) and Array of Structure (AoS). See Fig. 1 for the
difference between the two layouts. Here, we adopt SoA
as the data layout, i.e., the data are arranged so that the
x-coordinates of the i-atom and (i + 1)-atom are adja-
cent. The results of optimizations are shown in Fig. 2.
We describe each optimization below.
 0
 5000
 10000
 15000
 20000
 25000
 30000
HSW SKL KNL
El
ap
se
d 
tim
e 
[m
s]
SoA+Pair
SoA+Sorted
SoA+Sorted+SWP
FIG. 2: (Color online) Effects of optimizations before vec-
torization.
3Name Processor Vector ISA Compiler Options
HSW Intel Xeon E5-2680 v3 AVX2 icpc (ICC) 16.0.4 -xHOST
KNL Intel Xeon Phi 7250 AVX2, AVX-512 icpc (ICC) 18.0.1 -axMIC-AVX512
SKL Intel Xeon Gold 6148 AVX2, AVX-512 icpc (ICC) 18.0.1 -xCORE-AVX512 -qopt-zmm-usage=high
TABLE I: Details of CPUs, versions of the compiler, and compilation options used for benchmark simulations. The common
compiler option is -O3 -std=c++11 -w2 -w3 -diag-disable:remark -restrict and only CPU-dependent options are shown.
Name Description
Pair Naive implementation using a pair list.
Sorted Reduce the memory access by sorting a list.
AoS-4 Adopt the AoS data structure with padding.
TABLE II: List of optimizations.
A. Lennard-Jones Potential
We consider a classical MD simulation of the LJ po-
tential with truncation. The force calculation of the two-
body potential consists of double loops. The loop counter
of the outer loop is denoted by i and that of the inner loop
is denoted by j. The loop counters i and j correspond to
the indices of atoms. Consider the atom pair (i, j), for
which we calculate the force between them. We call the
atom with index i the i-atom and the other the j-atom.
The coordinates of the i- and j-atoms are denoted by ~qi
and ~qj , respectively, and the distance between them is
given by r = |~qj −~qi|. Then the LJ potential is described
by
V (r) = 4ε
[(σ
r
)12
−
(σ
r
)6]
, (1)
where ε is the well depth and σ is the atomic diameter.
Hereafter, we set σ and ε to unity. Since the potential
decays rapidly as the distance increases, it is wasteful
to calculate the force between pairs at long distances.
Therefore, a cutoff distance is introduced and only the
interactions of pairs within the distance are considered.
In this paper, we adopt the simple truncation, i.e., we in-
troduce a cutoff distance rc and we ignore force between
pairs of atoms at a distance greater than rc. This trunca-
tion modifies the potential function by adding a constant
so that V (rc) = 0. Note that this simple method may
exhibit problems in terms of energy conservation. For
practice use, it is better to adopt a truncation method
with not only the potential but also the force becoming
zero at the truncation distance [13–15]. While we here
adopt the simple truncation, the optimization and vector-
ization techniques described in this paper can be applied
to other truncation methods with minor modifications.
The force calculation for a single pair is shown in Al-
gorithm 1, for which we count the number of arithmetic
operations. For example, the operation ~qj − ~qi involves
three subtractions. The total number of arithmetic op-
erations required to calculate the force between a single
pair is 27 addition/multiplications and one division. To
Algorithm 1 Force calculation of LJ potential
1: ~r ← ~qj − ~qi
2: r2 ← ~r · ~r
3: r6 ← r2 × r2 × r2
4: r14 ← r6 × r6 × r2
5: df ← (48− 24× r6)× dt/r14
6: ~pi ← ~pi − df × ~r
7: ~pj ← ~pj − df × ~r
calculate the force between a single pair, 48 bytes must
be read and 24 bytes must be written since it is neces-
sary to load the coordinates and momenta of the atoms
and write back the momenta. Considering the above,
the force calculation of the LJ system becomes memory-
bound.
B. Verlet List and Bookkeeping Method
Since the potential function is truncated, we must con-
struct a pair list including only pairs for which the dis-
tance is shorter than the cutoff distance. While the trivial
implementation to construct the list has a complexity of
O(N2) for the total number of atoms N , the complexity
of the computation can be reduced to O(N) by adopting
the linked-list method [16, 17]. In the construction of
the pair list, we register each pair within a search length
rs, which is set at longer than rc. Then the list has
a margin of rs − rc and we can reuse the list for sev-
eral time steps. This technique is called the bookkeeping
method, which greatly reduces the time required to con-
struct the list [18]. By monitoring the fastest atom in
the system, we can safely determine whether we can con-
tinue to use the list or whether we must rebuild it [19].
By adopting the bookkeeping method, pairs beyond the
cutoff distance can be registered in the pair list. There-
fore, we must exclude such pairs from the calculation of
forces using the pair list. This fact affects the subsequent
SIMD vectorization.
C. Sorting by Indices
An interacting pair can be denoted by a pair of indices.
Therefore, the pair list is represented by a list of index
pairs (see Fig. 3 (a)). The pseudocode for calculating
forces using the pair list is shown in Algorithm 2. The
force-calculation kernel contains a single loop and it re-
41 0 0 2 3 3 1 1 2 3 0 2
9 2 1 7 9 5 2 4 8 4 5 6
0
1 2 5
1
2 4 9
2
6 7 8
3
4 5 9
(a)
(b)
SortedList[0]
1 2 5 2 4 9 6 7 8 4 5 9
SortedList[1] SortedList[2] SortedList[3]
(c)
i-Atoms
j-Atoms
i-Atoms
j-Atoms
FIG. 3: Data layout of a pair list. (a) List of pairs. (b)
Sorted list. The j-atoms interacting with the same i-atom
are grouped together. We expect that the data of i-atoms
are on registers. (c) Array representation of (b). The array
SortedList[i] denotes the list of j-atoms that interact with
i-atom.
quires memory access both for i- and j-atoms. Since the
number of arithmetic operations is small as described be-
fore, the memory access becomes the bottleneck, result-
ing poor performance. To reduce the number of memory
accesses, we sort the pair list into group j-atoms inter-
acting with the same i-atom (see Fig. 3 (b)). The array
representation of the sorted list is shown in Fig. 3 (c).
The array SortedList[i] denotes the array that stores
the indices of j-atoms interacting with i-atoms. This list
can be constructed by a counting sort and the compu-
tational complexity of the construction is O(N). Using
the sorted list, the force-calculation kernel contains dou-
ble loops. Then the information of the i-atom, which is
denoted by the loop counter of the outer loop, is stored
in registers. Then only the memory access of j-atoms is
required in the inner loop. The pseudocode of the force
calculation using the sorted list is shown in Algorithm 3
Hereafter, we refer to this optimization to“Sorted”. This
optimization works effectively on KNL as shown in Fig. 2.
Algorithm 2 Calculating the force in a simple manner
1: for all pairs (i, j) in PairList do
2: Load ~qi
3: Load ~qj
4: if |~qj − ~qi| < rc then
5: Load ~pi
6: Load ~pj
7: Calculate force between i- and j-particles.
8: Update ~pi and ~pj
9: Store ~pi
10: Store ~pj
11: end if
12: end for
Algorithm 3 Calculating the force with a sorted list
1: for i = 0 to N − 1 do
2: Load ~qi
3: Load ~pi
4: for all j in SortedList[i] do
5: Load ~qj
6: if |~qj − ~qi| < rc then
7: Load ~pj
8: Calculate force between i- and j-particles.
9: Update ~pi and ~pj
10: Store ~pj
11: end if
12: end for
13: Store ~pi
14: end for
Memory
Access
Arithmetic
Ops.
Iteration 1
1 1Load 
Position
1Calc. 
Force
Store 
Impulse
Prologue Iteration 1 Epilogue
(a)
(b)
Calc. 
Distance
1
Iteration 2
2 2Load 
Position
Store 
Impulse
1Load 
Position
1Store 
Impulse
2Load 
Position
Iteration 2
2Store 
Impulse
3Load 
Position
NStore 
Impulse
1Calc. 
Force
Calc. 
Distance
1 2Calc. 
Force
Calc. 
Distance
2 Calc. 
Distance
3
NCalc. 
Force
2Calc. 
Force
Calc. 
Distance
2
FIG. 4: (Color online) Software pipelining.
D. Software Pipelining
Software pipelining (SWP) is one of the loop optimiza-
tion techniques that reforms the loop to reduce the ex-
ecution time [20]. While SWP is often used to hide the
latency of instructions so that pipelines are kept busy
without stalling, we adopt this technique to increase the
number of instructions per cycle (IPC) rather than to
avoid pipeline hazards. The structure of the force cal-
culation has a double-loop form where the outer loop is
for i-atoms and the inner loop is for j-atoms. In the
body of the inner loop, there are four different opera-
tions, A,B,C, and D, which are A: load the position
of a j-atom, B: calculate the distance between i- and
j-atoms, C: calculate the force and update the momenta
D: store the momentum of the j-atom. Suppose n is the
number of j-atoms interacting with the i-atom, then the
inner loop has the form {ABCD}n. Since the four op-
erations are mutually dependent, the operations should
be performed sequentially (see Fig. 4 (a)). To increase
parallelism, we change the inner loop from {ABCD}n to
AB{CDAB}n−1CD (see Fig. 4 (b)) so that the memory
accesses and the arithmetic operations in the body of the
inner loop are no longer mutually dependent. Since the
recent CPUs are superscalar, independent memory ac-
cesses and arithmetic operations can be executed simul-
taneously, increasing IPC. Since the modification of the
loop causes an increase in the number of instructions,
the overall performance is improved when the increase
in IPC is larger than the increase in the number of in-
5structions. The impact of this optimization technique,
therefore, strongly depends on the environment, such as
the CPU architecture, the version of the compiler, and
so forth.
With the Intel 16.0.4 C++ compiler and on HSW, we
find that SWP improves IPC by 55% while the number
of instructions increases by 22%, and therefore, the total
performance gain is about 21%. We confirm that SWP
significantly improves the performance on SKL with the
Intel 17.0.6 C++ compiler. However, the gain in the
performance vanishes when we use the Intel 18.0.1 C++
compiler, since the performance without SWP is im-
proved by using a later version of the compiler. The
difference between the 17.0.6 and the 18.0.1 compilers
is the vector optimization capability. While the 17.0.6
compiler used xmm registers only, the 18.0.1 compilers
vectorized the code with using zmm registers. As the re-
sult, the code compiled by the 18.0.1 compilers is faster
than that by the 17.0.6 compiler. Note that, we used
the identical compiler options for both cases. If we apply
SWP by hand, then the loop was not vectorized since
the loop body becomes too complicated. However, the
performance is improved since IPC increases. Whether
the performance improves by SWP depends on the bal-
ance between performance improvement by improving
IPC and performance degradation due to inhibition of
vectorization. In SKL, the performances with and with-
out SWP happened to be nearly the same. However, the
performance on KNL becomes worse with SWP. This
is because the auto-vectorization by the Intel compiler
works efficiently for the code “SoA+Sorted” on KNL.
The auto-vectorization by the compiler on KNL is dis-
cussed in Sec. VA.
IV. SIMD VECTORIZATION WITH AVX2
 0
 5000
 10000
 15000
 20000
 25000
 30000
HSW SKL KNL
El
ap
se
d 
tim
e 
[m
s]
AoS4+Sorted
AoS4+Sorted+SWP
AoS4+Sorted+AVX2
AoS4+Sorted+AVX2+SWP
FIG. 5: (Color online) Effects of optimizations with AVX2.
In this section, we describe the vectorization with
AVX2 instructions. We can use a 256-bit-width regis-
ter with AVX2. Therefore, four pairs of forces can ideally
be calculated simultaneously. Since the innermost loop of
the MD code contains indirect accesses, it is not straight-
forward to pack data from the memory to the register and
to unpack data from the register to the memory. Since
the force calculations of the LJ potential are relatively
light, the packing and unpacking processes may be the
bottleneck. While AVX2 includes a gather instruction, it
is very slow. Additionally, AVX2 does not have a scatter
instruction. Therefore, some ingenuities are required to
pack and unpack the data. The results for the codes vec-
torized with AVX2 are shown in Fig. 5. In the following,
we describe how to vectorize the force calculation with
AVX2.
A. AoS with Padding
The key idea in the vectorization of the force calcula-
tion is to make the data structure the Array of Structures
(AoS) with padding [5]. The atom data are aligned so
that the data structure has 256-bit boundaries. Since the
coordinates and momenta have three elements, one dou-
ble precision floating number is inserted as a padding for
every three elements (see Fig. 6). We refer to this layout
as “AoS4” since each structure contains four elements
including a padding. This data structure has two advan-
tages over SoA. One is the cache efficiency. This data
structure ensures that the data for a single atom will be
on a single cache line with 256-bit width. The other is
the memory-access efficiency. Since the data are aligned
on 256-bit boundaries, the coordinates or momenta of
a single atom can be moved to the YMM register by a
single vector-load instruction (vmoveupd). We find that
the latter advantage is more effective for increasing the
speed. After loading the data of the coordinates into a
register, we calculate the relative coordinate vector be-
tween i- and j-atoms (Fig. 7 (a)). Performing the above
process four times, we obtain four registers containing
relative coordinate vectors. Then we transpose the four
vectors to obtain three vectors (Fig. 7 (b)). By calculat-
ing the sum of the squares of the vectors, we obtain a
single register that contains the relative distances of the
four pairs (Fig. 7 (c)).
Since we adopt the bookkeeping method, the pair list
contains the pairs with distances greater than the trun-
cation length. This list can be treated by masking op-
erations. We first calculate impulses of all pairs. Next,
we set the impulses between pairs that are outside the
interaction range to zero (see Fig. 8). Then we load the
momenta of the j-atoms, update them using the calcu-
lated impulses, and write them back.
The results of the above vectorization are de-
noted by “AoS4+Sorted+AVX2” in Fig. 5. The in-
crease in speed compared with the scalar code with
SWP, i.e., the improvements from ”AoS4+Sorted” to
”AoS4+Sorted+AVX2”, are 51% on HSW and 44% on
SKL. While the vectorization with AVX2 works effec-
tively on HSW and SKL, the vectorized code becomes
6x_1y_1z_1x_2y_2z_2x_3y_3z_3
x_2y_2z_2YMM register
 ✁ ✂✄☎✆ ✝✞✟✠ ✡☛☞✌
✍✎ ✏✑✒✓ ✔✕✖✗ ✘✙✚✛
FIG. 6: (Color online) Array of Structures (AoS) with
padding. This data layout allows us to load simultaneously
three elements of coordinates to a YMM register.
x_iy_iz_i
x_jy_jz_j
dxdydz
(a) (b)
dx1dy1dz1
dx2dy2dz2
dx3dy3dz3
dx4dy4dz4
dx1dx2dx3dx4
dy1dy2dy3dy4
dz1dz2dz3dz4
transpose
sum of squares
r_1
2
r_2
2
r_3
2
r_4
2
(c)
(d)
FIG. 7: (Color online) Calculation of relative distances. (a)
By calculating the difference between the coordinates of the
i- and j-atoms stored in each YMM register, the relative posi-
tion vector is stored in the YMM register. (b) The calculation
of the relative position vector is performed four times for four
j-atoms. (c) The four YMM registers are transposed to obtain
three registers. (d) The sums of the squares of the three reg-
isters are calculated to obtain four squared relative distances
of four pairs.
significantly slower than the original code on KNL. This
is because automatic vectorization by the compiler works
effectively with AVX-512 on KNL. This issue will be dis-
cussed later.
B. Software Pipelining
We can apply the SWP technique to the vectorized
code with AVX2 in a similar manner to the scalar code.
However, the performance improvements are moderate.
While the performance is improved by 6% on HSW, the
effect on SKL is not significant. Compared with the
fastest scalar codes which is “AoS4+Sorted+SWP”, the
performance gains by vectorization are about 42% on
HSW and 44% on KNL. The performance on KNL dete-
riorates upon applying SWP.
V. SIMD VECTORIZATION WITH AVX-512
In this section, we describe the vectorization with
AVX-512 instructions. The AVX-512 includes gather,
scatter, and mask operations, which are useful for the
vectorization of the loop involving indirect access. Here
r_1
2
r_2
2
r_3
2
r_4
2
r_c
2
r_c
2
r_c
2
r_c
2
m1m2m3m4mask
(a) (b)
0 0 0 0
df1df2df3df4src1
src2
m1m2m3m4mask
blend
df1df30 0
FIG. 8: Conditional execution using masks. The impulses
between the pairs with distances greater than the truncation
length are set to zero.
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
10428
14710
10368
9627
7916
7269
8491
6908
El
ap
se
d 
tim
e 
[m
s]
AoS4 (auto vectorized)
AoS4+AVX2+SWP
SoA (auto vectorized)
SoA+AVX-512+CDE
SoA+AVX-512+CDE+RLE
SoA+AVX-512+CDE+RLE+SWP
AoS8+AVX-512+CDE+RLE
AoS8+AVX-512+CDE+RLE+SWP
FIG. 9: (Color online) Results on KNL. The meanings of
the abbreviations in the figure are AoS4 (Array of Structures
with 4 members), AoS8 (Array of Structures with 8 members),
SoA (Structure of Arrays), CDE (collision detection elimina-
tion), RLE (remainder loop elimination), and SWP (software
pipelining).
we describe the efficiency of the optimization on KNL.
The results are shown in Fig. 9. The results on SKL are
discussed in the next section.
A. Auto-Vectorization by Compiler
While the Intel compiler cannot vectorize the bench-
mark program automatically with AVX2 without direc-
tives, it can vectorize the program with AVX-512. Actu-
ally, the compiler vectorized the code in Algorithm 3 as
follows.
1. Unroll the inner loop eight times.
2. Gather the positions and momenta of j-atoms with
vgatherdpd.
3. Calculate the forces and update the momenta of
j-atoms.
4. Detect conflicts between the indices of j-atoms with
vpconflictd.
5. Scatter the updated momenta with vscatterdpd.
7Here, the compiler inserted the codes for conflict detec-
tion since it did not know that there would be no conflict
between the indices of j-atoms interacting with i-atoms.
However, the penalty for inserting conflict detection is
not expensive as shown later. Inversely, the compiler
cannot vectorize the code without conflict detection in-
structions. In that sense, the conflict detection support
in AVX-512 is crucial for auto-vectorization by the com-
piler as suggested by Ho¨hnerbach et al. [7].
The code vectorized by the compiler with AVX-512
exhibits better performance than the code vectorized by
hand with AVX2. SWP reduces the performance since it
interferes with the optimization by the compiler.
B. AoS to SoA
Suppose xj is the x-coordinate of a j-atom. The dis-
tance between xj and xj+1 in memory is 8 bytes for SoA,
while it is 32 bytes for AoS (see Fig. 10). The gather
(vgatherdpd) and scatter (vscatterdpd) instructions af-
ford a scale factor that takes a values of 1, 2, 4, or 8.
When the data layout is SoA, then the indices of j-atoms
can be directly used for gather/scatter instructions with
a scale factor of 8. However, bit shifts are required for
AoS, and therefore, the bit shifts impose some penalty on
performance. The efficiency of this optimization strongly
depends on the code. Actually, the improvement of the
performance by changing the data layout is 1% for the
codes vectorized by the compiler. However, it improves
the performance by 7% for the codes vectorized by hand
with some additional optimizations, such as the remain-
der loop elimination and software pipelining, which are
described later.
 ✁✂ ✄☎✆
x_0x_1x_2x_3x_4x_5x_6x_7
y_0y_1y_2y_3y_4y_5y_6y_7
z_0z_1z_2z_3z_4z_5z_6z_7
✝✞✟ ✠✡☛
1
☞✌✍✎✏✑✒✓✔✕✖✗✘
346
✙✚✛✜
✢✣✤✥
✦✧★✩✪✫✬✭ ✮✯✰✱✲✳
4121624
✴✵✶ ✷✸✹✺✻
x_1x_3x_4x_6
✼✽ ✾✿❀❁❂
x_1x_3x_4x_6
❃ ❄❅❆❇❈
❉❊❋● ❍■❏❑▲▼◆
❖P◗❘ ❙❚❯❱❲❳❨
1
❩❬❭❪❫❴❵❛❜❝❞❡❢
346
FIG. 10: (Color online) Gather procedures. While eight
double-precision floating elements can be gathered with AVX-
512, only four are shown for visibility. (a) Gather process
for the SoA layout. By using 8 as the scale factor of the
gather instruction, the data can be gathered by using the
sorted list as is. (b) Gather process for the AoS layout. Since
each element is separated by 32 bytes and the maximum scale
factor is 8, it is necessary to carry out the gather instruction
using bit-shifted indices of the sorted list.
C. Collision Detection Elimination
As described above, the compiler inserted unnecessary
codes for the conflict detection. For the case of collision,
the compiler prepares a code storing the momenta se-
quentially which will be never called. Since there are no
conflicts between the indices, the conflict detection can
be eliminated if we call gather and scatter instructions
directly by using intrinsic functions. By this conflict de-
tection elimination (CDE), the performance is improved
by 7% (from ”SoA+AVX-512+CDE” to ”SoA+AVX-
512+CDE+RLE”). Note that, it may be possible to tell
the compiler that there are no conflicts using directives.
However, as far as we tried, the collision detection codes
could not be eliminated by adding directives.
D. Remainder Loop Elimination
The number of iterations of the inner loop of the force
calculation is equals the number of atoms existing within
the interaction distance, which depends on the density
of atoms. For the benchmark condition adopted in the
manuscript, the number of j-atoms interacting with one
i-atom is approximately 55 to 80. If we unroll the inner
loop eight times, the number of iterations of the vector-
ized loop kernel becomes 6 to 10. Since the number of
iterations of the remaining loop is up to 7, the compu-
tational time to process the remainder loop is compara-
ble to the time required for the vectorized loop kernel.
Therefore, we eliminate the remainder loop by mask-
ing. The pseudocode for the remainder loop elimination
(RLE) is shown in Algorithm 4. The key idea is to pre-
pare a vector of the loop counter (dˆ in the pseudocode).
Then we can make a mask mˆloop for the remainder loop.
Note that we must also to make a mask mˆcutoff for the
truncation of the interaction. The forces are zeroed by a
mask mˆtotal which is the logical conjunction of mˆloop and
mˆcutoff . The scatter of the momenta of j-atoms should
be masked by mˆloop since there is a possibility of con-
flict due to the RLE. RLE improves the performance
by 18% (from “SoA+AVX-512+CDE” to “SoA+AVX-
512+CDE+RLE”).
E. Software Pipelining
The performance of the vectorized kernel loop can be
improved by the SWP technique described in Sec. III D.
The performance improvement by SWP strongly depends
on the optimization methods used together. While SWP
does not improve the efficiency of the code vectorized
with AVX2, it improves the performance of the code vec-
torized with AVX-512 by 9% .
8Algorithm 4 Remainder loop elimination
1: tˆ← {8, 8, 8, 8, 8, 8, 8, 8}
2: vˆc ← {rc, rc, rc, rc, rc, rc, rc, rc}
3: for i = 0 to N − 1 do
4: qˆi ← Position of i-Atom ⊲ Broadcast
5: dˆ← {7, 6, 5, 4, 3, 2, 1, 0}
6: n← |SortedList[i]|
7: k ← 0
8: nˆ← {n, n, n, n, n, n, n, n}
9: while k < n do
10: qˆj ← Positions of j-Atoms ⊲ Gather
11: vˆd ← Distance between qˆj and qˆi
12: Calculate forces between i- and j-atoms
13: dˆ← dˆ+ tˆ
14: mˆcutoff ← _mm512_cmp_pd_mask(vˆc , vˆr, _CMP_LE_OS)
15: mˆloop ←_mm512_cmp_epi64_mask(dˆ)
16: mˆtotal ←_mm512_kand(mˆcutoff , mˆloop)
17: Zero the forces with the mask mˆ
18: Store ~pj with the mask mˆloop
19: k ← k + 8
20: end while
21: Store ~pj
22: end for
F. AoS with Eight Elements
x_1y_1z_1vx_1vy_1vz_1x_2y_2z_2
 ✁✂✄☎✆✝✞✟✠✡ ☛☞ ✌✍✎ ✏✑✒✓
✔✕✖✗✘ ✙✚✛✜ ✢✣✤✥ ✦✧★ ✩✪✫✬✭✮
FIG. 11: (Color online) AoS data layout with eight elements
(AoS8). Each structure has eight members, three of them are
coordinates, three of them are momenta, and two of them are
padding, respectively. The information of one atom is placed
on one cache line.
From the viewpoint of the number of instructions, SoA
is more advantageous than AoS since AoS requires addi-
tional instructions to prepare the vector of indices for
gather/scatter instructions. However, the AoS data lay-
out is more advantageous than that of SoA from the view-
point of cache efficiency. In the previous section, we in-
troduced AoS with four elements: three of them were the
coordinates or momenta of atoms and the other one was
padding. Here, we adopt AoS with eight elements so that
all the information of one atom is placed within one cache
line (see Fig. 11). The optimal data layout depends on
the optimization methods used together. For the code
applying CDE and RLE, SoA is superior to AoS8. How-
ever, AoS8 exhibits better performance than SoA with
SWP since SWP significantly improves the performance
of the code with AoS8.
VI. RESULTS ON SKYLAKE
 0
 1000
 2000
 3000
 4000
 5000
 6000
4588
2553 2532 2495
3054
2568
2967
2692
El
ap
se
d 
tim
e 
[m
s]
AoS4
AoS4+AVX2
AoS4+AVX2+SWP
AoS8+AVX2+SWP
AoS8+AVX-512+RLE
AoS8+AVX-512+RLE+SWP
SoA+AVX-512+RLE
SoA+AVX-512+RLE+SWP
FIG. 12: (Color online) Results on SKL. The meanings of
the abbreviations in the figure are AoS4 (Array of Structures
with 4 members), AoS8 (Array of Structures with 8 mem-
bers), SoA (Structure of Arrays), AVX2 (hand-vectorization
with AVX2), AVX-512 (hand-vectorization with AVX-512),
CDE (collision detection elimination), RLE (remainder loop
elimination), and SWP (software pipelining).
The results on SKL are shown in Fig. 12. The code
with the AoS having four members and sorting by in-
dices is denoted by “AoS4”. This code is automatically
vectorized by the Intel compiler using ZMM registers,
but gather/scatter instructions are not used and the per-
formance is not satisfactory. The results with “AVX2”
or “AVX-512” are vectorized by hand using AVX2 or
AVX-512, respectively. We find that the code with AoS
that is vectorized with AVX2 exhibits the best perfor-
mance, which is denoted by “Aos8+AVX2+SWP” in the
figure. The performance improvement compared with the
code automatically vectorized by the compiler, i.e., from
“AoS4” to “AoS8+AVX2+SWP”, is about 46%.
VII. SUMMARY AND DISCUSSION
In this paper, we investigated the effect of combin-
ing different data layouts and optimization methods used
with vectorization for the truncated-LJ-force kernel on
different architectures, HSW, KNL, and SKL. It was
found that the choice of data layout is crucial for vec-
torization using AVX2 on HSW. While AoS8 was found
to exhibit the best performance on all the architectures
studied in this work, the differences in performance be-
tween AoS and SoA are less significant when the code is
vectorized with AVX-512. We found that the data lay-
out AoS8 may interfere with optimization by the com-
piler. In the force-calculation kernel, the coordinates are
only read, but the momenta involve writing back. Since
each structure in AoS8 includes both coordinates and
momenta, the compiler cannot determine which mem-
bers can be modified. Therefore, AoS4 is the optimal
9choice from the viewpoint of portability. KNL and SKL
support both AVX2 and AVX-512. While the code vec-
torized with AVX-512 is much faster than that vectorized
with AVX2 on KNL, the code with AVX2 is slightly faster
than that with AVX-512 on SKL. Although SWP is an
optimization technique that can be used together with
vectorization with any instruction set, the gain in perfor-
mance by SWP is not significant in this work. However,
we found that SWP works efficiently with another com-
pilers such as GCC or with other architectures [12]. Al-
though its improvement in performance strongly depends
on the optimization ability of the compiler,the SWP is
worth considering as an optimization method to be used
with vectorization. In this work, we did not consider the
compiler directives in detail since it was easier to write
codes with builtin-functions than searching for effective
combinations of compiler directives. This situation may
change for force calculations of more complex interac-
tions.
While we performed vectorization faithfully to the orig-
inal scalar code, the performance can be improved by us-
ing reciprocal approximations. Although reciprocal ap-
proximation instructions also exist in SSE and AVX2,
more accurate instructions are available in AVX-512.
For example, vrcp28pd computes an approximate re-
ciprocal with 28-bit accuracy. Therefore, full precision
can be obtained with the first-order correction. How-
ever, vrcp28pd is included in the AVX-512ER instruc-
tion set, which is supported by KNL but not SKL. In-
stead, vrcp14pd is available with SKL but it requires
the second-order correction to obtain full precision. Us-
ing these instructions, the performance of vectorization
on SKL and KNL may be changed, which should be con-
sidered as a future issue.
While we were writing this paper, Intel announced dis-
continuance of Xeon Phi family [21]. Therefore, some
part of this study became obsolete. However, we think
that the knowledge obtained in this study is still useful.
It is almost certain that the number of CPU cores and
SIMD width will increase without increasing the oper-
ating frequency. Then the code vectorized with masked
gather and scatter instructions are likely to be more ef-
ficient as it was on KNL, while the code vectorized with
AVX2 was faster than that with AVX-512 on SKL. In any
case, it is difficult to predict the optimal combination of
optimization techniques, and trial and error are neces-
sary. We hope that our work will assist vectorizations
and optimizations on the future architectures.
Acknowledgements
The authors would like to thank S. Mitsunari and H.
Noguchi for fruitful discussions. This work was supported
by JSPS KAKENHI Grant Number 15K05201 and by
the MEXT project as “Exploratory Challenge on Post-K
Computer” (Frontiers of Basic Science: Challenging the
Limits). The computations were carried out using the fa-
cilities of the Information Technology Center of the Uni-
versity of Tokyo and the Institute for Solid State Physics
of the University of Tokyo.
[1] B. Alder and T. Wainwright, J. Chem. Phys. 27, 1208
(1957).
[2] T. C. Germann and K. Kadau, Int. J. Mod. Phys. C 19,
1315 (2008).
[3] H. Watanabe, M. Suzuki, and N. Ito, Comput. Phys.
Commun. 184, 2775 (2013).
[4] K. Rupp, https://github.com/karlrupp/microprocessor-trend-data.
[5] W. M. Brown, J.-M. Y. Carrillo, N. Gavhane, F. M.
Thakkar, and S. J. Plimpton, Comput. Phys. Commun.
195, 95 (2015).
[6] P. Karpin´ski and J. McDonald, in Proceedings of the
8th International Workshop on Programming Models and
Applications for Multicores and Manycores, PMAM17
(ACM Press, 2017), pp. 21–28.
[7] M. Ho¨hnerbach, A. E. Ismail, and P. Bientinesi, in SC16:
International Conference for High Performance Comput-
ing, Networking, Storage and Analysis (IEEE, 2016).
[8] M. J. Abraham, T. Murtola, R. Schulz, S. Pa´ll, J. C.
Smith, B. Hess, and E. Lindahl, SoftwareX 1-2, 19
(2015).
[9] H. C. Edwards, C. R. Trott, and D. Sunderland, J. Par-
allel Distrib. Comput. 74, 3202 (2014).
[10] J. C. Phillips, R. Braun, W. Wang, J. Gumbart,
E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale´,
and K. Schulten, J. Comput. Chem. 26, 1781 (2005).
[11] https://github.com/kaityo256/lj_simd .
[12] H. Watanabe, M. Suzuki, and N. Ito, Prog. Theor. Phys.
126, 203 (2011).
[13] S. D. Stoddard and J. Ford, Phys. Rev. A 8, 1504 (1973).
[14] J. Q. Broughton and G. H. Gilmer, J. Chem. Phys. 79,
5095 (1983).
[15] B. L. Holian, A. F. Voter, N. J. Wagner, R. J. Ravelo,
S. P. Chen, W. G. Hoover, C. G. Hoover, J. E. Hammer-
berg, and T. D. Dontje, Phys. Rev. A 43, 2655 (1991).
[16] B. Quentrec and C. Brot, J. Comput. Phys. 13, 430
(1973).
[17] R. W. Hockney and J. W. Eastwood, Computer Simula-
tion Using Particles (Taylor & Francis Ltd, 1988), ISBN
0852743920.
[18] L. Verlet, Phys. Rev. 159, 98 (1967).
[19] M. Isobe, Int. J. Mod. Phys. C 10, 1281 (1999).
[20] V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan,
ACM Comput. Surv. 27, 367 (1995).
[21] http://qdms.intel.com/dm/i.aspx/9C54A9A7-BF37-4496-B268-BD2746EA54D3/PCN116378-00.pdf .
