Student Cluster Competition 2017, Team University ofTexas at
  Austin/Texas State University: Reproducing Vectorization of the Tersoff
  Multi-Body Potential on the Intel Skylake and NVIDIA V100 Architectures by Sullivan, James et al.
Student Cluster Competition 2017, Team University of
Texas at Austin/Texas State University: Reproducing
Vectorization of the Tersoff Multi-Body Potential on
the Intel Skylake and NVIDIA V100 Architectures
James Sullivana, Collin Weirb, Austin Reicherta, R. Todd Evansb, W. Cyrus
Proctorb, Nicolas Thorneb
aThe University of Texas at Austin, 110 Inner Campus Drive, Austin, Texas 78705
bTexas Advanced Computing Center, 10100 Burnet Rd, Austin, Texas, 78758
1. Introduction
Reproducibility is a trait all scientific papers should aspire toward, and work
in the high-performance community is no exception. There is still significant
room for improvement, and the importance of the issue of reproducibility has
brought us to consider the paper from SC16 titled “The Vectorization of the
Tersoff Multi-Body Potential: An Exercise in Performance Portability” [6]. In
this paper, Hohnerbach et al. applied a vectorization scheme to the computa-
tionally intense Tersoff multi-body potential, which requires calculations that
move beyond pairwise interactions, relying instead on local complex interactions
involving relative positions between multiple particles. The authors attempt to
optimize the Tersoff kernel for LAMMPS [8] and increase its portability to vari-
ous cluster architectures. The authors provide an implementation in their code
to accomplish both these goals through vectorization and related methods.
The authors’ code provides for three vectorization schemes for appropriate
architectures, and optimizes both the USER-INTEL and USER-KOKKOS pack-
ages. The implementation of their vectorization scheme splits their algorithm
into computational and filter parts. The filtering forces LAMMPS to wait to
compute until the vector lanes are full or near-full, increasing the number of
operations performed in a given time interval. The authors make the following
claims:
• their single and mixed precision code additions are accurate when run
compared to double precision calculations,
Email addresses: jsull3@utexas.edu (James Sullivan), cweir@tacc.utexas.edu (Collin
Weir), austinskyreichert@yahoo.com (Austin Reichert), rtevans@tacc.utexas.edu (R.
Todd Evans), cproctor@tacc.utexas.edu (W. Cyrus Proctor), nthorne@tacc.utexas.edu
(Nicolas Thorne)
Preprint submitted to Elsevier August 22, 2018
ar
X
iv
:1
80
8.
07
02
7v
1 
 [c
s.D
C]
  2
1 A
ug
 20
18
• their code increases performance for running calculations of the Tersoff
potential across CPU-only, GPU co-processor, and Xeon Phi co-processor
systems,
• and that their vectorized code scales well across various hardware config-
urations.
We investigated all of these claims using some of the most powerful CPU
and GPU hardware options available at the time of this writing. This gave
us the ability to analyze and determine how well the claims of vectorization
portability have held up on contemporary hardware. Moreover, we also applied
the Tersoff potential to the task of calculating the thermodynamic properties
of silicon [9]. This “porter” problem potentially provides an additional test for
the vectorization scheme of Hohnerbach et al., the portability of the scheme to
other systems, and the reproducibility of the authors’ results.
This paper is organized as follows: we describe our cluster in terms of hard-
ware and software and our build process for the application in Section 2, we
present our attempt to reproduce the results of the original paper in Section 3,
and we conclude in Section 4.
2. Cluster and Builds
2.1. Machine description
The machine used was an eight-node cluster consisting of eight Dell Pow-
erEdge R740s. Each node contained two 24-core 2.1GHz Intel Xeon Platinum
8160 “Skylake” processors amounting to 48 cores per node. Additionally, each
node contained one NVIDIA Tesla V100 GPU, and had a total 192GB DRAM.
The cluster ran CentOS 7.4, was connected by a Mellanox InfiniBand EDR
switch capable of transmitting 100Gb/s, and had high-performance storage
managed with two striped Intel P4500 Series 4.0TB NVMe SSDs.
The competition required us to source hardware from sponsors and then
operate our cluster below 3000 Watts while competing. The original paper
involved runs on various compute architectures, including CPU-only runs, GPU
accelerated runs, and Intel Xeon Phi co-processor accelerated runs. Due to our
hardware constraints, we only attempted to replicate the CPU-only runs and
GPU accelerated runs. These runs were sufficient to satisfy the terms of the
competition.
2.2. Compilation/Run description
Using the March 2016 version of LAMMPS pulled directly from the lammps-
tersoff-vector repository of the original paper, we compiled different builds of
LAMMPS with and without the vectorization scheme of Hohnerbach et al. using
the USER-INTEL, USER-KOKKOS, and USER-GPU packages. For the USER-
INTEL package, we built using the Intel MPI (17.0.4) [3], Math Kernel [4] and
Threading Building Blocks [7] libraries, and each package was compiled with the
Intel C++ Compiler (17.0.4) [2]. We followed the procedure of Hohnerbach et
2
Figure 1: Accuracy study of LAMMPS precision level. Evaluation of single and mixed pre-
cision implementations when compared to double precision. Relative difference in values of
total energy produced by both the single vs. double (solid blue) and mixed vs. double precision
solvers (dashed orange) for 32,000 atoms for 106 steps.
al. for building with the USER-KOKKOS package and used the Kokkos package
from the Trilinos library [5]. Additionally, we used CUDA 9.0 [1] to build both
the USER-KOKKOS package and USER-GPU packages. We also used OpenMP
for both the USER-INTEL and USER-GPU builds.
To properly build with the USER-GPU package, we used the Tesla70 CUDA
architecture specification corresponding to the V100s. However, for the USER-
KOKKOS package associated with the repository containing the code of Hohner-
bach et al., the most recent available CUDA architecture was Maxwell53. There
was no version of Kokkos yet released for the Tesla70 architecture at the time
of this work. This meant that our USER-KOKKOS build will likely experience
increased performance with an updated version of Kokkos corresponding to the
Tesla70 architecture. As in Hohnerbach et al., we also use the most advanced
vector instructions available for a given processor architecture, in our case those
enabled by the “skylake-avx512” flag, in all of our builds.
3. Results
3.1. Accuracy study
Hohnerbach et al. claim that the single and mixed precision versions of their
code are accurate when compared with the double precision offered by LAMMPS
by default. Their evidence for this claim is in a comparison of relative difference
3
between single and double precision in the total energy of a long-running sim-
ulation. The authors quote a relative difference of 0.002% in this total energy
for 32,000 atoms over 106 timesteps. We test this claim on our cluster, and
go a step further by comparing the relative difference between their mixed and
double precision implementations as well.
In an attempt to replicate the accuracy of Hohnerbach et al., we calculated
the relative difference in total energy using the same problem size and number
of timesteps as in the original paper (Fig. 1). We find a relatively low maximum
value of deviation between the double and reduced precision runs of less than
0.25%. This is much larger than the deviation quoted in Hohnerbach et al., but is
still small, so we verify their claim of accuracy but to a weaker degree. Notably,
the deviation grows nearly monotonically after about 2.5 × 105 timesteps for
the single precision case. The mixed-precision case’s deviation initially grows
at the same rate as the single-precision case but appears to be bounded after
roughly 9 × 105 timesteps. Both cases’ behavior is different from the random
but non-increasing deviation displayed in Fig. 3 of the original paper. These
discrepant behaviors may be due to a variety of factors including differences in
instruction set and/or optimizations performed at compile time.
3.2. Performance differences
In this Section, we replicate nearly exactly the results of Hohnerbach et
al. using GPU offloading in the LAMMPS calculations. The only difference
lies in our problem size. Whereas the authors originally used a specific GPU
benchmarking problem with 256,000 atoms, we solve a similar problem with
512,000 atoms using the same number of timesteps (2420). Here Opt-KK refers
to calculations run using the optimized code with the USER-KOKKOS package,
while Ref-KK refers to non-optimized calculations using this package. Ref-GPU
refers to calculations using the non-optimized USER-GPU package, and is run
with single, mixed, and double precision.
We find results (Fig. 2) that are very similar to those of the original paper,
and verify the claim of increased performance on GPUs. Even at a problem
double in size, there is about a factor of two increase in performance (measured
in ns/day) between single and double precision runs, but only a very slight
growth in performance between mixed and single precision runs. Without the
optimization applied, Ref-KK performs better than the similar double precision
REF-GPU-D, but performs worse compared to the mixed and single precision
GPU runs.
The most significant result we find is the relatively large increase in per-
formance for Opt-KK-D. When compared to Ref-KK-D we find nearly a factor
of 4X increase in performance. This is only a slightly greater factor than the
one found for the same setup in Hohnerbach et al., but the near 3X increase
in performance between Opt-KK-D and the Ref-GPU-M/S cases is an improve-
ment upon the authors’ original increase of 1.5X in performance in these cases.
This is likely due to the closer balance of single precision to double precision
capability found in an NVIDIA V100 (2 to 1) versus a NVIDIA K20x or K40
(∼ 3 to 1).
4
Figure 2: Evaluation of performance portability for NVIDIA V100s for 512,000 atoms. Kokkos
runs are shown in purple (hatched) and reference GPU runs are in blue (unhatched). Error
bars are shown with limits denoting the minimum and maximum measured ns/day over 5 runs.
Ref-KK-D uses the non-optimized USER-KOKKOS package, Opt-KK-D uses the optimized
USER-KOKKOS package, Ref-GPU-D, Ref-GPU-M, and Ref-GPU-S use the USER-GPU
package with double, mixed, and single precision, respectively.
5
It is also interesting that the absolute ns/day values we quote for our GPU
runs are much higher than those offered by the authors in their original work.
This is likely due to the architectural improvements on the NVIDIA V100 as well
as the updated CUDA Toolkit [1]. In the original paper, the authors’ highest
value in ns/day for any GPU offloaded run was accomplished with Opt-KK-D
on a NVIDIA K40 GPU with about 1.8 ns/day. In comparison, when running
on the NVIDIA V100s, the lowest value we find is at about the same value
of 1.8 ns/day for Ref-GPU-D. The highest-performing run (Opt-KK-D) on the
V100s gives a value of approximately 8.5 ns/day, almost 5X the value produced
for the same run on the K40. Even without using Kokkos, it would seem that
upgrading GPU hardware would outperform the best possible outcome of op-
timization. However, the surprisingly high return on performance in a large
dataset using Opt-KK-D illustrates the power of the vectorization scheme of
Hohnerbach et al. For future use, note that building using Kokkos limits the
CUDA compute capability to version 5.3, while the V100s are capable of 7.0.
Thus, we might expect an even greater performance gain if Kokkos could use
this updated architecture.
We also attempted to run the porter problem while offloading to GPU, but
were unsuccessful. The reason for this is two-fold. First, Kokkos in the version
of LAMMPS provided by the original authors does not support the minimize
function required to run the porter problem. Second, we were not able to run
the non-optimized reference versions built using the GPU package, since we
received an error due to the inability of the GPU package to use a triclinic
crystal system box required by the porter problem.
3.3. Scaling study
Hohnerbach et al. originally perform a scaling study on nodes both with
and without Xeon Phi Knight’s Corner coprocessors to support their claim of
scalability for large problems. For the competition, we performed a similar study
for nodes with GPUs, as well as an additional study of scaling within a node.
We turn to the latter first. We used our Intel Skylake processors possessing 24
cores each, more than any socket available to the original authors, to do so. We
also did this in part due to our lack of access to the Xeon Phi co-processors,
though further investigation using those would certainly be worthwhile.
We performed a run with a problem size of 512,000 atoms utilizing a varied
number of cores with no explicit affinity settings. We found an intra-socket
speedup of 8X when scaling from 1 core to 24 cores. This scaling, however, did
not continue when using cores from both sockets as can be seen in Fig. 3. In
fact, using all 48 cores on a node resulted in comparable or worse performance
than using a single core. Interestingly, we saw a smaller speedup in the porter
problem, but not such a large drop-off in performance at a large number of cores.
In both cases, we saw a rise in performance across one socket, and a drop as
we used the whole node, though with looser time constraints we would dedicate
more time to getting more data in the region between 12 and 48 cores. It is also
important to keep in mind the size of the error bars on these reported values,
especially for the porter problem where we see large variability in ns/day when
6
Figure 3: Optimization results with scaling across one node with two 24-core Skylake sockets
(Intel Platinum 8160) for 512,000 atoms. Top panel : strong scaling of tersoff problem runs.
Bottom panel : strong scaling of porter problem runs. The error bars represent the maxima
and minima of reported ns/day for our 5 runs.
7
Figure 4: Reference GPU (using the USER-GPU package) results with strong scaling across
eight nodes with NVIDIA V100s for 512,000 atoms using double precision. We see roughly
linear scaling across our cluster. The error bars represent the maxima and minima of reported
ns/day for our 5 runs.
running with a small number of cores. Perhaps the size of the variability is due to
the nature of the porter problem itself, but it would be worth repeating runs for
more than just the 5 we completed to better understand this behavior. Again,
competition time constraints prevented us from generating further scaling data.
Our results imply scaling within a socket is excellent and beyond a single socket
poor for these problem sizes. However, the performance gains from using the
optimized code is retained at all scales much as in the original paper.
To see if the scaling exhibited by the Xeon Phis is also present for scaling
with GPUs, we performed an additional scaling study using our V100 GPUs.
In Fig. 4, we present the scaling in performance measured as we increased the
number of GPUs in our runs using the USER-GPU package. The error bars are
very small here, showing consistency in our runs. We used 512,000 atoms, in
order to provide a direct comparison to the CPU results that were gathered from
the competition. This was a deviation from the 256,000 atom size presented
in the original paper so we could demonstrate the increased performance of
the GPUs when compared to the results generated from the CPU runs. We
observe roughly linear scaling for the double precision run that was used as
a reference in the original paper on our V100 GPUs, which is superior to the
original reported scaling. Thus, we support the original claim of scaling across
nodes with accelerators.
Interestingly, there is super-linear scaling observed when going from 2 to 3
nodes as seen in Fig 4. There is the possibility that the problem size could
8
fit inside the memory of 3 GPUs but not 2. This could reduce data transfer
overhead. The resulting benefit to performance could then be overwhelmed
by communication overhead when using 4 or greater nodes. A second possible
explanation is that the MPI configuration happened to be tuned to perform well
using 3 nodes. Unfortunately, without access to the original hardware we can’t
be sure of the cause.
We were unable to perform the GPU-based scaling study using the Kokkos
build and associated optimizations. The LAMMPS package documentation for
USER-KOKKOS states Kokkos cannot run on multiple GPUs, but if Kokkos
were to provide multi-GPU support in the future, such a run would provide an
interesting followup to this large-scale scaling study.
4. Conclusions
We attempted to replicate the results of Hohnerbach et al. on a GPU-enabled
cluster. We found using CPU tests of accuracy that the optimization is still
accurate, though there is a nearly monotonic decline in accuracy of the single
precision versions at large timesteps. In the future it is worth considering the
effect on accuracy of Xeon Phi co-processor acceleration versus CPU only or
GPU accuracy. However, this decline is small and though a trend is present, we
replicate the order of accuracy produced in Hohnerbach et al.
When examining performance differences using GPUs, we found significant
performance increases due to our V100 GPUs. We found similar relative perfor-
mance in the various precision and optimization cases for GPU offloading. As
a result, we were able to replicate this outcome. In the future, we suggest test-
ing hardware such as the V100s along with its appropriate CUDA architecture,
i.e. a more recent version of compute capability, for increased performance.
We test scaling on a 48-core Skylake node and find good scaling up to 24 cores
and poor scaling when using all 48 cores in the node. This is in contradiction
to what is observed in Hohnerbach et al., where scaling from 24 to 48 cores is
efficient. It is not a direct comparison, however; the scaling presented in the
original paper is for whole nodes, going from 1 24-core node to 2 24-core nodes,
while the scaling in our study was restricted to within a node.
In a more direct comparison to the original paper’s scaling results using
accelerators, we found excellent scaling on NVIDIA V100 GPUs across many
nodes for the reference GPU run. In fact, the scaling we observed was better
than that reported for the Phi’s. We thus replicated, and even exceeded, the
strong increase in performance across multiple nodes for accelerators reported
by Hohnerbach et al. In the future, it may be interesting to consider scaling on
multi-GPU runs.
5. Acknowledgments
We would like to thank the other members of our Student Cluster Competi-
tion Team, Vivian Nguyen, Jeremie Gallegos, Carlos Hurtado, as well as Cyrus
9
Proctor, Nick Thorne, and Todd Evans for extensive mentoring and support.
We also thank our corporate sponsors, including TACC, Dell, NVIDIA, Intel,
Geist, Mellanox, and Raytheon. We thank the Student Cluster committee for
their hard work organizing and running the competition, especially Kris Garrett
for being patient with our questions.
6. References
References
[1] CUDA Development Toolkit (CUDA 9.0.176 384.81).
[2] Intel Corporation, Intel C++ Compiler 17.0.4.
[3] Intel Corporation, Intel MPI Library 17.0.4.
[4] Intel Math Kernel Library. Reference Manual. Intel Corporation, Santa
Clara, USA, 2009. ISBN 630813-054US.
[5] H. C. Edwards, C. R. Trott, and D. Sunderland. Kokkos: Enabling
manycore performance portability through polymorphic memory access pat-
terns. Journal of Parallel and Distributed Computing, 74(12):3202 – 3216,
2014. Domain-Specific Languages and High-Level Frameworks for High-
Performance Computing.
[6] M. Hohnerbach, A. E. Ismail, and P. Bientinesi. The vectorization of the
tersoff multi-body potential: An exercise in performance portability. In Pro-
ceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis, SC ’16, pages 7:1–7:13, Piscataway, NJ,
USA, 2016. IEEE Press.
[7] C. Pheatt. Intel threading building blocks. J. Comput. Sci. Coll., 23(4):298–
298, Apr. 2008.
[8] S. Plimpton. Fast parallel algorithms for short-range molecular dynamics.
Journal of Computational Physics, 117(1):1 – 19, 1995.
[9] L. J. Porter, S. Yip, M. Yamaguchi, H. Kaburaki, and M. Tang. Empirical
bond-order potential description of thermodynamic properties of crystalline
silicon. pages 96–106, 1997.
10
