Abstract This living paper reviews the present High Performance Computing (HPC) capabilities of the Tinker-HP molecular modeling package. We focus here on the reference, double precision, massively parallel molecular dynamics engine present in Tinker-HP and dedicated to perform large scale simulations. We show how it can be adapted to recent Intel ® Central Processing Unit (CPU) petascale architectures. First, we discuss the new set of Intel ® Advanced Vector Extensions 512
Introduction
Tinker-HP is a massively MPI parallel package dedicated to classical molecular dynamics (MD) and to multiscale simulations, especially using advanced polarizable force fields (PFF) encompassing distributed multipoles electrostatics [1] . It is an evolution of the popular Tinker package code [2] which conserves its simplicity of use and its developers-friendliness allowing for the rapid development of new algorithms. Tinker-HP offers the possibility to perform large scale simulations while keeping the Tinker reference double precision implementation dedicated to CPUs (Central Processing Unit) petascale architectures. The parallel scalability of the software has been demonstrated via benchmarks calculations. Overall, a several thousand-fold acceleration over a single-core computation is observed for the largest molecular systems, allowing therefore for reference long polarizable MD simulations on large molecular systems up to millions of atoms.
Despite this strong acceleration, and due to the development model of the Tinker software suite (now version 8 [2] ), which favours new scientific development over optimization, no attempt has really been made yet to adapt the initial Tinker-HP code to a particular CPU architecture. Each execution of the code initially took little or no advantage of the underlying capabilities of the CPU it's running on. processors, leads us to change the overall design of the code, while trying to keep its simplicity and readability. The goal of this paper is double. First, it intends to propose a comprehensive living review dedicated to the capabilities and performances of Tinker-HP's main production methods (i.e. force fields) on Intel's architectures. Second, it is organized to present in a pedagogical way our code optimization efforts, giving a optimal strategy to vectorize Tinker-HP. Such practical case is rarely documented and we think it could be useful to many developers in the field. The present version of the paper is then organized as follows. After reviewing the specificities of the latest Intel Xeon Scalable Processors (code-named Skylake) and particularly their Intel AVX-512 vector instructions set, we will give the general structure of the most computationally intensive FORTRAN subroutines in Tinker-HP, show their hotspots and propose a general strategy to vectorize the code with the Intel AVX-512 instruction set. We then give concrete examples of the vectorization process we follow. A performance comparisons between the released version and the vectorized version is then made, first for an execution on 1 core, to show brute acceleration of the code, and then in the context of a realistic parallel execution (with up to 16 000 cores). Finally, extended benchmarks on meaningful systems are provided with the AMOEBA polarizable force field [3] [4] [5] and also using the newly introduced initial implementation of classical force fields such as CHARMM [6] , as an illustration of what we can obtain with any non-polarizable force field (AMBER, [7] OPLS-AA [8] etc...).
Intel Xeon Scalable processors
We are using in our study a system with a CPU from the Intel Xeon Scalable processors family (code-named Skylake). These processors feature up to 28 cores per processor with 2 hyper-threads per core for a total of up to 56 threads per processor. A new mesh interconnect reduces the latency of communication within cores and controllers in the processor. Each core is capable of issuing up to four instructions per cycle out-of-order. Up to two of them can be Intel AVX-512 instructions [9] . The Intel AVX-512 instruction set is a 512-bit SIMD instruction set that allows to perform computations with a single instruction using SIMD registers that contain eight double-precision (DP) or sixteen single-precision (SP) floating-point values as well as a variety of integer data sizes. The Intel AVX-512 instruction set also supports unaligned loads, fused-multiply and add, vector masking, shuffle and permutation instructions, histogram support, and hardwareaccelerated transcendentals. Using all available cores and SIMD units is key to unlocking all the potential performance from these processors.
A significant change from its predecessor, the Intel Xeon processor v4, is the reorganization of the cache hierarchy to better balance the cache usage for server workloads. To achieve this, the L2 size has increased in size to 1MB and the last level cache (LLC) has been reduced in size (up to 38.5 MBs) but it is now a non-inclusive cache (meaning that data is not evicted from caches closer to the cores when evicted from the LLC).
The Intel Xeon Scalable processors provides two memory controllers with 3 memory channels each that support DDR4 up to 2600 MHz. This provides up to 123 GB/s of bandwidth to main memory for each socket. Three Intel ® Ultra Path
Interconnect links, each providing 10.4 GT/s, allow multiple processors to communicate to create bigger systems (e.g, dual-socket systems).
Considerations on vectorization
Efficient use of the SIMD units available in the processor is very important to achieve the best performances on modern (and probably future) processors. Most vector parallelism is extracted from loops. In this section, we outline some ideas that can help to get good performance from loop vectorization.
Modern compilers are able to auto-vectorize loops but need to be able to determine that vectorization does not break any possible cross-iteration dependencies. This is not always possible due for example to variable aliasing or loop complexity [10] . Programmers can help compilers by rewriting their loops with idioms that compilers recognize and/or using directives that guide the vectorization process from the compiler. Once a loop is vectorized, it is useful to consider if vector code generated for the loop is the most efficient possible. There are tools that can help on this assessment as well as providing feedback on how to fix it [11] . Common issues that are worth looked into are:
• Unaligned loads or stores. When data is not aligned to cache boundaries, some penalty will be incurred in load and store instructions compared to when data is well aligned. Therefore, it is recommended to align data to cache boundaries and also use the proper mechanisms to inform the compiler of this.
• Loop prologues and remainders. To provide better data alignment, the compiler might generate a different code for the first iterations of the loop to ensure the main vectorized part of the loop runs on aligned data. Similarly, if the compiler cannot deduce the number of iterations of a loop, it will generate a loop reminder that computes the last iterations of a loop that do not fill a vector register completely.
• Unnecessary masking. While the Intel AVX-512 instruction set supports vectorization of loops with conditional statements, supporting them requires the use of mask instructions and masked vector operations which reduces the vectorization efficiency. Therefore it is recommended to move as much as possible conditional statements out of vectorized loops. This might require to split the loop (with different code for those iterations where the branch was taken/not taken) or duplicate the loop (with different code for each branch of the conditional).
• Non-unit strides. Use of non-unit strides will most of the time force the compiler to generate gather and/or scatter instructions which are less efficient than regular loads and stores. This includes also accessing a field in a structure, as a non-unit stride will be needed to access the same field between consecutive elements of an array. This is why a Struct-of-Arrays (SoA) layout is preferred over a more conventional Array-of-Structs (AoS) layout for an efficient vectorization.
• Indirect accesses. Indexing one array with the values from another array will also result in generation of gather and/or scatter instructions. Therefore, this pattern should be avoided as much as possible.
• Register pressure. The number of vector registers is limited (e.g., Intel AVX-512 provides 32 vector registers).
If the number of arrays that are used in a given loop plus temporary values that might be required for the operations of the loop exceeds the number of registers, the compiler will need to spill some of the arrays to the stack and restore them afterwards which reduces significantly the vectorization efficiency. To avoid this, it is better to use different loops for independent array operations rather a single big loop with all arrays inside.
Working environment and definitions

Definitions
In this paper, we will use two versions of Tinker-HP :
1. the Release Version 1.1, referred to as Rel. 2 . the Vectorized Version 1.1v, referred to as Vec.
A third version, Release Version 1.2 (referred to as Rel2), is used in the Perspective subsection 8.3 to give a taste of the performance gain obtained with new algorithms. Vectorization of Rel2 is in progress. It will be referred to as Vec2.
We ran Tinker-HP exclusively on supercomputers under UNIX/LINUX Operating System (OS). These machines aggregate hundreds or even thousands of interconnected systems called computing nodes, or simply nodes, each of which having tens of CPU cores and usually hundreds of gigabytes of memory. On UNIX/LINUX computers, the code is executed by a process, which uses memory and CPU resources managed by the OS.
What we called the code can be split in two parts :
1. the User Code (UC), which comprises all the FORTRAN code. Here, it's the code for Tinker-HP and the 2DECOMP library. 2 . the Non-User Code (NUC), which comprises all the code executed by the process because of library calls from the UC, system calls done on behalf of the UC or code introduced implicitly by the compiler.
Of course, the way we write the UC (what library we used, how we setup our data,...) has an influence on what and how NUC is executed by the process. As we want to raise the performances, we have to take into account what NUC gets executed because of the UC code that we write.
We use the term Molecular System (MS) to denote all the physical systems we have simulated for this paper.
Note that the FORTRAN code listings shown in this paper have been taken as is, while all compilation reports and assembly code listings have been edited for publication purposes.
Compilation setup
We worked with the Intel ® Parallel Studio XE 2018 development suite [12] , containing the Intel ® Fortran Compiler, the Intel ® MPI Library, the Intel ® Math Kernel Library (Intel MKL) with implementations of BLAS [13] , LAPACK [14] and FFTW3 [15] routines, and Intel ® VTune ™ Amplifier for profiling and analysis.
The Tinker-HP sources are compiled with the flags shown in listing 1 where :
• -xCORE-AVX512 flag forces the generation of binaries for the Intel Xeon Scalable processors.
• -qopt-zmm-usage=high flag instructs the compiler to use zmm (512 bits) registers as much as possible.
• -align array64byte instructs the compiler to align all static arrays to 64 bits memory address boundaries
• -falign-functions=64 tells the compiler to align functions on 64 bits boundaries
• -qopt-report-phase and -qopt-report=5 flags produce vectorization reports.
• -S flag produces assembly code listings. Even if we do not use OpenMP* [16] in the dynamic simulation engine, other parts of the Tinkertools suite (Tinker 8 [2] ) use OpenMP directives. So, object files are linked with flags shown in listing 2 where :
• -mkl=sequential flag tells the linker to use the sequential Intel MKL library, which is lighter and faster than the multi-threaded ones (e.g., OpenMP).
• -qopenmp-stubs flag enables compilation of OpenMP programs in sequential mode. 
Execution setup
For the performance tests, calculations have been performed on nodes running under the RedHat* Enterprise Linux* Server Release 7.4 operating system, with the following configuration :
• 2 × Intel Scalable Xeon 8168 processor -2,7 GHz -24 cores/processor
• 192 GB of DDR4 memory, • InfiniBand* EDR interconnect.
We chose 8 MS from those studied in [1] with increasing sizes ranging from 9 737 to 3 484 755 atoms : the Ubiquitin protein, the prototypic Dihydrofolate Reductase (DHFR), the COX-2 dimer, the Satellite Tobacco Mosaic Virus (STMV), the Ribosome full structure in polarizable water, and three water boxes (Puddle, Pond and Lake).
The table 1 gives for each MS the name, the number of atoms and the number of cores used for the parallel calculations. We focus here on the part of the code dedicated to the AMOEBA polarizable force field which is the most computationally challenging and gives a lower bound to Tinker-HP performances [3, 18] . AMOEBA has been shown to have a wide applicability for physical systems ranging from liquids to metals ions, [19, 20] including heavy ones, [21, 22] in solution and to proteins [4, 5] and to DNA/RNA [5] . It uses distributed atomic multipoles up to quadrupole moments and a Thole/Applequist point dipole polarization model. Van 
MS
General Structure
Tinker-HP uses a 3D spatial decomposition to distribute atoms on the cores. Every process is assigned to a subsection of the simulation box and is responsible of updating the positions, speeds and accelerations of the atoms present in this subdomain at each timestep [1] . The most computationally intensive part of Tinker-HP is devoted to forces and electric fields calculations.
All the computation routines follow the same organization scheme :
• an external loop over all the atoms held by the core • the selection of the neighbour sites using various criteria (e.g. cutoff distances, ...) • a (very) big internal loop over the selected sites, where all quantities are computed.
In Rel, this internal loop computes all quantities for each atom-neighbour pair on the fly, with no attempt to precalculate or store intermediate quantities that can eventually be re-used. This results in a big harm on cache registers and processing units, and in a big use of memory-core transfer instructions. By contrast, there's almost no use of array (apart from indexing). Thus, the possibility to take advantage of the vector extension capabilities of the Intel AVX-512 instructions is very low.
Hotspots
The table 2 shows the profiling of Rel, running on 1 core for DHFR with AMOEBA (polarizable model) and with CHARMM (non-polarizable model). We give the module or routine, the real CPU time spent executing it, and the vector usage percentage. All routines are sorted with higher time-consuming first. We can see that Rel has two kinds of hotspots :
1. NUC hotspots : these are mainly due to libraries calls, system calls and memory management operations (initialization, copy, allocation and de-allocation). 2. Computational hotspots : these are mainly due to the computation of :
• the matrix-vector product operation applied at each iteration of the polarization solver (tmatxb_pme2), which can be called up to 12 times at each step, depending on the convergence criterion Table 2 . Profiling of Rel using Intel VTune Amplifier. Simulations ran on one core and 100 steps. MS is DHFR with AMOEBA polarizable force field and with CHARMM force field (no polarization). Most important NUC and computational hostspots are shown in separate frames. vmlinux is the system kernel, performing memory operations and system calls. For CHARMM calculation, image is splitted in two parts. The vectorized routines will use image(2). So, only the starred lines are counted in the total CPU time for comparison with Vec.
• the dipole polarization energy and forces (epolar1) • the van der Waals energy and forces with Halgren buffered 14-7 function (ehal1) • the multipolar permanent electrostatic energy and forces (empole1) • the right hand size of the polarization equation (efld0_direct2) • the van der Waals energy and associated forces with Lennard-Jones 6-12 function (elj1) • the charge-charge interaction energy and associated forces (echarge1)
Other widely used utility routines (image and torque) also appears.
In order to raise the performances of Rel, we needed to address these hotspots. Two observations have guided us :
1. vmlinux is taking almost as much CPU time as the multipole polarization energy and forces computation routine. That means the process makes many system calls and memory operations. 2 . the vector usage percentage is strictly 0.00 for all the computation subroutines. It confirms that these routines only use scalar operations.
The first remark led us to investigate the library and system calls and, first and foremost, to work on the memory management of a process running Tinker-HP in order to reduce memory operations.
The second remark led us to rewrite the computation routines. As using vector operations means most of the time using loops on arrays, the on the fly method of computation in Rel must no longer be used. 
Libraries and System calls Libraries calls
The vast majority of libraries calls comes from the Intel MKL library, which actually does computing work. As using it wisely can give a significant speedup, we have to provide the right FORTRAN code (UC) to access the right vectorized functions of Intel MKL.
System calls
When running, Tinker-HP reads a few files at the very beginning and outputs a log file which contains the simulation results and, periodically, a file containing atoms coordinates. As these input/output operations are done by the MPI-rank-0 process only, the open, read and close system calls do not really stress the UNIX/LINUX system, even if the simulation runs on millions of atoms.
So, most of the system calls come from the MPI library and are due to two design choices in Tinker-HP :
1. As explained before, each Tinker-HP process hold a portion of the space (a domain) and maintains MPI communications with MPI processes that hold other domains nearby it, so that each process can exchange information with the others and track atoms coming in or out of its own domain. As the other processes can run on other nodes, there can be even more time spent in system calls because of network transmissions. 2. A memory region is shared between all MPI processes on a computing node, using the sharing capabilities of the MPI library. The access control of this region is implemented through system calls (semaphores, flock(). ..) to synchronize processes and guarantee non overlapping when writing data.
Minimizing the time spent in system calls is not so easy. We tried different distributions of the MPI processes over the nodes to favour local MPI communications, but that did not give convincing results. We also tried to improve the use of MPI framework by masking communications with computations. We used non blocking versions of MPI_SEND and MPI_RECEIVE functions, and did some calculations before calling the corresponding MPI_WAIT. The performance gain is not really noticeable for now. But improving this part of the code is a work in progress.
Memory management
The figure 1 gives a simple picture of the memory layout of a UNIX/LINUX process. The text zone contains the code, the data zone contains initialized data. The bss zone contains uninitialized fixed size data and has itself a fixed and limited size, set at compile time. The heap zone contains allocated data and can grow or shrink, subject to allocations or de-allocations. The stack zone contains a Last-In-FirstOut structure, where values are pushed and pulled at each subroutine or function call or if the processor runs out of free registers.
Process memory setup
Historically, Tinker dynamically allocates and de-allocates all arrays it uses, because it was originally built to run on workstations with limited amount of memory and cores. These allocations are made by calls to the system malloc() group of functions. As a consequence, data are put in the heap section of the process, whose size is managed by the OS, allowing it to grow or shrink as needed.
As Tinker-HP is a pure MPI program which distributes data and can potentially run on hundreds of different nodes, each of them with gigabytes of memory, the problem of the memory consumption is not that important. On a computing node, each core (so, each MPI process) can easily have 2 or even 4 gigabytes of memory for its own use.
Still the size of some arrays are proportional to the size of the systems and therefore, the MS-size dependent data, declared when entering a subroutine, can be very large. normal run, each process maintains hundreds of numbers for each atom it holds. And we can have thousands of atoms held by each process and tens of MPI processes on one node. So, allocation and de-allocation of data for each process and in each subroutine constitutes a big stress for the OS.
Considering that the overall size of data held by one process is often under 2Gb, and that we maintain size derived constants throughout the program, we decided to remove all dynamic allocations in the vectorized routines, and declare all arrays with fixed sizes. That moves the data in the bss section of the process, which is lighter than heap for the OS to handle, lowering the stress on it.
Memset and memcpy
The execution cost of the memset and memcpy operations cannot be easily evaluated, as they come from the compiler libraries and are built upon the C library. But, because of their potential (big) effect on the performances (see table 2 and discussion on page 5), these operations have been extensively tracked.
Many of memset come from unnecessary zeroing and have been easily removed. Some of them come from the use of intrinsic FORTRAN90 functions, where the compiler creates temporary storage and introduces memcpy operations to work with it (for example, PACK). We tried to remove the use of intrinsic functions as much as possible. Some of the memset or memcpy operations also come from the way FOR-TRAN passes arrays to subroutines or functions (as a whole array, or as a slice). We avoided these operations wherever possible.
After this optimization, the real CPU time on one core for NUC hotspots can be shorter by up to 10%. But this depends a lot on the MS simulated and the activity of the UNIX/LINUX system outside of Tinker-HP.
Computational hotspots
The strategy we used can be developed in 4 guidelines : 1 . Rewrite all big internal loops. As using vector operations means using arrays, big loops can be split in numerous short ones, each loop computing only one or a few quantities for all the involved atoms. This way, the quantities calculated can be kept in arrays and vector operations can be executed on them. 2 . Cope with the way the compiler works on loops. As stated in section 3, when the compiler tries to vectorize a loop, it can generate 3 execution loops :
• a Peeled loop (P-loop), to treat array elements up to the first aligned one.
• a Kernel loop (K-loop), to treat the biggest possible number of array elements with vector operations.
• a Remainder loop (R-loop), to treat elements that remain untreated by the previous loops.
As the K-loops are the most effective and the fastest loops, we must eliminate as much P-loops and R-loops as possible. We'll show below what we did to achieve this goal. 3 . Use vectorized mathematical operations as much as possible. This can be difficult sometimes, because each compiler or library implements them in its own way.
For example, using the Intel Compiler, the sqrt(X) function is not vectorized. But the power function ** is. So loops with X**0.5 have a better vectorization score than loops with sqrt(X). As the two functions can give slightly different numerical results, care must be taken to be sure to always keep the big accuracy needed by Tinker-HP. 4 . Have no dependency between arrays in the loops, because the compiler will refuse to vectorize any loop where it cannot be sure that there is no dependency.
With that in mind, knowing that Intel AVX-512 vector registers can hold eight 8-bytes reals or sixteen 4-bytes integers, we should have a significant improvement of the speed if the number of neighbouring atoms is big enough to fill them. That's generally the case in Tinker-HP calculations, except for very in-homogeneous MS.
To summarize, filling in the 512 bits registers in an efficient way and using as much vector operations as possible in a loop need :
• No dependency, to be vectorized • Low number of arrays used, to reduce the register pressure • Arrays as close as possible in memory, to reduce cache miss and cost of memory operations • Data aligned on a suitable boundary, to eliminate the P-loop • No subroutine calls, no un-vectorized math operations, to get the best of the K-loop.
• Loop count carefully chosen, to eliminate the R-loop
If tests are mandatory (as in the selection process), they should be built in logical arrays before being used in the loop.
Dependency
Short loops calculate only one or a few unrelated quantities. They use the lower possible number of arrays. Thus, dependencies do not often occur. Where the non dependency cannot be automatically determined, we can easily see it and give directives to the compiler, or at worst rewrite the loop.
Data alignment
Historically, in Tinker, data were located in commons, that were themselves organized with scientific development in mind. Some compilers have options to align commons. But they may be inefficient if data are not correctly organized, with memory representation in mind.
We decided to replace commons with modules, which have many advantages :
• Arrays can be efficiently aligned using directives when declared • Overall readability is better, due to modularity • Code can be introduced in modules, so we can group operations and write them once and for all.
In all the modules, we used an ATTRIBUTE ALIGN::64 directive for each array declaration. At the very beginning of this work, we used arrays like pos(n,3) to represent, for example, the three spatial coordinates of an atom. But, sometimes, we saw that the initial alignment of the first row of pos was not kept by the compiler for the following ones, preventing it from fully optimizing the code and forcing it to generate extra P-loops. All arrays are now mono-dimensional. The coordinates are represented by arrays like Xpos(n), Ypos(n) and Zpos(n). The three coordinates are treated in the same loop, with no dependency, allowing for vectorization.
Data layouts in memory
The figure 2 shows 3 different data layouts for arrays in memory.
• In the setup x, no ATTRIBUTE ALIGN::64 directive has been given. There is no memory loss, but the real*8 array is not on a 64bits boundary. During execution, elements in this array will not be aligned. If no ASSUME_ALIGNED::64 directive is given, the compiler will generate P-loops. If an ASSUME_ALIGNED::64 directive is given, no P-loop will be generated. But the process will pick up wrong real*8 numbers in the Kloop, and give wrong results, or even crash. • In the setup y, all arrays are aligned on a 64 bits boundary with an ATTRIBUTE ALIGN::64 directive. No Ploop will be generated. But there can be a memory hole, if the number of elements in the integer arrays is odd. When running, the process could have to do some jumps in memory to get the real*8 numbers, loosing time.
• In the setup z, all arrays are aligned on 64bits with an ATTRIBUTE ALIGN::64 directive. No P-loop will be generated. There's no memory hole, because the number of elements in the integer arrays is kept even. We decided to implement the setup z, which represents the best trade-off between speed and memory consumption. So, we ended up with the typical array declarations in a module (here, MOD_vec_vdw) shown in listing 3, where :
• All 4-bytes integer arrays are listed before 8-bytes real ones, and each parametric dimension is a multiple of 64, eliminating the holes in the data layout and ensuring correct alignment in any cases. For example, in the listing 3, the parameter maxvlst, which represents the maximum number of neighbours in van der Waals pair list, is set to 2560.
• Arrays used in a loop are listed close to each others in modules and in the same order of utilization as much as possible, to reduce the time for memory operations.
For arrays declared in the subroutines (outside of modules), with shapes depending on the size of the MS, we calculated the next multiple of 16 bigger than the size, and used it as the dimension of the array.
Although the setup z implies an over-consumption of memory for arrays with MS-dependent sizes, this is almost not noticeable, because we add at worst 15 elements to the size, which goes usually from thousands to millions in Tinker-HP.
Almost all P-loops have been removed this way.
Loop counts
As we have many short loops in each subroutine, we really must carefully choose the loop counts. The number of sites being dependent of the size of the MS we work on, we cannot impose a fixed value. To be sure to completely fill the 512 bits registers at each loop, we decided to maintain two working loop counts : Since nnvlst can already be a multiple of 8 or 16, we should use the mod and merge constructs to get the smallest loop count. We used nvloop8 for real*8 operations, and nvloop16 for integer operations. real*8 arrays are loaded in registers by chunks of 8 elements, and integer arrays by chunks of 16 elements. We eliminated almost all the R-loops this way.
The flaw of this method is that, when nnvlst is very low, we do an overwork. However, the number of neighbours is generally big enough (typically between 200 and 2000), so these extra iterations are not too expensive. Furthermore, they are executed as K-loops, where maximum vectorization is in effect. We just have to remember that we have now extra (useless) calculated values and find a way to drop them in our calculations.
Design of loops
The use of short loops allows the programmer to better understand the code. That's exactly the same for the compiler ! So, loops in Vec are :
• simple. We use a small number of arrays (a maximum of 8 seems to be a good start). Due to the intrinsic complexity of the mathematical model, we use many temporary arrays.
• short. The loops contains 10 instructions at most, and 3 or 4 most of the time.
• mostly if-test free. Most of the time, an if-test in a loop prevents the compiler from vectorizing. For the loops in the neighbours selection process, which cannot be if-test free by nature, tests are built as logical array masks before being used. This way, loops using these masks can easily be vectorized.
Editing considerations
Big editing effort may seem to be useless at first glance. But we found that, most of the time, it was far easier to understand and vectorize loops once they are really readable. Also, modifications and debugging can be faster.
Typical vectorized loops in Tinker-HP
We have 2 kinds of vectorized loops in Tinker-HP :
• selection loops that select sites using various cutoffs and work on integers or logicals • compute loops that compute quantities and work on reals.
We will show each of them in details, and give some insights on how we have improved the vectorization. The typical loops have been extracted from ehal1vec, which use the module MOD_vec_vdw shown in listing 3.
Typical selection loop
We built a selection mask mask1 with the appropriate test, using nvloop16 here, as we work on integers (listing 5). We first tell the compiler to assume the loop count value is a multiple of 16, so that it does not generate any R-loop. We really have to be sure of that, otherwise the process will pick up numbers from other arrays in the memory and wrong results or bad crashes will occur. The vectorization report on listing 6 shows that the loop is perfectly vectorized.
We then applied the mask on the set of atoms we worked on to select those of interest. In FORTRAN, we have an intrinsic function PACK that does exactly this. So, in a first attempt, we wrote the selection loop as shown in the listing 7. Unfortunately, because of the PACK function, which does an implicit loop over all the elements of the array it works on, each line was seen as an independent loop by the compiler, and optimization was made on that line only. The corresponding FORTRAN compiler report on listing 8
shows that a vectorized loop is generated for the count line, with a vector cost of 0.810. Then, the first PACK line generates 3 loops :
• one over the 2-bytes logical array mask1 (vector length 32) with a vector cost of 1.250.
• one over the 4-bytes integer array kglobvec (vector length 16) with a vector cost of 1.000.
• one reduced to a memset or memcpy for the affectation of kglobvec1
The total vector cost is 2.250 for one PACK operation. We also have 3 loads and 1 store for each PACK line.
For this selection loop, we obtained a total vector cost of 0.810 + 4 * 2.250 = 9.810, plus 4 memset or memcpy, and a total of 13 loads and 4 stores. We cannot easily know the final cost of this selection loop, because, as stated before, the implementation of the 4 memory operations depends on the compiler.
To get a controlled and constant vector cost, whichever compiler we use, we decided to get rid of the PACK function. After all, packing data is just a matter of selecting array elements and putting them contiguously in a new array. So, we ended up with a functionally equivalent loop depicted in listing 9. Although there is a test in this loop, the corresponding FORTRAN compiler report (listing 10) clearly shows that :
LOOP BEGIN at ( 3 0 9 , 2 2 ) v ec t o r length 16 u n r o l l f a c t o r s e t to 2 normalized v e c t o r i z a t i o n overhead 1.192 LOOP WAS VECTORIZED unmasked a l i g n e d u n i t s t r i d e loads : 1 ---begin ve c t or cost summary ---
• The loop is vectorized • Every reference is aligned, so are the loads. The stores cannot be aligned, because of the packing.
• The vector length is 16 which means 16 integers will be picked up in each operation • The potential speedup is more than 7. This is very good in the presence of a test. • 4 vector compress instructions are generated, which correspond to the 4 affectations. We obtained a vector cost of only 2.500 and no memset or memcpy. That's 4 times smaller than the PACK version, and much more if we count the memset or memcpy operations.
The number of loads is also reduced by a factor of 3. This version is really faster than the first one.
Typical compute loop
The calculation loops follow the scheme we have described above. They are short, simple and easy to read. The listing 12 shows a typical compute loop.
We first tell the compiler to assume the loop count value is a multiple of 8 (we work on reals here). We use the arrays in the order they were declared in the module (see listing 3). There is no dependency, since all arrays are independent and used after being completely set. The compiler can easily optimize the loads and stores.
! DIR$ ASSUME (MOD( nvloop8 , 8 ) . eq As all arrays here are aligned, no P-loop are generated by the compiler. Because of the loop count, no R-loop are generated either.
LOOP BEGIN at ( 4 1 7 , 1 0 ) reference rvvec2 ( k ) has a l i g n e d access reference rv7vec
( k ) has a l i g n e d access reference r i k 3 v e c ( k ) has a l i g n e d access reference r i k 2 v e c ( k ) has a l i g n e d access reference r i k v e c ( k ) has a l i g n e d access reference r i k 4 v e c ( k ) has a l i g n e d access reference r i k 3 v e c ( k ) has a l i g n e d access reference r i k v e c ( k ) has a l i g n e d access reference r i k 5 v e c ( k ) has a l i g n e d access reference r i k 4 v e c ( k ) has a l i g n e d access reference r i k v e c ( k ) has a l i g n e d access reference r i k 6 v e c ( k ) has a l i g n e d access reference r i k 5 v e c ( k ) has a l i g n e d access reference r i k v e c ( k ) has a l i g n e d access reference r i k 7 v e c ( k ) has a l i g n e d access reference r i k 6 v e c ( k ) has a l i g n e d access reference r i k v e c ( k ) has a l i g n e d access reference r i k 7 v e c ( k ) has a l i g n e d access reference rv7vec
( k ) has a l i g n e d access reference invrhovec ( k ) has a l i g n e d access reference r i k v e c ( k ) has a l i g n e d access reference rvvec2
( k ) has a l i g n e d access reference invtmpvec ( k ) has a l i g n e d access v ec t o r length 8 normalized v e c t o r i z a t i o n overhead 0.049 LOOP WAS VECTORIZED unmasked a l i g n e d u n i t s t r i d e loads : 14 unmasked a l i g n e d u n i t s t r i d e s t o r e s : 8 ---begin ve c t or cost summary ---s c a l a r cost : 272 v ec t o r cost : 25.750 estimated p o t e n t i a l speedup : 10. We can see from the corresponding FORTRAN compiler report in the listing 13 that :
• The loop is vectorized • Every reference is aligned, so are the loads and stores.
• The vector length is 8 which means 8 numbers will be picked up in each operation • The potential speedup is around 10.5.
• 2 vectorized math library calls are made for the 2 ** function.
A look to the assembly code in listing 14 shows that all multiplications are done with vector operations vmulpd and vfmadd213pd and vfmadd231pd, which are fused multiplyadd operations. These vector instructions operate on zmm registers. We can also see the two calls to the vector version of the ** function so we are fully using Intel AVX-512 capabilities. If ever we had used a division, instead of **-one, we would have got : The estimated potential speedup in this case is less than half the previous one. And the utilization of the vector units is not so optimal. So, a careful reading of the vectorization report is always necessary to ensure the best choices have been made.
Final profile for Vec.
The tables 3 shows the profile and the boost factors between Rel and Vec for the final vectorized routines.
NUC hotspots
The Real CPU Time is almost the same as for the Rel version. We can see a reduction of about 10% for the real CPU time of vmlinux. The libmkl_vml_avx512.so shared library has appeared, because we use vectorized mathematical functions and replaced all calls to the complementary error function erfc, which was in the sources of Rel, by calls to vderfc, which is a vectorized implementation in Intel MKL library.
Computational hotspots
The vector usage percentage varies between 64% and 100%, and the boost factors are between 1.60 and more than 5. 10 . The real CPU time has shrunk from 278.95s to 136.46s for AMOEBA calculation, and from 24.8s to 13.0s for CHARMM calculation. The vectorized parts of code are now about 2 times faster.
Performance on Intel Scalable Processors
Sequential performance
We have evaluated the brute performance boost of Intel AVX-512 by running calculations on only one core from a dedicated node. In this situation, we can easily measure the time of execution of each interesting part of the code, with limited perturbation coming from the MPI part of the code or the presence of other processes.
We chose to measure 3 execution times :
1. time_vdw, which is the time taken by Van der Waals calculations. Depending on the setup, we used ehal1(vec) or elj1(vec). 2 . time_elec, which is the time taken by electrostatic calculations. Depending on the setup, we used echarge1(vec) (charges) or empole1(vec) (multipoles). 3 . time_polar, which is the time taken by polarization calculation and by far the biggest. We used epolar1(vec).
In this case, the boost is always a tradeoff between feeding the CPU with enough numbers, which goes better with the size of the MS, and minimizing the exchanges between memory and cores, which goes worse with the size.
For every MS, we ran 10 steps of calculations using the polarizable AMOEBA force field. We took the average value of each times, removing the lowest and the biggest. Results are given in the table 4. Performance boost factors are always very good, even when the size of the MS is very big (more than 1 million atoms on one core for STMV !). The boosts we obtained are very significant and justify the important vectorization efforts we made to get them. But the real gain should be estimated in a more realistic situation, where Tinker-HP is running in parallel. In this case, there could be from 8 to 48 processes running on one node, each competing for resources, and up to 340 nodes involved, multiplying MPI communications.
Best parallel performances
Polarizable force field : AMOEBA
We focus here on the absolute timings improvements over previous publications. As vectorization did not affect the scaling of the methods, interested readers can refer to the Tinker-HP software publication for detailed analysis of the scalability [1] . Here, we ran calculations on 2 000 steps (4ps), with the core setups shown in table 1. The best performance was taken as the average of the 20 performance evaluations made by Tinker-HP during the run, after having removed the first, middle and last values, which are lower because Tinker-HP writes intermediate files at these steps.
Results given in the boost factors. We are still able to get little gains with CPU2 sets, because most of the supplementary cores use vectorized routines. Anyway, these results are very encouraging, knowing that not all the code has been optimized. We pushed forward and tried simulations on STMV and Ribosome with up to 16200 cores (CPU2 set). The figures 3 and 4 give the performances obtained for Rel and Vec when increasing the number of cores.
The boost factors remain relatively constant for these two MS. With very large number of cores (and very large number of nodes), both Rel and Vec speeds are bounded by MPI communications and memory operations. 
Non-polarizable force field : CHARMM
Tinker-HP is yet not optimized for these kind of force fields as no specific "modern algorithmics" is present. Indeed, the initial implementation is mainly a massively parallel version of the initial Tinker that was initially aimed at performing comparisons for Steered Molecular dynamics between polarizable and non-polarizable approaches.
[23] Also, we used for the tests a very conservative molecular dynamics setup where bonds are not rigidified, reciprocal space computations are done at each timestep, etc... Such setup is chosen in order to provide reference numbers but actual timings could be accelerate in many ways. Thus, we know that we need much more cores to get results comparable to those of other prominent codes [24] [25] [26] . Nevertheless, we decided to make timings, firstly to get an idea of the boost the vectorization can provide in this case and, secondly, to know if we can still benefit from the scalability of the code, which is one of its greatest strengths. We used the same MS and the same CPU sets, limiting us to a maximum of 2 400 cores, since using more cores would not make any sense and knowing that these setups may not be optimal for this other type of computations (i.e. as they were chosen for AMOEBA).
Vectorization boost
The table 6 shows the timings we obtained for Rel and Vec.
Overall, the speedup factor in using non-polarizable force fields is found to be between 3 and 4. The boost factors are lower than for AMOEBA, mainly because the vectorized part of the code which is really executed is itself smaller. They show the same behaviour than for AMOEBA when the size of the MS increases, with a peak value reached for smaller systems (around 200 000 atoms). Beyond this size, the non-vectorized code become the limiting speed factor. 
CHARMM non polarizable Force Field
MS
Scalability
We tested the scalability of the code with 3 MS : Ubiquitin, DHFR and COX-2. As for AMOEBA, we ran 2000 steps of calculation with increasing number of cores, and took the average performance given by the code. The figures 5, 6 and 7 show the performances obtained for Rel and Vec when increasing the number of cores. Whatever MS we simulate, the scalability is still very good. The boost factor remains almost constant for the 2 smaller MS (Ubiquitin and DHFR). For the COX-2, the boost factor decrease from 1.41 to 1.39 when increasing the number of cores, because, with 2 400 cores, communications tends to lower the benefit of the vectorization. In practice, this part of the code is a first step towards an efficient engine for non-polarizable MD but some work is still required and is in progress to obtain better performances with an updated module to come. 
Perspectives on Tinker-HP 1.2 performance
This section intends to give a taste of the incoming performance gain that will appear in the Tinker-HP Release 1.2 version (Rel2). Indeed, despite being not fully vectorized yet, this major update proposes significant algorithmic speedups. For now, we can point out that a strong performance gain without accuracy loss can be observed by using Tinker-HP coupled to the new multi-timestep BAOAB-RESPA1 integrator [27] with hydrogen mass repartitioning. This newly introduced integrator splits the energy terms in 3 levels evaluated at different timesteps: the bonded terms are evaluated every 1fs, the non-bonded terms (including polarization) are split into short and long range, the short-range being evaluated every 10/3 fs and the long range every 10fs. Furthermore, short-range polarization is treated with the non-iterative TCG-1 (Truncated Conjugate Gradient) solver [28, 29] and the outerlevel uses the Divide-and-Conquer Jacobi Iterations (DC-JI) [30] approach, offering a net global acceleration about 4.91 time compared to standard 1fs/Beeman/ASPC (7 without ASPC) simulations without loss of accuracy enabling a good evaluation of properties such as free energies [27] .
Preliminary results (where not all routines are yet vectorized) are reported in Table 7 . We intend to review the full 1 Table 7 . Best production timings for the different MS using Rel2, Rel2-multi (multi-timestep) and Vec2-multi (multi-timestep). For DHFR, COX-2, STMV and Ribosome, optimal results with CPU2 setup are also shown (see table 1 ).
Finally, beside the focus on the AMOEBA polarizable force field, timings will be given for other polarizable models as well as on classical force fields (CHARMM, AMBER, OPLS etc...). For now, despite the non-optimization and the absence of use of lower precision of this part of the code, more than a 4-time speedup of the values reported in Table 7 give an initial idea of the reasonable code performances for non-polarizable simulations.
Conclusion
This work has been a fundamental step in many ways. First, it demonstrates that new HPC architectures can offer large acceleration to existing massively parallel code like Tinker-HP. A brute performance boost between 1.32 and 1.70 can be achieved on computationally intensive parts of the code, leading to an overall acceleration factor between 1.45 and 1.59 for AMOEBA (1.24 and 1.40 for CHARMM) in realistic conditions, even for the simulation of molecular systems with millions of atoms. Considering that the most costly calculations must be done with a simulation time of a few microseconds, such a gain in speed is undoubtedly a big progress.
Second, it shows that every gain in speed is not just a matter of raising the frequency of the CPU or buying more powerful computers. Such a large gain involves a close cooperation between the computational chemists, who write the code, and the HPC specialists, who know how the CPU and the system software work. To get this acceleration, we had to dig into the pieces of code that were the most CPU consuming and to rewrite them almost completely, with simplicity and efficiency in mind. But it was worth the effort. Furthermore, considering the past trend of past CPUs, we consider that vectorization will also play an important role in future architectures. Third, it gives us a strategy and some methods to further improve the code. It can serve as a solid starting point for the future. We are now able to more easily adapt Tinker-HP to new underlying hardware or software technologies. That would allow us to make the best of them.
Of course, optimization is far from finished as some parts of the code are not yet vectorized (for example the reciprocal space computations of permanent electrostatics and polarization) and other sources of speedups exist and will be investigated. In particular, we have to review how we can improve the creation of the neighbour lists, how we can have a faster indexing of all the atoms (sorting indexes could be a solution) and how we can better mask MPI communications with computations. Decreasing precision is also possible in specific cases to gain performances while retaining accuracy. This paper never ceases to be updated as we will accumulate new data on Github until a new version of this living paper is pushed to review. The next iteration of this paper will also incorporate results on next generation of Intel Xeon Scalable processors (codenamed Cascade Lake) and intend to evolve towards the adaptation of the Tinker-HP code to any of the future architectures proposed by Intel. Future work will focus on the algorithmic boosting of our initial implementation of classical non-polarizable force fields as the next iteration of the paper will propose more detailed benchmarks about incoming new polarizable approaches including SIBFA [31, 32] and major evolutions of AMOEBA such as AMOEBA+ [33] and HIPPO [34, 35] .
Acknowledgments
We want to give a special thank to David Guibert and Cedric Bourasset (Center for Excellence in Parallel Programming, Atos-Bull) for initial discussions.
LHJ wishes to thank Prof Esmaïl Alikhani (Sorbonne Université) for sharing his 2x18 cores Skylake computer in the early phase of this work.
LHJ also wants to thank Bernard Voillequin for fruitful discussions at the very beginning of this work. B. Voillequin is not a scientist, not even an engineer. But he's a man of good sense and good will. That's invaluable.
For a more detailed description of contributions from the community and others, see the GitHub issue tracking and changelog at https://github.com/myaccount/ homegithubrepository. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more information go to www.intel.com/benchmarks. Performance results are based on testing as of March 14th 2019 and may not reflect all publicly available security updates. See configuration in the paper for details. No product or component can be absolutely secure.
Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at [intel.com].
Intel, the Intel logo, Xeon, VTune and Xeon Phi are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Author Information ORCID:
Luc-Henri Jolly : 0000-0003-3220-5759 Louis Lagardère : 0000-0002-7251-0910 Jay W. Ponder : 0000-0001-5450-9230 Pengyu Ren : 0000-0002-5613-1910 Jean-Philip Piquemal : 0000-0001-6615-9426
