The main performance bottleneck of gravitational N -body codes is the force calculation between two particles. We have succeeded in speeding up this pair-wise force calculation by factors between two and ten, depending on the code and the processor on which the code is run. These speedups were obtained by writing highly fine-tuned code for x86 64 microprocessors. Any existing N -body code, running on these chips, can easily incorporate our assembly code programs.
Introduction
Some N-body simulations can be sped up in various ways, by using faster algorithms such as tree codes (Barnes and Hut, 1986) and/or special purpose hardware such as the GRAPE family (Sugimoto et al., 1990) . For some regimes, such as low N values, these speed-up methods are not very efficient, and it would be nice to find other ways to improve the speed of such calculations. It would be even better if these alternative ways can be combined with other methods of speed-up.
We explore here a general approach based on speeding up the inner loop of gravitational force calculations, namely the interactions between one pair of particles. Also when using tree codes this approach will be useful, since in that case the calculation cost is still dominated by force calculations. Even for GRAPE applications, this approach will still be useful in many cases, since there are always some calculations which are done more efficiently on the front end.
In particular, we consider the optimization of the inner force loop on the x86-64 (or AMD64 or EM64T) architecture, the newest incarnation of the architecture that originated with the Intel 8080 microprocessor. Processors with an x86-64 instruction set are currently the most widely used. Athlon 64 and Opteron microprocessors from AMD, and many recent models of Pentium 4 and Xeon microprocessors from Intel, support this instruction set.
As will be shown in section 2, a straightforward implementation of the inner force loop using either Fortran or C, compiled with standard compilers like GCC or icc for x86-64 processors, results in a performance that is significantly lower than the theoretical peak value one can expect from the hardware. In the following, we discuss how we can improve the performance of the force loop on processors with an x86-64 instruction set.
The SSE2 Instruction Set
Our approach is based on the use of new features added to the x86 microprocessors in the last eight years. The first one is the SSE2 instruction set for double-precision floating-point arithmetic. Traditionally, the instruction set for x86 microprocessors has included the so-called x87 instruction set, which was originally designed for the 8087 math coprocessor for the 8086 16-bit microprocessor. This instruction set is stack-based, in the sense that it does not have any explicit way to specify registers. Instead, registers are indirectly accessed as a stack, where two operands for an arithmetic operation are taken from the top of the stack (popped) and the result is placed back at the top of the stack (pushed). Memory access also takes place through the top of the stack. This x87 instruction set had the advantage that the instructions are simple and few in number, but for the last fifteen years the design of a fast floating-point unit for this x87 instruction set has been a major problem for all x86 microprocessors. If one would really design stack-based hardware, any pipelining would be practically impossible. In order to allow pipelining, current x86-based microprocessors, from Intel as well as AMD, translate the stack-based x87 instructions to RISC-like, presumably standard three-address register-to-register instructions in hardware at execution time.
This approach has given quite high performance, certainly much higher than what would have been possible with the original stack-based implementation. However, it was clear that pipelining and better use of hardware registers would be much easier if one could use an instruction set with explicit reference to the registers. In 2001, with the introduction of Pentium 4 microprocessors, Intel added such a new floating-point instruction set, which is called SSE2. It is still not a real three-address instruction set; it rather uses a two-address form where the address of the source register and that of the destination register are the same. Moreover, SSE2 still supports operations between the data in the main memory and that in the register, as was the case with the IBM System/360. Thus, it still has the look and feel of an instruction set from the 1960s.
Minimizing Memory Access
Even though operations between operands in memory and operands in registers are supported, clearly execution would be much faster if all operands could reside in registers. However, with the original SSE2 instruction set it was difficult to eliminate memory access for intermediate results, because there were only eight registers available for SSE2 instructions. For whatever reason, these registers are called "XMM" registers in the manufacturer's document, and we follow this convention. With the new x86 64 instruction set, the number of these "XMM" registers was doubled from 8 to 16. The implication for N-body calculations was that it now became possible to minimize memory access during the inner force loop. A form of optimization using this approach will be discussed in section 3.
Exploiting Two-Word Double-Precision Parallelism
Another important feature of SSE2 (which is actually Streaming SIMD extension 2) is that it is defined to operate on a pair of two 64-bit floating point words, instead of a single floating-point word. This effectively means that the use of SSE2 instructions automatically result in the execution of two floatingpoint operations in parallel. While this feature cannot be easily used with any compiler-based optimization, it is possible to gain considerable profit from this feature through judicious hand coding. We discuss the use of this parallel nature of the SSE2 instruction set in section 4.
Exploiting Four-Word Single-Precision Parallelism
SSE2 is not the only new floating-point instruction set that has been made available for the x86 hardware. As the name SSE2 already suggests, there is an earlier SSE instruction set, which is similar to SSE2 but works only on single-precision floating-point numbers. As is the case with SSE2, SSE also works on multiple data in parallel, but instead of the two double-precision data of SSE2, SSE works on four single-precision floating-point numbers simultaneously. Thus, the peak calculation speed of SSE is at least a factor of two higher than that of SSE2. For those force calculations where single precision gives us a sufficient degree of accuracy, we can make use of SSE, gaining a performance that is even higher than what would have been possible with SSE2, as we discuss in section 5.
Utilizing Built-In Inverse Square Root Instructions
SSE was designed mainly to speed up coordinate transformation in threedimensional graphics. As a result, it has a special instruction for the very fast calculation of an approximate inverse square root, which is intended as a good initial value for Newton-Raphson iteration. This is exactly what we need for the calculation of gravitational forces. We discuss the use of this approximate inverse square root for double-precision calculation, in section 4.
Organization
This paper is organized as follows. In section 2, we give the standard Clanguage implementation of the force loop. We consider the code fragment which calculates both the acceleration and its first time derivative. It is used with Hermite integration scheme (Makino & Aarseth, 1992) .
We present the assembly language output and measured performance, and describe possible room for improvements. We call these implementations the baseline implementation.
In section 3, we discuss the optimized C-language implementation of the force loop for the Hermite scheme. The difference from the baseline implementation is that the C-language code is hand-tuned so that the load and store of intermediate results are eliminated from the generated assembly-language output. This version gives us the speed-up of 46% compared to the baseline method.
In section 4, we discuss a more efficient use of SSE2 instruction, where forces on two particles are calculated in parallel using the SIMD nature of the SSE2 instruction set. Here, we use the intrinsic data types defined in GCC, which allows us to use this SIMD nature within the syntax of C-language. Also, we use the fast approximate square root instruction and a Newton-Raphson iteration. This implementation is 88% faster than baseline.
In section 5, we discuss a mixed SSE-SSE2 implementation of the force calculation for the Hermite scheme. In many applications, full double-precision accuracy is not necessary, except for the first subtraction between positions and final accumulation of the forces. "High-accuracy" GRAPE hardwares (GRAPE-2, 4 and 6) rely on this mixed-accuracy calculation. Thus, it is possible to perform most of the force-calculation operations using SSE single-precision instructions, thereby further speeding up the force calculation. In this way, we can speed-up the calculation by another 67 % from the SSE2 parallel implementation of section 4, achieving 219% speedup (3.19 times faster) from the baseline implementation.
Baseline implementation

Target functions
The target functions that we want to calculate are:
(1)
(2)
where a i and φ i are the gravitational acceleration and the potential of particle i, the jerk j i is the time derivative of the acceleration, and r i , v i , and m i are the position, velocity and mass of particle i, r ij = r j − r i and v ij = v j − v i .
The calculation of a and φ requires 9 multiplications, 10 addition/subtraction operations, 1 division and 1 square root calculation. Clearly, calculation of an inverse square root is more expensive than addition/multiplication operations. Therefore, if we want to measure the speed of a force calculation loop in terms of the number of floating-point operations per second, we need to introduce some conversion factor for division and square root calculations. In this paper, we use 38 as the total number of floating point operations for the calculation of a and φ. This implies that we effectively assign around 10 operations to each division and to each square root calculation. This particular convention was introduced by Warren et al. (1997) , and we follow it here since it seems to be a reasonable representation of the actual computational costs of force calculations on typical scalar processors.
In addition, it requires 11 multiplications and 11 addition/subtraction operations to calculate j. Thus, the total number of floating-point operations per inner force loop for the Hermite scheme can be given as 60. This is the number that we will use here. Note that we have previously used numbers that were slightly less, by about 5%, using 57 instead of 60 (Makino et al., 2001 ).
In the case of a simple leapfrog method or traditional "Aarseth" scheme, we need to calculate only a i and φ i . For a Hermite scheme we need to determine j i as well.
Baseline implementation and its performance
The following code fragments contain what we regard as the baseline implementation of the force calculations, where we use the word 'force' loosely, to indicate the calculations for accelerations, jerks as well as the potential. In other words, we consider 'force calculations' to comprise all the basic low-level dynamical calculations governing the interactions between pairs of particles. List 1. Baseline implementation calculates force on ith particle. 
r 2 = e p s 2 ; Line 29 in list 1 is a branch to avoid self interaction. We might remove this branch when we can use softening, but in this case branch prediction of the microprocessor works ideally, hence the performance changes little if we remove this. Table 1 shows the performance of this code on an AMD Athlon 64 3000+ (2.0GHz) processor with several different compilers. The first column gives the compiler used, the second the compiler options, the third the clock cycles per pairwise force calculation, and the fourth column gives the speed in Gflops. All compilers generate SSE2 instructions instead of x87 instructions unless we explicitly set options to use x87. The performance is fairly good, but not ideal. In the following we will investigate how we can improve the performance.
List 2 is the assembly language output of the Hermite force loop using the GCC compiler:
List 2. Assembly output of list 1 using GCC 3.3.1 with option -S -O3 -ffast-math -funroll-loops, commented by the authors for clarity. 3 C-level optimization
Code modification
As we have seen in the previous section, the assembly language output shows a significant amount of unnecessary load/store instructions. In principle, if the compilers were clever enough they would be able to eliminate these unnecessary operations. In practice, we need to guide the compilers so that they do not generate unnecessary code.
We have achieved significant speedup using the following two guiding principles:
(1) Eliminate assignment to any array element in the force loop.
(2) Reuse variables as much as possible in order to minimize the number of registers used.
Apparently, present-day compilers are not clever enough to eliminate load/store operations, if elements of arrays are used as left-hand-side values. Therefore, we hand-unroll all loops of length three and use scalar variables instead of arrays for three-dimensional vectors.
In well-written programs, variables which contain the values for different physical quantities should have different names. However, the use of many different names prevents optimization, since it results in a number of variables too large to be fitted into the register set of SSE2 instructions (there are 16 "XMM" registers for SSE2 instructions). Therefore, we explicitly reuse variables such as x, y, z so that these can hold both the position difference components as well as the pairwise force components, at different times in the computation.
The resulted C-language code is the following:
List 3. An optimized force loop. We can see that now all load/store instructions of the intermediate results are eliminated. Figure 1 shows the use of registers and table 2 presents the performance data for this code. This version is 46% faster than the code in section 2. Very roughly, this speed-up is consistent with the reduction in the number of assembly language instructions (from 82 to 62), but is somewhat larger. This is partly because the instructions which take memory arguments have a larger latency than register-only instructions.
Prefetch insertion and alignment
Line 22 of list 3 shows a built-in function for GCC, which is compiled into a prefetch instruction. In this case, the prefetch instruction loads the data which will be needed two iterations after the current one. The second and third parameters are read/write flags for which zero indicates preparing for a read and the degree of temporal locality takes value from 1 to 3.
Line 6 of list 3 shows a pad to make the size of the predictor structure to be exactly 64-byte, which is the cache-line size of Athlon 64 processor. To make the predictor structure align at 64-byte boundary, we use memalign() or posix memalign() instead of malloc(). The fastest case depends on N. For the region N ≤ 1024, in which data fits into 64 kB of L1 data cache, the code with prefetch and 64-byte alignment is the best. For larger N, the code with prefetch and without alignment is the best.
Assembly-level optimization
In this section, we give two extensions specialized for x86/x86 64 architecture for the force loop. The first is using SSE2 vector (SIMD) mode instead of SSE2 scalar (SISD) mode. The second is replacing one division and one square root, by a special instruction for fast approximate inverse square root in SSE and Newton-Raphson iteration.
SSE2 vector mode
In the previous section we improved the performance of the C-language implementation of the force loop, essentially by hand-optimizing the C code so that the generated assembly code becomes optimal. However, in this way, we did not use the full capability of SSE2. While SSE2 instructions can process two double-precision numbers in parallel, the force loop discussed in the pre-vious section uses only one of these two words. Clearly, we did not yet use the "SIMD" nature of the instruction set.
Whether or not we can gain by using this SIMD nature depends on the particular code, and also on the particular processor. On Intel P4 architecture the SIMD mode can offer up to a factor of two speedup, while on AMD Athlon 64 or Intel Pentium M, the speed increase can be small (or even negative).
There are many ways to use this SIMD feature. Since there are similarities with the vector instructions of some old vector processors, in particular the Cyber 205, one could make use of an automatic vectorizing compiler (such as icc version 6.0 and later). However, just as was the case with the old vectorizing compilers, the vectorizing capability of modern compilers are still very limited, and it is hard to rewrite the force loop so that the compiler can make use of the SIMD capability.
In fact, part of the reason why vectorization is difficult is that the load/store capability of SSE2 instructions is rather limited: it can work efficiently only on a pair of two double-precision words in consecutive and 16-byte aligned addresses. This means that the basic loop structure cannot work, and one needs to copy the data of two particles into some special data structure, in order to let the compiler generate the appropriate SIMD instructions.
Here we have adopted a low-level approach, in which we make use of the special data type of pairs of double precision words defined in GCC. Basically, this data type, which we call v2df, corresponds to what is in one XMM register (a pair of double-precision words). We can perform the usual arithmetic operations and even function calls for this data type. Here is the code which makes use of this v2df data type:
List 5. Defining SSE/SSE2 data type.
t y p e d e f d o u b l e v 2 d f _ _ a t t r i b u t e _ _ ( ( m o d e ( V 2 D F ) ) ) ; 2 t y p e d e f f l o a t v 4 s f _ _ a t t r i b u t e _ _ ( ( m o d e ( V 4 S F ) ) ) ;
The v4sf data type packs four single-precision words for SSE which we will use later.
The basic idea here is to calculate the forces from one particle on two different particles in parallel. One could try to calculate the forces from, rather than on, two different particles on one particle in parallel, but that would result in a more complicated program since then we will need to add up the two partial forces in the end. Also, from the point of view of memory bandwidth, our approach is more efficient, since we need to load only one particle per iteration. Note that this is the same approach as what is called i-parallelism in the various versions of GRAPE hardware, where parallel pipelines calculate the forces on different particles from the same set of particles.
In this code, we use macros for gathering load and scattering store using instructions for loading/storing the higher/lower half of an XMM register:
List 6. Gathering load and scattering store for SSE2. We should be careful about the fact that these built-in functions are undocumented feature of GCC. The GCC document describes built-in functions for most SSE/SSE3 instructions but not for any of SSE2 instruction. We can call most SSE2 instruction through the prefix builtin ia32 . Note that, for whatever reason, instructions such as "movxxx" from/to memory are changed to "loadxxx"/"storexxx". But this does not mean that GCC always generates correct assembly output. For example, GCC generates wrong code for the first macro in list 6 when reg is not a register variable. This can be considered either a bug or a feature, and it shows the risk of using these undocumented aspects of GCC.
We also use macros for numerical values since we cannot use numerical values like "3.0" for vector operations: Note that we need to copy the data of a single particle into both high and low words of an XMM register. The new SSE3 extension supports this "broadcasting", while the original SSE2 set did not. Therefore, we use the following macro:
List 8. Broadcast loading for double precision. We show the vectorized force loop using these data types and macros in list 9.
List 9. Vectorized force loop using SSE2 vector mode. The predictor structure has changed to store the same mass value in two places (line 4), in order to save an instruction (line 54).
Fast approximate inverse square root
The most expensive part of the force calculation, for recent microprocessors, is the calculation of an inverse square root which requires one division and one square root caluclations. Several attempts to speed this up using table lookup, polynomial approximation and Newton-Raphson iteration have been reported (e.g. Karp 1992 ). The main difficulty in these approaches is how to get a good approximation for the starting value of a Newton-Raphson iteration quickly.
Here, we use RSQRTSS/PS instruction in the SSE instruction set, which provide approximate values for the inverse square root for scalar/vector singleprecision floating-point numbers, to an accuracy of about 12 bits. With one Newton-Raphson iteration, we can obtain 24-bit accuracy, which is sufficient for most "high-accuracy" calculations. If higher accuracy is really necessary, we could apply a second iteration. The Newton-Raphson iteration formula is expressed as:
Here, x 0 is an initial guess for 1/ √ a.
We show the implementation for a scalar version using inline assembly code with GCC extensions:
List 10. Calling rsqrtss through inline assembly code. and for a vector version using built-in functions:
List 11. Calling rsqrtps using built-in functions of GCC. Note that we skip here the multiplication by −1/2 in equation (4). This can be done after the total force is obtained.
To use RSQRTSS/PS instructions, we first convert r 2 into single precision, then apply these instructions, and finally convert the result back to double precision.
Note that the actual value returned by this RSQRTSS/PS instruction is implementation dependent. In particular, the AMD Athlon 64 processors and the Intel Pentium 4 processors return different values. Figure 3 shows the errors of the return values as a function of the input values. The AMD implementation has smaller average errors, but at the same time they show a relatively large systematic bias. Even after one Newton-Raphson iteration in double precision, results from both implementations show relatively large biases. Table 3 presents the root mean square error, max error, and bias (mean error) of the approximate value 1/ √ x, on a Intel Pentium 4 and an AMD Athlon 64 before/after one Newton-Raphson iteration. Here the error is measured as 1/2(x[rsqrt(x)] 2 − 1), and the weight is 1/frac(x), where frac(x) is the normalized fraction of the floating-point number x. We can correct these biases, at least statistically, by multiplying the resulted force by constants which depend on the processor type used. Table 4 shows the performance of four different types of force loops, using SSE2 scalar/vector mode and with/without fast inverse square root. AMD, after N-R 1.57 × 10 −8 1.00 × 10 −7 −9.51 × 10 −9
Performance
Intel, before N-R 1.16 × 10 −4 3.26 × 10 −4 −8.37 × 10 −8
Intel, after N-R 3.05 × 10 −8 1.60 × 10 −7 −2.01 × 10 −8 IEEE 754 Single 3.58 × 10 −8 8.94 × 10 −8 1.52 × 10 −11 As we have summarized in the introduction, SSE2 is the SIMD instruction set for pairs of double-precision words. There are also SSE instructions that work on quadruples of single-precision words. Thus, using the SSE instruction set, we can in principle double the performance. If we perform the initial subtraction between positions and final accumulation of acceleration and potential in double-precision, we can use single-precision SSE functions for all other calculations, including subtraction between velocities and accumulation of jerk, and still maintain a pretty high accuracy. The main complexity here is the question of how to make use of four elements of SSE data type, in parallel. The simplest approach is to calculate the forces on four particles, but in that case we need too many variables, which do not all fit into the registers. We need two XMM registers for each element of force in this way. Instead, we have tried to achieve the maximum speed by calculating the forces from two particles on two particles (making a total of four force calculations) in parallel in SSE.
In the following, we present a detailed description of the implementation of this mixed-precision force loop using SSE/SSE2. We use the term i particles for particles which feel the gravitational force, and j particles for particles which exert the force.
The data structure for j particles
The data structure for the j particles should contain two particles. The code in list 12 achieves this goal by using the data types v2df and v4sf. Note that for velocities we use the single-precision v4sf data type, since we do not need double-precision accuracy for the velocity. We store the data of the two particles (with indices j and j+1) as (j, j+1) for position, and s (j, j, j+1, j+1) for velocity and mass.
List 12. The j particle structure. The variable pad is used to make the size of the structure an exact multiple of 64 bytes.
The data structure for i particles
We use the following local variables to store two i particles (with indices i 0 and i 1 ).
List 13. Two i particles packed into local variables.
Note that xi0 keeps the position of particle i 0 in a duplicated way (two 64-bit words of v2df data type store the same data). Similarly, xi1 keeps the data for i 1 . In this way, by subtracting x of the j particle variable from x0 or x1, we can calculate the displacement of two j particles from one i particle. The results are then converted to single-precision format using a cvtpd2ps instruction. (See list 14). After unpcklps (an instruction to pack two words into one words, not to unpack) is issued, the contents of
For the velocity, we store the data in the order of (i 0 , i 1 , i 0 , i 1 ), so that a single subtraction operation provides the four displacement vectors of two j particles from two i particles. 
Inverse square root and Newton-Raphson iteration
We use the same NR-iteration as in the previous section. Since we do not need the data conversion between single and double precision formats, the code here actually becomes simpler. List 15 gives this part of the code.
List 15. Calculation of inverse square root in v4sf.
1 v 4 s f r2 , r i n v ; 2 3 r i n v = _ _ b u i l t i n _ i a 3 2 _ r s q r t p s ( r 2 ) ; 4 r 2 * = r i n v ; / * s t a r t Newton -R a p h s o n * / 5 r 2 * = r i n v ; 6 V 4 S F _ O P _ S C A L A R ( r2 , -= , 3 . 0 f ) ; 7 r i n v * = r 2 ; / * now , r i n v = -2.0/r * /
Accumulating acceleration and potential
Before accumulation of acceleration and potential, we now have to convert single-precision data back to double precision. Before doing so, we need to split one piece of 128-bit data with 4 single-precision words into two (effectively) 64-bit words with two single-precision words. This is done by the movhlps instruction. Then we convert the result to double precision using a cvtps2pd instruction. The code appears in list 16. For the jerk, we directly accumulate them in quadruples of single-precision words, since we do not need double-precision accuracy for jerk. After total force is obtained, we add up higher two words and lower two words of the accumulated data.
The whole code
List 17 shows the entire code. We provide a simple library which one can call to use this function in a way similar to the way the GRAPE hardware is used. We show an example N-body code using this library in Appendix 8. Note that actual code in list 17 is slightly different from code in list (12-16). We use arrays instead of vector types in the j particle structure. Codes in list 14 and 16 are abstracted into macros for readability and saving of registers.
List 17. Total code of mixed precision force calculation. Table 5 gives the performance of the code listed above. The second column gives the performance of the code after hand-tuning the assembly output.
Performance
The best performance we have measured is 4.05 Gflops, or 3.19 times faster than the original C code described in section 2.
Discussion and Summary
In this paper we describe in detail various ways of improving the performance of the force-calculation loop for gravitational interactions between particles. Since modern microprocessors have many instructions which cannot be easily exploited by existing compilers, we can achieve quite a significant performance improvement if we write a few small library in assembly language and/or using instruction-set specific extensions for the compiler offered to us.
Our implementation will give a significant speedup for almost any N-body integration program. In addition, we believe that similar optimization is possible in many other compute-intensive applications, within astrophysics as well as in other areas of physics and science in general.
The source code and documentation are available at: http://grape.astron.s.u-tokyo.ac.jp/~nitadori/phantom/
Acknowledgment
We thank Jumpei Niwa for making his Opteron machines available for the development of the optimized force loop and also for his helpful advice. We are also grateful to Eiichiro Kokubo and Toshiyuki Fukushige for their encouragement during the development of the code. We would like to thank all of those who have been involved in the GRAPE project, which has given us a lot of hints for speeding-up of the gravity calculation. P.H. thanks Prof. Ninomiya for his kind hospitality at the Yukawa Institute at Kyoto University, through the Grants-in-Aid for Scientific Research on Priority Areas, number 763, "Dynamics of Strings and Fields", from the Ministry of Education, Culture, Sports, Science and Technology, Japan. This research is partially supported by the Special Coordination Fund for Promoting Science and Technology (GRAPE-DR project), also from the Ministry of Education, Culture, Sports, Science and Technology, Japan.
8 Sample N-body code using Hermite hierarchical time step scheme
We show sample N-body code using Hermite hierarchical time step scheme and the library in subsection 5.5. 
