We present a high-performance N -body code for self-gravitating collisional systems accelerated with the aid of a new SIMD instruction set extension of the x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture, we implemented a fourth-order Hermite scheme with individual timestep scheme (Makino & Aarseth, 1992) , and achieved the performance of ∼ 20 giga floating point number operations per second (GFLOPS) for double-precision accuracy, which is two times and five times higher than that of the previously developed code implemented with the SSE instructions (Nitadori et al., 2006b), and that of a code implemented without any explicit use of SIMD instructions with the same processor core, respectively. We have parallelized the code by using so-called NINJA scheme (Nitadori et al., 2006a), and achieved ∼ 90 GFLOPS for a system containing more than N = 8192 particles with 8 MPI processes on four cores. We expect to achieve about 10 tera FLOPS (TFLOPS) for a self-gravitating collisional system with N ∼ 10 5 on massively parallel systems with at most 800 cores with Sandy Bridge micro-architecture. This performance will be comparable to that of Graphic Processing Unit (GPU) cluster systems, such as the one with about 200 Tesla C1070 GPUs (Spurzem et al., 2010) . This paper offers an alternative to collisional N -body simulations with GRAPEs and GPUs.
Introduction
N -body simulations are a powerful tool to follow dynamical evolution of self-gravitating many-body systems, such as planetary systems, star clusters, galaxies, galaxy clusters, and the large scale structure in the universe. In an Nbody simulation, we solve the following equation of motion of particles:
where G is the gravitational constant, ε is the softening length, N is the number of particles, m i and r i are respectively the mass and position of ith particle, and r ij = r j −r i . For self-gravitating collisional systems such as planetary systems and star clusters, one needs to keep sufficiently good accuracy and hence resort to a brute-force scheme, a direct scheme to compute gravitational forces.
In fact, the number of particles adopted in recent numerical simulations of these systems is not always large enough. For example, the number of stars in a single globular cluster is typically ∼ 10 6 , whereas the number of particles in recent numerical simulation of globular clusters is ∼ 10 5 at most. This is the reason why efficient ways to compute gravitational force for a given set of particles are intensively studied in many aspects.
The use of external hardwares is one of the most popular ways to compute gravitational force very quickly. GRAPEs (Ito et al., 1991; Makino et al., 1997 Makino et al., , 2003 , which are special purposed accelerators for N -body simulations, have greatly contributed to N -body simulations of collisional systems. Recently, Nitadori (2009) and Gaburov et al. (2009) explored the capability of commodity Graphics Processing Units (GPUs) as hardware accelerators for N -body simulation and achieved a performance close to that of GRAPE-DR (Makino, 2008) , the latest version of the GRAPE family.
Another approach to accelerate the computation of gravitational forces is to utilize SIMD (Single Instruction, Multiple Data) instructions, which can perform the same operation on multiple data simultaneously, implemented on recent central processing units (CPUs) instead of making use of external hardware accelerators. Since Pentium III released by Intel Corporation in 1999, CPUs with x86 architecture have supported a set of SIMD instructions called Streaming SIMD Extensions (SSE). Nitadori et al. (2006b) (NMH06) applied the SSE instructions to the force calculation in N -body simulations, and calculated four interactions among particles in parallel. NMH06 achieved about three times higher performance than the case without the use of the SSE instructions. Furthermore, Nitadori et al. (2006a) (NMA06) demonstrated that the performance of CPUs with the aid of the SSE instructions is comparable to GRAPEs and GPUs in massively parallel simulations.
Recently, a new processor family with Sandy Bridge micro-architecture has been released by Intel Corporation. The processor supports a new set of instructions known as Advanced Vector eXtensions (AVX), an enhanced version of the SSE instructions. In the AVX instruction set, the width of the SIMD registers are extended from 128-bit to 256-bit. Hence, an AVX instruction set is able to process twice amount of data than the SSE instruction set. This suggests that a processor with the AVX instruction set should be able to compute accelerations twice as fast compared to its SSE-only counterpart. The AVX instructions are also supported by the next-generation CPU "Bulldozer" released by AMD Corporation.
In this paper, we present a new N -body code implemented using the AVX instructions. In our N -body code, we adopt a fourth-order Hermite scheme with individual timestep scheme (Makino & Aarseth, 1992) for its time integration, which is widely used for collisional N -body simulations. We have achieved 20 giga floating point number operations per second (GFLOPS) per processor core for N -body simulations. This performance is two times higher than the case in which we use the SSE instructions. This paper is organized as follows. In section 2, we outline the algorithm of the fourth-order Hermite scheme. In section 3, we describe our implementation for the N -body code using the AVX instruction set. In section 4 and 5, we show an accuracy and performance of our N -body code, respectively. The results are summarized and discussed in section 6.
Algorithm
The fourth-order Hermite scheme is a kind of a predictorcorrector method, developed by Makino & Aarseth (1992) . We outline its algorithm below. The predictor of ith particle is given by Taylor series as
where ∆t i , r i , v i , a i , and j i are the timestep, position, velocity, acceleration, and its first-order time derivative (jerk) of ith particle, respectively, and the superscripts (0) and (p) indicate quantities at the current time and predicted quantities, respectively. Using the predicted position and velocity, r i , the acceleration and jerk of ith particle are evaluated as
where r
i , and r
ij |, and the superscript (1) indicates quantities at the next time. Additionally, the potential of ith particle, φ
is computed for checking the validity of the energy conservation. The corrector is based on the third-order Hermite interpolation constructed using the old acceleration and jerk, and the new ones (a
i , and j
(1) i , respectively). From the third-order Hermite interpolation polynomial, we can obtain the second-order and third-order time derivatives of acceleration, which are so-called snap (s Before we describe our implementation, let us briefly summarize the SSE and AVX instructions and the difference between them.
The SSE instruction set is a SIMD instruction set introduced for the first time in Pentium III processors to improve the performance of media streaming, image processing and three-dimensional graphics by executing 4 single-precision (SP) or 2 double-precision (DP) floating-point operations in parallel. Operations supported by the SSE instruction set include addition, subtraction, multiplication, division, square root, inverse square root, etc. In such operations, dedicated registers with 128-bit length called "XMM registers" are used to store the 4 SP and 2 DP floating-point numbers. Although operations between data in the XMM registers and in main memory are supported, those with all data stored in the XMM registers are faster. The available number of the XMM registers in a single processor core is 8 for x86 processors and 16 for x86 64 processors. NMH06 utilized the SSE instruction set to accelerate the force calculations in their collisional N -body code.
The AVX instruction set is an enhanced version of the SSE instruction set, and there are two major differences between them. One is the number of data on which an instruction is operated in parallel. The length of the dedicated registers for the AVX instructions, "YMM registers", is 256 bits, two times longer than that of the XMM registers. Thus, 8 SP or 4 DP floating-point operations can be carried out simultaneously by the AVX instructions, while the SSE instructions can execute 4 SP floating-point operations or 2 DP floating-point operations in parallel. The lower 128 bits of the YMM registers are regarded as the XMM registers used by the SSE instructions for backward compatibility. The number of the YMM registers in one processor core is 16, equal to that of the XMM registers.
The other is that the number of operands of most instructions has been increased from two in the SSE instruction to three operands in the AVX instructions. Two of the three operands are source operands, and the other one is a destination operand, where the result of the operation is stored. Owing to this, while one of source operands is used as a destination operand in the SSE instructions, and overwritten, we can preserve both source operands at each AVX instruction, and use a limited number of the YMM registers very efficiently.
Methods to implement the code explicitly using the SSE and AVX instructions should be also addressed. In principle, if the compilers were clever enough to detect any concurrent loop, it could generate the code that effectively utilize these instructions. In reality, however, the presentday compilers cannot fully resolve the dependency among variables in the loops very well. Therefore, we have to implement the SSE and AVX instructions explicitly using assembly-languages or compiler-dependent intrinsic functions. Intrinsic functions are supported by several compilers and easy to program with. On the other hand, inline assembly-languages, we can manually control the assignment of the XMM and YMM registers to computational data, and minimize the access to the memory or the cache memory by optimizing the use of individual registers. As already noted above, calculations using SSE and AVX instructions are very efficient if all the data are stored in the dedicated registers. Therefore, the optimization of the use of the XMM and YMM registers is of crucial importance. In this work, we adopt an implementation of the SSE and AVX instructions using inline-assembly embedded in Clanguage. In the followings, we only present implementations and results for x86 64 processors with 16 XMM and YMM registers.
Arithmetic precision
In our implementation, we perform the force calculation in the "mixed" precision, in which, as is done in NMH06, only the relative position vector between i-and j-particles (r (p) ij in equation (4), (5), and (6)), and the accumulation of individual acceleration and gravitational potential (summation in equation (4) and (6)) are computed in DP, and the remaining portions of the force calculation are done in SP.
In computing accelerations, jerks, and potentials in equations (4), (5) and (6), respectively, we calculate the inverse square root of (r (p) ij ) 2 + ǫ 2 using the VRSQRTPS instruction, which computes an approximated value of the inverse square root very quickly to an accuracy of 12 bits. To obtain an accuracy equivalent to SP, a Newton-Raphson iteration is applied, such that
where x 0 is an initial guess for 1/ √ a, and x 1 is improved value of 1/ √ a. NMH06 reported that values returned by the VRSQRTPS instruction contain statistical bias dependent on implementation of this instruction. We statistically correct the bias in the same way as NMH06.
3.3. SIMD parallelization of the force calculation using the AVX instruction set Since we perform the force calculation mostly in SP, we calculate eight interactions between i-and j-particles simultaneously. In our implementation, interactions between four j-particles and two i-particles are calculated in parallel. Of course, there exist other possible combinations of interactions such as four i-particles and two j-particles. We have compared the performance of various combinations of interactions and found that this choice exhibits the best performance among others by 5-10%.
The calculation of forces on two i-particles (a force loop) in our implementation consists of three parts as follows.
1. Prepare the i-and j-particle data in a suitable form for SIMD calculations. 2. Calculate the forces on the two i-particles exerted by all j-particles on the YMM registers. 3. Write back the calculated forces of the two i-particles on the YMM registers into the memory. The steps 1 and 3 are written in C language, while the step 2 is written in C language with inline assembly language.
First, we describe the implementation of the steps 1 and 3. List 1 shows the definitions of structures for i-and jparticles as well as a structure for the results (accelerations, jerks, and potentials) used in our implementation. Before computing the accelerations, jerks, and potentials of i-particles, the positions, velocities, and indices of iparticles are stored in an array of the structure, Ipart, and the positions, velocities, masses, and indices of j-particles are stored in an array of the structure Jpart. The calculated accelerations, jerks, and potentials are stored into an array of the structure NewAccJerk. Note that the size of each array in the structures of i-and j-particles is multiple of 256 bits, the length of the YMM registers, so that each array can be readily loaded onto the YMM registers. The structures of i-particles (Ipart) and j-particles (Jpart) contain the data of two i-particles and four j-particles, respectively. In figure 1 , we show the assignments of particle indices in the arrays in these structures, where i0 and i1, and j0, j1, j2, and j3 are the indices of i-and j-particles, respectively. Such assignments enable us to calculate the accelerations, jerks, and potentials of two i-particles from four j-particles in a efficient manner with the AVX instructions. In the step 1, we arrange the particle data and substitute them into the structures Ipart and Jpart before calculating the accelerations and jerks in the step 2.
Next, we describe the implementation of the step 2, in which the accelerations, jerks, and potentials of i-particles are computed using the AVX instructions. In issuing the AVX instructions, we use the inline assembly language and manually control the assignment of the YMM registers to the computational data to achieve a good performance. We also try to obtain a high issue rate of the AVX instructions by optimizing the order of operations so that operands in adjacent instruction calls do not have dependencies as much as possible.
For the readability of the source code, we introduce a set of preprocessor macros of the AVX instructions. The macros used in our implementation are shown in List 2. Most of them have three operands, in which src1 and src2 are the source operands and the results of the instructions are stored in the last operand (destination operand, dst). In these macros, operands named src, src1, src2 and dst are in the XMM/YMM registers, and those named mem indicate the main memory or the cache memory. Brief descriptions of these macros are summarized in Table 1 . More detailed documents on the AVX instructions can be found in Intel's website. Table 2 Aliases of the YMM registers and the assignment of variables to the registers.
Alias ID Variables
41 # d e f i n e V C M P N E Q P S ( src1 , src2 , d s t ) \ 42 a s m ( " v c m p p s %0 , % " s r c 1 " , % " s r c 2 " , % " d s t " \ 43 " : : " g " ( 4 ) ) ; 44 # d e f i n e P R E F E T C H ( m e m ) a s m ( " p r e f e t c h t 0 % 0 " : : " m " ( m e m ) ) Table 2 shows the assignment of the YMM registers to the physical quantities in the calculations of accelerations, jerks, and potentials, where the subscripts of x, y and z indicates x, y, and z components of vectors, respectively. Hereafter, for simplicity, we omit the superscripts (p) and (1) of r and v, and of a, j and φ, respectively. With this assignment of the YMM registers, we can carry out the computation using the data in only the YMM registers except for one load instruction for each j-particle data. Figure 2 shows a schematic illustration of a force loop to calculate accelerations, jerks, and potentials on two iparticles from four j-particles by using the AVX instructions. The variables with a subscript i and j in figure 2 contains the 8 SP data of i-and j-particles in the same order of indices shown in figure 1, and those with a subscript ij in figure 2 such as r ij and v ij contains the data with indices of i-and j-particles in the order shown in figure 3. In figure 2 , the blocks DX, DY and DZ indicate the calculation of relative coordinates between two i-particles and four j-particles for x, y and z components, respectively, in the mixed precision, and the block SUM indicates the accumulation of physical quantities (accelerations and potentials) of two i-particles interacting with four j-particles in the DP. The details of the blocks DX, DY, DZ and SUM are described later.
First we describe the procedure depicted in figure 2. 1. Calculate r ij,x , r ij,y , and r ij,z in DX, DY, and DZ (see figure 4) , and store them into YMM03, YMM04, and YMM05, respectively. 2. Square r ij,x in YMM03, r ij,y in YMM04, and r ij,z in YMM05, and then sum up them and the squared softening length to compute the softened squared distance, r 2 ij ≡ r 2 ij + ǫ 2 , and store it into YMM01. 3. Calculate an inverse square root ofr 2 ij , operate Newton-Raphson iteration for that value (the block N.R. in figure 2) , and store the result 1/r ij into YMM01. If indices of i-and j-particles are the same, zero is stored in the corresponding element of YMM01 (the block MASK in figure 2 ). 4. Multiply 1/r ij in YMM01 by m j in the main memory obtaining the gravitational potential exerted by j-particles φ ij ≡ m j /r ij , and store the result into YMM02. 5. Calculate the summation of m j /r ij in YMM02 over the subscript j (the block SUM in figure 2) , and accumulate the results into YMM09 (the block PHI in figure 2 ). (see also figure 5) 6. Load v j,x , v j,y , and v j,z to YMM06, YMM07, and YMM08, respectively. 7. Subtract v i,x , v i,y , and v i,z in the main memory from v j,x in YMM06, v j,y in YMM07, and v j,z in YMM08, and store the results (v ij,x , v ij,y , and v ij,z ) into YMM06, YMM07, and YMM08, respectively. 8. Calculate inner product of relative position and relative velocity vectors, using r ij,x in YMM03, r ij,y in YMM04, r ij,z in YMM05, v ij,x in YMM06, v ij,y in YMM07, and v ij,z in YMM08, and store the result (r ij · v ij ) into YMM00. 9. Square 1/r ij in YMM01, and store the result 1/r 2 ij into YMM01. 10. Multiply m j /r ij in YMM02 and r ij · v ij in YMM00 by 1/r 2 ij in YMM01, and store the results (m j /r 3 ij and r ij · v ij /r 2 ij ) into YMM02 and YMM00, respectively. 11. Multiply r ij,x in YMM03, r ij,y in YMM04, and r ij,z in YMM05 by m j /r 3 ij in YMM02 obtaining the acceleration vector a ij exerted by j-particles, and store the results into YMM03, YMM04, and YMM05, respectively. 12. Calculate the summation of m j r ij,x /r 3 ij in YMM03, m j r ij,y /r 3 ij in YMM04, and m j r ij,z /r 3 ij in YMM05 over the subscript j (the block SUM in figure 2) , and accumulate the results into in YMM10, YMM11, and YMM12, respectively. These correspond to the blocks AX, AY, and AZ in figure 2, respectively. 13. Multiply v ij,x in YMM06, v ij,y in YMM07, and v ij,z in YMM08 by m j /r 3 ij in YMM02, obtaining the first term of the jerks in equation (5), and store the results into YMM06, YMM07, and YMM08, respectively. 14. Accumulate the first term of the jerks m j v ij,x /r The calculation of the relative coordinates between two iparticles and four j-particles done in the blocks DX, DY, and DZ are depicted in figure 4 . As an example, the calculation of r ij,x are processed as follows.
1. Load four r j,x (j = j0 · · · j3) to YMM00 from the main memory in DP. 2. Subtract the coordinate of the first i-particles r i0,x in the main memory from r j,x in YMM00, and store the result, r i0j,x , in YMM01. 3. Subtract the coordinate of the second i-particle r i1,x in the main memory from r j,x in YMM00, and store the result, r i1j,x , in YMM02. 4. Convert type of r i0j,x in YMM01 from DP to SP, and store the result in the lower bits of YMM01. 5. Convert type of r i1j,x in YMM02 from DP to SP, and store the result in the lower bits of YMM02. 6. Copy r i0j,x in the lower bits of YMM01 to the lower bits of YMM03, and r i1j,x in the lower bits of YMM02 to the upper bits of YMM03. Figure 5 illustrates the procedure to accumulate the accelerations and potentials on two i-particles exerted by four j-particles in DP. In the step 2, eight accelerations a ij and potentials φ ij between two i-particles and four j-particles are calculated on the YMM registers in SP. Note that jerks are accumulated in SP, instead of DP. In fact, we cannot compute the summations of accelerations and potentials over all four j-particles in DP using the AVX instructions. Instead, two partial summations over two j-particles can be computed for each i-particle. In accumulating gravitational potentials, for example, the procedures are given as follows.
1. Convert φ i0j (j = j0 · · · j3) in the lower 128 bits of YMM02 from SP to DP, and store the result in YMM06. 2. Copy φ i1j (j = j0 · · · j3) in the upper 128 bits of YMM02 to the lower 128 bits of YMM07. 3. Convert φ i1j in the lower 128 bits of YMM07 from SP to DP, and store the result in YMM07. 4. Operate a horizontal sum reduction of YMM06 and YMM07, obtaining φ i0j0 + φ i0j1 , φ i1j0 + φ i1j1 , φ i0j2 + φ i0j3 , and φ i1j2 + φ i1j3 and store them in YMM06. 5. Accumulate the partially reduced potentials of iparticles with indices of i0 and i1 in YMM06 to φ i in YMM09.
The function calc_accjerkpot which computes the forces on two i-particles from all the j-particles is shown in List 3. In the force loop at line 15 in List 3, the accelerations, jerks, and potentials on i-particles are calculated as we described. Before entering the force loop, i-particles are set, the pointer to j-particles is set to point the first element of the array of Jpart, and all the registers are set to zero. Once the force loop has been finished, sum reductions are operated on the partially accumulated accelerations, jerks, and potentials, and the results are stored in components of the array of structure NewAccJerk. The resulting accelerations, jerks, and potentials are multiplied by some coefficients. This is because we omit the multiplication of −1/2 in equation (12) in the force loop. (In this source code, we do not correct the bias in the VRSQRTPS instruction for simplicity.)
As seen in List 3, some operands are specified by XMM, which are aliases of the XMM registers. In some instructions, the lower bits of the YMM register is only used for their operands. When such instructions are operated, the XMM registers have to be specified. s t r u c t I p a r t * i p t r = i p a r t ; 11 s t r u c t J p a r t * j p t r = j p a r t ; 12 13 / / i n i t i a l i z e 14 / / j_i , z + = j_ij , z ( f i r s t t e r m ) 
Accuracy
In this section, we present the accuracy of our implementation in terms of errors in accelerations, jerks, and potentials of individual particles for a given snapshot as well as errors in the total energy in time integration.
We measure the errors of accelerations, jerks, and potentials of individual particles in the Plummer's model with N = 1K, 4K, and 16K (1K = 1024). Our implementation adopts the standard N -body units, G = M = r v = 1, where G is the gravitational constant, M and r v are the mass and virial radius of the system, respectively. The softening length is set to 4/N . The relative errors of accelerations, jerks, and gravitational potentials are defined as |a − a DP |/|a|, |j − j DP |/|j| and |φ − φ DP |/|φ|, respectively, where the reference values, a DP , j DP and φ DP are obtained by computing fully in DP for all operations, such as computation of distances, square roots, divisions, etc. Figure 6 shows the cumulative distribution of errors of accelerations, jerks, and gravitational potentials for individual particles in the Plummer's model with N = 1K, 4K, and 16K. Most accelerations and potentials have relative errors around 10 −8 . This is quite natural because, in our implementation, accelerations and potentials for individual pairs of i-and j-particles are calculated in SP. Accuracies of jerks are lower than those of accelerations and potentials because relative velocity vectors between i-and j-particles are calculated in SPs, and their accumulations are also operated in SP.
Note that the relative errors of jerks increase with the number of particles, which can be clearly seen in the upper panel of figure 7 . This is caused by the errors in accumulating jerks of individual pairs of i-and j-particles in SP. If we accumulate jerks in DP in the same manner as accelerations and potentials, the distributions of errors become almost independent of the number of particle as can be seen in the lower panel of figure 7 . We speculate that the origin of the errors are round-off errors in accumulating jerks in SP. In order to address the effect of errors in jerks, we perform the two runs with jerks accumulated in SP, and in DP with various number of particles up to N = 256K. We do not find any significant differences in the energy conservation throughout the duration of our calculations (t = 0.125 N -body units). We thus employ the accumulations of jerks in SP, since the accumulation of jerks in DP degrades the total performance by 20 -25 %. However, the accumulations of jerks in SP could lower the overall accuracy when N is large (say N > 10 6 ). n step , are averaged over all particles during 1 crossing time, which corresponds to 2 √ 2 in the N -body unit. In order to obtain the errors originated only by the force calculation in the mixed precisions, we calculate potential energies in DP. The errors in the total energy conservation decrease with the fourth power of n step , since we adopt the fourth-order Hermite scheme for the orbit integration of particles. Because of the mixed precision for the force calculation, the accuracy of the energy conservation stops improving at the relative errors in the total energy of 10 −9 even if the number of timesteps increases. On the other hand, the accuracy of the energy conservation in DP continues to improve for larger n step .
Performance
In this section, we present the performance of our implementation of the collisional N -body simulation using the AVX instructions. For the measurement of the performance, we use Intel Core i7-2600 processors with 8MB cache memory and a frequency of 3.40 GHz, which contain four processor cores. We disable Intel Turbo Boost Technology (TB), unless otherwise stated. A compiler which we adopt is GCC 4.4.5. We use compilation options -O3, -ffast-math, and -funroll-loops. Our code is parallelized with the Message Passing Interface (MPI) in the same manner as the NINJA scheme described in NMA06, and we measure the performance with various numbers of MPI processes on a single processors. The code units and softening length are the same as those in the previous section. We adopt the Plummer's model for the initial conditions, and evolve the system from time t = 0 to time t = 1 in the N -body units. Figure 9 shows the performance of our implementation with N = 1K, 2K, 4K, 8K, and 16K, in which the performance is calculated by estimating the total calculation cost of an acceleration, jerk and gravitational potential for a pair of i-and j-particles to be 60 floating-point operations. Although Makino & Fukushige (2001) estimated it to be 57 floating-point operations, we adopt the estimate by NMH06 so that one can directly compare the performance of our implementation with that in NMH06. To evaluate the performance gain of our implementation with the AVX instructions over that implemented in C-language without any explicit use of the SSE and AVX instructions, we measure the performance of the C-language code shown in the List 3 of NMH06 for N = 1K, 2K, and 4K on the same processor. In addition to that, to compare the performance of the AVX instructions and the SSE instructions, we also develop the implementation only with the SSE instructions, and measure its performance for N = 1K, 2K, 4K and 8K. The performance of our implementation with the AVX instructions is ∼ 20 GFLOPS and higher than that implemented with only the SSE instructions (∼ 10 GFLOPS) by about a factor of two as expected, and 5 times higher than that implemented without any SIMD instructions (∼ 4 GFLOPS). One can see that the performance with 4 MPI processes on four processor cores reaches ∼ 50 GFLOPS for N = 1K and ∼ 70 GFLOPS for N = 16K, which are 2.5 and 3.5 times higher than those with 1 MPI process on one processor core, respectively. When the Hyper-Threading Technology (HT) is enabled, the performance with 8 MPI processes on four processor cores reaches 90 GFLOPS for N = 16K, which is higher than that with 4 MPI processes on four processor cores (70 GFLOPS). Note that the performance with 8 MPI processes on four processor cores (50 GFLOPS for N = 16K) becomes lower than that with 4 MPI processes on four processor cores when HT is disabled. Table 3 shows a comparison between the performance when TB is enabled and disabled. TB enables CPU fre- The performances of GRAPE-DR and GPU (GTX 470) are plotted in dashed lines for comparison. In order to obtain the performance of the GPU, Yebisu N -body code (Nitadori, 2009; Spurzem et al., 2010 ) is used. We use Nitadori's code for the calculation of "w/o SIMD". His source code for the force calculation is described in List 3 of NMH06.
quency to be increased from 3.4 GHz to 3.8 GHz (one active core) and to 3.5 GHz (four active cores); thus TB contributes to performance gains of 10 % (one active core) and 3 % (four active cores). As expected, one can see that the gains are, respectively, about 10 % and 3 % in the cases of 1 MPI processes on one processor core and 8 MPI processes on four processor cores.
Note that the performance of our implementation is almost independent of the number of particles. This feature is advantageous over the dependences on the number of particles by external hardware accelerators such as GPUs and GRAPEs. The performances obtained by a GRAPE-DR board with 4 GRAPE-DR chips (400 MHz) mounted and by a GTX 470 GPU with Yebisu code (Nitadori, 2009; Spurzem et al., 2010) are presented in figure 9 in dashed lines, which strongly depends on the number of particles and relatively lower for the smaller numbers of particles. This is due to the overhead in transferring the data between extended hardware and host computers. In our implementation with the AVX instructions, we do not have such an overhead arising from the transfer of particle data, and thus the performance is almost independent of the number of particles. One can see that the performance of our implementation with a single CPU is slightly lower than that of the GRAPE-DR and GPU systems for N 10 4 (by half or one third), and even higher for N 1 × 10 3 . This feature is quite desirable for simulations on massively parallel computers, because the performance independent of the number of particles allows us to achieve fairly good "strong scaling". Furthermore, this feature is also advantageous under wide dynamic range, such as the case of the presence of binaries in star clusters.
Summary and discussion
We have developed an N -body code for a collisional system with the new SIMD instruction set available on the Sandy Bridge architecture released by Intel Corporation, the AVX instruction set. The performance is ∼ 20 GFLOPS with one processor core, which is two times higher than that with the SSE instructions as expected, and five times higher than that in C-language without any explicit use of SIMD instructions. The performance on a single CPU with four processor cores reaches 90 GFLOPS.
Here, let us estimate the performance of our code parallelized with the NINJA scheme on massively parallel systems. As a reference, NMA06 achieved about 4 GFLOPS with one processor core in a dual-core Opteron processors, and 2.03 tera FLOPS (TFLOPS) with 400 dual-core Opteron processors for an N = 64K system using the NINJA scheme. Since we have achieved 20 GFLOPS with one core on a Intel Core i7-2600 processor, the performance of 10 TFLOPS can be achieved with a massively parallel system with the same number of processor cores of Core-i7 2600 processors.
There are two advantages of our approach to accelerate N -body simulations using SIMD instructions implemented on CPUs over the use of external hardwares such as GPUs and GRAPE families. One is the portability of our code. We can carry out N -body simulations with good performance on any computer systems without GPUs and GRAPE boards. Even on the systems with instruction set architectures other than the x86 architecture, the fact that SIMD instruction sets similar to the SSE and AVX instruction set are available (e.g. Vector Multimedia Extension on Power series, HPC-ACE on SPARC64 VIIIfx, etc.) suggests that our approach to use SIMD instructions is successful on any architecture. The second advantage is the fact that the performance of our implementation is almost independent of the number of particles because there is no overhead associate with the communication between a CPU and an external hardware, such as a GRAPE and a GPU. As a result, the performance of our implementation is better than that aided by external hardwares. In other words, our code have very good strong scaling on massively parallel systems.
Finally, let us mention the future improvement of our implementation. The performance of our N -body code will be higher when Fused Multiply-Add (FMA) instructions are introduced. Using the FMA instructions which compute the product of two numbers and add the result to another number in one step, we will be able to improve the performance and the accuracy of the code, especially accumulation of jerks in the force loop. The FMA instructions will be available in the next-generation CPU named "Bulldozer" by AMD Corporation, and "Haswell" by Intel Corporation, which are scheduled to be released in mid-2011, and in 2013, respectively. Our code is publicly available at http://code.google.com/p/phantom-grape/.
