In this paper, we describe the architecture and performance of the GraCCA system, a graphic-card cluster for astrophysics simulations. It consists of 16 nodes, with each node equipped with two modern graphic cards, the NVIDIA GeForce 8800 GTX. This computing cluster provides a theoretical performance of 16.2 TFLOPS. To demonstrate its performance in astrophysics computation, we have implemented a parallel direct N-body simulation program with shared time-step algorithm in this system. Our system achieves a measured performance of 7.1 TFLOPS and a parallel efficiency of 90% for simulating a globular cluster of 1024 K particles. In comparing with the GRAPE-6A cluster at RIT (Rochester Institute of Technology), the GraCCA system achieves a more than twice higher measured speed and an even higher performance-per-dollar ratio. Moreover, our system can handle up to 320M particles and can serve as a general-purpose computing cluster for a wide range of astrophysics problems.
Introduction
The gravitational N-body simulation plays a significant role in astrophysics, including planetary systems, galaxies, galactic nuclei, globular clusters, galaxy clusters, and large-scale structures of the universe. The number of particles involved (denoted as N) ranges from O(10) in planetary systems to O(10 10 ) in cosmological simulations. Since gravity is a long range force, the main challenge of such simulation lies in the calculation of all N 2 pairwise interactions. Therefore anything involves particle number exceeding 10 6 will have to employ chiefly a mean-field scheme (see below). In the case of collisional system, the evolution time-scale is roughly determined by two-body relaxation time which is proportional to N/log(N) (Spitzer, 1987) . It implies that the total simulation time approximately scales as O(N 3 ) (Giersz and Heggie, 1994; Makino, 1996) . Therefore, the size of such astrophysical simulation is usually limited. For example, for a CPU with 10 GFLOPS (Giga Floating Operations per Second) sustained performance, it would take more than 5 years to simulate the core collapse in a globular cluster with N = 64 K.
A common way to speed up the N 2 force calculation is to adopt the individual time-step scheme (Aarseth, 1963) along with block time-step algorithm (McMillan, 1986; Makino, 1991) . The former assigns a different and adaptive time-step to each particle. Since the characteristic time-scale in some astrophysical simulations varies greatly between a dense region and a sparse region, it is more efficient to assign an individual time-step to each particle. The latter normally quantizes the time-steps to the power of two and advances particles group-by-group. Such an algorithm is especially suitable for vector machines and cluster computers, since a group of particles may be advanced in parallel. Moreover, it also reduces the time for predicting the particle attributes.
An alternative approach to improve performance is to replace the direct-summation scheme by an approximate and efficient scheme, which has a better scaling than O(N 2 ). Examples of such schemes include the Barnes-Hut tree code (Barnes and Hut, 1986) , Particle-Mesh (PM) code (Klypin and Holtzman, 1997) , Particle-Particle/ParticleMesh (P 3 M) code (Efstathiou and Eastwood, 1981) , and Tree-Particle-Mesh (TPM) code (Xu, 1995) . These schemes are efficient and can deal with a large number of particles. Accordingly, they are often used in large-scale structure simulations. The drawbacks of such schemes are the limited accuracy and the incapability to deal with close encounters, which make them inappropriate to study some physics, such as the core collapse in globular cluster.
To achieve both accuracy and efficiency, one needs a high-performance computer with direct-summation algorithm. The development of GRAPE (GRAvity piPE) (Sugimoto et al., 1990; Makino et al., 2003; Fukushige et al., 2005) is made for this purpose. It is a special-purpose hardware dedicated to the calculation of gravitational interactions. By implementing multiple force calculation pipelines to calculate multiple pairwise interactions in parallel, it achieves an ultra-high performance. The latest version, GRAPE-6, comprises 12,288 pipelines and offers a theoretical performance of 63.04 TFLOPS. There is also a less powerful version, GRAPE-6A, released in 2005. It is designed for constructing a PC-GRAPE cluster system, in which each GRAPE-6A card is attached to one host computer. A single GRAPE-6A card has 24 force calculation pipelines and offers a theoretical performance of 131.3 GFLOPS. Some research institutes have constructed such PC-GRAPE clusters (Fukushige et al., 2005; Johnson and Ates, 2005; Harfst et al., 2007; MODEST 1 ) , where the peak performance is reported to be about four TFLOPS. However, the main disadvantages of such system are the relatively high cost, the low communication bandwidth, and the lack of flexibility due to its special-purpose design (Portegies Zwart et al., in press) .
By contrast, the graphic processing unit (GPU) now provides an alternative for high-performance computation (Dokken et al., 2005) . The original purpose of GPU is to serve as a graphics accelerator for speeding up image processing and 3D rendering (e.g., matrix manipulation, lighting, fog effects, and texturing). Since these kinds of operations usually involve a great number of data to be processed independently, GPU is designed to work in a Single Instruction, Multiple Data (SIMD) fashion that processes multiple vertexes and fragments in parallel. Inspired by its advantages of programmability, high performance, large memory size, and relatively low cost, the use of GPU for general-purpose computation (GPGPU  2 ) has become an active area of research ever since 2004 (Fan et al., 2004; Owens et al., , 2007 . The theoretical performance of GPU has grown from 50 GFLOPS for NV40 GPU in 2004 to more than 500 GFLOPS for G80 GPU (which is adopted in GeForce 8800 GTX graphic card) in late 2006. This high computing power mainly arises from its fully pipelined architecture plus the high memory bandwidth. The traditional scheme in GPGPU works as follows (Pharr and Fernando, 2005; Dokken et al., 2005) . First, physical attributes are stored in a randomly accessible memory in GPU, called texture. Next, one uses the highlevel shading languages, such as GLSL, 3 Cg (Fernando and Kilgard, 2003) , Brook (Buck et al., 2004), or HLSL, 4 to program GPU for desired applications. After that, one uses graphics application programming interface (API) such as OpenGL 5 or DirectX 6 to initialize computation, to define simulation size, and to transfer data between PC and GPU memory. Note that the original design of graphic card is to render calculation results to the screen, which only supports 8-bit precision for each variable. So finally, in order to preserve the 32-bit accuracy, one needs to use a method called ''frame buffer object" (FBO) to redirect the calculation result to another texture memory for further iterations. In addition, this method also makes the iterations in GPU more efficient. For example in many GPGPU applications, the entire computation may reside within the GPU memory (except for initializing and storing data in hard disk), which minimizes the communication between GPU and the host computer.
In February 2007, the NVIDIA Corporation releases a new computing architecture in GPU, the Compute Unified Device Architecture (CUDA) (NVIDIA, 2007) , which makes the general-purpose computation in GPU even more efficient and user friendly. In comparing with the traditional graphic API, CUDA views GPU as a multithreaded coprocessor with standard C language interface. All threads that execute the same kernel in GPU are divided into several thread blocks, and each block contains the same number of threads. Threads within the same block may share their data through an on-chip parallel data cache, which is small but has much lower memory latency than the off-chip DRAMS. So, by storing common and frequently used data in this fast shared memory, it is possible to remove the memory bandwidth bottleneck for computation-intensive applications.
For hardware implementation, all stream processors in GPU are grouped into several multiprocessors. Each multiprocessor has its own shared memory space and works in a SIMD fashion. Each thread block mentioned above is executed by only one multiprocessor, so these threads may share their data through the shared memory. Take the NVIDIA GeForce 8800 GTX graphic card (NVIDIA, 2006) for example. It consists of 16 multiprocessors. Each multiprocessor is composed of eight stream processors and has 16 KB shared memory. By allowing the dual-issue of MAD (multiplication and addition) and MUL (multiplication) instructions, this graphic card gives a theoretical computing power of 518.4 GFLOPS. Besides, it has 768 MB GDDR3 memory (named as the device memory or GPU memory) with memory bandwidth of 86.4 GB/s and supports IEEE-754 single-precision floating-point operations. By contrast, the currently most advanced memory bus, dual-channel DDR2 800, in a workstation has a memory bandwidth of 12.8 GB/s.
Scientific computations such as finite-element method and particle-particle interactions are especially suitable for GPGPU applications, since they can easily take advantage of the parallel-computation architecture of GPU. In previous works, Nyland et al. (2004) and Harris (2005) implemented the N-body simulation in GPU but with limited performance improvement. More recently, a 50-fold speedup over Xeon CPU was achieved by using GeForce 8800 GTX graphic card and Cg shading language (Portegies Zwart et al., in press ), but it is still about an order of magnitude slower than a single GRAPE-6A card. Elsen et al. (2007) achieved nearly 100 GFLOPS sustained performance by using ATI X1900XTX graphic card and Brook shading language. Hamada and Iitaka (submitted for publication) proposed the ''Chamomile" scheme by using CUDA, and achieved a performance of 256 GFLOPS for acceleration calculation only. Belleman et al. (in press) proposed the ''Kirin" scheme also by using CUDA, and achieved a performance of 236 GFLOPS for acceleration, jerk, and potential calculations. Although the works of Hamada et al. have outperformed what can be achieved by a single GRAPE-6A card, these are either a sequential code that applies to a single GPU (Hamada and Iitaka, submitted for publication) or a parallel code but only has been tested on a 2-GPU system (Belleman et al., in press ). Consequently, their performances are still incomparable to those of GRAPE-6 and GRAPE-6A cluster.
Based on these works, we have built a 32-GPU cluster named GraCCA, which is compatible to CUDA and has achieved a measured performance of about 7 TFLOPS. In this paper, we describe the architecture and performance of our GPU cluster. We first describe the hardware architecture in detail in Section 2, and then our implementation of parallel direct N-body simulation in Section 3. We discuss the performance measurements in Section 4. In Section 5, we give a theoretical performance model, and finally a discussion of comparison with GRAPE, stability of GraCCA, and some future outlook are given in Section 6.
GPU cluster
In this section, we first show the architecture of GraC-CA, and then discuss the bandwidth measurement between PC and GPU memory, an issue that can be the bottleneck of scientific computation. each node equipped with two graphic cards, the NVIDIA GeForce 8800 GTX, for general-purpose computation. Each graphic card has a GDDR3 memory with 768 MB and a G80 GPU. Its theoretical computing power amounts to 518.4 GFLOPS. The whole system therefore provides a theoretical computing power of 518.4 Â 32 = 16.2 TFLOPS (exclusive of computation executed by CPUs).
Architecture of GraCCA
In Table 1 , we list the main components of a single node in our system. Apart from graphic cards, other components are similar to those of a general PC cluster. Each graphic card is installed in a PCI-Express x16 slot and each node is connected to a gigabit Ethernet switch.
Figs. 2 and 3 are the photos of our GPU cluster and a single node, respectively. We use MPI as the API to transfer data between different CPU processes (including two processes in the same node). Each process is taken by one GPU. For transferring data between PC memory and GPU memory, we adopt CUDA library as API. Since GPU is capable of ultra-fast computation, the communication between PC and GPU memory could be a bottleneck if it is not sufficiently optimized. We illustrate this point in next section.
By installing two graphic cards in a single PC, we maximize the performance of a single computing node. Moreover, as shown in Section 2.2, this architecture also utilizes the total bandwidth between PC and GPU memory more efficiently.
Bandwidth between PC and GPU memory
Data transfer between PC and GPU memory contains two parts: from PC to GPU memory (downstream) and from GPU to PC memory (upstream). Although the theoretical bandwidth of PCI-Express x16 is 4 GB/s in each direction, it is well known that for traditional OpenGL API, the effective bandwidth is asymmetric. So it would be more prudent to measure them separately.
Figs. 4 and 5 show the effective downstream and upstream bandwidths measured by CUDA API, as a function of the size of transferred data. The measured result is obtained by averaging over 1000 steps. In the case of single GPU, we can see that for package size >1MB, the bandwidth in each direction both achieve about 1500-1700 MB/s. It makes no significant difference between downstream and upstream bandwidth for CUDA API. Moreover, GPUs installed in northbridge and southbridge give about the same performance (the southbridge GPU exceeds the northbridge GPU by about 150 MB/s in downstream bandwidth for large data). For a more realistic case, with multiple GPUs running parallelly, two GPUs in the same node must transfer data to and from the PC memory simultaneously. In this case, the bandwidth of each GPU reduces to about 950 MB/s for package size >256 KB, giving a total bandwidth of about 1900 MB/s. As discussed in Section 4.2, given the drop in data speed this high-speed transfer still makes the communication time between GPU and PC memory nearly negligible for a direct N-body problem.
Direct N-body simulation in GPU cluster
To demonstrate the practicability and performance of GraCCA, we have implemented the direct N-body simulation in this system. In the following, we first describe the single-GPU implementation in detail, and then follow the parallel algorithm.
Single-GPU implementation
To implement the gravitational N-body calculation in a single GPU, we follow the basic ideas of Chamomile scheme (Hamada and Iitaka, submitted for publication) and Kirin scheme (Belleman et al., in press ), but with some modifications and a more detailed description. As described in Section 1, one of the most important features of CUDA and GeForce 8800 GTX graphic card is the small but fast on-chip shared memory. It is the key to fully explore the computing power of GPU. In addition, all threads executed in GPU are grouped into several thread blocks, and each of these blocks contains the same number of threads. For simplicity, we use the term ''Grid Size (GS)" to denote the number of thread blocks, and ''Block Size (BS)" to denote the number of threads within each thread block. Therefore, the total number of threads is given by GS * BS. In our current implementation, both BS and GS are free parameters which should be given before compilation. Also note that only threads within the same thread block may share their data through shared memory.
In our current implementation, only acceleration and its time derivative (jerk) are evaluated by GPU. Other parts of the program, such as advancing particles, determining time-step, and decision making, are performed in host computer. Fig. 6 shows the schematic diagram of our single-GPU implementation for acceleration and jerk calculations. Following the convention in N-body simulation, interactive particles are divided into i-particles and j-particles. The main task of GPU is to calculate the acceleration and jerk on i-particles exerted by j-particles according to the following:
and
where m, r ij , v ij , a, j, and e are mass, relative position, relative velocity, acceleration, jerk, and softening parameter.
To make it more clearly, we use P ks to denote the pairwise interaction between sth i-particle and kth j-particle. So, to match the CUDA programming model and extract the maximum performance of GPU, all N 2 pairwise interactions are grouped into (N/BS) 2 groups (denoted as G mn , m = 1, 2,. . . , N/BS, n = 1, 2,. . . , N/BS). Each group G mn contains BS 2 pairwise interactions between i-particles and j-particles. It may be expressed as
Groups within the same column are computed by the same thread block sequentially. In other words, for those i-particles belong to the same column of groups, the acceleration and jerk are evaluated group-by-group by a single thread block. For the case that N/BS > GS, a thread block should evaluate more than one column of G ij . For example,
The evaluation of group G 1,1 in Fig. 6 is shown in detail in Fig. 7 . Block(1) is comprised of BS threads, and each thread(s) evaluates the acceleration and jerk on sth i-particle exerted by j-particles. For example, P 2,1 , P 3,1 , . . . , P BS,1 are evaluated by Thread(1); P 1,2 , P 3,2 , . . . , P BS,2 are evaluated by Thread(2), etc. This kind of computation decomposition fully exploits the immense parallel computing power of modern GPU. Besides, since threads within the same thread block may share their data through fast shared memory, each thread only needs to load one j-particle (the sth j-particle) into the shared memory. It reduces the number of data transfers between device memory and shared memory, which has much higher memory latency and lower bandwidth than the on-chip memory. Moreover, since the number of pairwise force calculations in G mn is proportional to BS 
Fig. 6. The schematic diagram of our single-GPU implementation for acceleration and jerk calculations. The interaction groups computed by Block(1) are highlighted with blue border. The red regions in i-particle and j-particle arrays are the particles used to compute the group G 1,1 .
device memory to shared memory is proportional to BS, we could further eliminate this memory bandwidth bottleneck by having larger BS (128 for example). On the other hand, because different thread evaluates force on different i-particle, we may store the information of i-particles in perthread registers instead of shard memory.
The calculation procedure of a force loop may be summarized as following:
(1) The host computer copies the data of i-particles and j-particles from PC memory to device memory through PCI-Express x16 slot. (2) Each thread loads the data of i-particle into registers based on one-to-one correspondence. (3) Each thread loads the data of j-particle into shared memory based on one-to-one correspondence. (4) Each thread block evaluates one group of pairwise interactions (G mn ). Note that by iterating over j-particles first, all data of iparticles may stay in the same registers during the calculation of whole column of G mn . Moreover, when switching from one G mn to another, each thread only needs to reload seven variables (mass, position, and velocity of j-particles) instead of 12 (position, velocity, acceleration, and jerk of i-particles) from the device memory. So, it reduces the communication time between device memory and on-chip memory and results in a better performance, especially for small number of particles.
Finally, to integrate the orbits of particles, currently we adopt the fourth-order Hermite scheme (Makino and Aarseth, 1992 ) with shared time-step algorithm. For time-step determination, we first use the formula (Aarseth, 1985) dt
where m is an accuracy parameter, to evaluate the time-step for each particle, and then adopt the minimum of them as the shared time-step.
Parallel algorithm
To parallelize the direct N-body simulation in GraCCA, we adopt the so called ''Ring Scheme". In this scheme, all GPUs are conceptually aligned in a circle. Each GPU contains a subset of N/N gpu i-particles (denoted as Sub-I, N gpu denotes the total number of GPUs). Besides, j-particles are also divided into N gpu subsets (denoted as Sub-J), and a force loop is composed of N gpu steps. During each step, each GPU evaluates the force from a Sub-J on its own Sub-I, and then transfer the data of Sub-J between different GPUs.
(1) Initialize the acceleration and jerk backup arrays of each Sub-I as zeros. Note that in this scheme, we may use the non-blocking send and receive (ISEND and IRECV in MPI) to start the data transfer before step (4). The next force loop will wait until the data transfer is complete. By doing so, we could reduce the network communication time since it would be partially overlapped with the force computation (Dorband et al., 2003) .
Performance
In this section, we discuss the performance of GraCCA for direct N-body simulation. For all performance-testing BS j-particle sub-arrays (mass, position, velocity) i-particle sub-arrays (position, velocity) i-particle sub-arrays (acceleration, jerk) simulations, we used the Plummer model with equal-mass particles as initial condition and adopted the standard units (Heggie and Mathieu, 1986) , where gravitational constant G is 1, total mass M is 1, and total energy E is À1/4. This initial condition is constructed by using the software released by Barnes (1994 In the following, we first discuss the optimization of GS and BS. We then assess the performance of single-GPU system, and finally the performance of multi-GPU system.
T (1) T (2) T (3) T (4) T (5) T (BS) T (6) T (7) T (8)

Optimization of GS and BS
As mentioned in Section 3.1, both GS (number of thread blocks) and BS (number of threads within each thread block) are free parameters in our current implementation. In theory, in order to maximize the utilization of GPU resources, both GS and BS should be chosen as large as possible. But on the other hand, a larger BS would introduce a higher cumulative error (Hamada and Iitaka, submitted for publication). So, it would be necessary to determine the optimized values of GS and BS. Fig. 8 shows the calculation time per step as a function of GS for different number of particles. BS is set to 128. It can be seen that for GS 6 16, the calculation time per step is inversely proportional to GS. This result is consistent with the architecture of GeForce 8800 GTX, which has exactly 16 multiprocessors. Since each thread block is executed by only one multiprocessor, executing a kernel in GPU with GS 6 16 will result in 16-GS ''idle" multiprocessors. On the other hand, for GS = n * 16, n = 2, 3, 4,. . . , each multiprocessor processes more than one thread block concurrently. It enables a more efficient utilization of GPU resources. As shown in Fig. 8 , a single-GPU system is able to achieve its maximum performance for GS P 32. Fig. 9 shows the calculation time per step as a function of BS for different number of particles. GS is set to 32. It can be seen that for BS P 96 (except for BS = 160), it approaches the maximum performance. The best performance occurs for BS = 128 in our current implementation. Note that for BS = 160, the performance drops about 13 percent. It is mainly due to the current implementation of CUDA and graphic driver, and may be improved in future versions.
Accordingly, we adopt (GS, BS) = (32,128) for all performance tests in the following sections (except for the cases when N/N gpu < 4 K). Fig. 10 shows the calculation time per step versus the total number of particles. The measured result is obtained by averaging over 1000 steps for N < 128 K and over 100 steps for N P 128 K. Calculation time for initialization procedure is excluded. For the GRAPE system (Makino et al., 2003; Fukushige et al., 2005) , a time estimate is provided to evaluate the system speed. In a similar fashion, the total calculation time per step for a single-GPU system can be expressed as where T host is the time for host computer to predict and correct particles, as well as to determine the next time-step, T PCIe is the time for transferring data between PC and GPU memory through PCI-Express x16 slot, and T GPU is the time for GPU to calculate the acceleration and jerk.
Single-GPU performance
It is clear from Fig. 10 that for N P 4 K, the performance curve has a slope of 2, which is the signature of N 2 calculation. It also verifies that for a large number of N, both T host and T PCIe are negligible. But for N < 4 K, insufficient number of particles results in inefficient utilization of GPU resources. Moreover, the time for communication between PC and GPU memory and for computation in host computer become non-negligible. These factors further reduce the performance. We will describe the performance modeling for single-GPU calculation in detail in Section 5.1. Fig. 11 gives the measured performance in GFLOPS as a function of the total number of particles. The performance in GFLOPS is defined as
where N FLOP is the total number of floating-point operations for one pairwise acceleration and jerk calculation, and T single is the average calculation time per step in single-GPU system. Here we adopt N FLOP = 57 (Makino et al., 2003; Fukushige et al., 2005) in order to compare to the result of the GRAPE system. As discussed above, the performance drops for small values of N(N < 4 K) due to data communication, host computer computation, and insufficient threads in GPU. On the other hand for N P 16 K, the single-GPU system approaches its peak performance, which is about 250 GFLOPS for acceleration and jerk calculations. We note that the performance of the single-GPU system is limited by the computing power of GPU itself. Also note that here we use T single instead of T GPU in Eq. (6) for calculating GFLOPS. It makes Fig. 11 more practical and illustrative since T single and T GPU could be significantly different for small N (see Fig. 18 ).
Multi-GPU performance
Fig . 12 shows the calculation time per step versus the total number of particles for different number of GPUs (N gpu ). Six curves denote the 1-, 2-, 4-, 8-, 16-, and 32-GPU systems, respectively. For the multi-GPU system, the total calculation time per step may be expressed as
where T host , T PCIe , T GPU are defined in the same way as Section 4.2, and T net is the time for transferring data between different nodes through the gigabit network. Note that for a dual-GPU system, the result is measured by two GPUs installed in the same node, which provides higher communication bandwidth between the two GPUs.
In Fig. 12 , all six curves have slope of 2 for N/N gpu P 4 K, which is consistent with Fig. 10 . It shows that for large numbers of N, T host , T PCIe , and T net are all negligible, giving T multi $ T GPU . We will describe the performance modeling for multi-GPU calculation in detail in Section 5.2. Fig. 13 shows the results of performance measurements in GFLOPS as a function of the total number of particles for different numbers of GPUs. We can see that for N/N gpu P 16 K, each system with different number of GPUs approaches their peak performance. Moreover, it demonstrates a great scalability of our system. The maximum performance of the multi-GPU system is still limited by the computing power of GPU itself. For the 32-GPU case, the system achieves a total computing power of 7.151 TFLOPS, in which case each GPU achieves a performance of 223 GFLOPS. It is about 89 percent of the peak performance in a single-GPU system.
In Figs. 12 and 13, the crossover points indicate that the system with more GPUs becomes marginally faster or even slower than the system with fewer GPUs. All these points appear when N/N gpu = 1 K. This result is consistent with Fig. 11 , since when the number of particles changes from 2 K to 1 K, the performance of single GPU drops more than 50%. For N/N gpu P 4 K, all curves have slope of about À1, which indicates that the calculation time per step is roughly inversely proportional to the number of GPUs. It further verifies the great scalability of our GPU cluster, especially for N P 128 K. For the N = 1024 K case, it has a slope of À0.974. The calculation speed of 32-GPU system is 28.8 times faster than single-GPU system, in which the parallel efficiency achieves 90.1%. Fig. 15 shows the speedup factor versus the number of GPUs for different N, where the speedup factor s is defined as 
To demonstrate the GPU's ability for conducting the most time-consuming astrophysical computation, in Fig. 16 we show the temporal evolution of core density in the Plummer model up to N = 64 K. The core density is estimated by using the method proposed by Casertano and Hut (1985) , and with a faster convergence property suggested by McMillan et al. (1990) . The time (x axis) in Fig. 16 is scaled by 212.75 * log(0.11 N)/N (Giersz and Heggie, 1994) . Note that to capture the post-collapse behavior (e.g., gravothermal oscillation), one needs an alternative integration scheme such as KS regularization (Mikkola and Aarseth, 1998) to handle close two-body encounters and stable binaries (Makino, 1996) . Currently this scheme is not yet implemented in our program.
Performance modeling
In this section, we construct a performance model of direct N-body simulation in GraCCA, and compare that to the measured performance. The performance model we adopt is similar to that of GRAPE-6 (Makino et al., 2003) and GRAPE-6A (Fukushige et al., 2005) , but modified to fit the architecture of our system. In the following, we first present a performance model of the single-GPU system, and then follow the model of cluster system.
Performance modeling of single-GPU system
As mentioned in Section 4.2, the calculation time per step for direct N-body simulation in a single-GPU system (T single ) may be modeled by Eq. (5): T single (N) = T host (N) + T PCIe, single (N) + T GPU (N). T host is the time spent on the host computer. In our current implementation, it may be written as
where T pred is the time for prediction, T corr is the time for correction, and T timestep is the time for time-step determination. All of these operations are roughly proportional to the number of particles, so we may rewrite Eq. (9) as
where the lower-case letter ''t" represents the computation time ''per particle". This number is mainly determined by the computing power of host computer, and is roughly the same for different N. So in our performance model, we take t host as a constant and T host (N) is directly proportional to N. T PCIe, single is the time spent on data transfer in PCI-Express x16 lanes. Since the effective bandwidth between PC and GPU memory in a single-GPU case is different from the multi-GPU case (see Section 2.2), here we use the subscript ''single" to emphasize the difference. T PCIe, single may be written as
where T i is the time for transferring i-particle position and velocity downstream to GPU memory, T j is the time for transferring j-particle mass, position, and velocity downstream to GPU memory, and T force is the time for transferring i-particle acceleration and jerk upstream to PC memory. The lower-case ''t" represents the communication time per particle. 
where t pair is the calculation time for a single pairwise interaction. Note that T GPU scales with N 2 . The measured results of t host , t PCIe, single , and t pair are recorded in unit of millisecond in Table 2 . Table 2 are the optimum values, it would be more prudent to define efficiency factors both for t PCIe, single and t pair for small N (denoted as model 2). In practice, we may rewrite Eq. (5) as f PCIe, single (N) and f pair (N) are efficiency factors for t PCIe, single and t pair , respectively. These factors are purely empirical and determined by fitting to the measured results of t PCIe, single and t pair . Note that although f PCIe, single and f pair are discontinuous at N = 16 K and 4 K, respectively, they are negligible since these discontinuities have influences on predicted performance by less than 2%. The wall-clock time per step predicted by model 2 is also presented in Fig. 17 . It is clear that for N < 4 K, model 2 is in better agreement with the measured performance than model 1. Fig. 18 shows the relative ratios of T host , T PCIe, single , and T GPU in model 2. Since T GPU scales with N 2 , but T host and T PCIe, single scale with N, T GPU /T single would increase with N. This feature is clearly verified in Fig. 18 . T GPU /T single reaches 71% for N = 4 K and 91% for N = 16 K. These predicted ratios are consistent with the timing measurement discussed in Section 4.2.
Performance modeling of multi-GPU system
Following the performance model of the single-GPU system, the calculation time per step in our GPU cluster may be modeled as
where nN/N gpu is number of i-particles held by each GPU. The subscripts ''multi" in t PCIe, multi and f PCIe, multi are used to highlight the difference of bandwidth between single-and multi-GPU systems. For the latter case, the efficiency factor f PCIe, multi may be expressed as 
T net is the time for transferring data of j-particles through gigabit network. In the ring communication topology, T net may be expressed as
where t net is the time for transferring a single particle. It may be expressed as t net = 28/BW net , where BW net is the average measured bandwidth of gigabit network. The estimated value of t PCIe, multi and t net are recorded in Table 2 . Note that for the dual-GPU case, since two GPUs are installed in the same node, we set T net = 0 in our performance model. Fig. 19 shows the wall-clock time per step predicted by Eq. (14) for N gpu = 2, 8, 32, along with the measured performance for comparison. Again, the agreement between modeled and measured performance is very good. It indeed captures the feature of lower slope when N/N gpu < 4 K (see Section 4.3). Fig. 20 shows the relative ratios of T host , T PCIe, multi , T net , and T GPU in multi-GPU performance model with N gpu = 2, 8, 32. It can be seen that although the total calculation time is generally dominated by T GPU , the time spent on network communication is non-negligible in some cases. T net /T multi exceeds 20% for N 6 128 K in 32-GPU simulations and 10% for N 6 64 K in 8-GPU simulations. Also note that although T PCIe, multi plays a minor role in performance modeling, T PCIe, multi /T multi still exceeds 9% for N 6 128 K in 32-GPU simulations and 9% for N 6 32 K in 8-GPU simulations. Finally, it shows that T host is negligible in all cases.
Discussion
In this section, we address on the comparison of performance between GraCCA and GRAPE system, the stability of GraCCA, and finally a discussion of future work.
Comparison with GRAPE
6.1.1. Single-GPU system
The GRAPE-6A board with four GRAPE-6 processor chips has become available since 2005 (Fukushige et al., 2005) . It is a single PCI card (currently a GRAPE PCI-X board is available) attached to a host computer. The theoretical peak performance of a single GRAPE-6A board is 131.3 GFLOPS for acceleration and jerk calculation. The maximum number of particles it can handle is up to 256 K. A single GRAPE-6A card costs about $6 K.
In comparison with GRAPE-6A, the NVIDIA GeForce 8800 GTX graphic card transfers data between PC and GPU memory through PCI-Express x16 slot, which has a body part turns out to be the most time consuming. This is where the GPU computation comes to play. The GPU program can replace the existing CPU sub-routine of direct Nbody calculations. The replacement will shorten the cosmology computation time by a sizable factor.
Finally, Gualandris et al. (2007) have presented the first highly parallel, grid-based N-body simulations. It opens a paradigm for the GraCCA system to connect with the grid-computing community in the future.
