We have developed a special-purpose computer for gravitational many-body simulations, GRAPE-5. GRAPE-5 is the successor of GRAPE-3. Both consist of eight custom pipeline chips (G5 chip and GRAPE chip). The difference between GRAPE-5 and GRAPE-3 are: (1) The G5 chip contains two pipelines operating at 80 MHz, while the GRAPE chip had one at 20 MHz. Thus, the calculation speed of the G5 chip and that of GRAPE-5 board are 8 times faster than that of GRAPE chip and GRAPE-3 board. (2) The GRAPE-5 board adopted PCI bus as the interface to the host computer instead of VME of GRAPE-3, resulting in the communication speed one order of magnitude faster. (3) In addition to the pure 1/r potential, the G5 chip can calculate forces with arbitrary cutoff functions, so that it can be applied to Ewald or P 3 M methods. (4)The pairwise force calculated on GRAPE-5 is about 10 times more accurate than that on GRAPE-3. On one GRAPE-5 board, one timestep of 128k-body simulation with direct summation algorithm takes 14 seconds. With Barnes-Hut tree algorithm (θ = 0.75), one timestep of 10 6 -body simulation can be done in 16 seconds.
Introduction
In this paper, we describe the hardware architecture and performance of GRAPE-5, the newest addition to GRAPE series of special-purpose computers.
GRAPE (for "GRAvity PipE"; see Taiji 1998 ) is a special-purpose computer to accelerate the gravitational many-body simulation. It has pipelines specialized for the force calculation, which is the most expensive part of the gravitational many-body simulation. All other calculations, such as time integration of orbits, are performed on a host computer connected to GRAPE. Figure 1 illustrates the basic structure of a GRAPE system. In the simplest case, the host computer sends positions and masses of all particles to GRAPE. Then GRAPE calculates the forces between particles, and sends them back to the host computer. The GRAPE system achieved high performance on the gravitational many-body simulation through pipelined and highly parallelized architecture specialized for the force calculation. GRAPE-3 (Okumura et al. 1993 ) is the first GRAPE system with multiple force calculation pipelines. Figure 2 shows the architecture of GRAPE-3A board with 8 pipelines. One pipeline is integrated into the GRAPE chip. The peak performance of the GRAPE chip is 0.6 Gflops at a clock cycle of 20 MHz, and that of a system which contains 8 GRAPE chips is 4.8 Gflops. Like GRAPE-1 ) and GRAPE-1A (Fukushige et al. 1991) , GRAPE-3 uses the number format with short mantissa, in order to reduce the chip size. The r.m.s. relative error of the pairwise force is about 2%, which is low enough for most of collisionless simulations , Hernquist et al. 1993 , Athanassoula et al. 1998 . Copies of GRAPE-3 are used in many institute inside and outside Japan (e.g. Brieu et al. 1995 , Padmanabhan et al. 1996 , Steinmetz 1996 , Nakasato et al. 1997 , Klessen et al. 1998 , Theis and Spurzem 1999 , Koda et al. 1999 . GRAPE-5 is the successor of GRAPE-3. GRAPE-5 embodies the improvement of a factor of 10 in calculation speed, communication speed, and accuracy, over GRAPE-3. In addition, it can be applied to Ewald or P 3 M algorithms, since it can evaluate the force and potential with arbitrary cutoff.
Host
The structure of this paper is as follows: In section 2 we briefly describe force calculation algorithms used with GRAPE-5. In section 3 we describe the hardware of the GRAPE-5 system. In section 4 we discuss the calculation accuracy of GRAPE-5. In section 5 we present the timing results. In section 6 we discuss future prospects.
Force Calculation Algorithms
In this section we briefly discuss force calculation algorithms used on GRAPE-5. The O(N 2 ) direct summation is the simplest algorithm. To obtain the force on a particle, GRAPE-5 simply calculates and adds up forces from all particles in the system. Though the direct summation is simple and efficient for small-N (say N <10 5 ) systems, for large-N systems force calculation becomes expensive, even with GRAPE hardware. In the following, we describe two algorithms to reduce the calculation cost. First we describe the Barnes-Hut tree algorithm (Barnes and Hut 1986) , and then P 3 M (particle-particle particlemesh) method (Hockney and Eastwood 1981) .
Barnes-Hut Tree Algorithm
The Barnes-Hut tree algorithm (Barnes and Hut 1986) reduces the calculation cost from O(N 2 ) to O(N log N ), by replacing forces from distant parti-cles by that from a particle at their center of mass (or multipole expansion). Vectorization of tree algorithm is discussed in Barnes (1990) , Hernquist (1990) and . Parallelization has been discussed in numerous articles (Salmon and Warren 1992 , Salmon et al. 1994 , Dubinski 1996 , Warren et al. 1997 , Yahagi et al. 1999 .
The application of GRAPE hardwares to the tree algorithm are discussed in Makino (1991) and Athanassoula et al.(1998) . Vectorized versions of tree algorithms based on Barnes' modified algorithm (1990) are used in these articles. With this algorithm, we can use GRAPE more efficiently than with the original algorithm. In the original algorithm, tree traversal is performed for each particle. In the modified algorithm, tree traversal is performed for a group of neighboring particles and an interaction list is created. GRAPE calculates the force from particles and nodes in this interaction list to particles in the group.
The modified tree algorithm reduces the calculation cost of the host computer by roughly a factor of n g , where n g is the average number of particles in groups. On the other hand, the amount of work on GRAPE increases as we increase n g , since the interaction list becomes longer. There is, therefore, an optimal value of n g at which the total computing time is minimum (Makino 1991) . The optimal value of n g depends on various factors, such as the relative speed of GRAPE and its host computer, the opening parameter and number of particles. For present GRAPE-5, n g = 2000 is close to optimal.
When we need high accuracy for the force calculation, we can use P 2 M 2 (pseudo-particle multipole method; Makino 1998). In the algorithm described above, we can not handle higher-order terms of the multipole expansion on GRAPE, since GRAPE can calculate 1/r 2 force only. Thus, the amount of the calculation grows quickly when high accuracy is required. In P 2 M 2 , high-order expansion terms are expressed by forces from a small number of pseudoparticles, and thus we can evaluate these terms on GRAPE. Using P 2 M 2 , Kawai and Makino (1999) implemented arbitrary-order tree algorithm on GRAPE-4. The timing results show that the calculation with P 2 M 2 is faster than that without P 2 M 2 , when the total force error smaller than ∼ 10 −4 is required.
P 3 M Method
The P 3 M method (Hockney and Eastwood 1981) calculates gravitational interaction under periodic boundary condition. The total force is divided into longrange and short-range forces. The long-range (PM) force is computed in wave number space using fast Fourier transform (FFT). The short-range (PP) force is calculated directly.
Since the PM part takes care of the long-range interaction, the force calculation in the PP part is not a pure 1/r 2 gravity but with cutoff. For example, Efstathiou and Eastwood (1981) used the following form as the PP force exerted from a particle located at r
where m is the mass of the particle and r ′ is the position at which the force is evaluated. The cutoff g P3M is expressed as
where R ≡ |r − r j |/η, and η is a scale length. If higher accuracy is desired, one can use a Gaussian cutoff (Ewald 1921) . GRAPE-5 can calculate these forces using its programmable cutoff table. Brieu et al. (1995) implemented the P 3 M method on GRAPE-3. Since GRAPE-3 can calculate force with Plummer softening only, they expressed the PP force as a combination of three forces with different Plummer softening radius. Therefore they used GRAPE-3 three times to evaluate one PP force. This procedure is rather complex, and yet the accuracy is rather low. GRAPE-5 evaluates this interaction with cutoff in single call, and in high accuracy.
Hardware of GRAPE-5 System
In this section we describe the function and the architecture of GRAPE-5 system. In section 3.1 we show the overall architecture of the GRAPE-5 system and overview the function of the GRAPE-5 processor board. In section 3.2 we give a detailed description of the G5 chip, a custom chip for force calculation. Sections 3.3-3.6 are devoted to description of components of the processor board other than the G5 chip. Peak performance and other miscellaneous aspects of the board are described in section 3.7. Figure 3 shows the basic structure of the GRAPE-5 system. It consists of three components: a GRAPE-5 processor board, a PCI Host Interface Board (Kawai et al. 1997) , and the host computer. The processor board is connected to PCI Host Interface Board (hereafter we call PHIB) via Hlink . PHIB is attached to PCI I/O bus of the host computer. PCI bus is an I/O bus standard widely used from desktop PCs to supercomputers. Figure 4 shows the block diagram of the GRAPE-5 processor board. It consists of five units: the G5 chips, the memory unit, the particle index unit, the neighbor list unit, and the interface unit. The G5 chip is a custom VLSI chip which integrates two pipeline processors to calculate gravitational interactions. The memory unit supplies particle data to the G5 chips. The particle index unit supplies indices of particles to the memory unit during force calculation. This unit can supply indices in a special manner optimized to the cell-index method to calculate a short-range force, such as the PP force in P 3 M method. The neighbor list unit constructs the list of the nearest neighbor particles. The interface unit handles the communication with the host computer. In the following subsections we describe these units. 
Overall Architecture

Function
The G5 chip calculates the force exerted on particle i at position r i . The force f i is expressed as
where N is the number of particles and
Here f ij is the force (per unit mass) from particle j to particle i. Note that r j and m j are the position and the mass of particle j, and that r s,ij is the softened distance between particle i and j defined as r 2 s,ij ≡ |r i − r j | 2 + ǫ 2 i , where ǫ i is the softening parameter for particle i. Here and hereafter, we use index i for the particle on which the force is evaluated and index j for particles that exert forces on particle i. The function g is an arbitrary cutoff function, η is a scale length for the cutoff function, and r cut is the cutoff length. The G5 chip also calculates potential φ i associated with force f i , using cutoff function different from that for force calculation. During force calculation, The G5 chip asserts the neighbor flag if the distance of particles r s,ij is less than a given neighbor radius, h i .
Overall architecture
The G5 Figure 6 shows the block diagram of the pipeline unit. The number attached to each line is the number of bits used. The number format will be discussed later. The position vector r j and the mass data m j are supplied from the memory unit. The position vector r i , the softening parameter ǫ i , and the neighbor radius h i are stored in the on-chip register before the pipeline starts calculation. The pipeline unit calculates one interaction per clock cycle, and accumulate it to on-chip registers. The pipeline unit outputs the neighbor flag when r G5 chip has the virtual multiple pipeline architecture to reduce the necessary bandwidth of data transfer during force calculation. In this architecture, one pipeline acts as multiple pipelines operating at a slower speed. The G5 chip has six virtual pipelines per real pipeline unit and 12 virtual pipelines in total. Each real pipeline unit calculates the forces exerted on 6 different particles. Data for particle j is used for 6 clock cycles. The memory unit operates at a clock frequency 1/6 of that of the G5 chip. This architecture simplifies the board design. 
Number format
Following the design of GRAPE-1 and GRAPE-3, we adopted word length shorter than those used on conventional computers for the G5 chip. The word length directly determines the number of transistors. In order to achieve high performance at low cost, it is crucial to use the minimum word length that ensures the validity of the calculation. The word length used in the G5 chip are shown in figure 6 and figure 7. The number of bits is attached to each data line.
We adopted the logarithmic format for most of the operations in the pipeline unit except for the subtraction of the position vectors, lookup of the cutoff table, and the final accumulation of the force. We prefer the logarithmic format over the fixed-point format because it has larger dynamic range for the same word length. Of course, the usual floating-point format also can achieve a wider dynamic range. We chose the logarithmic format because operations such as multiplication and square root are easier to implement in the logarithmic format than in the floating-point format.
Although the addition becomes complex, the logarithmic format is more efficient for the word length we used for GRAPE-5. In the logarithmic format, a positive, non-zero real number x is represented by its base-2 logarithm y as
In G5 chip, we use 15 bits to express y in unsigned fixed-point format, with 8 bits below binary point. The 16th bit indicates whether x is 0 or not (nonzero bit), and the 17th bit indicates the sign. Thus the total number of bits per word is 17. This format can express real number in the range of [1, 2 128 ), and its resolution is 2 1/256 − 1 ≃ 0.003.
We use 32-bit fixed-point 2's-complement format for each component of the position vector r i and r j , in order to guarantee uniform resolution and to simplify implementation of the periodic boundary condition. The selection of the minimum image is performed automatically, by setting the size of the box length to the maximum expressible number (2 32 − 1). This format gives a spatial resolution of 2 −32 ≃ 10 −9 .
We use 64-bit and 50-bit fixed-point format for accumulation of the force and potential, respectively. We adopt the fixed-point format because the adder (accumulator) is simpler and its cost is lower in this format than that in logarithmic or floating-point format.
For conversion between the fixed-point format and the logarithmic format, we use format converters described in section 3.2.4 and 3.2.7. For addition in the logarithmic format, we use logarithmic adder described in section 3.2.5.
Format converter (fixed point to logarithmic)
Here we describe the hardware to convert the fixedpoint format to the logarithmic format. Output number has β = 2 + γ + δ bits, where γ and δ are the number of bits above and below the binary point, respectively. In the case of the G5 chip, β = 17, γ = 7 and δ = 8. Figure 9 shows the block diagram of the format converter. First we convert the 2's complement format to the sign+magnitude format. Then we calculate the "integer" part of logarithm (upper γ bits) using a priority encoder. The fractional part of logarithm is calculated as follows. We first normalize the magnitude of x by removing leading zeros. This is done by a logical shifter with control input from the priority encoder. The output of the shifter is then supplied to a table which convert a normalized number to the base-2 logarithm. In case of G5 chip, the table has 512 (= 2 9 ) entries to ensure that the round-off error generated at conversion is small.
Logarithmic adder
The (unsigned-)logarithmic adder performs addition of two positive number X and Y in logarithmic format. The design of logarithmic adder of G5 chip is basically the same as that of GRAPE chip. We designed it using the following relation
for X ≥ Y , or,
for general X and Y (Kingsbury and Rayner 1971, Swartzlander and Alexopoulos 1975) . Figure 10 shows the block diagram of the logarithmic adder. First we calculate X − Y and Y − X, and choose the positive one using a multiplexer. Then we use table lookup to obtain log 2 (1 + 2 δ − 1, the result of addition Z is equal to X (assuming X > Y ), after we properly rounded the result. Thus, in the case of the G5 chip with δ = 8, we need table Makino and Taiji 1998 , section 4.6 for detailed discussion). Figure 7 shows the block diagram of the cutoff function generator. The cutoff function g(r s,ij /η) and g φ (r s,ij /η) are calculated from r 2 s,ij and η 2 . First we divide r 2 s,ij by η 2 using subtracter and input the result to the entry generator. Then we supply the output of the entry generator to the cutoff function tables, and obtain 1/g and 1/g φ as output of those tables. The contents of the cutoff function tables are set by the user before the force calculation starts.
Cutoff function generator
The cutoff function table is realized by an on-chip RAM. The RAM table consumes significantly larger number of transistors per bit than ROM tables used in the format converter and the logarithmic adder. In order to achieve a good cost performance, it is important to keep the size of the table as small as possible.
We reduce the size of the table by taking advantage of the nature of the cutoff function. The cutoff function g converges to 1 when r s,ij /η approaches to 0, and converges to 0 when r s,ij /η ≫ 1 (see figure  11 ). Therefore we need high resolution only when r s,ij /η ∼ 1. Using these characteristics, we can reduce the number of entries to the cutoff table for small r s,ij /η ( ∼ < 0.1) and large r s,ij /η ( ∼ > 2). The entry generator reduces the number of entries to the cutoff function table in the following two steps. At the first step, the input r 2 s,ij /η 2 expressed in 15-bit logarithmic format (without sign and non-zero bit) is converted to a 9-bit integer number. The conversion is expressed as
where A is the logarithmic part of the input number interpreted as unsigned integer, and B is the output. This conversion reduces the number of entries at small r s,ij /η. At the second step, we reduce the entries at large r s,ij /η using the conversion given in table 1. An 8-bit integer number is obtained as the conversion result. The maximum r s,ij /η that is expressible in the final format is 3.0. This value is large enough for typical cutoffs such as g P3M given by equation (2) and a Gaussian cutoff, since the value of cutoff at r s,ij /η = 3.0 is smaller than the force calculation error. In the actual implementation, the entry generator directly converts the input to the final format by single table lookup.
In order to reduce the size of the RAM table, we should reduce not only the number of entries but also the word length. The logarithmic format we used has 17 bits. We do not need sign and non-zero bits for g. In addition, we do not need the full 7-bit integer part, since g smaller than 1/256 can be treated as zero without affecting the overall accuracy. So the minimum number of bits above binary point is three. We choose to assign 4 bits above the binary point. Figure 12 shows the block diagram of the circuit to convert the logarithmic format to the fixed-point format. First we convert the fractional part of the logarithmic format to a normalized number (exponential of the input) by table lookup. Then we shift it according to the integer part of the logarithmic format. 
Packaging and other miscellaneous aspects
The G5 chip is fabricated by NEC Corporation using 0.5 µm gate array process (CMOS-8L family). It consists of ∼ 200 K gates (nominal 300 K gates with 65 % usage) and is packaged in a ceramic pin-grid array with 364 pins. It operates at 80MHz clock cycle. The power supply voltage is 3.3 V and the power consumption is about 10 W. We designed the G5 chip using logic synthesis tool (AutoLogic II by Mentor Graphics Corp.). All design is expressed in the VHDL language.
Memory Unit
The memory unit supplies position vectors r j and masses m j of the particles to the G5 chip according to the indices supplied by the particle index unit. The particle data r j and m j are shared by all G5 chips. The memory unit is composed of four 4 Mbit (128 Kword × 32 bit) synchronous SRAM chips. Three are for position vectors and one is for masses. This unit can store up to 131072 particles. To simulate a system with more than 131072 particles, we need to divide the particles into subgroups each of which includes less than 131072 particles. We can calculate the total forces by summing up the partial forces from each subgroup.
Particle Index Unit
The particle index unit supplies particle indices (memory address) to the memory unit. This unit is optimized for the cell-index method (also known as the linked-list method, Quentrec and Brot 1975, Hockney and . We adopt the hardware design used in MD-GRAPE (Fukushige et al. 1996) .
The cell-index method is a scheme to reduce the calculation cost of the short-range force. In this method, the cube which includes entire simulation space (simulation cube) is divided into M 3 cells where M is the number of cells in one dimension. Usually, we set M to the largest integer not exceeding L/r cut , where r cut is the cutoff length of the short-range force and L is the side length of the simulation cube. In order to calculate the forces on particles in a cell, we need to calculate the contributions only from the particles in 27 neighbor cells. Therefore, the calculation cost is reduced by a factor of 27/M 3 . One could also further reduce the calculation cost by reducing the cell size and calculating the contribution from all cells whose distance from the cell in question is not exceeding r cut . Figure 13 shows the block diagram of the particle index unit. It consists of the cell-index memory and two counters: the cell counter and the particle index counter. When we store particles to the memory unit, we rearrange the order of particles so that particles in the same cell occupy consecutive locations. Thus, we can specify all particles in one cell by its start and end addresses.
The cell counter is a 7-bit counter, and the particle index counter is a 17-bit counter. These two counters 
Neighbor List Unit
The neighbor list unit stores the list of the nearest neighbors for particle i. Particle j is the neighbor of particle i if r s,ij < h i . One neighbor list unit handles the neighbors of particles calculated on four G5 chips. So we have two neighbor list units on one board. Figure 14 shows the block diagram of the neighbor list unit. Each of four G5 chips outputs neighbor flags for 12 particles at every clock cycle of the board. Thus the neighbor list unit receives the flags for 48 particles. These neighbor flags are stored together with the corresponding particle index j in the neighbor list memory if any of them is asserted. The host computer reads the data from the neighbor list memory after the force calculation is finished.
can hold up to 32768 neighbors for 48 particles. The memory stores 48 flag bits and lower 16 bits of particle index. The MSB of the particle index is stored in the FPGA chip which implements other functions of neighbor list as well. The MSB of the particle index changes from zero to one only once, since the particle index always increases. Therefore, we need to store only the value of the internal address counter of the neighbor list at which MSB of the index counter is first set. Using the value of this register, we can recover the MSB of particle index on the host computer.
The logic for the neighbor list unit other than the neighbor list memory is implemented in an EPF10K30 FPGA device.
Interface Unit
The interface unit controls communication between the host computer and the GRAPE-5 system. It recognizes the following five commands: (1) receive data for particle j; (2) receive data for particle i; (3) start the force calculation; (4) send back the calculated force f i and potential φ i ; (5) send back the neighbor list. The control logic is implemented in an EPF10K30 FPGA device, together with the particle index unit.
Peak Performance and Other Miscellaneous Aspects
The peak performance of a GRAPE-5 board is 38.4 Gflops. The G5 chip calculates 1.6 × 10 8 interactions per second. It calculates two pairwise interactions in each clock cycle, and operates at a clock cycle of 80 MHz. If we count the calculation of the gravitational force as 30 floating-point operations, the peak performance of the G5 chip is equivalent to 4.8 Gflops. Thus the peak speed of a board with eight G5 chips is 38.4 Gflops. The sustained performance is discussed in section 5. Figure 15 shows photograph of the GRAPE-5 processor board. The size of the board is 275 mm × 367 mm. Power supply voltage is 3.3 V and the total power consumption is about 70 W.
We started designing the G5 chip in 1996 May. The chip was completed in 1998 June. The first prototype of GRAPE-5 board with four G5 chips was completed in 1998 October. The production version of the processor board was completed in 1999 April.
Accuracy of Force Calculation
In this section we discuss the calculation accuracy of the force from one particle. In section 4.1, we discuss the accuracy of the 1/r 2 gravity with softening. In section 4.2, we discuss the accuracy of the force with cutoff.
Accuracy of 1/r 2 Force
A detailed analysis of the error propagation process gives an estimate of the relative error of the force from one particle as If ǫ ≪ r, the inequality becomes
Here f andf are exact and calculated forces from a particle at distance r. The parameter ǫ i is the r.m.s. absolute error of the fixed-point format used for the position vectors (r i and r j ), and ǫ f is the r.m.s. relative error of the logarithmic format used for internal number expression. The first term of the right hand side of inequality (11) is due to the round-off error of r i and r j expressed in fixed-point format. The second term is due to the error of internal calculation in logarithmic format.
As we described in section 3.2.3, we used 32-bit fixed-point format for position data and 17-bit (8 bits for fractional part) logarithmic format for internal number expression. Therefore, ǫ i and ǫ f are expressed as
and
Here r max is the maximum value of the coordinates. Note that r max can be specified by software. Figure 16 shows the theoretical and measured values of the error S r (f ) 1/2 as functions of r. The "exact" force f is calculated using IEEE-754 standard 64-bit arithmetic, and the forcef is obtained from the software emulator that gives the same results as G5 chip at bit level. The softening parameter is set to ǫ = 10 −6 r max . The mass is set to m = m min , where m min is the smallest mass with which the force from a particle located at far most distance (r max ) does not underflow. The theoretical estimate and measured errors show a good agreement. For r in the range of [10 −6 r max , r max ], the relative force error is around 0.3%. We also plot theoretical estimate of error of GRAPE-3 in Figure 16 . We can see that the G5 chip has 10 times higher accuracy and the dynamic range 10 3 times wider than those of the GRAPE Chip.
In the case of calculation with an extremely small softening, the force from close particle can overflow, i.e., the force can be too large to be expressed in the format we adopted for force representation. Since the maximum force from one particle scales as m/ǫ 2 , the minimum softening parameter ǫ 0 (m) with which the force does not overflow is expressed as
where ǫ min ≃ 10 −7 r max . Figure 17 shows such an extreme case. The force error for a small softening(ǫ = 10 −8 r max ≃ 10 −1 ǫ min ) is plotted. We can see that the force overflows at r ≃ 10 −7 r max .
We can avoid overflow of the force using masses Fig. 16 .-Estimated error of the force from one particle in the case of pure 1/r 2 force with softening, as functions of distance r scaled with r max . The solid and dashed curves are theoretical estimate for GRAPE-5 and GRAPE-3, respectively. The crosses are errors obtained with GRAPE-5. The softening parameter is ǫ = 10 −6 r max . Values are averaged over 1000 trials. smaller than m min . Equation (14) implies that ǫ 0 (m) is smaller than ǫ min , if m is smaller than m min . In this case, however, the force underflows at large r, i.e., the force from distant particle is so small that a part of its lower bits are lost. The maximum distance r 0 (m) with which the force does not underflow is expressed as Figure 18 shows the force error for softening smaller than ǫ min (ǫ = 10 −8 r max ≃ 10 −1 ǫ min ) and mass smaller than m min (m = 10 −2 m min ). We can see that the force does not overflow even with softening smaller than ǫ min , although it underflows at r > 10 −1 r max . This technique to use masses smaller than m min is particularly useful with the tree algorithm. In the tree algorithm, distant nodes typically represent many particles, while close nodes usually represent fewer number of particles. Thus, both the overflow and underflow are automatically avoided.
Accuracy of Force with Cutoff
First we estimate the accuracy of the cutoff function table. The error of the table output is estimated as
where g is the exact value of arbitrary cutoff function, g is the output of the table, and g ′ is the derivative of g. The parameter ζ i is the absolute error of the table entry for the worst case, and ζ f is the relative error of the table output for the worst case. The first term of the right hand side of inequality (16) is due to the round-off error of the table entry. The second term is due to the round-off error of the table output. The resolution of the entry of the table depends on the magnitude of the entry itself, as shown in table 1. The table entry r s,ij /η has the finest resolution (1/128) in the range of [0.0, 1.5], and the resolution doubles at r s,ij /η = 1.5 and r s,ij /η = 2.0. Therefore ζ i is given by
where
for 0.0 < r s,ij /η < 1.5 2 for 1.5 < r s,ij /η < 2.0 4 for 2.0 < r s,ij /η < 3.0 ∞ for 3.0 < r s,ij /η.
The table output is expressed in logarithmic format with 8-bit fractional part, and thus ζ f is given by Figure 19 shows the theoretical and measured values of the error as functions of r s,ij /η. We used the cutoff function for P 3 M method, g P3M , given by equation (2). The theoretical estimate and measured errors show a good agreement. The error increases discontinuously at r s,ij /η = 1.5 because the resolution of the table entry degrades. Even though, the magnitude of the error is not more than 0.5% in all range. Solid curve denotes measured error. Dashed curve is an estimate of the error for the worst case. As the function g we used g P3M for P 3 M method given by equation(2).
In the following we estimate the error of the force with cutoff. We define the error S r (f c ) as
where f c andf c are exact and calculated forces with cutoff, and f is a force without cutoff. We defined the error relative to the force without cutoff, since that is what affects the overall accuracy. Following a similar procedure as that for the 1/r 2 force, the force calculation error is given by 
If ǫ ≪ r, the inequality becomes
where ξ i is the r.m.s. absolute error of the format used for the entry of the cutoff function table, and is expressed as Figure 20 shows the theoretical and the measured error S(f ) 1/2 for the force with cutoff g P3M . The softening parameter is set to ǫ = 10 −6 r max . The scale length is set to η = 1/8. We can see that the error of the force with cutoff is no more than 0.4% for r in the range of [10 −6 r max , r max ]. Note that the measured error is systematically better than the theoretical estimate of equation (21) for 10 −6 r max < r < 10 −2 r max . This is simply because in this range g is very close to unity. The output of the table is exactly one. Thus, the round-off error of the cutoff table is smaller than the estimate of it. figure 16 , but for a force with cutoff g P3M for P 3 M method given by equation(2). The softening parameter ǫ and the scale length of the cutoff function η are 10 −6 r max and r max /8, respectively.
Timing Results
Here we give timing results of the GRAPE-5 system for direct and tree algorithms. For P 3 M method, we give theoretical estimate only. We used one processor board and a host computer COMPAQ AlphaStation XP1000 (Alpha 21264 processor 500 MHz). For comparison, we also give theoretical estimate for GRAPE-3 system. As the host computer of GRAPE-3 system, we use SUN Ultra 2/200 (UltraSPARC processor 200 MHz).
Direct Summation Algorithm
The total calculation time per timestep is expressed as
where T host1 , T grape1 , and T comm1 are the time spent on the host computer, the time spent on GRAPE-5, and the time spent for data transfer between the host computer and GRAPE-5, respectively. The time spent on the host computer is expressed as
where t misc1 represents calculation time spent on the host computer per particle, which includes time integration of the particles, diagnostics, and other miscellaneous O(N ) contributions. The value of t misc1 and other timing constants used for theoretical estimates in this section are summarized in table 2. 4.0 × 10
The time spent on GRAPE-5 is expressed as
Here n pipe is the number of real pipelines per processor board, which equals 16. The number t pipe is the cycle time of the G5 chip. The time spent for data transfer is expressed as
where t comm,j and t comm,i are the time to transfer one byte data from the host computer to GRAPE-5 for particle j and particle i, respectively. The transfer speed for particle j is faster than that for particle i, since the size of the data transferred in one transaction is larger for particle j. Large data implies small overhead and fast overall speed. The constant t comm,f is the time to transfer one byte data from GRAPE-5 to the host computer for calculated force, and q is given by
where ⌊x⌋ denotes the maximum integer which does not exceed x, and n mem (= 131072) is the maximum number of particles which can be stored in the memory unit of a GRAPE-5 processor board. This q indicates how the total force on a particle is divided. If N > n mem , the force on a particle must be divided into q pieces, and GRAPE-5 must be used q times to obtain the total force. Figure 21 shows measured and estimated calculation time per one timestep as functions of N . Figure  22 shows measured and estimated speed as functions of the number of particle N . For GRAPE-5 system, we chose timing constant of the host computer, t host1 to fit the measured results. For GRAPE-3 system, we used t host1 scaled from the one for GRAPE-5 system, according to the ratio of SPECfp95 values of the hosts. For the data transfer times, we used measured values for GRAPE-5 system, and the value given in Athanassoula et al. (1998) for GRAPE-3 system.
In figure 21 , we can see that the calculation with GRAPE-5 is about 8 times faster than that with GRAPE-3, for the entire range of N we used. In figure 22 , we can see that the effective performance of GRAPE-5 exceeds 70 % of the peak performance, for rather small N (= 10 4 ).
Barnes-Hut Tree Algorithm
where T host2 , T grape2 , and T comm2 are the time spent on the host computer, the time spent on GRAPE-5, and the time spent for data transfer between the host computer and GRAPE-5, respectively. The time spent on the host computer is expressed as where t con is the time to construct the tree structure, t list is the time to construct the interaction lists, and t misc2 represents O(N ) miscellaneous contributions (Fukushige et al. 1991) . In this equation, n terms is the average length of the interaction list. According to Makino (1991) , n terms is estimated as follows:
2/3 g + 84n
1/3 g + 56 log 8 n g −31θ −3 log 10 n g − 72
Here, θ is the opening angle. The time spent on GRAPE-5 is expressed as
and the time spent for data transfer is expressed as
(34) Figure 23 shows the calculation time per one timestep. Estimates for GRAPE-3 are also shown. Table 3 shows the calculation time spent on GRAPE-5, on the host computer, and for data transfer for the number of particle N =1048576. As the initial particle distribution, we use a Plummer model. All particles have equal mass. The opening angle θ is 0.75. For GRAPE-5, n g ≃ 2000. For GRAPE-3, we used n g ≃ 1000. The timing constants for the host computer and data transfer are chosen in the same way as in section 5.1.
In figure 23 , we can see that the calculation with GRAPE-5 is about 6 times faster than that with GRAPE-3, and that the tree algorithm is faster than the direct summation algorithm for N ∼ > 10 4 . Table 3 indicates that we can use up to two or three processor boards to increase the calculation speed further. In order to use more than four boards effectively, we need a host computer which has faster computing speed and multiple communication buses, so that the calculation on the host computer and the data transfer do not limit the total performance.
The calculation time with single processor board for a million particle simulation is 16 seconds per timestep. To put this number into perspective, parallel treecode by Dubinski (1996) would took around 60 seconds for one million particles on 64 processor T3D, with θ = 1.2. Taking into account the difference in θ, it is perhaps fair to say our treecode on GRAPE-5 is about 6 times faster than Dubinski's parallel code on 64 processor T3D (his code does not scale well for more than 128 processors). Yahagi et al. (1999) described an implementation of treecode on Fujitsu VPP300/16R, which took ∼ 7 seconds for one million particles, with the opening criterion roughly corresponding to θ = 1. Again taking into account the difference in θ, the effective speed of our code is ∼ 70% of VPP300/16R.
P 3 M Method
where T host3 , T grape3 , and T comm3 are the time spent on the host computer, the time spent on GRAPE-5, and the time spent for data transfer between the host computer and GRAPE-5, respectively. In the following we present the estimate for the case of homogeneous distribution of particles. The time spent on the host computer is expressed as
The first term represents the time for FFT and the second term represents the time for O(N ) miscellaneous calculations. Here M pm is the number of meshes in one-dimension used in PM force calculation, and t misc3 represents time for miscellaneous calculation per particle. The time spent on GRAPE-5 is expressed as
where M pp is the number of meshes for PP force. We set the ratio M pm /M pp to 2.9, following Brieu et al. (1995) and Fukushige et al. (1996) . We set M pm so as to minimize the total calculation time. The optimal value of M pm is given approximately by t pipe n pipe t fft N.
The time spent for data transfer is expressed as T comm3 = N (16t comm,j +12t comm,i +32t comm,f ). (39) Figure 24 shows the calculation time per one timestep. Theoretical estimate for GRAPE-5 is plotted. As the timing constants for the host computer, t fft and t misc3 , we used values scaled from the ones given in Fukushige et al. 1996 , according to the SPECfp95 values of the hosts. The estimated calculation speed is 40 and 25 times faster than that with GRAPE-3 (Brieu et al. 1995) and MD-GRAPE (Fukushige et al. 1996) , respectively. The calculation time with single processor board for 16 million particles simulation is 150 seconds per timestep. As discussed by Brieu et al. (1995) , this calculation time depends only weakly on the degree of the clustering. This is simply because the calculation cost of O(N ) part of the code (mass assignment, particle update, etc.) dominates the total cost. Even at a highly clustered stage, the increase in the calculation cost would be a factor of two or so. On the other hand, CPU time of parallel P 3 M implementation is very sensitive to clustering. For example, MacFarland et al. (1998) reported the CPU time per timestep increased from ∼ 10 seconds to ≥ 100 seconds for their 16 million particle simulation on 128 processor T3E-600. Thus, depending on the degree of clustering, P 3 M on GRAPE-5 runs at the speed of 5-50 % of a 128 processor T3E-600.
6. Future Prospects 6.1. Massively-Parallel GRAPE-5
We plan to build a massively-parallel GRAPE-5 system with peak performance of about 1 Tflops. This system will consist of about 20 processor boards and a host computer. Each processor board is connected to a PHIB which is inserted into one PCI slot. The host computer has multiple processors and multiple PCI slots.
In order for 1 Tflops-peak GRAPE-5 system to operate efficiently with the tree algorithm, a host computer should have the effective computing speed and communication speed of about 5 Gflops and a few hundreds MB/s, respectively. Currently, these performance figures are offered only by machines with multiple processors.
We consider two types of host computers. One is a shared-memory multiprocessors (SMPs), and the other is a workstation cluster.
The advantage of the SMP over the workstation cluster is that the implementation of the code is relatively easy. The disadvantages are that the price is relatively high and the number of processors is limited to 10 or smaller. This limit is due to bottleneck in the memory access from multiple processors.
The advantage of the workstation cluster is that they are inexpensive and that it is possible to connect ∼ 100 processors. For example, Warren et al.(1998) performed cosmological N -body simulation with tree algorithm on Avalon, a cluster of 70 DEC Alpha processors (DEC Alpha 21164A, 533 MHz). The disadvantage is that the implementation of the code is more difficult.
We plan to build a system based on a cluster of 4-8 workstations. A preliminary analysis indicates that 100BT Ethernet connection offers sufficient communication bandwidth. On this system, one timestep of 10 7 -body simulation with tree algorithm would take about 10 seconds. Using this system, for example, we can perform cosmological 10 8 -body simulation for 10 3 steps in one day.
GRAPE-5/PROGRAPE System
We plan to construct a heterogeneous computing system, GRAPE-5/PROGRAPE system. PRO-GRAPE (PROgrammable GRAPE; Hamada et al. 1999 ) is a multi-purpose computer for many-body simulation. It consists of reconfigurable processor implemented on FPGA (Field Programmable Gate Array) chips and memory which stores particle data. PROGRAPE has a very similar architecture to GRAPE except that it uses FPGA chips as pipeline processor instead of custom LSI chips such as G5 chip. PRO-GRAPE can calculate any interaction which is ex
