GRAPE project  by Makino, Junichiro
Journal of Computational and Applied Mathematics 149 (2002) 131–145
www.elsevier.com/locate/cam
GRAPE project
Junichiro Makino
Department of Astronomy, School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan
Received 6 November 2001; received in revised form 22 January 2002
Abstract
We overview our GRAvity PipE (GRAPE) project to develop special-purpose computers for astrophysical
N -body simulations. The basic idea of GRAPE is to attach a custom-build computer dedicated to the calculation
of gravitational interaction between particles to a general-purpose programmable computer. By this hybrid
architecture, we can achieve both a wide range of applications and very high peak performance. Our newest
machine, GRAPE-6, achieved the peak speed of 32 T5ops, and sustained performance of 11:55 T5ops, for the
total budget of about 4 million USD.
We also discuss relative advantages of special-purpose and general-purpose computers and the future of
high-performance computing for science and technology.
c© 2002 Elsevier Science B.V. All rights reserved.
Keywords: Computational science; Special-purpose computer; Numerical algorithms
1. Introduction
In this paper, we discuss a relatively unexploited approach to the computational science, namely
to design and build the computer hardwares specialized and optimized for relatively narrow range
of problems.
The development of the computers itself has been regarded as not really being a part of the
computational science. It has been considered as what industries and/or computer scientists would
do for us. This view was okay when relatively simple computers were still very expensive and
had to be shared by a number of researchers from a wide variety of =elds. For example, Cray-1,
which was completed in 1976, was really a simple and small computer by today’s standard. It has
just 8 MB of SRAM memory in 16 banks, one multiplier unit and one adder unit. The estimated total
E-mail address: makino@astron.s.u-tokyo.ac.jp (J. Makino).
0377-0427/02/$ - see front matter c© 2002 Elsevier Science B.V. All rights reserved.
PII: S0377-0427(02)00525-3
132 J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145
Fig. 1. Number of 5oating-point operations performed per single machine cycle of representative high-performance micro-
processors.
gate count was just 300K. In comparison, today’s single microprocessor integrates 40M transistors,
which is equivalent to around 10M gates. Thus, the world’s largest supercomputer 25 years ago is
less than one-tenth of a today’s desktop (or even notebook) PC in gate count.
With such a small gate count, it was unpractical to design a machine specialized to one application.
Since it was diGcult to make just one 5oating-point calculation unit, the natural way to make
a computer is to add control logics to that 5oating-point unit so that it can perform operations
speci=ed by the software.
The available number of transistors, however, has been increasing exponentially in the last half
century, in the rate of a factor of 100 per every decade. Thus, present-day microprocessors integrate
the number of transistors suGcient to implement several hundreds of 5oating-point units.
Fig. 1 shows this situation rather clearly for the case of single-chip microprocessors. When it
was impossible to implement full-decode multiplier unit into a single chip, we had to use smaller
multipliers which needed multiple cycles to perform one operation (before 1989). During this period,
we could use the increase in the available number of transistors just to make bigger and bigger
multipliers. Thus, in 1980s a factor of 100 improvement in the cycle count, which is consistent with
the factor of 100 increase in the number of transistors, was achieved.
In 1990s, however, the decrease in the operation count has practically halted, as clearly seen
from Fig. 1. In 1988, Intel shipped the =rst single-chip microprocessor which can perform more
than one 5oating-point operations per clock cycle, the i860. If the exponential increase in 1980s had
continued in 1990s, we would see microprocessors with several hundreds of 5oating-point units in
2001. Instead, processors like Intel Pentium 4 still perform only a few operations per cycle, even
though the number of transistors has increased by a factor of 50 or so.
J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145 133
In other words, practically all the increase in the available number of transistors in 1990s was
spent for things other than the arithmetic units. These “other” things include large L1 and some-
times L2 caches, multiple execution units for integer operations, datapath and control units for
out-of-order execution of multiple instructions, additional logics to implement deeper pipelines and
so on.
If we write the same =gure for high-end supercomputers, we see essentially the same trend, but
shifted by 15 years. With Cray-1 (1976), supercomputers had reached the regime where they can
perform one 5oating-point operations per cycle. Until early 1990s, the evolution of the number of
5oating-point units on a single supercomputer had been very slow. Machines of early 1990s such
as Hitachi S-3800, NEC SX-4 and Cray T-90 had only a few tens of arithmetic units, instead of a
few thousands which was theoretically possible.
The most obvious way to use multiple 5oating-point units in a single computer is to develop
a parallel computer. Thus, in 1980s, parallel computers based on one-chip microprocessors and
those based on supercomputers were built. Parallel computers based on microprocessors were mostly
distributed-memory processors, which were essentially a cluster of simple computers connected by
a relatively slow and inexpensive network. Thus, large machines with more than 1000 processors
were possible. Parallel computers based on supercomputers were more complicated shared-memory
machines. The shared-memory architecture limited the number of processors to 32 or less.
Although microprocessor-based machines used much higher number of processors, they were still
slow in absolute performance simply because each microprocessor was slow. Parallel supercomputers
oMered higher peak performance.
However, in 1980s, the performance of microprocessors had improved drastically, because they
could fully utilize the increase in the available number of transistors. Thus, by early 1990s, micropro-
cessor-based machines started to oMer better price-performance ratio than parallel supercomputers.
In 1993, Fujitsu announced VPP-500, the =rst parallel supercomputer with distributed-memory
architecture. With this architecture, available number of processors was boosted to more than 100,
resulting in a very large improvement in the price performance of supercomputers.
However, this move was counteracted by the birth of “Beowulf-class” machines [2,10], which are
really simple cluster of commodity PCs with Intel x86 CPUs connected by standard Ethernet. This
architecture oMers typically one order of magnitude better price performance than other computer
architecture, simply because the production cost is low due to mass production. “High-end” Intel
x86 CPUs are made and sold in units of hundreds of millions per year, while supercomputers were
sold in units of hundreds. Thus, the sheer number of chips reduces the production cost per chip, and
yet huge development cost can be spent for Intel x86 chip since the total revenue would be tens of
billion USDs.
Thus, at present the computer architecture which oMers both the best price performance and best
absolute throughput is a large cluster of PCs with microprocessors of high-volume production. How-
ever, as we have already seen at the beginning, this architecture, at present, does not make good
use of the transistors available on one chip. To make matters worse, the fraction of transistors used
for 5oating-point operations will decrease in future. In a few years, the architecture called on-chip
multiprocessor will be common. This evolutionally path is again following the path of supercom-
puters, but with the delay of 20 years. Thus, we can “predict” the future of on-chip multiprocessor
architecture in the near future rather accurately by just looking at the evolution of supercomputers
in 1980s: The evolution will be slow.
134 J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145
So our question here is: Is there any better way to design a computer for scienti=c simulation?
We believe the answer is “yes”, and in the rest of this paper we describe why we believe so.
In Section 2, we discuss the potential advantages and drawbacks of designing and developing
special-purpose systems. The potential gain is very large, since, at least in some cases, we might
be able to use a fair fraction of available transistors to perform useful arithmetic operation. On the
other hand, there are many practical diGculties which would oMset the potential gain. In Section
3, we discuss our GRAPE project [11,8] as an example of, well, reasonably successful projects
to develop special-purpose computers. In Section 4, we speculate on the future of the large-scale
scienti=c computing.
2. Special-purpose computing
2.1. Astrophysical N -body problem
The idea of building special-purpose computers for speci=c numerical problems is certainly not
new. In fact, the very =rst digital electronic computers (ABC and ENIAC) were designed to solve
speci=c problems. However, the designers of early computers found that their machines could be used
for a much wider range of problems, and thus the evolution of the programmable general-purpose
computer started.
At the time when even the largest supercomputers did not have a fully parallel multiplier, it did not
make sense to design a special-purpose computer. Any computer, either special- or general-purpose,
consists of arithmetic units (mainly multipliers), control logic and memory. A special-purpose com-
puter might have simpler control logic or smaller memory than a general-purpose computer. However,
if the cost of the arithmetic unit itself is a fair fraction of the total cost of the machine, we cannot
gain much by trying to reduce the cost of the rest of the system.
This situation, however, has changed completely, as we have seen in the previous section. To
give a concrete example, let us discuss the astrophysical N -body problem and our GRAvity PipE
(GRAPE) hardware specialized for that problem. The master equation for the gravitational N -body
problem is given by
d2xi
dt2
=−
∑
j =i
Gmj
xi − xj
|xi − xj|3 ; (1)
where xi and mi are the position and mass of the particle with index i and G is the gravitational
constant. The summation is taken over all particles (typically stars) in the system under study.
The number of particles N varies widely depending on what kind of systems is studied. A small
star cluster like Hyades might consist of several thousand stars. Globular clusters typically contain
one million stars, and galaxies hundreds of billions of stars. In the case of galaxies, it is impossible
to model them on a star-by-star basis, and we model the distribution of stars in the six-dimensional
phase space with much smaller number of particles.
If N is larger than several tens, the calculation of the right-hand side of Eq. (1) dominates
the total calculation cost. If N is very large, we could use sophisticated algorithms such as the
Barnes–Hut treecode [1], FMM [3], or the particle-mesh or particle–particle particle-mesh method
[4]. However, with treecode or FMM, the direct evaluation of gravitational interaction between
J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145 135
near-neighbor particles still dominates the total cost. Even in the case of the particle-mesh method,
if we want high accuracy, near-neighbor interaction must be calculated directly.
The calculation itself is rather simple. All we have to do is to calculate and accumulate the
gravitational force between two particles. Also, there is very large degree of potential parallelism,
since in principle we can evaluate all N 2 interactions in parallel. Thus, N -body simulations are quite
well suited for execution on massively parallel processors or PC clusters.
As discussed earlier, however, our aim here is to achieve a better use of transistors than what is
realized with present microprocessors. Since N -body simulations have a large degree of parallelism,
we =rst concentrate on how to make use of large number of transistors available in one chip. We
return to the problem of parallelization later.
The problem with present microprocessors is that very small fraction of the available transistors
is used to implement 5oating-point arithmetic operations. The number of 5oating-point units in a
single one-chip microprocessor has been almost constant for the last 10 years, though the number
of transistors has increased by almost two orders of magnitude.
Why the number of 5oating-point units has been just a few, even though there are enough tran-
sistors to integrate hundreds of them? There are essentially two reasons. The =rst reason is that we
do not know how to use such a large number of arithmetic units. To be precise, it is of course
possible to use multiple arithmetic unit in parallel for applications with high degree of parallelism
such as N -body simulations. Most of scienti=c applications do have high degree of parallelism, and
would run with high eGciency on processors with large number of arithmetic units. However, at
present, microprocessors are designed to achieve the following two goals. The =rst is to run the most
widely used applications as fast as possible. That is, to run Microsoft Windows operating system
and Microsoft OGce application package. The other is to run standard benchmarks such as SPEC
CPU2000 suites as fast as possible. Of course, these are most important applications and benchmarks
from the viewpoint of commercial success. However, it goes without saying that the performance of
a microprocessor for scienti=c applications has rather little to do with its performance in process-
ing MS-Word documents. Clearly, multiple 5oating-point units would not help much improving the
performance of MS-Word.
The second reason is that it is diGcult to provide suGcient memory bandwidth to keep processors
busy. It is not impossible to integrate 100 5oating-point arithmetic unit on a chip. However, if they
operates independently, or even if they operate in the SIMD fashion, each unit needs to get two
words of data from memory and to store one word of data, to perform one 5oating-point operation.
If these 100 arithmetic units operate on 500 MHz clock, which is rather slow by today’s standard,
the necessary memory bandwidth would be 150 Gwords=s or 1:2 Tbytes=s, which is about 500 times
higher than the speed of the oM-chip transfer bandwidth of today’s microprocessors. A 2 GHz Intel
Pentium 4 processor has the theoretical peak transfer bandwidth of just 3:2 GB=s.
Of course, the necessary bandwidth is smaller for operations like summation. The diMerence,
however, is small compared to the factor of 500 discrepancy shown above. This huge discrepancy
means it is not easy to keep just one processor busy. With present clock speed of around 2 GHz,
one processor needs the bandwidth of 48 GB=s, or 15 times more than what is actually provided.
It is quite understandable that architects of microprocessors are not too interested in adding more
5oating-point units. External memory cannot feed a single unit. Unless there is a way to vastly
increase the bandwidth, little or no gain is achieved by increasing the number of 5oating-point
units, even for scienti=c applications. It makes a lot of sense that they used a large fraction of the
136 J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145
Xj
mult
FiFiFi
Xi
ε
Xi
m
x + y + z +ε 2 2 2 2 x     
-1.5
mult
XiXiXi
Fig. 2. The force calculation pipeline.
transistors on chip to cache memory, which can somewhat reduce the necessary oM-chip memory
bandwidth.
To summarize, though the current chip design with just one or two arithmetic units looks like a
waste of the transistors, processor architects have good reasons to choose such designs.
2.2. Special-purpose computer
If we are to design a computer specialized for N -body simulation, or rather, the calculation of the
gravitational interaction between particles, we can work around the limitations discussed above.
Fig. 2 shows the basic “processor” for our N -body computer. Calculation of the gravitational
force from one particle to another requires some 40 5oating-point operations, depending on how you
count division and square-root operations. If we assign 10 operations to each of them, we end up
with about 40 operations per interaction. Instead of using one arithmetic unit to perform these 40
operations, we connect 40 5oating-point arithmetic units in the form of a pipeline, which calculate
and accumulate the force from one particle at each clock cycle.
With this approach, we can work around the two limitations discussed above. The =rst one was the
eGciency of the system on commercial applications. This we simply ignore. Clearly not everybody
who buys a PC wants a fast computer for astrophysical N -body simulation.
The second limitation is the memory bandwidth. With the simple pipeline of Fig. 2, 40 5oating-
point operations are performed for the input of four words. Thus, there is a saving of a factor of 30,
compared to a single arithmetic unit or multiple units working in an SIMD fashion.
When we integrate multiple pipelines, we can let these pipelines to calculate the forces on diMerent
particles from the same particle. Thus, we can integrate multiple pipelines into a chip without
increasing the necessary memory bandwidth.
It is also possible to reduce the required bandwidth for one pipeline, by letting a single pipeline
to calculate forces on multiple particles. This can be done by adding the register =les for the input
particles and calculated forces, and by switching them at each clock cycle. We call this the virtual
multiple pipeline (VMP) architecture [9]. The idea of VMP is quite similar to what is now called as
multithredding, but the diMerence is that VMP reduces the required memory bandwidth, while usual
multithredding only relaxes the requirement for the memory latency.
Note that many of the above ideas can be applied to the calculation of particle–particle interaction
on general-purpose programmable computer. Thus, with some careful programming (sometimes in
J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145 137
Host
Computer GRAPE
O(N) calculations O(N  ) force
calculation
2
Fig. 3. The basic architecture of GRAPE.
assembly languages), it is possible to achieve the performance close to the theoretical peak perfor-
mance of the processor. Here, the simple fact that there are only a few arithmetic units on a chip
limits the ultimate performance.
Our GRAPE project is based on this idea of integrating multiple (both physical and virtual)
pipelines specialized for gravitational force calculation into one chip (or one board, when the number
of available transistors was not as large as what is now). To give an idea, the GRAPE-6 system
which was completed in July 2001 integrates 1024 pipeline chips, each with 6 pipeline processors
which calculate the gravitational force and its =rst time derivative. As a result, single GRAPE-6
processor chip integrates about 350 5oating-point arithmetic units.
The clock speed of GRAPE-6 chips is only 100 MHz, though it uses reasonably advanced 0:25 m
technology. The reason for this rather low clock speed is simply that we did not have suGcient re-
source to =ne tune the design. Even so, with 350 arithmetic units a single chip oMers peak speed
exceeding 30 G5ops, which is still more than a factor of 10 better than that of the fastest micropro-
cessors. In addition, because of this rather low clock speed the power consumption and therefore heat
dissipation are small (around 15 W). This low power consumption is quite important in reducing
the overall cost of the system and the running cost (electricity is expensive in Japan).
Since the GRAPE hardware performs only the force calculation, all other operations, such as
the time integration of the orbits of stars and analysis of the calculated results, must be done on
somewhere else. Therefore, we connect the GRAPE hardware and general-purpose computer by some
communication link. This is the basic structure of our GRAPE systems, as shown in Fig. 3.
3. GRAPE project
We started GRAPE project in 1988. The =rst machine we completed, the GRAPE-1 [5] was a
single-board unit on which around 100 IC and LSI chips were mounted and wire-wrapped. We used
commercially available IC and LSI chips to implement force calculation pipeline. This choice was
a natural consequence of the fact that we lacked both money and experience to design custom LSI
chips. In fact, none of the original design and development team of GRAPE-1 had the knowledge
of electronic circuit more than what was learned in basic undergraduate course for physics students.
GRAPE-2 is similar to GRAPE-1 since it is also based on commercial LSI chips. The diMerence
is in the numerical accuracy. For GRAPE-1, we used an unusually short word format, to make the
hardware as simple as possible. The input coordinates are expressed in 15-bit =xed point format.
After subtraction, the result is converted to 8-bit logarithmic format, in which we use just 3-bits
for “fractional” part. This format is used until we obtain 1=r3. The =nal accumulation was done in
138 J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145
48-bit =xed point to avoid over5ow and under5ow. The advantage of the short format like 8-bit
logarithmic format is that we could use ROM chips to implement complex functions that require two
inputs. Any function of two 8-bit words can be implemented by one ROM chip with 16-bit address
input. Thus, all operations other than the initial subtraction of the coordinates and =nal accumulation
of the force were implemented by ROM chips.
The drawback of GRAPE-1 was its limited accuracy, which was insuGcient for fairly wide range
of astrophysical simulations, though for many other applications accuracy of GRAPE-1 turned out
to be suGcient. With GRAPE-2, we used standard IEEE-754 format (64 bit for initial subtraction
and accumulation, and 32 bit for all other operations).
GRAPE-3 was our =rst machine with custom LSI chip. The number format was again the com-
bination of the =xed point and logarithmic format similar to what were used in GRAPE-1, but
implementation of the arithmetic operations was quite diMerent since we could not integrate large
tables to a custom LSI chip. Conversions between =xed and logarithmic formats were implemented
by shifters and small lookup table, and addition in logarithmic format is implemented by two adders
and one small lookup table. Chip design was done as a joint research project between University of
Tokyo and Fuji Xerox Corp. The chip was fabricated using 1 m design rule by National Semicon-
ductor. The number of transistors on chip was 110K. Single chip operates at 20 MHz clock speed,
oMering the speed of about 0:8 G5ops. We designed a printed-circuit board with eight chips, for the
speed of 6:4 G5ops per board. Thus, GRAPE-3 is also our =rst trial to integrate multiple pipelines
into a system.
Also, GRAPE-3 was the =rst GRAPE machine to be manufactured and sold by a commercial
company. Nearly 100 copies of GRAPE-3 have been sold to more than 30 institutes (more than 20
outside Japan).
With GRAPE-4, we =nally integrated a high-accuracy pipeline into one chip. Also, with this chip
we added the additional pipeline to calculate the =rst time derivative of the force, so that we can
implement high-order time integration algorithms in a simple and eGcient way [7]. We could not
=t a full single pipeline to a chip with the technology available at that time. So we designed a
“1/3” version of the pipeline which processes x, y and z components of coordinates serially in
three consecutive clock cycles. The chip was fabricated using 1 m design rule by LSI logic. Total
transistor count was about 400K.
For GRAPE-4, we got a rather large grant of about 2 million USD. So we were able to build
a large system consisting of 1728 pipeline chips (36 PCB boards each with 48 pipeline chips).
GRAPE-4 system operates on 32 MHz clock, delivering the speed of 1:1 T5ops. Completed in 1995,
GRAPE-4 was the =rst computer for scienti=c calculation to achieve the peak speed higher than
1 T5ops. Also, for 1995 and 1996 it was awarded the Gordon Bell Prize for peak performance,
which is given for a real scienti=c calculation on parallel computer with the highest performance.
Technical details of machines from GRAPE-1 to GRAPE-4 can be found in our book [8] and
references therein.
GRAPE-5 [6] was an improvement over GRAPE-3, with a similar, but more accurate number
format than that was used for GRAPE-3. Also, it integrated two full pipelines which operate on
80 MHz clock. Thus, single GRAPE-5 chip oMered the speed 8 times more than that of the GRAPE-3
chip, or the same speed as that of a 8-chip GRAPE-3 board. GRAPE-5 was awarded the 1999 Gordon
Bell Prize for price performance. The GRAPE-5 chip was fabricated with 0:35 m design rule by
NEC.
J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145 139
Table 1
History of GRAPE project
GRAPE-1 (89/4 — 89/10) 120 M5ops, low accuracy
GRAPE-2 (89/8 — 90/5) 40 M5ops, high accuracy (32=64 bit)
GRAPE-1A (90/4 — 90/10) 240 M5ops, low accuracy
GRAPE-3 (90/9 — 91/9) 14 G5ops, high accuracy
GRAPE-2A (91/7 — 92/5) 180 M5ops, high accuracy
HARP-1 (92/7 — 93/3) 180 M5ops, high accuracy
Hermite scheme
GRAPE-3A (92/1 — 93/7) 6 G5ops=board
some 80 copies are used all over the world
GRAPE-4 (92/7 — 95/7) 1 T5ops, high accuracy
Some 10 copies of small machines
MD-GRAPE (94/7 — 95/4) 1 G5ops=chip, high accuracy programmable interaction
GRAPE-5 (96/4 — 99/8) 5 G5ops=chip, low accuracy
GRAPE-6 (97/8 — 01/7) 32 T5ops, high accuracy
Fig. 4. The evolution of GRAPE and general-purpose parallel computers. The peak speed is plotted against the year of
delivery. Open circles, crosses and stars denote GRAPEs, vector processors, and parallel processors, respectively.
GRAPE-6 is similarly the improvement over GRAPE-4. Since it is our newest hardware, we will
give a close inspection of its architecture in the next section.
Table 1 summarizes the history of GRAPE project. Fig. 4 shows the evolution of GRAPE systems
and general-purpose parallel computers. One can see that evolution of GRAPE is faster than that of
general-purpose computers.
140 J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145
These GRAPE hardwares, including GRAPE-6, have been applied to a number of astrophysical
problems both by our group and by other researchers worldwide. Since there are too many interesting
results, we do not try to list them. Some of the recent results are summarized in the Proceedings
of IAU Symposium 208 “Astrophysical Supercomputing Using Particles”, held in July 2001. Some
more informations can be found at http://www.astrogrape.org.
3.1. Machines for molecular dynamics (MD)
Classical MD calculation is quite similar to astrophysical N -body simulations since in both cases
we integrate the orbit of particles (atoms or stars) which interact with other particles with sim-
ple pairwise force. In the case of Coulomb force, the force law itself is the same as that of the
gravitational force, and the calculation of Coulomb force can be accelerated by GRAPE hardware.
However, in MD calculations the calculation cost of van der Waals force is not negligible, though
van der Waals force decays much faster than the Coulomb force (r−7 compared to r−2).
It is fairly straightforward to design a hardware which can handle particle–particle force, which is
some arbitrary function of the distance between particles. We approximate the given function by a
table of polynomials. In fact, we use this combination of lookup table and polynomial approximation
for the calculation of 1=r3 from r2 in GRAPE hardwares. So the actual change in the design of the
hardware is rather minor.
We have designed two machines, GRAPE-2A and MD-GRAPE, following these lines of idea.
GRAPE-2 was built using commercial chips and MD-GRAPE used a custom-designed pipeline chip.
Another diMerence between astrophysical simulations and MD calculations is that in MD calcu-
lations usually the periodic boundary condition is applied. Thus, we need some way to calculate
Coulomb forces from image particles. The direct Ewald method is rather well suited for implemen-
tation in hardware. In 1991 we developed WINE-1, a pipeline to calculate the wavespace part of the
direct Ewald method. The real-space part can be handled by GRAPE-2A or MD-GRAPE hardware.
In 1995, a group led by Ebisuzaki in RIKEN started to develop MDM, a massively parallel
machine for large-scale MD simulations. Their primary goal is the simulation of protein molecules.
MDM consists of two special-purpose hardwares, massively parallel version of MD-GRAPE
(MDGRAPE-2) and that of WINE (WINE-2). MDGRAPE-2 part consisted of 1536 custom chips
with four pipelines for the theoretical peak speed of 25 T5ops. WINE-2 part consists of 2304 custom
pipeline chips, for the peak speed of 46 T5ops.
MDM shared the 2000 Gordon-Bell performance Prize with GRAPE-6. It also was selected as the
=nalist for the 2001 Gordon-Bell performance Prize, again along with GRAPE-6.
4. GRAPE-6
In 1997, we started the GRAPE-6 project. It is a 5-year project funded by Japan Society for the
Promotion of Science (JSPS), and the planned total budget is about 500M JYE.
The GRAPE-6 is essentially a scaled-up version of GRAPE-4 [9], with the peak speed of around
100 T5ops. As of the time of writing, a 32 T5ops system with 1024 chips is in operation. The peak
speed of a single pipeline chip is 31 G5ops. In comparison, GRAPE-4 consists of 1728 pipeline
chips, each with 600 M5ops. The increase of a factor of 50 in speed is achieved by integrating
J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145 141
Fig. 5. The GRAPE-6 processor chip.
Fig. 6. The processor board of the GRAPE-6 with 32 processor chips. Four processor chips are mounted on modules, on
which eight memory chips are also mounted on the bottom side. One board houses eight modules.
six pipelines into one chip (GRAPE-4 chip has one pipeline which needs three cycles to calculate
the force from one particle) and using 3 times higher clock frequency. The advance of the device
technology (from 1 to 0:25 m) made these improvements possible. Fig. 5 shows the processor chip
delivered in early 1999. The six pipeline units are visible.
142 J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145
Fig. 7. The 32-board, 32-T5ops GRAPE-6 system with its host computer. The host is a cluster of four PCs with 1:7 GHz
Intel P4 processors connected by 100 Mb Ethernet.
Figs. 6 and 7 show the processor board with 32 processor chips and the 32-board system. This
32-board system has the theoretical peak speed of 32 T5ops, and has achieved the sustained speed
of 11:5 T5ops for the simulation of 1.4-million-body system.
We plan to extend this system to 80-board, 80-T5ops system by the end of FY 2001. Single-board
systems (4–32 chips) are commercially available.
5. Discussion
5.1. Are special-purpose computers di3cult to build?
We have seen that it is possible to develop special-purpose computers which oMer price perfor-
mance and also absolute performance better than those of general-purpose computers by one or two
orders of magnitude.
However, our GRAPE project is not the only project to develop special-purpose computers, and
yet it is not too easy to name other projects which have achieved similar level of success. We brie5y
discuss why.
There are a variety of reasons why a project ends up as a failure. These include:
(1) Long development time: Even though there are enormous ineGciency, the performance of
general-purpose computers is still improving at the rate of a factor of 10 in every 5 years.
J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145 143
Therefore, if you spend 5 years to develop a machine, you lose a factor of 10 in relative
advantage. Of course, in many cases the designers originally underestimated the development
time. So this problem was very diGcult to avoid.
(2) Too small gain: Even if your machine achieved the advantage of a factor of 10 by the time
of completion, in 5 years it will lose all advantages. So the lifetime of your machine is rather
short. If the advantage is a factor of 10 at the time of the design, the project is guaranteed to
fail.
(3) Too wide applications: This is not necessarily a failure, if high performance is achieved. How-
ever, this is the most common failure which leads to both small gain and long development
time.
(4) Too narrow applications or too di3cult to use: If there is not much scienti=c outcome from
your machine, it would not be regarded as a great success even if it achieves very good
performance.
(5) Obsolete technology/design method: Device technology is moving very fast, and so is the design
software/methodology. So, if the technology you used was 3-year old, you have already lost
the relative advantage of a factor of 4.
(6) Untested technology/design method: Even though the advance of the device technology is pretty
well predictable, in most cases what the manufacturers claim that they can deliver would not
be delivered on time. So be careful.
Of course, projects to develop general-purpose computers also suMer most of the above problems.
The advantage of the special-purpose design is the possibility to make better use of the available
transistors. The disadvantage is the limitations in available design resources (design experts, budget,
access to the latest technology, etc.).
5.2. Future prospect
We overviewed the technological and economical trends in high-performance scienti=c computing.
At least for certain range of problems, a special-purpose computer, or a combination of special- and
general-purpose computers such as our GRAPE systems, oMers a real and proven advantage over
traditional general-purpose computers. The relative advantage has been increasing, and will keep
to do so for the next one or two decades, unless some radical change in the design method for
general-purpose computer would take place.
Currently, it seems the direction of the evolution of general-purpose computer is a cluster of
commodity PCs. Current microprocessors for commodity PCs oMer very high performance in very
low cost, partly because of their low production cost due to mass production, and partly because
the investment for the design of the chip is actually very high. The large volume of production
justi=es the high development cost. However, the fact that microprocessors for commodity PCs oMer
performance better than any other general-purpose computers on the market does not necessarily
imply that their design is optimal. As we have seen, only a tiny fraction of the available transistors
is used for actual arithmetic operations, and that fraction has been decreasing quite rapidly.
Special-purpose systems, at least in principle, will not suMer these problems. In practice, however,
the approaches we have taken so far are becoming more and more diGcult, because the initial
development cost of the custom LSI goes up as the technology advances. The development cost
144 J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145
goes up because of two reasons. The =rst is that the amount of work to do the logic design,
test design, physical layout and design validation increases as the number of transistors in a chip
increases. In the case of special-purpose systems, the amount of work for the logic and test design,
which we can do in-house, does not increase too rapidly, but physical layout and design validation,
which is the work of the semiconductor company, take a long time and therefore lots of money.
The second reason is that the investment needed to build the semiconductor plant increases rapidly
as the technology advances.
Roughly speaking, for same physical size of the chip, the amount of money we paid was inversely
proportional to the design rule: around a quarter million USD for 1 m (GRAPE-4) and around 1
million USD for 0:25 m (GRAPE-6). In both cases, the size of the chip is around 110 mm2. If
this trend continues, the design cost of 0:13 m would be 2 million USD. The total budget must be
signi=cantly larger than the development cost of the chip, since otherwise the price of a single chip
would be too high. It is not easy to get such a large fund for a project of theoretical study in pure
science.
One possible compromise is the use of =eld programmable gate array (FPGA) chips as the build-
ing blocks of the pipelined processors. This choice will reduce the initial cost from 1 million
dollars to less than 10,000 dollars (the price of the design software), since the FPGA chip itself is
mass-produced and the design is loaded to a FPGA chip by con=guring the switches and lookup
tables in the chip.
Roughly speaking, because of this programmability, the calculation speed achieved by one FPGA
chip is 100 times lower than that can be achieved with a custom LSI of the same size and same
technology. However, this diMerence can be oMset by the possibility of using the most advanced
technology, the possibility to =ne-tune the design to individual problems, and most importantly,
much shorter design cycle time.
To give an example, large FPGA chips at the time of writing (Summer 2001) have the nominal
gate count of around 1 million, which is suGcient to implement the logic for a GRAPE-4 chip (100K
gates), and the clock speed would be around 75 MHz or more. Thus, one FPGA chip can deliver
1:5 G5ops. This might not sound very impressive compared to 31 G5ops of GRAPE-6 chip or around
2 G5ops of present microprocessors. However, compared to microprocessors, building a massively
parallel machine out of FPGA is much easier, and we can expect higher execution eGciency for the
same reason, as we have achieved higher eGciency on GRAPE hardwares.
Here again, in the long run we might see the same problem as the general-purpose computer. The
on-chip wiring would ultimately limit the speed and the circuit density. Because of the requirement
of the programmability, as the number of transistors on the FPGA chips increases, more and more
fraction of these chips will be used for wiring. However, for the next several years, this limitation
would not be too severe, and it is possible that some new design philosophy will allow us to make
better use of FPGA chips.
It is at least possible to use FPGAs for “proof of the concept” studies, where we demonstrate
that one particular custom design is actually usable and that it achieves better cost-performance than
general-purpose solutions. If that demonstration is successful, the grant large enough to make custom
LSI might be oMered.
To summarize, the initial cost of the large custom LSI might become too high for the level of the
amount of grants we can reasonably expect. However, machines based on FPGAs can be used for
small projects. The cost advantage of FPGAs will not be as large as that of custom LSI chips, but
J. Makino / Journal of Computational and Applied Mathematics 149 (2002) 131–145 145
compared to general-purpose microprocessors, they still oMer large advantage. So we expect to see
many successful projects to apply FPGAs for large-scale computing in the near future. The largest
ones will be done on custom LSIs, but the rest will be done on FPGAs.
Acknowledgements
We would like to thank Daiichiro Sugimoto, Toshikazu Ebisuzaki, Makoto Taiji, Tomoyoshi Ito,
Toshiyuki Fukushige and many others who were involved in the development of the six generations
of GRAPE hardwares, and Yoko Funato, Simon Portegies Zwart, Steve McMillan, Piet Hut and
again many others for discussions and collaborations in software development and scienti=c works
on GRAPE hardwares. This work is supported by the Research for the Future Program of Japan
Society for the Promotion of Science (JSPS-RFTP97P01102).
References
[1] J. Barnes, P. Hut, A hierarchical o(n log n) force calculation algorithm, Nature 324 (1986) 446–449.
[2] D.J. Becker, T. Sterling, D. Savarese, D.J.E., U.A. Ranawake, C.V. Packer, Beowulf: a parallel workstation for
scienti=c computation, in: Proceedings of the 1995 International Conference on Parallel Processing (ICPP), IEEE
Computers Society, Los Alamitos, 1995, pp. 11–14.
[3] L. Greengard, V. Rokhlin, A fast algorithm for particle simulations, J. Comput. Phys. 73 (1987) 325–348.
[4] R.W. Hockney, J.W. Eastwood, Computer Simulation Using Particles, IOP Publishing, Ltd., Bristol, 1988.
[5] T. Ito, J. Makino, T. Ebisuzaki, D. Sugimoto, A special-purpose n-body machine grape-1, Comput. Phys. Comm.
60 (1990) 187–194.
[6] A. Kawai, T. Fukushige, J. Makino, M. Taiji, Grape-5: a special-purpose computer for n-body simulations, Publ.
Astronom. Soc. Japan 52 (2000) 659–676.
[7] J. Makino, S.J. Aarseth, On a hermite integrator with ahmad-cohen scheme for gravitational many-body problems,
Publ. Astronom. Soc. Japan 44 (1992) 141–151.
[8] J. Makino, M. Taiji, Scienti=c Simulations with Special-Purpose Computers—The GRAPE Systems, Wiley,
Chichester, 1998.
[9] J. Makino, M. Taiji, T. Ebisuzaki, D. Sugimoto, Grape-4: a massively parallel special-purpose computer for collisional
n-body simulations, Astrophys. J. 480 (1997) 432–446.
[10] T.L. Sterling, J. Salmon, D.J. Becker, D.F. Savarese, How to Build a Beowulf: A Guide to Implementation and
Application of PC Clusters, MIT Press, Cambridge, MA, 1999.
[11] D. Sugimoto, Y. Chikada, J. Makino, T. Ito, T. Ebisuzaki, M. Umemura, A special-purpose computer for gravitational
many-body problems, Nature 345 (1990) 33–35.
