The CP-PACS Project by Iwasaki, Y.
ar
X
iv
:h
ep
-la
t/9
70
90
55
v1
  1
7 
Se
p 
19
97
1
The CP-PACS Project
Y. Iwasaki,a
for the CP-PACS collaboration
aCenter for Computational Physics and Institute of Physics, University of Tsukuba, Ibaraki 305, Japan
The CP-PACS project is a five year plan, which formally started in April 1992 and has been completed in
March 1997, to develop a massively parallel computer for carrying out research in computational physics with
primary emphasis on lattice QCD. The initial version of the CP-PACS computer with a theoretical peak speed
of 307 GFLOPS with 1024 processors was completed in March 1996. The final version with a peak speed of 614
GFLOPS with 2048 processors was completed in September 1996, and has been in full operation since October
1996. We describe the architecture, the final specification, the hardware implementation, and the software of the
CP-PACS computer. The CP-PACS has been used for hadron spectroscopy production runs since July 1996. The
performance for lattice QCD applications and the LINPACK benchmark are given.
1. Introduction
Numerical studies of lattice QCD have devel-
oped significantly during the past decade in par-
allel with the development of computers. Of par-
ticular importance in this regard has been the
construction of dedicated QCD computers (see
for reviews Ref.[1]) and the move of commer-
cial vendors toward parallel computers in recent
years. In Japan the first dedicated QCD com-
puter was developed in the QCDPAX project [2].
The QCDPAX computer with a peak speed of
14GFLOPS is actually the 5th computer in the
PAX project [3], which pioneered the develop-
ment of parallel computers for scientific and en-
gineering applications in Japan.
The CP-PACS project was conceived as a suc-
cessor of the QCDPAX project in the early sum-
mer of 1991. The project name CP-PACS is an
acronym for Computational Physics by a Parallel
Array Computer System. The aim of the project
was to develop a massively parallel computer for
carrying out research in computational physics
with primary emphasis on lattice QCD.
The CP-PACS project started in April 1992,
and after 5 years, is coming to a conclusion in
March 1997. Therefore it is timely to overview
the CP-PACS project and the CP-PACS com-
puter at this workshop held in the middle of
March 1997. In this article we present an
overview of the chronology and the organization
of the CP-PACS project in Sec.2, and describe the
details of the CP-PACS computer including the
architecture, the final specification, the hardware
implementation, and the software in Sec.3. Re-
search areas which are covered by the CP-PACS
project are given in Sec.4. In Sec.5 the perfor-
mance of the computer for lattice QCD appli-
cations as well as for the LINPACK benchmark
are given. Physics results obtained on the CP-
PACS computer are presented in other contribu-
tions [4,5]. Sec.6 is devoted to conclusions.
2. The CP-PACS Project
The CP-PACS project [6] aims at developing a
massively parallel computer designed to achieve
high performance for numerical research of the
major problems of computational physics. It fur-
ther aims at significant progress in the solution
of these problems through the application of the
computer upon completion of its development.
The planning of the project was started in the
summer of 1991. The proposal, made to the Min-
istry of Education, Science and Culture, was ap-
proved in the spring of 1991 as one of projects of
the Ministry’s “Program for New Development
of Academic Research”. The project formally
started in April of 1992, and has received about
2.2 billion yen spread over the five year period
2Table 1
CP-PACS Project members
computer science computational physics
hardware software particle physics astrophysics condensed matter
K. Nakazawaa I. Nakatae Y. Iwasakic S. Miyamao S. Miyashitap
H. Nakamurab Y. Yamashitae A. Ukawak T. Nakamural M. Imadaq
T. Bokuc Y. Oyanagif K. Kanayac M. Umemurac K. Nemotor
T. Hoshinod T. Kawaig S. Aokik Y. Nakamotoc A. Oshiyamak
T. Shirakawad M. Morih T. Yoshiec S. Gunjic
K. Wadae Y. Watasei M. Okawal
M. Yasunagae S. Ichiij N. Ishizukac
S. Sakaic M. Fukugitam
H. Kawain
a Department of Computer Science, University of Electro-Communications
b Center for Advanced Science and Techology, University of Tokyo
c Center for Computational Physics, University of Tsukuba
d Institute of Engineering Mechanics, University of Tsukuba
e Institute of Information Sciences and Electronics, University of Tsukuba
f Department of Information Science, University of Tokyo
g Department of Physics, Keio University
h Department of Engineering, University of Tokyo
i Data Handling Division, KEK
j Computer Center, University of Tokyo
k Institute of Physics, University of Tsukuba
l Numerical Theory Division, KEK
m Yukawa Institute for Theoretical Physics, Kyoto University
n Theory Division, KEK
o National Astronomial Observatory
p Department of Physics, Osaka University
q Institute of Solid State Physics, University of Tokyo
r Department of Physics, Hokkaido University
ending in March 1997. The funding comes from a
special allocation of the Grant-in-Aid of the Min-
istry of Education, Science and Culture support-
ing innovative fundamental research.
The Center for Computational Physics was
founded in April 1992 at University of Tsukuba
to carry out the project, as well as to promote re-
search in computational physics and parallel com-
puter science. The Center is an inter-university
facility open to researchers in academic institu-
tions in Japan.
The number of the project members, which was
22 when the project started, has increased to 33,
of which 15 are computer scientists and 18 are
physicists, as listed in Table 1. The projected
was headed by Y. Iwasaki. The development of
the CP-PACS computer was led by K. Nakazawa.
A unique feature of the project, as is clear from
Table 1, is its emphasis on cross-disciplinary re-
search involving both physicists and computer
scientists. This is a tradition carried over from
the QCDPAX project [2], which is the predeces-
sor and stepping stone for the CP-PACS project.
A close collaboration of researchers from the two
disciplines has been both important and fruitful
in reaching a design for the CP-PACS computer
which best balances the computational needs of
physics applications with the latest of computer
technologies.
Development of a massively parallel computer
requires advanced semiconductor technology. We
discussed the aim of the project with a number of
manufacturers and invited proposals in the period
of 1991–1992. We selected Hitachi Ltd. as the
3industrial parter through a formal bidding pro-
cess in the early summer of 1992, and we have
worked in a close collaboration for the hardware
and software development of the CP-PACS com-
puter. The fundamental design of the computer
was laid down in 1992, its details worked out in
1993, the logical design and the physical pack-
aging design completed in 1994, and chip fabri-
cation and assembling of parts started in early
1995. The first stage of the CP-PACS computer
consisting of 1024 processing units with a peak
speed of 307 GFLOPS was completed in March
1996. An upgrade to a 2048 system with a peak
speed of 614GFLOPS was completed at the end
of September 1996
3. CP-PACS Computer
3.1. Architecture
The CP-PACS computer is an MIMD (Multiple
Instruction-streams Multiple Data-streams) par-
allel computer with a theoretical peak speed of
614GFLOPS and a distributed memory of 128
Gbyte. The system consists of 2048 processing
units (PU’s) for parallel floating point process-
ing and 128 I/O units (IOU’s) for distributed
input/output processing. These units are con-
nected in an 8×17×16 three-dimensional array by
a three-dimensional crossbar network. The speci-
fication of the CP-PACS computer is summarized
in Table 2.
The basic strategy we have adopted for the de-
sign is the usage of a fast RISC micro-processor
for high arithmetic performance at each node and
a linking of nodes with a flexible network so as to
be able to handle a wide variety of problems in
computational physics. The unique features of
the CP-PACS computer reflecting these goals are
represented by the special node processor archi-
tecture called pseudo vector processor based on
slide-windowed registers (PVP-SW) [7] and the
choice of a three-dimensional Hyper Crossbar net-
work. A well-balanced performance of CPU, net-
work and I/O devices supports the high capability
of CP-PACS for massively parallel processing.
Table 2
Specification of the CP-PACS computer
peak speed 614Gflops(64 bit data)
main memory 128GB
parallel architecture MIMD with
distributed memory
number of nodes 2048
node processor HP PA-RISC1.1+PVP-SW
#FP registers 128
clock cycle 150MHz
1st level cache 16KB(I)+16KB(D)
2nd level cache 512KB(I)+512KB(D)
network 3-d crossbar
node array 8× 17× 16∗
through-put 300MB/sec
latency 2.5 ∼ 3.1µsec
distributed disks 3.5” RAID-5 disk
total capacity 1059GB
software
OS UNIX, micro kernel
language FORTRAN, C, assembler
Size 7.0m(width) × 4.2m(depth)
× 2.0m(hight)
Power dissipation 275 KW maximum
∗including nodes for disk I/O
3.2. Node Processor
Each PU of the CP-PACS has a custom-made
superscalar RISC processor with an architecture
based on PA-RISC 1.1. In large scale computa-
tions in scientific and engineering applications on
a RISC processor, the performance degradation
occurring when the data size exceeds the cache
memory capacity is a serious problem. The PVP-
SW is our solution to this problem, while main-
taining upward compatibility with the PA-RISC
architecture.
A schematic illustration of the PVP-SW ar-
chitecture is given in Fig. 1. The Slide Win-
dow mechanism allows the use of a large number
of physical registers, which is 128 in the case of
CP-PACS, through a continuously sliding logical
register window of 32 registers along the physi-
cal registers. The Preload and Poststore instruc-
tions can be issued without waiting for the com-
pletion of memory access. These features en-
4current window
previous window
following window
physical registers
global local
local
window
0
i–n
i
0
0
0
g-1
g-1
g
g
31
31
127
:
::
i+n'
:
global
logical registers (SW registers)
global
global
local
local
0
0
g-1
g-1
g
g
31
31
poststore
to memory
preload
from memory
Figure 1. Structure of slide-windowed registers.
able a pipelined access to main memory which is
made with multiple interleaved banks, and thus a
long latency for memory access can be tolerated.
An efficient vector processing without degrada-
tion for a very large length of vector-loop is real-
ized in spite of the superscalar architecture of the
CP-PACS processor.
3.3. Network
The 2048 processors are arranged in a three-
dimensional 8× 16× 16 array. The Hyper Cross-
bar network is made of crossbar switches in the
x, y and z directions, connected together by an
Exchanger at each of the three-dimensional cross-
ing points of the crossbar array, as illustrated by
a schematic diagram shown in Fig. 2. Each ex-
changer is connected to a PU or IOU. Thus any
pattern of data transfer can be performed with
the use of at most three crossbar switches. Since
the network has a huge switching capacity due to
the large number of crossbar switches, the sus-
tained data transfer throughput in general appli-
cations is very high.
Data transfer on the network is made through
Remote DMA (Remote Direct Memory Access),
in which processors exchange data directly be-
tween their respective user memories with a min-
imum of intervention from the operating system.
This leads to a significant reduction in the startup
Processing Unit
IO Unit
Crossbar Switch
Exchanger
Disk
x
y
z
Figure 2. Schematic diagram of CP-PACS.
latency, and a high throughput.
Inter-node communication is made by message
passing. Transfer of data within the network
proceeds via wormhole routing through the ex-
changers. The direction of routing is fixed to
x → y → z to avoid deadlocks. The bandwidth
of each crossbar is 300Mbyte/sec. The latency,
namely the initial overhead due to hardware and
software combined, for sending and receiving data
is 2.45, 2.83 and 3.09 µsec, respectively, for the
cases of the data transfer through one crossbar
switch in the x direction, two switches in the x
and y directions, and three switches in the x, y
and z directions.
The network allows a hardware bisection of PU
arrays in each of the x, y and z directions. Hence
the full system can be divided up to 8 indepen-
dent subsystems.
3.4. Distributed Disks
The distributed disk system of CP-PACS is
connected to 128 IOU’s on the 8 × 16 plane at
the end of the y direction of the Hyper Crossbar
network by a SCSI-II bus. RAID-5 disks are used
5Figure 3. Floor-plan of the CPU chip
for fault tolerance. The IOU’s handle parallel file
I/O requests issued by the PU’s in an efficient
and distributed way using Remote DMA through
the Hyper Crossbar network.
3.5. Connection to the Front End
The HIPPI connection to the front host is at-
tached to one of the IOU’s. A special FTP pro-
tocol has been developed for a high speed file
transfer between the distributed disk system of
CP-PACS and the disk storage of the front host.
The peak throughput is 100 Mbytes/sec and the
effective throughput is about 65 Mbytes/sec in
the case when the data with a size of 512 Mbyte
are sent from CP-PACS to the front host or from
the host to CP-PACS.
3.6. Hardware Implementation
The CPU chip is fabricated using 0.3 micron
CMOS semiconductor technology, with the size
being 15.7mm × 15.7mm. Fig. 3 shows the floor
plan of the chip. The PVP-SW feature is imple-
mented with 128 floating point registers occupy-
ing the top left block together with floating point
execution units.
The CPU, the storage controller (SC) and the
network interface adapter (NIA), are mounted in-
line on a ceramic multi-chip module of size 5.7cm
× 7.2cm, which is shown in Fig. 4: the left one is
the CPU, the central one is the SC and the right
one is the NIA. The SC and NIA chips are fab-
Figure 4. Ceramic multichip module of CPU
Figure 5. One board consists of eight CPU units
ricated using 0.5 micron CMOS gate-array tech-
nology. The twelve pieces surrounding them are
the second-level cache memory chips.
Eight PU modules together with their DRAM
memory are mounted on a board of size 45.6cm
× 62.5cm as shown in Fig. 5. The central piece
of each of the eight sections is the PU mod-
ule, now with fins for air-cooling. The other
white pieces are main memory address/data con-
trol units. The black pieces are DIM modules of 4
Mbit DRAM, 64 MByte for each PU. Each board
has two more chips for the crossbar switches in
the x direction, and one chip for the clock dis-
tributer.
Sixteen PU boards and one IOU board are
placed vertically on a back plane, and two back
6planes, one on top of the other, are housed in a
cabinet. A Crossbar switch in the y direction is
mounted on the backplane in the cabinet. Cross-
bar switches in the z direction are mounted on
separate boards, which are housed in separate
cabinets. A picture of the CP-PACS computer
is shown in Fig. 6.
Figure 6. Outlook of the CP-PACS computer
PU PU
PU PU
7 m
4.
2 
m
PUPU
PUPU
Figure 7. Floor-plan of the CP-PACS computer
A schematic floor plan of the cabinets are
shown in Fig. 7. The squares labeled “PU” rep-
resent cabinets housing the PU and IOU’s, and
those labeled “Z” are for the crossbar switches
in the z direction. The cabinets labeled “IOA”
contain I/O adapters. The RAID-5 distributed
disk system, placed a few meters from the CP-
PACS, is connected to the IOU’s by a SCSI-II
bus through adapters in the IOA cabinets. The
system is cooled by air drawn in from beneath the
cabinets.
3.7. Software of the CP-PACS
3.7.1. Operating System
The CP-PACS computer runs under the UNIX
OSF/1 operating system. Each node processor,
however, carries only a kernel based on Mach
3.0 in order to save memory for user applica-
tion programs and to avoid performance degrada-
tion. The kernel handles memory control, inter-
node communication, process scheduling, inter-
rupt handling and I/O. The full UNIX interface
and file server functions are implemented on the
IOU’s. One of the IOU, named the SIOU, con-
trols the whole system through the network.
The operating system has several new functions
added for parallel processing; software partition-
ing of the processor array so that independent
programs may be run on different partitions, and
the generation of processes over a user-specified
number of nodes to execute a parallel program.
The file system is logically structured to form
a single tree for the entire CP-PACS computer.
The file sets required to execute a single job can
be distributed over the disks connected to the par-
allel IOU’s so as to reduce I/O overheads. The
logical and physical mapping of the file system is
automatically controlled by the operating system.
3.7.2. Programming Environment
FORTRAN90, C, C++ and assembly language
are available for programming on the CP-PACS
computer. Assembler code can be included as
a subroutine in a FORTRAN or C code in or-
der to maximize the performance. Remote DMA
data transfer through the Hyper Crossbar net-
work is made by calling special library routines
for communications. FORTRAN90 and C com-
pilers generate assembler codes which incorporate
the PVP-SW enhancement, using the technique
of modulo scheduling and register coloring.
The Real-Time Performance Monitor allows an
on-line check of the performance of the CP-PACS
7QCDPAX
Workstations
VTR
Graphic Workstation
Disk Storage
Small-scale CP-PACS
ethernet
Workstation Cluster
Disk Array and
MO Disk Library
CMT Library
CP-PACSCampusLAN
Front-end Host
Computer
Gateway
FDDI Color
Printer
Figure 8. The computing system at the Center
for Computational Physics.
in applications. Various data, including the flops
of each CPU and the busy rate of the network
can be collected at regular intervals, and can be
graphically displayed on terminals.
3.8. Front End and Mass Storage
The computing system at the Center for Com-
putational Physics is shown in Fig. 8. The CP-
PACS computer is connected by a HIPPI channel
and Ethernet to the front host, which in turn is
connected to the disk storage (350 GByte) and a
tape archive (780 GByte). The front host is a vec-
tor computer with a peak speed of 256 MFLOPS
and 1 GByte of main memory. Job requests for
the CP-PACS are submitted through the front
host using NQS. Data I/O between the disk stor-
age and the distributed disk system of the CP-
PACS is made through the HIPPI channel. Out-
put data files are sent back to the disk storage at
the termination of each job request.
The front host has a disk storage of 350 GByte
connected by multiple channels to achieve a high
data transfer throughput. The front host is also
connected to a magnetic cartridge tape library
which holds 980 cartridges, each with a capacity
of 800 MByte.
The Center computing facility includes a work-
station cluster connected by a high speed switch.
One of the workstations functions as a file server
accessing a RAID-5 disk system with a total ca-
pacity of 89 GByte.
The QCDPAX, which was developed at Univer-
sity of Tsukuba by the QCDPAX project (1987-
1990), is also a part of the system. It has been
in continuous operation since completion for the
numerical simulation of lattice QCD.
The computing facilities of the Center are con-
nected by a LAN consisting of an FDDI loop and
Ethernet, which in turn is connected to the Uni-
versity of Tsukuba campus network.
4. Research Areas in Computational
Physics
In computational physics the project aims to
use the CP-PACS computer for carrying out
research in the following three areas: particle
physics, condensed matter physics and astro-
physics.
A major goal of the project is to significantly
advance the numerical study of lattice QCD in
particle physics. Large-scale numerical simula-
tions will be pursued with the CP-PACS com-
puter in order to verify the theory and to ex-
tract new physical predictions. Since the CP-
PACS computer with 1024 nodes was completed,
hadron spectroscopy calculations in the quenched
approximation as well as in full QCD have been
intensively performed and physics results are re-
ported at this workshop [4,5].
Important problems in condensed matter
physics such as strongly interacting electron sys-
tems, high-temperature super-conductivity, first-
principles calculations in material properties and
those in astrophysics such as the formation of
galaxies and stellar/planetary systems, and the
gravitational collapse will be also pursued with
the CP-PACS computer. Preparations for as-
8Table 3
Performance for lattice QCD programs
program MFLOPS/PU coding
(peak 300 MFLOPS)
calculation 77% 191 assembler
communication 23% - +
red/black MR solver sustained 100% 148 Fortran
for Wilson quark matrix calculation 84% 99
communication 16% - Fortran
sustained 100% 84
conjugate gradient solver calculation 90% 139
for Kogut-Susskind quark communication 10% - Fortran
matrix sustained 100% 125
Heat bath Monte Carlo calculation 96% 100
program communication 4% - Fortran
for SU(3) gauge theory sustained 100% 95
Over-relaxation calculation 91% 156
program communication 9% - Fortran
for SU(3) gauge theory sustained 100% 142
Hybrid Monte Carlo calculation 74% 151
program communication 26% - Fortran
for full QCD sustained 100% 112
trophysics and condensed matter physics appli-
cations have been started, and will gradually ex-
pand in time.
5. Performance
Our codes for large-scale simulations in lat-
tice QCD have been written with Fortran 90
and libraries for data communication. The For-
tran compiler which has been newly developed
to incorporate the PVP-SW feature produces ef-
ficient object codes, achieving typically 90 – 150
MFLOPS per node, depending on the structure
of do-loops. We have further developed a hand-
optimized assembler code for the core part of the
red/black solver of the Wilson quark matrix. In
this case the performance reaches 191 MFLOPS
per node which is about 64 % of the peak speed
(See Table 3). Even when the overhead due to
data communication is included, the sustained
speed in this case is 148 MFLOPS, which is about
a half of the peak speed. The performance for
typical application programs in lattice QCD is
shown in Table 3.
We have also measured the performance for the
LINPACK benchmark. The results are summa-
rized in Table 4. The sustained speed for the case
of 2048 PU’s is 368.2 GFLOPS, which is 59.9%
of the theoretical peak speed.
6. Conclusions
We have been able to develop a massively par-
allel computer of a peak speed of 614 GFLOPS
through a very effective collaboration of computer
scientists, physicists and a vendor. Throughout
the development phase we held a joint meeting
at least once a month, discussing every aspects
of the CP-PACS computer from the architectural
design to the details of the hardware implemen-
tation.
The CP-PACS computer achieves high perfor-
mance of 40 - 50 % of the peak speed for lattice
QCD application programs. The machine is very
9Table 4
Performance of LINPACK benchmark
N0. of Rmax Nmax N1/2 Rpeak Rmax
Procs. (GFLOPS) (order) (order) (GFLOPS) /Rpeak(%)
1 0.1969 2340 360 0.3 65.6
2 0.3873 3240 600 0.6 64.5
4 0.7704 4680 960 1.2 64.2
8 1.527 6480 1440 2.4 63.6
16 3.022 9360 2160 4.8 62.95
32 6.022 12960 3360 9.6 62.7
64 12.0 18720 4800 19.2 62.5
128 23.9 25920 6720 38.4 62.2
256 46.81 37440 9600 76.8 61.0
512 93.99 51840 15360 153.6 61.2
1024 186.5 74880 21120 307.2 60.7
2048 368.2 103680 30720 614.4 59.9
stable and we are obtaining interesting results on
hadron spectrum in the quenched QCD as well
as in full QCD. Preparations for astrophysics and
condensed matter physics applications have also
started.
ACKNOWLEDGEMENTS
I would like to thank the members of the CP-
PACS project, in particular, K. Nakazawa and
A. Ukawa for valuable discussions. I also would
like to thank Hitachi Ltd. for a close collabora-
tion on the development of the hardware as well
as the software of the CP-PACS computer. This
work is supported in part by the Grand-in-Aid of
the Ministry of Education, Science and Culture
(No. 08NP0101).
REFERENCES
1. N. Christ, Nucl. Phys. B(Proc. Suppl.)9
(1989) 549; R. Tripiccione, ibid.17 (1990) 137;
N. Christ, ibid.20 (1991) 129; D. Weigarten,
ibid.26 (1992) 126; E. Marinari, ibid.30 (1993)
122; Y. Iwasaki, ibid.34 (1994) 78; J.C. Sex-
ton, ibid. 47 (1996) 236.
2. Y. Iwasaki, T. Hoshino, T. Shirakawa, Y. Oy-
anagi and T. Kawai, Comp. Phys. Comm. 49
(1988) 449.
3. T. Hoshino, PAX Computer, High Speed
Parallel Processing and Scientific Computing
(Addison-Wesley, New York, 1989).
4. T. Yoshie’s contribution in these proceedings.
5. K. Kanaya’s contribution in these proceed-
ings.
6. Y. Iwasaki, Nucl. Phy. B(Pro. Suppl)34(1994)
78; A. Ukawa, ibid. 42(1995)194; Y. Iwasaki,
ibid. 53(1997)1007.
7. H. Nakamura, H. Imori, K. Nakazawa, T.
Boku, I. Nakata, Y. Yamashita, H. Wada and
Y. Inagami, Proc. of International Confer-
ence on Supercomputing ’93, 298 (1993); H.
Nakamura, K. Nakazawa, H. Li, H. Imori, T.
Boku, I. Nakata and Y.Yamashita, Proc. of
27th Hawaii International Conference on Sys-
tem Sciences, 368 (1994).
