Status of the 0.8 Teraflops Supercomputer at Columbia by Arsenin, I. et al.
he
p-
la
t/9
50
90
75
   
21
 S
ep
 9
5
1
CU{TP{710
Status of the 0:8 Teraops Supercomputer at Columbia.

I. Arsenin
a
, D. Chen
a
, N. Christ
a
, R. Edwards
d
, A. Gara
b
, S. Hanson
c
, C. Jung
a
, A. Kahler
a
, A.D.
Kennedy
d
, G. Kilcup
e
, Y. Luo
a
, C. Malureanu
a
, R. Mawhinney
a
, J. Parsons
b
, J. Sexton
f
, C. Sui
a
, and
P. Vranas
ay
a
Dept. of Physics, Columbia University, New York, NY 10027
b
Nevis Laboratory, Columbia University, Irvington on Hudson, NY 10533
c
Fermilab, P.O. Box 500, Batavia, IL 60510
d
SCRI, Florida State University, Tallahassee, FL 32306-4052
e
Dept. of Physics, Ohio State University, 174 West 18th Ave, Columbus, Ohio 43210
f
Dept. of Mathematics, Trinity College, Dublin 2, Republic of Ireland
The rst stage in the construction of the 0.8 Teraops Supercomputer at Columbia, a working, two node parallel
computer, has been successfully completed. The next stage, a 512 node, 26 Gigaops prototype, is in its nal
construction phase. A general description and current status of the hardware and software is presented.
1. Introduction
The 0.8 Teraops Supercomputer [1] consists of
16; 384 nodes connected in a 16
3
 4 4-D mesh.
Each node delivers a peak speed of 50 Mops, has
0.5 Mwords of DRAM, and 25/50 MHz bit{serial
communications. The machine is single precision,
except for global sums that can be of arbitrary
precision. The cost is 3 million U.S. dollars.
Each node is mounted on a daughterboard
and consists of a Texas Instruments Digital Sig-
nal Processor (DSP) chip, DRAM chips, and a
custom made chip called the Node Gate Array
(NGA) that provides an ecient memory and
communication interface. Sixty{four nodes are
housed on a motherboard. Eight motherboards
are plugged into a backplane and are housed in
a crate that provides power and cooling. Thirty{
two crates are cabled together to make up the
machine (8 racks of 4 vertically stacked crates).
2. Programming Environment
A well supported programming environment
will be available to the user. The time non-
critical portions of the code can all be written
in C, using the C compiler that accompanies the
Texas Instrument DSP and a few C-level rou-

Research was supported by the U.S. Dept. of Energy.
y
Speaker
tines that will perform the communications and,
if needed, improve the memory access timings. As
with most machines, the time critical portions of
the code must be \hand-coded" for optimal per-
formance. This can be done by using the stan-
dard Texas Instruments assembler and the mem-
ory mapped NGA. The Wilson and staggered in-
verters have already been \hand-coded" and they
sustain more than 30% of the peak speed on 2
4
lattices per node.
Code development tools include the Texas In-
struments DSP debugging tools that can be used
to debug the serial portion of the code. A debug-
ging kernel that can be used to debug the parallel
part of the code is being planned. Also, few{node
machines dedicated to code development will be
built.
Finally, the lattice sizes directly available to
the user are multiples of 16
3
 4. Other sizes can
be achieved at the software level using the pass{
through capability of the NGA or at the hardware
level by recabling the machine.
3. Hardware
1) NGA: The most important component of the
machine is the Node Gate Array (NGA). This
is the device that links all other components to-
gether in an ecient way. It provides interfaces to
2the DSP, DRAM, and to each of the eight nearest
neighbors via 8, bit{serial lines. It is an Appli-
cation Specic Integrated Circuit (ASIC), about
1:25  1:25 inches in size and has 208 pins. It
consists of 260; 000 transistors (as many as in an
80286 chip). For more details regarding the de-
sign of the NGA the reader is referred to [1].
The NGA was entirely designed and tested on
a PC and Sparc-10 workstation at Columbia, us-
ing the Viewlogic software tools. Software mod-
els of the DSP and DRAM, available from Logic
Modeling Co., were \connected" with the soft-
ware model of the NGA to provide a fully opera-
tional model of a single node. In this model each
serial line along one direction was connected with
the one in the opposite direction.
In order to test the NGA, a quarter of a million
cycles of code including conjugate gradient invert-
ers for staggered and Wilson fermions were run.
This code was run on the single node model us-
ing the Viewlogic simulator to rst test the logic
of the design. Next, the design was compiled to
an actual circuit of gates. The test code was run
again using the simulator but this time to test
the hardware operation of this circuit. At this
stage the software simulates the operation of the
actual chip as if it was already manufactured. In
particular, for a given set of external parameters,
such as temperature, voltage and manufacturing
quality, the simulator reproduces the delays the
signals encounter as they propagate through the
circuit. The chip must work over a range of these
parameters since in reality they may vary. Col-
lectively this range is parametrized by a quantity
called derating. The chip was tested for three der-
atings covering the expected range of parameters.
To understand one of the main sources of mal-
functions that occurred as the derating was varied
consider a signal propagating in the circuit. At
the rising edge (re) of a clock cycle, this signal is
expected to have arrived at a given point in the
circuit where an \edge sensitive" device such as
a ip-op is located. The ip-op sets its output
(Q) for the whole cycle to have the same value
as the value its input had at the beginning of the
cycle. A situation where a signal is on time, or is
\missed" because it arrived too early or too late is
shown in g. 1. Since a signal is a combination of
many other signals, the critical path that caused
the signal to arrive too late or too early had to be
found in order to identify and x the cause of the
problem. Because of the complexity of the circuit
and the fact that the simulator needed about a
second to simulate just a few chip cycles this was
a non{trivial and time consuming task.
too early
too late
on time
clock
D
Q
D Q(t+1)
0 0
1 1
D
D
D
Q
Q
Q
clock clock
re
re
flip-flop
Figure 1. Signal arriving at a ip-op
In order for the chip to be manufactured, the
nal design must be accompanied by a le con-
taining test vectors. A test vector is a collection of
input signals and the corresponding outputs they
produce. Test vectors are used by the manufac-
turer to test the proper operation of the device
as it is transcribed from a software model to an
actual chip. The test vectors were produced by
running the quarter million cycles worth of code
using the simulator. However, these test vectors
were not in accordance with the manufacturer's
requirement that in every 20 ns cycle there must
be a 5 ns window (called strobe window) where
no output signal is changing.
This problem was xed by using the fact that
a set of output signals that go to a peripheral
device is important only when a corresponding
output control signal indicates that the periph-
eral device should \pay attention". By design,
these output signals do not change for a few ns be-
fore this (setup time). The NGA has three types
of control signals, one for the DSP, one for the
DRAM and one for the serial communication sig-
nals. For each control signal a strobe window can
be created by keeping the corresponding output
3signals only during the setup time. The rest of
the time these signals are not important and can
therefore be neglected. Still, this is not enough
because the three control signals occur at dierent
points in the cycle and therefore their correspond-
ing strobe windows do not overlap. To avoid this
nal problem the test vectors were split into three
groups   one for each set of signals. Special soft-
ware was developed to analyze and manipulate
the test vectors produced by the simulator and
bring them to their nal form of three groups,
each with its own strobe window.
The design and testing of the NGA have been
completed and several chips have been manufac-
tured and delivered by the Atmel Corporation.
2) Daughterboard: The NGA, the DSP, and
the DRAM are mounted on the daughterboard.
The daughterboard is a 2:6  1:75 inch, 6{layer
custom made printed circuit board. Several have
been manufactured and delivered.
3) Two node prototype: A prototype moth-
erboard that houses two daughterboards was de-
signed as an intermediate testing step. It is a
4{layer custom made printed circuit board. Sev-
eral have been manufactured and delivered. Us-
ing this, two daughterboards and some basic soft-
ware (to be discussed later), a two node prototype
parallel computer was built and was operational
at the beginning of August. The careful design
and exhaustive testing of the various components
resulted in a prototype that was basically free of
errors and worked (running the staggered conju-
gate gradient algorithm) within one week from
the day its assembly began.
At present seven nodes have been assembled
and have passed an exhaustive set of tests. In
particular, the code that produced the test vec-
tors was run in a tight loop for several hours
with no errors. All tests were run with the high,
1.2 Mwords/sec, serial communication speed, at
room temperature and with no air blowing at the
node. The peak power consumption of a single
node was measured to be less than 3.5 Watts.
Furthermore, all nodes also ran successfully with
a 64 MHz clock (28% faster than the design 50
MHz peak speed).
4) Motherboard: The motherboard is a 14:5
20:5 inch, 8{layer custom made printed circuit
board that houses 64 nodes. One node called
node 0, is mounted directly on the motherboard.
The remaining 63 are mounted on daughter-
boards and are plugged into the motherboard
using SIMM{sockets. Node 0 is special in that
it has more memory (4 Mwords), is connected
to a PROM from which it will boot, and con-
trols two SCSI chips and a Programmable Array
Logic (PAL) switch that connect it with the con-
trol/diagnostic network (to be described later).
The motherboard schematic is complete and it is
currently being laid-out.
5) Backplane: The backplane is a 14:524 inch,
12{layer custom made printed circuit board with
sockets to plug in eight motherboards. It provides
each motherboard with power and connections to
all necessary signals. The design has been com-
pleted and is currently being manufactured.
6) Crate: The crate is a cabinet that contains
the 8 motherboards, the backplane and power
supplies at the rear, and mun fans and a cold
water radiator, when mounted in a rack, at the
bottom. It is currently being manufactured by
our supplier.
4. Networks
There are two independent networks:
1) The physics network connects all nodes in
a 16
3
 4 hypercubic mesh using bit{serial lines.
It is software programmable for a 25/50 MHz
(0.6/1.2 Mwords/sec) transmission rate. This
network will be used by the physics code for near-
est neighbor communication. It will also be used
to \unload" lattices from the whole machine to
the node 0 of each motherboard and from there
via a SCSI link to disk. The proper operation
of this network was tested using the prototype
two node computer where the signal along one
of the serial directions was carried by a 10 foot
long cable. It worked successfully at 50 MHz.
The largest sized cable in the full machine will be
about 6 feet.
2) The control/diagnostic network is com-
pletely independent of the physics network and
does not involve the NGA or DRAM. Because
of this it is less sensitive to hardware failures.
It will provide the front{end connection to the
4host workstation (SUN Sparc{10), diagnostic in-
formation, and will be used to load and initiate
execution of programs as well as unload small
amounts of data. It consists of two parts. The
rst part is a SCSI tree with the host-workstation
at the root, connecting the 256 motherboards and
a number of disks. A motherboard is connected
to the SCSI tree via two SCSI chips controlled by
node 0. The second part is a tree on each moth-
erboard that connects node 0 with the remain-
ing 63 nodes via their DSP serial ports. Node
0 can select any, or all, of the 63 nodes via a
programmable PAL switch. A SCSI link has a
bandwidth of 2.5 Mwords/sec and the DSP serial
port has a bandwidth of 0.3 Mwords/sec.
5. Software
When the power is turned on, all the DSP's
start running a boot{loader code that has been
imprinted on the chip by the manufacturer. All
nodes 0 have a pin set so that the boot{loader will
read boot{code from the PROM into memory and
start executing it. This code sets up the DSP
serial port and the PAL switch and broadcasts
a copy of itself to the serial ports of the other
63 nodes. These nodes have a pin set so that
the boot-loader waits until code arrives at their
serial ports. Once the copy of the boot{code sent
by node 0 arrives, the boot-loader puts it into
memory and starts executing it. The boot{code
performs some diagnostic tests and then branches
to a kernel.
The kernel on nodes 1{63 services a set of fun-
damental requests that arrive at the DSP serial
port, such as write, read and execute. The kernel
on node 0 is more complex since it has to control
the two SCSI chips, the PAL switch and handle
the routing and service requests along the SCSI
tree, as well as along the local DSP serial port
tree. The kernel expects to receive requests ac-
cording to a predened communications protocol.
This consists of packets with a maximum size of
512 words (so that they can be easily buered in
the DSP chip memory which is 2 Kwords long)
that contain a \payload" of data, as well as rout-
ing and control information. Once the code has
branched to the kernel, the machine is accessible
from the host workstation. The host communi-
cates with the machine through a front{end pro-
gram that exchanges packets via the SCSI con-
nection.
The communications protocol has been nal-
ized. The kernel for nodes 1{63 has been written
and was used to boot a single node and download
and run some of the code that tested the NGA, as
well as to boot the two node prototype. The SCSI
driver for both the SUN host workstation and the
motherboard SCSI chips, the fundamental part of
the kernel for node 0, and a program loader for
the SUN have also been written. They have been
tested successfully on existing commercial hard-
ware and are ready to be ported to the actual
hardware when it is assembled. The remainder
of the kernel for node 0 has been designed and
is currently being written. Also, physics code is
currently under development.
6. Timetable
We expect to have the 26{Gigaops 512{node
prototype working by the end of 1995 and hope
to have the full machine nished a year later.
7. Acknowledgments
We would like to thank the the Columbia sum-
mer students X. Chen, S. Kasow and C. Lazaroiu
for their help.
REFERENCES
1. I. Arsenin et. al., Nucl. Phys. B (Proc. Suppl.)
34, 820 (1994); R. Mawhinney, Nucl. Phys.
B (Proc. Suppl.) 42, 140 (1995); I. Arsenin,
Nucl. Phys. B (Proc. Suppl.) 42, 902 (1995);
J. Sexton, these proceedings.
