Architectural choices for the Columbia 0.8 Teraflops machine by Arsenin, Igor V.
ar
X
iv
:h
ep
-la
t/9
41
20
93
v1
  2
0 
D
ec
 1
99
4
CU–TP–662 hep-lat/9412093
Architectural choices for the Columbia 0.8 Teraflops machine
I. V. Arsenina∗
aPhysics Department, Columbia University,
New York, NY 10027, U.S.A.
We discuss the hardware design choices made in our 16K-node 0.8 Teraflops supercomputer project, a ma-
chine architecture optimized for full QCD calculations. The efficiency of the conjugate gradient algorithm in
terms of balance of floating-point operations, memory handling and utilization, and communication overhead is
addressed. We also discuss the technological innovations and software tools that facilitate hardware design and
what opportunities these give to the academic community.
1. OVERVIEW
The preliminary discussions of the Columbia
0.8 Teraflops project began in Spring 1993. Since
1989 the Columbia group has performed lattice
QCD calculations on the Columbia 256-node par-
allel computer with 16 Gflops peak speed. It be-
came clear by that time that the capacity of this
computer will have been exhausted in the course
of few years and it was necessary to look into ways
to enhance the computing power in order to stay
at the forefront of lattice calculations.
The timing of our decision to embark on the
building of a new supercomputer was greatly in-
fluenced by advances in technology that have
produced inexpensive and relatively fast proces-
sors, external memory units, and peripherials
that could be customized for the desired tasks.
Our choice was in favor of a machine that con-
sists of a large number of simple nodes connected
by an efficient communication network — a con-
figuration, we believe, more promising and cost
efficient than a few high speed and high capacity
processors which have yet to become available at
reasonable prices.
The Columbia 0.8 Teraflops computer is a col-
lection of 16, 384 nodes connected in a 163 × 4
∗Work done in collaboration with D. Chen, N. Christ,
C. Jung, A. Kahler, Y. Luo, R. Mawhinney, P. Vranas
at Columbia Univ; A. Gara, J. Parsons at Nevis Labs;
R. Edwards, A. Kennedy at SCRI; S. Hansen at Fermi-
lab; J. Sexton at Trinity College, Dublin; G. Kilcup at
Ohio State Univ. Research is sponsored in part by the
Department of Energy.
DSP DRAMNGA
+ serial direction
− serial direction
Figure 1. The functional diagram of one of the
16K nodes.
four-dimensional, serial network. The main ele-
ments of the node (Fig. 1) are:
• Digital Signal Processor (DSP) capable of
performing a floating point multiplication
and an accumulation in a single 25MHz cy-
cle. We are currently using Texas Instru-
ments TMS320C30 50MHz parts. For more
information on DSP applications see [4].
• Dynamic Random Access Memory
(DRAM). We are supporting either 256k×
16 or 512k × 8 parts from various vendors.
These memory devices are connected to a
39-bit data bus (32-bit data word with 7
bits for Error Detection and Correction).
2• The cornerstone of our design, the NGA
(Node Gate Array), is a customized gate
array (Fig. 2). The NGA provides a con-
troller for the serial communication pro-
tocol, arbitrates memory accesses, handles
recovery from errors in DRAM and se-
rial wires, and includes a buffer that bal-
ances data transfers between the DSP and
DRAM.
DSP I/O
controller
SCU Circular Buffer
DRAM I/O
and EDC
Figure 2. The main components of the NGA.
The opportunity to include almost all I/O con-
trollers and global signal handlers in one VLSI
chip significantly simplified the configuration of
the node and at the same time added a new level
of flexibility to the hardware. In designing the
NGA, we tried to shift as much of the low level
functionality as possible to the hardware thus al-
lowing the programmer to concentrate more on
the implementation of the physics algorithms. At
the same time we are leaving enough hooks for a
sophisticated user concerned with getting maxi-
mum efficiency.
The design of a truly parallel computer requires
a well balanced communication scheme with the
throughput matching the speed of the CPU. The
optimal communications vs. computations ratio
depends on the algorithms to be run on the com-
puter. Our task was to produce a Lattice QCD
machine able to run full QCD calculations effi-
ciently. Our benchmark algorithm for determin-
ing balance with the serial communications and
memory accesses is a single conjugate gradient
(CG) update from a staggered fermion evolution.
This is more or less an obvious choice for the full
QCD calculation since matrix inversions take up
most of the CPU time. We also used code for
Wilson fermions which, though less communica-
tions intensive, provides a useful tool for estimat-
ing overall performance. A serial communication
scheme with transfers going in four ”space–time”
directions does not create substantial overhead
for vector–matrix multiplications if run at a speed
exceeding 17MHz. On the other hand, global dot
products that require the sum of the spinors on
all nodes would take up to 70% of the CPU time if
serial transfers were run at 25MHz. Our SCU (Se-
rial Communication Unit) thus has been designed
to perform “on-the-fly” add, max, and broadcast
operations between the data residing on a partic-
ular node and the data coming from the adjacent
nodes. This mode significantly reduces the la-
tency of transferring information across the com-
puter.
Another aspect of achieving balance among dif-
ferent components of the node is managing slow
DRAM. This is done by introducing a pipeline
stage in the path between the DSP and the
memory. A 32-word Circular Buffer can be in-
structed to prefetch a certain number of words
from DRAM which can be read by the DSP in
zero wait state mode later on. The Circular
Buffer has an embedded protection against ac-
cesses to invalid data locations unless the pro-
grammer specifically turns this protection off.
The size of the Circular Buffer comfortably ac-
commodates all eighteen elements of a SU(3)
color matrix.
2. MODERN DESIGN TOOLS
The NGA incorporates almost all the nontriv-
ial functionality that makes a collection of CPU’s
and memory devices a parallel computer. Please,
refer to [1] or [2] for the description of the upper
level design blocks of our project. In this section
we will concentrate on the implementation of the
ideas described in the the previous section.
3Not long ago, at the final stages of the de-
sign process, one had to produce schematic draw-
ings, then solder available logic elements and
more sophisticated standardized medium integra-
tion components on printed circuit boards, and
test the result using a logical analyzer. The de-
sign process was time consuming and hardly feasi-
ble outside large companies or research centers. A
major advance occurred a few years ago with the
advent of hardware description languages, such as
VHDL and Verilog, that allow a designer to for-
malize some higher level concepts and provide a
high degree of automation in transforming this
description into an actual schematic accepted by
VLSI manufacturers.
The basic entities that VHDL (our language
of choice) describes are signals and their sequen-
tial and concurrent assignments. One can write
boolean style logic equations connecting incom-
ing signals such as the DSP address bus with
outgoing signals such as serial wires constituting
the communication network or controls for pe-
ripherial devices. Finite state machines, that re-
act to the incoming signals according to the state
the machine is in, are naturally implemented as
clocked processes where the state is represented
by an internal signal that is allowed to change
value only at a specific edge of the clock. The
following is an example of a two-bit counter:
wait until prising(clock);
count0 <= not count0;
count1 <= count1 xor count0;
Once the desired logical equations are writ-
ten, the design can be simulated as a black box
which reacts to the external stimuli represented
by time patterns for each incoming signal. We
also use models for the standard components such
as DSPs and DRAMs available from the Logic
Modeling Corporation which allow us to check the
logical consistency of a design by running a piece
of CG code on a model of the DSP which com-
municates with our model of the NGA which, in
turn, reads and writes to a model of the memory.
The next important step is synthesis of the
VHDL source code; the output of which is the
desired schematic. This process is similar to com-
piling an ordinary program. The gate array man-
ufacturer supplies a library of components which
represent basic constructs in VHDL. The synthe-
sizer parses the code and substitutes these con-
structs with the elements of the library, properly
connecting them to each other. After the initial
run, the synthesizer proceeds to the iterative op-
timization task by looking at larger clusters of
components and trying to reduce them to more
compact ones following the list of constraints such
as speed, size, or load requirements set by the
user. The development of efficient synthesis tools
is very much in progress now and the current ones
are not perfect. Some human intervention is still
required to produce sound results. For more in-
formation on this aspect of design see [3].
The final step in the design is the simulation
which includes correct propagation delays sup-
plied by the vendor. At this stage one can identify
parts of the design that are too slow to keep up
with the desired speed of the computer and opti-
mize them either by rearranging the VHDL code
or by applying more severe constraints to the syn-
thesizer. There are a number of tools available for
this: one can list the asynchronous paths with
delays larger than a set number, include such sig-
nals in a watch list from the simulator, look at
the diagnostic output of the synthesizer, and so
on. At the same time one can run large pieces of
assembly code on the model of the DSP, checking
the performance of physics code on a cycle by cy-
cle basis and revealing obscure bugs that might
otherwise show up only after thousands of cycles
of simulation. The performance figures for the
Columbia 0.8 Teraflops project are based on these
simulations and presented in reference [1].
REFERENCES
1. R. Mawhinney in these Proceedings.
2. I. Arsenin, et al. Nucl. Phys. B (Proc. Suppl.)
34 (1994) April 1994.
3. See for example: A. Airiau, Circuit synthe-
sis with VHDL, Kluwer Academic Publish-
ers, Boston, 1994. or L. Baker, VHDL pro-
gramming: with advanced topics, Wiley, New
York, 1993.
4. R. J. Higgins, Digital signal processing in
VLSI, Prentice Hall, 1990.
