Cost-Effective Clustering by Gottlieb, Steven
ar
X
iv
:h
ep
-la
t/0
11
20
26
v1
  1
7 
D
ec
 2
00
1
1
Cost-Effective Clustering
Steven Gottlieb
Department of Physics, Indiana University, Bloomington, IN 47405, USA
Small Beowulf clusters can effectively serve as personal or group supercomputers. In such an environment, a
cluster can be optimally designed for a specific problem (or a small set of codes). We discuss how theoretical
analysis of the code and benchmarking on similar hardware lead to optimal systems.
PACS/Keywords: 12.38.Gc, Lattice QCD, Linux, Beowulf Clusters, Supercomputing
1. INTRODUCTION
The Beowulf project began in 1994 at the
NASA Goddard Space Flight Center. The his-
tory, current status and a list of about 100 clus-
ters may be found at http://www.beowulf.org.
This year at Supercompting 2000, Gordon Bell
sponsored a prize for a computer constructed or
purchased for under $10,000. Thus, cost-effective
clustering is a timely topic.
I assume you need to do a lot of computation
and need to get it done inexpensively rather than
in the shortest time. It helps if you can run more
than one job at a time. This should be contrasted
with weather prediction, where being able to run
seven one-day predictions that each take seven
days is not very useful. For weather we must ex-
ceed a specific minimal speed that will let us pre-
dict tomorrow’s weather from current conditions.
The cost model presented here is based on cap-
ital expenditure for hardware. A more sophisti-
cated model would include maintenance, electric-
ity, cooling, floor space, etc.
These design principles are advocated:
1) Know the bottlenecks and requirements of
your problem. With this knowledge, you can
avoid building an unbalanced system that, for in-
stance, might put too much money into a network
whose high performance is not required. 2) De-
sign for the sweet spot. (Note: the sweet spot
changes with time and may depend on the prob-
lem.) As an example, the highest density memory
available at any time tends to be quite expensive.
Currently, it is less expensive to buy two 256 MB
parts than one 512 MB part. 3) Design for to-
tal system cost effectiveness. If a 10% increase
in the speed of the processor results in a 5% per-
formance increase, since the processor is only one
component, the system price increase might be
less than 5%. 4) Benchmark as much as you can
before deciding on a design.
A cluster compute node can be very simple.
Six items are (almost) mandatory: motherboard,
CPU, memory, network card, case and floppy
drive. The first three items will be the most ex-
pensive if using a FastEthernet network (unless
you insist on using rack mounted cases). Oth-
erwise, the network card will be a major ex-
pense. Some motherboards come equipped with
a FastEthernet interface. I have also seen some
systems without individual cases. A hard drive or
video card may be useful. Some Beowulf design-
ers discourage a hard drive on each compute node.
If your application requires lots of scratch disk
i/o, then you may need a disk on the node. When
a node repeatedly fails to reboot, a video card can
help diagnose the problem. (At IU in two years
of running with 40 nodes, this has only been nec-
essary about half a dozen times.) We have a few
spare video cards to install when needed.
The Indiana University Physics Department re-
ceived $50,000 in 1998 to build a 32-node Linux
cluster. The machine we built in Nov. 1998
is called CANDYCANE, which stands for CPUs
And Network Do Your Calculation And Nothing
Else. CANDYCANE is an appropriate name be-
cause it was designed for the “sweet spot,” that
is, components were picked to give the best price-
performance ratio attainable. The cost per node
was $693 for a Pentium II 350, with a 4.3 GB
hard drive and 64 MB of ECC RAM. Each node
2has a floppy drive and a FastEthernet card. The
40-port HP Procurve switch cost about $2,000,
so the total cost was about $25,000. In Novem-
ber, 2000 it would have been possible to build
this system for <$320 per node, or for approxi-
mately $12,000. An even more attractive alterna-
tive would be a diskless Athlon 600 MHz system
for which the per node cost is about $275. This
node would have much better performance than
the PII 350; however, the FastEthernet would
be a bottleneck on the MILC code with Kogut-
Susskind quarks. Even so, a 32 node system with
a minimum performance of 1280 and 1660 Mflops,
for 84 and 144 sites per node, respectively, could
be built for under $10,500. This works out to a
cost/MF of between $6.3 and $8.2.
In Sec. 2, we describe the key issues for good
performance. Section 3 gives details of single
node performance. Section 4 points the reader
to a web site with extensive benchmarks, gives
cost estimates for several designs and compares
cost-performance ratios for these clusters and a
number of supercomputers. For additional infor-
mation about emerging technologies for clusters
see Ref. [1].
2. KEYS TO PERFORMANCE
A very simple approach to achieving good per-
formance for domain decomposition codes like
Lattice QCD codes is to optimize single node per-
formance and to try to avoid degrading perfor-
mance too much when one has to communicate
boundary values to neighboring nodes. The key
to a cost effective design is an appropriate bal-
ance. Floating point performance is more eas-
ily adjustable than network performance because
processors come in many speeds, but there are
only a few choices for the network.
A simple performance model of the Kogut-
Susskind Conjugate Gradient algorithm gives this
bandwidth requirement to overlap communica-
tion and floating point operations:
MB = 48MF/(132L) = 0.364MF/L, (1)
where MB is the achieved bandwidth in
Megabyte/s, MF is the achieved floating point
speed in Megaflop/s on matrix-vector multiplica-
Figure 1. Measured bandwidth and the simple
performance model.
tion and an L4 portion of the grid is on each node.
We assume there are neighboring nodes in each
direction, i.e., 16 or more nodes. The constant
factor 0.364 is specific to KS quarks. However,
the 1/L behavior is typical of the domain decom-
position approach to parallelism and comes from
the surface to volume ratio.
Figure 1 shows a log-log plot of measured band-
width on a ping-pong test for three types of hard-
ware and the performance model for several pro-
cessor speeds. The messages vary in size from 800
bytes to 30 KB for problem sizes of interest. The
arrows near the bottom of the graph correspond
to different L values. The FastEthernet (blue)
and Myrinet (green) curves come from measured
performance on the Roadrunner (RR) superclus-
ter at the Albuquerque High Performance Com-
puter Center. Two curves are shown for Myrinet.
With the newer drivers, bandwidth is better and
smoother. The Quadrics curve comes from the
Teracluster at Lawrence Livermore National Lab-
oratory (LLNL). The measurement was done us-
ing the Netpipe program from the Ames Scal-
able Computing Laboratory [2]. The straight red
lines come from the performance model presented
above and are plotted for matrix times vector
speeds of 50, 100, 200 and 400 MF. We need to
run at a large enough value of L so that the mea-
3sured bandwidth is above the straight line (for
whatever speed our processor achieves for the cor-
responding value of L). Because of cache effects,
the processors will achieve higher speeds when
L is small, but that requires the highest band-
width. Thus, pushing up the communication rate
for small messages is important. Being able to
run for a small value of L with high efficiency
allows running a fixed size problem at high to-
tal performance. We see that none of the net-
works achieves more than a small fraction of its
peak bandwidth for the message sizes of inter-
est. A system design based on achieving that
peak bandwidth would almost certainly be com-
munication bound, i.e., money would have been
spent on floating point capacity that could not be
used. There are large differences in the prices of
FastEthernet, Myrinet and Quadrics hardware.
Choice of network can obviously play a critical
role in system performance and cost-effectiveness.
3. SINGLE NODE PERFORMANCE
The single node performance is likely to depend
upon such issues as the quality of the CPU, the
performance and size of cache(s), the bandwidth
to main memory and the quality of the compiler.
For message passing performance, key issues are
the latency, peak bandwidth, processor overhead
and the message passing software. It is important
to make the right choices when designing your
system.
For the CPU, one can choose among Intel
Celeron, Pentium II, PIII, PIV, Itanium; AMD
Athlon, Thunderbird, Duron; Compaq Alpha and
other possibilities. The Celeron may have limited
performance because it only has a 66 MHz front
side bus (FSB). If access to memory is important
(as it is for this application) the 100 or 133 MHz
FSB of the PII and PIII will be useful. The PIV
is quite new and currently expensive. Itanium is
soon to be available and some results are provided
below. The Alpha has great performance, but it
is expensive.
There are currently several memory types
available for different processors. They include
PC100, PC133, Rambus, Double Data Rate
(DDR), which is also known as PC1600 or PC
2100. PC100 and PC133 are fairly mature at this
stage, and there is little difference in price (cur-
rently about 0.5$/MB). Rambus was quite expen-
sive at introduction, but has recently decreased
quite a bit (currently slightly more than 1$/MB).
DDR is just now coming to market (Micron), but
it does not carry too much of a premium (cur-
rently similar to Rambus). Motherboards that
use DDR memory are just coming to market.
Pick the right amount of memory for your prob-
lems: you never want your code to swap, but nei-
ther do you want to buy a lot of memory that you
never use.
Choice of motherboard is crucial. It must be
matched to the processor and memory. The sup-
port chip can have an important impact on per-
formance (as we explore below). The mother-
board determines the number of processors per
node. We discuss below whether dual processor
systems are more or less cost-effective for a par-
ticular application. The motherboard will also
determine whether you have a faster or wider PCI
bus than the initial standard 33 MHz-32bit bus.
This may be important if you have a higher speed
network like Myrinet.
The last critical choice is the network hardware
and software. It properly deserves its own sec-
tion, but to save space we briefly discuss it here.
FastEthernet is the commodity network. Other
choices such a Myrinet, Giganet, Gigabit Ether-
net, Quadrics QSnet and SCI from Dolphin/Scali
have higher performance, but are quite expen-
sive compared to FastEthernet. (Including the
card and switch, you can expect to spend about
$1,500 per node, except for Quadrics which is over
$3,000 per node.) Because of the big jump in
price and performance, it is necessary to make
sure the system maintains balance between CPU
and network. Regarding software choices, un-
der FastEthernet, MVIA[3] and GAMMA[4] soft-
ware have reduced latency compared with stan-
dard TCP/IP. I have recently tried VMI under
Myrinet which is being developed at NCSA and
found it to be superior to running under standard
GM (Myrinet supplied) driver.
Turning next to single node performance, we
note that it is easy to waste a lot of money
on a poor system design. To illustrate this, we
4Figure 2. Processor price vs. speed.
consider the various speed AMD Athlon proces-
sors available and their prices on two particular
days. Although we focus on Athlon here, the
same considerations apply to Intel or other pro-
cessors. Figure 2 shows that processor price is a
rapidly increasing function of speed. It can also
be a rapidly decreasing function of time, espe-
cially for the faster processors.
Dividing the price by the speed of the chip, we
find that the relative expense rises rapidly for the
faster chips [5]. On April 7, 2000, there was an
apparent sweet spot at 600 MHz. The faster chips
have a higher price-performance ratio. Depend-
ing upon the costs of the other components of the
system, the entire system may have a higher or
lower price-performance ratio.
For our QCD codes, access to memory is quite
important. In our first example, we compare a
Celeron chip with a 66 MHz FSB and a Pen-
tium II chip with a 100 MHz FSB. The Pentium’s
clock speed is only 6% faster than the Celeron,
but its performance is 38% faster for L =6 and
23% faster for L ≥8. This is because of its larger
cache and 52% faster FSB.When we overclock the
Celeron, it has a slower FSB but a faster CPU
speed than the PII, and performance is about
equal except for L =6 where the smaller cache
probably is the source of the difference.
Table 1
Megaflop rate of Intel Processors
CPU Celeron Celeron PII
Clock(MHz) 330 416(OC) 350
FSB(MHz) 66 82 100
L MF MF MF
4 102 117 113
6 60 74 83
8 57 71 72
10 57 71 70
12 56 70 70
14 56 71 70
Table 2
Megaflop rate of Athlon Processors
L 500 MHz 600 MHz
4 231 276
6 129 135
8 97 102
10 92 97
12 90 95
14 89 93
By comparing 500 MHz and 600 MHz Athlons,
we demonstrate that performance does not in-
crease in proportion to the speed of the chip.
This is because memory access is fixed by the
FSB speed.
The 600 MHz chip has a peak speed 20% faster
than the 500 MHz chip. With 44 lattice points,
we do see a 20% speed up, but for the larger prob-
lem sizes that do not fit into cache, there is only a
5% speedup. We expect that for even faster pro-
cessors, performance increases will be marginal.
Since memory access is so crucial, I have pur-
chased a Pentium III 533B chip that uses PC133
memory. In theory, it should provide about
33% better performance than a similar chip with
PC100 memory. I have tried three different
motherboards using different support chips and
the results are disappointing. The Gigabyte
GA6VXE+ motherboard uses a VIA chip set,
the Supermicro (SM) PIIISED uses the Intel 810e
chip set, and I also tried an Intel CC820 moth-
erboard using the Intel 820 chip set. The results
are not particularly better than a PII350 or 450
MHz chip using a BX motherboard, except when
5Table 3
Megaflop rates of various motherboards and CPU combinations or cluster nodes
L Gigabyte GA6VXE+ Intel CC820 SM PIIISED SM P6SBA RR † LL ‡
Pentium III 533B PIII 533B PIII 533B PII 350 PII 450 PIII 733
4 186 182 174 114 142 319
6 106 98 94 83 99 140
8 81 75 73 72 82 130
10 76 72 70 70 79 127
12 76 70 69 70 78 127
14 73 70 69 70 78 126
† Roadrunner: Portland Group compiler
‡ Los Lobos: Portland Group compiler
the problem fits in cache. In the future, I hope
to test whether the PIII prefetch instructions can
be used to improve this situation. (This tech-
nique would also be applicable to a BX moth-
erboard with PIII installed.) The next version of
the Portland Group Compiler is supposed to have
better facilities for telling the compiler when to
prefetch data to cache. (See Table 3.)
It is possible that the 815e support chip from
Intel will provide better performance with the 133
MHz FSB, but I have not done any tests my-
self. There are benchmarks at the Intel devel-
oper’s web site that look promising. However,
there are support chips from ServerWorks that
support PC133 memory well. Here are results
from Los Lobos (LL), that uses Intel 733 MHz
chips in IBM Netfinity servers that use a Server-
Works chip set. SuperMicro is manufacturing a
dual processor motherboard that uses a support
chip from ServerWorks, but the motherboard cur-
rently costs about $280, which is about twice the
price of dual processor BX motherboard. Also,
this board requires registered memory which will
add to the cost of the system.
You might have noticed above that the Athlon
numbers are fairly impressive especially consider-
ing that an Athlon chip will cost less than an Intel
chip at comparable speed. There has been good
news recently on the Athlon front. DDR memory
is now available from Micron, and there will soon
be motherboards that use the AMD 760 chipset
which supports DDR memory. Dual processors
have not been available for the Athlon, but there
will be an AMD 760MP chip that will support
dual processors. On October 10, 2000, AMD
announced a demonstration dual-CPU DDR sys-
tem, so it should not be too many more months
before these are available to the consumer.
With the help of programmers at Intel, the
MILC benchmarks have been tuned and run on
several Intel Itanium(tm) based systems. The
code that was run does not include any assembler
instructions, but prefetching hints to the com-
piler have been inserted to enhance performance.
The Itanium(tm) processors have the best perfor-
mance we have seen for our code.
Table 4
Megaflop rate of Itanium Processors
600 MHz 667 MHz 800 MHz
L 100×1 133×2 133×2
2MB/1GB 4MB/4GB 4MB/2GB
4 692 761 916
6 646 726 867
8 290 591 732
10 214 464 539
12 187 330 359
14 178 301 326
It should be noted that the 667MHz processor
is a pre-production pilot. The two faster proces-
sors have double-pumped front side bus and they
also have a larger cache. By looking at ratios of
results we can see the effects of cache size, exter-
nal memory speed and processor speed. A second
6table and interpretation of the relative perfor-
mance may be found in the original transparen-
cies[5].
4. PRICE-PERFORMANCE RATIOS &
BENCHMARKS
The simple performance model presented above
can help us predict when the communication and
floating point are in reasonable balance, but it
is no substitute for real benchmarks. A web
site for MILC benchmarks may be found at
physics.indiana.edu/˜sg/milc/benchmark.html.
All the benchmarks presented are for single pre-
cision Kogut-Susskind conjugate gradient. Due
to limitations of space and time, we refer the
reader to the web site.
We consider bare-bones cluster nodes with ei-
ther 450 MHz PII chips as in Roadrunner, or 733
MHz PIII chips as in Los Lobos. For the Los
Lobos level system, we are assuming a Server-
Works dual-CPU capable motherboard will have
comparable performance to the IBM Netfinity
nodes. Each node contains 64 MB of ECC mem-
ory per processor and a 4.3 GB hard drive. Prices
are based on a search of www.pricewatch.com
on Oct. 31, 2000. Performance expectations are
based on Roadrunner or Los Lobos. (An AMD
Athlon based system was considered in Sec. 1.)
To build a single CPU node like Roadrunner,
but with 64 MB of memory would cost $325. A
dual CPU node, with 128 MB would cost $527. If
more memory is desired, it should cost less than
$1 per Megabyte. Los Lobos style nodes are $634
and $878, for single and dual cpu, respectively.
Per port FastEthernet switch costs are about $56,
$185 and $240, for 32, 72 and 144 ports, respec-
tively. Myrinet cost is $1527 per port and scales
linearly up to 128 ports with the Clos switches.
These prices are for LANai 9, Myrinet 2000 cards,
and better network performance than on Los Lo-
bos is expected.
For a single CPU RR level system, the price-
performance ratio in $/MF is 7.2–9.3, 10–13,
≈11–15 and 28–31, for FastEthernet with 32, 64,
128 nodes and Myrinet, respectively. With dual
CPUs, the numbers are 8–11, 12–16, 14–19 and
20–23. We see that the second CPU makes the
Table 5
Price-performance ratios
Computer (date of quote) $/MF
RR level 1 CPU FE (10/00) 7–15
RR level 1 CPU Myrinet (10/00) 28–31
RR level 2 CPU FE (10/00) 8–19
RR level 2 CPU Myrinet (10/00) 20–23
LL level 1 CPU FE (10/00) 11–17
LL level 1 CPU Myrinet (10/00) 20–22
LL level 2 CPU Myrinet (10/00) 16–21
64-node SGI Origin 250 MHz (2/99) 193
44 node Cray T3E (2/99) 480
256 node IBM Power 3 SP (2/00) 166
with estimated discount 91
64 CPU Compaq Alpha Server SC 150
Myrinet based system considerably more cost ef-
fective; however, for FastEthernet, although the
marginal cost is small, the performance gain is not
that great either, and the system is less cost effec-
tive. With the more expensive and higher perfor-
mance LL level nodes, FastEthernet cost in $/MF
is about 11, 14 and 17, and Myrinet is 20–22 for
single CPU systems. With a second processor
the Myrinet number drops to 16–21. Dual CPU
benchmarks have not been run with FastEther-
net, but the network performance should be even
more of an issue here, and we expect the cost ef-
fectiveness to be somewhat less.
This work was supported by the U.S. DOE un-
der grant DE-FG02-91ER 40661. Special thanks
to the MILC collaboration, the Albuquerque High
Performance Computer Center, Indiana Univer-
sity, LLNL, National Center for Supercomputing
Applications, Pittsburgh Supercomputer Center
and San Diego Supercomputer Center.
REFERENCES
1. See http://physics.indiana.edu/˜sg/pccluster.html
for several talks on cluster technology and
application performance.
2. Visit http://www.scl.ameslab.gov. Netpipe
can be downloaded from there.
3. To find out more about M-VIA, see
http://www.nersc.gov/research/FTG/via.
74. See the GAMMA web site of G. Ciaccio,
http://www.disi.unige.it/project/gamma/.
5. See http://physics.indiana.edu/˜sg/ccp2000/
for this graph and others not included here.
