Parametric Estimation of the Ultimate Size of Hypercomputers by Zinoviev, Dmitry
ar
X
iv
:1
11
1.
42
87
v1
  [
cs
.PF
]  
18
 N
ov
 20
11
Parametric Estimation of the Ultimate Size of Hypercomputers
Dmitry Zinoviev
Mathematics and Computer Science Department
Suffolk University, Boston, 02114 USA
dmitry@mcssuffolk.org
Abstract
The performance of the emerging petaflops-scale su-
percomputers of the nearest future (hypercomputers)
will be governed not only by the clock frequency of the
processing nodes or by the width of the system bus, but
also by such factors as the overall power consumption
and the geometric size. In this paper, we study the
influence of such parameters on one of the most im-
portant characteristics of a general purpose computer
— on the degree of multithreading that must be present
in an application to make the use of the hypercomputer
justifiable. Our major finding is that for the class of
applications with purely random memory access pat-
terns “super-fast computing” and “high-performance
computing” are essentially synonyms for “massively-
parallel computing”.
1. Introduction
Super-fast computers processing data at a sustained
rate on the order of 1015 integer or floating-point opera-
tions per second (1 petaops, or 1 petaflops), also known
as hypercomputers [9], will be emerging within the next
decade as ultimate tools for solving very large-scale
problems of computational fluid dynamics, weather
forecasting, nuclear stockpile stewardship, cryptanaly-
sis, real-time image processing and rendering, and the
like [3].
Common sense supported by the results of prelimi-
nary case studies [11] suggests that the hypercomputers
will materialize as hardware installations of substantial
size and power consumption. The average geometric
diameter of the installation, combined with the ultra-
high clock frequency, will be eventually translated into
a memory access latency of several hundreds and thou-
sands processor cycles, — a situation unthinkable in
the domain of personal computers but quite common
on the Internet. To achieve and sustain the required
performance, the hypercomputer must be originally
designed as a highly multithreaded machine [10, 4].
Preemptive multithreading helps to hide the memory
access latency. However, it implies high parallelism,
which inevitably limits the usability of a hypercom-
puter to a narrow domain of intrinsically parallel ap-
plications. Careful consideration of physical factors can
help to anticipate the potential problems that may ren-
der the design of a hypercomputer doomed to failure.
In this paper, we will obtain a rough parametric es-
timation of the performance of hypercomputers based
on their fundamental physical and geometric proper-
ties, such as power consumption and wire size.
2. Model
For the purpose of this study, the following simpli-
fied model of a hypercomputer has been used. We as-
sume that the hypercomputer consists of Q nodes, each
node being either a processing element (PE), or a mem-
ory bank. The nodes are connected using a multistage
internal network. The diameter of the network D is on
the order of log2Q (this is true for delta networks and
approximately true for other high-performance net-
works). For the ease of application development, all
processing elements have uniform access to the globally
shared memory. A typical application using the hyper-
computer generates purely random memory traffic at a
rate of 1.32 (“load”) requests and 0.78 replies (“store”)
per clock cycle [7], or approximately 1 outbound mes-
sage per cycle per node. All instructions are presumed
to be fetched from local instruction caches and do not
contribute to the total traffic. Data caches are not con-
sidered, taking into account the random pattern of the
memory usage. Finally, we assume that the processor
word width is W bits, the processor clock frequency
is f0, and that each PE completes one instruction per
clock cycle.
processing
elements, memory,
switches
interconnects
passive
Figure 1. Arrangement of the components of
a hypercomputer. A sample message path is
shown with a dashed line.
To achieve its ultimate performance, the system
must be well balanced in a sense that the round-trip
memory access latency, measured in PE clock cycles,
should be approximately equal to the degree of multi-
threading. In this case, a thread blocked at a memory
request will be scheduled for execution by the hard-
ware exactly when the results arrive to the local regis-
ters. Smaller degree of multithreading will reduce the
performance of the hypercomputer, while higher degree
will require extra hardware for thread contexts, most
of which will never be used.
As a first step toward the refining of the proposed
model, we observe that the design of a petaflops-scale
hypercomputer implies three-dimensional integration.
Indeed, it has been shown [2] that the footprint of
a hypercomputer flattened into the two-dimensional
space would be as large as a soccer field (namely,
∼ 1, 000m2). The actual arrangement of the com-
ponents (PEs, memory banks, and internal network
nodes) is not essential for the study. We will focus on
a rather unrealistic, but easy to model, spherical con-
figuration, with all active components evenly placed
on the surface of a sphere of diameter L, and all pas-
sive components (wires) hidden under the surface, as
shown in Figure 1. (A similar — but technically more
sound — cylindrical arrangement has been proposed
in [1] and [12].) Such configuration permits relatively
easy access to the active components in case they need
maintenance or replacement.
3. Power Consumption
Electrical power is consumed by the hypercomputer
statically and dynamically. The static term is at-
tributed to the leakage current (which can be ignored,
at least in theory) and to the power dissipation in pas-
sive interconnecting wires. The dynamic term depends
on the performance Θ of the hypercomputer.
Let us begin with the evaluation of the total num-
ber of wires required to interconnect the processing el-
ements. The signal transfer rate on a wire (limited by
the wire bandwidth Bw) may be substantially slower
than the PE clock rate f0. Respectively, the amount
of passive wires in the network must be proportion-
ally larger to match the total bandwidth B of requests
generated by the PEs, and the available bandwidth of
the network. Each stage of a multistage network con-
tributes proportionally to the total number of wires,
too. Finally, we must add extra wires to compensate
for the network saturation, which typically takes place
at α ∼ 60% load:
N =
B
Bw
D
α
=
(1.1f0WQ)D
Bwα
∼ f0WQD
Bwα
. (1)
Second, we must establish a relationship between
the performance of the hypercomputer and its config-
uration and clock frequency. The aggregate peak per-
formance of the hypercomputer, measured in floating
point operations per second, can be roughly estimated
as
Θ = Qf0 (W/W0) , (2)
where W0 is the number of bits per word in a “stan-
dard” processing element. The ratio in the parentheses
takes into account the fact that, for instance, a 128-bit
PE is twice as powerful as its 64-bit counterpart run-
ning at the same clock rate.
The size of the hypercomputer installation will be
defined by the amount of power pv that can be possibly
removed from a unit volume by means of either forced-
air or water cooling. The maximum power that can be
removed in the former case is ps = 5 · 105W/m2 [5].
Water cooling can remove more power, but requires
more sophisticated and bulky plumbing. At the mo-
ment, we do not know what will be the ultimate verti-
cal chip pitch h for the 3-dimensional integration. The
pitch of h = 5mm sounds like a sane approximation,
with a proper allowance for the packaging and cool-
ing infrastructure. Under this assumption, the maxi-
mum power that can be removed from a unit volume
is pv = ps/h = 10
8W/m3.
3.1. “Test Vehicle” Hypercomputer
To verify our theoretical reasonings, we will consider
a hypothetical hypercomputer of year 2007. This “Test
Vehicle” hypercomputer (TVHC) will be driven byQ =
50, 000 super-fast 128-bit Intel chips (f0 = 20GHz [8]).
The nodes will be connected using a banyan network
(D = log2Q ≈ 16) implemented as a collection of in-
sulated thin pure copper wires (bandwidth per wire
Bw ≈ 3.6Gbps [5]; resistivity ρ = 17.5 ·10−9Ω ·m; wire
electrical cross-section σw = 2.5 · 10−8m2). One can
verify using Eq. 2 that the peak performance of this
hypercomputer will be 1015 operations per second, or
1 petaops.
3.2. Static Power Dissipation
Power dissipated statically by a passive resistive
electrical system is given by Ohm’s law: Ps = I
2R,
where I is the signal current, and R is the overall re-
sistance of the system. We assume that I ≈ ±20mA,
although higher-current drivers may be needed to sus-
tain error-prone high bit rate transmission at meter-
scale distances.
The interconnection network can be ultimately con-
sidered as a collection of N individual wires of length
li, with electrical cross-section σw, made out of a good
conductor with resistivity ρ. It can be shown that the
average distance between any two components on a
sphere L¯ is 2L/pi. The wires are connected in series,
and the total resistance is:
R =
N∑
i
Ri =
ρ
σw
N∑
i
li =
ρNL¯st
σw
=
2ρNLst
piσw
Finally,
Ps =
2I2LstρN
piσw
. (3)
The total heat generated by the static power dissi-
pation Ps must be removed from the chip, according
to the conditions stated above. This is only possible,
if the volume occupied by the wiring is large enough:
Ps ≤ pvV = pipvL3st/6. Substituting Ps from Eq. 3 and
N from Eq. 1 and Eq. 2, we get the final dependency
of Lst on Θ:
Lst ≈
√
ΘW0
(
ρI2
σwpvBw
D
α
)
. (4)
The diameter of the “static thermal core” for the
TVHC Lst is ≈ 0.008m.
3.3. Dynamic Power Dissipation
Dynamic power dissipation is due to the fact that
each operation executed by any PE requires certain
energy (in our case, w ≈ 10−10 J/op [5]). We have to
consider heat generated by both processing elements
and memories (there are 2Q of them), and switching
elements (there are at least QD/2 of them, assuming a
delta-class interconnection network). We do not know
the exact relationship between the complexity of op-
erations executed by the switching engines and com-
putational engines, and for the purpose of this study
we will assume that they are equivalent. Therefore,
the total dynamic power dissipation Pd in the hyper-
computer is equal to Θw (2 +D/2). According to the
model proposed in Sec. 2, active processing and switch-
ing elements are spread on the surface of the sphere
enclosing the passive interconnection wires, forming
an “active shell”. The surface of the sphere must
be spacious enough to enable adequate heat removal:
Θw (2 +D/2) ≤ piL2dynw0. Obviously,
Ldyn =
√
Θw
piw0
(
2 +
D
2
)
. (5)
For the TVHC, Ldyn = 0.8m. This is certainly an
optimistic estimation, because a lot of power is required
for various support operations, such as PE “housekeep-
ing” and memory refreshing.
3.4. Power Dissipation in Drivers
Yet another source of dynamic power consumption
is the set of drivers responsible for the transmission
of digital signals from one agent to another along the
interconnecting wires. Each driver constitutes a cur-
rent source injecting either +I or −I into the attached
wire, at voltage V . To reduce noise and decrease bit
error rate, the drivers must be placed as close to the
agents as possible, and therefore are located on the
same surface of the “active core”. Altogether, 2N
drivers are required, with the total power dissipation
of Pdr = 2NIU . Again, the surface of the core must
be spacious enough: 2NIU ≤ piL2drps. Naturally,
Ldr =
√
2NIU
pips
=
√
ΘW0
√
2IUD
pipsBwα
. (6)
Under the assumption of a really low-voltage driver
(V = 1V ), the diameter of the “thermal core” expands
to Ldr ≈ 4.9m.
The diameters of all three thermal spheres consid-
ered so far — Eq. 4, Eq. 5, and Eq. 6 — scale as
√
Θ. This means, in particular, that the size of the
shell will be determined by static, dynamic, or driver-
related power dissipation, but not by all of them at a
time. More specifically,
Lpow = max (Lst, Ldyn, Ldr). (7)
To summarize: the size of the “minimal thermal
core” of the hypercomputer suggested in Subsection 3.1
must conform to the driver power dissipation require-
ments. The surface of the conforming core will be large
enough to accommodate the processing and switch-
ing elements, and the volume of the core will be large
enough to fit the interconnection wires — without in-
troducing additional power constraints.
4. Wiring Constraints
Alternatively, the ultimate size of a hypercomputer
can be estimated by considering how much space is
required to contain the copper wires constituting the
interconnection network.
If a cross-section of a single interconnecting wire
(including appropriate insulation, cooling, mechanical
support, etc.) is σ, and there is the total of N wires
constituting the interconnection network, then the to-
tal physical volume V1 occupied by the wiring is:
V1 = σNL¯g = 2σNLg/pi.
On the other hand, this volume cannot exceed the vol-
ume of the core:
V2 = piL
3
g/6.
Therefore, the following simple equation holds:
Lg =
√
12σN/pi ∼
√
σN. (8)
For interchip connections implemented on a printed cir-
cuit board (PCB), σ may be chosen to be on the order
of 10−7m2 (wires are placed at ≈ 0.3mm pitch).
Substituting Eq. 1 into Eq. 8, we obtain the de-
pendence of the average network size on the PE clock
frequency:
Lg =
√
f0WQ
(
σD
Bwα
)
. (9)
Notice that the parameters in the parentheses are be-
yond our control. (D is a slow function of N and can
be considered a constant.)
Combining Eq. 2 and Eq. 9, we discover that the
average “packing” size of the interprocessor network
again scales as the square root of the performance of
the hypercomputer:
Lg =
√
ΘW0
(
σD
Bwα
)
. (10)
We would like to emphasize that Eq. 10 has been ob-
tained exclusively by considering the geometric volume
necessary to contain the passive interconnecting wires.
The “packing” size of the hypercomputer given by
Equation 10 is almost 9.8m.
5. Parallelism
The “well-balanced” condition postulated in Sec. 2
imposes even stricter requirements on the scaling of a
hypercomputer. The net effect of the geometry of the
system on the expected degree of parallelism will be
discussed in this section.
In a “well balanced” system, the number of thread
contexts per PE (or the amount of parallelism, T ) must
be large enough to tolerate the round trip latency of a
memory access measured in PE clock cycles. The la-
tency includes the signal propagation time τp, message
processing overhead τn, and memory response time τm:
T = (τp + τn + τm) f0 =
(
2LD
cs
+
DC
f0
+ τm
)
f0.
(11)
Here, cs is the signal propagation speed (in copper,
cs ≈ 9 ·107m/s), and C is the number of PE cycles re-
quired for message processing at one internal network
node (we take C ∼ 10, but believe that it may be as
low as 1). It can be shown that for the hypercomputer
proposed above, the first term dominates the other two.
Indeed, τp ≈ 2.25µs, τn ≈ 5ns (the first and the sec-
ond terms in Eq. 11), and τm ≈ 1ns [5]. For the rest
of our reasoning, we may safely assume that
T ≈ Lf0 (2D/cs) . (12)
The comparison of Eq. 7 and Eq. 10 with re-
spect to “our” hypercomputer suggests (Figure 2)
that the geometric considerations dominate the power-
management considerations, regardless of the perfor-
mance of the installation. Therefore, the study of the
power consumption may be safely omitted, and we can
concentrate on the geometric term.
The combination of Eq. 10 and Eq. 12 gives the de-
pendence of T on the hypercomputer clock frequency
and overall performance:
T = f0
√
Θ

 2
cs
√
σD3W0
Bwα

 . (13)
For the TVHC, T ∼ 70, 000. As usual, the factors
collected in the parentheses are beyond our control.
1e-05
0.0001
0.001
0.01
0.1
1
10
1GIGA 1TERA 1PETA
S
iz
e,
 m
Performance, OPS
Intel Celeron
"Static core"
"Dynamic core"
"Driver core"
"Wiring core"
Figure 2. The minimal diameter of the TVHC
installation as a function of its performance.
6. Solutions
An unpleasant consequence of the equation 13 is
that the amount of intrinsic parallelism required from
an application in order to be efficiently executed by a
hypercomputer is proportional to the clock frequency
of the PE and to the square root of the overall per-
formance of the machine. This means that “super-fast
computing” and “high-performance computing” are es-
sentially synonyms for “massively-parallel computing”,
and as such cannot be considered suitable for general-
purpose applications with a purely random memory
access pattern.
A number of solutions may be suggested to this
problem. One way to circumvent the “packing” con-
straint is to use open-space optical interconnects. For
these kind of links, one can expect to have the band-
width B0 ≈ 40Gbps per link, with signal propagation
speed cs = 3 · 108m/s. An important property of an
open-space network is that the links can actually over-
lap. Therefore, the size of the core will not be limited
by volume anymore. Instead, it will be limited by the
area of the inner surface of the shell:
Lg =
√
4NσLE/pi.
Here, σLE is the footprint of a light emitting ele-
ment, for instance, vertical cavity surface emitting
laser (VCSEL). Assuming that the size of a VCSEL
is 200µm × 200µm [6], the diameter of the shell Lg
will be ≈ 2m — a big improvement, compared to the
“copper” shell. It is also worth mentioning that the
static power dissipation in an open-space network is
zero, due to the absence of wires.
We could not find reliable information on the power
consumption of very high-speed VCSELs and photo-
diodes. An intelligent guess is that at 40Gbps, power
required by a single emitter is ∼ 0.1mW . Equation 6
gives the size of the “driver core”: Ldr ≈ 3.3m. As
one can see, the “driver” shell becomes bigger than the
“packing” shell and determines the size of the TVHC.
Once again, we would like to emphasize that we have
no solid numbers for very high-speed VCSELs, and the
result of this calculation must be considered exclusively
as a rough estimate.
There exists at least yet another alternative to cop-
per wires. They can be replaced with high-speed bal-
listic high-Tc superconductor (HTSC) ceramic wires.
HTSC wires promise high data transfer rates (Bw ≈
10Gbps) and high signal propagation speed (sc ≈
2 ·108m/s). These two factors together can reduce the
“packing” size and the degree of parallelism by 40% and
60%, respectively. However, the ultimate cross-section
of ceramic wires is not know now, and this third fac-
tor may potentially undo the improvement. There will
be still at least some gain, unless the HTSC wires are
6 · 10−7m2 in cross-section or thicker.
The biggest improvement that can be brought in
by the HTSC wires is the shrinkage of the “driver”
core. Superconductor drivers may consume as little as
10µW of power, compared to 20mW for semiconduc-
tor drivers. This would reduce the size of the respective
core to Ldr ≈ 0.5m, which would allow us to totally
exclude it from the consideration.
Unfortunately, HTSC wires can operate only at the
temperature of liquid nitrogen and require deep refrig-
eration. The dissipated power will be removed else-
where (namely, at the nitrogen liquifier setup, which
may be located outside of the shell) and will not con-
tribute to the power balance of the core. However, the
cryogenic infrastructure may (and apparently will) in-
flate the effective cross-section σ of the interconnects.
The net effect of this inflation is not known yet.
7. Conclusion
We have considered the parametric dependences of
the geometric size of a hypothetical petaflops-scale hy-
percomputer on the geometric size and power proper-
ties of its interconnection network. We discovered that
the size of a hypercomputer with spherical arrange-
ment of active components (processing and switching
elements and memories) scales as the square root of
the aggregate peak performance: L ∼
√
Θ. In order to
sustain the execution rate, the hypercomputer must be
designed as a highly multithreaded machine. As such,
it will be most suited for highly parallel applications.
Even though it may be possible to reduce the degree
of parallelism by optimizing the implementation of the
network, it is questionable whether a general-purpose
application with purely random memory access pattern
can benefit from being executed by the hypercomputer.
8. Acknowledgments
The author would like to thank Paul Ezust and Dan
Stefa˘nescu (Suffolk University) for useful discussions
and help with the preparation of the manuscript, and
T. Sterling (JPL), K. Likharev (SUNY), and P. Bunyk
(TRW) for the inspiration.
References
[1] L. Abelson, Q. Herr, G. Kerber, M. Leung, and
T. Tighe. Manufacturability of superconductor elec-
tronics for a petaflops-scale computer. IEEE Trans.
on Appl. Supercond., 9(2):3202–3207, June 1999.
[2] M. Dorojevets, P. Bunyk, D. Zinoviev, and
K. Likharev. ”COOL-0”: Design of an RSFQ subsys-
tem for petaflops computing. IEEE Trans. on Appl.
Supercond., 9(2):3606–3614, June 1999.
[3] G. Gao, K. K. Likharev, P. C. Messina, and T. L.
Sterling. Hybrid technology multithreaded architec-
ture. In Proc. Frontiers‘96, pages 98–105, Annapolis,
MD, Feb. 1996.
[4] G. Gao, K. Theobald, A. Marquez, and T. Ster-
ling. The HTMT program execution model. Technical
Memo 09, Univ. of Delaware, CAPSL, July 1997.
[5] ITWG. International technology roadmap for semi-
conductors. Available electronically at:
http://public.itrs.net/, 1999.
[6] LaserMate. VCSEL (Vertical Cavity Surface Emitting
Laser): The Next Generation Laser. Lasermate Cor-
poration, 2001. Available electronically at:
http://www.lasermate.com//vcsels.htm.
[7] D. Patterson and J. Hennessey. Computer Architec-
ture. A Quantitative Approach. Morgan Kaufmann
Publishers, Inc., 2nd edition, 1996.
[8] L. D. Paulson. Intel technology promises 20-GHz
chips. IEEE Computer, pages 25–27, Dec. 2001.
[9] T. Sterling, 1999. Personal communication.
[10] Tera. Tera: Principles of Operation. Tera Computer
Company, 1997.
[11] L. Wittie, G. Sazaklis, Y. Zhou, and D. Zinoviev.
High throughput networks for petaflops computing. In
Proc. 17th IEEE Symp. on Reliable Distributed Sys-
tems (SRDS’98), pages 312–317, West Lafayette, IN,
Oct. 1998.
[12] L. Wittie, D. Zinoviev, G. Sazaklis, and K. Likharev.
CNET: Design of an RSFQ switching network for
petaflops-scale computing. IEEE Trans. on Appl. Su-
percond., 9(2):4034–4039, June 1999.
