The quest for petascale computing by Dongarra, Jack J. & Walker, David William
One petaﬂop per second is a rate ofcomputation corresponding to 1015ﬂoating-point operations per second.To be of use in scientiﬁc computing, a
computer capable of this prodigious speed needs a
main memory of tens or hundreds of terabytes and
enormous amounts of mass storage. Sophisticated
compilers and high memory and I/O bandwidth
are also essential to exploit the architecture efﬁ-
ciently. To mask the hardware and software com-
plexities from the scientiﬁc end user, it would be
advantageous to access and use a petascale com-
puter through an advanced problem-solving en-
vironment. Immersive visualization environments
could play an important role in analyzing and nav-
igating the output from petascale computations.
Thus, petascale computing is capable of driving
the next decade of research in high-performance
computing and communications and will require
advances across all aspects of it.
A report from the President’s Information
Technology Advisory Committee recommends
that federal research programs “… drive high-
end computing research by trying to attain a sus-
tained petaops/petaﬂops on real applications by
2010 through a balance of software and hard-
ware strategies.” (For a copy of the committee’s
report, see www.ccic.gov/ac/report.) The report
identifies many applications likely to benefit
from petascale computing, including1  
• nuclear weapons stewardship,
• cryptology,
• climate and environmental modeling,
• 3D protein molecule reconstruction,
• severe storm forecasting,
• design of advanced aircraft,
• molecular nanotechnology,
• intelligent planetary spacecraft, and
• real-time medical imaging.
The application of petascale computing to these
areas should enhance US economic competitive-
ness and our knowledge of the universe around
us,2 and such factors have motivated a small band
of experts to address petascale computing issues
through a series of workshops and conferences.
Such activities engendered small-scale point de-
sign studies, which the National Science Foun-
dation, DARPA, and NASA funded to look at
different aspects of petascale computer design.
Recently, these agencies funded a project to ex-
amine in more detail the so-called hybrid tech-
nology multithreaded (HTMT) computer as a
pathway to petascale computing.
This article discusses the current status of ef-
32 COMPUTING IN SCIENCE & ENGINEERING
THE QUEST FOR PETASCALE
COMPUTING
Although the challenges to achieving petascale computing within the next decade are
daunting, several software and hardware technologies are emerging that could help us
reach this goal. The authors review these technologies and consider new algorithms
capable of exploiting a petascale computer’s architecture.
H I G H - P E R F O R M A N C E  
C O M P U T I N G
JACK J. DONGARRA
University of Tennessee
DAVID W. WALKER
Cardiff University
1521-9615/01/$10.00 © 2001 IEEE
MAY/JUNE 2001 33
forts to make petascale comput-
ing a reality, but ﬁrst we review
trends in supercomputer perfor-
mance over the past 50 years. 
Trends in supercomputer
performance
In the last 50 years, the sci-
entiﬁc computing ﬁeld has seen
rapid changes in vendors, ar-
chitectures, technologies, and
system usage. However, despite
all these changes, the long-
term evolution of performance
seems to remain steady and
continuous. Moore’s Law is of-
ten cited in this context. If we
plot the peak performance of
the leading supercomputers
over the last ﬁve decades (see Figure 1), we see
that this law holds well for almost the complete
lifespan of modern computing; on average, per-
formance increases by two orders of magnitude
every decade.3
To provide a better basis for statistics on high-
performance computers, a group of researchers
initiated the Top500 list,4 which reports the sites
with the 500 most powerful computer systems
installed. The best Linpack benchmark perfor-
mance5 achieved is used as a performance mea-
sure in ranking the computers. The Top500 list
has been updated twice a year since June 1993.
Although many aspects of the HPC market
change over time, the evolution of performance
seems to follow empirical laws, such as Moore’s
Law. The Top500 data provides an ideal basis to
verify an observation like this. Looking at the
computing power of the individual machines in
the Top500—and the evolution of the total in-
stalled performance—we can plot the perfor-
mance of the systems at positions one, 10, 100
and 500 in the list, as well as the total accumu-
lated performance. In Figure 2, the curve of po-
sition 500 shows an average increase by a factor
of two per year. All other curves show a growth
rate of 1.8, plus or minus 0.07 per year.
Based on current Top500 data (which cover the
last seven years), and the assumption that the cur-
rent performance trends will continue for some
Cray T3DTMC CM-5
ASCI Red
TMC CM-2
Cray 2
Cray X-MP
Cray 1
IBM 360/195CDC 7600
UNIVAC 1
EDSAC 1
IBM 7090
CDC 6600
1950
1 Tflops/sec.
1 Gflops/sec.
1 Mflops/sec.
1 Kflops/sec.
1960 1970
Year
1980 1990 2000
ASCI White Pacific
Figure 1. Moore’s Law and the peak performance of various computers over time.
Jun
e 
'93
Jun No
v 
'93
Jun
 '9
4
No
v 
'94
Jun
 '9
5
No
v 
'95
Jun
 '9
6
No
v 
'96
Jun
 '9
7
No
v 
'97
Jun
 '9
8
No
v 
'98
Jun
 '9
9
No
v 
'99
100,000
Sum
N=1
N=500Pe
rfo
rm
a
n
ce
 (G
flo
ps
)
0.1
10
100
Year
Jun
 '0
0
No
v 
'00
1.167 TF/s
59.7 TF/s
Intel XP/S140
Sandia
SNI VP200EX
Uni Dresden
Fujitsu/Tsukuba
'NWT' NAL
Hitachi/Tsukuba
CP-PACS/2048
Intel
ASCI Red
Sandia
IBM
ASCI White
LLNL
IBM SP PC604e
130 processors
Alcatel
88.0 TF/s
4.94 TF/s
55.1 TF/s
0.4 TF/s
Figure 2. The
overall
growth of 
accumulated
and individual
performance
as seen in the
Top500
report.
34 COMPUTING IN SCIENCE & ENGINEERING
time to come, we can extrapolate the observed
performance. In Figure 3, we do this by using lin-
ear regression on a logarithmic scale. This means
that exponential growth is ﬁtted for all levels of
performance in the Top500. These simple ﬁts to
the data show surprisingly consistent results.
Based on the extrapolation from these ﬁts, we can
expect the ﬁrst 100 Tﬂops system to be available
by 2005, which is one or two years later than the
Accelerated Strategic Computing Initiative
(ASCI) anticipates. By 2005, no system smaller
than 1 Tﬂops will make the Top500 list.
Looking even further into the future, we can
speculate, based on the current doubling of per-
formance every year, that the ﬁrst petaﬂops system
should be available around 2009. However, due to
the rapid changes in the technologies used in HPC
systems, predicting such a system’s architecture is
difﬁcult—although the HTMT architecture and
IBM’s Blue Gene design offer promising avenues
to petaﬂops performance. Even though the HPC
market has changed substantially in the 25 years
since the Cray 1’s debut, there is still no end in sight
to the rapid cycles of technological innovation that
characterize supercomputing’s history.
Hardware technologies for petascale
computing
We can group hardware technologies for
achieving petaﬂops computing performance into
ﬁve main categories: conven-
tional technologies, processing-
in-memory (PIM) designs, de-
signs based on superconducting
processor technologies, special-
purpose hardware designs, and
schemes that use the aggregate
computing power of Web-dis-
tributed processors. We only
consider here technologies cur-
rently available or that are ex-
pected to be available in the near
future. Thus, we don’t discuss
designs based on speculative
technologies, such as quantum6
or macromolecular7 computing,
although they might be impor-
tant in the long run.
Conventional technology
Systems based on conven-
tional technology are those pro-
duced by extrapolating current
mainstream compute chip de-
sign and fabrication technologies into the future.
This extrapolation is largely based on the pro-
jects published by the Semiconductor Industry
Association, the leading trade association repre-
senting US computer chip manufacturers.8 SIA
predicts 3.5-GHz clock speeds by 2006, which
should increase to 6 GHz by 2009. Assuming 70-
nanometer lithography, the SIA also expects
chips to possess 8 Gbytes of DRAM and 1 Gbyte
of SRAM by 2009.
Rick Stevens at Argonne National Laboratory
posits that a 2.5-Pﬂops system will be available
by 2010.9 This system is based on 4.8-GHz
processors and is composed of 8,000 symmetric
multiprocessor nodes. Each node contains 16
CPUs and has an expected performance of 300
Gflops. The system’s total memory would be
32.8 Tbytes, with 1.3 Pbytes of disk storage.
Stevens predicts the system will cost approxi-
mately $16 million. However, because it would
need 8 megawatts of power, energy consump-
tion would cost about $8 million per year.
Steve Wallach of CenterPoint Venture Part-
ners has put forward a design for a commercial
off-the-shelf petaﬂops system that is similar to
that of Stevens.10 Wallach proposes an 8,192-
node system with four CPUs per node and a per-
formance of 120 Gﬂops per node. This yields a
system with a total performance of approxi-
mately 1 Pﬂops. In estimating his system’s power
requirements, Wallach includes the energy the
Jun
 '9
3
No
v 
 '9
4
Jun
 '9
6
No
v 
 '9
7
Jun
 '9
9
No
v 
 '0
0
Jun
 '0
2
No
v 
 '0
3
Jun
 '0
5
No
v 
 '0
6
Jun
 '0
8
No
v 
 '0
9
100,000
10,000
1,000,000
Sum
N=1
N=10
1 Tflops
Earth 
simulator
ASCI
N=500
Pe
rfo
rm
a
n
ce
 (G
flo
ps
)
0.1
10
1
1,000
100
Year
1 Pflops
Figure 3. Extrapolation of recent growth rates of performance seen in the Top500.
MAY/JUNE 2001 35
main memory and the CPUs will consume, giv-
ing a total of about 4.6 megawatts. Wallach’s de-
sign proposes an optical interconnection net-
work between the system’s nodes based on
OC768 channels. These would provide a bisec-
tion bandwidth of 50 Tbyte/s.
PIM designs
A central issue in the design and use of HPC
systems is the fact that the peak processing speeds
of CPUs has increased at a faster rate than mem-
ory access speeds. With conventional technolo-
gies, the ratio of memory access time to CPU cy-
cle time will continue to increase and is expected
to exceed 400 by 2010.11 This has led to HPC sys-
tems with complex memory hierarchies. A num-
ber of latency-hiding techniques reduce this dis-
parity’s impact to improve the efﬁciency with
which applications use the hardware, including
hardware and compiler support for multithreaded
programming and structuring algorithms to max-
imize data reuse in the upper levels of memory. 
PIM designs seek to reduce the memory access
bottleneck by integrating processing logic and
memory on the same chip, thereby dramatically
improving the memory access time to CPU cycle
time ratio. Peter Kogge and his colleagues at
IBM developed Execube, the first such chip.12
Analog Devices introduced the Sharc system-on-
a-chip in 1994, incorporating a digital signal
processor and 0.5 Mbytes of SRAM. In 1996,
Mitsubishi produced the M32R/D, the ﬁrst com-
mercial DRAM PIM chip, which integrates a 20-
MHz processor and 2 Mbytes of DRAM. Devel-
opers have since produced a number of PIM
products and projects (see www.ai.mit.edu/
projects/aries/course/notes/pim.html).
Digital superconductor technologies
Clock speeds for semiconductor logic circuits
are expected to reach a peak of a few GHz within
the next 10 years, mainly due to prohibitively
large power requirements. However, digital su-
perconductor technology, based on Rapid Sin-
gle-Flux-Quantum (RSFQ) logic, has achieved
speeds of 760 GHz and has extremely low power
consumption. These features, together with a
simple fabrication technology, make digital su-
perconductor devices a promising basis for
petascale computers. In such devices, a single
quantum of magnetic flux encodes digital bits,
while data are transferred as picosecond “SFQ”
pulses. However, logic devices based on super-
conducting niobium must be cooled to between
4 to 5 degrees Kelvin, although the cost of re-
frigeration would make up only a fraction of a
petascale system’s total cost. Current plans13 are
to use 0.8-µm RSFQ technology as a key com-
ponent in an HTMT petascale computer.
Special-purpose hardware
Researchers on the Grape project (short for
gravity pipe, see http://grape.c.u-tokyo.ac.jp/grape)
at the University of Tokyo have designed and built
a family of special-purpose attached processors for
performing the gravitational force computations
that form the inner loop of N-body simulation
problems. The computational astrophysics com-
munity has extensively used Grape processors for
N-body gravitational simulations. A Grape-4 sys-
tem consisting of 1,692 processors, each with a
peak speed of 640 Mﬂops, was completed in June
1995 and has a peak speed of 1.08 Tﬂops.14 In
1999, a Grape-5 system won a Gordon Bell Award
in the performance per dollar category, achieving
$7.3 per Mﬂops on a tree-code astrophysical sim-
ulation. A Grape-6 system is planned for comple-
tion in 2001 and is expected to have a performance
of about 200 Tﬂops. A 1-Pﬂops Grape system is
planned for completion by
2003.
Web-based petascale
computing
The Web’s aggregate com-
puting power is enormous, but
because of severe latency and
bandwidth constraints, we can
harness it for only very loosely
coupled applications. We can
divide examples of Web-based
computations into those that
are community-based (and use
underutilized PCs across the
Web) and those that involve a
higher degree of coordination
and more dedicated resources. Examples of com-
munity-based Web computing projects include
• Cracking the 56-bit DES encryption stan-
dard, which the Electronic Frontier Foun-
dation’s “Deep Crack” computer achieved
in January 1999 in a record time of 22 hours
and 15 minutes with a worldwide network
of nearly 100,000 PCs.
• The Great Internet Mersenne Prime Search,
which recently discovered a two-million-digit
prime number by using a network of tens of
thousands of PCs. This work was done under
the auspices of Entropia (www.entropia.com)
Clock speeds for
semiconductor logic
circuits are expected
to reach a peak of a
few GHz within the
next 10 years.
36 COMPUTING IN SCIENCE & ENGINEERING
and claims a performance of 1 Tﬂops.
• The SETI@home project, which uses un-
derutilized PCs across the Web to search for
extraterrestrial intelligence.15
We need one million 1-Gﬂops PCs to reach a
performance of 1 Pﬂops for a
community-based Web com-
puting project; it is not incon-
ceivable to achieve this within
the next few years. Here, the
deﬁnition of an application be-
comes rather strained because
once the computational parts
that come together to solve a
problem become sufﬁciently
uncoordinated, it is not clear
that they still comprise an ap-
plication in the traditional
sense.
Computing resources dis-
tributed over the Web have
been used for a number of scientific applica-
tions—generally, they possess a greater degree
of coordination than the community-based Web
computing projects discussed earlier. However,
the higher latency and low bandwidth that cur-
rently characterize communication on the Web
mean that we can exploit concurrency in such
distributed scientiﬁc applications only at coarse
granularity. Communication between the coarse-
grain components of these applications is typi-
cally performed with a network-oriented trans-
port layer such as Corba, Globus,16 or Legion.17
An example of this class of Web-based applica-
tion is the numerical relativity computations that
have been performed using computers distrib-
uted across the US and Europe.18 Because of
high communication costs, typical scientiﬁc sim-
ulation applications are unlikely to achieve
petaflops performance across the Web in the
foreseeable future.
The HTMT project
The HTMT project is a partnership between
industry, academia, and the US Government to
develop a petascale computer by the year 2010.
The intention is to exploit a number of leading-
edge hardware technologies to build a scalable
architecture that delivers high performance and
has acceptable cost and power consumption.
The HTMT machine has a deep memory hier-
archy, and latency reduction and hiding are key
in its design and use. Here are some of the tech-
nologies central to the HTMT concept:
• superconductor RSFQ logic, which is the
basis for superconductor processing ele-
ments (Spells) with a clock speed of around
150 GHz. HTMT contains 2,048 proces-
sors, each capable of 600 Gﬂops, giving a to-
tal peak speed of 1.2 Pﬂops; 
• SRAM and DRAM PIM chips;
• a data vortex optical interconnection net-
work with communication speed of 500
Gbps per channel; and
• hardware support for latency management.
The HTMT architecture has four levels of
distributed memory. Each Spell has 1 Mbyte of
cryogenically cooled RAM maintained at liquid
helium temperature. The next two levels in the
memory hierarchy are SRAM and DRAM PIM-
based memory, and the fourth is holographic 3/2
memory. The SRAM PIMs are cooled to liquid
nitrogen temperature and are connected to
DRAM main memory by an optical packet-
switched network.
Latency management across these multiple
levels of memory is crucial to the HTMT exe-
cution model. Multithreading is a commonly
used way of hiding latency by context switching
between concurrent threads and is the basis of
the HTMT execution model. Multithreading
and context prefetching are important tech-
niques for latency tolerance in the hierarchical
memory. In the HTMT execution model, a mul-
tilevel multithreaded context-management strat-
egy, known as thread percolation, automatically
migrates contexts through the memory hierar-
chy to keep the Spells supplied with work.19 The
memory hierarchy’s PIM-based levels control
thread percolation. They decide which compu-
tations will be performed next at a coarse granu-
larity, and the Spells schedule and perform the
computations at a ﬁne granularity.
Programming methodologies for
petascale computing
Many of the lessons learned in the HTMT
project are likely to be incorporated in first-
generation petaflops computers—a multi-
threaded execution model and sophisticated
context management are approaches that might
be necessary to provide hardware and system-
level software support. These approaches could
help control the impact of latency across mul-
tiple levels of hierarchical memory. Thus, al-
though a petaflops computer might have a
global name space, the application programmer
Many of the lessons
learned in the HTMT
project are likely to be
incorporated in first-
generation petaflop
computers.
MAY/JUNE 2001 37
will still need to be aware of the memory hier-
archy and program in a style that makes opti-
mal use of it and exploits data locality wherever
possible. Algorithms and applications will have
to be carefully crafted to achieve an acceptable
fraction of peak performance. Furthermore, ap-
plications will have to possess a high degree of
concurrency to provide enough threads to keep
the processors busy. 
The main programming methodologies likely
to be used on petascale systems will be
• A data parallel programming language, such
as HPF, for very regular computations.
• Explicit message passing (using, for exam-
ple, MPI) between sequential processes, pri-
marily to ease porting of legacy parallel ap-
plications to the petascale system. In this
case, an MPI process would be entirely vir-
tual and correspond to a bundle of threads.
• Explicit multithreading to get acceptable
performance on less regular computations.
Given the requirements of high concurrency
and latency tolerance, highly tuned software li-
braries and advanced problem-solving environ-
ments are important in achieving high perfor-
mance and scalability on petascale computers.
Clearly, the observation that applications with a
high degree of parallelism and data locality per-
form best on high-performance computers will
be even truer for a petascale system character-
ized by a deep memory hierarchy. It is also often
the case that to achieve good performance on an
HPC system, a programming style must be
adopted that closely matches the system’s execu-
tion model. Compilers should be able to extract
concurrent threads for regular computations,
but a class of less regular computations must be
programmed with explicit threads to get good
performance. It might also be necessary for ap-
plication programmers to bundle groups of
threads that access the same memory resources
into strands in a hierarchical way that reﬂects the
hierarchical memory’s physical structure. Opti-
mized library routines will probably need to be
programmed in this way.
IBM’s proposed Blue Gene computer (www.re-
search.ibm.com/bluegene) uses massive paral-
lelism to achieve petaops performance. There-
fore, it can be based on a less radical design than
the HTMT machine, although it still uses on-
chip memory and multithreading to support a
high-bandwidth, low-latency communication
path from processor to memory. Blue Gene is in-
tended for use in computational biology applica-
tions, such as protein folding, and will contain on
the order of one million processor–memory
pairs. Its basic building block will be a chip that
contains an array of processor–memory pairs and
interchip communication logic. These chips will
be connected into a 3D mesh.
New algorithms for petascale
computing
Computations that possess insufﬁcient inher-
ent parallelism (or in which we cannot exploit par-
allelism efﬁciently because of data dependencies)
are not well suited for petascale systems. There-
fore, a premium is placed on algorithms that are
regular in structure but that may have a higher op-
eration count over more irregular algorithms with
a lower operation count. Dense matrix computa-
tions, such as those the Lapack software library
provides,20 optimized to exploit data locality are
an example of regular computations that should
perform well on a petascale system.
The accuracy and stability of numerical meth-
ods for petascale computations
are also important issues be-
cause some algorithms can
suffer signiﬁcant losses of nu-
merical precision. Thus, slow-
er but more accurate and sta-
ble algorithms are favored
over faster but less accurate
and stable ones. To this end,
hardware support for 128-bit
arithmetic is desirable, and in-
terval-based algorithms may
play a more important role on
future petascale systems than
they do on current high-per-
formance machines. Interval
arithmetic provides a means of tracking errors
in a computation, such as initial errors and un-
certainties in the input data, errors in analytic
approximations, and rounding error. Instead of
being represented by a single floating-point
number, each quantity is represented by a lower
and upper bound within which it is guaranteed
to lie. The interval representation, therefore,
provides rigorous accuracy information about a
solution that is absent in the point representa-
tion.
The advent of petascale computing systems
might promote the adoption of completely new
methods to solve certain problems. For exam-
ple, cellular automata are highly parallel, are very
Slower but more
accurate and stable
algorithms are
favored over faster
but less accurate and
stable ones.
38 COMPUTING IN SCIENCE & ENGINEERING
regular in structure, can handle complex geome-
tries, and are numerically stable, and so are well
suited to petascale computing. CA provide an in-
teresting and powerful alternative to classical
techniques for solving partial differential equa-
tions. Their power derives from the fact that
their algorithm dynamics
mimic (at an abstract level)
the fine-grain dynamics of 
the actual physical system,
permitting the emergence 
of macroscopic phenomena
from microscopic mecha-
nisms. Thus, complex collec-
tive global behavior can arise
from simple components
obeying simple local interac-
tion rules. CA algorithms are
well suited for applications
involving nonequilibrium and
dynamical processes. The use of CA on future
types of “programmable matter” computers is
discussed elsewhere21—such computers can de-
liver enormous computational power and are
ideal platforms for CA-based algorithms. Thus,
on the ﬁve-to-10-year timescale, CA will play an
increasing role in the simulation of physical (and
social) phenomena.
Another aspect of software development for
petascale systems is the ability to automatically
tune a library of numerical routines for optimal
performance on a particular architecture. The Au-
tomatic Tuned Linear Algebra Software project
exempliﬁes this concept.22 Atlas is an approach for
the automatic generation and optimization of nu-
merical software for computers with deep mem-
ory hierarchies and pipelined functional units.
With Atlas, numerical routines are developed with
a large design space spanned by many tunable pa-
rameters, such as blocking size, loop nesting per-
mutations, loop unrolling depths, pipelining
strategies, register allocations, and instruction
schedules. When Atlas is ﬁrst installed on a new
platform, a set of runs automatically determines
the optimal parameter values for that platform,
which are then used in subsequent uses of the
code. We could apply this idea (with further re-
search) to other application areas, in addition to
numerical linear algebra, and extend it to develop
numerical software that can dynamically explore
its computational environment and intelligently
adapt to it as resource availability changes.
In general, a multithreaded execution model
implies that the application developer does not
have control over the detailed order of arithmetic
operations (such control would incur a large per-
formance cost). Thus, numerical reproducibility,
which is already difﬁcult to achieve with current
architectures, will be lost. A corollary of this is
that for certain problems, it will be impossible to
predict a priori which solution method will per-
form best or converge at all. An example is the
iterative solution of linear systems for which the
matrix is nonsymmetric, indeﬁnite, or both. In
such cases, an “algorithmic bombardment” ap-
proach can help. The idea here is to concurrently
apply several methods for solving a problem in
the hope that at least one will converge to the so-
lution.23 A related approach is to apply a fast, but
unreliable, method ﬁrst and then to check a pos-
teriori if any problem occurred in the solution.
If so, we use a slower method to ﬁx the problem.
These poly-algorithmic approaches can be avail-
able for application developers either as black
boxes or with varying degrees of control over the
methods used.
Two promising avenues for achievingpetaflops performance by the year2010 are the IBM Blue Gene initia-tive and the HTMT design, both of
which use a number of emerging hardware tech-
nologies. A team involving Compaq, Sandia Na-
tional Laboratories, and Celera Genomics an-
nounced in January 2001 their intention to
develop a 100 Teraops computer by 2004, with
the intention of building a 1 Petaop machine in
a second development phase.
A deep memory hierarchy with disparate la-
tencies and bandwidths characterizes the HTMT
design, but any computer designed for petascale
performance needs to address the memory access
bottleneck problem. Good performance on these
types of machines need to exploit on-chip inte-
gration of processors and memory, and a multi-
threaded execution model, with support at the
hardware, system software, and compiler levels.
In addition, latency-tolerant algorithms with low
bandwidth requirements will be of crucial im-
portance in achieving an acceptable fraction of
peak performance. These algorithms generally
have higher operation counts than the corre-
sponding optimal sequential code, but they ac-
cess the memory hierarchy more efficiently.
Thus, CA algorithms might come into favor over
traditional methods for solving partial differen-
tial equations on petascale architectures. 
Intelligent, resource-aware numerical algo-
Cellular automata will
play an increasing
role in the simulation
of physical (and
social) phenomena.
MAY/JUNE 2001 39
rithms that can automatically tune themselves
are also a desirable feature. Extensive use of
highly tuned software libraries and problem-
solving environments to assist in application
composition and job submission will be ex-
tremely important. Novel approaches, such as
algorithmic bombardment, will address issues of
numerical reproducibility. We will need interval
arithmetic methods to get robust and accurate
solutions to some problems.
Acknowledgments
This article is partly based on presentations made at the
Second Conference on Enabling Technologies for
Petaflops Computing, held 15–19 February 1999 in
Santa Barbara, California. 
References
1. D. Bailey, “Onwards to Petaflops Computing,” Comm. ACM, vol.
40, no. 6, June 1997, pp. 90–92.
2. K. Kennedy, “Do We Need A Petaflops Initiative?” Parallel Com-
puting Research, vol. 5, no. 1, Winter 1997; www.crpc.rice.
edu/CRPC/newsletters/win97/index.html.
3. E. Strohmaier et al., “The Marketplace of High-Performance
Computing,” Parallel Computing, vol. 25, nos. 13–14, Dec. 1999,
pp. 1517–1545.
4. J.J. Dongarra, H.W. Meuer, and E. Strohmaier, Top500 Super-
computer Sites, tech. report UT-CS 99-434, Dept. of Computer
Science, Univ. of Tennessee, Knoxville, 1999.
5. J.J. Dongarra, Performance of Various Computers Using Standard
Linear Equations Software, tech. report CS-89-85, Dept. of Com-
puter Science, Univ. of Tennessee, Knoxville, 2000.
6. D. Aharonov, “Quantum Computation,” Ann. Reviews of Com-
putational Physics, D. Stauffer, ed., World Scientific, Singapore,
1998.
7. L. Adelman, “Molecular Computation of Solution to Combina-
torial Problems,” Science, vol. 266, 1994, p. 1021.
8. International Technology Roadmap for Semiconductors, Semicon-
ductor Industry Association, Austin, Texas, 1999; http://pub-
lic.itrs.net/files/1999_SIA.Roadmap/Home.htm.
9. R. Stevens et al., “The High Performance Computing Extreme
Linux Open Source Systems Software Initiative,” Second Extreme
Linux Workshop at the Usenix Ann. Technical Conf., 1999,
www.extremelinux.org/activities/usenix99/docs/betagrid/Steven
sEL99/index.htm.
10. S. Wallach, “Petaflop Architectures,” Proc. Second Conf. Enabling
Technologies for Petaflops Computing, to be published;
www.cacr.caltech.edu/pflops2/presentations/WallachPeta2.pdf.
11. P.M. Kogge, “In Pursuit of the Petaflop: Overcoming the La-
tency/Bandwidth Wall with PIM Technology,” Proc. Second Conf.
Enabling Technologies for Petaflops Computing, to be published;
www.cacr.caltech.edu/pflops2/presentations/KoggePeta2.pdf.
12. P.M. Kogge, “The EXECUBE Approach to Massively Parallel Pro-
cessing,” Proc. Int’l Conf. Parallel Processing, CRC Press, 1994.
13. P. Bunyk et al., “RSFQ Subsystem for Petaflops-Scale Comput-
ing: ‘COOL-0,’” Proc. Third Petaflop Workshop, 1999, pp. 3–9.
14. J. Makino et al., “Grape-4: A Massively Parallel Special-Purpose
Computer for Collisional N-Body Simulations,” Astrophysical J.,
vol. 480, no. 1, May 1997, pp. 432–446.
15. E. Korpela et al., “SETI@home: Massively Distributed Comput-
ing for SETI,” Computing in Science & Eng., vol. 3, no. 1, Jan./Feb.
2001, pp. 78–83
16. I. Foster and C. Kesselman, “The Globus Project: A Status Re-
port,” Future Generation Computer Systems, vol. 15, nos. 5–6,
Oct. 1999, pp. 607–621.
17. A.S. Grimshaw, W.A. Wulf, and the Legion Team, “The Legion
Vision of a Worldwide Virtual Computer,” Comm. ACM, vol. 40,
no. 1, Jan. 1997.
18. W. Benger et al., “Numerical Relativity in a Distributed Environ-
ment,” Proc. Ninth SIAM Conf. Parallel Processing for Scientific Com-
puting, SIAM, Philadelphia, 1999.
19. G.R. Gao et al., The HTMT Program Execution Model, CAPSL tech-
nical memo 09, Dept. of Electrical and Computer Eng., Univ. of
Delaware, 1997.
20. E. Anderson et al., LAPACK User’s Guide, SIAM Press, Philadelphia,
1995.
21. T. Tofolli, “Programmable Matter Methods,” Future Generation
Computer Systems, vol. 16, nos. 2–3, Dec. 1999, pp. 187–201.
22. R.C. Whaley, A. Petitet, and J.J. Dongarra, “Automated Empiri-
cal Optimizations of Software and the ATLAS Project,” to be pub-
lished in Parallel Computing.
23. R. Barrett et al., “Algorithmic Bombardment for the Iterative So-
lution of Linear Systems: A Poly-Iterative Approach,” J. Compu-
tational and Applied Mathematics, vol. 74, nos. 1–2, 1996, pp.
91–100.
Jack J. Dongarra is a University Distinguished Professor
of Computer Science at the University of Tennessee,
Knoxville. His technical interests include the design and
implementation of open source software packages and
systems. He received a BS in mathematics from
Chicago State University, an MS in computer science
from the Illinois Institute of Technology, and a PhD in
applied mathematics from the University of New Mex-
ico. He is a fellow of the AAAS, ACM, and the IEEE and
a member of the National Academy of Engineering.
Contact him at the Dept. of Computer Science, Univ.
of Tennessee, 1122 Volunteer Blvd., Knoxville, TN
37996-3450; dongarra@cs.utk.edu.
David W. Walker is Professor of High-Performance Com-
puting in the Department of Computer Science at the
Cardiff University in the UK. His research interests include
software, algorithms, and environments for computa-
tional science on HPC systems and the design and im-
plementation of problem-solving environments for sci-
entific computing. He received a BA in mathematics
from the University of Cambridge and an MSc and a PhD
in physics from the University of London. Contact him
at the Dept. of Computer Science, Cardiff Univ., PO Box
916, Cardiff CF24 3XF, UK; david@cs.cf.ac.uk.
