Assnum-Petaflops is a scale of computer performance equal to a million billion operations per second and is thousands of times more powerful than today's most powerful massively parallel processors.
tron@cesdis.gsfc.nasa.gov 301-286-2757 Assnum-Petaflops is a scale of computer performance equal to a million billion operations per second and is thousands of times more powerful than today's most powerful massively parallel processors.
Although representative of a class of system far beyond feasibility with contemporary teclmology, petaflops computer architecture is the target of active investigation and the topic of govenunent sponsored workshops. Realizing petaflops capability will depend on advances in device technology, architectural structures, and parallel algorithms for scientific and engineering applications. Technologies such as advanced semiconductor, optical, and superconductors will yield basic clock rate enhancements between one and three orders of magnitude. New architecture structures will integrate between a thousand and million way parallelism and incorporate new mechanisms for mitigating the impact of memory access latency and resource management overhead. Application algorithms will extend the parallelism scaling while bounding critical path length of operation execution sequence. Together these new capabilities will bring a revolutionary capability to the aerospace and other engineering communities for modeling, simulation, and analysis. This paper describes the field of petaflops computing in terms of the evolution of enabling technologies, new architectures likely to deliver effective performance at the petaflops level, and examples of important applications that will be significantly advanced through the availability of future petaflops computers. A summary of the seminal findings of several recent workshops exploring this regime of computing systems \vi11 be presented in detail. Analysis of the critical technologies and their evolution will be used to support estimates of the likely timeframe in which such capability will become practical. 
IYTRODUCTION
High performance computing is an enabling technology for both engineering and the sciences. Investigation of complex design trade-off spaces and exploration of diverse physical phenomena are made possible through the combination of high throughput computing capability and high capacity memory and mass storage facilities. As advances in computing extend our ability to understand and manipulate the natural world, so too do the bounds on such capabilities limit our opportunities to delve further. Even while the most aggressive massively parallel computing systems reach tentatively towards a Teraflops performance, scientists across a range of disciplines are considering the possibility of practical systems capable of orders-of-magnitude greater perforniance, in the realm of petaflops scale computing. A petaflops is a million billion operations per second and is comparable to thousands of times the capability of the most powerful contemporary massively parallel processing system. Indeed, a system of this scale mould exceed the combined capability of all the computing resources on the Internet today. The capability implied by a petaflops extends far beyond known experiences and its feasibility demands technologies unavailable to us today. But understanding its future requirements is important to establishing research directions in device technology, computer architecture, and application algorithms as well as gaining insight into its potential for advancing engineering and science. This paper introduces the topic of petaflops scale computing and presents a summary of the basic understanding of this extreme realm derived from two years of investigation by a growing community of interdisciplinary researchers.
Each major advance in computing performance has required a significant change in technology, architecture, and execution model. These paradigm shifts have permitted new computing capabilities to be brought to bear on problems of ever increasing scale and resolution, but often at a cost to pro,arammability and ease of use. The supercomputer evolved from the venerable mainframe in the early 1970's by shifting to vector processing.
Architectures were pipelined to 01-erlap memory access, communication, and arithmetic processing in order to hide the latency of data movement and provide very hi& operation throughput.
The Cray-1 is the best known manifestation of these concepts in that era and the ideas continue in the C-90 today. But advances in performance for vector processing were limited by clock rates. High end technology did not yield sufficient growth in that area. Indeed, achieving 1 nanosecond (ns) clock time seemed to be a major barrier but was finally realized in the prototype of the Cray-4, although no such machine has ever been delivered. Performance in the range of 1 Gigaflops (billion floating point operations per second) has been achie\.ed by vector processors.
To go much beyond 1 Gflops, a second shift in approach has been required, towards massive parallelism. While it is true that vector processing itself exploits parallelism at the instruction level, massilrely parallel processing (MPP) exploits a much broader range of parallelism at the task or process level and is enabled by the microprocessor revolutionized through major ad\-ances in high density semiconductor technology. Parallelism at this granularity, albeit on a modest scale, is employed even in today's supercomputers. The MPP paradi-gm [combines hundreds or even thousands of such microprocessors in a single system yielding performance in the range of 100 Gflops for the largest configurations. But these systems require a very different approach to prograrmning than did earlier systems because the resources are now distributed, imposing long latencies for resource access across rhe diameter of the hardware structure. Therefore, message passing programming models have been devised by which applications are partitioned and distributed across the processors of an MPP. These separate processes then communicate by exchangmg messages across the system's internal communication nem~ork. The programmer and system software must deal with a very different set of programming challenges and such systems are more difficult to employ effectively than a simple workstation or even vector supercomputer.
To build a petaflops scale MPP from contemporary technologies would require millions of microprocessors filling a large building and consuning the power of a major electrical generating plant. And because of the very long latencies on the order of tens of thousands of cycles or more, only those applications exhibiting high locality of variable access would be practical targets for exploiting such a system. The cost would be prohibitive at approximately 100 billion dollars. Clearly such systems are infeasible at present. But would they ever be possible, and iLf so, when and how? These are the questions examined in this paper. But in doing so, it is necessary to also examine the possible shifts in approach required for this next revolution in high performance computing.
The fuidings presenred in this paper are preliminary and are derived from a set of Federally sponsored workshops. The Pasadena Workshop on Enabling Technologies for Petaflops Computing was conducted in February, 1994 and included over 60 invited participants from diverse fields. This workshop provided the first substantive examination of petaflops scale computing and established this emerging discipline as a legitimate field of inquiry. One ylear later i n February, 1995, The Petaflops Frontier (TPF) workshop was organized as an opcn forum at the Frontiers '95 conference attracting over a hundred participants from a broad r a g e of disciplines and organizations. Then!, in August 1995, a third workshop involved over forty invited scientists to examine petaflops scale applications in detail. At the same time, more than one point design study is underway to develop ideas for petaflops architecture to achieve a better understanding of the feasibility, time frame, technology requirements and applicability of such future systems. This rapidly evolving body of knowledge is the basis of this paper. 3D-t-T: a surprisingly large number of problems fall into this category which has the application scale in both data set size and number of time steps as the underlying machne performance scales up. The consequence of this is that the amount of memory required scales at most in proportion to the 3/4 root of the perfomiance. This is derived from the fact that the simulated time stop must shrink proportionally to the spatial distance. For a fixed time simulation, as the data set expands more time steps are as well.
PETAFLOPS SCALED APPLICATIONS
While this may seem a distinction without a difference, it means that a petaflops computer for the 3D+T class of application requires about 30 Terabytes of memory or 1/32 of the full petabyte. Given the cost sensitivity of a petaflops computer to memory capacity, this can make a petaflops system cost viable years before it might otherwise have been.
Another important property of candidate petaflops applications is the scaling of intrinsic parallelism with level of performance demand. Many applications, such as those described above, increase in parallelism as the data set size increases, permitting additional processing resources to be applied. But certain algorithms are critical path length dominant.
As a consequence, while increase in integration steps are required for greater precision or longer extrapolation times, that data set size does not grow. Nor does the available parallelism.
Some climate modeling formulations fall into this category. An implication of this is that system design advances must include faster cloc:k rates and not rely entirely on parallelism for performance gain. Thus, a multiprocessor comprising fewer fast nodes would be preferred rather thai a very large number of slow nodes, all other considerations being equal (they never are). This is just one instance of Amdahl's Law.
Memory size is not the only issue related to data set storage impacting performance. Data access patterns yary dramatically among applications. But many latency hiding techniques rely on certain important properties of memory access behavior. Memory latency is the time for a processor to access a word in memory. By making certain the required word is near by or by overlapping the access of the operand word with other computation, much of the negative effects of latency can be mitigated. The dominant form of latency management in modem computers including MPPs is cache buffering; the use of small high speed memories that can be accessed in one or tivo processor cycles and into which the working set can be copied. But for caches to work, data access patterns must assume data localiv in space and time. Applications can be dilyided into those which exhibit strong locality and those whose access patterns are often independent.
Finally, analyses of the nature of applications as possible candidates for petaflops computing are highly sensitive to the way the application is cast. Specifically, message passing, data parallel, and shared memory models all contend as alternative programming approaches with their respective advantages and disadvantages.
ENABLING TECHNOLOGIES
The driving parameters that will ultimately deternline the viability of petaflops computing are:
Iimate system clock speed,
0
Cost of fabrication, Power and cooling requirements, Degree and range of parallelism exploited, and Effectiveness.
There are, of course, many other issues such as reliability, lan,guage and operating system support, and so forth. But the threshold of feasibility is dominated by the above. The order in which these are given is not by importance, whatever that would mean, but rather in terms of the means through which they are addressed. Clock speed, fabrication cost, and power consumption are primarily determined by the device technologies used. Parallelism and effectiveness are determined by the system architecture and influenced by the support system software provided. Total peak perfoniiance is the combined product of clock rate and parallelism. Effectiveness is the combined influences of efficiency, generality, aid programmability that yield the sustained or useful performance for real-world applications. In this section, we consider the opportunities and challenges in the realm of device technology and therefore the first three parameters above. The next section addresses the last t\vo through consideration of architecture issues and approaches.
Semiconductor Devices
Semiconductor technology is the dominant means of realizing computing systems. While logic devices employ CMOS technology and memory devices employ variants of NMOS technology (for density), they are all a form of saturable semiconductor material. More high speed non-saturable logic such as tlie venerable ECL devices and the highly advanced GaAs (Gallium Arsenide) have provided some performance advantage. GaAs technology for example provided the 1 nanosecond clock times of the Cray-4. But the high cost, high power consumption, and relatively low density constrain it to a narrow set of system applications. CMOS has dominated the rapid growth of semiconductor over the last decade. Any consideration of future petaflops computing system must start with the opportunities and limitations of semiconductor technology now and in the future.
The Semiconductor Industry Association ( S A ) has developed a best-guess extrapolation [3] of the rate of progress anticipated for the evolution of semiconductor technology. This is shown in Table 3 .1 and extends to the year 2007. These studies were in-depth and not intended as an optimistic appraisal but rather a high-confidence assessment for planning purposes. These numbers were extrapolated further to the year 2015 although the values indicated here are more speculative. It should be noted that the SIA projections are being extended formally in part motivated by the Petaflops initiative.
Today, parts comprising almost 10 ridlion transistors and clock rates exceeding 200 MHz are commercially feasible. Memory densities of 16M and 64M bit DRAMS are viable today, although memory prices are not resources, although a sikgificant part of the [declining but are, in fact, rising slightly. Single transistor count is being dedicated to cache processors are being manufactured using these memory on the die. The bounding condition is the device count. The mechanical packaging constraints based on reliability, size, wire count and so forth !imit high end system to between 10,000 and 50,000 discrete devices. 
Optical Devices
Optical technologies are emerging as an important component of the digital information infrastructure. Although dedicated principally to specific roles of long distance digital communication and high density read only storage, these rapidly growing markets combined with the exceptional potential has positioned optics as the technology of the future. Its role in achieving petaflops scale computing systems has been shown to be important if not essential. It is expected that optics will become the dominant form of high bandwidth communications, not only between computing sites or computing systems at a given site, but between the principal components within a single parallel computing system. As computer capability approaches petaflops scale, optics may well be used at all levels of communication down to that of interchip data transfer. Optical information storage now dominates the music industry and is rapidly replacing the floppy disk in the PC and workstation market. Although relatively slow in access time, its high density and moderate bit transfer rates have made it the medium of choice for most software dissemination as well as a vehicle for large data sets. Ironically, the one area in which optics is not impacting digital computing is in the computing itself. Optical logic has not been adequately developed, nor is expected to pose a threat to semiconductor logic in the foreseeable future. Nonetheless, the other attributes of optical technology are viewed as critical to the ultimate goal of petaflops computing.
For epical applications, a sustained throughput o f a petaflops may be expected to require a processor to memory system data transfer rate of a petabyte per second (PBps). While we may assume that on-chip communications will be managed by transmission line methods and that much of the interaction between processors and memory system will be to the on-chip cache, the off-chip aggregate bandwidth requirements can be expected to be daunting. For example, taking the general configuration described in the previous section and assuming a 98% data cache hit rate and 100% instruction cache hit rate (both hghly optimistic) would still require 2 Terabytes per second transfer rates onto the processor chips, even with multiple processor per chp. The limitations of electrical conununication are fundamental to the physics of such medium. The energy requirements for electrical transmission grow with increased frequency and distance. With the frequency requirements anticipated, even sub-centimeter distances may make the power demands of electrical transmission prohibitive. Add to that the problems of cross-coupling and other isolation related issues at high frequencies and it soon becomes clear that wires may prove inadequate to the task. Optical communications suffer from none of these shortcomings.
Optical conmunications can take the form of either guided or free-space transmission. Primarily, guided optics are implemented with fiber optics, the same technology used for long distance communications by the major camers and adapted for use by the computer industry in such devices as FDDI. Using Light Emitting 'Diode sources, current capabilities are on the (order of 100 Mbits per second. Higher ,powered laser diode signal sources can increase that by more than an order of magnitude to more than 2 Gbits per second. Researchers in this field have projected that combinations of wavelength division multiplexing and time division multiplexing will yield optical guided wave transmission rates in the range of a Tera bit per second within the next 15 to 20 years, the same time frame required to realize 0.05 micron semiconductor devices.
One of the weaknesses of fiber optics is the fiber itself. That is, fibers have to be connected to the source and receive points of the communication transfer and take up space both at the points of interconnection and in the intervening distance. Given the thousands or tens of thousands of processing and memory chips expected to comprise a petaflops computer, the number of fibers required could easily surpass a million. Just the reIiability issue of keeping all the interconnections intact is a serious concem. The alternative offered approach afforded by optical is the free-space medium. Free-space data communication eliminates the need for fibers and sends light signals directly through the air. Laser diodes and optical sensors may be directly integrated onto semiconductor waffers using such components as self electrooptic effect devices. Large arrays of such elements have been fabricated in the laboratory. This permits easy exploitation of parallel communication channel in highly dense configurations. Even today, single channel speeds of 100 Mbits per second have been combined in thousand point arrays for an aggregate throughput of more than 100 Gbits per second.
Experts have predicted that the development of new integrated laser device technology will increase channel rates by two orders of magnitude over the next 20 years. Advanced forms of the self electro-optic effect devices along with multiple quantum well modulators and vertical-cavity surface emitting laser diodes will enable 10 Gbit per second bandwidths. Higher density packing will enable 1000 by 1000 laser diode pixel arrays on a single chip yielding a r i aggregate throughput of 10 Pbits per second. Free space optics has the additional advantage over guided techniques of permitting signals to cross each other without interference. With light deflecting technologies such as accousto-optic, such arrays can be used as a cross switch permitting interconnect topologies to be altered, at least within a narrow domain. While speeds of such interconnect switching will be much slower than communications rates, there will be many opportunities in which this kind of flexibility can enhance system effectiveness, especially for special purpotse device architectures such as for digital signal processing and post sensor processing applications. One disadvantage is the expected higher energy requirements for free-space versus guided techniques.
As was discussed, memory requirements for petaflops computing is highly sensitive to the class of applications being perforrned but can easily range from tens of terabytes of a petabyte. In fact, the cost of the computer could be dominated by the cost of the memory to the point that the processors essentially come for free. While perhaps a slight exaggeration, it does highlight the critical nature of memory in determining feasibility of petaflops computers. Optical storage technology may make an important contribution in addressing this problem. Even today, optical disks are emerging as a major force in data archiving and dissemination. Its high density and low duplication costs have achieved si,gificant market penetration in spite of the relative disadvantages of slow access times and difficulties in providing readwrite media. Such disks hold 600 Mbytes of storage and this is expected to increase to 20 Gbytes over the next decade. New technologies are being developed that will extend optical storage from the limitations of 2 dimensions to 3 dimensional arrays at extremely high densities. And because entire 2-D planes can be accessed at a time, the average bandwidth can be extraordinary. Such technologies include holostore based on photorefractive rods, two-photon 3-D memory, and spectral hole bunling. Holostore devices may hold up to a Tbit. Two-photon 3-D devices may provide 100 Tbits of capacity.
And, spectral hole burning for holographic storage may yield a capability of storing a Pbit. Researchers in optical memory technologies predict that within 20 years Two-photon 3-D technology with spatial light modulator addressing will be capable of storing 100 Petabits and provide a transfer rate of a Pbit per second, although the access time will be on the order of I00 nanoseconds.
From the above discussion, it is concluded that optical technology will be an important part of any petaflops computer architecture both for intra-system data conlmunications and for large data storage capacity. But because of its weakness in performing logical operations quickly and cheaply, an alternate technology needs to be found. That technology may by superconductivity.
Superconducting Devices
While both semiconductor and optical technologies play an important role in today's computing, superconductivity remains largely in the experimental arena with the possible exception of some exotic sensor electronics for d a r e d astronomy and similar applications. Superconductivity exploits a remarkable phenomenon of quantum mechanics that eliminates all electrical resistance from certain materials below specific temperatures peculiar to that material. Until recently, temperatures required were in the realm of a few Kelvins. Today, special ceramic based materials have demonstrated superconductivity at temperatures near that of liquid nitrogen. Nonetheless, special cooling facilities are required with added challenges to packaging and maintenance. But the effort may be worth it because superconductive logic devices are fast and require extraordinarily low power.
Where the alternative is in the range of a megawatt, the inconveniences implied by superconducting circuits may be acceptable.
In recent years, the majority of research in this technology has been conducted in Japan. There, experimental processors and memory have been fabricated and evaluated. Simple processors on the scale of 100,000 gates have been operated at 1 GHz clock rates using 2 micron technology. Also, memory chips of 4 Kbits have been produced with access times of half a nanosecond. While not spectacular in size by the standards of mature semiconductor technology, the speeds certainly are and these experimental devices are only beginning to realize the promise of superconductivity. One of the most important breakthroughs in recent years is rapid single quantum flux (RSFQ) gates which permit low loss storage and logic that may pave the way to a future generation of superconducting devices.
Most important is the potential of this not so new but still young technology. Physicists predict, with confidence, that logic parts operating at 50 GHz clock rate will be feasible and that processors of several million gates can be fabricated. The expected power consuniption of such a processor would be approximately 4 watts while delivering a peak perfomiance of 200 Gflops. Using cryogenic cooled semiconductor memory would require a total power budget of tens of Kwatts or about a hundredth of what might be needed for a more typical semiconductor based system as described earlier. Also, the time frame may be significantly earlier than that for the semiconductor system.
SYSTEM ARCHITECTURES
A computer is more than the sum of its parts. While technology may determine the peak potential performance of a system, it is the architecture in conjunction with its support software that determines the possible sustained or delivered performance. And as always when parallelism is a performance factor, application algorithms determine the actual performance observed. Architecture specifies the logical interface between the software and the underlying hardware through the set of primitive instructions and the name space of the system together referred to as the instruction set architecture or ISA. Architecture also establishes the major coniponent subsystems and their interrelationships that make up the overall computer structure. Each subsystem is defined by its functionality, its external interface, and its operational performance. Ideally, a computer would look like a single monolithic system with no sensitivity to usage. But the extreme use of parallelism, necessitated by the ultra high performance, requires system structures that are intrinsically distributed. Interactions between subsystems incur appreciable delay measured in system clock cycles. Even in todays massively parallel processing systems, the range of memory access times from closest cache to farthest remote memory can be three orders of magnitude or more. If, as has been the case for past microprocessor architectures, the instruction issue pipeline is stalled until a previous remote memory access is completed, that processor could \vait for the e:quivalent of thousands of instruction cycles. The problem is anticipated to be much worse for a petaflops computer.
Latency is one of se\-era1 major problems that must be addressed by system architecture. A second is overhead. the amount of work required to manage the parallel resources and concurrent activities of a parallel computer. Overhead can be thought of as that work needed in a parallel execution that would not be necessary in a purely uniprocessor implementation of the same application. Overhead activities include synchronization of concurrent tasks, scheduling and load balancing of tasks, message passing, and other housekeeping chores. Overhead imposes a lower bound on the granularity of parallelism that can be effectively exploited. Below this bound, the overhead required to manage a thread (sequence of operations) is mo:re than the actual useful work itself, leading to significant inefficiency. Yet another challenge to parallel architecture is stanation, the unavailability of work to be performed for some or most functional units. Stanation is either a result of the total parallel \York being poorly and unevenly distributed across the parallel resources or a result of inadequate parallelism in the application. For many cases, the finer the granularity that can be exploited, the more parallelism there is available. Thus, a secondary effect of overhead is contributing to starvation in some cases. Overhead also determines to what degree it is efficient to perform dynamic task scheduling. If the overhead for such functionality is low, then load balancing can be performed on a continuing basis, thus alleviating the possibility of starvation. Finally, contention for shared functional elements results in resource conflicts that in turn causes some part of the execution to be delayed whle some other part uses a needed and shared resource. Most typically, contention occurs for interconnect networks and memory banks but can result from the need to share any physical or logical senrice. Architecture can approach the problem of contention by incorporating sufficient resources to satisfy demand and to minimize service time of any resource to reduce wait times. These four factors combine to degrade system performance from the ideal, sometimes dramatically. The extremes of petaflops only aggravate them.
There are perhaps an infinite number of possible architectures that may be considered for achieving petaflops, especially &en the array of technologies that might be applied and the range of timeframe in which such a system could be delivered. But these can be classified according to how the latency problem is addressed and u-hether a single technology or mu< of technologies is applied from which to fabricate it. In the first case, we consider three broad classes of architecture. The first category are those systems for which the latency is very low. Memory access times are short and memory accesses are pipelined as in contemporary vector supercomputers. A relatively few ultra high speed heavily pipelined processors are employed. For example, a thousand (yes, that's relatively fen-) 1 Tflops processors with substantial intemal parallelism and tightly coupled to its memory which would also be very high speed could be conceived. Such a system would incorporate a high degree of intemal parallelism.
The second case would employ more modest processors on the scale of a high end workstation and exploit instruction level parallelism to a significant degree. Each might exhibit a peak performance of 32 Gflops employing 16-m.ay instruction issue and iiiultiple functional units. There would be a small number of processors per chip along with substantial local fast buffer storage. Main memory would be at some substantial distance and much slower than the processor clock rate to achieve highest density and low power. This category of architecture would incorporate latency management strategies in hardware to overcome some of the potential performance degradation due to latency access times. Such techniques used even today include cache, and prefetch. New techniques beginning to be demonstrated are multiple issue and out of order completion memory accesses, and special streaming and gatherscatter hardware.
The third case Lvould exploit lighhveight processors tightly integrated with memory on the same die to expose the high intrinsic memory bandwidth. This bandwidth exists because internally. each DRAM pulls an entire row of data which is equal to the square root of the memory chip bit capacity. It is only when an access request from off chip is satisfied that a final select from this large data string is chopped down to a byte, a nibble, or even a bit, discarding most of the data retrieved from the memory array. By m e r~g the processors with the memory, the processor has access to the entire row of data fetched yielding a much higher aggregate bandwidth. But this class of system approaches latency by relying on data partitioning to minimize off chip access requests to remote memory. Also by using slower processors, the difference between local and remote access times can be reduced, measured in processor cycles. Over a hundred processors per chip would be integrated along with the memory. A total chip count of 10,000 would make up a Pflops system assuming a per processor capability of 1 Gflops.
The first category of system is extremely difficult to imagine given the current expectations in technology. But the second two are feasible within the time frame discussed. Th.e second case may be best implemented as a hybrid structure using a mix of technologies. The third case could be implemented primarily with semiconductor although some optical communications would likely be applied between major modules such as chassis or racks. But even with this requirement, using today's cooling technology, the cost of such cooling would be about $20,000 which is incidental compared to the rest of the system. The processor is operated at 32 GHz which is well within the anticipated capability of the technology and, indeed, is somewhat conservative. Instruction level parallelism is exploited but only at a level of %way issue per cycle. Whlle this is more than current microprocessor, it is only a factor of two more. This may be matched by microprocessors within 5 years so is not aggressive either.
The aspect of this architecture which departs from the norm is that it is multi-threaded. This means that each processor maintains sufficient state to support a large number of threads concurrently. At any one cycle, only one thread issues instructions to thc execution pipeline. The processor scheduler goes roundrobin among the set of active and pending threads, drawing one mu1 ti-operation instruction from a different thread each cycle. This has a number of beneficial effects, among them simplifying pipeline design and interlock control because no more than one operation from a particular thread can be in the execution pipeline at one time. But the principal benefit of such a multithreaded architecture is its effect on memory access latency. When a thread issues a request to memory which is not satisfied by the contents of th,e cache, the thread itself suspends but does not block the pipeline from issuing instructions from other threads. This is very similar to the Tera computer currently under devellopment by Tera Computer Company. This dynamic and adaptive approach to latency provides resiliency in the presence of varying and unpredictable latencies. It also addresses the problem even when there is little locality of data access as required for cache based solutions.
Optical fiber is used as a hi& bandwidth but thermally isolating interface to the remaining system components. The total power requirement for the processors themselves is under 2 Kwatts. The processors are insulated from the extemal ambient conditions by an intermediate environment of liquid nitrogen. But this layer provides more than just thermal shielding. It also provides cryogenic coolllig for CMOS SRAMs that act as the level 2 cache for the system. Such technology was explored and exploited by ETA during the 1980s and found to provide much hgher access speeds and lower energ dissipation.
The S U M S are directly iiimersed in the liquid nitrogen bath and are interfaced to the extemal system by an optical packet switch that provides high bandwidth and thermal isolation. The switch pipeiines many request from the level 2 caches to the large DRAM memory. The size of the DRArvl is approximately 40 Tbytes, more than enough to solve a large number of interesting problems directly but much cheaper than a system with a full Pbyte of main memory. However, the larger problems are supported by a layer in the memory hierarchy not found in today's systems. A 3-D photo-refractive optical memory is used that holds multiple Pbytes of storage. Access time is relatively long to this optical memory but it delivers many Mbytes at a time to the DRAM memory providing high bandwidth. The main memory acts, in part, as a cache for the slower but larger optical memory.
The second architecture to consider in detail is a single technology solution and falls into the third category discussed earlier. This structure is a class referred to as PIM for processor in memory and hghlights that the processing logic and memory cells are closely bound together. The key advantages of the P I M model is that it exploits the high innate memory bandwidth as previously described and per computation requires relatively low power. As shown in Figure 4 .3, the memory takes up .the majority of the die. The processors are in fact quite simple, each operating at 2 GHz with single issue 32 bit instructions. The figure shows again only 8 processors but this would be expanded in a hierarchical array on the die to achieve the total number of processors. Figure 4 .4 shows the internal structure of each cell includuig the memory, external interface, and processor logic. This processor can be controlled either in SIMD or MIMD mode allowing mixed mode operation as well. In SIMD mode, all processors will receive a stream of broadcast instructions from a central or control processor. In MIMD mode, each processor will perform from a separate thread of instructions locally stored. When a processor requires data from an external resource, it stalls until the data :is returned.
This represents wasted time. But the processors are very fine g a i n and a loss of a processor for a few hundred cycles is not very serious. One mitigating mechanism in this architecture is the direct access to adjacent processors which can supply data as fast as the local memory can. Thus, nearest-neighbor remote access is vsry efficient. Many realworld problems exhibit this kind of memory access patterns. Otherwise, this architecture must rely on spatial and temporal locality to minimize wasted cycles. These two architectures represent the diversity of choices in the trade-off space and also demonstrate the opportunities available for achieving petaflops scale computation. But an architecture is only as good as its support system software. Little work has been done on this aspect of petaflops computing although a workshop on the topic is being planned for June, 1996. Given the challenge of programming sub-Teraflops architectures of today, it may be that the system software and tools will prove to be even more difficult than building the machine itself. The discussion so far assumed architectures that were fairly general in their applicability. It should be noted that there is an entire range of additional opportunities if architectures are applications driven and are either domain specific or special purpose. In such systems, the interconnect topology reflects the application algorithm and is optimized for it. Work in this area has not been pursued but is expected to play a more central role over the corning year.
5, Drscvssrox AYD CONCLUSIONS
A series of workshops and studies have provided a detailed examination of the opportunities and challenges for petaflops computing systems and applications. These investigations have been pursued to determine the potential of petaflops computing for important problems in science, engineering, information management, and commerce. Many significant applications have been identified that either demand or would greatly benefit from computers capable of petaflops performance.
It has been determined that petaflops computer systems will be realizable through enhanced natural evolution of technology and systems architecture in 20 to 25 years using largely commercially available components. But processor architectures will require enhancements for broadest applicability at petaflops scale which include 64 bit and 128 bit integer and floating point data formats, bit and byte operations, 64 bit addressing, and divide and square root operations. latencies, expose rich forms of parallelism, and balance workloads.
Latency will be a major factor in determining performance efficiency. Latency management in hardware and software will be a pacing element in achieving sustained petaflops performance for a broad range of real-world problems. Memory access patterns and bandwidth requirements of petaflops applications show wide diversity. Conventional cache policies and mechanisms will be inadequate. Conventional memory hierarchy precludes convenient exploitation of alternative storage technologies for basic computation. Memory capacity requirements vary dramatically across the array of important petaflops applications. Data intensive applications can require in excess of a petabyte of working storage. Response time critical applications may employ less than a terabyte of memory. A large and general class of simulation oriented applications exhibit memory demands less than or equal to the 3/4 root of the performance yielding a requirement for 32 terabytes or less.
Limitations of parallelism in some petaflops scale applications will require system architectures to run at the highest possible clock rates within the constraints of economic viability.
Fewer higher performance processors are preferred in lieu of a greater number of light-weight processors. Petaflops architectures may be feasible in a decade, at least for domain specific applications, if alternative approaches other than those based on commercial microprocessor technology are pursued. Petaflops application programming will require fomialisms incorporating the necessary constructs for better application driven management of resources to circumvent Having determined the availability of important candidate applications for petaflopr; computing, and having validated the feasibility of implementing a petaflops architecture in the twenty year time-frame, many questions remain concerning efficiency, programmability, cost, near-term opportunities, generality and other issues related to the practical exploitation of petaflops. Future work is required in the near term to address these. While substantial progress has been made, further applications studies need to be performed to expand the range of known petaflops applications and to identifl and characterize requirements and limiting factors a Research needs to be conducted into specific advanced archtecture techniques including latency hiding and memory hierarchy structures. This includes establishing (a memory continuum for seamless out-of-core computation. Exploration needs to be undertaken in the development of advanced technologies including superconductor logic and storage, optoelectronic communications, optical storage, and advanced lithography and semiconductor fabrication.
Point design studies of petaflops architectures should be initiated. One area of focus would be the likely path of evolution by the vendors. Another should be a closer exaniination of the PIM concept. A third area of pursuit should consider petabyte mass storage management oriented systems. SIMD, cellular arrays, and systolic structures all are viable paths of exploration as well. Performance models of both applications and architectures should be devised to permit detailed analysis and projection of capabilities under diverse workloads and circumstances.
These activities in the near term will enhance the community's understanding of key issues, possibilities, and trade-offs. Without this deeper insight, it is premature to engage in major efforts committed to one particular approach over another. One problem with recent work has been the degree of conservatism in the class of systems examined. The conclusion that it will take 20 years to achieve petaflops computing following the more obvious paths should provide motivation for inn0vatiA.e approaches to be considered. A goal should be to devise a strateg that might yield petaflops computing in 12 years. Even if more risky, there is plenty of time to conduct low cost studies to evaluate such ideas.
Of the technologies that must move forward in tandem, system software is the least developed. Within the high performance computing community, there is a consensus that system software is in serious trouble. Programming methods for MPPs are primitive: unreliable, difficult to optimize and debug, and do not support portability. Many of the proposed architectural approaches under consideration for petaflops computing embody similar characteristics as today's MPPs. Without adequate solutions to current problems, it is proving untenable to propose methods that will address the challenges of petaflops scale computing. One possible hope is that the problems of today, that are so challenging to HPC software designers, are a product of poor architecture and that these deficiencies will be corrected prior to the realization of actual petaflops systems. Then, system software may have an easier set of issues to address. At this point, any suggestion as to direction in system software would be mere speculation
The field of petaflops computing has only just emerged in the last two years. Prior to that, even posing this topic would be considered a h g e activity and such work would not be credible. Now researchers from academia and industry have joined in cooperation with government agencies to pursue modest but sustained exploration to determine the potential and approaches that will lead to petaflops scale computing systems. This is a very exciting time. 
