In this paper we analyze the major trends and changes in the High Performance Computing HPC market place since the beginning of the journal`Parallel Computing'. The initial success of vector computers in the seventies was driven by r a w performance. The introduction of this type of computer systems started the area of Supercomputing'. In the eighties the availability of standard development e n vironments and of application software packages became more important. These criteria determined next to performance the success of MP vector systems especially at industrial customers. MPPs became successful in the early nineties due to their better price performance ratios which w as enabled by the attack of the`killer-micros'. In the lower and medium market segments the MPPs were replaced by microprocessor based SMP systems in the middle of the nineties. This success was the basis for the emerging cluster concepts for the very high end systems. In the last few years only the companies which h a ve e n tered the emerging markets for massive parallel database servers and nancial applications attract enough business volume to be able to support the hardware development for the numerical high end computing market as well. Success in the traditional oating point i n tensive engineering applications seems to be no longer su cient for survival in the market.
Introduction
The Only Thing Constant Is Change" | Looking back on the last decades this seem certainly to be true for the market of High-Performance Computing systems HPC. This market was always characterized by a rapid change of vendors, architectures, technologies and the usage of systems. Despite all these changes the evolution of performance on a large scale however seems to be a very steady and continuous process. Moore's Law is often cited in this context. If we plot the peak performance of various computers of the last 5 decades in gure 1 which could have been called the`supercomputers' of there time 4,2 we indeed see how w ell this law holds for almost the complete lifespan of modern computing. On average we see an increase in performance of two magnitudes of order every decade. In this paper we analyze the major trends and changes in the HPC market for the last three decades. For this we focus on systems which had at least some commercial relevance. Historical overviews with di erent focus can be found in 8,9 .
In the second half of the seventies the introduction of vector computer systems marked the beginning of modern Supercomputing. These systems o ered a performance advantage of at least one order of magnitude over conventional systems of that time. Raw performance was the main if not the only selling argument. In the rst half of the eighties the integration of vector system in conventional computing environments became more important. Only the manufacturers which provided standard programming environments, operating systems and key applications were successful in getting industrial customers and survived. Performance was mainly increased by improved chip technologies and by producing shared memory multi processor systems.
Fostered by several Government programs massive parallel computing with scalable systems using distributed memory got in the focus of interest end of the eighties. Overcoming the hardware scalability limitations of shared memory systems was the main goal. The increase of performance of standard micro processors after the RISC revolution together with the cost advantage of large scale productions formed the basis for the Attack of the Killer Micro". The transition from ECL to CMOS chip technology and the usage of o the shelf" micro processor instead of custom designed processors for MPPs was the consequence.
Traditional design focus for MPP system was the very high end of performance.
In the early nineties the SMP systems of various workstation manufacturers as well as the IBM SP series which targeted the lower and medium market segments gained great popularity. Their price performance ratios were better due to the missing overhead in the design for support of the very large con gurations and due to cost advantages of the larger production numbers. Due to the vertical integration of performance it was no longer economically feasible to produce and focus on the highest end of computing power alone. The design focus for new systems shifted to the market of medium performance systems.
The acceptance of MPP system not only for engineering applications but also for new commercial applications especially for database applications emphasized di erent criteria for market success such as stability of system, continuity of the manufacturer and price performance. Success in commercial environments is now a new important requirement for a successful Supercomputer business. Due to these factors and the consolidation in the number of vendors in the market hierarchical systems build with components designed for the broader commercial market are currently replacing homogeneous systems at the very high end of performance. Clusters build with components of the shelf also gain more and more attention.
1976 1985: The rst Vector Computers
If one had to pick one person associated with Supercomputing it would be without doubt Seymour Cray. Coming from Control Data Corporation CDC where he had designed the CDC 6600 series in the sixties he had started his own company`Cray Research Inc.' in 1972. The delivery of the rst Cray 1 vector computer in 1976 to the Los Alamos Scienti c Laboratory marked the beginning of the modern area of`Supercomputing'. The Cray 1 w as characterized by a new architecture which g a ve it a performance advantage of more than an order of magnitude over scalar systems at that time. Beginning with this system high-performance computers had a substantially di erent architecture from main stream computers. Before the Cray 1 systems which sometimes were called`Supercomputer' like the CDC 7600 still had been scalar systems and did not di er in their architecture to this extend from competing main stream systems. For more than a decade supercomputer was a synonym for vector computer. Only at the beginning of the nineties would the MPPs be able to challenge or outperform their MP vector competitors.
Cray 1
The architecture of the vector units of the Cray 1 w as the basis for the complete family of Cray v ector systems into the nineties including the Cray 2, Cray X -MP, Y-MP, C-90, J-90 and T-90. Common feature was not only the usage of vector instructions and vector register but especially the close coupling of the fast main memory with the CPU. The system did not have a separate scalar unit but integrated the scalar functions e ciently in the vector cpu with the advantage of high scalar computing speed as well. One common remark about the Cray 1 w as often that it was not only the fastest vector system but especially also the fastest scalar system of it's time. The Cray 1 w as also a true Load Store architecture. A concept which later entered mainstream computing with the RISC processors. In the X-MP and follow on architecture, three simultaneous Load Store operations per CPU were supported in parallel from main memory without using caches. This gave the systems exceptionally high memory to register bandwidth and facilitated the ease of use greatly.
The Cray 1 w as well accepted in the scienti c community and 65 systems were sold till the end of it's production in 1984. In the US the initial acceptance was largely driven by g o vernment laboratories and classi ed sites for which raw performance was essential. Due to it's potential the Cray 1 soon gained great popularity in general research laboratories and at universities.
Cyber 205
Main competitor for the Cray 1 w a s a v ector computer from CDC the Cyber 205. This system was based on the design of the Star 100 of which only 4 systems had been build after it's rst delivery in 1974. Neil Lincoln designed the Cyber 203 and Cyber 205 systems as memory to memory machines not using any registers for the vector units. The system also had separate scalar units. The system used multiple pipelines to achieve high peak performance and a virtual memory in contrast to Cray's direct memory. Due to the memory to memory operation the vector units had rather long startup phases which allowed to achieve high performance only on long vectors.
CDC had been the market leader for high performance systems with it's CDC 6600 and CDC 7600 models for many y ears which g a ve the company the advantage of a broad existing customer base. The Cyber 205 was rst delivered in 1982 and about 30 systems were sold altogether.
Japanese Vector Systems
End of the seventies the main Japanese computer manufacturers Fujitsu, Hitachi and NEC started to develop their own vector computer systems. First models were delivered in late 1983 and mainly sold in Japan. Fujitsu had early decided to sell their vector systems in the USA and Europe through their mainframe distribution partners Amdahl and Siemens. This was the main reason that Fujitsu VP100, and VP200 systems could be found in decent numbers early on in Europe. NEC tried to market their SX1 and SX2 systems by themselves and had a much harder time to nd customers outside of Japan. From the beginning Hitachi had decided not to market the S810 system outside of Japan. Common feature of the Japanese systems were separate scalar and vector units and the usage of large vector registers and multiple pipelines in the vector units. The scalar units were IBM 370 instruction compatible which made the integration of these systems in existing computer centers easy. I n Japan all these systems were well accepted and especially the smaller models were sold in reasonable numbers.
Vector Multi-Processor
At Cray Research the next steps to increase performance were not only to increase the performance and e ciency of the single processors but also to build systems with multiple processors. Due to diverging design ideas and emphasis two design teams worked parallel in Chippewa F alls.
Steve Chen designed the the Cray X-MP system rst introduced with 2 processors in 1982. The enlarged model with 4 processor was available in 1984.
The systems were designed as symmetrical shared memory multi processor systems. The main emphasis of the development w as the support of multiple processors. Great e ort went i n to the design of an e ective memory access subsystem which w as able to support multiple processors with high bandwidth. While the multiple processors were mainly used to increase the throughput of computing centers, Cray w as one of the rst to o er a means for parallelization within a user's program using features such as Macrotasking and later on Microtasking.
At the same time Seymour Cray focused at Cray Research on the development of the Cray 2. His main focus was on advanced chip technology and new concepts in cooling. The rst model was delivered in 1985. With it's four processors it promised a peak performance of almost 2 GFlop s more than twice as much as a 4 processor X-MP. As it's memory was build with DRAM technology the available real main memory reached the unprecedented amount of 2 GByte. This memory size allowed for a long running programs not feasible on any other systems.
The Japanese Vector computer manufacturers decided to follow a di erent technology path. They increased the performance of their single processors by using advanced chip technology and multiple pipelines. Later the Japanese manufactures announced multiprocessor models typically had 2 or at most 4 processors.
1985 1990: The Golden Age of Vector Computers
The class of symmetric multi-vector processor systems dominated the supercomputing arena due to it's commercial success in the eighties.
Cray Y-MP -Success in Industry
The follow up of the Cray X-MP, the Cray Y-MP, w a s a t ypical example for the sophistication of the memory access subsystems, which w as one of the major reason for the overall very high e ciency achievable on these systems. With this product line later including the C-90 and T-90, Cray Research followed the very successful path to higher processor numbers always trying to keep the usability and e ciency of their systems as high as possible. The Cray Y-MP rst delivered in 1988 had up to 8 processors, the C-90 rst delivered in 1991 up to 16 processors. and the T-90 rst delivered in 1995 up to 32 processors. All these systems were produced in ECL chip technology.
Beginning of the eighties the acceptance of the Cray 1 systems was strongly helped by the easy integration in computing center environments of other vendors and by standard programming language support Fortran77. After 1984 a standard UNIX operating system, UNICOS, was available for all Cray systems which w as quite an innovation for a mainframe at that time. the availability o f v ectorizing compilers in the second half of the eighties more independent software vendors started to port their key applications on Cray systems which w as an immense advantage to sell Cray systems to industrial customers. Due to these reasons Cray v ector systems started to have success in industries such as automotive industry and oil industry. Success in these market ensured the dominance of Cray Research in the overall supercomputer market for more than a decade.
Looking at the Mannheim supercomputer statistics 3 in table 1 which counted the worldwide installed vector systems we see the dominance of Cray with a constant market share of 60. This was con rmed by the rst Top500 list from June 1993 2 which included MPP systems as well. Cray had an overall share of 40 of all the installed systems which w as equivalent to 60 of the included vector systems.
Cray 3
Seymour Cray left Cray Research Inc. in 1989 to start Cray Computer and to build the follow up to the Cray 2 the Cray 3. Again the idea was to use the most advanced chip technology to push single processor performance to it's limits. The choice of GaAs technology was however ahead of its time and lead to many development problems. In 1992 a single system was delivered. The announced Cray 4 system was never completed.
ETA
In 1983 CDC decided to spin of it's supercomputer business in the subsidiarỳ ETA Systems Inc'. The ETA10 system was the follow up to the Cyber 205 on which i t w as based. The CPU's had the same design and the systems had up to 8 processors with a hierarchical memory. This memory consisted of a global shared memory and local memories per processor, all of which w ere organized as virtual memory. CMOS was chosen as basic chip technology. T o achieve l o w cycle times the high end models had sophisticated cooling system using liquid nitrogen. First system were delivered in 1987. The largest model had a peak performance of 10 GFlop s well beyond the competing model of Cray Research.
ETA h o wever seem to have o verlooked the fact that raw performance was no longer the only or even most important selling argument. In April 1989 CDC terminated ETA and closed it's supercomputer business. One of the main failures of the company being that they overlooked the importance of standard operating system and standard development e n vironments. A mistake which did not only brought E T A but CDC itself down.
Mini-Supercomputer
Due to the limited scalability of existing vector systems there was a gap in performance between traditional scalar mainframes and the vector systems of the Cray class. This market was targeted by some new companies who started in the early eighties to develop the so called mini-supercomputer. Design goal was one third of the performance of the Cray class supers but only one tenth of the price. The most successful of these companies was Convex founded by Steve W allach in 1982. They delivered the rst single processor system Convex C1 in 1985. In 1988 the multiprocessor system C2 followed. Due to the wide software support these systems were well accepted in industrial environments and Convex sold more than 500 of these systems worldwide.
MPP -Scalable Systems and the Killer Micros
In the second half of the eighties a new class of system started to appear -parallel computers with distributed memory. Supported by the Strategic Computing Initiative of the US Defense Advanced Research Agency DARPA 1983 a couple of companies started developing such system early in the eighties. Basic idea was to create parallel systems without the obvious limitations in processor number shown by shared memory designs of the vector multiprocessor.
First models of such massive parallel systems MPP were introduced in the market in 1985. At the beginning the architectures of the di erent MPPs were still quite diverse. Major exception was the connection network as most vendors choose a hypercube topology. Thinking Machine Corporation TMC demonstrated their rst SIMD system the Connection Machine 1 CM1. Intel showed their iPSC 1 hypercube system using Intel 80286 cpu and and Ethernet based connection network. nCube produced the rst nCube 10 hypercube system with scalar Vax-like custom processors. While these system still were clearly in the stage of experimental machines they formed the basis for broad research on all issues of massive parallel computing. Later generations of these system were then able to compete with vector MP systems.
Due to the conceptual simplicity of the global architecture the number of companies building such machine grew very fast. This included the otherwise rare European e orts to produce supercomputer hardware. Companies who started to develop or produce MPP system in the second half of the eighties include: TMC, Intel, nCube, FPS Floating Point Systems, KSR Kendall Square Research, Meiko, Parsytec, Telmat, Suprenum, MasPar, BBN, and others.
Thinking Machines
After demonstrating the CM-1 Connection Machine in 1985 TMC soon introduced the follow on CM-2 which became the rst major MPP, designed by Danny Hillis. In 1987 TMC started to install the CM-2 system. The Connection Machine model were single instruction on multiple data SIMD systems. Up to 64k single-bit processors connected in a hypercube network together with 2048 Weitek oating point units could work together under the control of a single front-end system on a single problem. The CM-2 was the rst MPP system which w as not only successful in the market but which also could challenge the vector MP systems of it's time Cray Y-MP, at least for certain applications.
The success of the CM-2 was great enough that another company, MasPar, which started producing SIMD systems as well. It's rst system the MasPar MP-1 using 4-bit processors was rst delivered in 1990. The follow o n m o d e l MP-2 with 8-bit processors was rst installed in 1992.
Main disadvantage of all SIMD system however proved to be the limited exibility of the hardware which limited the number of applications and programming models which could be supported. Consequently TMC decided to design their next major system the CM-5 as MIMD system. To satisfy the existing customer base this system could run data-parallel program as well.
Early MPPs
Competing MPP manufacturer had from the start decided to produce MIMD systems. The more complex programming of these systems was more than compensated by the much greater exibility to support di erent programming paradigms e ciently.
Intel build systems based on the di erent generations of Intel micro processors. The rst system iPSC 1 was introduced in 1985 and used Intel 80286 processors with an Ethernet based connection network. The second model iPSC 2 used the 80386 and already had a circuit switched routing network. The iPSC 860 introduced in 1990 nally featured the i860 chip. For Intel massive parallel meant up to 128 processors which w as the limit due to the maximum dimension of the connection network.
In contrast to using standard o -the-shelf microprocessor nCube had designed their own custom processor as basis for their nCube 10 system introduced in 1985 as well. The design of the processor was similar to the good old Vax and therefore a typically CISC design. To compensate for the relatively small performance of this processor the maximal number of processors possible was however quite high. Limitation was again the dimension 13 of the hypercube network which w ould have allowed up to 8096 processors. The follow-up nCube 2 again using this custom processor was introduced in 1990.
Attack of the Killer Micros
Beginning of the nineties one phrase showed up on the front pages of several magazines: The Attack of the Killer Micro". Coined by Eugene Brooks from Livermore National Laboratory this was the synonym for the the greatly increased performance levels achievable with microprocessors after the RISC superscalar design revolution. The performance of micro processors seemed to have reached comparable level to the much more expensive ECL custom mainframe computer. But not only the traditional mainframes started to feel the heat, even the traditional supercomputer the vector multi-processors got under attack.
Another slogan for this process was widely used when Intel introduced it's answer to the RISC processors, the i860 chip: Cray o n a c hip" Even as sustained performance values did not always come up to PAP peak advertised performance values the direction of the attack w as clear. The new genera-tion of microprocessors manufactured relatively cheap in great numbers for the workstation market o ered the much better price performance ratios. Together with the scalable architecture of MPPs the same high performance levels as with vector multiprocessor could be achieved for a better price.
Aggravated was this contrast greatly by the fact that most mainframe manufacturers had not seen early enough the advantage of CMOS chip technology and were still using the little faster but much more expensive ECL technology. Cray Research w as no exception in this respect. Under great pressure all mainframe manufacturers started to switch from ECL to CMOS. At the same time they also started to produce their own MPP systems to have competing products with the up and coming new MPP vendors.
Hidden behind these slogans were actually two di erent trend working together, both of which e ects can clearly be seen in the Top500 data. The replacement of ECL chip technology by CMOS is shown in gure 2. The rapid decline of vector based systems in the nineties can be seen in the Top500 in gure 3. This change in the processor architecture is however not as generally accepted as the change form ECL to CMOS. The Japanese Manufacturers together with SGI still continue to produce vector based systems. The hopes of all the MPP manufacturers to grow and gain market share however did not come true. The overall market for HPC systems did grow only slowly and mostly in directions not anticipated by the traditional MP vector or MPP manufacturers. The attack of killer micros went i n to it's next stage. This time the large scale architecture of the MPPs seen before would be under attack.
One major side-e ect of the introduction of MPP system in the market for supercomputer was that the sharp performance gap between supercomputers and mainstream computers no longer existed. MPP could -by de nition -be scaled by one or two magnitudes of order bridging the gap between high-end multi-processor workstations and supercomputers. Homogeneous and mono-lithic architectures designed for the very high end of performance would have to compete with clustered concepts based on shared memory multi-processor models from the traditional UNIX server manufacturer. These cluster o ered another level of price performance advantage due to the large supporting industrial and commercial business market in the background and due to the reduced design cost not focusing on the very high end of performance any more. This change towards the usage of standard components widely used in the commercial market place actually widened the market scope for such MPPs in general. As a result the notion of a separated market for oating point i n tensive high-performance computing no longer holds. The HPC market is nowadays the upper end of a continuum of systems and usage in all kind of applications.
Playground for Manufacturers
With the largely increased number of MPP manufacturers it was evident that a shake-out" of manufactures was unavoidable. In table 2 we list vendors which h a ve been active at some point in the HPC market 6,1 . In Figure 4 we try to visualize the historic presence of companies in the HPC market. After 1993 we included only companies which had entries in the Top500 and were actively manufacturing.
1 9 8 0 1 9 8 1 1 9 8 2 1 9 8 3 1 9 8 4 1 9 8 5 1 9 8 6 1 9 8 7 1 9 8 8 1 9 8 9 1 9 9 0 1 9 9 1 1 9 9 2 1 9 9 3 1 9 9 4 1 9 9 5 1 9 9 6 1 9 9 7 1 9 9 8 1 9 9 9 From the 14 major companies in the early nineties only 4 survived this decade Table 2 Commercial HPC Vendors on their own. These are the 3 Japanese vector manufacturer Fujitsu, Hitachi, and NEC and IBM which due to it's marginal HPC presence at the beginning of the nineties could even be considered a newcomer. Four other companies entered the HPC market either by buying some companies or by developing their own products Silicon Graphics, Hewlett-Packard, Sun, Compaq. None of these is a former HPC manufacturer. All are typical workstation manufacturer which e n tered the HPC market at least initially from the lower end with high-end UNIX-server models. There presence and success already indicates the change in focus from the very high end to markets for medium size HPC systems.
Kendall Square R esearch
In 1991 rst models of a quite new and innovative systems were installed. The KSR1 from Kendall Square Research. The hardware was build like other MPP with distributed memory but gave the user the view of a shared memory system. The custom design hardware and the operating system software were responsible for this virtual shared memory appearance. This concept of virtual shared memory could later on be found in other systems such as the Convex SPP series and lately in the SGI Origin series. The KSR systems organized the complete memory on top of the VSM as cache only memory. By this the data had no xed home in the machine and could freely room to the location where they were needed.`Management mistakes' brought the operations of KSR to an abrupt end in late 1994.
Intel
In 1992 Intel started to produce the Paragon XP series after delivering the Touchstone Delta system to Caltech in 1991. Still based on the i860 chips the interconnection network was changed to a two dimensional grid which n o w allowed up to 2048 processors. Several quite large system were installed and in June 1994 a system at Sandia National Laboratory which a c hieved 143
GFlop s on the LINPACK benchmark was the number one in the Top500.
Intel decided to stop it's general HPC business in 1996 but still build the ASCI Red system afterwards.
Thinking Machines
Also in 1992 TMC started to deliver the CM-5 a MIMD system designed for the very high end. Theoretically systems up to 16k processors could be build. In practice the largest con gurations reached 1056 processors at the Los Alamos National Laboratory. This system was at the number one spot of the very rst Top500 list in June 1993 achieving almost 60 GFlop s. A Sun SPARC processor was chosen as basic node cpu. Each of these processors had 4 vector coprocessors to increase the oating-point performance of the system. Initial programming paradigm was the data-parallel model familiar from the CM-2 predecessor. The complexity of this node design however was more than the company or customers could handle. Due to the design point being the very high end the smaller models also had problems competing with models from other companies which did not have t o p a y for the overhead of supporting such large systems in their design. The CM-5 would be the last TMC model before the company had to stop the production of hardware in 1994.
The raw potential of the CM-5 was demonstrated by the fact that in June 1993 the position 1 4 were all held by TMC CM-5 systems ahead of the rst MP vector system an NEC SX-3 44R. The only other MPP system able to beat a Cray C-90 at that time was the Intel Delta system. The performance leadership however had started to change.
In June 1993, still 5 of the rst 10 system were MP vector systems. This number decreased fast and the last MP vector system which managed to make the top 10 was a NEC SX-4 with 32 processors in June 1996 with 66.5 GFlop s. Later only systems with distributed memory made the top 10 list. Japanese MPPs with vector cpu managed to keep their spot in the top 10 for some time. In the November 1998 list however the top 10 positions were for the rst time all taken by microprocessor based ASCI or Cray T3E systems. The rst system with vector CPU's was a NEC SX-4 with 128 processor at number 18.
IBM
In 1993 IBM nally joined the eld of MPP producers by building the IBM SP1 based on their successful workstation series RS6000. While this system was often mocked being a workstation cluster and not a MPP it set the ground for the reentry of IBM in the supercomputer market. The follow on SP2 with increased node and network performance was rst delivered in 1994. Contrary to other MPP manufacturers IBM was focusing on the market for small to medium size machines especially for the commercial UNIX server market. Over time this proved to be a very pro table strategy for IBM who managed to sell models of the SP quite successful as a database server. Due to the design of the SP2, IBM is able to constantly o er new nodes based on the latest RISC system available.
Cray Research
In 1993 Cray Research nally started to install their rst MPP system, the Cray T3D. As indicated by the name the network was a 3 dimensional torus. As CPU Cray had chosen the DEC alpha processor. The design of the node was completely done by Cray itself and was substantially di erent from a typical workstation using the same processor. This had advantages and disadvantages. Due to their closely integrated custom network interface the network latencies and the bandwidth reached values not seen before and allowed very e cient parallel processing. The computational node performance itself was however greatly a ected by the missing 2nd level cache. The system was immediately well accepted at research laboratories and was even installed at some industrial customer sites. The largest con guration known is installed at a classi ed government site in the USA with 1024 processors and just breaking the 100 GFlop s barrier on the LINPACK.
Convex
In 1994 Convex introduced it's rst true MPP, the SPP1000 series. This series was also awaited with some curiosity a s i t w as after KSR the second commercial system featuring a virtual shared memory. The architecture of the system was hierarchical. Up to 8 HP microprocessors were connected to a shared memory with crossbar technology similar to the one used in the Convex vector series. Multiple of these SMP units would then be connected in a distributed memory fashion. The operating system and the connection hardware would provide the view of a non-uniform shared memory over the whole machine. In the following years a series of follow on models was introduced the SPP1200 in 1995, the SPP1600 in 1996, and the SPP2000 renamed by H P as Exemplar X-Class in 1997.
The role of MP vector systems during 1990 1995
Cray Research continued building their main MP vector system in traditional style. The Triton, known as T-90, was introduced in 1995 and build in ECL very much along the line of the Y-MP and C-90 series. The maximum number of processors was increased to 32. This gave a full system a peak performance of 58 GFlop s. Realizing that it needed a product for lower market segment Cray had bought the company Supercomputer which had developed Cray Y -MP compatible vector systems in CMOS technology. I t w as marketed by Cray starting in 1993. The next system in this series developed by Cray Research itself was the J-90 introduced in 1995 as well. With up to 32 processors it reached a peak performance of 6.4 GFlop s which w as well below the ECL systems from Cray and unfortunately not much a b o ve the performance of best microprocessor available.
Convex introduced the C3 series in 1991 and the C4 in 1994 before the company w as bought b y Hewlett-Packard the same year. After this merger the unit focused on it's MPP products.
In Japan, Fujitsu had introduced the single processor VP2600 in 1990 and would market this series till the introduction of the CMOS based VPP500 in 1994. NEC introduced the SX-3 series in 1990 as well. This system reached with it's up to four processor a peak performance of 26 GFlop s. NEC subsequently implemented their vector series in CMOS and introduced the NEC SX-4 in 1994. Up to 32 processors can be installed as conventional shared memory MP vector system. Beyond this up to 16 of these units can be clustered in a distributed memory fashion. The largest con gurations known to be installed have 128 processors with which they gained positions 29 and 30 in the June 1999 Top500 list. These are at present the largest vector based systems with traditional design.
Fujitsu's MPP-Vector Approach
Fujitsu decided to go it's own way i n to the world of MPPs. They build their commercial MPP system with distributed memory around the node and cpu of their successful VP2600 vector computer series. However Fujitsu was the rst Japanese company who implementing their vector design in CMOS, the VPP500. A rst`pre-prototype' was developed together with the National Aerospace Laboratories NAL. The installation of this system named the Numerical Wind Tunnel NWT started in 1993. Due to it's size this system managed to gain the number 1 position in the Top500 an unchallenged four and number 2 position three times from November 1993 to November 1996. Delivery of VPP500 systems started in 1993.
After 1995: Old and New Customers
The year 1995 saw some remarkable changes in the distribution of the systems in the Top500 for the di erent t ypes of customer academic sites, research labs, industrial commercial users, vendor installations, and con dential sites see Fig. 5 .
Until June 1995, the major trend seen in the Top500 data was a steady decrease of industrial customers, matched by an increase in the number of government-funded research sites. This trend re ects the in uence of the different g o vernmental HPC programs that enabled research sites to buy parallel systems, especially systems with distributed memory. Industry was understandably reluctant to follow this step, since systems with distributed memory have often been far from mature or stable. Hence, industrial customers stayed with their older vector systems, which gradually dropped o the Top500 list because of low performance.
Beginning in 1994, however, companies such as SGI, Digital, and Sun started to sell symmetrical multiprocessor SMP models of their major workstation families. From the very beginning, these systems were popular with industrial customers because of the maturity of these architectures and their superior price performance ratio. At the same time, IBM SP2 systems started to appear at a reasonable number of industrial sites. While the SP initially was sold for numerically intensive applications, the system began selling successfully to a larger market, including database applications, in the second half of 1995.
Subsequently, the number of industrial customers listed in the Top500 increased from 85, or 17, in June 1995 to about 241, or 48.2, in June 1999. We believe that this is a strong new trend because of the following reasons.
The architectures installed at industrial sites changed from vector systems to a substantial number of MPP systems. This change re ects the fact that parallel systems are ready for commercial use and environments. The most successful companies Sun, IBM and SGI are selling well to industrial customers. Their success is built on the fact that they are using standard workstation technologies for their MPP nodes. This approach provides a smooth migration path for applications from workstations up to parallel machines. The maturity of these advanced systems and the availability o f k ey applications for them make the systems appealing to commercial customers. Especially important are database applications, since these can use highly parallel systems with more than 128 processors. Figure 6 shows that the increase in the number of systems installed at indus- 
Architectures
The changing share of the di erent system architectures in the HPC market as re ected in the Top500 is shown in gure 7. Besides the fact that no single processor systems are any longer powerful enough to enter the Top500, the major trend is the growing number of MPP systems. The number of clustered systems is also growing and for the last two y ears we s e e a n umb e r o f P C o r workstation based`Network of Workstations' in the Top500. I t i s a n i n teresting and open question which share of the Top500 such N O Ws will capture in the future. 
Vector based Systems
Cray Research i n troduced their last ECL-based vector system the T-90 series in 1995. Due to the unfavorable price performance of this technology the T-90 was not an economical success for Cray Research. One year later in 1996 SGI bought Cray Research. After this acquisition the future of the Cray v ector series was in doubt. The joint company announced plans to produce a joint macro architecture for it's microprocessor and vector processor based MPPs. In mid 1998 the SGI SV1 was announced as future vector system of SGI. The SV1 is the successor both to the CMOS based Cray J-90 and the ECL based Cray T-90. The SV1 is CMOS based which means that SGI is nally following the trend set in by F ujitsu VPP700 and NEC SX-4 a few years earlier. First user shipments are be expected by the end of 1999. One has to wait and see whether the SV1 can compete with the advanced new generation of Japanese vector systems, especially the NEC SX-5 and the Fujitsu VPP5000.
Fujitsu continued along the line of the VPP system and introduced in 1996 the VPP700 series featuring increased single node performance. For the lower market segment the VPP300 using the same nodes but a less expandable interconnect network was introduced. The recently announced VPP5000 is again a distributed-memory vector system, where 4 up to 128 512 by special order processors may be connected via a fully distributed crossbar. The theoretical peak performance is ranges from 38.4 GFlop s up to 1.229 TFlop s, in special con gurations even 4.915 TFlop s.
NEC had announced the SX-4 series in 1994 and continues along this system line. The SX-4 features shared memory up to a maximum of 32 processors. Larger con gurations are build as cluster using a proprietary crossbar switch. In 1998 the follow up model SX-5 was announced for rst delivery in early 1999. In 1998 the follow up model SX-5 was announced for rst delivery in late 1998 and early 1999. In the June 99 Top500 list the SX-5 was not yet represented. In contrast to its predecessor, the SX-4, the SX-5 is not o ered anymore with faster, but more expensive SRAM memory. The SX-5 systems are exclusively manufactured with synchronous DRAM memory. The multiframe version of the SX-5 can host up to 512 processors with 8 GFlop s peak performance each, resulting in a theoretical peak performance of 2 TFlop s. More information about all these current architectures can be found in 7 .
Traditional MPPs
Large scale MPPs with homogeneous system architectures had matured during the nineties with respect to performance and usage. Cray nally took the leadership here as well with the T3E system series introduced in 1996 just before the merger with SGI. The performance potential of the T3E can be seen by the fact that in June 1997, 6 of the top 10 positions in the Top500
were occupied by T3E's. End of 1998 the top 10 consisted only of ASCI systems and T3E's.
Hitachi was one of the few companies introducing large scale homogeneous system in the late nineties. It announced the SR2201 series in 1996 and tries to sell this system outside of Japan as well.
The rst of the ASCI system, the ASCI Red at Sandia National Laboratory, was delivered in 1997. It took immediately the rst position in the Top500 in June 1997 being the rst system to exceed 1 TFlop s LINPACK performance. ASCI Red also ended several years during which several Japanese systems ranked as number one.
IBM followed the Path of their SP series introducing new nodes and faster interconnects as well. One major innovation here was the usage of SMP nodes as building blocks which further demonstrates the proximity of the SP architecture to clusters. This design with SMP nodes was also chosen for the ASCI Blue Paci c systems.
SMPs and their succesors
SGI made a strong appearance in the Top500 in 1994 and 1995. Their PowerChallenge systems introduced in 1994 sold very well in the industrial market for oating point i n tensive applications. Cluster build with these SMPs appeared in a reasonable number at customer sites.
In 1996 the Origin2000 series was announced. With this system SGI took the step away form the bus based SMP design of the Challenge series. The Origin series features a virtual memory system build with distributed memory nodes up to 128 processors. To a c hieve higher performance these systems can be clustered again. This is the basic design of the ASCI Blue Mountain system.
Digital was for a long time active as producer of clustered systems for commercial customers. In 1997 the Alpha Server Cluster was introduced which was targeted towards oating point i n tensive applications as well. One year later Compaq acquired Digital giving Compaq as rst PC manufacturer an entry in the HPC market.
Hewlett-Packard continued producing systems along the line of the former Convex SPP systems targeting mainly the midsize business market where the company has good success. The very high end -which had never been a target for Convex or HP -still seems to of minor interest to HP.
Sun was the latest company who entered the Top500. After the merger of SGI and Cray Research in 1996, Sun bought the former business server division of Cray which produced SPARC based SMP systems for several years. In 1997 Sun introduced the HPC 10000 series. This SMP system is build around a new type of switched bus which allows to integrate up to 64 processors in an e cient w ay. T o due it's wide customer base and good reputation in the commercial market Sun was able to sell these SMPs very well especially to commercial and industrial customers. For the very high end market clusters build with these SMP were introduced in 1998.
New Application Areas
For research sites or academic installations, it is often di cult | if not impossible | to specify a single dominant application. The situation is di erent for industrial installations, however, where systems are often dedicated to specialized tasks or even to single major application programs. Since the very beginning of the Top500 project, we h a ve tried to record the major application area for the industrial systems in the list. We h a ve managed to track the application area for almost 90 of the industrial systems over time.
Since June 1995 we see many systems involved in new application areas entering the list. Figure 8 shows the total numbers of all industrial systems which i s made up of three components, traditional oating point i n tensive engineering applications, new non oating point applications, and unknown application areas. In 1993, the applications in industry typically were numerically-intensive applications, for example, geophysics and oil applications, automotive applications, chemical and pharmaceutical studies, aerospace studies, electronics, and other engineering including energy research, mechanical engineering etc.
The share of these areas from 1993 to 1996 remained fairly constant o ver time.
Recently, h o wever, industrial systems in the Top500 have been used for new application areas. These include database applications, nance applications, and image processing.
The most dominant trend is the strong rise of database applications since November 1995. These applications include on-line transaction processing as well as data mining. The HPC systems being sold and installed for such applications are large enough to enter the rst hundred positions|a clear sign of the maturity of these systems and their practicality for industrial usage.
It is also important to notice that industrial customers are buying not only systems with traditional architectures, such as the SGI PowerChallenge or Cray T-90, but also MPP systems with distributed memory, such as the IBM SP2. Distributed memory is no longer a hindrance to success in the commercial marketplace.
2000 and beyond
Two decades after the introduction of the Cray 1 the HPC market has changed it's face quite a bit. It used to be a market for systems clearly di erent from any other computer systems. Nowadays the HPC market is no longer an isolated niche market for specialized systems. Vertically integrated companies produce systems of any size. Similar software environments are available on all of these systems. This was the basis for a broad acceptance at industrial and commercial customers.
The increasing market share of industrial and commercial installations had several very critical implications for the HPC market. The manufacturers of supercomputers for numerical applications face in the market for small to medium size HPC systems the strong competition of manufacturers selling their systems in the very lucrative commercial market. This systems tend to have better price performance ratios due to the larger production numbers of systems accepted at commercial customers and the reduced design costs of medium size systems. The market for the very high end systems itself is relatively small and does not grow. It can not easily support specialized niche market manufacturers. This forces the remaining manufacturers to change the design for the very high end away from homogeneous large scale systems towards cluster concepts based on medium size`o -the-shelf' systems.
Currently the debate still goes on if we need a new architecture for the very high end supercomputer such as the multithreaded design of Tera. At the same time we see a new architecture appearing in growing number in the Top500 -the 'Network of Workstations' including PC based systems. Depending were one draws the line between a cluster of SMPs and a network of workstation we h a ve about 6 such N O Ws in the June 1999 edition of the Top500. A s there is not a single manufacturer who provides LINPACK measurements for such systems we certainly underestimate the actual number of NOW large enough to t in the Top500. The potential of this class of HPC architecture with excellent price performance ratios should not be underestimated and we expect to see more in the future. If they however will seriously challenge the clusters of large SMPs for a wide range of application is an open question.
Dynamic of the Market
The HPC market is by it's very nature very dynamic. This is not only re ected by the coming and going of new manufacturers but especially by the need to update and replace systems quite often to keep pace with the general performance increase. This general dynamic of the HPC market is well re ected in the Top500. In table 3 we show the number of systems which fall o the end of the list within 6 month due to the increase in the entry level performance. We see an average replacement rate of about 160 systems every half year or more than half the list every year. This means that a system which i s at position 100 at a given time will fall o the Top500 within 2-3 years.
Consumer and Producer
The dynamic of the HPC market is well re ected in the rapidly changing market shares of the used chip or system technologies, of manufacturers, customer types or application areas. If we h o wever are interested in where these HPC systems are installed or produced we see a di erent picture.
Plotting the number of systems installed in di erent geographical areas in gure 9 we see a rather steady distribution. The number of system in installed in the US is slightly increasing over time while the number of systems in Japan is slowly decreasing.
Looking at the producers of HPC system in gure 10 we see an even greater dominance of the US, which actually slowly increases over time. European Table 3 The replacement rate in the Top500 de ned as number of systems omitted because of their performance being too small. manufacturers do not play a n y substantial role in the HPC market at all.
Government programs
The high end of the HPC market was always the target for government program all over the world to in uence the further development of new systems. In the USA there are currently several government projects on the way t o consolidate and advance the numerical HPC capabilities of US government laboratories and the US research community in general. The most prominent of these is the`Accelerated Strategic Computing Initiative ASCI'. Goal of this program is to create the leading-edge computational modeling and simulation capabilities that are essential for maintaining the safety, reliability, and performance of the U.S. nuclear stockpile and reducing the nuclear danger" systems of the largest scale technically possible. Currently the system`ASCI Red' is installed at Sandia National Laboratory. This system is produced by Intel and has currently 9472 Intel Pentium Xeon processors. It was the rst system to exceed the 1 TFlop s mark on the LINPACK benchmark in 1997.
Since then ASCI Red is the number 1 on the Top500 with currently 2.1
TFlop s.`ASCI Blue Mountain' installed at the Los Alamos National Laboratory is a cluster of Origin2000 systems produced by SGI with a total of 6144 processors. It achieves 1.6 TFlop s LINPACK performance.`ASCI Blue Paci c' installed at the Lawrence Livermore National Laboratory is an IBM SP system with a total of 5856 processors. The future plans of the ASCI program call for a 100 TFlop s system by April 2003.
The Japanese government decided to fund the development of an`Earth Simulator' to simulate and forecast the global environment. in 1998 NEC was awarded the contract to develop a 30 TFlop s system to be installed by 2002.
Performance G r owth
While many aspects of the HPC market change quite dynamically over time, the evolution of performance seems to follow quite well some empiriacal law's such as Moore's law mentioned at the beginning of this article. The Top500
provides an ideal data basis to verify an observation like this. Looking at the computing power of the individual machines present i n t h e Top500 and the evolution of the total installed performance, we plot the performance of the systems at positions 1, 10, 100 and 500 in the list as well as the total accumulated performance. In Figure 11 the curve of position 500 shows on the average an increase of a factor of two within one year. All other curves show a growth rate of 1.8 0.07 per year.
To compare these growth rates with Moore's Law w e n o w separate the inuence from the increasing processor performance and from the increasing number of processor per system on the total accumulated performance. To get meaningful numbers we exclude the SIMD systems for this analysis as they tend to have extreme high processor numbers and extreme low processor performance. In Figure 12 we plot the relative growth of the total processor number and of the average processor performance de ned as the quotient o f total accumulated performance by the total processor number. We nd that these two factors contribute almost equally to the annual total performance growth factor of 1.82. The processor number grows per year on the average by a factor of 1.30 and the processor performance by 1.40 compared 1.58 of Moore's Law.
The average growth in processor performance is lower than we expected. A possible explanation is that during the last years powerful vector processors got replaces in the Top500 by less powerful super-scalar RISC processor. This e ect might be the reason why the Top500 currently does not re ect the full increase in RISC performance. Once the ratio of vector to superscalar processors in the Top500 stabilizes we should see the full increase in processor performance re ected in the Top500. The overall growth of system performance is however larger than expected from Moore's Law. This comes from growth in the two dimensions processor performance and number of processors used. Based on the current Top500 data which c o ver the last 6 years and the assumption that the current performance development continue for some time to come we can now extrapolate the observed performance and compare these values with the goals of the mentioned government programs. In gure 13 we extrapolate the observed performance values using linear regression on the logarithmic scale. This means that we t exponential growth to all levels of performance in the Top500. These simple tting of the data shows surprisingly consistent results. Based on the the extrapolation from these ts we can expect to have the rst 100 TFlop s system by 2005 which is about 1 2 years later than the ASCI path forward plans. By 2005 also no system smaller then
Projections
The only performance level which deviates with its growth rate from all others is the performance of position 100. In our opinion this is actually a sign of a l o wer number of centers being able to a ord a very high end system. This concentration of computing power in fewer centers can be seen in the US for a couple of years already in basically all government programs.
Looking even further in the future we could speculate that based on the current doubling of performance every year the rst Peta op system should be available around 2009. Due to the rapid changes in the technologies used in HPC systems there is however at this point in time no reasonable projection possible for the architecture of such a system at the end of the next decade. Even as the HPC market has changed it's face quite substantially since the introduction of the Cray 1 three decades ago, there is no end in sight for these rapid cycles of re-de nition. And we still can say that in the High-Performance Computing Market The Only Thing Constant Is Change".
