Parallel computers dedicated to lattice eld theories are reviewed with emphasis on the three recent projects, the Tera ops project in the US, the CP-PACS project in Japan and the 0.5-Tera ops project in the US. Some new commercial parallel computers are also discussed. Recent d e v elopment of semiconductor technologies is brie y surveyed in relation to possible approaches toward Tera ops computers.
Introduction
Numerical studies of lattice eld theories have developed signi cantly in parallel with the development of computers during the past decade. Of particular importance in this regard has been the construction of dedicated QCD computers (see Table 1 and for earlier reviews see Ref. 1] ) and the move o f commercial vendors toward parallel computers in recent years. Due to these developments we n o w h a ve access to parallel computers which are capable of 5{10 G ops of sustained speed.
However, a fully convincing numerical solution of many o f lattice eld theory problems, in particular those of lattice QCD, requires much more speed. In fact typical number of oating point operations required in these problems, such as full QCD hadron mass spectrum calculations, often exceeds 10 18 , w h i c h translates to 115 days of computing time with the sustained speed of 100 G ops. Under this circumstance we really need computers with a sustained speed exceeding 100 G ops.
In this talk I review the present status of e ort toward construction of dedicated parallel computers with the peak speed of 100{1000 G ops. Of the six projects in this category (see Table  1 ), APE100 2] is near completion and ACPMAPS upgraded 3] is running now. Because they have already been reviewed previously 1], we shall only describe their most recent status. The three recent projects, the Tera ops project 4, 5] in the United States, the CP-PACS project 6] in Japan Review talk at Lattice 9 3 , Dallas, USA. A k ey ingredient in the fast progress of parallel computers in recent years is the development i n semiconductor technologies. Understanding this aspect is important when one considers possible approaches toward a Tera ops of speed. I shall therefore start this review with a brief reminder of the development of vector and parallel com- puters and the technological reasons why recently parallel computers have exceeded vector computers in the computing capability (Sec. 2). The status of APE100 and ACPMAPS upgraded are summarized in Sec. 3. The US Tera ops, CP-PACS and 0.5-Tera ops projects are described in Sec. 4 . Powerful parallel computers are also available from commercial vendors. In Sec. 5 I shall discuss two new computers, the Fujitsu VPP500 and CRAY T3D. After these reviews I discuss several architectural issues for computers toward Tera ops in Sec. 6. A brief conclusion is given in 
Recent development of computers and semiconductor technology
In the upper part of Fig. 1 we s h o w the progress of peak speed of vector and parallel computers over the years. Small symbols correspond to the rst shipping date of computers made by commercial vendors, with open ones for vector and lled ones for parallel type. Parallel computers dedicated to lattice QCD are plotted by large symbols. We clearly observe that the rate of progress for parallel computers is roughly double that of vector computers and that a crossover in peak speed has taken place from vector to parallel computers around 1991.
The \linear t" drawn in Fig. 1 for parallel computers can be extrapolated to the period prior to 1985. QCDPAX is the fth generation computer in the PAX series 9] and there are four earlier computers starting in 1978. In the lower part of Fig. 1 the peak speed of these computers are plotted in units of M ops together with that of the Caltech computer described, for example, by Norman Christ at Fermilab in 1988 1]. It is amusing to observe that the rapid increase of speed of parallel computers has been continuing for over a decade since the early days.
It is important to note that the rst three PAX computers are limited to 8 bit arithmetic and the fourth one to 16 bit. We also recall that the rst Columbia computer used 22 bit arithmetic. Thus not only the peak speed but also the precision of oating point n umbers has increased signi cantly for parallel computers. Now the 64 bit arithmetic is becoming standard.
To see more closely why the crossover happened, let us look at the development of technology of semiconductors. In Fig. 2 we s h o w h o w machine clocks become faster in the case of ECL which is utilized in vector-type supercomputers as well as in the the case of CMOS which is used in personal computers and workstations. As we can see, the speed of CMOS is about 10-fold less than ECL. However, the power consumption and the heat output are much lower than those of ECL. Furthermore the speed of CMOS itself has become comparable to that of ECL of the late 1980's.
The machine cycle of one nano-second is a kind of limit to reach. This is understandable because one nano-second is the time in which light t r a vels 30cm. In this time interval one has to load data from memory to a oating point operation unit, make a calculation and store results to the memory. Even in the ideal case of pipelined operations, one nano-second corresponds only to one G ops. Usually a vector computer has a multiple operation units which consists of, for example, 8 oating point operation units (FPUs). Because of this, the theoretical peak speed becomes 8 G ops. Further it has multiple sets of this kind of multiple FPUs in the case of 4 sets the peak speed becomes 32 G ops. This is the way h o w a vector computer gets the peak speed of order of 10 G ops. That is, recent v ector computers are already parallel computers. However, it is rather di cult to proceed further in this approach because of the power consumption and the heat output.
On the other hand, the development of CMOS semiconductor technology, with its small-size, high speed and low p o wer consumption, has made it possible to construct a massively parallel computer which is composed of order of 1,000 nodes with the peak speed which exceeds that of vectortype supercomputers. This is the reason why the crossover occurred.
The speedup of CMOS has become possible due to the development of LSI technology. Figure 3 shows the development in terms of the minimum feature size or minimum spacing. Now the spacing has been reduced to 0.5 micron. This development has also lead to a substantial increase of DRAM bit capacity which has recently reached the level of 16Mbit. The speed of transistors has also increased with the decrease of minimum spacing because electrons can move through the Figure 4 shows the development o f t h e n umberof pins of LSI.
Due to these development, it is now not a dream to construct a 1T ops computer with 64 bit arithmetic with reasonable size and reasonable power consumption.
Past and present of dedicated computers
The computers of the rst group in Table 1 , the three computers of Columbia 10], two versions of APE 11], QCDPAX 12], GF11 13] and ACPMAPS 14], were constructed some years ago and have been producing physics results. The characteristics of these computers are given in Table 2 . These computers are already familiar to lattice community. Therefore I refer to earlier reviews 1] for details and just emphasize that a number of interesting physics results have been produced. This fact shows that there is really bene t in constructing dedicated computers.
The computers of the second group in Table 1 , the 6 G ops version of APE100 and ACPMAPS upgraded, have been recently completed. Both are now producing physics results, some of which have been reported at this conference. I list their characteristics in Table 3. 3.1. APE100
The architecture of APE100 2] is a combination of SIMD and MIMD. The full machine consists of 2048 nodes with a peak speed of 100 G ops. The network is a 3-dimensional torus. Each n o d e h a s a custom-designed oating point c hip called MAD. The chip contains a 32-bit adder and a multiplier with a 128-word register le. The memory size is 4Mbytes/node with 80 ns access time 1M 4 DRAM. The bandwidth between MAD and the memory is 50 Mbytes/sec, which corresponds to one word/4 oating point operations. One board consists of 2 2 2 = 8 nodes with a commuter for data transfer. The communication rates onnode and inter-node are 50 Mbytes/sec and 12.5 Mbytes/sec, respectively. Each board has a controller which t a k es care of program ow control, address generation and memory control.
The 6 G ops version of APE 100, which is called TUBE, is running and producing physics results. A TUBE is composed of 128 nodes making a 32 2 2 torus with periodic boundary conditions. The naming originates from its topological shape. The memory size is 512 Mbytes. The sustained speed of a TUBE for the link update is about 1.5 microsecond/link with the Metropolis algorithm with 5 hits. The time for multiplication of the Wilson operator is 0.8 microsecond per site. These rates roughly correspond to 2.5 G ops to 3 G ops, which represents 40-50% of the peak speed. These gures show good e ciency.
The physics subjects being studied on TUBE are hadron spectrum and heavy quark physics, the results of which have been reported at this conference.
A T ower which consists of 4 TUBEs with a peak speed of 25 G ops is being assembled now and should be working in the late fall of 1993. The full machine which is composed of 4 Towers with a peak speed of 100 G ops is expected to be completed by the rst quarter of 1994.
ACPMAPS Upgraded
This is an upgrade of the ACPMAPS replacing the processor boards without changing the communication backbone 3]. The ACPMAPS is a MIMD machine with distributed memory. On each n o d e t h e r e a r e t wo I n tel i860 microprocessors with a peak speed of 80 M ops. The memory size is 32 Mbytes of DRAM for each node. The full machine consists of 612 i860 with a peak speed of 50 G ops and has 20 Gbytes of memory.
The network has a cluster structure: one crate consists of 16 boards with a 16-way crossbar. A board can be either a processor node or a Bus Switch Interface board. The 16-way crossbars are connected in a complicated way which makes a hyper-cube and other extra connections. The throughput between nodes is 20 Mbytes/sec.
ACPMAPS has a strong distributed I/O system: there are 32 Exabyte tape drives and 20 Gbytes of disk space. This mass I/O subsystem is one of characteristics of ACPMAPS.
The software package CANOPY which w as well described several times 14, 3] is very powerful to distribute physical variables to nodes without knowing the details of the hardware.
The ACPMAPS is running and doing calculations of the quenched hadron spectrum and heavy quark physics, the results of which h a ve been reported at this conference.
The sustained speed measured on a 32 3 48 lattice are as follows. One link update time by a heat-bath method is 0.64 micro-second per link. One cycle of conjugate gradient i n version of the Wilson operator by red-black method takes about 0.64 micro-second per site. The L inversion together with the U back-inversion in the ILUMR method takes 2.23 micro-second per site. These gures for the sustained speeds are about 10-20% of the peak speed. Therefore e ciency is not so good compared to TUBE. However, there are several good characteristics. First, it supports both 64 and 32 bit arithmetic operations. The network is very exible and the distributed I/O system is convenient for users.
Project under way and proposed
The three projects of the third group in Table 1 , the Tera ops project, the CP-PACS project and the 0.5-Tera ops project are well under way. The basic design targets are listed in Table 4 .
Tera ops project
The Tera ops project 4] has changed signicantly since last year. The new plan (Multi- The full machine consists of 512 nodes with a peak speed of at least 1.6 T ops with 64 bit arithmetic. The sustained speed is expected to be more than 1 T ops. A preliminary estimate for the cost of the full machine is $20 { 25M. This project is the collaboration of the QCD Tera ops Collaboration 15], MIT Laboratory for computer science, Lincoln Laboratory and TMC. Funding for the project began in the fall of 1992 with startup funds provided by MIT. The proposal for the whole project will be submitted to NSF, DOE and ARPA this fall. The tentative s c hedule is to build a prototype node in 1994, a prototype system in 1995 and have the full system in operation in 1996. The architecture is MIMD with a 3-dimensional hyper crossbar which will be explained later. The target of the peak speed is currently at least 300 G ops with 64 bit arithmetic. We are making a proposal for additional funds to increase this peak speed. The memory size is planned to be more than 48 Gbytes.
CP-PACS project
The processor is based on a Hewlett-Packard PA-RISC processor. This is a super-scalar processor which can perform two operations concurrently. We enhance the processor to support ecient v ector calculations. The peak speed of one processor is 200 { 300 M ops. The enhancement will be described in detail later. For memory we u s e synchronous DRAM, pipelined by m ultiinterleaving banks and a storage controller. The memory bandwidth is one word per one machine cycle. Now let me explain the vector enhancement of the processor. As is well-known, high performance of usual RISC processors like those of Intel, IBM, HP and DEC heavily depends on the existence of cache. However, when the data size exceeds the cache size, e ectiveness of cache de- creases. Figure 5 show s a t ypical example of the performance of a RISC processor. When the data size exceeds the size of the on-chip level-1 cache, it drops down by a b o u t 5 0 % . Furthermore when it exceeds the size of the level-2 cache, the performance is of order 15% of the theoretical peak speed. This feature is very common to cachebased RISC processors.
To o vercome this di culty, our strategy is to increase the number of oating-point r e g i s t e r s w i t hout serious changes in the instruction set architecture. This means upward compatibility. However, this is not straightforward because the register elds for instructions are limited the number of registers is usually limited to 32. To resolve this problem we i n troduce slide windows as well as preload and poststore instructions 17]. We a l s o pipeline the memory. Because of these features we are able to hide long memory access latency and perform vector calculations e ciently. Figure 6 is a schematic illustration of how slide windowed registers work. Arithmetic instructions use the registers in the active window which h a s 32 registers. The preload instruction can load data into registers of the next (or next-to-next) window and the poststore instruction stores data from registers of the previous window. The pitch for the window slide can be chosen by software. Due to the preload and poststore instructions we <Slide-Window>, the number of slide-windowed oating-point registers is assumed to be 64. Except for #14 of Livermore Fortran Kernels, the performance with slide windows is almost equal to that of the perfect cache case and it is about 6 times higher than the original one. Figure 8 shows the e ciency of performance for the case of multiplication of the Wilson matrix. The dashed line corresponds to e ciency in the case of the code optimized by h a n d w i t hout considering memory bank-con icts. The solid line is the result of a simulation for the realistic case where the e ect of memory bank con ict and the bu er size e ect are taken into account. This shows that if the number of registers is larger than 100 the e ciency is more than 75%. We will develop a compiler for the enhanced RISC processor, which will produce optimized codes for the slide-window architecture. On each processing unit(PU), we place one enhanced PA-RISC processor, local storage(DRAM) and a storage controller(see Fig. 9 ). NIA stands for Network Interface Adapter and EX for exchanger. On an IO unit(IOU), in addition to the components on PU, we place an IO bus to which disks are connected through IO adapters.
The network is a 3-dimensional hyper crossbar as shown in Fig. 10 . It consists of x-direction crossbars as well as y and z direction crossbars. This hyper crossbar network is very exible: from any node to another node data can be transferred through at most three switches. The data transfer is made by message passing with wormhole routing. The latency is expected to be of order of a few micro-second. A block-strided transfer is supported. We h a ve also a global synchronization in addition to the hyper crossbar network.
The system con guration of the CP-PACS with distributed disks is depicted in Fig. 10 . The disk space is more than 500 Gbytes in total. We u s e RAID5 which has extra parity bits. In general, when the number of disks is large as in this case, the MTBF(mean time between failure) becomes of order of one month. With RAID there is no such problem, however. The number of nodes, not xed yet, is from 1000 to 1500.
The host is a main frame computer with modi cations for massive data transfer between the CP-PACS and the external disk storage.
A prototype with the PA-RISC without enhancement, which will be used mainly for tests of network hardware, will be completed in early 1994 and the full scale machine with the newly developed processor is scheduled to be completed by spring 1996.
The project is being carried out by a collaboration with Hitachi Ltd. A new center called \Cen-ter for Computational Physics" was established at University of Tsukuba for the development o f CP-PACS. A new building for the center, where the new machine will be installed, was completed in the summer of 1993. The fund for the development of CP-PACS is about $14M. The node architecture is depicted in Fig. 11 . The processor is DSP(Digital Signal Processor) by T exas Instruments. A 32 bit addition and multiplication can be performed concurrently with 40 ns machine cycle. This leads to 50 M ops for each node. It executes one word read for one machine cycle and one word write for two m a c hine cycles. T h e D S P h a s 2 K w ords of memory on chip. The size is small (3:0 cm 2 ), the power consumption very low (less than 1 Watt) and the price is less than 50$.
Each n o d e h a s 2 M b ytes of DRAM. The maximum bandwidth between the processor and the memory is 25 Mwords/sec. The memory size is 32 Gbytes in total.
The node gate array(NGA) which i s s h o wn in Fig. 12 is to be newly developed. The design has been partly nished. It plays the roles of memory manager, network switch and specialized cache as a bu er. The bu er size is chosen in such a w ay that multiplications of 3 3 matrices on 3-vectors can be e ciently done.
The 4-dimensional network is connected by eight bi-directional lines of NGA. Because the data transfer is made by handshaking, the latency is not low. To hide this latency, there is a mode called \store and pass through". In the calculation of the inner-product of two v ectors which appears in the conjugate gradient method, the data transfer which takes 70 % of the total time without this mode reduces to 28 % with this mode. It supports a block-strided transfer.
The mechanical design of a mother board is shown in Fig. 13 . On the mother board there are 2 2 4 4 = 64 daughter boards with last 4 Figure 11 . Schematic diagram of one node of the 0.5-Tera ops machine. making a loop. Each node has a SCSI port to which peripheral tape and disk drives are connected. One of 256 boards of the full machine is connected to the host. The disk space is 48 Gbytes in total. The data transfer from disk to tape or visa-visa can be done concurrently with physics calculations.
The power consumption is expected to be about 50 KW, which i s v ery low compared with other projects. The test board will be completed by summer 1994 and the full machine by summer 1995. The funds for 128 node machine with a peak speed of 6.4 G ops is supported by DOE. The proposal for the full machine will be submitted in spring 1994.
APE1000
This is a successor of APE 100 with a peak speed of 1T ops with 64 bit arithmetic 8]. The project will start by the end of 1994.
Commercial computers
I list the characteristics of the most powerful commercial computers in Table 5 and describe in some details the two n e w ones below. For other computers I refer to the earlier reviews 1].
VPP500
This is the latest machine from Fujitsu. Each node is a vector processor with the same architecture as VP400 with a peak speed of 1.6 G ops. Because of this, it is called a vectorparallel machine by F ujitsu. One node is a multichip-module which consists of 121 LSIs, a part of which is composed of GaAs. Each node has 128 Kbytes of vector registers and 2 Kbytes of mask registers. The memory size is 256Mbytes/node. The network is a complete crossbar connecting all nodes, which is very powerful for any application. The bandwidth for data transfer is 400 Mbytes/sec for each direction. The OS is UNIX and the language is Fortran plus directives for parallel procedures.
The maximum number of nodes is 222 with the peak speed of 355 G ops. The power consumption is 6KW/node. The power needed for the full machine is more than 1 MW.
A small VPP500 with 4 processors is scheduled to be installed at Aachen this December. Another one with 7 processors will be installed at the Institute of Space and Astronomical Laboratory of Japan next January.
T3D
This is the machine just announced by CRAY. The node processor is the DEC Alpha chip, which is one of the most powerful RISC chip in the market. The clock cycle is 6.7ns and the peak s p e e d o f t h e c hip is 150M ops. The memory size is 16Mbytes for one node with 4Mbit DRAM at present. It will be upgraded soon to 64Mbytes with 16Mbit DRAM. The memory is globally shared and physically distributed.
The network is a 3-dimensional torus. The bandwidth for data transfer is 300MB/sec for each direction. The latency of the communication is very low, less than 1 microsecond for hardware overhead.
It is a MIMD machine with a maximum peak speed of 300G ops when it is composed of 2048 nodes: the maximum number of nodes which is 1024 at present will be increased to 2048 soon.
The OS is Mach and the language is Cray R esearch Adaptive F ortran.
The machine with 32 nodes have been already installed at Pittsburgh Supercomputing Center. It will be upgraded to 512 nodes next spring.
Sustained speed of commercial parallel computers
The MILC collaboration has been running QCD codes on a number of commercial computers including the nCUBE2, the Intel iPSC/860, the Intel Paragon and the TMC CM5. They have results of benchmarks for the conjugate gradient matrix inversion with staggered quarks on these parallel computers 19] . The performances of the benchmarks are plotted in Figs. 14 and 15, respectively, in terms of M ops/node and the ratio of the sustained speed to the theoretical peak speed. It should be noted that the benchmarks quoted for the CM5 and the Paragon are preliminary. In particular, the communication speed of the Paragon is expected to improve signi cantly as the operating system is upgraded.
The nCUBE2 is very stable and has nice software. Because nCUBE2 is slow, it is not suitable for large QCD simulations, but it is convenient for software development. When the code is written in C, the e ciency is very low for iPSC/860 and CM5 as is seen in the gures. Only when they are written in assembly languages, the e ciency becomes around 30%. A similar e ciency has been also reported at this conference by Rajan Gupta 20] for Wilson quarks in the case of CM5.
6. Toward Tera ops computers 6.1. Three strategies Roughly speaking, there are three strategies to get a 1 T ops machine as shown in Table 6 . The rst is a vector-parallel approach taken by VPP500: 2 G ops 500 nodes =1 T ops. The second is the approach taken by T3D and CP-PACS, that is, to use the most advanced RISC processor with an enhanced mechanism for high throughput between memory and processor: 200-400 M ops 2500-5000 nodes = 1T ops. The approach t a k en by the Tera ops project is in between the rst and the second in the sense that the peak speed of one FPU is 200{300 M ops and that of one node is more than 1.6 G ops. The third approach i s t o u s e w ell-established technology taken by C M 5 , P aragon, nCUBE and the 0.5-T ops project: 50-100 M ops 10,000-20,000 nodes = 1 T ops.
In the rst approach, the power consumption and the size will become problematical, although the number of nodes is small. In the second approach, the sustained speed of each node for arithmetic operations and that of the data transfer between nodes will be the key issue. In the third approach the packaging of the whole system and the reliability w i l l b e c r u c i a l . In spite of these potential obstacles, I believe that the rapid progress of technologies will enable all three approaches Table 6  Towards 1 Tera ops machines  Speed of CPU  #CPU  type  M ops  2000  500  VPP500  Tera ops  200{400 2,500{5,000 T3D, CP-PACS 50{100 10,000{20,000 0.5Tera ops, CM5, nCUBE, Paragon to reach 1 T ops of theoretical peak speed in a few years. We should note, however, that achieving a high sustained speed with massively parallel computers and having exibility for applications require additional considerations on the balance of speed of various components and other architectural issues. Let us make brief comments on these points.
Balance of speed
In Fig. 16 the memory-processor bandwidth, the inter-node communication bandwidth, and the memory size are compared against the processor speed for the computers we reviewed in some detail. The processor speed is normalized to unity, and other normalizations are chosen for the following reason. For QCD calculation it is probably appropriate that the bandwidth between CPU and the memory is one word for two oating point operations. It also will be enough that the bandwidth for inter-node communication is 0.1 words for one oating point operation. For the memory size, the normalization is arbitrary, and I chose 0.025 words of memory size for 1 ops/sec. We see that each machine has its own characteristic. Securing a high bandwidth between memory and processor and that between nodes, su cient to keep up with the processor speed, is one of the crucial factor for a high sustained speed. In dedicated computer projects these parameters can be tuned to speci c applications (this in fact underlies the cost e ectiveness of dedicated computers). For CP-PACS we have chosen the balance in such a w ay t h a t i t i s o p t imized for lattice QCD. We should note, however, that the requirements on the bandwidths in lattice QCD are modest compared to many other applications. Higher bandwidths are probably preferred for general purpose computers as realized in the case of T3D.
There are other points which do not appear in the gure such as the number of oating point registers on each processor, the structure of memory (pipelined or not) and the latency of the communication. These features are also important for the performance of a massively parallel machine. For example, the memory-processor bandwidth relative t o t h e speed of one node is small for VPP500, but it has 8Kbytes of registers which probably compensates it.
6.3. Other issues of architecture 6.3.1. SIMD or MIMD SIMD is simple and generally su cient for QCD calculations. However, MIMD is more exible and can accommodate more varieties of algorithms. An interesting question is whether there are e cient algorithms for inversion of quark matrices which requires a MIMD architecture. Another point is that MIMD hardware is probably simpler than SIMD for a machine with a large number of processors since the clock skew problem will become serious for SIMD.
Topology of network
The 3d torus and 4d torus networks are simple and natural for lattice QCD. However, precision measurement of observables requires nitesize analyses for which w e need simulations on a number of lattice sizes. For this point more exible network is preferable.
32bit or 64bit
In many cases of lattice QCD calculations it seems that 32bit arithmetic is su cient. However, for example, at the global reject/accept step of the Hybrid Monte Carlo algorithm on a large lattice, the 32bit precision in not su cient. In general the 64 bit precision is needed when the algorithm involves global variables.
Conclusions
In this review I have surveyed the development of parallel computers and the present status of dedicated computer projects toward Tera ops of speed. In the 1980's parallel computers were in their infancy and TMC was virtually the only company in the eld. At that time there was no doubt that constructing dedicated parallel computers by p h ysicists was a bene cial project. In fact dedicated computers which resulted from these projects have produced a number of interesting and important physics results on lattice eld theories. The situation has become less clear-cut in recent y ears due to higher technology needed to achieve faster speed on one part, and emergence of powerful general purpose parallel computers from commercial vendors on the other.
Historically projects for dedicated computers have been carried out by a small group of lattice physicists, in some cases in collaboration with experimental physicists and computer scientists, but without involvement o f large commercial companies. The 0.5-Tera ops project follows this spirit. Fully utilizing well-established microprocessor technologies and designing aids which have become commercially available, the project aims to complete a computer precisely tuned to lattice QCD within a short period of time and at a low cost. It is very impressive to learn that this strategy is actually possible for computers approaching a Tera ops of speed. I believe t h a t a vital factor in realizing this approach i s t h e e x p erience gained with the construction of three previous computers at Columbia.
Another possible approach is to depart from the traditional style and to seek for a close collaboration with large companies from the start of the project. This strategy is the one taken by the US Tera ops project and the Japanese CP-PACS project. In the computers planned in these projects the most advanced processors are to be networked together with a large bandwidth. The 0.5-micron semiconductor technology, s o o n t o b ecome that of 0.3 micron, and the packaging technique needed for this ty p e o f a r c hitecture can not be handled by p h ysicists and computer scientists alone. The cost is necessarily higher and the construction period longer. There are, however, the advantage of choosing more exible architecture, reliability o f h a r d w are, and generally better software environment w h i c h i s v ery important for development of application programs and data analysis.
Regardless of the approaches, I think dedicated computer projects still represent an important a venue we should pursue for acquiring the computing power needed for advancement of lattice eld theory studies. Hopefully all three computers will be completed in a few years time and produce a variety of fruitful results with some unexpected surprises.
