developed with the attention, criticism, and help of my coneagues, especially Creon Levit and John Gilbert, but I am solely responsible for these views.
My conclusion is that the connection machine is a breakthrough in machine architecture. It clearly shows the possibility of achieving a major gain in sustained performance through the use of simple but highly replicated hardware. It is also a breakthrough in that the fundamental programming model, data parallelism, is much easier to use and think about than MIMD programming systems. TMC has made several nice extensions to the simplest SIMD programming model: parallel remote reference (called prs_), parallel remote store (called pact) with combining operators, segmented parallel prefix operations (called scan), and nearest-neighbor grid communication (called news). These add significant power to the languages. Nevertheless, there are a number of serious weaknesses in the current TMC implementation (the CM-2) of the idea] connection machine. After discussing these, I will give my view of the unsolved and dLfllcult problems of software and hardware that will need to be addressed over the next decade in order for the supercomputer user community to Ucash in" on the connection machine breakthrough.
Outline
The contents of the report are these. Thus, more algorithm parallelism is needed to use a 1-bit machine than an 8-bit machine of the same cost; iv) Local indirect addressing is very dii_cttlt in 1-bit architectures, since a multiple bit address (16 -24 bits, typically) must be read from memoryto supporta 1-bit read;this maybe a fatal problem;
v) In l-bitarchitectures it is not usefulto integratea fast scalarunit tightlywith the parallelarray,since the time to extract data from memory over a 1-bitdata path is prohibitive. Given wider paths to memory, therecan be considerable value in thiskind of hybrid system.
The tension between points (i)and (ii), which favor 1-blt,and (iii) -(v),which favorwider processors, makes the 4-16 bitrange a very attractive design point in today's technologies:this allows 10 4 -10 5 processorsin machines with pricesin the $1 -$10 millionrange. In futuredesigns,the capability forlocaladdressingof localmemory should be provided,perhaps at some reduced throughput. To do thison the currentCM requiresstorage of data in a "slicewise" manner, in which a 32-bitword is spread over the 32 processorson 2 chipsthat share one Sprint-Weitek combination. With better hardware implementation of floating point and indirectaddressing, thisdistinction should disappearin the future.
The instruction set
Recent researchinto instructionset architectures lu_ shown that simple instruction setsthat provide directcontrolof the hardware without an interveninglayerof mlcrocode are the best approach. These architectures are calledRISCs (Reduced Instruction Set Computers). The instructionset of the CM-2, Paris,implements a very high level machine model. It isvirtualized. It isa memory-memory architecture. It has an enormous number of instructions because itis aimed at too high a levelof abstraction. Itisimplemented by microcode,which makes itslower.
Ithidesimportant machines features, most notablythe Weitek chipregisters, the latency of the paths to memory and the details of the router.It seems to me that Paris was designedbefore the virtuesof the RISC approach to instruction sets was well known and understood. At this point, Paris is a significant liability forTMC. _uture compilersshouldbypass Paris.Itshould be supported as a high-level language only forcompatibility purposes. The factthat Parisisa memory -memory instruction setisparticularly unfortunate,since thiswastes the most valuablemachine resource,memory bandwidth. In RISCs, memory is referencedonly by load and store instructions; thisallowsmemory traffic to be scheduledto hide the latency of memory and to avoid unnecessaryloads and stores, But in memorymemory architectures, in which the processorregisters (and the CM-2 has many of these) aren't visible to the programmer or compiler, every temporary vaiue is stored to and loaded from memory, and can't be loaded until the processor already is waiting for it.
Because of its comp]exlty, Fortran probably won't use more than a fraction of Paris.
Because of the deficiencies of Paris, it is common for programmers to resort to even lower level coding, microcode in some form, to get the best possible performance out of the CM hardware for their applications. 
that the number of virtual processors for which any activity is required is in general less than the number of data dements per physlcai processor, and can act accordingly.
Not so the CM microcode. 
The implementation

5
The hardware implementation
The CM-2 is implemented with gate arrays. The CM uses a front-end for two distinct jobs: running the operating system and controUing the parallel array. This makes the front-end a bottleneck. In the future these roles should be separated, and a fast control processor that is tightly integrated with the paratlel array should be developed. It should he capable of fast scalar operations.
The current CM avoids the complication of a memory hierarchy entirely. The third of these levels is the most appropriate for serious scientific computing and is, fortunately, soon to be available. The introduction of the virtual processor model by TMC was a very important step in the direction of useful programmability in SIMD parallel machines; even today, it is not available on the MIMI) multicomputers. router is very expensive and should be sparingly used. On later instances of the CM this should be made less so. This will allow the applications programmers greater freedom in their choice of algorithms and data structures and will make it possible to solve problems with irregular topological structures more easily. Finally, the peculiar characteristics of the CM-2 instruction set often require that the programmer get involved with coding at an unnecessarily low level. This ought not be true in the future.
Let me summarize my thoughts on the programming environment.
• Connection machine programming is essentially no more difllcult than sequential machine programming.
• Very innovative algorithms are needed on connection maddnes because of parallelism (Amdahl) and because the data are distributed so no processor sees more than a small part of the problem.
• Fortran 90 isa very promising approach to the programming of many but not allparallel supercomputing situations.
7
The programming tools
For numerical computation, *lisphas little to offer.The Fortran now under development isbetterin these ways:
• It is quite close to the standard Fortran 90; it is divorced from Paris entirely;
• It allowsmultipleVP setsand arraysof any sizewithout any fuss;
• It provides the most natural syntax for arrays and iteration, Fortran's traditional strengths;
• It provides dynamic storage allocation, correcting one of Fortran's traditional weaknesses.
• Itsintrinsics are useful.
The current CM Fortran needs some additional extensions.
• There is no provision for pse_ with combining operators;
• There are no scans,segmented or otherwise;
• There isli_ted controlover the layoutof arraysin the CM;
• Nested where constructs should be allowed.
Unfortunately, the birth of CM Fortran has been slow and very painful. As of today, the implementation fails to support the full language. Array valued functions, a key feature, are not implemented. There is no interactive debugger, a feature that I don't find really important, Today's (Cray, NEC, Fujitsu, Hitachi) supercomputers are MIMD multiprocessors that share a unified memory space. A number of highly parallel derivatives of these machines are now under development. To build such scaled up versions, a new memory ar_tecture that employs many memory modules and a processor -memory switching network is necessary. This makes coherent caching at the processor difficult, so some software controlled caching is becoming popular. Latency for access to non]oral memory is high on these machines: 5psecs is typical (whereas floating point arithmetic takes a few tens of nanoseconds, at most). The one advantage of these systems is that they can attempt to support the current programming model: simultaneous multiple users, many Unix processes, the illusion of a fiat memory with equal access by all processes, and automatic compiler extraction of parallelism from sequential code. 
