Introduction
Work has been concentrated in three areas; system design and applications, memory design, and transmitter design. Goals for each of these areas have been determined and work has progressed to provide detailed simulations of the OFTIMUL (for -Qically Interconnected Multiprocessor) system and demonstrate its usefulness in a number of applications. A preliminary design for the system was completed and plans for fabrication of a prototype system were developed.
Professor Kowel attended the 1988 ACM International Conference on Supercomputing in July, and presented the paper, entitled " OPTIMUL: An Optical
Interconnect for Multiprocessor Systems", included in the Appendix. One of the major invited talks, by Carl Ledbetter, President of ETA, dealt with the challenges of obtaining a factor of 10 improvement in supercomputer performance. He showed that fundamental physical constraints as well as practical fabrication problems rule out success by traditional technological paths. He mentioned two possible paths to gain significant improvements -software, and hybrid electronic/optical systems. The work done during this reporting period encourages us in our belief that we have a promising solution, based on both categories.
System Design and Applications
Both an optical read and an optical read/write system have been evaluated for their potential increase in speed in a multiprocessor environment. It may turn out that the potential increase in speed is greatest in loosely coupled systems. Database applications were studied extensively as an application which can benefit greatly from the use of optical memory interconnect as proposed in OPTIMUL. The use of the OPTIMUL system in projection, sort and join operations were studied. Select and project operations lend themselves easily to simple multiprocessor operations and optical interconnect will allow for reception of the partitioned task without contention and subsequent delay. Sorting and joining operations have also been sttjdied and algorithms utilizing optical interconnects were developed. The results of this portion of the work have been accepted for the Eighth Annual IEEE International Phoenix Conference on Computers and Communications to be held in Scottsdale, AZ, in March, 1989 . A copy of this manuscript is included in the Appendix.
In addition to the work performed in the area of relational database applications for the OPTIMUL system, the following tasks were undertaken: 1 1. Identification of a particular computational problem which will benefit most greatly from the OPTIMUL technology. Possible problems include pattern recognition and classification, weather prediction, or expert system tasks.
2.
Algorithm development for solution of the identified problem using an optically interconnect mu',iprocessor system.
3. Simulation of the system using various sized memories and transfer rates, and various system configurations.
Memory Design
A preliminary design of an Optically Writeable Ram Cell (OWRC) has been completed and simulations of the device performed. The device is essentially a fast static ram cell with the capability of being written to optically. Electronic writes may also be performed in which case the device acts as a standard memory device. The circuit uses reverse-biased photodiodes to act as optical detectors; these detectors are modeled as current sources which generate current in proportion to the amount of illumination. The most important feature of this design is that the device acts as a differential detector and can determine small differences between a reference beam and the information beam. In this way the memory information may be transmitted to the receiver without full modulation of the incident beam. As discussed in this report, simulations have shown that with a modulation level of less than 1%, the memory information can be received in 10ns.
Basic Operation
The OWRC consists of three major functional blocks; 1) input circuitry, 2) the differential amplifier/SRAM cell, and 3) the output circuitry. It has two data inputs and requires four controlling clocks. Two of these clocks require the inverse signal to drive the P-channel devices. The complements can be generated by adding inverters to the cell but to maintain a minimum size they have been assumed to be provided. Thus tMere are a total of 8 inputs (6 for clock signal and 2 data). The results of circuit simulations of the devices are shown in Figures 4 and 5.
Detailed Operation
OWRC employs differential optical inputs to maximize resolution and to minimize the constraints on the optical system which will be supplying the input signals. The input circuit is shown in Figure la The optical receivers are reverse-biased diodes which will conduct a current proportional to its illumination. Since the diodes may be under contstant illumination., the nodes IN_POS and INNEG will normally be at 5 volts. To operate the circuit, the nodes POSSAMP and NEGSAMP must first be discharged to ground so that they will start charging from the same level. This is done by asserting SHRTCLK which drives the gates of Ml1 and M21 to charge POS_SAMP and NEGSAMP. The final voltages are determined by the illumination and the duration of SAMPCLK. With a different amount of illumination on the two diodes (representitive of a one in memory), the POSSAMP and NEGSAMP nodes will charge at different rates and will thus be at different final voltages when SAMPCLK is negated.
Differential Amplifier / Static Ram (SRAM) Cell
The differential amplifier, as illustrated in Figure 2 , consists of two cross coupled CMOS inverters with two additional FETs to allow the application and removal of power to the two inverters. Once the sample has been taken and SAMPCLK has been negated the difference may be evaluated by asserting the evaluation clocks EVALCLK and NOTEVAL. As can be seen in Figure 2 , EVALCLK drives an N-channel FET which connects the inverters to ground and NOTEVAL drives the complementary P channel FETS which connects the inverters to power. When power and ground are applied to the inverters they amplify the difference between the two sample nodes (inputs to the inverters) and settle with the higher one at 5 volts and the lower one at ground. It is important that the input circuitry supply current such that these nodes charge to a level Vo which is constrained such that (Vdd -Vthp) > VO > (Vss +Vthp). Failure to meet this condition will result in either the P or N channel devices in both inverters being off when evaluation starts causing unpredictable results. EVALCLK iS 0 held high as long as it is desired to maintain the data in the RAM cell.
IBy_
Ditr.ibutia'm/ Availability Codes S'Avall and/or D3st Special
Output Stage
It is important that the capacitance from the nodes POSSAMP and NEGSAMP be closely matched since deviation from perfect matching will result in degradation of the resolution of the difference amplification. With this in mind, the output of the OWRC is also differential. This is for capacitance matching purposes. The output stage consists of a simple CMOS pass gate which is shown in the diagram of the complete circuit (Figure 3 ). Data may be read out of the RAM cell anytime after it has settled by asserting ENLCLK and INVENL, thus connecting OUTPUT (Ii, , _0UT) Figure 4 . Circuit simulations performed using Spice which demonstrate the response of the OWRC (lower simulation) to an optically induced current of approximately 4li amperes (top simulation).
Balanced Receiver Design
During the last period of this contract, the balanced receiver was analyzed as a possible input structure for the memory. The two photodiodes act as a differencing element for the optical signals received as shown in Figure 5 . By using this as an input to the memory circuit described previously, it should be possible to reduce the complexity and thus the real estate requirements for the receiver. The use of a coherent receiving system was also investigated, as sl1,wn in Figure 6 . The advantage of this receiving system is that it does not require polarizing filters and has better theoretical sensitivity, but requires more sophisticated optics to create the interference.
D A"

Ip t1
)r ,Iph2 1ut . Balanced receiver used as a coherent detector. In this case no polarizing filters are necessary and the differential phase shift between the two pixels is measured. This configuration has the highest theoretical sensitivity but requires vibration-free optics and a coherent soUrce.
REF
Optical Design
Based on the preliminary design of the optical receiving elements, an optical budget can be estimated for the system. Figure 7 illustrates the incident power on a transmitting element and the subsequent propagation and losses in a system containing 8 receiving arrays. Detailed calculations based on the specifications of available COD devices as receivers and ferroelectric liquid crystals as the modulation coating have been made. They reveal that 1 Watt of input power is sufficient to drive 64 processors from one 64Kb shared memory, assuming 'no repeator' architecture. This optical budget can certainly be provided by a modest gas laser, or by an incandescent source. For a thin solid film coating, the estimation is more difficult. Our curent AZO-DYE etalons provide only 0.01% modulation, compared to nearly 100% for the liquid crystal films. Of course, we expect to make far better etalons with better dyes as the work continues. With an improvement of 100, we should be able to design electronics capable of discriminating the two switched levels.
-INCIDENT BEAM
SCATTERING LOSSES 50% COLLIMATING LENS
Optical Budget
Based on the design of a multiple image system based on beam-splitters (as shown in the previous quarterly report) an analysis of the total modulated optical power needed as a function of the switching current per bit was calculated. The total optical input power required is found to be For the simulations shown below the modulation was considered to be 100% efficient. For modulation efficiencies less than 100% the required optical power will increase linearly with the decrease in modulating efficiency. As illustrated in the graph, if switching currents on the order of 10 nA are sufficient for switching the memory bits (with a bit error rate of < 10-11) the system will require less than 10 Watts of optical input power, even with 32 processors. While it is possible to obtain lasers with this much continuous power, it will be more feasible to u 1,ted incandescent light as a source. Filtering the light from a broadband source will provide an inexpensive yet strong (>1OW) source of light. It is interesting to note that the Fabry-Perot etalons only have an effective path length on the order of 5001Lm and thus the coherence length of the light needs to be on the order of 1mm. Since ordinary discharge lamps have coherence lengths on the order of several mm, is should be possible to obtain a powerful yet inexpensive light source for thin film modulators. 
Conclusions
During the contract work has progressed in all areas of the program. By performing simulations at both the system and circuit ievel, we are able to predict the overall performance of an OPTIMUL system and allow for the development of a preliminary design. This design allows for implementation in the immediate future using available materials such as fast liquid crystals but will be applicable to other technologies being developed such as polymeric electro-optical thin film materials.
Concerning the tasks which lie ahead of us, the following should be mentioned as the most crucial ones: a. We will present here a radically new nierconciect i e. an entire chip can be written in one access. This opti--neninod which will solve these prnniem-s, and have other cal channel is described in Section 2. anci then MP system advaor ge as weil:
architectures utilizing it will he proposed in Section 3, Section 4 will then present Some irnplemoental,;sn deta-s. a) The niewi interconnect will tie uisable for both fine-grained and coarst-grained types of applicat ions. Mtoreover, it could he appiied to butid 2. A New OptkaJI Memory Acceas Ctsannei systems which are equally effective on both of tneme apptication types, with no reconliguration Consider devices D, ... , Ds which wish to read time. Such systems would then also work well memory chip C, in which are stored bits Bi. , B. 'A for "medi.m-grained' applications, thus recog.
will report here a techniqze in which the devices cn read niting tbai. the fine-grined and coarse-grained from C optically, bypassiug the need for asing the chip's concept~s are merely two extrrmal representatives pins, and which will allow this access to be simnultaneous. in a broad range of problems having varying with respect to both devices D, and bits B, lFigure 1). degrees of frequency of interprocessor commun.-ration To achieve this, C will be coated with a. tbin poivmeric 61lm, using a LangmuiriBlodgett (L B) or other !ecnihi It will solve the long-standing problems of connique Kowel et a. 19851 Kowel et a4. 1987' W hen Cs tention for memory and for the illuminated. e g. by & laser, the film will cause the processor memory switch in TC systems. Tbere reflected beam to be intensity-modulated by the eiectric will be absolutely no qa-ueiog delay for read fields at each position beneath the film in C. Thus the access to shared memory. reflected team will contain a complete bit map of The contents, of C. The beam will be processed hy optcai c) In LC systems, it will enable a truly dramatir aPparatus for focusing onto the receivers D, mprovement in interprocesbor communications bandwidth, and again totally eliminate contenDemodulation of the beam back to storage as v-ectric t.on for the interconnect switch, fields at the receivers is accomplished ny the use of photosensitive technology. For example, one possibility -S to dl Although there will still be physical limitations use ordina-y DRAM memories, which have a natural senon the sine of p, such limits shoui be far 'te sitivity to light. -his means also that parts of C must oe cosriigthian those of existing systems with masked from the light, so that illumination of C does not convetrinnototclpoesrmmr cr change the contents of bits in C. e g. only thi. Output poirconnects. tion of a gait can be exposed. CCD or CID arrays are alIso possibilities for use as demodulators el Our approach should also be superior to other
In this way, tbe %1al:es stored at all the b-ts B. 58 has been anticipated. but even this would memory and interconnect contention problems which have Only allow 258 simultaneous hits to be transmitplagued TC systems, while the simuitaneity over both ted: be contrast, under our approach, the entire and will have an equaily profound impact s the nietContents of a chip can be transmitted smultanework bandwidth limitation probiemn in LC systems Note OuSilY. i e thousands or even iniilions of bits can again that the ciasical bottleneck arising from limitat-ors be sent in parallel. Note that this also implies ontepn-otrdbisaioscmieiy yasd that the pin-limitation problem is alSO eimthe approach described here.
, ItI to C can be iccompisheti Ov rewervicK tpe pro. deiayo n-ior, a-t 'epo-e ress or by-osr-P.PA'e 2. P5 Is -i k2-tecture 11in Another approach would be to use a material wItb an electrochromic Property. which would produce an a ten. The above two architectures are just two examples: sity map of 'the electrodes directly through eiectr,a:, many other configurations are possible. For example, induced absorption or light. The Stare effect nas beirn purely LC systems can be formed, e.g as ring networks.
used to characterize LB films Blinov et at, i984 avii But again, instead of the serial interprocessor communicawould be an alternative technique '.nich wouid 0o( tion available in ordinary ring networits. the optical chanrequire polariters abov tbe film. Even though such an nels introduced here would provide exceedingly highly interaction is likely to be slower than the eiectro-opi,, parallel commiunication.
effect, it may be a feasible implementation since so arge an amount of data is transferred simuitaneousty .another possibiiity !s to set up memory hierarchies.
[n this setting, motivated by a desire to conserve on
The deposition of these films should provde excellen: optical apparatus. only some memory access would be topographic coverage. be physically and cbemicaj!, optical, with the optically accessed memories serving in robust, and be of very uniform thickness and opticai qualthe role of cache front ends for much larger memories. ity.
A large number of processors can be accommodated 4. Some Implementation Details by introducing "fly's-eye" optics capable or imaging tte shared memory contents onto a large number of proresImplomentatio:. of OPTIUL requires materials and sors, as depicted in Figure 4 . CCD arrays are used as components for the illumination. electro-optLc conversion receiver/transmitters in that figure, but as menuoned of data. and the subsequent conversion of the data back before, DRAM or other technology is possible. to an electrical signal. The illumination for the system is provided by a laser at a suitable wavelength and with This configuration also allows for broadcast of a svssuitable opticai power. as determined by the other tern clock from the shared memory, so that ail processors components materials in the system. Appropriate optics can run in a synchronous mode if desired, although they would be used to focus the beam on the processor imaging may generate multiple phases or frequencies from the arrays. The simplest implementation for OPTIMUL master clock for internal use. 
Optical Transfer of Information from Main Memory to Local Memories
Isuqynation of memory cNp C allows simultaneous transfer of bits SO-Ok to all reoeMng devices. For a emory sie of 1K and a trarser tame of 100 niw rhe oeltm data rate Is greater then 101,itaec. University of California at Davis
Introduction
Multiprocessor (MP) systems, consisting of p interconnected but independent processors, have the potential for a speedup factor of p in computational power. However, a long-standing problem has been that this potential has not been realizable, due to the overhead of processor-memory and/or processor-processor communication. This has been the case for both types of MP systems which are usually considered:
Tightly-Coupled (TC) Systems:
The very significant overhead in TC systems takes the form of contention for the shared central memory M,-,, and for the processor-memory interconnect. The latter problem is exacerbated by the fact that the expense of full crossbar switches results in the use of other networks for which there is even more processor contention for the interconnect, e.g. f-nets [Hwang and Briggs, 19841. For small values of p, use of caches can be effective, but the efficiency decreases with p [Wilson, 19871. Furthermore, it has recently been discovered that access to interprocess synchronization variables in shared memory worsens this problem tremendously [Pfister and Norton, 1986) .
Loosely-Coupled (LC) Systems:
In LC systems, there is no shared memory, but there is still communications overhead of another kind. The processors communicate with each other through a network. Bandwidth limitations on this interconnection network present very substantial overhead. For example, intercluster accesses in the Cm* machine were a factor of 8.7 times slower than accesses to local memory jHwang and Briggs, 19841.
In both the TC and LC settings, another significant problem is the severe restrictions resulting from chip pin limitations. Even channels of very high bandwidth, such as those constructed from optical fibers, would not solve the problem arising from the fact that there are only a few data pins but thousands or even millions of bits in a memory chip.
The entire history of the development of MP technology has been dominated by the search for solutions to these problems iSiewiorek et a4 1982; Hwang .nd Briggs, 1984; Agrawal, 19861 . Essentially, no completely satisfactory solutions have been found. For example, after Cray Research, Inc. released the Cray X-MP, an MP version of the Cray-I supercomputer recently, a number of investigations [Bailey, 1987; Cheung and Smith, 1986; Oed and Lange, 19861 quickly showed the system to suffer from slowdowns due to both contention for shared memory and contention for the network which connects the processors to that memory, just as with all the earlier MP systems.
Perhaps an even more dramatic example is the S-1, a TC MP system developed at Lawrence L~vemore National Laboratories JHwang and Briggs, 1984J. Throughout the period of development of this system. it was hailed as one of the most advanced MP projects in existence. However, recently the proj-ct was discontinued, in spite of all the favorable publicity, and the very extensive funds expended ',Bruner, 19871. One of the primary reasons given for the discontinuation was that the project engineers had found that the contention for shared memory in the system would be much greater than they had anticipated. They are now beginning work on a completely new design. Such problems have been considered extremely difficult to solve, with some authors even going so far as to say that we possibly should resign ourselves to the problems not being solved, concentrating on software methods instead ILedbetter, 19881.
However, in [Matloff, Kowel and Eldering, 19881 a radically new interconnect method was presented which will solve these problems, and have other advantages as well: The new interconnect will be usable for both fine-grained and coarse-grained types of applications; it will solve the long-standing problems of contention for memory and for the processor/memory switch in TC systems; in LC systems, it will enable a truly dramatic improvement in interprocessor communications bandwidth, and again *totally eliminate contention for the interconnect switch; our approach should also be superior to other optical processor/memory interconnects which have been proposed, e.g. optical crossbars lBell, 1986; Hutcheson ct al, 19871; the pin-limitation problem is also eliminated, which is a problem even in those architectures which have been proposed based on an optical fiber interconnect.
Our name for this new interconnect is OPTIMUL, an acronym for Optical Multiprocessor Interconnect. The central feature is an optical processor-memory channel, which will allow simultaneous access of a memory chip, where the word "simultaneous" is rrant both with respect to all bits in the chip, and with respect to all processors. In other words, all processors can simultaneously read the entire contents of a chip, with no interference at all. Write access is of course restricted to a single processor at a time, but it still is simultaneous across bits in the chip, i.e. an entire chip can be written in one access. This optical channel is described in Section 2, and then MP system architectures utilizing it will be proposed in Section 3. Sections 4 and 5 will present some performance analyses of these architectures.
A New Optical Memory Access Channel
Consider devices DI, ... , Dk which wish to read a memory chip C, in which are stored bits B 1 , ... , B,. We will report here a technique in which the devices can read from C optically, bypassing the need for using the chip's pins, and which will allow this access to be simultaneous, with respect to bogh devices D, and bits B, (Figure 1) . To achieve this, C will be illuminated and mechanisms used to cause the reflected fight beam to be intensity modulated by the electric fields at each position in C. Thus the reflected beam will contain a complete bit map of the contents of C. The beam will be demodulated by optical apparatus for focusing onto the receivers D.
Some preliminary implementation details were given in !Matloff, et al, 1988j . An updated is given in the following: To achieve the desired modulation effect, we are pursuing two strategies, one based on advanced ferroelectric liquid crystals, and the other using thin solid film structures containing highly nonlinear dyes. Either material would be used to coat over the surface of the chip C above (or a group of such chips).
The fields on the surface of typical IC's are of magnitude on the order of volts/pm, larger than the fields supplied by the electrodes in a typical liquid crystal display. This fact led to the demonstration of an electro-optical method for testing integrated circuits jBurns, 19791. Problems such as long switching times (-10 ms) have recently been resolved, with switching times on the order of 100 ns, and even faster operation appears to be possible iJohnson, et al, 19871.
We also have been examining the feasibility of using thin solid organic films as the coating material to be used to effect the light modulation. Such materials appear promising, and would offer a tradeoff of higher speed for lower image contrast (Kowel, et a, 1987 'Kowel, 1985 . We are investigating synthesis and deposition techniques, and are collecting electro-optical measurements to evaluate the potential of these films.
Demodulation of the beam back to storage as electric fields at the receivers is to be accomplished by the use of photosensitive technology. For example, one possibility is to use ordinary DRAM memories, which have a natural sensitivity to light. This means also that parts of C must be masked from the light, so that illumination of C does not change the contents of bits in C; e.g. only the output portion of a gate can be exposed.
However, in commercially produced chips, this photosensitivity of DRAM's may not be uniform enough for reliable use as demodulators, since the sensitivity is a byproduct, not a primary specification. Thus, we are taking other approaches instead, based on photodiodes. We have designed and simulated such a receiving device Loving and Eldering, 1988) . In fact, other such memchies have been proposed [Kosnocky, 1971 i Ullman et at, 19'u,.
In this way, the values stored at all the bits B, in C can be transmitted optically to the devices D,, simultaneously over all subscripts i and j. Clearly, the simultaneity over i wihl have highly signiFcant implications for the memory and interconnect contention problems which have plagued TC systems, while the simultaneity over j will have an equally profound impact on the network bandwidth limitation problem in LC systems. Note that the classical bottleneck arising from limitations on the pins-to-stored-bita ratio is completely bypassed in the approach described here. Writes to C can be accomplished by reversing the process.
System Architectures
The optical interconnect presented here can be used in a variety of configurations. Two of these were described in [Matloff, Kowel and Eldering, 19881 , which will be summarized here:
This configuration features optical memory reads, but used electronic writes, the latter being motivated by a desire for simplicity in the first prototype to be constructed, and by the fact that the electronic bus, with a standard Test-and-Set cycle or similar mechanism, avoids the interprocess synchronisation problem which must be solved in a purely optical system.
Architecture I:
This configuration features both optical reads and writes. It is intensive in memory quantity needed, with essentially separate memory modules being used for reads and writes. Interprocess synchronisation is handled by message-passing techniques lPeterson and Silberschatz, 19851, the implementation of which were given in examples in iMatloff, Kowel and Eldering, 19881.
Other Architectures:
The above two architectures are just two examples; many other configurations are possible. For example, purely LC systems can be formed, e.g. as tree or ring networks (see Section 5). But again, instead of the serial interprocessor communication available in ordinary ring networks, the optical channels introduced here would provide exceedingly highly parallel communication. This is currently being investigated [Matloff and Schubert, 19881. Another possibility is to set up memory hierarchies. In this setting, motivated by a desire to conserve on optical apparatus, only some memory access would be optical, with the optically accessed memories serving in the role of cache front ends for much larger memories.
Performance Analysis: Simulation of a Continuum of Systems with Varying Degrees of Coupling
Numerous mathematical analyses of multiple access of memory systems have been presented (a nice collection of references appears in the introduction to Chapter 6 of (Agrawal, 19861) . However, for the present purpose, a simulation analysis was preferred, in the interests of (a) simplicity, and (b) modeling OPTIMUL's ability of a processor to do a parallel access of a large data structure.
Specifically, we set up the following model, which can be considered as an abstraction which is representative of a number of architectures which could be developed using the optical interconnect introduced in [Matloff, Kowel and Eldering, 19881 . We will refer to the abstracted system here by the same name, OPTLMUL.
In this system we have p processors viewing a central shared memory of m modules. Consider the operation of one processor F. P will alternate between periods of memory accas and nonaccess. We assume the nonaccess time (measured in units of memory cycles) has a geometric distribution with mean
The model then assumes that at the start of an access period, P will send to a memory controller a request for R,, consecutive words in the memory space, e.g. a request to read an entire array or subarray. R,,. is assumed to nave a geometric distribution with mean p,,.
We are comparing OPTIMUL to a conventional MP system. There is extremely wide variation in "conventional" MP systems; the model cannot incorporate all of them. Instead the model has been designed so that variation of its parameters will allow modeling of a range of situations suitable for comparison to OPTIMUL; this will be seen below.
In the conventional system, it is assumed that the memory controller will satisfy the R.,g. requests made by P in whatever order they become satisfiable, similar to the "C" organization jHwang and Briggs, 19841 lKogge, 19811, with consecutive words stored in consecutive modules (mod in), i.e. using low-order interleaving. If one of the words requested by P encounters contention with a request from another processor, one of the processors must wait. It a requested module is free, it takes one unit of time to satisfy a request for one word of memory.
On the other hand, in modeling OPTIMUL, we are assuming that any request takes only one unit of time to service, for any value of R,.,, i.e. OPTIMUL will access all R, , in one time unit, due to OPTIMUL's ability to transfer the entire contents of a memory chip in parallel. (For this reason, this way would be most fully exploited if the MI, 's (as in Architecture 1) were contained within the processors, we are making such an assumption here. On the other hand, in some ways our model is too conservative, i.e. it actually underestimates OPTIMUL's potential; this will be explained below.
The simulation actually measures the performance of our model, conventional MP system, rather than OPTIMUL itself. The mean delay per memory access, DC, is found for the conventional system. Under the model described here, the corresponding mean delay for OPTI]MUL is exactly 1.0. Thus D¢ may be used as a figure of merit for OPTIMUL, i.e. a measure of the speedup in memory access obtained.
We have noted that one of OPTIMUL's important advantages is that it can operate in both TC and LC modes. This is the motivation behind our model for the memory access of a conventional MP:
We model the "typical" TC system an having a fairly small value of p, .. # 2.0. This reflects the fact that TC systems are appropriate for applications in which the processors must communicate with each other fairly often, and that they do so by accessing M,,,. However, such accesses are usually for only one word, or a small number of words. To reflect this, we set IA,,, to be fairly small in our simulator. To model TC systems which exist today, in which the number of processors is limited, we will set the number of processors p to be small in our TC simulations, specifically 16.
LC Mode.
Here we set p to be a larger number (64) in our simulator, reflecting the situation in many current LC systems (of course, many such systems are even larger than this). Also, since LC systems are set up for applications in which the processors communicate with each other less frequently, we have set ., tc, be fairly large (100.0). However, when LC systems do communicate with each other, it tends to be with relatively large amounts of data; thus we have set p,,, to be fairly large in our simulation, with a value of 100.0.
Both the TC and LC models include m -16 memcry modules in M,,.
Note that these models will severely underestimate OPTIMUL's potentiai, in a number of ways. For example, the TC model tacitly assumes that the processor/memory interconnect switch for the conventional MP machine is in the form of a crossbar, which is not typical in MP systems, and is actually infeasible for the larger ones. Thus the model for the conventional MP machine does not incorporate any queueing delay due to the interconnect switch; as mentioned above, such delay can be quite large, and thus this results in underestimating OPTIMUL's potential. Of course this built-in bias against OPTIMUL will be even worse in our LC model, since the interconnect queueing delay is much worse in that case; we are not allowing for network traffic delay at all in this simple analysis.
A large number of simulation runs were conducted, but instead of reporting all of them, we will concentrate on three representative examples:
This is a TC model, with p,,, = 1.0. This setting can be expected to give only a modest advantage to OPTIMUL over convtntional machines, due to the abovementioned lack of interconnect queueing delay ir our model. However, we still found that the figure of merit DC was 1.34, i.e. even this setting's bias against OPTIMUL, OPTIMUL has a 34% advantage.
Example B:
This too is a TC model, but with p,: = 10.0, representing a situation in which the P, are vector processors. This .nodels a setting in which most memory accesses of a proctssor are for scalars, but occasionally a vector access is made. Here we found that DC = 12.44, a 12-fold advantage for OPTLULL.
This is the LC model described above. Here OPTIMUL has a very dramat:c advantage over a conventional system, with D c = 277.35 (and, as mentioned above, this number is probably an underestimate of the true value).
In addition, one of OPTaMUL's most significant advantages is invisible in the simulation study, namely the feasibility of using a much !arger number p of processos in a TC system. The limitations of crossbars (or their more sophisticated variations) on p imply that it would be infeasible to use TC systems in applications having a very high degree of inherent parallelism. The optical interconnect nature of OPTINUL should make it much more feasible to build large TC systems, so that more highly parallel applications may be handled.
Performance Analysis: Case Study of a Sorting Application
In Section 4, we presented an analysis based on abstraction of memory access patterns in multiprocessor systems. This analysis showed the potential of OPTIMUL to be quite dramatic for some settings of the simultion parameters. However, additional understanding is gained by investigating the performance of OPTIMUL on a specific application, which is done in this section. The analysis here is basically a trace-drive, simulation of the performance of our proposed system on sorting problems.
The analy..is assumes that OPTIMUL's processors are of speed comparable to that of a VAX 8500. Single-processor computation times used below were obtained by using the Unix 'time' command to get processor run times for actual C code for the sort algorithm specified below.
The processors are assumed to be set up as an LC system in a ring topology. The OPTLMUL version of this system is assum-d to have optical neighbor-to-ceighbor links which use the technology described above, which the capability of transferring millions of bits in hunoreds of nanoseconds; interprocessor communication time is essentially negligible in this system. The non-OPTIiMUL version of the system has "conventional" neighbor-to-neighbor links having transmission rates of 50 megabits/second. Links of this speed or better are.beginning to appear, c.g. the "semi-LC" VAX Cluster sytems ,Kronenberg, Levy and Strecker, 19881, this rate is much faster than is typical among most LC systems to date, e.g. the Hypercube.
The fort algorithm used was Quickmerge iQuinn, 1987', which consists of three phases. During Phase I each processor sorts a subset of the array using Quicksort. These subsets must then be merged to compete the sort. Before the merge phase, Phase I1, a search phase, Phase III, is added so that the merge task car be divided among all the processors. Processors search for dividers to partition each of the sorted subsets such tnat there is no value in partition. . of any subset j which is greater than any value in partition i±-Lk of any subset k. L,',ing the merge phase, each processor joins together a set of partitions which share common dividers. Because of the divisions performed in step two, merged part tion i precedes merged partitioni+ I .
On an LC system, the communication between phases is substantial:
(a) Before the initial (sort) phase, each processor must receive a subset of the array to sort. These subsets are sent by the lead processor, relayed from processor to processor along the ring until reaching the desired destination processor.
(b) Before the second (search) phase begins, each processor must receive the sort phase results from all other processors.
(c) Before the final (merge) phase, the partition dividers must be passed to each of the processors (note that the data is already present in each processor's private memory).
(d) Finally, the merged partitions must be returned to the lead processor for concatenation.
The entire array must be broadcast three times. As the total communications cost is dominated by this data movement, we won't consider the transfer time of the partition dividers.
Within an OPTMUL ring configuration, memory would appear to be shared since information could be transfered continuously around the ring. As OPTMUL allows a complete memory-memory transfer in one memory cycle, data can be transferred (broadcast) to all processors in p memory cycles where p is the number of processors on the ring. Preliminary study suggests that we will be able to transfer the contents of one memory chip to another in less than 50Ons, and that this time can be reduced to less than l00ns. Even given the slower speed, data could be broadcast to all members of a 64 member ring in about 32ps ( 63"500ns = 31500ns < < I ms ). This is a substantial savings over the alternatives discussed above, even ignoring the propagation delay around the ring.
Below are tables indicating approximate times for the Quickmerge algorithm were the algorithm executed on OPTIMUL and non-OPTIMUL ring as described above. The improvements look modest in comparison with that of Exampl-" , in the last section, but still are quite impressive, with speeds double and triple those of the conventional LC system. The largest improvement reported occurs for a 128 processor system sorting a 256k integer array. Here the OPTIMUL system would perform approximately three and a half times faster than a non-OPT[MUL system having the same number of processors. For larger problems and more processors, larger speedup factors might be observed. 'On the other hand, it appears that additional tuning of the algorithm could be done for the non-OPTIMUL setting, and the gap in performance narrowed somewhat.j More detailed analyses, including the implementation details for such a ring configuration, are currently in progress jMatloff and Schubert, 1988 .
T[he gains reported here are significant, but modest in comparison to the most extreme gains presented in Section 4. In that light, it must once again be pointed out that speedup factors are highly application-dependent. In particular, in the sorting application analyzed here, there is a fundamental obstacLe to speedup, in terms of the relative size of computation and communication times:
Consider sorting n items on p procesors, by part'tioning into blocks of approximate size n/p each. The computation time is approximately C n/p log(nip) for some C, assuming that all subproblems firia!. at roughly the same time; this is not a bad assumption, since the standard deviation of sort times is small compared to the mean 'Gonnet, 19841. iFor simplicity, we are ignoring the merge pha.se in the analysis below.: The communications time is roughly D (n,/p) p (n/p amount of data being passed through p nodes) for some D.
Fix p and vary n. If the ratio n/p is too small, then very little data is being passed from node to node, not enoigh to fully exploit the highly parallel data transmission capability in OPTMfUL. On the other hand, as n grows, the computation time tends to dominate the communication time. In this setting, OPTIMUL's communications advantages will be quite substantial over non-OPTIMUL systems, but the advantages will not be important, since communications times will be a minor proportion of the total times anyway.
In other words, applications such as sorting, having computation times which are more than O(n), are poor candidates for studies whose aim is to investigate interprocessor communications costs. In such applications, inefficient interprocessor communication might not be penalized much. Searching applications, with 0(n) or O(log n) computational times, should much more fully exploit OPTMUL's hugh communications bandwidth capabilities, and are currently under investigation.
Acknowledgement
This work was supported in part by Rome Air Development Center (RADC). 
