In the paper we discuss the results of a feasibility study of an opto-electronic shared memory with concurrent read, concurrent write capability. Unlike previous such work we consider a true hardware shared memory rather then a simulation on a tightly, optically connected distributed memory computer. We describe a design that could be implemented using compact integrated semiconductor modules and propose ways to solve two major problems faced by such a device: optical system complexity and parallel word level write consistency. It is shown that, in principle, a memory with GBytes capacity and a latency of less then 1 ns, accessed by up to lo5 processors could be feasible. Using devices currentiy available as laboratory prototypes and taking into account energy and crosstalk considerations a capacity of more then 1MB and a latency of about 50 ns might be attained for up to 1000 Processors.
Introduction
The concurrent read concurrent write shared memory (CRCW SM) is a powerful concept for parallel processing. Among other it is an essential component of the P U M model of parallel computation [7] . Thus, much effort has been invested in its practical implementation. Unfortunately, attempts to build a uniform accem time shared memory scalable beyond about 100 processors (PES) have so far yielded unsatisfactory results. Efforts now concentrate on simulating shared memory on communication networks, using latency hiding and other techniques [lo, 231.
In this paper we consider an alternative approach: using the natural parallelism provided by optical data storage. We present the results of an extended feasibility study to show that despite 0-7803-2018-2/95/$4.00 0 1995 E E E being technologically challenging opto-electronic CRCW-SM (OCRCW SM) is an interesting possibility that should become feasible in the near future. We exploit recent advances in micro-optics and opto-electronics t o propose an architecture for the OCRCW SM. It relies on the recently developed ultra fast GaAs MQW optical modulators (e.g. [IS] ) , VCSEL laser diode arrays (e.g.
[14]), and the progress in the integration of microoptical components [13]. We also propose ways,to solve two major problems faced by the OCRCW SM: optical system complexity and parallel word level write consistency.
We first give a short overview of related work in section 2. Then in section 3 the principle of the OCRCW SM is described. From there we proceed to describe the overall system architecture in 4 and discuss implementation issues in section 5. The problem of parallel write word level consistency is considered in section 6. Finally performance estimates are given in section 7 and our conclusion and future work are presented.
Related Work
So far, research on the use of optical technology in parallel computers has concentrated on the improvement of 
Principle of Operation
The idea behind an opto-electronic CRCW SM (OCRCW SM) is that information stored using optically controlled, variable light absorption of memory pixels can be accessed concurrently by distinct light beams.
Bit Storage
In opto-electronic memory bits are stored in miniature light modulators (LMs). These are pixel like, multistate devices that can modify some property of incident light in accordance with their current state. In the following we consider LMs with two states: light absorbing and light transmitting that can be selected by applying an appropriate voltage. A simple common LM of that kind is a liquid crystal cell that constitutes a single pixel of a LCD display.
The state of a LM corresponds to a single bit of information. It can be read out by illuminating the LM with a beam of light and using a light detector (D) to check whether it has been absorbed or transmitted. By illuminating a LM with multiple light beams and using multiple detectors a concurrent read operation can be performed (Figure 1) .
To allow optical bit write an electro-optic memory cell (MC) can be constructed by combining a voltage controlled LM with two photodetectors. When illuminated, one phototodetector (set detector,SD) causes the modulator to be switched on, while the other one (clear detector, CD) switches it off (Figure 2) . Thus a bit can be written by illuminating either then SD or the CD of the corresponding MC.
For a parallel write operation with some processors trying to set the bit and others trying to clear it the SD and CD can be wired to follow the majority. The bit is set depending on whether more light is incident on the SD or on the CD. 
Opto-Electronic RAM
A random access memory based on the bit storage mechanism described above can be constructed using following components:
1. an array of electro optic MCs (memory plane, MP), The basic principle is shown in Figure 9 . The address of the desired memory location adr is translated by the RU into a light beam directed towards the corresponding location in the memory plane. After hitting the memory plane the light beam is directed towards the detector by the OS. For a write operation the WU unit sends a light beam towards the set or clear detector of the MC corresponding to the selected address.
The readout process of an M bit memory can be portrayed as a 1 x M mapping performed by the RU followed by a M x 1 reduction done by the os: 
OCRCWSM
To allow parallel access the above setup must be extended by an additional RU, WU, and D for every accessing processor (Figure 4) . Since different light beams do not interact with each other, every processor can access the memory as if no others were present. Any PE can perform a read or write access on any memory location at any time realizing the full CRCW SM semantics.
For P processors the system performs P 1 x M mappings followed by a M x P mapping for parallel readout and a P 1 x 2M mappings for parallel write.
(3)
1 _< adr' 5 M , 1 5 i 5 P, U E ( 0 , l )
Design Consideration
The major concerns that have to be addressed by any practical OCRCW SM design are optical s y s tem complexity and parallel write consistency on word level. System Complexity There are fundamental physical constraints on the amount of information that can be stored optically in a given area. The minimum radius of a memory pixel rbit must obey with A the wavelength, C Y the maximum angle of incidence of a light beam,rMp the radius of the MP and d the distance between the RUs and the MP. This limits not only the density of the MP but also the number of independent communication channels that can be implemented in a given volume [26] . In practice energy concerns and crosstalk due to missalligned components impose even stricter bounds (e.g. [6] ).
As a consequence of the above we must: 1. limit the MP area to allow large memory density, compact integration and prevent stability and alignment problems and 2. limits the number of individual points that can be addressed by each PE to reduce the size and complexity of the RUs and WU.
To achieve this goals without limiting the memory capacity and compromising the parallel access capability a multistage, modular addressing mechanism is introduced in section 4.
Parallel Word Level Write Consistency A protocol which determines the value of a bit through a majority rule as described in 3.1 can lead to inconsistent values on word level. With several PES writing a b, bit word each one is likely to win only on a fraction of bits. Thus in general the result of a parallel word write operation would be undefined. In contrast many shared memory algorithms assume that one of the accessing PES wins on word level and manages to correctly write its value. This problem is solved by a backoff synchronization strategy at memory module level. In section 6 we describe a fast backoff protocol that realizes a prioritized concurrent write model.
Overall Architecture
The memory is divided into m modules each containing p pages of b bits. The modules are connected by a network that allows all PES to acceaa all modules independently aa described in 4.1. This gives a k e d , manageable size of the M P for arbitrary memory capacity while preserving parallel access capability. On module level only whole pages are optically addressed by the RUs and WUs reducing the complexity of the beam steering mechanism by b (section 4.2). The selection of individual bits from a page is performed electronically by a fast matrix addressing mechanism.
Module Interconnection
The module interconnection network must allow parallel access of all PES to all modules realizing P independent 1 x m interconnects. The complexity and cost should not exceed a small fraction of the total cost of the memory modules. In this section we show how a tree of multiplexers network fulfills these requirements. It can be constructed using a small number of modular components, each with a functionality of a single simplified memory module. Tree of Multiplexers Network The processors are connected to the modules by a tree of multiplexers as shown in Figure 5 . The tree has the height h and the degree D. Each multiplexer has one input and D outputs for each PE. The memory modules are located at the leaves and have P 10 channels, one for each PE. Thus each PE has an independent connection to each memory module.
Complexity In addition to the m = Dh memory modules the above architecture contains nmuit h-1 i = O multiplexers. Each multiplexer performs P 1, x D mappings directing the input of each P E onto one of D alternative outputs. This is essentially a simplified version of the functionality contained in a single module that performs P independent 1 x p mappings with p >> D.
Cost Dividing the OCRCW SM into modules in the above manner increases the number of components in the system by at most
The engineering effort does not increase since the additional components are simplified versions of the memory modules. The multiplexer stages have no influence on the number of switching operations needed for a memory access since they are part of the standard address decoding process.
Module Structure
Each of the m modules is essentially a M m Mmod = -bits OCRCW SM as described in 3.3 except that it is organized in p memory pages (MPGs) b bits each. Data is accessed in two stages, one optical and one electrical. Beam Deflection In the optical stage a MPG is selected by illuminating it with a beam of light. Thus a fast and compact light beam steering mechanism with a resolution of p angles is needed. The only practical device that is fast and compact enough is an array of p distinct, microscopic laser diodes (LDs). The LDs are combined with appropriate microoptics to make sure that each emits a light beam at an unique angle. Since only one LD is active at any time a simple, fast matrix addressing can be used for LD selection. Reading a Page In the two stage addressing mechanism each PE must be equipped with an array of 6 detectors (DA) instead of asingle detector. A memory page ie read optically by projecting it on the DA. This is done by activating the appropriate LD. The optical system makes sure that the desired page is illuminated and the transmitted light directed towards the DA of the reading Figure 6 ). In the DA the required bits are selected by the matrix addressing mechanism. Writing a Page To write a memory location an appropriately set sample page is projected on the MPG containing the desired location (Figure 7) . The sample page has two electrically controlled LMs for each MC of a MPG. One LM corresponds to the SD, the other one to the CD of the MC. To write a selected bit of a page the LMs of the sample page corresponding to the other bits are all off. Those bits are unaffected by the write operation. Of the LMs corresponding to the selected bit either the set LM or the clear LM is on. This determines whether the bit is to be cleared or set (Figure 8) . The above scheme guarantees that multiple PES can access different bits of a single page independently.
Implementation Issues
The central problem in the implementation of the OCRCW SM described in this paper is the realization of a single memory module. Both the multiplexer modules and the connections between the modules are technologically much less challenging.
Optical System
The purpose of this section is to provide some insight into the optical implementation of an OCR-CW SM module. To this end we describe in some detail a possible optical read setup.
A major difference between this system and the naive idea of projecting whole pages is the need for splitting the illuminating beam into b small beams, each focused on a. single bit. This is due to energy concerns and interference problems. We consider a setup consisting of two types of components: lenses and of holograms. It is a variant of a two holograms, canonical fourier system [2] that is commonly used in optical interconnection systems.
Basic Components

A light beam is an electromagnetic wave characterized by a complex function +(z(t), y ( t ) , ~( t ) ) .
An optical element like a lens, a hologram or an array of LMs is defined by the way it modifies a light beam incident on it In many cases we are dealing with either a parallel light bundle (approximately a plane wave) defined by the propagation angle (a,P) with the z axis or a light cone converging to (or emerging from) a given point (z, y, z ) (approximate spherical wave). We will denote the former by 4(a,P) and the latter by y(z, y, 2).
Lens lhnctionality A lens (L) is an optical element that translates angle coded information into spatially coded information. All light rays incoming at a given angle are gathered in a single point (z,, 9,) at the focal plane at the distance f behind the lens.
A lens transforms a plane wave 4 into a spherical wave (o or in other words a parallel light bundle into a cone like converging light bundle (and the other way around). Hologram Functionality The function of a hologram can be described by multiplying the incoming light function $,,, by a complex transmission function t h ( 2 , y ) . Thus a hologram is an extremely versatile device that, in theory, can be used to implement any possible optical transformation. For the OCRCW SM we need two kinds of holograms:
1. An H2 maps a sequence of plane waves incoming at angles differing by multiples of 6 onto a single plane wave.
Reading a Page
Read S e t u p The read setup is sketched in Figure Each RU consists of an array of LDs in the focal plane of a lens LRU followed by a holographic element H1 in the second focal plane of ZRU. A DA unit consists of a holographic element Hz in the focal plane of a lens LDA followed by the DA in the second focal point LDA.
Read Process The read process by processor P starts with a light cone ( p ( z ( P , p ) , y ( P , p ) , t~u ) emitted by the LD corresponding to the page p. At the M P it becomes an array of b light cones each focused on one bit of the desired page 
Optoelectronic Components
Three types of optc-electronic components are required to build aa optical CRCW memory p r e posed in this paper: light sources, light modulators and light detectors. Of interest are only devices that can be built and miniaturized in semi-conductor technology and can potentially be operated at GHz frequencies at low driving energies. A comprehensive overview of such devices and their operating principles can be found in [25] .
Lasers Much work has been performed in the area of semiconductor laser arrays. Reference [14] reports a GaAs MQW laser array with a device density of 2z106/cm2 and 0.5mW per laser. The experimental device had an active area of 15mm2 and was operated at 230MHz. This type of device was used as a read out element for holographic page memory [21] . Recently LD arrays of up to 256 elements each 15p in diameter with over 3mW output and up to lGHz modulation frequency became commercially available [22] . Light Modulators Light modulators are the most critical opto-electronic components. Devices allowing MHz or GHz modulation frequencies at low energies became available only recently with the advance of the MQW (multiple quantum well) SEED (self electro optic device) technology [20]. They exploit electrically induced shifts of the absorption peak of excitons (electron-whole pseudoatoms) confined in narrow potential wells by thin layers of different semiconductors. Theoretically switching times of several picosecond can be achieved in MQW SEEDS modulators. Light Detectors A photodetector can be characterized by two factors: the dependence of the photocurrent on the absorbed optical energy (efficiency) and the statistical "noise" current. For reliable signal transmission photocurrents an order of magnitude greater then the noise current must be generated leading to a minimum reliably detectable energy. Silicon avalanche pin photodiodes are cited with bit rates of lGHz at about 100 nW optical power or 100 MHz at less then 10 nW.
Parallel Word Level Write Consistency
It is important that a CRCW SM allows parallel write on word level as well as on bit level. However a majority decision on bit level can lead to inconsistent results on word level. In case of several PES trying to write a different value w into the same memory word P the result of the operation does not correspond to any of the original words.
An attempt to write wl = 0101, w2 = 0110, and w3 = 0000 for example leads to w = 0100 # W l , w2, w3
to be written to P. To avoid this difficulty we propose an iterative badroff strategy on memory module level. It relies on the ability of a memory word to recognize a conflict, signal it, and use the b, bits of P to make all but f PES back off in every iteration.
Backoff Mechanism
Memory Extensions To signal a parallel write conflict an additional control bit (conflict bit, CB)
is added to each memory word. In addition the SDs and CDs of the bits of P must be wired in such a way, that 1. a simultaneous signal occurring at the SD and CD of any bit (=an attempt to write conflicting values) can be detected and signalled by setting the CB and 2. there is a mode of operation, activated by a set CB, which unambiguouslyselects and sets a single bit out of the set off all bits of U with an active SD.
WU Extensions Every PE is assigned a unique number N P E . This number is coded to a basis of b,. For b, = 8 octal representation would be used. Each digit of NPE is represented by b,,, bits of which only the one corresponding to the value of this digit is set. The octal, two digit number 13 would thus be represented by the digits dl = Backoff Protocol A write operation by a subset of P,, 5 P processors P on a memory word U is performed as described below: 01000000, da = 00010000.
1. The iteration number k is set to 0. 2. Each PE in P writes its value of w to U.
If a conflict occurs then:
(a) the CB bit of P is set, (b) all other bits of P are cleared, 4. Each PE in P reads P to see if CB is set.
5.
If CB is set then (a) each PE in P writes the kth digit dk of its NPB (in the b, bit representation described above) to U, (b) of all the dks written by the PES in P one is selected by the priority selection mechanism and the corresponding bit is set, while all other bits of P remain off, (e) each PE in P reads P, compares it to its dk and checks if CB is set, (d) all PES with d h #P leave P.
(e) If CB is set then k is increased and all PES in P go back to step 5a. 6. The last P E remaining in P writes w to U.
Cost
The backoff strategy makes sure that the number of PES Pk remaining in the race after iteration k is at most Thus the maximum number of iterations needed to solve a write conflict is logb,(P). For b, = 32 this gives a maximum 3 iteration steps for up to 32768 P E and 4 iteration steps for up to 1048576 PES. Since the operation involves a single module rather then the whole memory the actual impact of a parallel write conflict on the write access speed is even less then the number of backoff iterations.
Performance, Scalability and Feasibility
In this section we discuss the limit,s on the performance of the CRCW-SM described in the previous sections. Our purpose is not to provide a strict, exact analysis which is beyond the scope of this paper. Instead we want to show that a system large enough to be useful could be feasible in the near future.
Performance Limits
The performance of the memory is given by the density 6, the degree of concurrency P , the access latency L and the data transfer rate R. It is determined by the optical limits on components density as well as the performance and fabrication density of the opt0 electronic components. Optical System Constraints The maximum density of the memory plane is given by (4). In theory the maximum degree of concurrency is given by the number of RUs and DAs that can be placed in the area about equal t o the area of the MP. We assume the size of a single LD and a single D to be about the size of a MC. Thus the number of LDs or Ds that can be placed on the area equal to the area of a Mmod bit MP is trivially himod.
For every P E we need p LDs and b Ds. This gives P = min(b,p). Equations (4), and (9) represent the theoretical, fundamental limit on the size of OCRCW SM.
A more practical limit can be obtained using a model presented in [2] to estimate the volume of a two hologram fourier optical interconnects system taking into account energy, missallignment and crosstalk consideration. The volume is shown to be of the order of v ( n , x , F ) = O(nbX+Ft) with K being the number of independent holographic elements, n the number of sources and F the fanout. Although the OCRCW SM does not totally fit into this model taking K = p , n = P and F = b gives a valid estimate of the order of magnitude For a concrete value of V the above term must be multiplied by 64sX9 and an additional constant factor Cop,.
Opt0 Electronic Components Constraints
All the electronic components discussed in section 5 Combined Limits A summary of the results derived based on the components and concepts described in the previous sections appears in the table below. The theoretical numbers are based on > 1 I 103 I <50 I 100 the fundamental limits presented in (4) , and (9) and a wavelength of 100nm. The practical take into account energy considerations and crosstalk problems using (10) as well as the characteristics of devices currently available at laboratory level.
Feasibility
The components necessary for the implementation of the OCRCW-SM described in this paper have all been demonstrated by different groups in laboratory experiments [14, 171. Some are even commercially available [22] . They have been combined to form complex systems on an optical bench in various experiments [3, 241 . The integration of similar (less complex) systems in compact modules has also been reported [4] . However currently, the complexity of such systems is still below the complexity of an OCRCW-SM large enough for practical purposes. Nevertheless the required technology is emerging.
Conclusion
We have presented a design for a scalable OCR-CW SM. An architecture was devised to overcome the optical system complexity problem through modularization and multistage addressing at acceptable additional cost. A priority based scheme for concurrent word write operation that results in negligible overhead even for large number of PES was proposed. Issues of optical implementation were discussed and an example system for the memory module read operation was described. Finally performance limits were presented. We conclude that the OCRCW SM is an interesting alternative to the simulation of shared memory that should become technologically feasible in the near future. However due to disadvantages in terms of capacity and cost as compared with conventional memory ways must be found to combine limited amounts OCRCW-SM with electronic memory.
Future Work
The work presented in this paper constitutes the result of a first design and feasibility study stage of our project. We are currently starting experimental work using arrays of SSEED MQW modulators and VCSEL laser chips to build an optical bench proof of principle model. Furthermore experiments to test the scalability and the performance limits of different critical system components are also planned. Based on our conclusion we are investigating different machine models equipped with a limited amount of OCRCW SM combined with a large amount of conventional electronic memory. We are working on simulations comparing the performance of machines using OCRCW-SM varying in size and performance to more conventional parallel machines.
