On the feasibility of a scalable opto- electronic CRCW shared memory by Lukowicz, Paul & Tichy, Walter F.
Early version published in: Proc. of the ICA3PP, 1995, Brisbane, Australia.
Discussion of the optical system to appear in the Proc. Topical Meeting On Optical Computing OC'96, Sendai Japan.
On the Feasibility of a Scalable Opto-Electronic CRCW Shared
Memory
Paul Lukowicz, Walter F. Tichy
Universitat Karlsruhe, Fakultat fur Informatik, Germany
email: lukowicz@ira.uka.de, tichy@ira.uka.de
Abstract
In the paper we discuss the results of a feasibility
study of an opto-electronic shared memory with con-
current read, concurrent write capability. Unlike pre-
vious such work we consider a true hardware shared
memory rather then a simulation on a tightly, opti-
cally connected distributed memory computer. We de-
scribe an architecture that can be implemented using
semiconductor based light modulator and VCSEL laser
diode arrays and microlenses. We show how to solve
two major problems faced by such a device: optical
system complexity and parallel word level write con-
sistency. An analysis of the fundamental performance
constraints proves that in general the memory density
decreases only linearly with the degree of concurrency
as opposed to the quadratic decrease in electronic par-
allel access memory. Under certain circumstances the
memory density can even be shown to be independent
of the number of accessing processors. We argue that
in principle, a memory with GBytes capacity and a
latency of about 1 ns, accessed by thousands of proces-
sors is conceivable.
1 Introduction
The concurrent read concurrent write shared mem-
ory (CRCWSM) is a powerful concept for parallel pro-
cessing. It allows simple and ecient implementation
of a variety of parallel algorithms. It is also an essen-
tial component of the PRAM model of parallel compu-
tation [14]. Thus, much eort has been invested in its
practical implementation. Unfortunately, attempts to
build a uniform access time shared memory scalable
beyond a small number of processors (PEs) have so
far yielded unsatisfactory results. Eorts now concen-
trate on simulating shared memory on communication
networks, using latency hiding and other techniques
[17, 37]. Apart from the additional cost caused by the
network, this introduces network management delays
and sequentialization on memory module level.
In this paper we consider an alternative approach:
using the advantages provided by optical communica-
tion data storage to implement a scalable hardware
shared memory. We utilize the fact that in 3D free
space optical technology a network with a bisection
width B can be implemented on an area proportion-
al to B [11]. On the other hand any electronic im-
plementation with an arbitrary but xed number of
layers requires an area O(B2). We present the re-
sults of an extended feasibility study to show that de-
spite being technologically challenging opto-electronic
CRCW-SM (OCRCW SM) is an interesting possibility
that should become feasible in the near future.
We rst give a short overview of related work in sec-
tion 2. Then in section 3 the principle of the OCRCW
SM is described. From there we proceed to describe
the overall system architecture in 4 and discuss imple-
mentation issues in section 5. The problem of parallel
write word level consistency is considered in section 6.
Finally performance estimates are given in section 7
and our conclusion and future work are presented.
2 Related Work
2.1 Electronic Multiport Memory
Multiport memory is an extension of commonRAM
that allows truly parallel access by providing multiple
ports, each with its own addressing mechanism. Cur-
rently multiport memory modules with up to 4 ports
are available. A larger number of ports has so far
1
not been implemented due to the cost of duplicating
the addressing mechanism for each additional port. A
recent study [13] shows the chip area needed to imple-
ment a P port memory module with a given capacity
to be proportional to P 2. It concludes that no more
that 10 to 16 ports can be expected to be feasible in
the foreseeable future.
2.2 SM Simulation
A huge amount of research in the area of parallel
processing is devoted to the simulation of SM on dis-
tributed memory computers. According to the level of
simulation the approaches can be classied as software
(e.g. [42, 4, 18]), operating system (e.g. [16, 27, 2])
or hardware (e.g. [1, 25, 39]) based. Although some
promising results have been achieved, a system with a
performance comparable with the ideal SM for a wide
range of problems and a large number of processors is
still not in sight.
2.3 Optics in Parallel computing
Optics has long been considered superior to elec-
tronics in communication intensive applications. The
use of optics for computing was pioneered among oth-
ers by Huang [20, 19] and Lohmann [28, 29, 33]. So
far, research on the use of optical technology in par-
allel computers has concentrated on the improvement
of interconnection networks [36, 15, 43, 3]. Concepts
have also been proposed for direct optical connections
between electrical [23] or optical [31] memory banks.
It was suggested that a fully connected network can
realize a model of computation that is closer to the
PRAM then conventional parallel computers [10]. Our
work goes a step further by proposing a direct imple-
mentation of the full functionality of a CRCW par-
allel random access memory. Although the potential
has been pointed out repeatedly (e.g. [30], [6]), little
research has so far been performed in that area. In
[5] a volume holographic concurrent memory was pro-
posed and a small demonstrator (5 processor and 5 5
bit pages) implemented. In [24] architecture concepts
for a parallel optical associative cache are discussed.
3 Basics
3.1 Problem Overview
A sequential memory system with the capacity of
M bit consist of M memory elements and a 1 to M
connection allowing the processor to access each mem-
ory cell. A M bit P Processor shared memory system
requires M memory elements and a P to M connec-
tion. This is true for both an ideal, hardware SM
system and a network based simulation. The dier-
ence between the two approaches lies in the degree
of parallelism allowed by both the network and the
memory elements. In an ideal SM system there is no
limit on the number of processor that can concurrent-
ly access any single memory cell. The P to M inter-
connection network is equivalent to a superposition of
P 1 to M networks present in a sequential memory
system. It provides an independent channel between
every processor end every memory cell. Thus the de-
lay of a memory access operation performed by any
given processor does not depend on the activity of the
other processors. The main diculty in implementing
an ideal shared memory is the necessity for an inde-
pendent physical connection for every communication
channel. The area needed for the physical realization
of the connections accounts for the P 2 bound on the
area of a P port VLSI memory mentioned in section
2.1.
3.2 Principle of Operation
The idea behind an opto-electronic CRCW SM
(OCRCW SM) is that information stored using op-
tically controlled, variable light absorption of memory
pixels can be accessed concurrently by distinct light
beams. The optical addressing is performed in free
space by directing a beam of light towards the desired
memory cell. Thus no physical connections are re-
quired. Multiple communication channels can coexist
in the same space. In principle the addressing mecha-
nism of a P processor M bit shared memory can easily
be constructed by superposition of P sequentialM bit
addressing devices.
Bit Storage
In opto-electronic memory bits are stored in miniature
light modulators (LMs). These are pixel like, multi-
stage devices that can modify some property of inci-
dent light in accordance with their current state. In
the following we consider LMs with two states: light
absorbing and light transmitting that can be selected
by applying an appropriate voltage. A simple common
LM of that kind is a liquid crystal cell that constitutes
a single pixel of a LCD display.
The state of a LM corresponds to a single bit of
information. It can be read out by illuminating the




Figure 1: Reading information stored in a light modulator










setting a bitclearing a bit
Figure 2: Writing an optical memory cell (MC) consisting
of a electrically controlled LM and two photodetectors: a set
detector (SD) and a clear detector (CD).
to check whether it has been absorbed or transmitted.
By illuminating a LM with multiple light beams and
using multiple detectors a concurrent read operation
can be performed (Figure 1).
To allow optical bit write an electro-optic memory
cell (MC) can be constructed by combining a voltage
controlled LM with two photodetectors. When illu-
minated, one photodetectors (set detector,SD) causes
the modulator to be switched on, while the other one
(clear detector, CD) switches it o (Figure 2). Thus a
bit can be written by illuminating either then SD or
the CD of the corresponding MC.
For a parallel write operation with some processors
trying to set the bit and others trying to clear it the
SD and CD can be wired to follow the majority. The
bit is set depending on whether more light is incident
on the SD or on the CD.
Opto-Electronic RAM
A random access memory based on the bit storage
mechanism described above can be constructed using
following components:
1. an array of electro optic MCs (memory plane,
MP),
2. a beam steering device for read access (read
unit,RU),
3. a beam steering device for write access (write
unit,WU),
4. a light detector (D), and
5. an optical system (OS) to direct the beams to-
wards the detector.
The basic principle is shown in Figure 3. The address
of the desired memory location adr is translated by
the RU into a light beam directed towards the corre-
sponding location in the memory plane. After hitting
the memory plane the light beam is directed towards
the detector by the OS. For a write operation the WU
unit sends a light beam towards the set or clear detec-
tor of the MC corresponding to the selected address.
OCRCW SM
To allow parallel access the above setup must be ex-
tended by an additional RU, WU, and D for every
accessing processor (Figure 4). Since dierent light
beams do not interact with each other, every processor
can access the memory as if no others were present.
Any PE can perform a read or write access on any
memory location at any time realizing the full CRCW
SM semantics.
For P processors the system performs P 1  M
mappings followed by a M  P mapping for parallel
readout and a P 1 2M mappings for parallel write.
3.3 Design Consideration
The major concerns that have to be addressed by
any OCRCW SM design are optical system complexity
and parallel write consistency on word level.
3.3.1 System Complexity
There are fundamental physical constraints on the
amount of information that can be stored optically
in a given area and the number of optical communica-
tion channels in a given volume. Furthermore practi-
cal consideration impose a limit on the size of a single
memory module.
Memory and Connection density The mem-
ory density is limited by the minimum radius r
of a point that can be resolved using a given opti-
cal system. For a wavelength , and a lens (or any
other equivalent optical system) with a diameter





The maximum number of optical communication
channels of a system depends on the degree of
3







space invariant channels can be accommodated
in a volume V = a3 at the wavelength . For








In practice energy concerns and crosstalk due to
aligned components impose even stricter bounds
(e.g. [11, 7]).
Size The physical size of a memory plane is limited
by lens quality considerations, alignment and sta-
bility problems and volume utilization.
High quality lenses with low f
D
ratio required by
formula (1) are very dicult to manufacture for
large D. Furthermore high quality imaging is
much easier for small then for large elds of view.
Larger systems are also more dicult to align,
package and stabilize. Due to (3,2) the memo-










gets worse as the memory plane diameter a grows.
The system not only gets more dicult to build
but also increasingly consists of empty 'useless'
space.
As a consequence of the above we must limit the num-
ber of individual points that can be addressed by each
PE to reduce the degree of space variance of the sys-
tem. Furthermore the restriction on the physical size
of the system means that OCRCW-SM a must be con-
structed using small, constant size modules, in much
the same way as electronic memory is organized in
chips and banks. This poses a problem of designing
a module interconnection network that does not com-
promise the parallel access capability. To achieve this
goals a multistage, modular addressing mechanism is
introduced in section 4.
1The space variance of an imaging system denes the irreg-
ularity of the system. It can be described as the number of per-
mutations between the object and the image. In a lens based
system the degree of space variance determines the minimum










Figure 3: A scheme of a RAM memory using a plane of mem-
ory cells (MC) consisting of a light modulator (LM), and two
detectors (SD and CD), light deecting read and write units



















Figure 4: A scheme of the OCRCW-SM using a plane (MP)
of memory cells (MC) and an individual read unit (RU), write
unit (WU) and detector (D) for each PE (see 3.3).
3.3.2 Parallel Word Level Write Consistency
A protocol which determines the value of a bit through
a majority rule as described in 3.2 can lead to inconsis-
tent values on word level. With several PEs writing a
bw bit word each one is likely to win only on a fraction
of bits. An attempt to concurrently write w1 = 0101,
w2 = 0110, and w3 = 0000 to the memory word w for
example leads to
w = 0100 6= w1; w2; w3:
Thus in general the result of a parallel word write op-
eration would be undened. In contrast many shared
memory algorithms assume that one of the accessing
PEs wins on word level and manages to correctly write
its value.
This problem is solved by a backo synchronization
strategy at memory module level. In section 6 we














Figure 5: The read setup for the OCRCW SMmemorymodule
divided into pages.
4 Overall Architecture
The reduction of the space variance of the system is
achieved by restricting optical addressing to p pages,
b = M=P bits each instead of allowing individual op-
tical addressing of each of the M bits. The selection
of individual bits from a page is performed electron-
ically. Section 4.2 describes a page oriented memory
system that does not restrict the level of concurrency.
To allow the division of the memory into constant
size modules without sacricing concurrency a tree like
network is proposed that allows full parallel access of
all processors to all memory modules. Section 4.2 dis-
cusses a network architecture that uses the same tech-
nology as the memory modules and only marginally
increases the cost and complexity of the system (as
compared with the same amount of memory imple-
mented in a single module).
4.1 Page Addressed Memory Modules
In the page oriented OCRCW-SM memory is ac-
cessed in two stages: an optical page selection stage
and and an electrical bit selection stage. The beam
steering for optical page selection is performed by
miniature laser diodes (LDs) combined with appropri-
ate optics. The electrical stage can be implemented
with a fast matrix addressing mechanism.
Reading a Page Each PE must be equipped with
an array of b detectors (DA) instead of a single de-
tector. A memory page is read optically by project-
ing it on the DA. For that purpose each processor is
equipped with separate miniature laser diode (LD) for
each page. An appropriate optical system makes sure
that the desired page is illuminated when the corre-
sponding LD is activated and that the transmitted
light is directed towards the DA of the reading pro-
cessor (Figure 5). In the DA the required bits are













sample page memory  page before memory page after
Figure 7: The LMs of the sample page during a write oper-
ation: top, for a bit that is no to be eected, middle for a bit
that is to be cleared, bottom for a bit that is to be set.
Writing a Page To write a memory location an ap-
propriately set mask page is projected on the memory
page containing the desired location (Figure 6). The
mask page has two electrically controlled LMs for each
memory cell of a page. One LM corresponds to the SD,
the other one to the CD. To write a given bit of a page
rst the LMs of the sample page corresponding to the
other bits are all turned o. Those bits are unaected
by the write operation. Of the LMs corresponding to
the selected bit either the set LM or the clear LM is
switched on. This determines whether the bit is to
be cleared or set (Figure 7). The above scheme guar-
antees that multiple PEs can write dierent bits of a
single page independently.
4.2 Module Interconnection
The module interconnection network must allow
parallel access of all PEs to all m modules. Thus
a separate communication channel is needed between
every processor and every module. This is equivalent
to a superposition of m independent 1  m intercon-
nects. For a large number of processors and memory
modules the enormous number of communication lines
means that only an optical implementation is feasi-










P P P P
Figure 8: The OCRCW-SM divided into small memory mod-
ules connected by a tree of multiplexers.
in the literature could be modied for that purpose.
In this section we propose an opto-electronic, Banyan
like network that we will call a tree of multiplexers
network. The advantages of this approach is that this
type of network can be constructed with components
similar in functionality and architecture to the mem-
ory modules. Furthermore the network structure is
modular and can easily be scaled by adding modules.
Tree of Multiplexers Network The processors are
connected to the modules by a tree of multiplexers as
shown in Figure 8. The tree has the height h and the
degree D. At each node there is a multiplexer with one
input and D outputs for each PE. The memory mod-
ules are located at the leaves and have P IO channels,
one for each PE. Thus each PE has an independent
connection to each memory module.
Complexity In addition to the m = Dh memory




Di  2 Dh 1  m
multiplexers. Each multiplexer performs P 1D map-
pings directing the input of each PE onto one of D
alternative outputs. This is essentially a simplied
version of the functionality contained in a single mod-
ule that performs P independent 1p mappings with
p >> D.
Cost We assume that the cost of building a single
multiplexer module is at most equal to the cost of
a memory module. Dividing the OCRCW SM into
modules in the above manner increases the number of
components in the system by at most
cost 






The engineering eort does not increase since the addi-
tional components are simplied versions of the memo-
ry modules. The multiplexer stages have no inuence
on the number of switching operations needed for a
memory access since they are part of the standard ad-
dress decoding process.
5 Implementation Issues
In this section we briey sketch possible optical im-
plementations for the memory modules and the mod-
ule interconnection network. We also discuss the avail-
ability and performance of the opto-electronic compo-
nents required for the realization of the system. The
purpose of this section is to provide a basis for the
performance analysis and feasibility estimation. A de-
tailed consideration of the best technical solutions is
beyond the scope of the feasibility study and is the
subject of current and future research.
5.1 Optical Memory Module System
For the sake of simplicity the following description
will be restricted to a 2 dimensional system. The num-
ber of pages p will be assumed to be equal to the num-
ber od processors P .
An exemplary read and write system for 3 proces-
sors and 3 pages is shown in gures 9 and 10. The
imaging of the memory pages on the detectors in the
read system and the mask pages on the memory pages
in the write system are performed by an array of 2p 1
imaging lenses I12p 1 lenses each used at 2f . The
image of the memory produced by each lens in the
detector plane is shifted vertically by a distance pro-
portional to the vertical oset of the lens from the cen-
ter of the memory plane. This can be used to make
sure that for every 1  i  p = P  j  1 there is a
imaging lens Ik(i;j) that images page i on detector j (or
mask page j on page i in the write system). To illumi-
nate a selected memory page (or the mask page) each
processor has a beam steering mechanism that makes
sure that all light passes only through the appropiate
lens Ik(i;j). The beam steering mechanism of every
processor consists of an array of p laser diodes (LDs)
located o axis in front of a collimating lens. To guar-
antee the correct illumination angle in the read setup
an additional pair of illuminating lenses Il separated
fIl is necessary in front of each memory page.
5.1.1 Basic Lens Functionality
The system described below is composed convex lens-
es, each characterized by its diameter d and focal
length f . In the following paragraphs we will refer
to the horizontal axis of the lens as the optical axis.
We will also use a coordinate system (lens coordinates)
with the origin located at the center of the lens. The
planes perpendicular to the optical axis, located at the
distance f behind and in front of the lens are called
the focal planes (back focal plane and front focal plane
6
respectively). For each point (object point) located at
a given distance s1 in front of the lens and y1 above
the optical axis the lens performs a mapping onto an
image point located at the distance s2 behind the lens
and  y2 from the optical axis. In the lens coordinate












All light emitted from the object point is collected in
the image point. The coordinates of the image point
can be derived from the coordinates of the object point
















Two special cases of the above mapping are particu-
larly important for our system:
1. Light rays emerging from a single point source
located in the front focal plane of a lens ends up
as a parallel beam on the other side of the lens.
This corresponds to the image point at innity.
The width of the beam is given by
w = f tan (7)
with  being the angle between the uppermost
and the lowermost rays emitted from the object





with respect to the focal plane.
2. An object located at the distance of 2f on one
side of a lens produces an identical but inverted
image at the distance of 2f on the other side of
the lens. In the lens coordinate system this can












The read setup consists of an illumination and an
imaging system. The former has P read units
(RU1P ) at x
RU followed by p illumination lenses
Il1p at x
Il in front of the MP at xMP . The MP
has p pages p1p each dp wide spaced by p. The
page spacing is chosen to be equal to the size of an
individual page
p = dp (10)
The imaging system behind the MP is composed of an
array of 2p  1 imaging lenses RU12p 1 at xI and P
detector arrays DA1P at x
DA.
Imaging System The imaging lens array is located
at
xI = xMP + 2fI = xDA   2fI (11)
between the MP ad the DAs. Each lens has a diameter
dI = dp equal to the diameter of a single page and is
located either directly opposite a page or between two
pages. The size dd and spacing d of the DA directly
corresponds to the MP with
dd = dp = d = p: (12)
Thus the vertical coordinates of the centers of the
pages ypi the Is y
I




















k + j = 2i + 1 (16)
The above setup results in I(k+j 1)=2 imaging page pk
on the detector array DAj . Since there are twice as
many lenses as pages or DAs equation 16 can be sat-
ised for any pair 1  i; j  p = P . Processor Pj
can thus read the page pk by illuminating it in such
a way that all light has to pass through I(k+j 1)=2.
This is accomplished by a parallel beam of light pass-
ing through the page towards the corresponding DA







Illumination System For every processor j and ev-
ery page k the task of the illumination system is to
direct a parallel beam of light towards pk in such a
way that it satises condition (17). This is done in
two stages. First a parallel beam of light is directed
by the read unit RUj towards the focal plane of Ilk at
the angle 180   k;j. According to 8 a light beam at
the required angle can be generated by a laser diode








yjk = tan (180  k;j)  f
RU =

































xxRU x x xIl MP I DA
Figure 9: A possible optical read setup for a single OCRCW SM memory module using a shifted lens array as described in 5.1
For the light beam to illuminate the focal plane of Ilk















must be satised. Assuming each illumination unit to








together with (12) and (12) leads to the distance be-
tween the read units and the front focal plane of the
illumination units to be equal to the distance between
the memory plane and the imaging lenses.
(xIl   fIl) = xI   xMP = xDA   xI = 2fI (23)
Each illumination unit has two identical lenses with
the focal length fIl separated by 2fIl. The Ils reverse
the vertical direction of the incoming beam and direct
it toward the memory page k located fIl behind the
second lens at
xMP = xIl + 3fIl (24)
at the desired angle k;j.
5.1.3 Writing a Page
In a sense the write process requires the reverse func-
tionality of the read system. For read access each pro-























x x xIl MP I
Figure 10: A possible optical write setup for a single OCRCW
SM memory module using a shifted lens array 5.1
its own detector array. This can be viewed as a p to 1
mapping. In the write system on the other hand every
processor must be able to project its particular mask
page on any given page resulting in a 1 to p mapping.
The write setup can be derived from the read setup
by replacing the detector arrays with mask pages and
placing the horizontally ipped read units at fRU be-
hind them as shown on Figure 10. The combination
of the mask page and the read unit constitutes a write
unit WU. This way the imaging system is operated
'in reverse'. Thus the same lens I(k+j 1)=2 that images
memory page pk on the detector arrayDAj images the






















adressing units focusing units receiver arrays
Figure 11: A possible optical setup for single multiplexer stage
described in 5.2.1
are reversible). For the light to reach reach I(k+j 1)=2
MPj must be illuminated at the same angle k;j as
pk during read access by processor Pj. According to
(??) light directed towards page pk by RUj leaves the
read unit at the angle 180   k;j. As a consequence
the above requirement is automatically satised when
horizontally ipped read units are placed at the dis-
tance fRU behind the corresponding mask pages.
5.2 Module Interconnection Network
This section shows how the intermodule network
described in 4.2 can be implemented based on the tech-
nology and concepts used for the memory modules. It
uses a 1 detector D LDs smart pixels for address de-
coding and a shifted lens system similar to the page
illumination setup of the memory modules for beam
steering to the next stage. For the geometrical layout
a 3D H-tree topology is chosen.
5.2.1 Multiplexer Modules
We assume that data from the processor to the mem-
ory is transferred in packets proceeded by a header
containing the address. Each stage decodes part of
the address and sends the data accompanied by the
rest of the address to the next stage. To perform the
decoding each stage has a combination of a detector,
decoding circuits and D LDs for every processor. The
LDs are combined with an optical system that makes
sure that each LD sends the data to the corresponding
detector of a dierent next stage module.
Multiplexer Optical System The multiplexer op-
tical system shown in Figure 11 is a variation of the
Figure 12: Topology of a H-tree connection.
setup for illuminating memory pages in the read pro-
cess described in the previous section. It is based on
the fact that laser diode LDij used by RUi to illumi-






in the back focal plane of that lens. The multiplexer
optical system thus consist of an array of P address-
ing units AU1P at x
AU followed by and array of D
focusing lenses FL1D at x
FL and an array of D next
stage receiver arrays RA1D at
xRA = xFL + fFL: (26)
The AUs are identical with the RUs except for the
number of LDs, which is D instead of p. The dis-
tance between the AUs and the FUs is determined by









Taking fAU = fRU , fIl = fFL and yAUi = y
RU
i makes
the dimensions of the multiplexer modules and mem-
ory modules equal.
At each stage the data going from the memorymod-
ules to the processors arrives from one of the D mod-
ules of the next stage and has to be send to the pro-
ceeding stage. This can be accomplished without any
electronic address decoding by a purely optical setup
using beam splitters.
5.2.2 Connecting the Multiplexer Modules
The objectives of the design of the geometrical layout
of the network are to nd a technically feasible way
of implementing the huge number of intermodule con-
nections and to realize the highest possible memory
density. The proposed solution is based on a H-tree
topology which is often used in VLSI-design (see Fig-
ure 12). It can be implemented in 2D for binary and
in 3D for quad trees. At each node the connections to
9
the sons are rectangular to connection from the par-
ent. The length of the connections between the nodes
is reduce by half with every stage.
To realize a 3D quad tree each multiplexer module
is a cub" with the front face connected to the pre-
decessor and the sides connected to the next stage
modules. Using simple orthogonal mirrors the system
described in the previous section can be transformed
to fulll this requirement. The level dependent length
of connection lines can be implemented using addi-
tional connection modules consisting of a single lens
with a focal length 1=4 of the cubicles length.
The above schemes guarantees a very good utiliza-
tion of the available volume. Alignment is required
between neighboring modules only. Since the densi-
ty of communication channels is much smaller then in
the memory modules no critical alignment problems
arise 2.
5.3 Optoelectronic Components
Three types of opto-electronic components are re-
quired to build an optical CRCW memory proposed
in this paper: light sources, light modulators and light
detectors. Of interest are only devices that can be
built and miniaturized in semiconductor technology
and can potentially be operated at GHz frequencies
at low driving energies. A comprehensive overview
of such devices and their operating principles can be
found in [40].
Lasers Much work has been performed in the area
of semiconductor laser arrays. Reference [22] reports
a GaAs MQW laser array with a device density of
2x106/cm2 and 0:5mW per laser. The experimental
device had an active area of 15mm2 and was operated
at 230MHz. This type of device was used as a read out
element for holographic page memory [34]. Recently
LD arrays of up to 256 elements each 15 in diameter
with over 3mW output and up to 1GHz modulation
frequency became commercially available [35].
Light Modulators Light modulators are the most
critical opto-electronic components. Devices allowing
MHz or GHz modulation frequencies at low energies
became available only recently with the advance of
the MQW (multiple quantum well) SEED (self elec-
tro optic device) technology [32]. They exploit elec-
trically induced shifts of the absorption peak of exci-
tons (electron-whole pseudoatoms) conned in narrow
potential wells by thin layers of dierent semiconduc-
tors. Theoretically switching times of several picosec-
ond can be achieved in MQW SEEDs modulators.
2Assuming a cubical side length of 1cm and 1000 processors
allows an area of approx. 300 300m per channel
Light Detectors A photodetector can be charac-
terized by two factors: the dependence of the pho-
tocurrent on the absorbed optical energy (eciency)
and the statistical "noise" current. For reliable sig-
nal transmission photocurrents an order of magnitude
greater then the noise current must be generated lead-
ing to a minimum reliably detectable energy. Silicon
avalanche pin photodiodes are cited with bit rates of
1GHz at about 100 nW optical power or 100 MHz at
less then 10 nW.
6 Parallel Word Level Write Consis-
tency
It is important that a CRCW SM allows parallel
write on word level as well as on bit level. However a
majority decision on bit level can lead to inconsistent
results on word level. In case of several processors try-
ing to write a dierent value w into the same memory
word w the result of the operation does not correspond
to any of the original words. To avoid this diculty
we propose an iterative backo strategy on memory
module level. It relies on the ability of a memory word
to recognize a conict, signal it, and use the bw bits
of w to make all but P
bw
processors back o in every
iteration.
6.1 Backo Mechanism
Memory Extensions To signal a parallel write con-
ict an additional control bit (conict bit, CB) is
added to each memory word. In addition the SDs
and CDs of the bits of w must be wired in such a way,
that
1. a simultaneous signal occurring at the SD and
CD of any bit (=an attempt to write conicting
values) can be detected and signalled by setting
the CB and
2. there is a mode of operation, activated by a set
CB, which unambiguously selects and sets a single
bit out of the set o all bits of w with an active
SD.
WU Extensions Every processor is assigned a
unique number NPE . This number is coded to a basis
of bw. For bw = 8 octal representation would be used.
Each digit of NPE is represented by bw bits of which
only the one corresponding to the value of this digit is
set. The octal, two digit number 13 would thus be rep-
resented by the digits d1 = 01000000; d2 = 00010000.
10
Backo Protocol A write operation by a subset of
Pwr  P processors P on a memory word w is per-
formed as described below:
1. The iteration number k is set to 0.
2. Each processor in P writes its value of w to w.
3. If a conict occurs then:
(a) the CB bit of w is set,
(b) all other bits of w are cleared,
4. Each processor in P reads w to see if CB is set.
5. If CB is set then
(a) each processor in P writes the kth digit dk
of its NPE (in the bw bit representation de-
scribed above) to w,
(b) of all the dks written by the processors in
P one is selected by the priority selection
mechanism and the corresponding bit is set,
while all other bits of w remain o,
(c) each processor in P reads w, compares it to
its dk and checks if CB is set,
(d) all processors with dk 6=w leave P.
(e) If CB is set then k is increased and all pro-
cessors in P go back to step 5a.
6. The last processor remaining in P writes w to w.
6.2 Cost
The backo strategy makes sure that the number
of processors Pk remaining in the race after iteration









Thus the maximum number of iterations needed to
solve a write conict is logbw (P ). For bw = 32 this
gives a maximum 3 iteration steps for up to 32768
processor and 4 iteration steps for up to 1048576 pro-
cessors. Since the operation involves a single module
rather then the whole memory the actual impact of a
parallel write conict on the write access speed is even
less then the number of backo iterations.
7 Evaluation
In this section we discuss the limits on the perfor-
mance of the CRCW-SM described in the previous
sections. Our purpose is not to provide a strict, exact
analysis which is beyond the scope of this paper. In-
stead we want to show that a system could be feasible
in the near future.
7.1 Performance Limits
The performance of the memory is given by the
memory density  M determined by the minimal al-
lowed bit radius rbit, the degree of concurrency P , the
maximal feasible capacity of the memory M and the
access latency L. The above are determined by the
optical limits on components density as well as the
performance and fabrication density of the opto elec-
tronic components.
Optical Memory Module System The fundamen-
tal, implementation independent limits on the mem-
ory density and concurrency are given by equations
(2) (3). The degree of space variance of the system is
max(P; p) the maximum of the number of processors





with DMP being the diameter of the memory plane.
The imagingof memory pages on the detectors is space








In the setup proposed in section 5 the maximum
memory density  M determined by the minimal radius
of a single bit rbit can be derived by applying equation





The above equation suggests no direct dependence of
the the memory density on the number of accessing
processors. However one has to keep in mind that for
every accessing processor 2 additional imaging lenses








Thus for a xed size of the memory plane and con-
stant fI the memory density linearly decreases with








To keep the memory density constant either the mem-
ory plane size must be kept proportional to
p
(P )
(meaning thatM must increase linearly with the num-
ber of processors) or the focal length fI must be de-
creased.









The problem with this approach is that it causes the
maximumangle of the light beams in the system maxij









Memory Module Interconnection Network The
optical system of the memory module interconnection
network is directly derived from the memory module
system. Since D is much smaller then p a multiplexer
module with a size not exceeding that of a memo-
ry module can easily be implemented. In case of a
3D quad tree a multiplexer module for 400 processors
would require p=D = P=D = 100 smaller connection
density then a memory module.
Opto Electronic Components Constraints All
the electronic components discussed in section 5.3
have been fabricated in arrays with densities of more
then 166 elements per mm2 on semiconductor wafers.
They can be operated at frequencies of up to 10 GHz.
Current MQW devices (modulators and VCSEL
LDs) operate in the at the wavelength   850nm.
Achieving other wavelengths requires combining dif-
ferent semiconductors which poses some technological
problems. However recently rst VCSEL emitting in
the ultraviolet with   200nm were reported.
Overall PerformanceThe fundamental limit on the
density of an OCRCW SM arises from the limitation of
the optical system. Assuming a wavelength of 1m,
2fI=DMP = 0:1, lenses with fI=dI = 2 and a two
detector one modulator bit structure about 100KBit
could in theory be accommodated on a cm2. Accord-
ing to (33) about 400 processors could access such a
memory module concurrently. Using the 3D H-tree
tree of multiplexers network some 400 such modules
could be accommodated in a cube with 10cm side
length. Reducing the wavelength to 500nm (green
light) would result in around 400 KBit memory mod-
ule capacity or 1600 accessing processors If it were pos-
sible to use UV-light the optical system would allow
up to 1GBit and thousands of processors. On module
level the latency and bit rate are limited by the perfor-
mance of the optoelectronic components rather then
the fundamental limit on the space and time band-
width product. While energy and heat consideration
might make GHz systems extremely dicult to im-
plement, in theory module level latencies of 100ps are
conceivable. Such high device speeds would allow to
use timemultiplexing to increase the capacity by a fac-
tor of 102 or more while retaining acceptable module
access latency. The overall latency must take into ac-
count the optical path length through the multiplexer
network. For a 103cm cubicle the path to the mod-
ules is approx. 20cm leading to a delay of about 1ns.
Including the delay caused by the multiplexing stages
(also assumed to work at up the maximal rate of some
10GHz) gives an overall latency of less then 2ns
L  2ns (35)
.
7.2 Feasibility
Looking at the feasibility of the OCRCW-SM pro-
posed in this paper we have to consider two aspects:
the dierence between fundamental and practical per-
formance limits and the fact that opto-electronics is
an emerging technology that in many areas is still far
from achieving its full potential.
Fundamental vs. Technical Limits Important
practical aspects that were left out of the fundamen-
tal limits analysis are crosstalk due to imperfections
of the optical systems, power considerations and heat
problems. The impact of the above factors depends on
the exact technical implementation of the system an
is dicult to judge. In [12] the authors estimate that
taking crosstalk into account reduce the maximum in-
terconnection density by a factor of about 10.
Limits of Current Technology The components
necessary for the implementation of the OCRCW-SM
described in this paper have all been demonstrated
by dierent groups in laboratory experiments [22, 26].
Some are even commercially available [35]. They have
been combined to form complex systems on an opti-
cal bench in various experiments [8, 38]. The inte-
gration of similar (less complex) systems in compact
modules has also been reported [9, 21]. However cur-
rently, the complexity of such systems is still below
the complexity of an OCRCW-SM large enough for
practical purposes. Nevertheless the required technol-
ogy is emerging with a large community of scientist
working on demonstrating larger and more complex
system.
7.3 Opto-Electronic vs. Electronic
Shared Memory
As discussed in 2.1 the need for physically distinct
connections and the 2D nature of electronic VLSI
circuits lead to a O(P 2) memory density reduction
for concurrently accessible electronic memory devices.
For the OCRCW-SM proposed in this paper the cost
of concurrent access by P processors is O(P ) for xed
and O(1) for P dependent maximal illumination an-
gle. Thus as far as concurrency is concerned optical
12
memory proposed in this paper has a considerable fun-
damental advantage over electronic memory. While it
is now widely acknowledge that electronic concurrent
memory is not suitable more then a few processors
the above suggests that OCRCW-SM might well be
useful for large scale parallel systems. Particularly in-
teresting is the possibility of using small amounts of
OCRCW-SM together with conventional memory as
parallel 'caches'.
8 Conclusion and Future Work
We have presented a design for a scalable OCRCW
SM that overcomes the optical system complexity
problem through modularization and multistage ad-
dressing. An optical system based on LD arrays and
lenses was proposed to implement this architecture.
A synchronization scheme was discussed that realizes
word consistent concurrent write operation with neg-
ligible overhead even for large number of processors.
A study of the fundamental performance constraints
has shown that for the above memory system the cost
of concurrent access by P processors is O(P ) for xed
and O(1) for P dependent maximal illumination angle
as opposed to the O(P 2) cost in electronic multiport
memory. At the same time the capacity and latency of
the memory can in theory compete with conventional
electronic memory.
We conclude that the OCRCW SM is an interesting
alternative to the simulation of shared memory. How-
ever a lot of research must yet conducted before even
a small practical device can be build. Especially we
must keep in mind that an implementation of a large
system must deal with aberrations, crosstalk, stability
and power consumption aspects that were not consid-
ered in this paper. Problems that have to be tackled
in the future include:
1. a detailed study of possible, technically feasible
implementations of the optical system
2. implementation of proof of principle lab models
3. a study of the practical and technological perfor-
mance limits take take into account optical sys-
tem aberrations as well as power and heat dissi-
pation
4. a study of possible computer architectures using
limited amounts of OCRCW-SM in combination
with conventional memory
5. simulations comparing the performance of ma-
chines using OCRCW-SM varying in size and per-
formance to more conventional parallel machines.
References
[1] Ferri Abolhassan, Reinhard Drefenstedt, Jorg
Keller, Wolfgang J. Paul, and Dieter Scheerer. On
the physical design of PRAMs. Computer Jour-
nal, 36(8):756{762, December 1993.
[2] John K. Bennett, John B. Carter, and Willy
Zwaenepoel. Munin: Shared memory for dis-
tributed memory multiprocessors. Technical Re-
port Rice COMP TR89-91, Rice University, April
1989.
[3] Merklein T.M. Brenner K.H. Implementation of
an optical crossbar network based on directional
switches. Appl. Optics, 31(14):2446, 1992.
[4] Siddhartha Chatterjee. Compiling nested data-
parallel programs for shared memory multipro-
cessors. ACM Transactions on Programming
Languages and Systems, 15(3):400{462, July
1993.
[5] Clare and Keith Jenkins. Shared-memory opti-
cal/electronic computer: Architecture and con-
trol. Applied Optics, 33(8):1559, January 1994.
[6] J.L. de Bougrenet del la Tocnaye and J.R. Brock-
lehurst. Parallel access read/write memory using
an optically addressed ferroelectric spatial light
modulator. Applied Optics, 30(2):179{180, Jan-
uary 1991.
[7] D. J. Drabik. Optoelectroni integrated systems
based on free-space interconnects with an arbi-
trary degree of space variance. Proc. of the IEEE,
82(11):1595, November 1994.
[8] F.B. MCormic et al. Five-stage free-space opti-
cal switching network with eld-eect transistor
delf-electro-optic-eect-device smart-pxiel arrays.
Applied Optics, 33(8):1601, March 1994.
[9] W.S.Lacy et. al. A ne grain, high-throughput
architecture using through-wafer optical intercon-
nect. In Proc. First International Workshop on
Massively Parallel Processing Using Optical In-
terconnects, Cancun, Mexico, April 1994. IEEE
Computer Society Press.
13
[10] D.G. Feitelson. Optical Computing. A Survey for
Computer Scientists. The MIT Press, 1988.
[11] M.R. Feldman, T.J. Drabik, S.C. Esener, and
C.C. Guest. Comparison between optical and
electrical interconnects for ne grain processor
arrays based on interconnect density capabili-
ties. Applied Optics, 28(18):3820{3829, Septem-
ber 1989.
[12] W. Erhardand D. Fey. Parallele Digitale Optische
recheneinheiten. B.G. Teubner Stuttgart, 1994.
[13] Marti J. Forsel. Are multiport memories physi-
cally feasible? SIGARCH Notices, 1994.
[14] S. Fortune and J. Wyllie. Parallelism in random
access machines. In ACM Symposium on Theory
of Computation, pages 114{118, May 1-3 1978.
[15] E.E. Frietman, W. van Nifterick, L. Dekker, and
T.J.M. Jongeling. Parallel optical interconnects:
Implementation of optoelectronics in multipro-
cessor architectures. Applied Optics, 29(8):1160{
1177, March 1990.
[16] Wolfgang K. Giloi. Bedeutung, ziele und ergeb-
nisse des parallelrechner-entwicklungsprojekts
manna. GMD-Spiegel, (3), 1993.
[17] T. J. Harris. A survey of pram simulation tech-
niques. Technical Report CSR-23-92, University
of Edinburgh, Department of Computer Science,
1992.
[18] Ernst A. Heinz and Michael Philippsen. Syn-
chronization barrier elimination in synchronous
FORALLs. Technical Report No. 13/93, Uni-
versity of Karlsruhe, Department of Informatics,
April 1993.
[19] A. Huang. Architectural considerations in optical
digital computers. Proc.IEEE, 72(7):780, 1984.
[20] Goodman J.W. Ishihara S. Huang A., Tsuno-
da Y. Optical computing using residue arith-
metik. Appl. Optics, 18(2):149, 1979.
[21] J. Jahns. Planar packaging of free-space optical
interconnections. Proc. of the IEEE, 82(11):1623,
November 1994.
[22] J.Jewell, A. Scherer, S.L. McCall, Y.H. Lee, J.P.
Harbison, and L.T. Florez. Low-treshhold electri-
cally pumped vertical-cavity surface-emitting mi-
crolasers. Electronis Letters, 25(17):1123{1124,
1989.
[23] M. Koyanagi, H. Takato, H. Mori, and J.Iba. De-
sign of 4kbit x 4-layer optically coupled three di-
mensional common memory for paralll processor
system. IEEE Journal of Solid State Circuits,
25(1):109{116, February 1990.
[24] L.Cheng and A.A.Sawchuk. Considerations for
optoelectronic shared cache parallel computers.
In Proc. of the First International Workshop on
Massively Parallel Processing Using Optical In-
terconnections, page 241. IEEE Computer Soci-
ety Press, 1994.
[25] Daniel Lenoski, James Laudon, Kourosh Ghara-
chorloo, Wolf-Dietrich Weber, Anoop Gupta,
John Hennessy, Mark Horrowitz, and Monica S.
Lam. The Stanford DASH multiporcessor. IEEE
Computer, 25(3):63{79, March 1992.
[26] A.L. Lentine, H.S. Hinton, D.A.B. Miller, J.E.
Henry, J.E. Cunninggham, and L.M.F. Chi-
rovsky. Symmetric self-electrooptic eect device:
Optical set-reset latch, dierential logic gate, and
dierential modulator/detector. IEEE Journal of
Quantum Electronics, 25(8):1925{1936, August
1989.
[27] Kai Li. IVY: A shared virtual memory system
for parallel computing. Proceedings of the 1988
International Conference on Parallel Processing,
2:94{101, August 1988.
[28] A.W. Lohmann. What classical optics can do
for the digital optical computer. Appl. Optics,
25(10):1543, 1986.
[29] Marathay A.S. Lohmann A.W. Globality and
speed of optical parallel processors. Appl. Optics,
28(18):3838, 1989.
[30] A.D. McAulay. Optical Computer Architectures.
John Willye & Sons, Inc., 1991.
[31] A.D. McAulay, J. Wang, and X. Xu. Optical word
parallel interconnections between optical random
access memories. Proceedings of the SPIE, 1704,
1992.
[32] D.A.B. Miller. Quantum wells for optical
information processing. Optical Engineering,
26(5):369{373, May 1987.
[33] Streibl N., Brenner K.H., Huang A., Jahns
J.and Jewell J., Lohmann A.W., Miller D.A.B.,
Murdocca M.J., and Prise M.E.and Sizer T. Dig-
ital optics. Proc. of the IEEE, 77(12):1954, 1989.
14
[34] E.G. Paek, J.R. Wullert, M.J.A. Von Lehmen,
A. Scherer, J. Haribson, L.T. Florez, H.J. Yoo,
and R. Martin. Compact and ultrafast holo-
graphic memory using a surface-emitting micro-
laser diode array. Optics Letters, 15(6):341{343,
March 1990.
[35] Inc. Photonics Research. Product Information,
1993.
[36] T. M. Pinkston. The GLORI strategy for multi-
processors: Integrating optics into the intercon-
nect architecture. Technical Report CSL-TR-92-
552, Stanford University, Department of Comput-
er Science, December 1992.
[37] Abhiram G. Ranade. How to emulate shared
memory. In Proceedings of the Annual Sympo-
sium on Foundations of Computer Science, pages
185{194. IEEE, 1987. Yale University.
[38] I. Redmond and E. Schenfeld. A distribut-
ed, recongurable free-space optical interconnec-
tion network for massively parallel processing ar-
chitectures. In Proc. Optical Computing 1994
OC'94, page 373, Edinburgh, Scotland, 1994.
[39] Jerey Kuskinand David Ofeltand Mark Heinric-
hand John
Heinleinand Richard Simoniand Kourosh Ghara-
chorlooand John Chapinand David Nakahiraand
Joel Baxterand Mark Horowitzand Anoop Gup-
taand Mendel Rosenblum and John Hennessy.
The stanford ash multiprocessor. In In Pro-
ceedings of the 21st International Symposium on
Computer Architecture, 1994.
[40] B.E.A. Saleh and M.C. Teich. Fundamentals of
Photonics. John Wiley & Sons, INC, 1991.
[41] J. Schwinder, W. Stork, and R. Volkel. Possibil-
ities and limitations of space-variant holographic
optical elements for switching networks and gen-
eral interconnects. Applied Optics, 31(35):7403{
7410, December 1992.
[42] Thomas J. Sheer, Robert Schreiber, John R.
Gilbert, and Siddhartha Chatterjee. Aligning
parallel arrays to reduce communication. In Fron-
tiers '95: The 5th Symp. on the Frontiers of
Massively Parallel Computation, pages 324{331,
McLean, VA, February 6{9, 1995.
[43] W. Stork. Optische Kommunikationsnetzw-
erke. PhD thesis, Universitat Erlangen Nurnberg,
1989.
15
