Abstract. This paper introduces programmable arrays of optically interconnected electronic processors and compares them with conventional symbolic substitution (SS) systems. The comparison is made on the basis of computational efficiency, speed, size, energy utilization, programmability, and fault tolerance. The small grain size and space -invariant connections of SS lead to poor computational efficiency, difficult programming, and difficult incorporation of fault tolerance. Reliance on optical gates as its fundamental building elements is shown to give poor energy utilization. Programmable optoelectronic multiprocessor (POEM) systems, on the other hand, provide the architectural flexibility for good computational efficiency, use an energy-efficient combination of technologies, and support traditional programming methodologies and fault tolerance. Although the inherent clock speed of POEM systems is slower than that of SS systems, for most problems they will provide greater computational throughput. This comparison does not take into account the recent addition of crossover interconnect and space-variant masks to the SS architecture.
INTRODUCTION
The planar nature of electronic very large scale integration (VLSI) technology imposes limits on parallel electronic computing interconnect latency and area. t Free -space optically interconnected processing elements (PEs) offer an opportunity to remove this 1 
The planar nature of electronic very large scale integration (VLSI) technology imposes limits on parallel electronic computing interconnect latency and area. l Free-space optically interconnected processing elements (PEs) offer an opportunity to remove this limitation by providing interconnections in three dimensions. 2 We describe here general-purpose computing systems currently under investigation at the University of California, San Diego that integrate optoelectronic PEs and free -space programmable optical interconnects. These systems combine the advantages of efficient processing abilities of silicon technology and programmable global communication provided by optical interconnects. We call these systems programmable optoelectronic multiprocessor (POEM) systems. 5 To place the characteristics of POEMs in context, we will compare them with an alternative general-purpose optical computing system based on symbolic substitution (SS) that has been presented by Huang et aí.6'7 and Kozaitis.8 Both POEM and SS are being proposed for achieving high performance, generalpurpose, and parallel computing. In this paper we examine the performance potentials and technological limits of these two systems. The evaluation of these systems is based on their ability to implement various algorithms efficiently, the power and area requirements of existing and projected technologies to implement them, fault tolerance, and ease of programming. Section 2 provides architectural descriptions as well as example implementations of POEM and SS systems. In Sec. 3 we establish the computational equivalence of SS systems to a 2 -D mesh of VLSI processors. Technological considerations are discussed in Sec. 4 , including system size, speed, and energy dissipation. In Sec. 5 the relative merits of POEM and SS systems are compared. Section 6 presents our conclusions.
SUMMARY DESCRIPTIONS OF POEM AND SS
In this section we describe briefly the architectures and fundamental features of POEM and SS. Specific characteristics important for the comparison of the systems are emphasized.
2.1. POEM architecture 2.1.1. Architecture description POEM systems have a highly parallel architecture based on wafer scale integration of optoelectronic PEs and reconfigurable freespace optical interconnects. The POEM machine can be realized with an integrated optoelectronic technology, such as silicon/ PLZT9'1° for the PE arrays, and dichromated gelatin as the volume holographic storage medium for the interconnects. The POEM architecture can be extended to be reprogrammable or reconfigurable using a real -time volume holographic medium such as photorefractive crystals.
The POEM architecture uses electrical interconnects for local communication within a PE and holographic optical interconnects for global communication among PEs. As shown in Ref 11,  for interconnections longer than a certain break -even length, free -space holographic optical interconnects consume less energy and are faster than their electrical counterparts. Also, freespace interconnects are immune to the crossover constraints of planar electronic technology, allowing denser interconnection topologies. Furthermore, they release space in the processing planes used for interconnects, allowing more silicon circuitry on the wafer. The POEM machines use light modulators as optical transmitters. Compared with active light sources such as lasers or light-emitting diodes, light modulators are attractive because they may be easier to integrate with silicon and because they dissipate less power on -wafer since electrical to optical conversion power is dissipated off -wafer. This also allows on- wafer power dissipation to be independent of the fan -out of the processor communication network if electro -optic light modulators are used.
The POEM architecture can support any variation of the parameters commonly used to classify parallel architectures: granularity (fine, coarse, or large grain), synchrony [single instruction stream -multiple data stream (SIMD) or multiple instruction stream -multiple data stream (IMD)], and topology. The strength of POEM machines comes from their efficient implementation of interconnections and the large degree of parallelism and connectivity that is inherent in free -space programmable global optical interconnections.
Implementation
As an example, we describe a fine -grain POEM machine [ Fig. 1(a) ] containing a very large number (100,000 or more) of simple one -bit silicon processors. An optoelectronic controller, connected to a sequential host computer, is used to optically broadcast the instruction stream and master clock through a computergenerated hologram to the PEs for SIMD processing. The global interprocessor communication in POEM is implemented by activating different interconnection holograms in a volume holographic material of large storage capacity, such as dichromated gelatin. Each interconnection hologram is recorded with a different random phase code. These holograms can be activated independently at speeds compatible with the system clock rate by displaying the appropriate random phase code on a small spatial light modulator. Therefore, unlike conventional parallel systems, there are no limitations from fixed interconnection to-limitation by providing interconnections in three dimensions. 2^ We describe here general-purpose computing systems currently under investigation at the University of California, San Diego that integrate optoelectronic PEs and free-space programmable optical interconnects. These systems combine the advantages of efficient processing abilities of silicon technology and programmable global communication provided by optical interconnects. We call these systems programmable optoelectronic multiprocessor (POEM) systems. 5 To place the characteristics of POEMs in context, we will compare them with an alternative general-purpose optical computing system based on symbolic substitution (SS) that has been presented by Huang et al. 6'7 and Kozaitis. 8 Both POEM and SS are being proposed for achieving high performance, generalpurpose, and parallel computing. In this paper we examine the performance potentials and technological limits of these two systems. The evaluation of these systems is based on their ability to implement various algorithms efficiently, the power and area requirements of existing and projected technologies to implement them, fault tolerance, and ease of programming.
Section 2 provides architectural descriptions as well as example implementations of POEM and SS systems. In Sec. 3 we establish the computational equivalence of SS systems to a 2-D mesh of VLSI processors. Technological considerations are discussed in Sec. 4 , including system size, speed, and energy dissipation. In Sec. 5 the relative merits of POEM and SS systems are compared. Section 6 presents our conclusions.
SUMMARY DESCRIPTIONS OF POEM AND SS
POEM architecture

Architecture description
POEM systems have a highly parallel architecture based on wafer scale integration of optoelectronic PEs and reconfigurable freespace optical interconnects. The POEM machine can be realized with an integrated optoelectronic technology, such as silicon/ PLZT9' 10 for the PE arrays, and dichromated gelatin as the volume holographic storage medium for the interconnects. The POEM architecture can be extended to be reprogrammable or reconfigurable using a real-time volume holographic medium such as photorefractive crystals.
The POEM architecture uses electrical interconnects for local communication within a PE and holographic optical interconnects for global communication among PEs. As shown in Ref 11,  for interconnections longer than a certain break-even length, free-space holographic optical interconnects consume less energy and are faster than their electrical counterparts. Also, freespace interconnects are immune to the crossover constraints of planar electronic technology, allowing denser interconnection topologies. Furthermore, they release space in the processing planes used for interconnects, allowing more silicon circuitry on the wafer. The POEM machines use light modulators as optical transmitters. Compared with active light sources such as lasers or light-emitting diodes, light modulators are attractive because they may be easier to integrate with silicon and because they dissipate less power on-wafer since electrical to optical conversion power is dissipated off-wafer. This also allows on- wafer power dissipation to be independent of the fan-out of the processor communication network if electro-optic light modulators are used. The POEM architecture can support any variation of the parameters commonly used to classify parallel architectures: granularity (fine, coarse, or large grain), synchrony [single instruction stream-multiple data stream (SIMD) or multiple instruction stream-multiple data stream (MIMD)], and topology. The strength of POEM machines comes from their efficient implementation of interconnections and the large degree of parallelism and connectivity that is inherent in free-space programmable global optical interconnections.
Implementation
As an example, we describe a fine-grain POEM machine [ Fig. l(a) ] containing a very large number (100,000 or more) of simple one-bit silicon processors. An optoelectronic controller, connected to a sequential host computer, is used to optically broadcast the instruction stream and master clock through a computergenerated hologram to the PEs for SIMD processing. The global interprocessor communication in POEM is implemented by activating different interconnection holograms in a volume holographic material of large storage capacity, such as dichromated gelatin. Each interconnection hologram is recorded with a different random phase code. These holograms can be activated independently at speeds compatible with the system clock rate by displaying the appropriate random phase code on a small spatial light modulator. Therefore, unlike conventional parallel systems, there are no limitations from fixed interconnection to-pology among the processors. Instead, the programmable optical interconnects are determined by the optoelectronic controller. Therefore, the programmer can implement a topology that best matches the current algorithm. In addition, the interconnection storage capacity requirement on the holographic material can be reduced if real -time reprogrammable material requirements can be added. For example, one may envision using photorefractive crystals or other nonlinear optical materials to apply reprogrammable interconnects to the PEs. In this case, the user will be capable of reconfiguring the POEM in a very short time to match his algorithmic requirements.
The internal data paths of the PEs are implemented electrically as in a common electronic processor. Each PE has the capability to perform logic, conditional execution, data movement, and I/O operations [ Fig. 1(b) ]. Also, each PE has some local random-access memory (RAM) to support the conventional programming models. In general, the grain size of the PEs is governed by the break -even interconnection distance found by equating the energy required by the local and global interconnects, and by the computational and concurrency requirements imposed by a given application. For some applications, the amount of required memory governs the grain size of the PE, resulting in nonscalable systems. In POEM, the physical size of a PE may be governed by the size of the RAM even for a small number of storage cells. However, a RAM function is crucial for performing context switching, that is, for handling a number of processes larger than the number of PEs in the system. Optical memory systems that will support large memory bandwidth and large storage capacity will remove these limitations and increase the range of application of POEM systems.
The fine -grain POEM machine was designed to apply parallelism to a wide variety of algorithms. However, because of the programmability of optical interconnects and the large number of simple PEs, it is particularly effective for the rapid execution of symbolic information processing tasks and graph algorithms. The fine -grain POEM machine is expected to offer flexibility and high performance in the rapid execution of semantic networks, production systems, management of large knowledge bases, transportation and communication optimization problems, computer -aided design, VLSI circuit simulation, parallel databases, and game playing. For example, consider the implementation of a parallel knowledge -base system with POEM architecture. Theoretical work by Fahlman11 has shown that storing knowledge as a pattern of interconnections between many very simple PEs allows searches to be performed very quickly. The basic idea is to store the knowledge as a graph in which individual concepts are assigned to PEs and the interconnections between the PEs represent the relations between the concepts. Search operations are performed by marking specific node processors and then propagating these markers in parallel through the network. The set of conventions and processing algorithms for representing the knowledge in such a parallel network is called NETL. Fahlman has shown that NEIL is capable of performing search operations on the knowledge base, simple deductions, learning, consistency checks, matching, and symbolic recognition tasks. The important and unique feature of NETL is that the time required to perform a search is essentially a constant, independent of the size of the knowledge base. This is to be contrasted to sequential AI systems, whose search time increases linearly with the size of the knowledge base. The NETL system can be directly mapped onto the POEM hardware12 by programming the optical interconnects. The large number of PEs allows POEM NEIL to be used in implementing large knowledge bases. The programmable interconnects of POEM remove the overhead and latency that would occur if NEIL were mapped onto a machine with a fixed interconnection.
SS -based computing systems
In this section we give a brief review of SS -based computing systems and some of the proposed optical implementations.
Architecture description
The idea of SS is derived from cellular automata considered by Von Neumann,13 in which locally interconnected cells evolve using certain transition rules. The motivation for considering such computational models is the desire to show that a collection of locally interconnected devices (cells) governed by simple transition rules can exhibit interesting computational properties. SS is an elaboration of the idea of cellular automata, suited for optical implementation. It is a pattern rewriting procedure that operates in a parallel and space -invariant fashion on a 2 -D plane of binary pixels. Every occurrence of a given pattern is replaced by another pattern. Each such pair of patterns is called a substitution or a transition rule. A pattern is a k x k square of pixels in which certain pixels are required to have specific binary values. An example of a rule is shown in Fig. 2 . All occurrences of the left -hand side (LHS) pattern are simultaneously replaced by the right -hand side (RHS) pattern. Since a pixel can be common to several shifted venons of the replacement pattern, the information in that pixel as a result of the replacement is a logical OR of the corresponding pixels.
Brenner, Huang, and Streibl7 have suggested a set of substitution rules that are adequate to perform logical operations, thus demonstrating that SS is a general-purpose computing system. Murdocca14 proposed a general-purpose SS system that consists of only one substitution rule (Fig. 3) . The choice of substitution rules is determined by such criteria as universality, simplicity, ease of implementation, and efficiency. In particular, we show in Sec. 5.1.2 that the "complexity" of a rule influences the energy dissipation of the system.
KIAMILEV, ESENER, PATURI, FAINMAN, MERGER, GUEST, LEE
pology among the processors. Instead, the programmable optical interconnects are determined by the optoelectronic controller. Therefore, the programmer can implement a topology that best matches the current algorithm. In addition, the interconnection storage capacity requirement on the holographic material can be reduced if real-time reprogrammable material requirements can be added. For example, one may envision using photorefractive crystals or other nonlinear optical materials to apply reprogrammable interconnects to the PEs. In this case, the user will be capable of reconfiguring the POEM in a very short time to match his algorithmic requirements.
The internal data paths of the PEs are implemented electrically as in a common electronic processor. Each PE has the capability to perform logic, conditional execution, data movement, and I/O operations [ Fig. l(b) ]. Also, each PE has some local random-access memory (RAM) to support the conventional programming models. In general, the grain size of the PEs is governed by the break-even interconnection distance found by equating the energy required by the local and global interconnects, and by the computational and concurrency requirements imposed by a given application. For some applications, the amount of required memory governs the grain size of the PE, resulting in nonscalable systems. In POEM, the physical size of a PE may be governed by the size of the RAM even for a small number of storage cells. However, a RAM function is crucial for performing context switching, that is, for handling a number of processes larger than the number of PEs in the system. Optical memory systems that will support large memory bandwidth and large storage capacity will remove these limitations and increase the range of application of POEM systems.
The fine-grain POEM machine was designed to apply parallelism to a wide variety of algorithms. However, because of the programmability of optical interconnects and the large number of simple PEs, it is particularly effective for the rapid execution of symbolic information processing tasks and graph algorithms. The fine-grain POEM machine is expected to offer flexibility and high performance in the rapid execution of semantic networks, production systems, management of large knowledge bases, transportation and communication optimization problems, computer-aided design, VLSI circuit simulation, parallel databases, and game playing. For example, consider the implementation of a parallel knowledge-base system with POEM architecture. Theoretical work by Fahlman11 has shown that storing knowledge as a pattern of interconnections between many very simple PEs allows searches to be performed very quickly. The basic idea is to store the knowledge as a graph in which individual concepts are assigned to PEs and the interconnections between the PEs represent the relations between the concepts. Search operations are performed by marking specific node processors and then propagating these markers in parallel through the network. The set of conventions and processing algorithms for representing the knowledge in such a parallel network is called NETL. Fahlman has shown that NETL is capable of performing search operations on the knowledge base, simple deductions, learning, consistency checks, matching, and symbolic recognition tasks. The important and unique feature of NETL is that the time required to perform a search is essentially a constant, independent of the size of the knowledge base. This is to be contrasted to sequential AI systems, whose search time increases linearly with the size of the knowledge base. The NETL system can be directly mapped onto the POEM hardware 12 by programming the optical interconnects. The large number of PEs allows POEM NETL to be used in implementing large knowledge bases. The programmable interconnects of POEM remove the overhead and latency that would occur if NETL were mapped onto a machine with a fixed interconnection.
SS-based computing systems
In this section we give a brief review of SS-based computing systems and some of the proposed optical implementations.
Architecture description
The idea of SS is derived from cellular automata considered by Von Neumann, 13 in which locally interconnected cells evolve using certain transition rules. The motivation for considering such computational models is the desire to show that a collection of locally interconnected devices (cells) governed by simple transition rules can exhibit interesting computational properties. SS is an elaboration of the idea of cellular automata, suited for optical implementation. 7 It is a pattern rewriting procedure that operates in a parallel and space-invariant fashion on a 2-D plane of binary pixels. Every occurrence of a given pattern is replaced by another pattern. Each such pair of patterns is called a substitution or a transition rule. A pattern is a kxk square of pixels in which certain pixels are required to have specific binary values. An example of a rule is shown in Fig. 2 . All occurrences of the left-hand side (LHS) pattern are simultaneously replaced by the right-hand side (RHS) pattern. Since a pixel can be common to several shifted verions of the replacement pattern, the information in that pixel as a result of the replacement is a logical OR of the corresponding pixels.
Brenner, Huang, and Streibl7 have suggested a set of substitution rules that are adequate to perform logical operations, thus demonstrating that SS is a general-purpose computing system. Murdocca14 proposed a general-purpose SS system that consists of only one substitution rule (Fig. 3) . The choice of substitution rules is determined by such criteria as universality, simplicity, ease of implementation, and efficiency. In particular, we show in Sec. 5.1.2thatthe "complexity" of a rule influences the energy dissipation of the system. A general-purpose computing system that employs SS has the following structure: The binary plane contains an encoding of the input data and control bits. The substitution rule is applied to this plane repeatedly for a predetermined number of cycles.
We can think of the control bits as the program. If we have several different rules, these can be applied serially or in parallel. When they are applied in parallel, the resultant plane would be the OR of the resultant planes from the individual substitution rules.
Optical implementations of SS systems
An optical system for performing SS must provide two basic operations: pattern recognition and pattern replacement. The most widely used approaches for both operations apply a thresholding operation to a composite of shifted replicas of the input image. Here we briefly review the ways in which optical systems can produce shifted image replicas and describe how this capability is combined with thresholding and logic -level restoration to provide cascadable building blocks for pattern recognition and pattern substitution.
One important choice to be made in specifying a pattern recognition module is whether it will recognize patterns of ones, patterns of zeros, or patterns consisting of ones and zeros. Recognition of patterns containing ones and zeros leads to system compactness and operational flexibility but also requires a more complex optical system.
Description of a simple recognition-substitution module
Implementation of SS is simplified if the pattern to be recognized consists of only bright pixels (ones) or only dark pixels (zeros). A bright pixel pattern recognizer is described here.
A replica of the input image is made for each bright pixel in the pattern to be recognized (the LHS pattern). Each replica image is shifted horizontally and vertically by an amount that brings a corresponding LHS bright pixel to the position of a designated origin pixel. All of the shifted replicas of the input image are superimposed, producing a composite image having pixels with different brightnesses. The brightest pixels in the composite image will occur at each position where the input image matches the LHS pattern. The composite image is incident on an array of thresholding optical gates whose output leaves only these brightest pixels (pattern matches) in the bright state. Once bright pixels marking the locations of pattern matches have been obtained, the next step is to substitute the new RHS pattern at each location. For each bright pixel in the RHS pattern, a replica of the image at the output of the threshold array is made. The replica images are shifted by an amount corresponding to the position of the bright pixels in the RHS pattern. The shifted replicas are superimposed (ORed), with the result that the RHS pattern now appears in all locations that a recognition spot existed. For achieving cascadable modules, an array of gain and isolation devices is included. must be produced for each bright pixel in the substituted pattern. Two approaches to replicating, shifting, and combining images for SS have been published in the literature7'15: geometrical optics using beamsplitters, mirrors, and prisms and diffractive optics using holograms. We briefly review the merits and fundamental limitations of each in the following:
Several systems using geometrical optics components have been proposed for providing the image replication, shifting, and combining operations for SS. All of them are roughly equivalent to the beamsplitter configuration shown in Fig. 4 . Although these implementations are very straightforward, the process is inherently power-inefficient. In principle, two images may be combined without power loss with the use of a polarization beamsplitter, but the ouput image, containing both polarizations, is not suitable for cascaded stages of lossless combinations. Since many rules require detection and substitution of patterns containing at least four or more shifted images, a spatial light modulator must be used to regenerate an image with one linear polarization after each pair combination, or nonpolarized image combination must be used for each additional image combination. If this second approach is adopted to combine N images, at least [(N /2) -1] /(N /2) of the input power is lost.
The alternative to geometrical optics for image replication, shifting, and combining is the use of holograms. In contrast to geometrical optics, volume holograms can be used to losslessly combine many images with very little loss. A more subtle problem arises with the use of holographic optical elements (HOEs), however. Holograms do not delay wavefronts the way refractive optical components do. With holograms, all phase delays are modulo 27r. This means, for instance, that if a hologram performs the function of a lens, wavefronts passing through the center of the holographic lens will arrive at the image before those passing through the edge. Put another way, pulses of light will be stretched in time, placing a lower limit on the clock period for an optical system. As an example, a 2.5 cm diameter, f/1 holographic lens will lengthen all pulses of light passing through its full aperture by about 50 ps.
Data encoding schemes
Two approaches have emerged for recognizing patterns containing both ones and zeros. The first approach is dual -rail logic, or position encoding. With this method both the true and false states of a logic variable are represented by a bright spot in the optical array; ones are represented by a bright spot in a specified position, zeros by a bright spot in another position (e.g., see Fig. 2 ). Thus, the problem of detecting ones and zeros in a pattern has been translated into a requirement to detect just ones or just zeros. Processing can proceed as previously described for those operations. A general-purpose computing system that employs SS has the following structure: The binary plane contains an encoding of the input data and control bits. The substitution rule is applied to this plane repeatedly for a predetermined number of cycles. We can think of the control bits as the program. If we have several different rules, these can be applied serially or in parallel. When they are applied in parallel, the resultant plane would be the OR of the resultant planes from the individual substitution rules.
Optical implementations of SS systems
An optical system for performing SS must provide two basic operations: pattern recognition and pattern replacement. The most widely used approaches for both operations apply a thresholding operation to a composite of shifted replicas of the input image. Here we briefly review the ways in which optical systems can produce shifted image replicas and describe how this capability is combined with thresholding and logic-level restoration to provide cascadable building blocks for pattern recognition and pattern substitution.
Description of a simple recognition-substitution module
A replica of the input image is made for each bright pixel in the pattern to be recognized (the LHS pattern). Each replica image is shifted horizontally and vertically by an amount that brings a corresponding LHS bright pixel to the position of a designated origin pixel. All of the shifted replicas of the input image are superimposed, producing a composite image having pixels with different brightnesses. The brightest pixels in the composite image will occur at each position where the input image matches the LHS pattern. The composite image is incident on an array of thresholding optical gates whose output leaves only these brightest pixels (pattern matches) in the bright state.
Once bright pixels marking the locations of pattern matches have been obtained, the next step is to substitute the new RHS pattern at each location. For each bright pixel in the RHS pattern, a replica of the image at the output of the threshold array is made. The replica images are shifted by an amount corresponding to the position of the bright pixels in the RHS pattern. The shifted replicas are superimposed (ORed), with the result that the RHS pattern now appears in all locations that a recognition spot existed. For achieving cascadable modules, an array of gain and isolation devices is included.
Implementation of image shifting and combining operations
Optical implementations of SS are all based on replicating, shifting, and recombining data page images. During pattern recognition, a shifted replica of the input image must be formed for each distinguished bit in the pattern to be recognized. For substitution, a shifted replica of the output of the threshold array must be produced for each bright pixel in the substituted pattern. Two approaches to replicating, shifting, and combining images for SS have been published in the literature7' 15 : geometrical optics using beamsplitters, mirrors, and prisms and diffractive optics using holograms. We briefly review the merits and fundamental limitations of each in the following:
Several systems using geometrical optics components have been proposed for providing the image replication, shifting, and combining operations for SS. All of them are roughly equivalent to the beamsplitter configuration shown in Fig. 4 . Although these implementations are very straightforward, the process is inherently power-inefficient. In principle, two images may be combined without power loss with the use of a polarization beamsplitter, but the ouput image, containing both polarizations, is not suitable for cascaded stages of lossless combinations. Since many rules require detection and substitution of patterns containing at least four or more shifted images, a spatial light modulator must be used to regenerate an image with one linear polarization after each pair combination, or nonpolarized image combination must be used for each additional image combination. If this second approach is adopted to combine N images, at least [(N/2) -l]/(N/2) of the input power is lost.
The alternative to geometrical optics for image replication, shifting, and combining is the use of holograms. In contrast to geometrical optics, volume holograms can be used to losslessly combine many images with very little loss. A more subtle problem arises with the use of holographic optical elements (HOEs), however. Holograms do not delay wavefronts the way refractive optical components do. With holograms, all phase delays are modulo 2ir. This means, for instance, that if a hologram performs the function of a lens, wavefronts passing through the center of the holographic lens will arrive at the image before those passing through the edge. Put another way, pulses of light will be stretched in time, placing a lower limit on the clock period for an optical system. As an example, a 2.5 cm diameter, f/1 holographic lens will lengthen all pulses of light passing through its full aperture by about 50 ps.
J. Data encoding schemes
Two approaches have emerged for recognizing patterns containing both ones and zeros. The first approach is dual-rail logic, or position encoding. With this method both the true and false states of a logic variable are represented by a bright spot in the optical array; ones are represented by a bright spot in a specified position, zeros by a bright spot in another position (e.g., see Fig. 2 ). Thus, the problem of detecting ones and zeros in a pattern has been translated into a requirement to detect just ones or just zeros. Processing can proceed as previously described for those operations.
The other approach is to encode the binary states of a cell not with intensity but with orthogonal polarizations of light. 16 As with simple recognition, a replica of the data plane is produced for each distinguished cell in the LHS pattern, but in this case both true and false LHS cells may be specified. Replicas corresponding to zeros in the LHS pattern are passed through a halfwave plate, thereby inverting the logic value of their bits. Shifting now occurs on all replicas to bring the specified LHS cells to the origin. The resulting superposition passes through a polarizer aligned with the true state polarization in the data array. Wherever the data array matches the LHS pattern, all cells with the true state polarization and cells with a false state polarization that has been rotated 90° to the true state are superimposed. Thus, matches are noted by the brightest pixels after passing through the analyzer. From this point, the rest of the process follows that for simple recognition.
Both approaches roughly double the power consumed by the system. For dual -rail encoding this occurs because the number of pixels to represent each bit is doubled and the complexity of logic and data paths is correspondingly increased. For polarization -based encoding, a polarization analyzer is used prior to the optical gates, thereby discarding half of the power.
SS SYSTEMS AND 2 -D VLSI MESH
In this section, we compare optical SS systems with a VLSI 2 -D mesh of processors that operates in SIMD mode. We show that an SS system can be efficiently simulated by a very-finegrain mesh of processors and that an SS rule can be simulated using only a small number of cycles that depends on the size of the SS rule. In fact, we specify measures to quantify the complexity of an SS rule. On the other hand, we show that an SS system is inefficient in simulating a mesh of electronic processors, where each processor has the ability to perform basic arithmetic and data movement operations on one -bit words. This simulation requires more space and time. We also give quantitative estimates of the resources needed to simulate an optical SS system using a mesh of VLSI processors. As in a mesh, in SS each instance of the rule works only on a small amount of nearby information.
Simulation of SS by a mesh
We first make the following assumptions about the mesh: Each processor is connected to its four nearest neighbors with bidirectional edges. The operation of each processor is synchronized by a global clock. Each processor has instructions for communicating with its four neighbors and for computing the logical operations AND, OR, and NOT.
To simulate an N x N optical SS system by an N x N mesh of electronic processors, we further assume, without loss of generality, an SS system based on a single rule, then extend our analysis later to handle the case of multiple rules. The basic idea is that each mesh processor (x,y) is responsible for the state of the pixel (x,y) in the binary plane of the SS system. We then simulate the transition rule on the mesh and update the states. In the following, we compute the cost of simulating a transition rule.
Consider a transition rule that replaces a k x k frame with another k x k replacement frame based on the existence of a certain search pattern in the frame. A search pattern is specified by requiring distinguished pixels to have certain states. Let m be the number of these pixels. The other (k2 -m) pixels in the frame are "don't-care" pixels because their state does not affect the recognition of the pattern. Similarly, the replacement frame is specified by giving the set of distinguished pixels that are required to have the value 1. Let n be the number of those pixels.
Other pixels in the replacement pattern have the value O. Our aim is to capture the cost of the complexity of simulating the transition rule as a function of k, m, and n. Consider how a pixel in the output plane of an SS system can possibly change its state after an application of a transition rule. Each pixel in the output plane depends on exactly n k x k frames. If at least one of these frames has the required search pattern, a 1 will be written in the pixel. The presence of a search pattern in a frame is determined by the m distinguished pixels. Hence, the new state of a cell is determined by a Boolean formula, which is an OR of n terms each of which is an AND of m Boolean variables. We next show how this function can be computed for each of the pixels in parallel in time 0(k2) and with a small (O(min(n,2k))) amount of hardware per processor in the mesh.
In the first phase, we compute the AND of the distinguished pixels for each possible k x k frame. For each frame, we designate a unique pixel to collect and AND together the states of the distinguished pixels corresponding to the search pattern. Note that each pixel appears distinguished in the search patterns of exactly m frames. Hence, each pixel has to send its state to m different recipients. This transmission can be accomplished in These bounds work in general. In many specific cases, one could exploit the regularity of the rule to derive more efficient simulations. For example, Murdocca's transition rule can be simulated in about eight communication cycles.
The simulation procedure described above does not handle the processors at the edges or the case of a system in which several rules are being applied simultaneously. The edge processors can be taken care of by deleting the appropriate product terms. We can simulate a system with several rules by considering the logical OR of the output binary planes that would result from applying the individual transition rules. The cost functions for this case would be the same as in the one -rule case with k = max(ki), m = 1mi, and n = 2.ni.
Since there is a limitation on the size of electronic mesh that can be implemented at present, we should consider the problem of simulating an N x N SS system with a smaller M x M mesh, where M < N. We assume N/M to be some integer multiple of k, and we compute the time and space requirements to perform this simulation. The basic idea is to make each processor in the mesh responsible for a p x p window of pixels in an SS binary plane, where p is N /M.
The simulation algorithm we use here is composed basically of a communication phase followed by a computation phase. In the communication phase, each processor sends pixel state information to its four nearest neighbors such that 4p(k -1) + 4(k -1)2 state bits are received at each processor. The idea is
KIAMILEV, ESENER, PATURI, FAINMAN, MERGER, GUEST, LEE
The other approach is to encode the binary states of a cell not with intensity but with orthogonal polarizations of light. 16 As with simple recognition, a replica of the data plane is produced for each distinguished cell in the LHS pattern, but in this case both true and false LHS cells may be specified. Replicas corresponding to zeros in the LHS pattern are passed through a half wave plate, thereby inverting the logic value of their bits. Shifting now occurs on all replicas to bring the specified LHS cells to the origin. The resulting superposition passes through a polarizer aligned with the true state polarization in the data array. Wherever the data array matches the LHS pattern, all cells with the true state polarization and cells with a false state polarization that has been rotated 90° to the true state are superimposed. Thus, matches are noted by the brightest pixels after passing through the analyzer. From this point, the rest of the process follows that for simple recognition.
Both approaches roughly double the power consumed by the system. For dual-rail encoding this occurs because the number of pixels to represent each bit is doubled and the complexity of logic and data paths is correspondingly increased. For polarization-based encoding, a polarization analyzer is used prior to the optical gates, thereby discarding half of the power.
SS SYSTEMS AND 2-D VLSI MESH
In this section, we compare optical SS systems with a VLSI 2-D mesh of processors that operates in SIMD mode. We show that an SS system can be efficiently simulated by a very-finegrain mesh of processors and that an SS rule can be simulated using only a small number of cycles that depends on the size of the SS rule. In fact, we specify measures to quantify the complexity of an SS rule. On the other hand, we show that an SS system is inefficient in simulating a mesh of electronic processors, where each processor has the ability to perform basic arithmetic and data movement operations on one-bit words. This simulation requires more space and time. We also give quantitative estimates of the resources needed to simulate an optical SS system using a mesh of VLSI processors. As in a mesh, in SS each instance of the rule works only on a small amount of nearby information.
Simulation of SS by a mesh
To simulate an N X N optical SS system by an N x N mesh of electronic processors, we further assume, without loss of generality, an SS system based on a single rule, then extend our analysis later to handle the case of multiple rules. The basic idea is that each mesh processor (x,y) is responsible for the state of the pixel (x,y) in the binary plane of the SS system. We then simulate the transition rule on the mesh and update the states. In the following, we compute the cost of simulating a transition rule.
Consider a transition rule that replaces a kxk frame with another kxk replacement frame based on the existence of a certain search pattern in the frame. A search pattern is specified by requiring distinguished pixels to have certain states. Let m be the number of these pixels. The other (k2 m) pixels in the frame are "don't-care" pixels because their state does not affect the recognition of the pattern. Similarly, the replacement frame is specified by giving the set of distinguished pixels that are required to have the value 1. Let n be the number of those pixels. Other pixels in the replacement pattern have the value 0. Our aim is to capture the cost of the complexity of simulating the transition rule as a function of k, m, and n.
Consider how a pixel in the output plane of an SS system can possibly change its state after an application of a transition rule. Each pixel in the output plane depends on exactly n kxk frames. If at least one of these frames has the required search pattern, a 1 will be written in the pixel. The presence of a search pattern in a frame is determined by the m distinguished pixels. Hence, the new state of a cell is determined by a Boolean formula, which is an OR of n terms each of which is an AND of m Boolean variables. We next show how this function can be computed for each of the pixels in parallel in time O(k2) and with a small (O(min(n,2k))) amount of hardware per processor in the mesh.
In the first phase, we compute the AND of the distinguished pixels for each possible kxk frame. For each frame, we designate a unique pixel to collect and AND together the states of the distinguished pixels corresponding to the search pattern. Note that each pixel appears distinguished in the search patterns of exactly m frames. Hence, each pixel has to send its state to m different recipients. This transmission can be accomplished in These bounds work in general. In many specific cases, one could exploit the regularity of the rule to derive more efficient simulations. For example, Murdocca's transition rule can be simulated in about eight communication cycles.
The simulation procedure described above does not handle the processors at the edges or the case of a system in which several rules are being applied simultaneously. The edge processors can be taken care of by deleting the appropriate product terms. We can simulate a system with several rules by considering the logical OR of the output binary planes that would result from applying the individual transition rules. The cost functions for this case would be the same as in the one-rule case with k = max(ki), m = 2mi, and n = 2m.
Since there is a limitation on the size of electronic mesh that can be implemented at present, we should consider the problem of simulating an N x N SS system with a smaller M x M mesh, where M < N. We assume N/M to be some integer multiple of k, and we compute the time and space requirements to perform this simulation. The basic idea is to make each processor in the mesh responsible for a p x p window of pixels in an SS binary plane, where p is N/M.
The simulation algorithm we use here is composed basically of a communication phase followed by a computation phase. In the communication phase, each processor sends pixel state information to its four nearest neighbors such that 4p(k 1) + 4(k -I)2 state bits are received at each processor. The idea is that each processor gathers a k -1 wide window of states around it so that it has all of the necessary information to compute the new states of its pixels. The time for the computation is O(p2log(mn)), and each processor needs O(mn + p2 + 4p(k -1) + 4(k -1)2) switches. The overall time for simulating one application of a transition rule is O(p2log(mn) + 4p (k -1) + 4(k -1)2). In particular, when p » k, the time is O(p2log(mn)) and the hardware cost is 0(p2).
Simulation of a mesh by SS
We now consider the simulation of a VLSI mesh with an SS system. We show that such simulation requires more space and processing cycles, even for a very simple mesh.
Consider a mesh of one -bit processors, each having three registers capable of performing logical and data movement operations. We also have instructions to transport the data between the neighboring processors. To simulate such a system, we make the following two generous assumptions about the capabilities of the SS system: (1) the system can have a large number of substitution rules operating in parallel and (2) the control bits in the input plane can be changed every cycle.
The basic idea of the simulation is to allocate a window of SS pixels for each processor. This window contains the space for the three registers and the control bits to specify the instruction in dual -rail logic. We use multiple SS rules (about 16) operating in parallel to implement the instruction set.
This scheme gives us the minimal area per processor and one cycle time to execute an instruction. Simple calculations show that the area required per processor would be at least 25 pixels. Thus, if we assume that the binary plane has 1000 x 1000 pixels, we can at best simulate a 200 x 200 mesh of one -bit processors with each step of mesh taking one clock cycle of the SS system.
If a larger -grain processor is used or if the above -mentioned assumptions are not feasible, in particular if we have to work with a single rule, then the corresponding simulation would be much more inefficient in terms of both time and area. This would imply that any realistic SS system can simulate only a small mesh (less than 100 processors), taking a large number of cycles to simulate a cycle of the mesh.
To summarize, we have shown that an SS system is no more powerful than a fine -grain mesh of processors of similar size. This means that any advantage that can be enjoyed by an SS system must come from technological considerations. In the next section, we look at the technological aspects.
SYSTEM AND TECHNOLOGICAL CONSIDERATIONS OF POEM AND SS
Here, we discuss the technological characteristics of both POEM and SS systems. In particular, we determine the energy dissipation and speed of these systems. To begin with, let us consider some fundamental characteristics associated with the optical gates of which these systems are composed.
Fundamental considerations for optical gate arrays
In the following, we analyze optical gate switching speed and array size in terms of thermal limitations, optical interconnect density, and efficiency of optical and electrical interconnects.
In general, a bound on the number (N x N) of gates in an array of area A can be found by requiring that heat dissipation cannot be larger than the heat removal per switching cycle. Thus, we have N2 < PdmaxA Pc& (1) where Pd max is the maximum allowable power dissipation density, which is dependent on the thermal characteristics of the material and the heat removal technique applied to the device. Pc is the power dissipation density of a single optical gate, and Ac is its active area. In addition, the required space bandwidth product (SBP) of an optically interconnected system is SBP ? A .
Ac (2)
In general, A is limited by wafer size, and Ac is limited by lithography or by the optical wavelength. Combining Eqs. (1) and (2), we obtain an upper limit on the size of an optical gate array imposed by thermal dissipation and optical interconnect density as
For an optical gate, the power dissipation density is related to the switching energy density Ec and the switching speed T by Pc = Ec /T. Using this relation in Eq. (3), we can show that the minimum switching speed T of the array is determined by
Hence, for a given device and optical interconnect technology, the speed of an optical gate is limited by the array size. An important figure of merit for optical gate arrays, therefore, is the array throughput, given by
This equation puts an upper limit on the capabilities of any optical gate array implemented with a given technology. In the case of the optoelectronic PE arrays used in POEM, Ac is the area of a single modulator in each PE and is occupying only a small fraction of the total PE area. With the simplifying assumptions made in Sec. 5.1 for the worst -case calculations, Eq. (5) can also be used to estimate the computational throughput of POEM systems. Next, we develop models to evaluate the energy dissipation and the latency of POEM and SS systems.
Energy dissipation and latency for POEM and SS
In this section we determine the energy dissipation and the speed of POEM and SS systems.
POEM
The POEM machine is composed of electronic PEs interconnected with holographic optical interconnects. Each PE is made of logic gates interconnected with electrical interconnects. The energy is dissipated essentially in the electrical interconnections and in silicon inverters. The maximum PE clock rate is fundamentally determined by the speed of the longest electrical interconnect in the PE, while the speed of interprocessor com- 
it has all of the necessary information to compute the new states of its pixels. The time for the computation is O(p2log(mn)), and each processor needs O(mn + p2 + 4p(k 1) + 4(k I)2) switches. The overall time for simulating one application of a transition rule is O(p2log(mn) + 4p(k 1) + 4(k -I)2). In particular, when p » k, the time is O(p2log(mn)) and the hardware cost is O(p2).
Simulation of a mesh by SS
Consider a mesh of one-bit processors, each having three registers capable of performing logical and data movement operations. We also have instructions to transport the data between the neighboring processors. To simulate such a system, we make the following two generous assumptions about the capabilities of the SS system: (1) the system can have a large number of substitution rules operating in parallel and (2) the control bits in the input plane can be changed every cycle.
The basic idea of the simulation is to allocate a window of SS pixels for each processor. This window contains the space for the three registers and the control bits to specify the instruction in dual-rail logic. We use multiple SS rules (about 16) operating in parallel to implement the instruction set.
This scheme gives us the minimal area per processor and one cycle time to execute an instruction. Simple calculations show that the area required per processor would be at least 25 pixels. Thus, if we assume that the binary plane has 1000 x 1000 pixels, we can at best simulate a 200 x 200 mesh of one-bit processors with each step of mesh taking one clock cycle of the SS system.
If a larger-grain processor is used or if the above-mentioned assumptions are not feasible, in particular if we have to work with a single rule, then the corresponding simulation would be much more inefficient in terms of both time and area. This would imply that any realistic SS system can simulate only a small mesh (less than 100 processors), taking a large number of cycles to simulate a cycle of the mesh.
To summarize, we have shown that an SS system is no more powerful than a fine-grain mesh of processors of similar size. This means that any advantage that can be enjoyed by an SS system must come from technological considerations. In the next section, we look at the technological aspects.
SYSTEM AND TECHNOLOGICAL CONSIDERATIONS OF POEM AND SS
Fundamental considerations for optical gate arrays
In general, a bound on the number (NxN) of gates in an array of area A can be found by requiring that heat dissipation cannot be larger than the heat removal per switching cycle. Thus, we have PCAC (1) where Pdmax is the maximum allowable power dissipation density, which is dependent on the thermal characteristics of the material and the heat removal technique applied to the device. PC is the power dissipation density of a single optical gate, and AC is its active area. In addition, the required space bandwidth product (SBP) of an optically interconnected system is AC (2) In general, A is limited by wafer size, and Ac is limited by lithography or by the optical wavelength. Combining Eqs. (1) and (2), we obtain an upper limit on the size of an optical gate array imposed by thermal dissipation and optical interconnect density as N2 < -JSBP .
For an optical gate, the power dissipation density is related to the switching energy density Ec and the switching speed T by PC = EC/T. Using this relation in Eq. (3), we can show that the minimum switching speed T of the array is determined by
This equation puts an upper limit on the capabilities of any optical gate array implemented with a given technology. In the case of the optoelectronic PE arrays used in POEM, Ac is the area of a single modulator in each PE and is occupying only a small fraction of the total PE area. With the simplifying assumptions made in Sec. 5.1 for the worst-case calculations, Eq. (5) can also be used to estimate the computational throughput of POEM systems. Next, we develop models to evaluate the energy dissipation and the latency of POEM and SS systems.
Energy dissipation and latency for POEM and SS
POEM
The POEM machine is composed of electronic PEs interconnected with holographic optical interconnects. Each PE is made of logic gates interconnected with electrical interconnects. The energy is dissipated essentially in the electrical interconnections and in silicon inverters. The maximum PE clock rate is fundamentally determined by the speed of the longest electrical interconnect in the PE, while the speed of interprocessor com-munication is determined by the longest holographic interconnect in the system.
First, we discuss the energy dissipation and the speed of a PE. The total energy dissipated per clock cycle within a PE is the sum of the energies spent in switching the electronic logic gates and driving the interconnects. The energy consumed in switching a logic gate with short connections is dominated by the gate input capacitance C. If V is the required voltage swing, then the switching energy is given by CV2 E` = . (6) When the connections are longer, the wire capacitance dominates the gate input capacitance and the switching energy becomes proportional to the length of the electrical wire.
The operating speed of the circuit is inversely proportional to the connection delay, which depends on the length of the wire. For short wires it is given by 17
Tshort wire = 2.718Tinven(KLwire) (7) where Lwíre is the connection length and K is a constant, typically between 0.1 and 0.2 µm -1. The inverter switching time TInv is a technological constant representing the logical gate switching speed. This logarithmic dependence of wire delay on wire length shows that for locally connected gates, the speed is essentially determined by Tiny. On the other hand, when the connections are long the wire delay is proportional to the wire length and is given by17
LwireV
Tlong wire c (8) where c is the speed of light and Er is a constant, typically about 4. Thus, long electrical connections decrease the speed of operation and increase the energy consumption. We now turn our attention to the energy dissipation and speed of holographic interprocessor connections. Figure 5 illustrates a free -space optical interconnect system. A biasing optical field is incident on only the modulators associated with each PE. The light, transmitted by a modulator that is turned on, is directed with holographic interconnects onto the desired detector(s). The energy required by such interconnects can be evaluated to beta Ea = 2VF(C + Cinv)I q + VJ + CMVM (9) where E0 is the required optical link energy, V is the inverter voltage swing, F is the fan-out, Cpd is the photodetector capacitance, CM is the modulator capacitance, and VM is the halfwave voltage of the modulator. The photon energy is represented by hv, and the electronic charge by q. The efficiency of the optical link is modeled by 1, which includes the efficiencies of the modulator, hologram, and detector. Compared with the energy requirements of electrical interconnects in Eq. (7), it can be shown that E. is less for long communication distances. The break -even communication length establishes the criteria for the appropriate use of electrical and optical interconnections. As an example, an optical link realized with a PLZT light modulator with 10 µm2 area and a fan-out of 1 using the 2.5 µm process will dissipate an energy of 50 pi, assuming 60% holographic Th=-e(11 +D -1 , (10) where e is the distance between the PE array and the hologram and D is the length of a side of the array. For fixed interconnects, this skew can be compensated for by the introduction of appropriate optical time -delay elements into different communication paths. However, in the case of programmable optical interconnects, this compensation technique cannot be used because of the time dependence of the relative delays. Nevertheless, the magnitude of the skew is presently less than the latency of the state -of-the -art MQW light modulators and therefore does not limit the communication speed. For example, for an optoelectronic PE array 15 cm on the side, the signal skew ranges from 20 ps to 200 ps as f is varied from 150 to 15 cm. Note that the magnitude of the skew is reduced by increasing f. However, the free -space optical propagation delay increases with a according to 1-ft. = e /c. Thus, for a given array dimension there exists an optimal distance e, minimizing the propagation delay and signal skew.
SS
Here, we compute the energy dissipation and the delay involved in a single application of an SS transition rule. Figure 6 is a KIAMILEV, ESENER, PATURI, FAINMAN, MERGER, GUEST, LEE munication is determined by the longest holographic interconnect in the system. First, we discuss the energy dissipation and the speed of a PE. The total energy dissipated per clock cycle within a PE is the sum of the energies spent in switching the electronic logic gates and driving the interconnects. The energy consumed in switching a logic gate with short connections is dominated by the gate input capacitance C. If V is the required voltage swing, then the switching energy is given by CV2 (6) When the connections are longer, the wire capacitance dominates the gate input capacitance and the switching energy becomes proportional to the length of the electrical wire.
The operating speed of the circuit is inversely proportional to the connection delay, which depends on the length of the wire. For short wires it is given by 17 Tshort wire = 2.718Tinv^n(KLwire) (7) where Lwire is the connection length and K is a constant, typically between 0.1 and 0.2 jjum" 1 . The inverter switching time Tinv is a technological constant representing the logical gate switching speed. This logarithmic dependence of wire delay on wire length shows that for locally connected gates, the speed is essentially determined by Tinv -On the other hand, when the connections are long the wire delay is proportional to the wire length and is given by where c is the speed of light and er is a constant, typically about 4. Thus, long electrical connections decrease the speed of operation and increase the energy consumption. We now turn our attention to the energy dissipation and speed of holographic interprocessor connections. Figure 5 illustrates a free-space optical interconnect system. A biasing optical field is incident on only the modulators associated with each PE. The light, transmitted by a modulator that is turned on, is directed with holographic interconnects onto the desired detector(s). The energy required by such interconnects can be evaluated to be °1
where E0 is the required optical link energy, V is the inverter voltage swing, F is the fan-out, Cpd is the photodetector capacitance, CM is the modulator capacitance, and VM is the halfwave voltage of the modulator. The photon energy is represented by hv, and the electronic charge by q. The efficiency of the optical link is modeled by T], which includes the efficiencies of the modulator, hologram, and detector. Compared with the energy requirements of electrical interconnects in Eq. (7), it can be shown that EO is less for long communication distances. The break-even communication length establishes the criteria for the appropriate use of electrical and optical interconnections. As an example, an optical link realized with a PLZT light modulator with 10 jim area and a fan-out of 1 using the 2.5 jxm process will dissipate an energy of 50 pJ, assuming 60% holographic diffraction efficiency and 90% modulator and detector efficiencies. Compared with the energy required for a typical electrical off-chip connection of about 1 nJ, the optical link consumes less energy.
The speed of operation of POEM systems can be limited by the latency of the global optical links, local electrical interconnect delay, or the inverter switching speed. The latency of the global optical links will be governed technologically by the light modulator speed and fundamentally by the skew introduced by holographic interconnects and by the free-space optical propagation delays. Typical achievable speeds with light modulators are 0.1 to 1 JJLS with Si/PLZT and 1 to 10 ns with multiple quantum well (MQW) technologies. 19 For global holographic optical interconnections, relative time delays will be introduced among the PEs by the hologram. This skew can be expressed from simple geometrical considerations as (10) where is the distance between the PE array and the hologram and D is the length of a side of the array. For fixed interconnects, this skew can be compensated for by the introduction of appropriate optical time-delay elements into different communication paths. However, in the case of programmable optical interconnects, this compensation technique cannot be used because of the time dependence of the relative delays. Nevertheless, the magnitude of the skew is presently less than the latency of the state-of-the-art MQW light modulators and therefore does not limit the communication speed. For example, for an optoelectronic PE array 15 cm on the side, the signal skew ranges from 20 ps to 200 ps as ( is varied from 150 to 15 cm. Note that the magnitude of the skew is reduced by increasing . However, the free-space optical propagation delay increases with according to Ttr = /c. Thus, for a given array dimension there exists an optimal distance , minimizing the propagation delay and signal skew.
Here, we compute the energy dissipation and the delay involved in a single application of an SS transition rule. Figure 6 is a diagram of the system we use. We make the following assumptions about the system:
The SS system uses a single rule of (k,m,n) complexity operating on an N x N pixel image.
(ii) The input image contains b bright pixels, each having an energy of ein ( iii) The input image contains S occurrences of the search pattern.
The transition rule produces an N x N output image with S' bright pixels, each having an energy of ein The optical operations of splitting, shifting, combining, and imaging are lossless. The system contains two N x N arrays of optical gates:
one for logic -level isolation and restoration (amplification) and the other one for thresholding. Figure 7 shows the model of the three -terminal device used in the amplifier array. When light of energy ein enters the device, light of energy Geit, leaves. In this case, conservation of energy requires that eb + e¡n = Gem + eca t (11) where eb is the bias energy to the amplifier, G is the inputoutput gain of the device, and eca is the energy dissipated in switching the device. On the other hand, when no input light enters the device, the bias energy eb is dissipated by the device and no output is produced. (viii) Figure 8 shows the model of the two -terminal device used in the threshold array. Such a device is characterized by its threshold energy and switching energy ect. When the input light is below the threshold energy, no output is produced and all of the input energy is dissipated at the device. But when the input light energy exceeds the threshold, output light of energy (ein -ect) is produced and an energy of ect is dissipated in switching the device. (ix) The devices are memoryless; that is, at the end of each clock cycle the optical gate arrays are reset.
We now explain the system energy budget shown in Fig. 6 . The bias energy of N2eb is used to power the amplifier array. Since the input has b bright pixels, the output of the amplifier array also has b bright pixels, each having energy of Get.. The total energy dissipated in the amplifier array is the sum of the energies dissipated by the b switching cells, which had light incident on them (beca), and the N2 -b cells, which had no light incident on them [(N2 -b)eb] The amplifier array is followed by a lossless optical system that produces an image with S pixels above the threshold energy of the threshold devices. This image is incident on the threshold array. The energy dissipation in the threshold array is the sum of the energy to switch the devices for S pixels above threshold (Sect) and all of the energy that is below threshold [(b -S)Get]. The output produced by the threshold array is an image with S bright pixels each having an energy of ( Geit, -ect). This image is passed to the optical system for substitution, which generates a final output image with S' bright pixels. Conservation of energy requires that the total energy entering the system be equal to the energy leaving the system plus the energy absorbed. Using this constraint we obtain eb >_ (n -1)e;a + eca + act diagram of the system we use. We make the following assumptions about the system: (i) The SS system uses a single rule of (k,m,n) complexity operating on an N x N pixel image, (ii) The input image contains b bright pixels, each having an energy of Cm-(iii) The input image contains S occurrences of the search pattern, (iv) The transition rule produces an N X N output image with S' bright pixels, each having an energy of ein . (v) The optical operations of splitting, shifting, combining, and imaging are lossless, (vi) The system contains two N x N arrays of optical gates:
one for logic-level isolation and restoration (amplification) and the other one for thresholding, (vii) Figure 7 shows the model of the three-terminal device used in the amplifier array. When light of energy ein enters the device, light of energy Gein leaves. In this case, conservation of energy requires that 6b + Gin -GCin + Cca > (11) where 65 is the bias energy to the amplifier, G is the inputoutput gain of the device, and eca is the energy dissipated in switching the device. On the other hand, when no input light enters the device, the bias energy eb is dissipated by the device and no output is produced. (viii) Figure 8 shows the model of the two-terminal device used in the threshold array. Such a device is characterized by its threshold energy and switching energy ect. When the input light is below the threshold energy, no output is produced and all of the input energy is dissipated at the device. But when the input light energy exceeds the threshold, output light of energy (ein ect) is produced and an energy of ect is dissipated in switching the device. (ix) The devices are memoryless; that is, at the end of each clock cycle the optical gate arrays are reset.
We now explain the system energy budget shown in Fig. 6 . The bias energy of N2eb is used to power the amplifier array. Since the input has b bright pixels, the output of the amplifier array also has b bright pixels, each having energy of Getn . The total energy dissipated in the amplifier array is the sum of the energies dissipated by the b switching cells, which had light incident on them (beca), and the N2 b cells, which had no light incident on them [(N2 b)eb]. The amplifier array is followed by a lossless optical system that produces an image with S pixels above the threshold energy of the threshold devices. This image is incident on the threshold array. The energy dissipation in the threshold array is the sum of the energy to switch the devices for S pixels above threshold (Sect) and all of the energy that is below threshold [(b S)Gein]. The output produced by the threshold array is an image with S bright pixels each having an energy of (Getn ect). This image is passed to the optical system for substitution, which generates a final output image with S' bright pixels. Conservation of energy requires that the total energy entering the system be equal to the energy leaving the system plus the energy absorbed. Using this constraint we obtain eb s> (n -l)ein ect . (12) This equation reveals that each pixel needs enough energy to create a full rule -substitution pattern and to energize one threshold device and one amplifier device. Now, we use Eq. (11) (13) This equation indicates that the gain of each amplifying device must exceed n. The overall energy dissipation can now be computed by adding the energy dissipations of the amplifier and the thresholding arrays and using Eq. (13) The first term in Eq. (14) is the energy required to bias an array of N x N optical devices such that the recognition-substitution operation can be carried out. This amount of energy is dissipated under all conditions. According to Eq. (14), the bias energy is quite large because it is N2(G -1) /(G -n) times the switching energy required per thresholding device plus N2 times the switching energy required by the amplifying devices. The second term in Eq. (14) represents the energy losses associated with different fan-in and fan -out and can be made to vanish for m = n, i.e., for constant fan -in and fan -out. For example, assuming G of 5 and using Murdocca's simplest rule where (k,m,n) = (3,4,4), we require power dissipation of (36ea + 9eca) for each 3 x 3 window. Considering that the switching energy of an optical device is presently equal to the energy of an electronic transistor, Murdocca's SS rule requires the energy equivalent of 45 transistors. We discuss the computational value of such a recognition-substitution module in the next section. The above argument can easily be extended to an SS system with R parallel rules. Such a system is basically equivalent to R one -rule SS systems operating in parallel. Assuming the same gain for the amplifying array, the energy dissipation will be essentially R times larger and is given by
We now estimate the latency of a recognition-substitution module. The time required to perform one application of an SS rule is the sum of the time required to switch the optical gates and the transit time through the optical system. The speed of the optical gate is limited by the array size and is given in Eq. (4). The transit time is limited by the complexity of the rule and the imaging optics. For example, for a very simple SS rule such as the one proposed by Murdocca, the transit time ?transit is proportional to 4f Ttransit = c (16) where f is the lens focal length. Using the expression for resolution of an Airy pattern, we can express the latency in terms of the SBP of the optical system; the f-number of the lenses, Ttransit ? 10v-1VSBP(F#)2 (17) or, using Eq. (3),
Pdmax (18) Note that the latency of an SS system grows as the size of the array increases. For more complex systems, the optical transit time increases with the parameter m of the rule and the number of rules R.
RELATIVE MERITS OF POEM AND SS
Based on the previous analysis, we now compare quantitatively the performance potential of POEM and SS systems.
Computational efficiency
The computational power of SS systems lies in their ability to implement space -invariant transition rules very quickly. The communication involved in effecting these transition rules is done by replicating, shifting, and combining images. Such operations are easy to accomplish in optics.
But this capability of SS systems does not necessarily translate into computational efficiency. The computation involved is done by thresholding or clipping (a nonbinary operation) an analog signal back to binary form, resulting in inefficient energy utilization, especially for complex rules. The communication provided by the transition rules is very local and space invariant. However, many computations, including basic operations such as addition and multiplication, can be implemented more efficiently with space -variant communication. Hence, SS requires a large number of pixels and substitution cycles to implement operations such as logic functions, addition, multiplication, etc.
To provide a specific example, consider the implementation of a NAND gate using Murdocca's rule.14 The SS NAND gate takes 255 pixels of area and requires six applications of the transition rule. In contrast, an electronic NAND gate requires four inverters and takes about one inverter switching time when short wires are used. That is, a NAND gate fabricated with 1µm CMOS lithography that has a fan-out of 2 takes 400 ps when the connection length is less than 1 mm. If the inverter switching energy and the optical gate switching energies are assumed to be the same, then Murdocca's NAND gate requires four orders of magnitude more energy than the electronic NAND gate. Another example of wasted space is shown in Fig. 9 , where a flip -flop is implemented with 50 x 56 pixels using Murdocca's rule. Additional examples given by Cloonan20 and Goodman21
KIAMILEV, ESENER, PATURI, FAINMAN, MERGER, GUEST, LEE
This equation reveals that each pixel needs enough energy to create a full rule-substitution pattern and to energize one threshold device and one amplifier device. Now, we use Eq. (11) to eliminate the dependence on ein in Eq. (12) to obtain eb 5> eca (13) This equation indicates that the gain of each amplifying device must exceed n. The overall energy dissipation can now be computed by adding the energy dissipations of the amplifier and the thresholding arrays and using Eq. (13) for et>. This gives
The first term in Eq. (14) is the energy required to bias an array of N x N optical devices such that the recognition-substitution operation can be carried out. This amount of energy is dissipated under all conditions. According to Eq. (14), the bias energy is quite large because it is N2(G 1)/(G -n) times the switching energy required per thresholding device plus N2 times the switching energy required by the amplifying devices. The second term in Eq. (14) represents the energy losses associated with different fan-in and fan-out and can be made to vanish for m = n, i.e., for constant fan-in and fan-out. For example, assuming G of 5 and using Murdocca's simplest rule where (k,m,n) = (3,4,4), we require power dissipation of (36ect + 9eca) for each 3x3 window. Considering that the switching energy of an optical device is presently equal to the energy of an electronic transistor, Murdocca's SS rule requires the energy equivalent of 45 transistors. We discuss the computational value of such a recognition-substitution module in the next section.
The above argument can easily be extended to an SS system with R parallel rules. Such a system is basically equivalent to R one-rule SS systems operating in parallel. Assuming the same gain for the amplifying array, the energy dissipation will be essentially R times larger and is given by (15) We now estimate the latency of a recognition-substitution module. The time required to perform one application of an SS rule is the sum of the time required to switch the optical gates and the transit time through the optical system. The speed of the optical gate is limited by the array size and is given in Eq. (4). The transit time is limited by the complexity of the rule and the imaging optics. For example, for a very simple SS rule such as the one proposed by Murdocca, the transit time Transit is proportional to _ 4f transit C (16) where f is the lens focal length. Using the expression for resolution of an Airy pattern, we can express the latency in terms of the SBP of the optical system; the f-number of the lenses, F#; and the optical frequency v as 
Note that the latency of an SS system grows as the size of the array increases. For more complex systems, the optical transit time increases with the parameter m of the rule and the number of rules R.
RELATIVE MERITS OF POEM AND SS
Computational efficiency
The computational power of SS systems lies in their ability to implement space-invariant transition rules very quickly. The communication involved in effecting these transition rules is done by replicating, shifting, and combining images. Such operations are easy to accomplish in optics. But this capability of SS systems does not necessarily translate into computational efficiency. The computation involved is done by thresholding or clipping (a nonbinary operation) an analog signal back to binary form, resulting in inefficient energy utilization, especially for complex rules. The communication provided by the transition rules is very local and space invariant. However, many computations, including basic operations such as addition and multiplication, can be implemented more efficiently with space-variant communication. Hence, SS requires a large number of pixels and substitution cycles to implement operations such as logic functions, addition, multiplication, etc.
To provide a specific example, consider the implementation of a NAND gate using Murdocca's rule. 14 The SS NAND gate takes 255 pixels of area and requires six applications of the transition rule. In contrast, an electronic NAND gate requires four inverters and takes about one inverter switching time when short wires are used. That is, a NAND gate fabricated with 1 jxm CMOS lithography that has a fan-out of 2 takes 400 ps when the connection length is less than 1 mm. If the inverter switching energy and the optical gate switching energies are assumed to be the same, then Murdocca's NAND gate requires four orders of magnitude more energy than the electronic NAND gate. Another example of wasted space is shown in Fig. 9 , where a flip-flop is implemented with 50 x 56 pixels using Murdocca's rule. Additional examples given by Cloonan20 and Goodman21
show that many other important Boolean logic modules require more area and time when implemented with SS.
In Sec. 3 we showed that it requires a large area to implement a basic processor capable of arithmetic and data movement operations. In Sec. 4 we derived the limitation in size and speed of optical gate arrays in terms of thermal considerations. In the following, we show how power considerations limit the size and speed of POEM and SS systems. In particular, we derive some estimates on speed and size based on the value E,, the minimal device switching-energy density. We show that even the best possible values of E, cannot support a large and fast SS system.
Speed and size
The speed and size of both systems are governed by Eq. (5). With respect to SBP, POEM systems may enjoy three orders of magnitude advantages over SS since the POEM machines use diffractive optics for global connections, while in SS all interconnects are implemented with refractive optics. Using multilevel -phase holograms, the SBP of diffractive interconnects can be as large as 1011. On the other hand, lens -based refractive interconnects have a SBP of at most 108. The large SBP of holographic interconnects is used in the POEM architecture to achieve a larger ratio A/Ac in Eq. (2) while retaining a high degree of concurrency and allowing reasonable area to implement electronic signal processing.
We now consider the information handling capacity of POEM machines for two different optoelectronic technologies: Si/PLZT9
and Si or GaAs IC integrated with MQW modulators. 19 We assume that the processing element is a simple one -bit processor capable of performing logic, data movement, conditional execution, and communication instructions and has 128 bits of local memory. Such a processing element can be implemented with about 104 transistors, resulting in a square area of 105 1.Lm2 using 0.5 µm CMOS technology. This number is calculated based on the layout of a prototype PE designed with 2.5 µm minimum feature in 1 mm2 area. We also assume that the size of the processor plane is limited to 6 in. x 6 in. by the wafer size. Then, there will be 250,000 PEs on the processor plane. To compute the operating speed, we assume that the maximum power dissipation density is 10 W /cm2 and perform our calculation for the worst -case condition when all devices on the wafer dissipate the same switching energy as is required to drive the optical devices. Si/PLZT technology requires 1 pJ /µm2 switching energy density for the PLZT modulators and a typical modulator occupies a 10 µm2 area.9 Using Eq. (5) we obtain maximum N2T -1 of 1016 operations /s. If we use an area of 105 µm2 ( = 104 x 10 µm2) to host the PEs associated with every modulator, then the throughput is reduced to 1012 (= 1016 /104) operations/s. Thus, assuming 100% yield on a 6 in. wafer, 250,000 globally interconnected PEs, each occupying an area of 105 µm2, can be operated at megahertz rates.
Assuming Si /MQW or GaAs /MQW integration technology to be available, a similar calculation reveals that one can implement 250,000 PEs, all communicating with one another at a rate of 0.1 to 1 GHz, because E, = 10 fJ /µm2, which is smaller than E, = 1 pJ /µm2 for Si/PLZT. Note that the fundamental limits on the speed achievable with POEM systems is limited to a few gigahertz by the skew associated with the optical transit time in the global holographic interconnects. Also, the PEs can perform local computations at rates higher than the communication speed.
Holographic interconnects cannot be used in the case of SS because of the pulse spreading they introduce at very high speed operation. Therefore, refractive interconnects must be used in SS, limiting SBP to 108. As can be seen from Eq. (4), using MQW technology with N2T -1 < 1015, only 1 million optical devices will be allowed to operate at a maximum switching rate of 1 GHz. Devices under development that require E, = 1 fl/ µm2 at a 10 ps switching rate will allow a maximum optical gate array size of 316 x 316. In addition to the device size limitations, systems using such high speed devices will be limited by the optical transit time as given by Eq. (18).
Complex and multiple rules
One can alleviate some of the speed and size inefficiencies that accompany a single rule system such as Murdocca's by using complex and /or multiple rules operating in parallel.8'2 However, complex and multiple rules increase the energy consumption and decrease the speed of the system, as shown in Eq. (15) . There are additional constraints on the complexity and the total number of rules in an SS system. For example, the m parameter of an SS rule cannot exceed the dynamic range of the thresholding device. For proper recognition, the thresholding devices must be able to distinguish an input light intensity of em from [1 -(1 /m)]ein. The n parameter is limited by the gain of the amplifier device. As seen in Eq. (13), n imposes a lower limit on G. Increasing the number of rules in the system increases the optical transit time and the size of the system and dissipates more energy. For a large number of rules R, the increase in energy dissipation is by a factor of R as can be seen from Eq. (15) . The optical transit time increases as we stack many images of the binary plane in free space. These multiple images also increase the total volume of the system. Based on the above discussions, complex/multiple rules are not favored in an SS system. As a consequence, we cannot easily eliminate the size and speed inefficiencies that accompany the implementation of basic operations by SS rules.
Architectural considerations
Since technological considerations do not favor highly complex rules, the communication in an SS system is essentially local. In Sec. 3 we showed that architecturally, an SS system is not more powerful than a mesh. In this section, we argue that the mesh architecture is not always an efficient network topology for parallel computation, even though it is easy to implement. We can map certain problems efficiently onto a mesh by using highly regular algorithms, consequently facilitating very fast communication. But these highly local interconnections limit the performance of many algorithms. Any algorithm whose output depends on almost all of the inputs requires at least N time steps, the diameter of the mesh. On the other hand, networks such as the hypercube have log(N) diameter. Even though the communication in these cube -like architectures tends to be slower than that of the mesh, diameter consideration indicates that these networks will ultimately be more efficient. Table I shows that several important prototype problems do have more efficient algorithms on highly interconnected architectures. Hence, the real questions are whether better communication schemes can be developed for the cube -like architectures and at what point it is advantageous to have slower communicating but more globally connected processors.
POEM systems offer a potential solution because their architecture overcomes many of the limitations faced by highly show that many other important Boolean logic modules require more area and time when implemented with SS.
In Sec. 3 we showed that it requires a large area to implement a basic processor capable of arithmetic and data movement operations. In Sec. 4 we derived the limitation in size and speed of optical gate arrays in terms of thermal considerations. In the following, we show how power considerations limit the size and speed of POEM and SS systems. In particular, we derive some estimates on speed and size based on the value Ec , the minimal device switching-energy density. We show that even the best possible values of Ec cannot support a large and fast SS system.
Speed and size
The speed and size of both systems are governed by Eq. (5). With respect to SBP, POEM systems may enjoy three orders of magnitude advantages over SS since the POEM machines use diffractive optics for global connections, while in SS all interconnects are implemented with refractive optics. Using multilevel-phase holograms, the SBP of diffractive interconnects can be as large as 1011 . On the other hand, lens-based refractive interconnects have a SBP of at most 108 . The large SBP of holographic interconnects is used in the POEM architecture to achieve a larger ratio A/AC in Eq. (2) while retaining a high degree of concurrency and allowing reasonable area to implement electronic signal processing.
We now consider the information handling capacity of POEM machines for two different optoelectronic technologies: Si/PLZT9 and Si or GaAs 1C integrated with MQW modulators. 19 We assume that the processing element is a simple one-bit processor capable of performing logic, data movement, conditional execution, and communication instructions and has 128 bits of local memory. Such a processing element can be implemented with about 104 transistors, resulting in a square area of 105 jjim2 using 0.5 jxm CMOS technology. This number is calculated based on the layout of a prototype PE designed with 2.5 |jim minimum feature in 1 mm2 area. We also assume that the size of the processor plane is limited to 6 in. x 6 in. by the wafer size. Then, there will be 250,000 PEs on the processor plane. To compute the operating speed, we assume that the maximum power dissipation density is 10 W/cm2 and perform our calculation for the worst-case condition when all devices on the wafer dissipate the same switching energy as is required to drive the optical devices. Si/PLZT technology requires 1 pJ/jjim2 switching energy density for the PLZT modulators and a typical modulator occupies a 10 |xm2 area. 9 Using Eq. (5) we obtain maximum N2T~l of 1016 operations/s. If we use an area of 105 jmm2 (= 104 x 10 (xm2) to host the PEs associated with every modulator, then the throughput is reduced to 1012 (= 1016/104) operations/s. Thus, assuming 100% yield on a 6 in. wafer, 250,000 globally interconnected PEs, each occupying an area of 105 jxm2, can be operated at megahertz rates.
Assuming Si/MQW or GaAs/MQW integration technology to be available, a similar calculation reveals that one can implement 250,000 PEs, all communicating with one another at a rate of 0.1 to 1 GHz, because Ec = 10 fj/|xm2 , which is smaller than Ec = 1 pJ/jmm2 for Si/PLZT. Note that the fundamental limits on the speed achievable with POEM systems is limited to a few gigahertz by the skew associated with the optical transit time in the global holographic interconnects. Also, the PEs can perform local computations at rates higher than the communication speed.
Holographic interconnects cannot be used in the case of SS because of the pulse spreading they introduce at very high speed operation. Therefore, refractive interconnects must be used in SS, limiting SBP to 108 . As can be seen from Eq. (4), using MQW technology with N2T -1 < 10 15 , only 1 million optical devices will be allowed to operate at a maximum switching rate of 1 GHz. Devices under development that require Ec = 1 fJ/ jjim2 at a 10 ps switching rate will allow a maximum optical gate array size of 316x316. In addition to the device size limitations, systems using such high speed devices will be limited by the optical transit time as given by Eq. (18).
Complex and multiple rules
One can alleviate some of the speed and size inefficiencies that accompany a single rule system such as Murdocca's by using complex and/or multiple rules operating in parallel. 8 '2 However, complex and multiple rules increase the energy consumption and decrease the speed of the system, as shown in Eq. (15) . There are additional constraints on the complexity and the total number of rules in an SS system. For example, the m parameter of an SS rule cannot exceed the dynamic range of the thresholding device. For proper recognition, the thresholding devices must be able to distinguish an input light intensity of ein from [1 (l/m)]ein . The n parameter is limited by the gain of the amplifier device. As seen in Eq. (13), n imposes a lower limit on G. Increasing the number of rules in the system increases the optical transit time and the size of the system and dissipates more energy. For a large number of rules R, the increase in energy dissipation is by a factor of R as can be seen from Eq. (15) . The optical transit time increases as we stack many images of the binary plane in free space. These multiple images also increase the total volume of the system. Based on the above discussions, complex/multiple rules are not favored in an SS system. As a consequence, we cannot easily eliminate the size and speed inefficiencies that accompany the implementation of basic operations by SS rules.
Architectural considerations
Since technological considerations do not favor highly complex rules, the communication in an SS system is essentially local. In Sec. 3 we showed that architecturally, an SS system is not more powerful than a mesh. In this section, we argue that the mesh architecture is not always an efficient network topology for parallel computation, even though it is easy to implement. We can map certain problems efficiently onto a mesh by using highly regular algorithms, consequently facilitating very fast communication. But these highly local interconnections limit the performance of many algorithms. Any algorithm whose output depends on almost all of the inputs requires at least N time steps, the diameter of the mesh. On the other hand, networks such as the hypercube have log(N) diameter. Even though the communication in these cube-like architectures tends to be slower than that of the mesh, diameter consideration indicates that these networks will ultimately be more efficient. Table I shows that several important prototype problems do have more efficient algorithms on highly interconnected architectures. Hence, the real questions are whether better communication schemes can be developed for the cube-like architectures and at what point it is advantageous to have slower communicating but more globally connected processors.
POEM systems offer a potential solution because their architecture overcomes many of the limitations faced by highly Third, the topology of the interconnects in POEM is not restricted to being regular. Space-variant interconnection holograms24'25 allow arbitrary and irregular communication between processors. In fact, the need for such communication is supported by the theory of parallel algorithms, which shows that fast arallel algorithms require irregular communication among PEs .3, 26 Finally, POEM architecture allows for programmable interconnections. This reduces the silicon area required by the routers commonly used in electronic concurrent computers. In addition, such programmable interconnections are desirable since different algorithms dictate different interconnections for efficient implementation.
Other considerations
In the following subsections we compare local communication, programming methodologies, and the resistance to technological defects in POEM and in SS.
Local communication
In this subsection we show that "simulated wires" used in SS are much slower than electrical wires used in POEM systems. The idea of "simulated wires" is repeated application of a rule to move information across the plane. In fact, one application of a rule of complexity (k,m,n) can move information by a distance of at most (2k -1) pixels. If T is the time required to apply the rule and L is the length of the connection in pixels, then L Twire -T(2k -1) (19) In particular, for Murdocca's rule with k = 3, the movement of data across a distance equal to the length of several gates requires hundreds of applications of the rule, because even the simplest logic gates require large area when implemented in SS.
In contrast, the delay of a short electrical wire is basically determined by the inverter switching time and is given in Eq. (7). Moving information over a short distance or across many hundreds of logic gates takes essentially the same amount of time. This is due to the small size of electronic logic gates and the logarithmic dependence of delay on the length of the wire. To illustrate the above arguments, we compare the time delay in moving information in SS and in POEM. Assuming that the time to apply one transition rule is 1 ns and the area of a typical logic gate is 10 x 10 pixels, the time to move information across 10 gates is 100 ns. In contrast, a NAND gate, in 1 CMOS technology, with a 1 mm long output wire, has a delay of about 400 ps and can move information across many hundreds of gates.
In summary, the simulated wires of SS are inferior to their electronic counterparts in speed. This result indicates that the data and the control bits that operate on it must be placed close together in the plane.
RAM implementation and programming methodologies
The programming flexibility of digital electronic computing is closely associated with its ability to implement RAMs. In this subsection we show that SS does not provide an efficient means of RAM implementation, limiting its applications and increasing programming complexity.
In electronics, the speed of local interconnects enables the implementation of small size RAMs with fast access times. This enables POEM machines to perform space -time trade -offs and to handle large problems. In particular, it enables POEM machines to perform context switching for solving problems larger than the size of the machine. Therefore, POEM programming can be accomplished using the conventional stored program concept. The instructions and the data can be stored in the memory and executed by a processor.
On the other hand, SS has slow local communication, making the implementation of RAM difficult. A RAM has a requirement that any bit of storage is accessible in one clock cycle. Consider an S2 bit SS RAM. Assuming that the RAM is laid out as a 2 -D array of p x p pixel windows and each window stores one bit, the length of the side of the array is S *p. Thus, the longest simulated wire in this system is about S *p pixels long. Implementing even 100 bits of memory with Murdocca's rule or another similar rule would require unacceptable access time due to wire delays. Thus, the programming methodology in SS must be different from the processor -memory model used in POEM and at least for now appears to be more difficult and limited in flexibility.
To overcome the lack of efficient communication capability, lack of efficient memory, and complex logic implementations, researchers have proposed to lay out SS programs as circuits, with data and associated control bits placed in close proximity. This approach seems to be harder, at least for now, to use for programming because of the difficulty of laying out the computation and making sure the timing is properly arranged. Thus, it appears that SS would be more suited for highly structured, local, fine -grain, space -invariant problems. An application of SS to such a problem has yet to be demonstrated.
Fault tolerance
Any fabrication procedure has a certain yield factor. Therefore, the POEM and SS systems must have resistance to technological interconnected electronic architectures such as the hypercube. POEM architecture provides a flexible, fast, and parallel environment through its programmable global optical interconnects. First, POEM systems can handle half a million PEs on two highly interconnected wafers, as discussed in Sec. 5.1. This large number and very high density of interconnections is a direct result of the 3-D nature of the POEM architecture. Although the estimated number of PEs in POEM is already quite impressive, one can envision that the number can be further increased using more PE planes and interconnection holograms. Additionally, with multiple PE planes the processor grain size can be increased without reducing the overall number of processors in the system. Second, optical interconnects provide fast means of global communication with low energy requirements. As a result, POEM systems can fully use the advantage of highly interconnected architectures.
Third, the topology of the interconnects in POEM is not restricted to being regular. Space-variant interconnection holograms24'25 allow arbitrary and irregular communication between processors. In fact, the need for such communication is supported by the theory of parallel algorithms, which shows that fast parallel algorithms require irregular communication among Finally, POEM architecture allows for programmable interconnections. This reduces the silicon area required by the routers commonly used in electronic concurrent computers. In addition, such programmable interconnections are desirable since different algorithms dictate different interconnections for efficient implementation.
Other considerations
Local communication
In this subsection we show that "simulated wires" used in SS are much slower than electrical wires used in POEM systems. The idea of "simulated wires" is repeated application of a rule to move information across the plane. In fact, one application of a rule of complexity (k,m,n) can move information by a distance of at most (2k 1) pixels. If T is the time required to apply the rule and L is the length of the connection in pixels, then Twire = (19) In particular, for Murdocca's rule with k = 3, the movement of data across a distance equal to the length of several gates requires hundreds of applications of the rule, because even the simplest logic gates require large area when implemented in SS.
In contrast, the delay of a short electrical wire is basically determined by the inverter switching time and is given in Eq. (7). Moving information over a short distance or across many hundreds of logic gates takes essentially the same amount of time. This is due to the small size of electronic logic gates and the logarithmic dependence of delay on the length of the wire.
To illustrate the above arguments, we compare the time delay in moving information in SS and in POEM. Assuming that the time to apply one transition rule is 1 ns and the area of a typical logic gate is 10 x 10 pixels, the time to move information across 10 gates is 100 ns. In contrast, a NAND gate, in 1 jxm CMOS technology, with a 1 mm long output wire, has a delay of about 400 ps and can move information across many hundreds of gates.
RAM implementation and programming methodologies
In electronics, the speed of local interconnects enables the implementation of small size RAMs with fast access times. This enables POEM machines to perform space-time trade-offs and to handle large problems. In particular, it enables POEM machines to perform context switching for solving problems larger than the size of the machine. Therefore, POEM programming can be accomplished using the conventional stored program concept. The instructions and the data can be stored in the memory and executed by a processor.
On the other hand, SS has slow local communication, making the implementation of RAM difficult. A RAM has a requirement that any bit of storage is accessible in one clock cycle. Consider an S2 bit SS RAM. Assuming that the RAM is laid out as a 2-D array of p X p pixel windows and each window stores one bit, the length of the side of the array is S*p. Thus, the longest simulated wire in this system is about S*p pixels long. Implementing even 100 bits of memory with Murdocca's rule or another similar rule would require unacceptable access time due to wire delays. Thus, the programming methodology in SS must be different from the processor-memory model used in POEM and at least for now appears to be more difficult and limited in flexibility.
To overcome the lack of efficient communication capability, lack of efficient memory, and complex logic implementations, researchers have proposed to lay out SS programs as circuits, with data and associated control bits placed in close proximity. This approach seems to be harder, at least for now, to use for programming because of the difficulty of laying out the computation and making sure the timing is properly arranged. Thus, it appears that SS would be more suited for highly structured, local, fine-grain, space-invariant problems. An application of SS to such a problem has yet to be demonstrated.
Fault tolerance
Any fabrication procedure has a certain yield factor. Therefore, the POEM and SS systems must have resistance to technological defects. In POEM architecture the global interconnects are spacevariant and programmable. Therefore, faulty processors can be easily bypassed.
Since the interconnections in SS are space invariant and nonprogrammable, it seems very difficult to implement any fault tolerance. One possible way of handling faults is to arrange the control and data bit placements to avoid defective cells. Although this is possible, it complicates the design of SS systems because it introduces additional constraints to the problem of laying out a computation in the SS plane.
6. CONCLUSIONS Our intent in this paper has been to introduce a new optoelectronic parallel computing architecture called the programmable optoelectronic multiprocessor (POEM). The attractive features of this architecture have been established by comparing POEM to symbolic substitution (SS), a parallel optical computing system widely recognized in the research community. The comparison has included computational efficiency of the architectures, power dissipation and speed of the respective supporting technologies, ease of programming, and amenability to fault tolerance. A summary of the comparison appears in Table II . The POEM architecture is motivated by analyses indicating efficient and effective means of combining optics and electronics. Electronics possesses a very mature technology for switching devices, and electrical communication is more efficient than optical communication for short distances (less than 1 mm). Thus, small to medium grain electronic processors (about 1000 gates) form the core of POEM. For the greater distances of interprocessor communication, optical link efficiency compares so favorably with electrical link efficiency that the price paid (in power dissipation and delay) for optoelectronic conversions is overcome. In contrast, SS is at the extreme of fine -grained processing and pays a high price for having even its shortest links implemented optically. All-electronic multiprocessor systems usually represent the other extreme of coarse -grained processing and squander substantial power and time by driving long wires.
Also, POEM can incorporate complex global patterns of interprocessor communication, which is difficult for SS and allelectronic systems to achieve. The extremely high clock rate of SS systems prohibits the use of holographic connection elements, which in turn has required space -invariant communication using refractive optics. This reduces SS to the equivalent of a 2 -D mesh -connected architecture, which is well known to be computationally inefficient for the solution of many problems. Though the use of multiple and complex substitution rules mediates against this limitation, such rules exact a heavy penalty in system power dissipation and speed. In this respect, our results agree with those of other researchers20 who show that space -invariant transition rules do not give efficient implementations of basic cornputational operations. Thus, we observe that some proponents of SS and similar systems have begun incorporating global interconnections into their designs. 27' 8 In particular, recently the architecture of SS was modified by the addition of global spaceinvariant crossover interconnection between the pixels and spacevariant custom masks.29 While the hardware remains the same, the computational characteristics of the new modified SS architecture are markedly different from conventional SS. The new SS architecture is no longer equivalent to a mesh in topology, because it uses the connectivity of optics. It has been applied to realize programmable logic arrays (PLAs),30 an optical RAM decoder circuit 31 and a two -channel sorting node for a wideband digital switch.39 Although the new SS architecture may overcome some interconnection problems, it does not directly address the following problems: (1) The energy budget of the new SS system has not been worked out to determine the possible size and speed of the system. (2) In the PLA designed with the new SS approach, all of the gates in the PLA are dissipating energy, whereas in electronic PLAs, only the gates that actually implement the function dissipate energy. Thus, making large optical PLAs is likely to be energy inefficient. (3) Although the new SS has been applied to realize a PLA, it is questionable whether a PLA is the most efficient method to realize a logic function in terms of gate count and speed. In fact, the number of optical gates that are unused in computing a Boolean function increases with the complexity of the function. (4) It is questionable whether using a regular interconnect to implement irregular functions is a computationally efficient method. Based on these considerations, another comparison study of the new SS architecture with POEM is underway. defects. In POEM architecture the global interconnects are spacevariant and programmable. Therefore, faulty processors can be easily bypassed.
CONCLUSIONS
Our intent in this paper has been to introduce a new optoelectronic parallel computing architecture called the programmable optoelectronic multiprocessor (POEM). The attractive features of this architecture have been established by comparing POEM to symbolic substitution (SS), a parallel optical computing system widely recognized in the research community. The comparison has included computational efficiency of the architectures, power dissipation and speed of the respective supporting technologies, ease of programming, and amenability to fault tolerance. A summary of the comparison appears in Table II. The POEM architecture is motivated by analyses indicating efficient and effective means of combining optics and electronics. Electronics possesses a very mature technology for switching devices, and electrical communication is more efficient than optical communication for short distances (less than 1 mm). Thus, small to medium grain electronic processors (about 1000 gates) form the core of POEM. For the greater distances of interprocessor communication, optical link efficiency compares so favorably with electrical link efficiency that the price paid (in power dissipation and delay) for optoelectronic conversions is overcome. In contrast, SS is at the extreme of fine-grained processing and pays a high price for having even its shortest links implemented optically. All-electronic multiprocessor systems usually represent the other extreme of coarse-grained processing and squander substantial power and time by driving long wires.
Also, POEM can incorporate complex global patterns of interprocessor communication, which is difficult for SS and allelectronic systems to achieve. The extremely high clock rate of SS systems prohibits the use of holographic connection elements, which in turn has required space-invariant communication using refractive optics. This reduces SS to the equivalent of a 2-D mesh-connected architecture, which is well known to be computationally inefficient for the solution of many problems. Though the use of multiple and complex substitution rules mediates against this limitation, such rules exact a heavy penalty in system power dissipation and speed. In this respect, our results agree with those of other researchers20 who show that space-invariant transition rules do not give efficient implementations of basic computational operations. Thus, we observe that some proponents of SS and similar systems have begun incorporating global interconnections into their designs. 27 ' In particular, recently the architecture of SS was modified by the addition of global spaceinvariant crossover interconnection between the pixels and spacevariant custom masks. 29 While the hardware remains the same, the computational characteristics of the new modified SS architecture are markedly different from conventional SS. The new SS architecture is no longer equivalent to a mesh in topology, because it uses the connectivity of optics. It has been applied to realize programmable logic arrays (PLAs),30 an optical RAM decoder circuit.31 and a two-channel sorting node for a wideband digital switch. 9 Although the new SS architecture may overcome some interconnection problems, it does not directly address the following problems: (1) The energy budget of the new SS system has not been worked out to determine the possible size and speed of the system. (2) In the PLA designed with the new SS approach, all of the gates in the PLA are dissipating energy, whereas in electronic PLAs, only the gates that actually implement the function dissipate energy. Thus, making large optical PLAs is likely to be energy inefficient. (3) Although the new SS has been applied to realize a PLA, it is questionable whether a PLA is the most efficient method to realize a logic function in terms of gate count and speed. In fact, the number of optical gates that are unused in computing a Boolean function increases with the complexity of the function. (4) It is questionable whether using a regular interconnect to implement irregular functions is a computationally efficient method. Based on these considerations, another comparison study of the new SS architecture with POEM is underway.
Although the technology supporting POEM is not as fast that used in SS, its efficient combination of optics and electronics and its flexible use of global interconnects gives POEM computational power greater than that of SS. Efficient local electronic connections used in POEM allow easy implementation of random access memory. This in turn facilitates traditional programming methodologies. In SS, data and programming information must be tightly interleaved because communication distances are limited. The space -variant and programmable optical interconnects of POEM allow interprocessor connection topologies that are more efficient than mesh connection and easily accommodate fault tolerance through bypassing defective processors. The space -invariant connections of SS make this difficult to do.
Although the computational performance of POEM using existing technology is already competitive with any other system, it can be expected to improve steadily. Current limitations to POEM performance are technological: the speed of electronic processors and optical modulators, both of which are being actively developed. SS, by relying heavily on high speed for computational power, already faces fundamental limits in device power dissipation and signal skew due to optical propagation.
POEM architecture is well suited for parallel processing with a variety of processor granularity, synchrony, and interconnection topology. It combines the power of parallel space -variant optical communication with the flexibility and efficiency of electronics. The fast, global, and programmable interconnections of POEM will enhance significantly the capabilities and application range of parallel computing. 
