The parallel nature of optics and free-space propagation, together with its freedom from communication interference, makes it ideal for designing massively paralle l computers. Our architecture is highly amenable to optical implementations and aims at data-parallel applications. We estimated the theoretical p erformance of the optical system and compared it with electronic SIMD array processors. Preliminary results show that the system proVides greater computational throughput and efficiency than its electronic counterparts.
model includes a hierarchical mapping techniqu e that h elps design the algorithms and maps th em onto the proposed optical architecture.
We estimated the theoretical p erformance of the optical system and compared it with electronic SIMD array processors. Preliminary results show that the system proVides greater computational throughput and efficiency than its electronic counterparts.
Background
The tremendous progress in science and tech nology introduced an increase in the processing of large amounts of data in real time in a wide variety of scientific applications. For example, real time computer vision requires processing images of 1,000 x 1,000 data elements within a time frame of 16.7 milliseconds. This time sugg ests process ing rates of 10 to 1,000 GOPS (109 operations per Several researchers' ') suggest optics as a complementary technolugy for breaking major perfonnance barriers faced by conventional electronic technology. Optics has many unique features fhat can be exploited for high-speed parallel process ing. They include speed, paralleli"m. adequate cornrnunica tions. and architectural flexibility.
Optical systems are inherently multidimensional. Lenses, pri�ms, and mirrors can transfer planes comprising over a mil lion data points Simultaneously. This fact implies that a cost effective parallel means of achieving I/O and multiciimensional architectural topologies Illay be postiible. The rate at which data moves through an optical processing system b essentially limited by the rate at which data enters the system ami its detection at the output. The actual computation time consists mainly of light propagation through optical devices (provided that the switching rates of these active devices are comparable to optical signal propagation). Thus, wc can obtain higher throughput and processing rates than we do with current sys tems.
Perhaps the most attractive teature of optics for massively parallel processing is communications .
• . W I I Transmission of infonnation via photons requires no physical conducting ma terial, but relies on low-loss dielectric material for waveguide propagation or frec spaeC', As a result. optics-based mtercon nections potentially offer a freedom from mutual effects not alforded by electronic interconnections. This advantage be comes morc important as the bandwidth of the interconnec tions increascs. for the eftect of mutual coupling associated with electrical interconnections is proportional to the trequency of the signals propagatmg on the interconnect lines. Therefore.
optics-based communicat ion, offers higher temporal and spatial bandwidths.
The noninterfering nature of optic"l interconnections offers extra llexihility in routing. which in turn offers more "rchitcc tural flexibility. Since electrical interconnections cannot cross. thev must he routed under one another. Optical interconnec tions can cross one another without negative effects . .'vloreover, sincc optics-based interconnections require no mech,mical contacts. wc simply changc the directions of optical beams to reroute imerconnections. Variolls sources provide more details on optical imerconnecrions.1" 'I.
\\7hile the justifications for using optics for interconncc tions as well CIS mass stmage are ,veil established. the justi fication telr using optic, in digital proce;,sing remains in an embryonic stage, since digital optic-al device developments are in therr infancy. llowever. if data must be converted to optical form to usc an optics-based communication medium, lIsing an optical computing engine might keep up with the rate of communications, without w,orting to signal conversions (electronic-optical-electronic). These conversions cause major perfoIlllance degradation and increase power consumption.
The possibility ut using optics f(Jr building new parallel computing sv'stems tailored to the req uirement,', of dataintensive applications has been an objective of several re searchers. Recent technological advances in optical devices raised hopes for the practical realization of new parallel opti cal compllters. These advances include the development of compollnds in multiple quantum wells" for high-efficiency injection lasers, the development of nonlinear materials for optical switching devices.
I" 211 and the development of optical logic devices capable of implementing logic functions and serving as memory storage."'"
The 3D optical architecture The driving features of optical systems-the massive fine grain parall elism and the high degree of communication flex
ibility-and our ability to move around large optical images of bright and dark spots with great ease using optical components suit many applications. Such applications require the pro cessing of large amounts of structured data (multidimensional arr:lYs) and favor the SIMI) mode of computation. The attrac tivencss of these attributes is evidenced by the number of SIMD optical architecturcs that have been proposed in the past." "I The optical model we present also exploits optic. s advantage;, for parallel processing.
The 3D optical architecture. Figure 1 depicts a block diagram of the basic components ot the optical architecture. Unlike conventional computers that manipulate individual Os and Is as basic computational objects, the optical architecture manipulates bit planes as basic computational entities. Each bit plane i corresponds to a weight tactor 2' in the binary rep resentation, and up to three bit planes can be processed si multaneously. For images of n x n elements, up to 3n' operations process concurrently.
'Ine heart of the architecture is the parallel processor array.
Locally, this array can be viewed as a bit-serial or a bit-slice processor, since it performs one logical operation on one. two, or three I-bit operands. Globally, it can be viewed as a plane parallel processor, since it simultaneously performs the same operation on a large set of operands encoded as hit planes.
This hit-serial processing allows flexihle data formats and al most unlimited precision. Optical interconnections move the images around the system. We conceived the architecture as being built with optical hardware that manipulates entire im ages simultaneously both at I/O and processing. In this way, the 2D parallelbm b sustained throughout valious stages of the computation.
Processor array organization. The processor array oper ates in the SIMD mode of computation--the same operation applies to all data entries. Processing is based on optical sym bolic substitution logic, or SSL," described in detail later. The processor array uses three fundamental operators: a logical Not, a logical And, and a full Add as defined below, along with some other basic terms:
• Definition 1. We define a bit plane as I x 1---7 10,11. where I is a se t of integers. Hence, we denote it as A = Iu,), where i,j represents the Cartesian coordinates of the binary value a" E 10,1l. For an 11 X Ii bit plane. i, j = 1, . .. , n. We define a 0 plane as an n X rt bit plane A, such that a 'I = 0 for all i,j = 1, ''', n. Similarly, we define a 1 plane as an n x n bit plane such that a, I = 1 for all i, j = 1 .... , n.
• Definition 2. We define a data plane of length q as a stack of q-hit planes, denoted by the boldface notation A = A"_,, A, , _ 2, "' �' where A 'I _1 and A" are the most significant and the least significant bit planes respec tively. We will also denote A = layl, where ay is an in teger number.
• Definition 3. We define a pl ane negation operator de noted by P-Nor(A) as one that takes a bit plane A as inpul amI prouuces an output bit plane A' as follows: P Not(A) = A' where A' = la '"I for i, j = 1, . . . , n.
• Definition 4. We define a plane logical And operator.
denoted by P-And, as one that takes two bit planes A,B as arguments and produces an output bit plane X as follows: P-AndCA,B) = X, such that X ' I = Uf 1/\ b'l' where 1\ is the conventional logical And applied to single bits.
• Def'mition 5. We define a plane full Add operator, de noted by P-Add. as one that adds three bit planes A,R,C and prodnces two output planes X and Y, defined as follows: P-Add(A,R.C) = X, Ywhere x'} = a" ffi h" ffi c'i
The signs ffi and v denote the conventional logical exclu sive Or, and the logical Or operations respec tively. The three fundamental operators constinlte a complete logic and arith metic set capable of computing any arithmetic or logic func tion using bit-serial algorithms.
Data-routing functions. A distinctive feature of tlle bit plane architecture is that it provides parallel data movement along with the parallel processor array. We can load the bi nary images at the input in plane fonnat, either from the data memory or hum the external world such as a television scan ner or a remole-sensing device (Figure 1 again) .
The data enters the processor array through three input planes A,B,C: which are necessary for bit-serial arithmetic.
Planes A and B hold the operands, while plane C holds the carry-hit plane required in bit-serial arithmetic. Depending on the primitive operator needed at a given computational step, the input combiner performs three data movement functions as elaborated next.
For the logical P-"lot operator, the input comhiner latches onto the relevant input plane with the data to he inverted into the processor array without any change in the spatial p osition of the data. The logical P-And operator is applied to two bit planes in which the logical And operation proceeds on overlapping bits from the two bit planes. The data move ment function required in this case, the 2D perfect shuffle, performs the perfect shuffle" function on the rows of the two relevant input planes, while leaving the column positions unchanged.
Given two n x rt input planes, A = la " 
m 'Ine P-Add operator adds the overlapping bits of three bit planes. The permutation function required for the data, the 2D 3-shuffle, is similar in function to the 2D perfect shuffle just described. except that the 2D 3-shuffle alternates rows of three hit planes. Given three input planes A,B.C. of size n x n, the resulting 3-shuffled image D measures 3n x n defined as: Similarly, all replacements proceed in parallel . Since the input image can be very large (say, 1,000 x 1,000 pixels), over 0 : :
1 :
L __________________ _ Figure 3 . Light-intensity encoding of the binary values 0 and 1 (al; optical SSL rules for primitive operators: the full Add (bl, the logical And (cl, and the logical Not.
pixels, dark and bright, and the logic value 1 by the inverse panern, hright and dark, as shown in Figure 3a . The dark and bright pixels represent increasing levels of light intensity. In this dual-rail coding scheme, the intensity of the bright pixel and its position represents a logic value, which has some implementation advantages43 We refer to the optical encod ing of the binary values 0 and 1 as the fundamental patterns.
We next implement the fundamental operators as SSL rules, specifying how to manipulate information represented by optical patterns. These optical patterns are combinations of the fundamental panerns.
We derive the SSL rules from the truth- The 2D 3-shuffle function descrihed earlier groups bits of the same coordinates. Similarly , the logical P-And and P-Not give rise to four and then two SSL rules, as shown in Figure   3c ,d. Thus, we need a total of 14 SSL rules to implement the three fundamental operators.
Implementation of SSL We briefly illustrate the imple mentation of one SSL rule (r 3 in Figure 3b ) using an additive logic implementation method43IS to assist the conceptual un derstanding of this technique. The required optics have two parts: pattern recognition followed hy pattern replacement. replication, shift, combination, and masking from figure 5 for clarity. Brenner et al 4 3 provides a detailed description of this particular method, including the optical setup. Implementation of the processo r arra y. Each of the three fundamental operators comprises several SSL rules that need to be fired Simultaneously . To do so, we replicate the output of the input combiner a number of times equating the number of SSL lUles to be activated at a given stage of elements can replicate the input. However, we would need a binary treelike replication scheme to equalize the optical path for each copy.
Each copy moves to one of the eight SSL rules r, to rB of Figure 3b . After the necessary substitutions, we optically su perimpose the outputs of every active SSL mle to form the processed result. The optical superimposition represents a logical Or of light patterns in which a bright pixel overwrites a dark pixel. TIlus we can implement the processor array with three modules, namely, an Add, an And, and a Not.
Each module comprises the SSL mles of the corresponding operator, as illustrated in Figure 6 . A dynamic beam-steering element (an acousto-optic or electro-optic deflector along with some mirrors) under program control del1ects the input pl ane to the desired module. Within each module, a static beam-steering element directs the processed output to the output router as shown in Figure 7 . Recently, I 9,53 introduced an alternative dynamic method to implement the processor array that docs not require a dynamic steering device.
Implementation of data-routing functions. The input combiner and output router assume only data movement functions; they do not require data processing. The input combiner assumes the three functions already described.
Transmitting a bit plane to the processor array does not in volve permutation of the data, and therefore we can use any imaging system.
The 2D perfect shuffle and 3-shuffle functions pemlUte me row position of the input data. The literature proposes a wide variety of methods for realizing these functions.;6-6() These
Output of the input combiner
: Adder module:
, ,
And module : The output-input optical feedback is one-to-one mapping that does not require permutation of the data; furthermore it does not need to be reconfigurable, which renders its optical implementation very simple. In fact an imaging system with some control (polarization-based devices) can implement the optical feedback. 2D computing substructures . The optical architecture exploits spatial parallelism at the hardware level, which en ables it to process an entire data plane at once. This capabil ity is opposed to task (or function) parallelism in which the data plane is decomposed into subplanes that are processed sequentially in a pipclined fashion. '6 To enforce this capahility at the algOrithm design level. we view the design and mapping process as a hierarchical struc ture, as shown in Figure 11 . At the highest level of the hierar chy is the application we wish to solve (signal and image processing, vision, radar). The next level identifies the vari ous algorithms we can use to compute these applications.
This level includes matrix algebra, numerical transforms, and solutions of partial differential equations among others. A further analysis of these algorithms reveals that they share a common set of high-level operations, which we call comput ing substructures. These substructures can in tum be decom posed into a set of fundamental operators such as the P-Add, the logical P-And, and P-Not.
The rationale behind this mapping technique is that most of the data-parallel algorithms share common attributes such as regularity, localized and intensive computations, recur siveness, and matrix operations. So the mapping process starts by identifying a set of substructures that captures most of these features. We then must efficiently map these substruc tures onto the architecture and build parallel algorithms upon them. This makes the mapping process more systematic and hence efficient. In this article we concentrate on a represen tative set of these computing substructures to show the methodology.
IEEEMiao
We will denote data transfer by A (B or C) f--Xk by which we mean the transfer of bit plane Xk to the input plane A (B or C). Similarly, the expression X f-
We use C f--0 plane to clear all the entries of input plane C. Similarly. C f--1 plane indicates the setting of all the en tries of input plane C to 1, and j> f--0 plane denotes a transfer of a zero-bit plane to memory location P. Loop constructs such as k:= a to log,n and indices such as a and log,n, and parameter calculations should be interpreted as control instructions that the control unit executes.
2D addi tion/subtraction. This suhstructure refers to the addition (or suhrraction) of corresponding elements of two n x n data planes X and Y of integers. The result is a data plane S = (Si), whose elements Sir = Xi' ± y,Jor i,j = 1, .. . , n. This step is similar to conventional matrix addition Csub tmction). Let X be an n x n q-bit planes,
Here q is the precision of the operands, Xo being the least significant and Xq _ I being the most significant bit planes re spectively. Similar considerations take place for the data plane
Y.
The 20 addition substnJcrure add� the corresponding ele ments of the data planes hit serially, starting from the least it to X using the 20 addition procedure . We obtain the two's complement of data plane Y by first negating all the bit planes of Y (1'; f-y', for i = 0, . . . , q -1 using the P-Not operator) .
We then add it to a data plane whose least Significant bit plane is a 1 plane; the remaining q -1 bit planes are all 0 planes.
2D multiplication. This operation refers to the multipli cation of corresponding elements of two data planes. Let X and Y be n x n q-bit planes. The product P forms as 2q-bit planes P = P ' q_ I P2q-2, ..
• , Po , where p'} = x,} X Y'} ' As an example, let q be equal to 3 (we assume the same precision for both data planes to simplify the example). With X = X,x I J<, and Y = Y, 1; Yo, the resulting product then becomes P = P S P4 P 3 P,PI PO ' The multiplication process starts by clearing the product bit planes to zero:
This step represents the initial partial product pO (the super script 0 indicates the initial partial product) . Next, we calcu late the first partial product p' :
Po'
f-P-And(J<" y. J pI , ..-P-AndCX;, y.J
The notation P/ Cj = 0, ... ,5 ) means the jth bit plane of the i [h partial product. The second partial product p2 is gener ated from P' in the following manner:
f-P-And CJ<" 1;) P/,C f-P-Add (P/,T,C) T f-P-And (X I ' v.) P/ ,C f-P-Add (P/,T,C) T f-P-And (X" V.) P/ 'P/ f-P-Add(P3', 1;C)
The variable There is a temporary bit plane. We obtain the final product after three iterations P = pl. This product is produced as:
Note that, unlike the conventional shift and add multiplica tion algorithm, we did not need to shift the previous partial product to generate the current one. Instead, we start the addition at the bit plane corresponding to the amount of shift It takes q full additions and q logical And operations to generate the final product p, Therefore, the time complexity of the 2D multiplication is OCq), independent of the number of pairs to be multiplied.
2D data-shifting operations. We define two operations for shifting a data plane by a variable number of pixels in either direction. The logical shift involves columns (or rows) of Os that enter from the opposite side of the shift direction.
Given a data plane P of q-bit planes Pq _ I Pq _ 2, . .. , Po, we define a horizontal shift operation, denoted by H., (P), to be the data plane P shifted in the X axis by a columns (+ a, for positive shift, and -a for negative shift). See Figure 12a .
The amount of shift is sequentially applied to every hit plane p. of the data plane P. The shifted plane can either be stored in itself or in a different data plane in memory. For the latter case , we introduce the notation X f---HaCP), by which we mean that the shifted plane P is stored in plane X. Simi larly, we define two other operations, denoted by Va(P) and X f---VaCP) for vertical shifting. An illustration of vertical shift apperars in Figure 12b .
We now have an optical register-transfer language, com prising the 2D operations just described, with which we can describe parallel algorithms without referring to the machine hardware . Hereafter, the following shorthand notations 2D add (x, Y), 2D sub(x, Y), 2D multiply(X, y), Ha(P), Va{P) denote the 2D addition, 2D subtraction, and 2D multiplica tion of the two data planes X, Y, and horizontal and vertical shift of a data plane P respectively.
Mapping data-parallel algorithms
The key feature of data-parallel algorithms is that their parallelism comes from simultaneous operations across large sets of data, rather than from multiple thread of control. I A large portion of scientific computing algorithms fall into this cat-
0
, P 12 P 1 3 P 1 4 P 22 P 2 3 P 2 4
V-2(P) P 1 1 P 12 P 1 3 P 1 4 P2 1 P 22 P 2 3 P 2 4 P3 1 P3 2 P33 P 34 P 4 1 P4 2 P43 P4 4 P 11 P 12 P 1 3 P 1 4 P 21 P 22 P 2 3 P 2 4 P3 1 P 3 2 P3 3 P 34 P 4 1 P42 P4 3 P4 4 P 31 P 3 2 0 : 0 :
---I P4 1 P 4 2 0 : 0 : egory hecause of the enormous amount� of structured data that need to be processed. Various sources propose SIMD machines as the most suitable class of computers for dealing with these algorithms. These machines include image pre cessing systems such as the MPP,75 the Clip/' and the DAP 610,78 as well as fine-grained parallel systems such as the Connection Machine .'9 The SIMD optical architecture potentially offers a larger array size (larger number of processing elements) than ex isting counterparts. In addition, its unrestricted intercon nections give it a greater flexibility in handling data-parallel algorithms that require local as well as global communica tions. In the following, we show the mapping of several algorithms onto the architecture . We chose these algorithms to represent a broad range of complexity. They are also important key algorithms that occur as subproblems in larger programming tasks. Many more numerical algorithms have been mapped onto the optical architecture 55
Row/column accumulation. In calculating the sum of all the elements of a data plane columnwise (rowwise). we sum all the elements of a particular row; the final sum re sides in the first entry of that row. Similarly, for column accumulation, we sum all the elements of a particular col umn, and the final sum occupies the first entry of that col umn. For a given data plane S of n x n elements, we proceed as fo llows. We split the initial plane S horizontally using the vertical shift operation (or vertically for rowwise accumula tion using the horizontal shift operation) into two planes X and Y. Each plane contains half the data entries of S. Next we add these planes using the 2 0 addition suhstructure .
We repeat this split-and-add process for logzn iterations, afler which, the first row (first column) of S holds the accumu lated sums of each column (row) .
Procedure Row-Sum!Column-Sum(S,x, Y) We can combine the Row-Sum and Column-Sum sub structures to compute the sum of all the element� of a data plane. To find the sum of the element.� of a data plane, say S, we first apply the Row-Sum substmcture to produce one column of accumulated sums. Next , we apply the Column Sum substructure to accumulate the element� of that column. Figure 13 shows an example of computing the sum of the 16 elements stored in a 4 x 4 data plane S. We compute the sum after Zlog, 16 -8 steps and store it in location su' Similarly, we can compute the product of all the element� of a data plane (chain multiplication) using the 20 multiplication and vertical and horizontal shift substructures. The product of n' elements with q precision cach can be found in OCq'log, n)
time.
Matrix multiplication. Let X,Y he n x n matrices (as suming the same size for simplicity). Then their product X * Y = Z is an n x n matrix Cthe multiply asterisk denotes matrix multiply) whose elements are given by:
5 " 5' 2 5'3 5 ,.
21 22 23 24
Ss, 5 32 5 33 5 34 54 1 5 42 8 43 8 44 (a)
Step 1: spin 8 into X and Y 
I
Optical computing
We assume matrix X is being transposed and stored as X = X',X'2, . . . , X'm where X', is an nx n matrix fo rmed by rep licating the i th column of the transposed matrix X' n times.
The basic approach for computing the jth row of the product matrix Z is to first generate the point-by-point multiplication of the clements of matrix Y by the elements of matrix X' j , using the 2D multiplication substructure . Then we sum the columns of this matrix, using the Row-Sum substructure . As an example, consider matrices X and Y to be 3 x 3 of integers:
We assume matrix X = X,' Xi,x,' where X,' is the ith row of matrix X transposed and replicated as follows:
The elementwise multiplication of X,' and Y using the 2D multiplication results in a matrix:
XlJ XY I2 x12 XY 22
We accumulate the rows of matrix T using the Row-Sum substructure, to generate a matrix Z' whose first row is the first row of the product matrix Z:
In a similar manner, we use X,', and x.' to generate two matrices Z' and Z3 respectively:
Note that the first row of Z2 and the first row of Z3 are the second and last rows of the product matrix Z respectively. We generate the product matrix Z by shifting Z2 by one row, and Z3 by two rows downward, and sequentially adding all the three matrices Z', Z', Z3 using the 2D addition substructure: Let the set of data be represented by a data plane S of n' elements. The algorithm for finding the maximum value of S proceeds by folding S repeatedly in half and selecting the largest value of overlapping elements from each half at each step. Initially we fold S in half by storing its first n/2 rows in the first n/2 rows of a matrix, say S., and its last n/2 rows in the first rows of a second matrix st· We subtract st from So using the 2D subtraction substrucmre , retain the maximum value of each pair of clements, and store it back into S. The new S contains half the data points of the ori ginal seL We find the maximum value hy repeating the folding and subtraction process for 2log,n. During the first log,n iterations, we fold the data plane S along the horizontal direction at each iteration. At the end of the first log,n iterations, each entry in the first row of S holds the maximum value of the corre sponding column. Next, we fold S vertically and perfonn comparison for another log,n iterations after which the maximum value of the entire data plane S is located in the first entry S11'
Procedure Maximwn(S,So,S,)
S. � v_ peS) ; V+ f}S u; S, � v+ /S); T � 2D sub(s., st); ifT q_ l=Oth en S� S. else S� st; In this procedure, T = 1'.-11'.-" " . , To is a temporary data plane used to hold the :;ubtraLtion re:;ull. Tq _ 1 is the most significant bit plane of T. Since we are using two's-comple ment subtraction, the most significant bit of the subtraction result indicates the relative magnimde of the operands. Each entry 1'. _ ,( ij) represents the sign bit of the subtraction op eration Su Cij) -S, Cij). Therefore, for 1'.-l(ij) = 0, S.Cij) is greater than or equal to S,(ij); otherwise stU j) is greater than S.C ij ). We achieve the selection of the largest value Cnoted by the simple conditional statement: if 1'. -1 = 0, then S � S.
else S � Sf) by the following Boolean expression; 5 . = P-AndCP-NotC 1'. _l ),5u.)
where the sign v is the logical Or and Su., Sa refers to the k thbit plane of arrays S. and SI respectively. We can express Equation 5 using only logical P-Not and P-And operators (using De Morgan's theorem A vB = A .II B :
S. = P-Not CP-And CP-Not (P-AndCP-notC Tq _ I), 5u.)) , P-NotCP-AndC Tq _1>5/,))))
Several iterations through the system carry out Equation 6.
TIle time complexity of the algorithm is O(q log,n). TI1is alger
Iithm is representative of many neighborhood algoIithrns such as those that find the minimum, the average, the median, the sum of a data plane, histogramming, counting, and so on. All can be implemented in O(q log2 n) time. For example, we can apply the same algoIithm to fm d the minimum of a seL In this case, we retain the minimum value at each iteration.
Projected performance
We estimated the theoretical performance of the optical architecture by evaluating several performance measures and compared them to the ones of existing SIMD array proces sors. The optical implementation of SSL follows the method briefly described earlier. 4355 We used the fo llowing key parameters in the analysis; Optical cycle time. The optical cycle time , denoted by Ta <my' e4uals the time elapsed between inputting the data in the input planes and outputting the result at the output router.
TIlis time includes formatting the data at the input combiner, processing the formatted data in the processor array, and routing it to the appropriate destination. Therefore; Using the above parameters, we derive Tpro< :
The numhers over the hraces indicate the times needed to accomplish each suhtask as enumerated above . 7;, can be in the range of 0.1 to 1 ns (light propagates at 1 ftlns in free space). Presently, the status of active optical devices is mueh less mature than the passive component�, and is the suhject of intense research. The available optical switching devices have response times orders-of-magnitude higher than � (see Table 1 ). Thus, the dominant factor in The diameter is the maximum numher of com munication cycles (or links) needed for any two
PEs to communicate. For the optical case, this fac
Step 3 OLEs are pulsed devices. In their operation, these devices require two wavelengths, one for the input Signal and one for a bias signal (clock cycle). The two inputs, data and clock cycle, are separated in both time and wavelength. The modulated output signal has the same wavelength as the bias signal, and therefore the input and output signals have different wavelengths 11 Hinton reports research efforts un der way to implement a device composed of two OLE de vices interconnected in such a way that the second OLE example in Figure 14 shows the broadcasting of a value x residing in the lower left-comer PE to all other PEs in 210g, 4 � 4 steps.
In current implementations of the MPP and the Clip, I/O processes in column-parallel fashion, while the row-parallel DAP loads data into the processor array one column or one row at a time. In contrast with the optical system, I/O activi ties process in a plane-parallel manner. This capability gives the optical system an I/O speedup of n, for an n x n input image, over the MPP, Clip, and the DAP. It could be a tre mendous speed advantage, considering the large potential value of n. We note that the Connection Machine can also handle plane-parallel data, loading through a very expensive I/O system called the data vault. on the proposed optical architecture. . 1he speedup is the ratio of the execution time on one I-hit processing element to time taken on n2 PEs.
•• q is the precision (or operand length). 78 With the advances in optical storage and optical intercon nections, optical information processing systems will have a major impact on overall system performance. If the data is already being stored and transmitted in optical form, optical computing processors might be the best alternative for pro cessing the data rather than resorting to optical-to-electronic to-optical conversions, which are major sources of performance degradation. Thus a more unifonn technology for storage , communication, and processing of data in opti cal form will significantly impact the future of high-perfor mance computing systems.
IEEE Micro
Digital optical information processing is the least devel oped at this time due to the immaturity of optical switching and logic devices. These devices are in their first generations.
Considerable research and development efforts presently under way will lead to optical devices with lower switching energy, higher switching speeds, and higher resolutions. As these devices mature, optical computing systems will be highly competitive with existing electronic systems.
Here, we contribute to the ongOing efforts in building the foundations of new optical computing systems. We intro duced a 3D optical computing architecture based on sym bolic substitution logic. Althougb we focused on architectural and algOrithmic issues plus some performance projections, we provide an extensive and up-to-date reference list that covers all aspects of the field. We showed that with a few symbolic substitution rules one can build a massively parallel optical computer. After introducing a hierarchical mapping technique for mapping parallel algOrithms onto the optical computing model, we mapped several parallel algorithms onto the architecture. We chose these algorithms to repre sent a wide range of computational complexity.
We have assessed the theoretical performance of the pro posed optical system. Although the system is not competitive at the present time with electronic array processors, it is quite attractive. It can potentially deliver a throughput at least 100 times higher than that of its electronic counterparts (owing to its multidimensional nature and higb speed). Moreover, the communication flexibility and the parallel VO of the optical system seems to be unmatchable with electronic ar ra y processors .
This preliminary performance analysis suggests that the proposed optical system is potentially a better alternative than current computing systems. The best applications are those that require the processing of large amounts of structured data such as remote sensing, signaVimage processing, vi sion, weather modeling, and seismic data processing. We 
