We consider the task of interconnecting processors to realize efficient parallel algorithms. We propose interconnecting processors using certain graphs called expander graphs, which can provide fast communication from any group of processors to the rest of the network. We show that these interconnections would result in a number of efficient parallel algorithms for sorting, routing, associative memory, and fault-tolerance networks. As the interconnections based on expander graphs are global and irregular, we reason that optical interconnections are preferred to electronic and propose implementation of these interconnections using the programmable optoelectronic multiprocessor architecture.
Introduction
To cope with the ever increasing demand on computing power, it is not enough to rely on faster device technology. It is necessary to utilize parallel processing. Assuming that the communication overhead is small and that the algorithm can be fully parallelized, a task requiring T sequential time steps can be performed in Tip steps by distributing the task among p processors. These two assumptions are the most important considerations in designing efficient parallel algorithms. We consider the design and implementation of such algorithms. More specifically, we investigate the interconnection properties of very large scale processor networks necessary to support efficient parallel algorithms. We find certain interconnection networks called expander graphs very useful for this purpose. Interconnections based on expander graphs can achieve global communication in constant time. This property of expander graphs is successfully exploited in the design of several efficient parallel algorithms.'-1 0 However, it has been unclear how to construct and implement good expander graphs.
We take the position that the interconnection networks based on expander graphs are the key to implementing significantly efficient parallel algorithms.
To this end, we consider the design and the construction of expander graphs. We describe a probabilistic approach to construct and evaluate good expander graphs. We then try to convince the reader that expander graphs can indeed result in efficient algorithms in a variety of situations. We then discuss an optoelectronic implementation of these interconnection networks which combines the optical interconnection technology with very large scale integration (VLSI) technology," thus overcoming the difficulties encountered with pure'VLSI technology.
In Sec. II we explain the definition and theory of expander graphs. Using this theory, we present a probabilistic approach to the construction of expander graphs. In Sec. III we show that expander graphs give rise to efficient parallel algorithms in a number of application domains. In particular, we explain how expander graphs can be used to construct approximate halvers, the basic building block of an optimal sorting algorithm. We show how expanders can result in a lower delay in routing applications. We also describe applications in associative memory, object distribution, fault-tolerant networks, and error correcting codes. In Sec. IV, we discuss approaches to implementing irregular interconnections and propose an optoelectronic system. Finally, in Sec. V, we discuss our conclusions and suggest future research directions.
Expander Graphs
Efficient parallel algorithms rely on the fast transfer of information within the processor network.1 2 Although a fully interconnected crossbar network can accomplish such communication in a single time step, the number and length of interconnections required (n 2 for n nodes) make the implementation of such large 1 l og n \ ncreasing ammng/ log n Fig. 1 . Communication distance on a hypercube with n nodes. The nodes are denoted by logn-digit binary addresses. For n = 8, the longest communication distance is between (000) and (111). If information needs to be transmitted from the group of nodes in the shaded region to the remaining nodes, it requires /2 logn -c(logn) 1 /2 steps, which grows with logn.
scale networks prohibitively expensive. Often we are limited to networks with a smaller number of interconnections per processor. An example of such a network is hypercube with log2 n interconnections per processor. Two processors are connected if the Hamming distance between their log2 n-bit addresses is 1. The hypercube and its variants, such as shuffle exchange and cube connected cycles, have become popular because of their relative ease of implementation and because a number of algorithms have been implemented on them with satisfactory performance. Such networks cannot give us optimal parallel algorithms, however, since they lack sufficient connectivity to facilitate fast parallel communication. For example, consider a communication task in which some small group, say 10%, of the processors hve information which they need to transmit to the entire network. Assuming we have no control over the initial information distribution among the processors, any group of 10% of the processors can be considered. Clearly, a crossbar network can accomplish this task in one step but is technologically expensive. A hypercube can also accomplish this task but requires 0(logn) steps, which scales with the size of the network (Fig. 1 ). We need a network which can interconnect an arbitrary number of processors in a constant number of steps.
Interconnection networks based on expander graphs can provide the solution. They can accomplish this task in a number of steps dependent only on the fractional size of the group and the graph's expansion factor. The number of steps is independent of network size. We call this property global communication with a constant number of steps. It has been shown that for any given expansion there exist expander graphs with a constant number of fanouts per node (see Theorem 1 of Sec. II.B).1 3 This global communication ability is of fundamental importance for interconnection networks because a number of optimal algorithms can be developed based on such networks. In fact, there is some theoretical evidence that this property is indispensable for designing optimal algorithms.' 2 One of these is a 0(logn) routing algorithm which can be used as a basis for a general purpose parallel computer. 2 
A. Definition of Expander Graphs
Expander graphs are defined in terms of their properties. For convenience, we describe them as bipartite graphs, i.e., graphs with connections between two discrete regions. A bipartite graph G = (I,O,E) has a set I of input nodes and a set 0 of output nodes with E as the set of edges between input and output nodes. We consider only those bipartite graphs for which III = 101
and define that III = 101 = n. Edge (ij) E connects input node i with the output node j. For any subset A of inputs, we define the neighborhood (A) = U E O(ij) , E for some i e A}. The same applies to any subset of outputs with its neighborhood in 0. We also define a bipartite graph G to be d-regular if the degree, i.e., fanout, of every node in the graph G equals d.
For 0 < e < 1 and # > 1, a d-regular bipartite graph G
so that AI < en, the neighborhood (A) of A in G is such that Ir(A)l 2 IAI. In other words, a graph expands if every subset of nodes up to a given size has a large neighborhood. We call /3 the exapnsion factor of the graph. Typically, we use expander graphs with 3 = (1-e)/e. In the following subsections, we look at the existence and the construction of expander graphs in greater detail.
B. Properties of Expander Graphs
In an expander graph, the size of the neighborhood of every set is larger than that set by a constant factor. This expanding property gives rise to a number of interesting computation and communication properties. For example, this expansion property gives rise to approximate halving in a constant number of steps using compare and exchange operations. It also offers the means of realizing a trade-off between storage capacity for the interconnection patterns and the degree of the network while retaining the error correction, exponential convergence, and robustness properties in associative memory. Furthermore, this property also ensures multiple paths between the processing elements, providing the necessary redundancy for faulttolerant communication networks. These properties together with applications are presented in further detail in Sec. III.
The large neighborhood criterion is a strong requirement. Consequently, even the proof of existence of such graphs is nontrivial. Such a proof is in fact provided by a nonconstructive (probabilistic) argument. In this argument, we look at the set of all d-regular bipartite graphs with each side having n nodes together with a uniform distribution on it. We then show that the fraction of these graphs that fails to have the given expanding property is <1. To make this fraction small, we select a suitable d as a function of the expansion. Since the fraction of graphs which does not have the required expanding property is <1, cer-tain graphs of degree d must exist that meet the given expansion requirements. This result is stated more precisely in the following theorem. 
where H(x) = -x log2x -(1 -) log2(1 -x) is the binary entropy function. Let I and 0 be two sets of vertices, III = 101 = n, and let G be a random d-regular bipartite graph on the classes of vertices I and 0, obtained by choosing randomly d permutations from I to 0. Then, with probability approaching 1 as n tends
Notice that we allow multiple edges here. This theorem guarantees expander graphs of any given expansion whose degree is bounded as a function of E as in the above equation. Notice that the degree bound is independent of n. The problem is that this probabilistic argument does not give us a clue as to how to construct such expander graphs explicitly. Also, it seems that the explicit construction of expander graphs is difficult due to our requirement that every subset of nodes have a large neighborhood. Although there are some explicit constructions of expander graphs, 14 15 none of these constructions offers high expansion with a small degree bound. On the other hand, the random construction shows that the degree need not be higher than -loge/e. 3 Also, the probabilistic argument shows that for a suitably chosen d and for all large n, almost all the d-regular bipartite graphs will have the required expansion. This suggests that we should use randomly generated d-regular graphs. Such an approach would be successful only when we have the means to determine if a given d-regular bipartite graph has the necessary expansion. Computing the expansion of a bipartite graph is a co-NP complete problem, 16 but estimating the lower bound of the expansion is not difficult, as shown by Tanner. 17 We only need to compute certain eigenvalues of the incidence matrix of the expander graph for this estimation. The precise result of Tanner is given below.
Let 
As the computation of eigenvalues of a matrix is relatively easy, this theorem provides an efficient tool for estimating the expansion. However, such a probabilistic approach gives rise to very irregular interconnection networks which create severe routing difficulties for VLSI implementation. Our investigation of a suitable implementation technology revealed free space optical interconnection as the preferred choice. In addition, optical interconnection is also superior for such global interconnection from both power and speed considerations. In Sec. IV we present the specific optical interconnection techiques to realize expander graph interconnections. In Sec. II.C we give our method for constructing expander graphs with a given expansion using the theoretical results mentioned here.
C. Probabilistic Construction
To generate an expander graph with a given expansion factor, the three primary tasks are (a) generating random d-regular graphs with a given degree d, (b) estimating its expansion by applying Theorem 2, and (c) selecting the graph with the best expansion over a large number of iterations. We present this algorithm below in a step-by-step approach, assuming that n is the number of input or output nodes on each side of the bipartite graphs and a = (1 -e)/E is the expansion:
Step 1:
We use a random number generator to first generate a random permutation of the first n integers. This random permutation will be interpreted as a one-toone connection between the input and output nodes. Then we select d using Eq. (1). This d is the minimum degree required to achieve the given expansion. We generate d random permutations and construct the n X n incidence matrix of the corresponding d-regular graph. Theorem 1 guarantees that there exists expander graphs with the given expansion, provided that we select d according to Eq. (1) for the desired expansion.
Step 2: Estimate the Expansion Using Theorem 2.
We use standard numerical routines to compute the eigenvalues of this matrix. These eigenvalues together with d and e are substituted into Eq. (2) to obtain a lower bound on the expansion of the graph.
Step 3: Selecting the Best Graph.
Steps 1 and 2 are repeated for many iterations. We select the graph with the largest expansion. Figure 2 graphically illustrates this algorithm. Using this algorithm, for various values of n and d, we found the network with the least second largest eigenvalue and computed its expansion using Theorem 2. We then plotted the relationship between the expansion j [forE = 1/(1 + i3)] and the degree d of the network for various values of the number of vertices n (Fig. 3) . From this figure, it is clear that the relationship between the expansion and the degree is largely independent of the size n of the network for larger values of n (n > 128). The discrepancy for smaller values of n can be explained by the fact that the theoretical results 3 are asymptotic. Hence we demonstrated that, for a given Our initial experiments were conduc n up to 1024. The incidence matrix is storage mode. This approach will work for n up to | 5000 using a Cray Y-MP with 32 million memory words. For larger values of n, n >> d, the incidence matrix is sparse. We also recall that it is a real symmetric non-negative definite matrix. Consequently we can use skyline methods to cut down dramatically I i the storage requirement for large incidence matrices. 18 
Ill. Applications of Expander Graphs
In this section, we give a few applications where the connectivity of expander graphs is successfully ex-
I J
ploited to yield fast parallel algorithms and efficient designs.
A. Parallel Sorting
It has been a long-standing problem to find an optiJi mal sorting network with 0(logn) stages. It is easy to see that we need at least logn stages of comparators with each stage performing 0(n) comparisons, since we have a lower bound of n logn on the total number of comparisons required for sorting. The credit for discovering an optimal algorithm goes to Ajtai, Komlos, and Szemeredi (AKS), who came up with an 0(logn) stage sorting network, thereby matching the lower bound to within a constant factor. 4 The basic idea of the AKS algorithm is to halve recursively the given sequence of numbers. Such a recursive halving requires 0(logn) stages. Unlike a naive recursive halv-
ing scheme in which it would take (log 2 n) steps (since each exact halving requires logn time), the novelty of
the AKS sorting network is that it uses only approximate halving instead of exact halving. They handle these approximately halved sequences using an efficient error-management scheme to obtain the sorted sequence. Such an approach can result in an 0(logn) algorithm provided approximate halving can be done quickly. This is how expander graphs enter the picture. It turns out that we can do approximate halving in constant time using expander graphs. We discuss the relationship between approximate halving and expander graphs in a greater detail. The idea is that we can use bipartite graphs to model the computation of an approximate halver network. The nodes in the bipartite graph represent the wires in the comparator network. We partition the wires into two groups of equal sizes with the node sets I and 0 corresponding to these two groups so that, at the end of the computation, most of the elements of the lower (higher) half of the inputs end up in the nodes of I(O). Hence we can assume that compare-and-exchange operations are only made between the nodes from different parts of the graph. Each such compare-and-ex- design will be insensitive to the order of these comparisons.
We now define an e-halver as a comparator network which takes the inputs aj,a 2 , . ., a2n and produces two blocks (lower and higher) of outputs of equal length n. The idea is that the lower block contains all but an fraction of the n small elements of the input, and the higher block contains all but an e fraction of the n large elements of the input. In fact, an e-halver satisfies a stricter condition. An e-halver has the property that, for any inputs and k < n, the number of elements from the k smallest elements of the input which are output in the higher block is <ek. Similarly, the number of elements from the k largest elements of the input that are output in the lower block is <ek. In other words, the elements that are output in the wrong block are distributed evenly in their respective halves.
It turns out that the bipartite graphs that can be used as an -halver are expander graphs. The close connection between expander graphs and e-halvers is established by the following fact:
Fact 1: The comparator network whose compareand-exchange operations are given by a [d,,(1 -e)Ae expander is an e-halver with depth d.
Since the depth in our comparator networks corresponds to parallel time, our goal would then be to design -halvers with minimal depth for any given e and n. From Fact 1, it can be seen that this goal is equivalent to finding expander graphs of a given expansion with the minimal degree. The significant fact is that Theorem 1 implies that, for any given e, we can find an e-halver network whose depth is independent of the size of the network. The depth is determined only by e. It is the existence of such bipartite graphs that makes efficient parallel algorithms possible.
In Sec. II.C we constructed several expander graphs of varying n and d and estimated their expansion .
The e can be easily computed from the relation e = 1/(1 + fi) (see Fig. 4 ). In other words, when we use the expander graphs that we generated as approximate halvers, we can guarantee that the number of elements in the wrong half is at most en (see Fig. 5 ). This is the maximum error we have to consider when designing an error-management scheme to achieve the AKS sorting algorithm. Note that the expansion computed for Fig.  3 are lower bound estimates; therefore, the e obtained would be upper bound estimates (see Fig. 6 ). To see how these expander graphs perform as e-halvers, we used them to halve 1000 random permutations of n distinct integers and recorded the number of integers that ended up in the wrong halves. The highest error count was divided by n to give an empirical estimate of e (see Fig. 7 ). This empirical e is significantly less than the upper bound estimated, confirming with Fig. 5 that E-halvers are very efficient for a large majority of the input. In addition to the sorting algorithm, researchers have developed optimal algorithms for other related problems, e.g., finding the maximum and median. 5 6 The problem with all these algorithms is that they are only optimal in an asymptotic sense. This means that these algorithms can perform better only when we consider problems of very large sizes, e.g., sorting billions of numbers.
For smaller problem sizes, other existing algorithms would be more efficient. Even if the AKS algorithm is not presently competitive for practical problem sizes, one can, with improved algorithm analysis techniques and advances in the theory of parallel algorithms, hope for the development of parallel algorithms that are optimal for practical problem sizes. 3 The technological feasibility of irregular interconnections would give impetus for the development of better parallel algorithms.
B. Routing
One of the principal and immediate applications of expander graphs is for constructing efficient routing networks. It has been shown by Valiant 19 that if a messsage is routed to a random destination and then to its real destination, the delay can be made proportional to the diameter of the network. One intuitive reason for smaller delays is that random destination routing would tend to distribute the packets evenly across all the edges in the network, thereby minimizing the traffic on each edge. It turns out that one can dispense with random destination routing and achieve the same effect by using expander graphs. This was shown by Upfal. 2 Recently, Leighton and Maggs showed that they can achieve significantly lesser delays and higher fault tolerance by augmenting a butterfly network with an expanding graph. In particular, they showed that such an augmented butterfly is better than even a dilated butterfly, which has the same amount of hardware. These results suggest that expander graphs will play a significant role in the development of parallel computers.
C. Associative Memory
Associative memory is the ability to recall data given partial information. One of the well understood models of associative memory is that of Hopfield. 2 0 This model assumes a fully interconnected network of neurons. Information is stored in this system by adjusting the weights of the interconnections. This model has a number of remarkable properties, which include error correction, exponential convergence, and robustness with respect to errors in the weights. However, in practice it is hard to implement such a network since the number of interconnections grows quadratically with the number of neurons. To retain many of the nice properties exhibited by the Hopfield model, Komlos and Paturi 7 have shown that one can use certain sparse networks, which should have global communication properties similar to those of expander graphs. Nonexpanding networks like the hypercube would lack the error-correction properties of the Hopfield model. In essence, expander graphs would give us the means to realize a trade-off between the storage capacity and the degree of the network while retaining the error correction, exponential convergence, and robustness properties. This shows that the connectivity provided by an expander graph interconnections is versatile. We generated a 256-node expander graph with d = 20 and used it for interconnecting 256 neurons in a simple Hopfield network. Using <10% of the interconnection required by a crossbar, we stored two 16 X 16 binary images and demonstrated the error-correction property for 20% random error (Fig. 8) .
D. Object Redistribution
The object distribution algorithm is the central part of Cole and Vishkin's solutions to the O(logn) time task scheduling problem and the (1) time processor scheduling problem. 8 We have a set of objects representing the tasks to be performed in parallel. The goal is to divide this set of objects into collections of objects with approximately equal sizes so that these collections can be executed in a minimum number of parallel steps. This problem is encountered when the objects are distributed unevenly in the network, and no one processor has access to all the objects. Such an uneven distribution can be made more balanced if the objects are redistributed. The scheduling problems can be solved optimally if the redistribution can be done in constant time. The proposed solution uses an expander graph to interconnect these collections. As we shuffle the objects between pairs of interconnected collections to achieve local balance between them, the global communication property of expander graphs assures more even global object distribution in a constant number of steps.
E. Fault-Tolerant Networks and Error-Correction Codes
Achieving consensus in the presence of faults is a basic problem in distributed computing. Hardware or software faults can prevent a processor from cooperating in the consensus process. In such a case, the goal is to obtain unanimity among the nonfaulty processors. The problem is that faulty processors can prevent communication among the nonfaulty ones. It is also possible that faulty processors can introduce misleading messages into the network. To achieve unanimity, Q(t) connectivity is necessary where t is the number of faults to be tolerated. This high connectivity requirement can be relaxed if we are willing to lose some nonfaulty processors and settle for cooperation among the vast majority of the nonfaulty processors. In such a case, one can use expander graphs to interconnect the processors. The communication properties of expander graphs guarantee that we lose only a few nonfaulty processors. Hence expander graphs can be used as good fault-tolerant networks. 9 Economic memory storage requires a refresh or restoration mechanism to counteract the accumulation of errors. Such a mechanism must rely on redundancy and voting for restoration. This added computational requirement increases the possibility for device error. Thus the problem of information storage in the presence of noise leads to the problem of computation in the presence of noise. This problem is similar to that of fault-tolerant computing. Here again global communication properties of expander graphs can be used to implement a voting mechanism economically.' 0
IV. Optoelectronic Implementation
We have now described the irregular interconnection approach to parallel computation and discussed some of its advantages. In this section, we examine the implementation technology. We show that a system combining local electronic computation with global optical communications provides an excellent match to the system requirements. We describe the programmable optoelectronic multiprocessor (POEM) system being developed at UCSD and discuss how it can support expander graphs. Two implementations are discussed, one using fixed computer generated holographic optical interconnections, the other using reconfigurable volume (photorefractive) holographic interconnections.
A. Why Optoelectronics?
Electronic VLSI technology is well established, inexpensive, and reliable. It is excellent for logic operations and local communications, as in a single processing element. However, as the length and density of the communication links increase, the disadvantages of a purely electronic approach become significant. In particular, the irregular global communications described are catastrophic for VLSI. Time delay, energy dissipation, and potential clock skew all grow with increasing length-a problem for global communications. Electronic crosstalk and reliability considerations limit the allowable number of line crossings, making the layout of irregular interconnection links difficult. As a result, the problem of communications becomes critical in chip layout. Valuable silicon real estate is expended on connections, reducing the amount available for processing. For example, in most VLSI chips, 70% of the silicon area is devoted to communications and related tasks, although most of the chip layout time is spent trying to minimize this percentage.
The communications problem is basically topological. In VLSI electronics, all the processors lie in a 2-D plane. As long as the communications between them are also restricted to that plane, there is competition between communications and processing for the same limited area. By introducing free-space optical interconnection, communication links can be taken into the third dimension above the processing plane. There are costs in power, speed, and complexity in converting the electronical signals to optical. However, it has been shown 2 l that for links longer than a (technology dependent) break-even length Ic the optical link is more efficient in terms of both power and speed. The 1c was calculated using realistic optical and electronic performance parameters and found to be as small as 1-2 mm. This means that optical links are preferred for wafer-scale integration implementation of parallel processors using global interconnections. Based only on power and speed considerations, the optical link is already preferred to the electronic wire for long distance communications, but there are other significant advantages offered by the optoelectronic combination. The area of the chip expended on the long distance wires and any associated electronics (amplifiers, signal boosters, etc.) is made available for processing. The VLSI layout is simplified. Problems arising from clock skew are reduced, since all long distance links have approximately the same length. In addition, there are two potential advantages which come from the physical separation of the interconnect technology from the processing plane: fault tolerance and reconfigurability. VLSI fabrication faults can be corrected after chip testing by selecting the connection links to replace faulty processors with working spares. This reduces production costs without sacrificing efficiency. As a consequence of the optical long distance communication, all processors are effectively adjacent. If the connection can be changed during operation, the system becomes more versatile, efficient, and operation fault tolerant. The type and time scale of reconfiguration are technology dependent, but in general the better the connection pattern matches the problem requirements, the more efficiently the available processing power can be applied to a variety of problems.
B. Programmable Optoelectronic Multiprocessor
The POEM is a generalized system approach to parallel computing derived from these considerations. The POEM system was described in detail in Ref. 22 . We briefly describe it here, then discuss its application to parallel processing with irregular interconnection. An idealized POEM system is shown in Fig. 9 . The VLSI wafer is divided into optoelectronic processing elements (PEs) which perform computations and local communications electronically. Each PE also has one or more optical detector and modulator with which it can receive data and control instructions and communicate to other processors. These optoelectronic PEs communicate among each other through electrooptic (EO) modulation of coherent light (generated offchip). Modulators are preferred to integrated laser sources for reduced on-chip power dissipation, simplified fabrication, and increased reliability. The processing planes can be manufactured using, for example, silicon electronic processors fabricated on transparent EO PLZT substrates. 2 3 The system is controlled in single instruction multiple data (SIMD) fashion by a serial host computer, which distributes the clock signal, determines the tasks of the PEs on various wafers, and (for reconfigurable systems) controls the interconnection pattern. Data transfer is bit serial, but computations are made in parallel planes. Interconnections are made using holography (see Sec.
IV.C) and may be fixed or reconfigurable depending on the technology and application. The POEM architecture is a generalized approach to optoelectronic processing, describing any system using holographic interconnection of electronic processing arrays. It is intended as a framework to be adapted into specific systems matching the application requirement. The POEM system can have either an unfolded (Fig.  10) or folded (Fig. 11) geometry. The unfolded system can use fixed interconnects, which can be implemented with thin computer generated holograms (CGHs) or, for certain regular interconnection patterns, with re-fractive optics. A large number of processing planes are required to perform the computation. As a result, the hardware cost is placed on the processing electronics rather than the interconnection technology. A fixed-interconnection unfolded system can be efficient for some computations, but a more versatile computer will require reconfigurable interconnections. The folded system in Fig. 11 uses reconfigurable interconnections to perform general purpose processing with only two processing planes. Information is transferred back and forth between the planes. The connections can be bidirectional, as shown, or they can be different for the forward and return paths. The reconfigurable interconnects increase the computer's versatility and efficiency at a cost of increased optical system complexity. Using only two processing planes increases hardware utilization but does not support pipelined operations.
The speed and nature of reconfiguration determine which algorithms can be efficiently implemented. Clearly, a system which can update the connections in less than a single clock cycle is ideal, but some computations and algorithms need to update connections relatively infrequently after many clock cycles. We have found it convenient to categorize reconfigurable interconnection systems according to their range and speed of reconfiguration. A preprogrammed connection system can switch at high speed between a limited set of prerecorded patterns. These patterns must be chosen and stored before operation and can be updated slowly (compared to the computer's run time) if at all. A reprogrammable connection system is completely general; any desired connection pattern can be constructed and implemented at the reconfiguration rate. Finally, an adaptive connection system produces a continuous incremental change in the interconnection pattern in response to the algorithm's needs.
C. Algorithm Implementation on POEM
The technology of interconnection depends heavily on the needs of the algorithms to be implemented. Some highly regular interconnection patterns can be performed using space invariant refractive optics such as lenses, masks, and mirrors. For more general patterns, including the completely irregular connections discussed in this paper, space variance is needed. A promising approach to space variant interconnection is to use holography. Each connection can be stored as a single hologram. The input beam reads the hologram, reconstructing a wavefront propagating toward the desired destinations. Holographic storage is dense and distributed, storing large amounts of information in a defect-tolerant manner. Most important, any desired connection pattern with arbitrary fanout, fanin, and direction can in principle be stored. Holograms are divided into two major types, thick (volume) and thin, according to whether their thickness is large or small compared to the features of the recorded interference pattern (the grating wavelength). In Secs. IV.C.1 and IV.C.2 we describe two POEM sys-
Mrrors

Processing planes
Modulator light Input Fig. 12 . POEM system using a fixed CGH. The EO modulator output from one processing plane is interconnected by the fixed CGH to the next processing plane.
tems implementing parallel irregularly connected algorithms. The first system is unfolded, using fixed thin holographic optical interconnects. The second is folded with preprogrammed volume holographic interconnections. We outline the procedure for approximate halving on each system to illustrate operational differences.
Unfolded POEM with Fixed CGH Interconnects
Thin holograms can be used to perform fixed interconnections. They may be fabricated by recording optical interference patterns or computer generated masks in either phase or amplitude. A CGH with submicron features can be written by electron-beam lithography, then etched into glass plates. 2 4 Multilevel-phase CGHs can be designed to produce up to 100% diffraction efficiency, although transmission (amplitude modulation) CGHs are much less efficient. 2 5 Figure 12 shows a POEM system which uses a fixed CGH to interconnect a series of parallel optoelectronic processor arrays.
The system shown uses a double pass faceted CGH architecture 2 6 with one facet devoted to each detector and modulator. Coherent light entering the modulators from below is polarization modulated and analyzed. This output is collimated and directed to one or more location in the next plane by modulator facets. Detector facets focus the incident light into detectors. The area of modulators and detectors is minimized to reduce device capacitance and response time. 23 Data can be input electronically or in parallel using spatial light modulators (not shown) imaged onto detectors in the processing planes. Assuming a 5-X 5-cm diffraction-limited CGH with 0.5-Asm features and 700-nm light, 128 X 128 processor arrays could be interconnected. 2 6 Optoelectronic Si/PLZT processing arrays of this size are certainly feasible. More sophisticated interconnection hologram design and fabrication techniques currently under investigation should be able to accommodate larger PE arrays.
To perform the approximate halver algorithm described in Sec. III.A, each processing element requires two detector inputs (the two values to be compared) and two modulator outputs. The planes are divided into two halves, higher and lower. One output from each PE connects directly to that PE's corresponding location in the next plane, while the other output connects to a quasirandom destination on the next plane's other half. Each PE receives and compares the two input values. The higher half of the processors passes the higher of the two values straight across and switches the lower. The lower half of the processors does the opposite. The input values are loaded from the left and propagate in parallel to the right, sorted more accurately in each step into the higher and lower halfplanes.
Folded POEM with Programmable Photorefractive Interconnects
Thick (volume) holograms are dramatically different from planar holograms in that they exhibit readout selectivity. When the readout beam mismatches the stored hologram in either the optical wavelength or the phase pattern, the diffraction efficiency decreases dramatically. The degree of selectivity depends on the thickness of the hologram; for a 1-mm thick hologram, an angular mismatch of 0.10 (Ref. 27 ) cuts diffraction to nearly zero. This behavior allows the superposition of multiple volume holograms, each coded with its own reference wavefront. When one or more of the reference wavefronts illuminates the hologram, the corresponding images are simultaneously recalled. Each of these volume holograms can in theory have high (approaching 100%) diffraction efficiency. The principal volume recording media are photorefractive crystals, which develop an index modulation (phase grating) in continuous response to incident light. Figure 13 shows a POEM system using multiplexed volume holographic interconnects recorded in a photorefractive crystal. The processing planes are similar to those of the preceding example, except that now because the connection pattern is reconfigurable, the information is exchanged back and forth between a single pair of processing array planes, PAl and PA2. The optical system works by retrieving interconnection patterns prestored as volume holograms superimposed on a photorefractive crystal. In Fig. 13 , a recording source array is used to record each processor's interconnection pattern sequentially. Computercontrolled scanning directs the recording images to spatially discrete crystal subvolumes. Multiple interconnection patterns are superimposed on each subvolume using phase or wavelength multiplexing. After all the holograms are recorded, one complete interconnection pattern can be recalled in parallel using the input coded with the proper phase or frequency. The system is preprogrammed, reconfigurable between prestored patterns. Assuming diffraction-limited holograms and 10% diffraction efficiency, two 50-X 50-X 2.5-mm lithium niobate crystals could interconnect a 128 X 128 input array with ten prestored patterns. Again, more sophisticated approaches should increase the possible performance. In particular, prefabricated CGH patterns could be used to provide wavefronts for volume storage, decreasing programming time and possibly increasing array size.
To perform the approximate halving of n values, the input is arbitrarily divided into two halves. Each half is sent into one of two n/2 element processor arrays, PAI and PA2. Each processor stores its value, then sends a copy to the other plane along a skewed 1-1 connection pattern. Both the forward and the reverse connection patterns are identical. In the next step, each processor compares the received value with the one it was originally given. Processors in PA1 store the higher value and send the lower to PA2 using a new irregular connection. Processors in PA2 perform the same operation in reverse, storing the higher of the two values. As the process continues, planes PAl and PA2 hold in storage the higher and lower half, respectively, of the values with a steadily decreasing probability of error. In this folded implementation a total of only n processors was required, each with a single detector and modulator.
The two systems we have described are intended only to indicate the potential of optoelectronic processing for implementing irregularly interconnected parallel algorithms. Both optical and electronic components may be replaced as more advanced versions become available. For example, the correlation matrix-tensor multiplier system currently being investigated at UCSD may provide a more versatile reprogrammable interconnection system. 2 8 The Si/PLZT processor planes may be replaced with faster switching multiple quantum well modulators. Most important, we have shown that optoelectronics is a technology well suited to implementing these algorithms.
V. Conclusions and Further Work
We proposed an interconnection architecture based on expander graphs and have shown how these interconnections could lead to efficient parallel algorithms. We have also reasoned that such graphs cannot be implemented with existing VLSI technology but can be made practical with optoelectronic computing technology using free space optical interconnects.
Our further work will focus on experimentally demonstrating the feasibility of implementing expanders on optoelectronic computers and find new and more efficient ways of using irregular interconnections.
