INTRODUCTION
Although computers have increased performance by an order of magnitude every five to ten years since their inception many obstacles remain in applying them to realworld situations. Conventional computers must be programmed (an expensive proposition) and cannot deal well with new, unforseen inputs. Further, these devices cannot easily recognize patterns (such as speech or handwriting) causing them to interface with humans in an unnatural manner. These inadequacies, together with our inability to sustain the rapid growth in performance of conventional computer hardware has motivated the study of alternative methods.
Artificial Neural Networks represent an unusual, biologically-inspired strategy t o do computing. Although ANNs are not a new development, born in the work of McCulloch and Pitts in the 1940s and 1950s, only recently (1 980s) have important breakthroughs been made to allow them to become an effective computational paradigm. These systems work well in many areas where traditional computers fail: ANNs are particularly promising in artificial perception tasks such as speech recognition and machine vision. They have found increased use in adaptive signal processing and adaptive control theory as well.
The response and characteristics of present models of artificial neural networks are primarily investigated by simulation on conventional computers or parallel processing machines. Figure 1 shows the performance obtainable with commercially available simulators in terms of interconnections (storage) and interconnections per second (speed). This should be compare with current and future I I Figure 1 -ANN Implementation Requirements application requirements. It becomes obvious that today's hardware capabilities are limiting the development of neural network research. The fundamental drawback of the simulators is that much of the spatial and temporal parallelism, inherent t o ANNs, is lost and that the computing time of the simulated net, especially for large associations of neurons, grows to such orders of magnitude that a timely response may not be achievable. While current research requirements necessarily conform t o electronic implementations (semiconductor) it is apparent that optical technology can open the doors for new research. As application requirements begin to outpace the capabilities of software simulation and electronic realizations, it makes sense to think about designing specialized optical hardware.
There currently exists no consensus on how to implement ANNs into hardware. Although optical technologies hold the most promise in the long term, electronics are currently favorable because of the maturity of the technology and the level of integration. However there remain many largely unaddressed questions with regard t o electronic ANN implementation: How do we deal with the massive number of interconnections required (the connectivity problem) ? Can we supply the large amount of storage required ? Can we provide flexibility for learning and adaptation without sacrificing performance ? What are the advantages and disadvantages of continuous vs. discrete time, analog vs. digital valued, and literal vs. virtual implementations ?
One goal of this paper is to provide some perspective on the implementation of ANNs. This will be done by discussing issues and examples. Although it is assumed that the reader is familiar with the pertinent aspects of ANNs, they will be briefly reviewed in the next section.
NEURAL NETWORK FUNDAMENTALS
Neural Networks primarily operate in t w o modes; learning mode and recall or relaxed mode. These may both be running concurrently. Learning involves the update of weights t o strengthen or weaken network connections while recall involves processing inputs to produce outputs. As we propagate signals through the network we require t w o basic operations: processing and communication. We shall see that these may be mathematically represented as a matrixvector multiplication.
The basic component of a neural network is the neuron model. A number of models have been constructed to describe the behavior of neurons in biological nervous systems. These have been inspired by different goals. Some describe the operation of neurons in great mathematical detail, usually t o study them from a neuropsychology standpoint. Others assume some simplifications for ease of implementation, making them more amenable to performing computations. Since we are interested in performing computations, we will discuss the latter type of model. Figure 2 shows the general structure of most simplified neuron models. The neuron receives a set of input signals, xi, which are multiplied by a corresponding set o f signals are added up to produce n, the complete input to the neuron. An activation function, F, operates on the sum of the weighted neuron inputs ( Ciwfi), transforming it into an output signal which can be transmitted to other neurons. The activation function is usually some type of nonlinearity such as a sigmoid { { 1 + e-i"}.'), hyperbolic tangent or signum function. The weight w,, is the threshold value 8 which biases the firing of the neuron. By convention the input x,, is set to unity. The weighting factors play an important role in the neuron model. A larger weighting factor will serve to increase the sensitivity of the neuron to a particular input.
I
The real power of a Neural Network emanates in from it's connections and it's consequent ability t o form associations. The vast majority of network models are variations on t w o principal topologies; feedback frecursivel and feedforward networks. The latter is shown in figure 3 . This network has an input layer, t w o hidden layers, and an output layer. The number of neurons in each layer and the number of hidden layers is an architectural decision that is often application dependent. The feedforward network structure produces a nonlinear mapping between the input vector, ,g, and the output vector, as described in figure 3 {F is a nonlinear activation function). In the recursive network shown in This operation is used t o implement an associative memory; the network is supplied with a partial input pattern and after iterating it converges to the complete and closest pattern. The output of a recursive ANN is described by a trajectory of vectors over time rather than a single vec:tor as in the feedforward case.
Regardless of the network topology, all neural networks are fundamentally vector mappers,input vectors are mapped to some cutput vector through some sequence of nonlinear transformations. The weights of a network define the nonlinear mappings or associations to be implemented. These weights are adjusted when the network is in a training mode by some learning laws. There are a great many learning laws and much attention is focused on this area. Two general classes of learning procedures exist. In unsupervised /earning, network models are first presented with an inout vector from the set of oossible network inouts. The network rule adjusts the weights so that input examples are grouped into classes based upon their statistical properties. In supervised learning the network is presented with a set of training examples (inputs and desired outputs). The output error is then minimized by modifying the weights. In general this is a long, arduous process. This has caused much of the skepticism among ANN critics. Efficient learning is one of the primary focuses of ANN research. As discussed below the network propagation time has a direct effect on training.
DESIGN kSUES

Literal vs. Virtual Implementation and
The Connectivitv Problem
One fundamental question in electronic ANN hardware design is whether to implement the network literally or not. In other words, how many virtual neurons should be hosted on each physical processor ? The extreme cases are the serial implementation of an ANN algorithm on a generalpurpose uniprocessor (full virtualization), and the fully parallel implementation of one processing element per algorithmic neuron (no virtualization). Literal realizations of larger networks may not be plausible because of the overwhelming number of connections and the amount physical space they would require. The size of the network and the ability of the technology to accommodate this massive communication burden are key to the decision on literality.
The speed of a network is always an important consideration. It affects the propagation time (time to pass signals through the network) in both the learning and relaxed modes. We compare the forward (relaxed mode) propagation time of a literal implementation, rn, with that of a virtual implementation, rvjr, for the multi-layer feedforward network of figure 3. We assume that the network has N neurons in P layers, and C connections. The virtual realization of this network contains P physical processors (analog or digital) connected by B hardware buses. We see that for the literal case where rpoc is the neuron's processing time and f , , , is the communication time representing the latency time of connections between layers. The virtual net requires a forward propagation time given by where i , , , , ih, and i , ,
represent the neuron processing time, the transmission delay of the bus, and the transmission delay to send a signal over a local bus internal to a processor (this may be a memory access). We include y as a processor utilization factor. The term A, symbolizes the overhead in communication from bus request collisions (usually AB% A, can be expressed as
where U represents bus utilization. We note that the overhead factors, y and U, are monotonically increasing functions of P and C -for simplification we shall assume both to be unity (best case for literal implementation). A virtual neuron may signal another neuron in either the same or a different host processor, denoted as Psd=PrEy and Psend# P,, respectively. Assuming that virtual neurons are uniformly distributed across all the processors the probabilities for these t w o cases is given by physical processors, then the probability that a particular connection will not be directed to any one processor (causing it to go idle) is p ( m k / c o m ) = -P -1 P and the probability that none of the C connections made in a single broadcast step are implemented on this processor is P -1 p(miSs) = This probability is essentially an inefficiency factor. For P = C this is approximately equal to 1 / e , since Thus, asymptotic performance (total useful computation within 37% of the maximum achievable by an infinite number of processors) is reached for a number of processors comparable to the number of connections out of each unit.
Much fewer than C processors, however, would imply significantly reduced computational throughput.
For electronic implementation processors are relatively expensive -for large networks it is usually not practical to have a literal realization nor to even have C processors. This causes reduced performance by degrading forward propagation time and efficiency as shown above.
Diqital vs. Analoq Implementation
Historically, ANN simulation has favored digital devices while dedicated ANN implementations have used primarily analog devices. The motivation for these areas is slightly different. Simulation investigates ANN properties and must be done on a general-purpose digital computer. Dedicated ANNs, which work in one application area, have followed the most successful model known, the biological (ana!og) model. Both of these paradigms have particular advantages and as a result, somewhat of a controversy has developed. Analog supporters point t o the fact that analog devices can process more than one bit (of accuracy) per transistor leading t o greater levels of integration which is key to the ability to support literal implementations. Analog devices perform the pervasive multiply-accumulate operations in parallel as a result of physics, making them potentially quicker. Those that advocate digital devices have pointed to the superior noise immunity, more sophisticated time multiplexing techniques (which better support virtualization), and better fabrication processes associated with digital VLSI. Further, digital devices are more easily programmed and more capable of supporting the large dynamic range that network weights require to implement gradient learning algorithms.
We may observe the slower execution times for digital networks by re-examining the forward propagation time discussed above. In particular we investigate the processing time, fprOc. For a single neuron with n inputs, as shown in figure 2 , the computations required include n multiplications, {n-1) additions, and a nonlinear activation function approximation. We write the execution time for a single neuron as t,,(neuron) = n t, + (n -1) redd + rM
Realizing that a fully connected feedforward network provides each neuron of layer i with n,., + 1 input connections { w e account for the threshold weight) we can easily derive an expression for the processing time:
where P is the number of layers and q denotes the number of nodes in layer i. It is clear that the processing time for the digital case increases quadratically (Oh*)) with network size.
Analog implementations do not generally suffer from this problem because additions, multiplications, and even the activation function may result directly from the device and circuit characteristics.
Another consideration is cost. To investigate this we first reflect upon ANN structure; there are t w o basic operations, computation and communication. These can be dissected into three fundamental architectural tenets; processing, memory, and communication. One approach to a cost analysis would compare the price of performing these three functions in this analog and the digital case. We follow that approach.
The cost of interconnecting points in analog and digital technology is essentially the same assuming the "wiring" process and rnaterial to be identical for both cases. It becomes a question of the amount of interconnection required and the space it consumes. We will assume that after taking noise margins into account a particular analog device provide:; us with /3 bits of accuracy which, for analog implementations, is propagated over a single interconnect. For equivalent accuracy we will then require /3 digital interconnects for each analog interconnect. We can establish a simple relationship between the relative costs as c**m(dim = P .C-(Malog)
where /3=lnt(3.2d + 1) and d is the number of decimal digits required @tl and / 3 is assumed to be independent of network size). Following the same argument for the processing costs (two cases differ in quantity of processing devices required for equivalent accuracies) we can assert a linear relationship there as well:
Where p' is proportional to p; we shall assume /3' =/3.
We now compare the relative costs to store the weights in memory. Fortunately, an applicable analysis exists in Wiener 191 which we !;hall site.
Normally we would expect a more accurate device to cost more than a less accurate one and we assume, for the purposes of this argument that the cost, CmmOw, is proportional to accuracy. In fact, if one thinks of the device as a meter with a pointer that must finish in one of n well defined regions, it is clear that having just one region would not convey any information and should not contribute to the cost. Therefore the cost of the device should be counted only for region:: in excess of 1 and may be written as C,,,,, We observe that the cost function is minimized when each memory device has only t w o states (when N = PI. This implies that if information is to be stored at all, it can be done most cheaply in binary code with digital devices. We have using N = 1 for the analog case and N = /3 for the digital case. The total cost can now be found by summing the component costs;
.
P
We may assume that Cpmdg(digital), Chtmm,Jdigital) and Cmmow(digita/) are all linearly related (differing only by a constant factor) and they are independent of the signal accuracy P. We can say that For simplicity w e shall assume u,u = 1. It then becomes clear that as our precision requirements increase the cost ratio of analog memory over digital memory increases exponentially:
where fl represents network accuracy requirements (as well as the number of bits for the digital case). We may also observe that
We conclude that electronic digital implementations are generally more cost-efficient for any given precision.
Continuous vs. Discrete lmplemen ta tion
Literality and the number of quantization levels (as discussed above) are spatial decisions for ANN implementors. We have not yet considered the temporal aspect. The issue of whether t o use asynchronous, continuous time devices or synchronous, discrete implementation involves efficiency and optimization in time.
Although the question is often entangled with the issues discussed above, it need not be. It is true that analog implementations tend to operate in continuous time and digital devices are usually discrete but hybrids abound.
Asynchronous-digital ANNs (see 171) and synchronousanalog realizations do exist.
Continuous operation is more biologically accurate and almost essential in many of the recursive networks often used as associative memories [3] . But discrete implementations may simulate continuous time by stochastically firing the neurons [12]. Discrete implementations tend t o have larger network delay time due to the overhead in synchronization mechanisms (these mechanisms carry a penalty in terms of chip area). The tradeoff made here is between smaller, faster, more biologically accurate implementation and slower, more flexible, operation.
THE ROLE OF OPTICS
We have seen that we are presented with somewhat of a dilemma when we attempt t o implement ANNs electronically. While digital implementation may be more flexible and perhaps more cost-efficient, it lacks the speed and simplicity of it's analog counterpart. Further, we are pushed towards virtual realizations because of the connectivity problem and the ubiquitous multiply-accumulate operations. Optical computing technology, though in its infancy, clearly has the potential to alleviate some of these problems by providing much denser interconnection, quicker propagation, and a natural parallelism which may be tailored to ANN designs.
As mentioned above, the neural network paradigm involves t w o fundamental activities: processing and communication. Optics has a very natural and evident advantage in the area of communication due to the noninterfering property of light. Many photonic signals may be transmitted through each other and in the same space with no information lost upon signal recovery. In electronics, capacitance and electromagnetic interference limit the speed and degree of miniaturization attainable.
In order to perform the processing activity optical signals must be able t o carry information. In electronics we encode information as voltages, currents, or phase and frequency data. Similarly in optics we may encode information as amplitudes, intensities, polarization, phase or frequency. But some of these encoding schemes may require more expensive, coherent (single frequency) light sources such as lasers. One of the simplest ways t o encode information is to use transmittance (intensity). This allows one to use incoherent light sources.
Upon encoding the information w e next must perform weighting and thresholding operations on signals. The weighting operation is usually a multiplication (although other operations may be used -an XOR is used in a design example below). Figure 5 below shows the design of a simple analog optical multiplier. The information on plate 1, f,(x,y), is fourier transformed by the lens L, and then inverse fourier transformed by L, so that the reverse image, f,(-x,-y), is projected upon the second plate containing f,(x,y). The transmittance of the combination at any arbitrary point is the transmittance of the product of the individual images at that point: f,(x,y) = f,(-x,-y)*f,(x,y). image introduced at P, have identical transmittance values. The information of plane P, thus corresponds to that of a vector, say b of n elements. These columns of light can be superimposed on a matrix mask located at P, . The information of the matrix mask corresponds to that of a matrix, A, of dimensions n x n. The product, C (C = b*A), of the stored matrix and the input vector is then fourier transformed along only the horizontal direction by the cylindrical lens, L, , which is then inverse fourier transformed by the biconvex lens, L, . Consequently at the output plane P, we obtain the column vector of the product C such that it's mth row consists of the sum of products Ea, , , , i bi . Figure  7 shows a three-dimensional view of optical matrix-vector multiplication. A spatial light modulator (SLM) may be used to form a weight mask. The SLM may be thought of as a controllable transparency which can adjust the transmittance of each of its pixels. With proper focusing optics (cylindrical lenses,not shown) the input beams' transmittance is weighted by a particular row of the weight mask SLM to produce an output transmittance. We may note that this scheme performs all the O(n2) multiply-accumulate operations in parallel. The capacity is limited only by the level of miniaturization which can be achieved. Upon completion of the weighting operation thresholding must be accomplished. Thresholding involves the application of some nonlinear activation function. This can be divided into t w o broad categories; external thresholding, and internal thresholding 11 31. External thresholding is distinguished by using electronics to apply the nonlinearity. In internal thresholding nonlinear optic devices must be used. Although electronic devices are clearly more adept at performing nonlinear transformations, their use requires a conversion process (optic to/from electronic). This is generally costly in terms of speed, power consumption and device size. In practice, external thresholding is more popular since nonlinear optical devices are relatively immature.
Desiqn Example
Con ten t-A ddressable Net work
The Content-Addressable Network (CAN) is an electrooptic implementation of a pattern-classifying neural network proposed by Brodsky, Mardsen, and Guest [61. CAN represents a simple, efficient hybrid system which exploits the particular strengths of optics and electronics, using each technology t o do what it does cheaply and well.
Binary classification would be expensive using a conventional continuous-valued network model on a digital machine. The backpropagation learning algorithm has been observed to require 8 or more bits at each interconnection.
Without this level of precision the network may not converge to a learned solution. CAN avoids this issue by using binary values for each connection weight and node output. This reduction in accuracy requirements allows for the utilization of digital memories which, as argued earlier, are expected t o be cheaper than analog memories. The activation function is also binary rather than a sigmoid or tanh function. Figure 8 shows a single layer of the CAN network. Each horizontal row of the weight matrix represent a node whose weights, mi,i, are exclusive-ORed with the elements of the input vector, V. We should recall that the exclusive-OR operation will produce a 1 (high or true) output only if these t w o values are different. We shall abbreviate the exclusive-OR operation as "XOR" and denote it as "@". Figure 9 shows the structure of one such memory. In this way an input string is associated with a particular output.
The summation vector, D, is computed optically using a technique similar t o the matrix-vector multiplication mentioned above. We may express this as 
