Large processor arrays are candidates for performing computations of neural network models at speeds required for real time applications, e.g. in pattern recognition. The paper gives a general model of an array of bit-serialprocessors and demonstrates the mapping of neural net models on such an array.
INTRODUCTION
Recent years have seen an enormous increase of interest in neural networks. It has been realized that massive parallelism is requiredfor human-like performance in pattem recognition. Neural networks provide one technique to do this. Processor arrays are candidates for performing the computations efficiently. The subject of this paper is to study the mapping of neural network computations on a regular array of a large number of simple processors. The computations are uniform and arithmetically simple. This suggests that simple processing elements are sufficient and that the SIMD type of architecture is appropriate. The number of interconnections in a neural network is often orders of magnitude greater than the number of processing units. This suggests that connectivity be stored partly in matrices.
We study the mapping of both feedforward (with back-propagation) and feedback neural nets. The characteristics of these models will be briefly outlined. Before that we introduce a generic architecture for a bit-serial array processor (BAP). We describe the algorithms in a parallel language (Pascam) which includes constructs directly implementable as elementary operations of a BAP. The computations are analysed, performance figures are given, and system implementation is discussed.
BIT-SERIAL ARRAY PROCESSORS
A Bit-serial Array Processor (BAP) is characterized by the following properties:
-It is organized as an SIMD processor, i.e. it consists of many processing elements (PES) and one common control unit.
-The PES treat data bit-serially and the data paths to and from each PE are only one bit wide.
-Activation of the PES may be data driven "(associative process''), which means it is not the location or address of a PE that CH2898-5/90/0000/0501$01 . OO 0 1990 IEEE 501 decides its action on an instruction from the control unit, but some property of the data in the memory or registers of that PE.
-An interconnection network defines a topological relationship between PES.
A commonly used organization of the BAP is to place the interconnection network between the memory part and the logic part of the PE as shown in Figure 1 . This does not mean that memory and logic are physically apart -on the contrary, they are seen as a whole and should preferably be put on the same chip.
Memory Modules
Interconneaion ALU Modules Network Figure I . Organization of a Bit-serial Array Processor.
A BAP is defined by the characteristics of five parts: data storage, processing, data alignment, inputloutput, and control.
Data storage is organized as Memory Modules (MMs). One bit from each MM is accessible at a time. A number of such bit-slices, normally consecutive, form afield.
The Processing part is an ensemble of Arithmetic and Logic Units (ALUs) which implement functions on a set of one-bit arguments. The complexity of the ALU may vary from boolean functions of two variables through bit-serial multipliers to full bitserial floating point units.
One of the registers in the ALU is the one-bit Activity Register, the contents of which determines whether or not the ALU takes part in the specified operation. To choose only one ALU for activity, a selectfirst facility is included. In more elaborate models multi-bit registers may be used to determine one out of a set of actions to be performed.
The Data Alignment part consists of an interconnection structure that allows each ALU to receive data also from "neighbouring" modules. Common structures are the square grid, the linear array, the n-cube and the shuffle-exchange. Many separate structures may be implemented on the same processor.
The structure of the Inputloutputpart design is strongly dependent on the demands of the application and may be varied in several ways. For example, in some cases a direct bit-slice wide interface to the data source may be motivated.
The total activity is mastered by the Control Unit, which takes instructions from an ordinary sequential processor. Multiplication is a frequent operation in many application areas. Using ALUs with a complexity comparable to a full adder only, the multiplication time grows quadratically with the data length. Ohlsson [ 11 suggested a bit-serial multiplier in each ALU, giving a multiplication time that is no longer than the time required to read the operands and store the result. Figure 2 shows the design for multiplication of two 2's complement integers using a series of full adders (FA). The multiplicand is first shifted in, most significant bit first, into the array of M flipflops. The multiplier is then applied to the input, least significant bit first, and the product bits appear at the output, least significant bit first. The S flip-flops store the accumulated sum. A more detailed description is given in 
NEURAL NETWORK ALGORITHMS
Several neural net models have been proposed. They are characterized by network topology, node characteristics, and training rules. Frequently used and discussed models are the multilayer feedforward networks with supervised learning by error backpropagation [3] and the feedbacknetworks, either with symmetric connectivity and stochastic nodes (Boltzmann machines [4, 5]), symmetric connectivity and deterministic nodes (Hopfield net [6, 7, SI) , or nonsymmetric connectivity and deterministic nodes [9, In order to be as general as possible in the implementation studies we use a feedback algorithm without any assumption on symmetry of the weight mamx. Thus, for the Hopfield model and the Boltzmann machine shorter execution times than those reported below can be expected (both use symmemc mamces).
The back-propagation model is used as a pattem classifier or feature detector. The feedbackmodels are used as auto-associative memories for tasks like pattem completion.
101.

Feedforward networks with error back-propagation
A feedforward net (ff net) with four layers is shown in Figure 3 . Each node (neuron) in a layerreceives input from every node in the previous layer. Each node computes a weighted sum of all its inputs. Then it applies a nonlinear activation function to the sum, resulting in an activation value of the neuron. A sigmoid function, with a smooth threshold like curve, is the most frequently used activation function in feedforward networks.
Input Layer Hidden Layers
Output Layer laya 1-1 layeI 1 Figure 3 . A four-layer feedforward network.
The back-propagation algorithm (also known as the generalized delta rule) [3] is used to train the network in our examples.
In the fist phase the input to the network is provided and values propagate through the network to compute the output vector 0. 0 is then compared with a target vector T provided by a teacher, resulting in an error vector E. In the second phase the values of the error vector are propagated back. The error signals for hidden units are thereby determined recursively: Error values for layer 1 are determined from a weighted sum of the values of layer I+ 1, again using the connection weights -now "backwards". The weighted sum is multiplied by the derivative of the activation function to give the error value 6.
Finally, appropriate changes of weights and thresholds are made. The weight change in the connection to unit i in layer 1 from unitjin layer 1-1 is proportioTif1 to th7,product of the output value o and the error value 6: Awij = ~6~ o!l-l!The bias (threshold) value may be seen as the weight from a &it that is always on. The algorithm is summarized below.
1.
Apply input
2.
Compute output oi(l) =f(ne$') + b:))
where n e f ) = Determine error vector E = T -0 w;)oi(i-') for each layer. 
Propagate error backwards.
If nodej is an output node then the elements of the error value vector D are
Here we have used the fact that the sigmoid function
Repeat from 1 .
Algorithm I . Back-propagation training algorithm
Feedback networks
A feedback network has a single set of completely interconnected nodes, see Figure 4 . All nodes are both input and output nodes.
Each node computes a weighted sum of all its inputs and applies a nonlinear activation function to the sum. The resulting value is treated as input to the network in the next step. When the net has converged, i.e. when the output no longer changes, the pattern on the output of the nodes is the network response.
Feedforward net with error back-propagation Figure 5 shows the data storage for each layer. In the forward pass the net vector is computed by N successive, parallel multiplyand-add operations, each requiring access to a different output value from the previous layer. Thus the PES must, one after the other, broadcast their output values to all PES. The 0-vector is computed by a parallel application of the activation function. Training or learning can be done in supervised mode with the delta rule [ I l l or back propagation [lo] , or unsupervised by a Hebbian rule 1111. The delta rule is more powerful than the Hebb rule, and more commonly used than back-propagation (for feedback nets). We only analyse the delta rule algorithm.
In the first phase of training the pattern is imposed on the net at time zero by forcing the output from the net to match the pattern. Following this initiation, the net iterates in discrete time steps using the given formula. When the net has converged the activation ai is compared to the target t. and the error is calculated as e = t -a.. The weights are changed in proportion to the product of the activition a, and the error e , i.e. Awij = qeiaj
Set activation values to extemal input values
Calculate new activation values aj = f (ne$ + bj) where ne$ = Determine enor vector E = T-A Adjust the weights Awij = qeiq and biasAbi = qei .
1.
2.
wi,& , until the network is stable I 3.
4.
5.
Repeat from 1.
Algorithm 2 . Delta learning algorithm for feedback networks
MAPPING NEURAL NETWORKS ON A BAP
It should be clear from the above descriptions of networks, that the computations of both a feedforward network with error backpropagation and a feedback network involve mainly matrix-byvector multiplications, where the matrices contain the connection weights and the vectors contain activation values or error values. Such a multiplication contains Z P scalar multiplications and N computations of sums of N numbers.
The fastest possible way to compute this is to perform all TP multiplications in parallel, which requires TP PES and unit time, and then form the sums by using trees of adders. The addition requires N(N-1) adders and O(10giV) time. This is, however, an unrealistic method depending on both the number of PES required and the communication problems caused. Instead we take the approach of having as many PES as neurons in a layer, N, and storing the connection weights in mamces, sized N by N, one for each layer. The PE with index j has access to row j of the matrix by accessing its own memory word. In the backward pass the computation of the error vector for each layer requires vertical addition. We suggest a bit-serial adder tree. The addition of each column can be overlapped with the multiplications for the next. On completion of this phase, weight changes are calculated. The 0-vector of layer 1-1 is first multiplied by a constant, q. Then, for each j, the j:th value of the result is broadcast to all other PES, the E-vector is multiplied by this value and the result is added to the j:th column of the weight matrix. The threshold vector is changed in a similar way.
The Pascal/L language
PascalL is an extension of Pascal for parallel processing, developed in the LUCAS project [2] . In Pascal/L the parallelism of the architecture has a correspondence in the syntax of the language. Thus, constructs in the language are directly implementable as elementary operations of a BAP.
A selector defines a boolean vector over the MMs and is used to control the parallelism of operations. A parallel array has a fixed number of components, all of the same type and located in the MMs. An indexing scheme allows simultaneous access to a column or a subset of the column components of a two-dimensional array. For example: W[*,5] selects column 5 of W, W[SEL,5] selects a subset of column 5. A parallel array may be used without any index at all (and no brackets), in which case all components of the array are referenced.
To support data-driven processing a number of standard functions and procedures can be applied to selectors. The first function finds the first component of a selector with the value true and returns a new selector with only this element true. The nextprocedure assigns false to the first true element of the selector. This is useful when elements are to be processed sequentially. The some function returns true if there is at least one true element of the selector, otherwise it retums false. COMPUTATION TIME
Feedforward networks with error back-propagation
The computations for one layer of the feedforward pass contain N operations of type multiply(by constant)-and-add followed by a few (maximum ten) multiply-and-add operations to compute the activity function (e.g. by a piecewise linear activation function which approximates the sigmoid function). Since N is large in the applications we consider we can leave the latter operations out when we estimate the computation time.
During accumulation the sum will grow to a maximal length of b+logN bits. On the average the number of cycles for multiplyand-add will be 4b+logN-l, using the bit-serial multiplier of Figure. 2.
In the backward error computation phase the summation is made over the adder tree in b+logN cycles. A multiplication and a tree addition can be made simultaneously. The weight changing phase, finally, takes 4b cycles per column.
In t0ta.l the computations for one layer consume: [8b + lo@-1 + max(3b, b+logN)]N cycles during training and (4b + lo@-l)N cycles during recall.
Assuming a clock frequency of 10 MHz (which is fairly conservative) execution times shown in Table 1 (with a m-1 connection weight matrices stored). It should also be mentioned that there are methods proposed that include amomentum term a in the weight changing rule for the ff networks (refer to Algorithm 1):
Thus the past weight change affects the current direction to an amount determined by the constant a. This is considered to allow high learning rate without leading to oscillations. However, it requires that the weight changes be stored as well, which doubles the required memory space. The implementation is software configurable to allow for "compilation" of a certain architecture to suit a specific application. Neural network computations constitute one such application area.
CONCLUSIONS
We have given a general model of a bit-serial array processor (BAP) and have shown how the computations of different neural network models can be performed on such a processor. A major advantage of the bit-serial working mode is that precision can be traded for speed. We have calculated execution times and memory requirements for feedforward and feedback networks of different sizes and with different numerical precision. Results show a large speed advantage over commercial neural net simulators and form the basis for the outline of a one-board implementation comprising 1024 processing elements.
An interesting result is that the computations do not require the processor array to have a very rich communication structure. The facilities needed are the ability to broadcast a single bit from any processor to all others, a means for selecting processors in order, one by one (a select first chain), and a bit-serial adder tree to add the values of a field.
The approach taken is to map one neuron on each processorin the case of multilayer networks the same processors are used for all layers. If more processors are available, or if the processors are fewer than the neurons, the programs presented must be slightly changed. The speed will increase or degrade accordingly. Thus, the speed of a certain network can be adjusted by choosing the number of processors.
A critical operation in the computations is multiplication. We have shown how a very simple bit-serial multiplier structure using carry-save technique can equalize multiplication time relative to addition time.
It is seen that the different net models that we have studied put the same demands on the processing array. These models are representative for the neural networksarea, implying that efficient execution of most kinds of neural networks on a BAP can be expected.
