This paper presents a mapping scheme for p a d e l pipelined execution of the Backpropagation Learning Algorithm o n dtktributed memory multiprocessors (DMMs). The proposed implementation ezhibits training set parallelism that involves batch updating. Simple algorithms have been presented, which allow the data transfer involved in both forward and backward execution3 phases of the backpropagation algorithm t o be carried out with a small communication overhead. The egectiveness of our mapping has been illustrated, by estimating the speedup of a proposed implementation on an away of T-805 transputers.
Introduction
Parallel implementation of multilayer neural nets can be broadly divided into two distinct categories: implementation on a set of simple processors having small local memory and implementation on distributed memory multiprocessors (DMMs), which consist of more powerful processors with sufEcient memory. In the first case, all the neurons and weights are distributed across the processors of the parallel architecture, and the synaptic operations of a single layer are carried out in parallel [6] . The parallelism achieved in this way is the finest level of parallelism that can be identified in the parallel implementation of a multilayer neural net and is known as the synapse level parallelism [5] . Two distinct types of parallelisms can be visualized in the parallel implementation of multilayer neural nets on DMMs Basically, there F e two method of achieving training set parallelism. In the first method, Werent training pairs can be processed in multiple copies of a single neural net, each copy being assigned to one processor 1101 or cnch copy distributcd across multiplc processors [Ill. The second method, that is preferable for implementation on array processors, is to process multiple training pairs in a pipelined way in the rows of the processor array, each executing a different layer of the neural net. The forward and backward passes of more than one layer may be simultaneously executed for multiple training pairs in the rows of the processor array.
In the present paper we have investigated the pipelined execution of multiple training pairs in the rows of a two-dimensional processor array. After discussing the assignment of neurons and weights on the p r o w SOR of the array architecture, we have identified the data transfer required for parallel execution of different layers of the neural net in the rows of the processor-may. Multiple data items heading for a common destination are sent in the form of one packet. This results in a reduction in the total communication overhead encountered in executing both the forward and backward passes of any layer, for two different training pattems. Algorithms are presented separately for forward and backward passes, and also for the pipelined execution of both the execution phases for different training pairs. The performance of the proposed mapping technique has been analyzed for a T-805 based transputer array. Section 2, following the introduction, discusses the architecture of the multilayer perceptron and outlines its principle of operation. Also, the backpropagation learning algorithm that is used for training this network has been discussed in this section. The mapping method is presented i n Section 3. Algorithms are given for execution of the backpropagation algorithm on an array architecture. In Section 4, we have analyzed the performance of a proposed implementation on an array of T-806 trans puters. Finally, Section 5 summarizes this paper and comments on the efficacy of the mapping method.
Multilayer Perceptron and Backpropagation Algorithm
Primary application of the multi-layer perceptron (MLP) network is in pattern classification or more simply, pattern matching [1]- [4] . In this, the network produces the correct output pattern for any pattern presented at its input. The learning algorithm followed to train the network to perform the above task is the backpropagation algorithm devised by Rumelhart et al. in the year 1986 [4] . In this algorithm, a training set, consisting of a number of input-target pattern pairs is repeatedly presented to the network and the network weights are adjusted till the correct output pattern is obtained for every input pattern in the training set. Afterwards, the network continues to classify the input patterns correctly, irrespective of whether they belong to the training set or not. Below, we discuss the structure of the multilayer perceptron and the backpropagation algorithm in brief.
The multilayer perceptron consists of an input layer, an output layer and one or more hidden layers of neural elements or neurons. There is full connectivity in between two adjacent layers and the connections are weighted. Fig. 1 shows a 3 layer perceptron with 4 neurons in each layer. Function of the input layer neurons is only to distribute the input values to all the neurons of the first hidden layer. The backpropagation algorithm that is used to train this network consists of two phases: Recall phase or forward pass and the learning phase or backward pass. The forward pass starts with each neuron computing a weighted sum of all the inputs and then determining its output by applying a non-linear activation function to this sum. The output is then supplied to all the neurons of the next higher layer, which then proceed to compute their own outputs. This process is continued till the network output, composed of state values of all the output layer neurons are found out.
Thereafter starts the learning phase or the backward pass, which b d n s with each neuron in the output layer finding out its error value by comparing the output with the target value. The error value of the output layer are then sent backward through the network calculating the error value for each lower layer neuron and adjusting its input weights. The process terminates after all the input weights of the first hidden layer have been adjusted. This completes execution of *e kcckpropagation algorithm for a single training pair. After multiple preentations of the entire training set the weights finally converge to definite values, which signals the end of training phase. Now, the network can be used to clas sify arbitrary sets of input patterns.
Our assumptions regarding the multilayer percep tron and the notations used in the ensuing discussion are given below.
The MLP network considered in this paper is as tl denotes the jth target pattern.
T is the total number of training pairs in the training set. [I]. This function is preferred as it has got a very simple derivative given by:
Equations 2 and 3 are followed in the learning phase. Equation 2, used to compute the error values of a lower layer neuron fiom error values of its higher layer is:
Changes in input weights of layer 1 neurons are found by the equation:
In the above equation 9 is the learning rate that usually lies between 0.25 to 0.75 [l].
The Mapping Method
We have proposed a mapping method for parallel pipelined execution of the backpropagation algorithm on a tw*dimensional array of processors. Below, we discuss the target architecture, initial data assignment and the data routing involved in this implementation.
Target Architecture
The target architecture consists of a hoist computer and 
Initial Data Assignment
The initial data assignment, that gives the distribution of the neural net data on the target parallel architecture is quite simple. 
Inter-processor Communication
The inter-processor communication (IPC) in both the phases of the backpropagation algorithm involve a set of nearest neighbour shifts. First, the data transfer involved in the two phases has been discussed separately and then we present an algorithm to execute the two phases concurrently for multiple training pairs. The addition and subtraction operations considered in these algorithms are assumed to be modulo(P).
Forward Pass
The forward pass begins by the host computer communicating the input values to the corresponding neurons of the fnst hidden layer by sending n,(O) to E(l), 
Backward Pass
The backward pass involves the calculation of lower layer error values, as well as the determination of changes in input weights of this layer. Algorithm-2 is followed to compute the error values of the lth layer neurons and to adjust their input weights.
Algorithm-2 1 In the processor P;(I+ l), 1 In each of the following steps: 2. Send Ai(9 from P;(l+ 1) to P , + 1 ( I + 1).
3. In the processor P,(l + I), receive the data Aj(l), 1 5 j 5 n, from Pi 1 (I + 1) and do the following:
Update Aj(l) = Aj(l) + w;j(E + l).&(I + 1).
Send A,(l) to P,+l(l).
Execute step 3 (P-1) times. Afterwards, processor
Pj(I + l ) , 1 5 j 5 n, has the accumulator, Aj(l) = Cy=, ~; j ( l t-1 ) . 6 i ( l + 1).
.
Tkansfer Aj(1) from Pj(l + 1) to P,(Z). Also, send uj(f -1) from P,(l -1) to q ( l ) .
6. Then, in ( P -1 ) right shifts send aj(l -1) to all the processors of the lth row. 7. In processor Pj(l), compute the error value 6j(f) and the weight changes Aw+(I),I 5 k 5 n, using Eqns. 2 and 3.
The above algorithm needs a communication time of CP.
( l ) = w;,(l+ l).&(l + 1).

Pipelined Ezecution
The basic idea of the mapping is to execute different layers of the neural net concurrently in Merent rows of the processor array for multiple training pairs. Again, it is possible to execute both the forward and backward passes for two different training pairs in a single row of the processor array. Although it is not possible to parallelize the computations involved, as seen from algorithms 1 and 2, the data transfers in both the execution phases are similar and so can be combined as discwed in algorithm 
AT(l -1) = ~i i ( l ) . q ( l )
and,
Af(l) = w;,(Z).~:(l-1).
3. In P,(I), combine <(l-1),4(1-1) and A:([ -1 ) into a single packet and transfer the packet to P , + l ( f ) .
4.
In each of the next (P-1) steps do the following in processor P, (I):
.Receive a packet comprising of the data items <(Il), aj(l-1) and A;(l-l), 1 5 j I n, from P,-l(l).
Update own accumulator Af(l) = A;(l)+v,j(l).a:(I-l).
Update the accumulator AT(Z -1) = AY(I -1 ) + Update the accumulated weight change:
Recombine the three data items into one packet and send the packet to P,+l(l). In the above discussions, it has been assumed that the number of processors in a row of the target architecture is same as the number of neurons in a layer of the neural net, which is rarely the case. Also, the neural net is assumed to have a regular architecture or the same number of neurons in each layer. The proposed mapping technique is equally valid when number of neurons in a layer of the neural net is greater than the number of processors in a row and the neural net architecture is not regular. It is enough that the number of processors be a multiple of the number of layers in the neural net, 90 that the layers are mapped onto different processors. When number of processors per row of the array is less than the number of neurons in a layer of the neural net, multiple neurons of one layer would be processed in a single processor.
Now, processor
Analysis
We derive an expression for the time required to process a complete training set. Our assumptions in the following derivation are as follows. The total time taken by an LxP array to execute the backpropagation algorithm for the entire training set once is given by: to = 2(t, + t m ) .
Tmp= (T+l)(t~+t6+t,)+2(to+tu)+(T+L+l)PC:+ (T-L+l)C
Proof
The forward pass of the first training pair would take L processing steps, each of which needs (tf+CP) amount of time. Thus, the total time taken to process the forward pass for the first training pair is L(tf+CP). For the next (T-L+1) training pairs, each processor executes both the forward and backward passes for two different training pairs. Now, each processing step takes a time Tp given by: Tp = t f + t 6 +tu + C(P+ 1) Thus, the total time for this is (T-L + l ) ( t~ + t 6 + tu + c(P+ 1)) . Now, one training pair is completely processed at the end of every processing step. The last L training pairs can not have their backward passe overlapped with any other operations. Hence, the backward pass for these patterns have to be executed separately which takes a time of L.(ta + tu + C f l .
Finally, it may be observed that execution of backward pass for the output layer has to be performed separately for the first, as well as the last training pair. For this, the additional time needed is 2. (t0 + t u ) . Summing Once the training phase is over, the netwofk is used as a pattern classifier, where it produces the correct output pattern for any pattern presented at the input of the network. Now, it has only to process the forward pass. Speedup achieved during this period is obtained by e s timating the time for executing the forward pass for T input patterns. Theorem 2 may be used for this purpose. 
The corresponding uniprocesor time is : TI,,, = lhLtl and forward pass speedup is: SJ = T/,i/Tfip.
Summary and Conclusion
In this paper, we have proposed a mapping scheme for the parallel pipelined execution of the backpropagation learning algorithm on distributed memory multiprow+ son. The layers of the neural net are assigned to the corresponding rows of the array and are executed simultaneously for different training pairs. Multiple data items heading for a common destination are assumed to be sent in the form of a single packet. This enables the communication in the forward and backward execution phases of two different training pairs to be combined, resulting in a reduckd communication overhead. The weights are not modified after the processing of every individual training pair. Rather, changes due to multiple training pairs are accumulated and are used to modify the network weights once for the presentation of the entire training set. In each figure we have considered neural nets having 2, 4, and 8 layers (In these counts we do not include the input layer). It is clear that the mapping scheme works better for neural nets with more number of layers. The linear variation of speedup with increase in the number of procesors proves the validity of our mapping technique. 
