During the last years, several neurocomputers have been developed, but still general purpose computers are an alternative to these special purpose computers. This paper describes a mapping of the backpropagation learning algorithm onto a large 2-D torus architecture. The parallel algorithm was implemented on a 512 processor AP1000 and evaluated using NETtalk and other applications. To obtain high speedup, we have suggested an approach to combine the multiple parallel degrees (training set parallelism, node parallelism and pipelining of the training patterns) of the algorithm. For a large number of processors, we obtained a performance of 81 million weight updates per second using 512 processors, when running the NETtalk network. Our results show that to obtain the best performance on a large number of processors, a combination of multiple degrees of parallelism in the backpropagation algorithm ought to be considered.
INTRODUCTION
One of the most popular arti cial neural networks (ANN) is the multi layer perceptron network (MLP), Rummelhart et al 1]. Back propagation (BP) is the most common algorithm for training an MLP. The algorithm is very computationally demanding, thus to reduce the long training time, parallel computation is mandatory. The topology of an MLP is not xed. Every real ANN application require di erent number of neurons (nodes) and layers. It is di cult to know which parallel algorithm to implement, because the e ectiveness of each approach will vary according to the mapping of the neural network application onto the architecture of the target machine. Thus, mainly experimental results can give an idea of the best mapping strategy. The most important factors to increase speedup are to minimize communication between processing elements (cells) and to avoid idle time for each cell. Since the number of weights is much larger than the number of nodes in an MLP, it is bene cial to store the weight matrices statically (i.e. rotate the input elements between processors, instead of the weight matrices). In this paper a parallel BP algorithm is evaluated on Fujitsu AP1000, Ishihata et al 2], a message passing MIMD computer with two dimensional torus topology network. Figure 1 depicts the architecture. AP1000 has distributed memory and each cell consists of a Sparc CPU, a FPU, a message passing chip, 128 KB cache and 16 MB main memory. The system used in this research consists of 512 cells. Message routing between cells is done by worm-hole routing, thus the path length has little e ect on communication time.
In the following section we describe an algorithm that combines multiple degrees of BP parallelism. Then, we show how the performance varies for different MLPs and di erent number of cells. Finally, conclusions are given.
MAPPING OF BP NETWORK ONTO A LARGE 2D-TORUS MIMD COMPUTER
Many factors a ect the design of a parallel BP algorithm. The most important issues are:
Weight updating strategy. Three di erent approaches:
Learning by pattern, update the weights after each training pattern has been presented. Learning , it was shown that to obtain the highest possible performance on a highly parallel computer, a combination of all degrees of BP parallelism ought to be considered. Below, the details of the proposed mapping is outlined.
Combined solution
To be able to exploit all degrees of parallelism, the network is partitioned as shown in Figure 2 . Notice that all input training pattern elements are stored in each hidden layer processor to reduce communication. The output layer processors must rotate their input elements (hidden layer output elements) between themselves. Training set parallelism means that several copies of the network are made. The balance of computation load between the layers depends on the number of processors and the number of nodes in each layer. The output layer processors do most of the backward phase computation, but usually the hidden layer has a much larger number of nodes (i.e. larger weight matrix). Thus, hidden layer processors require more time to compute outputs and accumulate weight change values than the output layer processors. This will be investigated in more detail in the result section.
Optimizing the weight update
Since the convergence rate improves when the weights are updated frequently, it is of major importance to minimize the weight update time. To avoid weight updating to become a bottleneck in a large system, we use a log n step summing technique, shown in Figure 5a ). The summing starts in the leftmost column. Hidden layer cells send their matrices to cells south of themselves, while output layer cells send to cells further north. We call the solution edge summing, since we sum the matrices in the north-most cell and southmost cell for the output and hidden layer, respectively. In the given system, only 3 steps are required for summing the weights matrices. Then the result is ( Figure 5b ) sent back to each hidden and output layer cell. This is done in a similar way as the summing, but opposite direction for sending the data is used. The broadcasting part can be omitted, if we make the communication in part a) bi-directional and duplicate the summing. However, this introduces more overhead, communication con icts and redundant computation.
RESULTS AND DISCUSSION
The performance on AP1000 is measured using the NETtalk application proposed by Sejnowski Figure 6 . Both algorithms have almost linear speedup on a system of many cells. When more than 64 processors are used, the 3APC becomes faster than 2APC, reaching 81 MCUPS on 512 processors. This can partly be explained by the small grains in computation for the 26 output neurons. Instead of using all the processors for this task, 3APC uses half of the processors to work on the intensive computing of the hidden layer. Moreover, there are twice as many training patterns computed on each processor, between weight updates, compared to for 2APC. The reason is that, 3APC uses two rows for each network copy, while 2APC needs only one. In comparison, the NETtalk application on a Sparc 10 workstation got a performance of 0.6 MCUPS, using learning by pattern.
On 512 processors, Figure 7 shows how the performance decreases, when the weights are updated more frequently than once per 1024 patterns. 2APC updates the weights in less time than 3APC. This is due to the initial pipeline delay and unequal load balance, when then hidden and output layer weights are updated concurrently for 3APC. Figure 8 shows the NETtalk performance for 60 and 120 hidden neorons. 3APC performs better than 2APC for 120 hidden neurons. When the number of hidden neurons is increased, both the number of hidden and output forward computation increases. The increase is of equal proportion (e.g. doubled for 120 neurons, compared to 60) for each of the layers. However, the backward phase requires more computation by the output layer processors. This is because the main part of the backward pass is done by the output processors. A large number of hidden neurons seems to be bene cial for the hidden-output load balance in our application. In general, a decrease in the number of input nodes or an increase in the number of hidden or output nodes moves computation load from the hidden to the output processors. Performance for a large neural network (e.g. an image recognition application) is given in Figure 9 . The network consists of 1024 input, 512 hidden and 64 output neurons. 4096 training patterns were used and the weights were updated for every 512 patterns. The 2APC outpace 3APC, which is understandable from the non-even number of nodes in each layer. For such a large network it would be interesting to see if it is possible to get high performance by using only node parallelism (i.e. both nodes within a layer and computation within each node run in parallel, Yukawa and Ishikawa 6]). For 64 processors the performance is equal to 2APC. For 256 processors the performance is much less than 2APC. This indicate that the computation grains become small, even for a relatively large network. However, the algorithm may need less number of training iterations, since learning by pattern is used. It was impossible to run the node parallel implementation on the 512 cell con guration. Comparing load balance between the hiddenand output-layer processors
An even load balance is very important for the 3APC algorithm to obtain good performance. In the following, we show the results for two performance analysis, given by a AP1000 performance analyser tool. light gray areas means that the processor is idle (i.e. waiting for data). The white areas in the beginning and end (for cell 0 and 8) indicate waiting to start and nished, respectively. The hidden processors are computing almost 100 % of the time. We see that after two patterns, the output layer has to wait for the hidden layer output to arrive (large light gray area). The picture corresponds to the earlier given explanation in Figure 3 . The white thin columns in the upper part of the performance plot arise from the communication between the output layer processors. Figure 11 shows how the idle time for the output layer processors are reduced, when the number of processors in the system increases. The output cells becomes more active, since they do more commmu- nication (error computation) than the hidden layer cells. Exhanging values between 8 processors need more time than between 4. The hidden layer processors becomes idle, while waiting for the error of the rst pattern. Usually, there will be quite many training patterns between weight updating, thus this idle time will be a minor problem. In general, an application on a large parallel system needs to be of a smaller number of hidden and output neurons (compared to the number of input neurons), to obtain an equal load balance, compared to on a smaller system. Since both the number of processors and neurons affect the load balance it is di cult to tell when to use 3APC instead of 2APC. However, this can easily be decided by comparing the time of one training iteration.
CONCLUSIONS
A description of how to exploit multiple degrees of BP parallelism on the AP1000 computer has been given. We have shown how the performance varies for di erent MLP network sizes and number of processors in the system. Earlier research on parallelizing BP exploits only one or two types of parallelism, while we have combined three types: training set parallelism, node parallelism and pipelining of the training patterns. Our results show that this is necessary to obtain the highest possible performance on a large number of processors. By combining the aspects of BP parallelism, computation granularity can be kept coarse even on a large number of processors. For the NETtalk application, we show that the load balance between hidden and output layer computation becomes more equal as the number of processors is increased. The results should be of interest both to other MIMD computers with 2D-torus network and other large general purpose computers.
