Abstract-The supervised training of feedforward neural networks is often based on the error backpropagation algorithm. Our main purpose is to consider the successive layers of a feedforward neural network as the stages of a pipeline which is used to improve the efficiency of the parallel algorithm. A simple placement rule will be presented in order to take advantage of simultaneous executions of the calculations on each layer of the network. The analytic expressions show that the parallelization is efficient. Moreover, they indicate that the performances of this implementation are almost independent of the neural network architecture.
INTRODUCTION
The applications of artificial neural networks will be fully developed when the massive parallelism of the architecture is exploited in dedicated electronic microcircuits or in simulators which are based on general purpose parallel computers. The learning process of the backpropagation algorithm requires much time and needs high-performance machines for industrial applications. The neural network architecture from which this algorithm operates can possess numerous neurons and synaptic connections. It uses a training set which contains numerous examples. High speed computation can be achieved by partitioning the set of data and by making simultaneous runs on each subset. Thus, a learning algorithm can be parallelized in two ways -by the partitioning of the neural network or by the partitioning of the training set.
In the first case, the network may be partitioned by distributing the synaptic coefficients and the neurons throughout a processor network. Again, there are two variations of this technique:
The neural network may be embedded in a processor network, or the matrix products used in the learning algorithm may be parallelized.
In the first variation, the neurons are distributed among the available processors. When information is being transferred from one neuron to another and these two neurons are placed on different processors, data communication will occur. Since data are distributed in the network, the memory requirements are moderate. However, it is difficult to place the neurons in such a way as to produce efficient implementations, which require both an evenly distributed computational load and few data communications [1] [2] .
The second variation is based on the fact that the computations in a neural network are basically matrix products. Matrix products are also the only operations inside the network which absolutely require data communications between the processors. The advantage of this approach is the fact that the amount of data communicated between processors is moderate and evenly distributed. Although the classical parallel matrix product algorithms are efficient, they are difficult to use effectively because the synaptic matrix of a feedforward neural network is lower triangular. The parallelization may work efficiently if the synaptic matrix is distributed appropriately [3] [4] [5] throughout the processor network. This approach has been considered for implementation onto SIMD machines [6] [12] , and for VLSI implementation [7] [8] [9] of neural networks because it makes the algorithms run by each elementary processor identical and simple and makes the communication between processors very regular.
The partitioning of the training set is yet a third parallelizing possibility [10] [11] [12] . The neural network is duplicated on every processor of the parallel machine, and each processor works with a subset of the training set. However, the behavior of the parallelized version of the learning algorithm is often different from its behavior on a sequential machine. This alteration sometimes renders the algorithm inappropriate. On the other hand, the distribution of the computation load is often excellent. Data can easily be organized in large-sized packets that allow full exploitation of the bandwidth of communication channels. However, the memory requirement per processor is proportional to the number of synaptic coefficients, which may be very large. Moreover, the data communication delays may be significant when the number of processors is large.
In a previous paper [3] , we parallelized the backpropagation algorithm by parallelizing matrix products on some "nearest neighbor" architectures. The analytic expressions of performance we set forth were confirmed by experimental measurements. The results showed that the efficiency of the parallel algorithm was strongly dependent upon the neural network architecture. They also showed that the implementation on a square two-dimensional torus is far more efficient than an implementation on a ring. Later, after applying the same strategy of parallelization on a torus while exploiting pipelined computation, we presented an efficient parallel implementation and devised experiments for the backpropagation learning algorithm.
This algorithm is suitable for a wide range of feedforward neural network architectures [4] .
In Section 2, matrix forms of three variations of the backpropagation are presented -the true gradient algorithm, the LMS stochastic gradient algorithm and the BLMS algorithm. The parallelization of these algorithms is described in Section 3. The successive layers of a feedforward neural network are considered as stages of a pipeline and the computation concerning each layer is parallelized. Expressions of time for parallel computation are also established. In Section 4, the theoretical analysis and the experimental results are compared for two very different types of neural architecture -the 3-layer neural network and the fully connected feedforward neural network. In addition, we show that theoretical expressions are helpful in predicting the performance of complex neural architectures. This is illustrated by an example of a network used for handwritten digit recognition in an industrial context, such as the one described by Y. Le Cun et al. [13] .
THE ARTIFICIAL NEURAL NETWORKS

Modeling, Notations
The evolution of the state σ i (t) of the neuron i of a network of N neurons is described by the following expressions :
The coefficients W ij associated with the N inputs of a neuron i are the "synaptic coefficients"
or "synaptic weights." The neuron i is the postsynaptic neuron of the synapse associated with W ij , and neuron j is its presynaptic neuron. The state of this neuron at time t + ∆t is affected by the states σ j rendered by the neurons j at time t and by an external input I i (t). If this external input is useless, it can be discarded. In this case: I i (t) = 0, regardless of the value of t. Rather than considering neurons that possess several external inputs, we have chosen neurons with only one non-weighted external input. In this case, the synaptic matrix W, which is composed of the synaptic coefficients W ij , is square. This property will be useful in the analytic study below.
The potential V i (t) is found by the addition of the external input contribution I i (t) to the sum weighted by W ij of the states σ j (t) presented to the inputs at time t. The new state σ i (t+∆t) is calculated from the potential through an activation function f. In this study, a sigmoidal function is used to insure that the activation function is differentiable.
Feedforward Neural Networks
Feedforward networks constitute an important variety of neural networks. For such networks, it is possible to index the neurons in such a way that the output of a neuron j is connected to an input of a neuron i when j < i. The synaptic matrix of this kind of network is lower triangular. The active external input constitutes the input of the network, while the output of the network consists of a set of neurons whose state is communicated to its environment. In a feedforward neural network, the states σ i depend only on the synaptic weights and the example presented to the input of the neural network because there is no feedback. Thus expressions of V i and σ i are independent of time if the following condition is true: The state of a neuron i can be evaluated only if the state of each neuron j is determined where j < i. So, the expressions of V i and σ i become
where k is the index of the example presented to the input of the neural network.
In fact, the order in which the state of each neuron is evaluated is often a partial ordering and it is useful to define the concept of layer in order to further explain this problem. Generally speaking, there are several ways of dividing a network into layers. In this paper, we derive the definition of the layer λ i of neuron i from the layers λ j of these presynaptic neurons j by the following rule :
The neurons that do not have synaptic inputs make up layer 0 (figure 1). Let L = max i (λ i ), where 0ּ ≤ּ iּ <ּ N. Then, the number of layers of the network is L+1. Every neuron on layer L is an ouput neuron, but not all the output neurons are on layer L. Figure 1 illustrates the layer determination rule applied to a network of 13 neurons. On the right is the "adjacency matrix" of the network: The black squares represent non-zero coefficients, and white squares represent "zero" coefficients. Neuron 2 is on layer 0 because it does not have presynaptic connections; in another model, this neuron could be on layer 1. Every neuron of the network could be an input or an output unit. However, neurons 0 through 4 are input units of the network because they are on layer 0, while neurons 7, 10, 11, and 12 are necessarily output units because they have no postsynaptic connection. If another layer determination rule had been used, neuron 7 could be on layer 3, for instance. 
Training by the Backpropagation Rule
The error backpropagation algorithm [14] is still the most popular learning algorithm for multilayer networks. It uses a training set which contains X examples {i(k); k = 0,...,X-1}.
These examples are vectors presented to the input of the network, e.g., bitmap representations of handwritten digits. The training set also contains the corresponding desired output vectors {d(k); k = 0,...,X-1}, e.g., the binary code of a digit. Learning consists of computing the synaptic weights so that, if a picture i(k) is presented to the input units, the corresponding code d(k) appears on the output neurons. Thus, the network learns to "associate" the picture of the digit with the corresponding code. Once training has been completed, the network is expected to produce the correct output code each time a digit is presented to the input, even if it does not belong to the training set .
The training process is based on a cost function which is defined in the space of synaptic coefficients. This function expresses the discrepancy between the actual output computed by the network and the desired output vector. The cost function is minimized by a gradient method.
True Gradient Algorithm
The cost function is defined for the whole training set as follows:
where
The cost function is minimized by a gradient method. Each time the whole training set is presented to the network, the synaptic weights are altered according to this equation:
where the gradient step η is a positive real number. An error δ i (k) can be defined for each neuron and each example of the network such that:
Relations (1) and (2) give values σ j (k) as previously described in section 2.2. The computation of state σ j (k) is propagated from the first layer 0 to the last layer L. Then, δ i of neuron i is computed from the values δ n of the postsynaptic neurons of neuron i and from the value of the desired state d i (if i is an output neuron). Thus, the error is propagated from layer L to layer 0:
where f' is the derivative of the activation function. Therefore, f must be differentiable.
Usually, a sigmoidal activation function is used for classification problems.
During a training cycle, the whole training set is presented once in order to estimate the gradient of the cost function. When the cycle is complete, the weights are altered according to expression (5) , and the training cycles are repeated until the value of the cost function is small enough.
Relations (1), (2), (6) , and (7) can be written in matrix form to show the parallelism of the computation inside any given layer. The following notations will be used. Let W + λ be the rectangular block containing the synaptic coefficients W ij associated with the synapses whose postsynaptic neurons i belong to layer λ. Similarly, a given block W -λ contains all the coefficients W ij in such a way that the presynaptic neuron j belongs to layer λ. there is a propagation phase and a backpropagation phase. In the propagation phase, the state of the neurons is evaluated for every layer λ for the example k:
In the backpropagation phase, the error of every neuron is evaluated for every layer λ for example k:
The alteration matrix ∆W is computed after each training cycle :
LMS (Least Mean Square) Stochastic Gradient Algorithm
B. Widrow [15] has used a gradient algorithm which consists in computing an estimate ∇J(k) of the gradient of the quadratic cost J for each example k. This estimate is defined as follows:
In this case, the weights are updated after the presentation of each example:
Thus, the weights are altered X times during a training cycle, whereas the true gradient algorithm involves only one update per cycle. This algorithm is referred to as the "LMS stochastic gradient algorithm".
BLMS (Block Least Mean Square) Learning Algorithm
Learning by the LMS stochastic gradient algorithm needs more updatings of the synaptic matrix than by the true gradient algorithm because the estimate of the gradient of the cost function is not precise. A better estimate of the gradient of the cost function for all examples is achieved by considering a block of b examples instead of only one example [16] . For a block ξ, an estimate ∇J(ξ) of the gradient of J is defined as follows:
The training set is divided into blocks of b examples. After the presentation of each block ξ, the synaptic matrix is updated according to the following expression:
where 0 < ξ < X/b , where X/b is the ceiling function of X/b. Now, the synaptic matrix is updated X/b times during a training cycle.
PARALLELIZATION OF THE LEARNING ALGORITHM
Introduction
The study of the parallelization of the backpropagation will be performed for the BLMS variation. The case of the LMS stochastic gradient is obtained for b = 1, whereas b must be equal to X for the true gradient algorithm. The vector expressions established above show an obvious kind of exploitable parallelism. The states for every neuron of each layer can be computed simultaneously during the propagation phase, and the same convenience applies to the computation of the error δ i during the backpropagation phase. However, the performance of the parallel computation is strongly dependent upon the architecture of the neural network. that can be performed is on the order of N, whereas the number of simultaneous operations that can be performed for a single-layer network is on the order of N 2 .
This drawback could be overcome if, at a given step of the state propagation and of the error backpropagation, each layer dealt with different examples from a given block of the training set. The idea considers each layer of the network as a stage of a pipeline. In this case, even when the network is fully connected, N 2 simultaneous operations can be performed as soon as the pipeline is initialized. In fact, we shall see below that the efficiency of the parallel computation does not depend strongly upon the architecture of the network. Moreover, the pipeline initialization has little effect, even in the event that the number of examples in a block is small.
Pipeline and Error Backpropagation Algorithms
The computation of expression (8) can be performed simultaneously on every layer only if there is no interdependency among the variables which are used in the parallel evaluation. A training cycle is divided into computation steps referred to as s. At each step, for a given layer λ, the state σ λ is evaluated from the states on all the lower layers. Then, the error δ λ is computed from the errors on all the upper layers. A relation exists between λ, the step s, and the index of the example related to layer λ. During step 0 of the propagation phase, the part of example 0 which is presented on layer 0 allows the computation of σ 1 (0). Consequently, σ 1 (s)
is computed on layer 1 at step s, while layer λ, if λ ≤ L, yields the calculation of σ λ (s-λ+1).
During the backpropagation phase, the state corresponding to example 0 is presented to layer L at step L-1. At the same step, the error δ L (0) on this layer is determined and the error δ L-1 (0) is computed on layer L-1. Then, error δ L-1 (s-L+1) can be determined on layer L-1 at step s, while δ λ (s+2+λ-2L) is computed on layer λ. In the propagation phase, layer 0 constitutes a unique pipeline stage with layer 1 because no matrix products are required for computing the state σ 0 .
The same remarks apply in the backpropagation phase for layer L and layer L-1. Step s of the pipelined backpropagation algorithm in the particular case of a four-layer network.
In the case of the true gradient algorithm, the matrix ∆W can be updated for a given example k when the backpropagation phase for this example is completed on layer 1. In fact, the optimization of the computation of ∆W requires that the computation of the error on layer 0 be started before updating ∆W. (The reasons will be stated below.) Consequently, ∆W can be updated for the example k = s-2L+2 at step s of the computation. The number of required steps N s must take into account the pipeline initialization so that the training cycle is complete. At every step, a new example is presented to the network; therefore, N s = X + 2L -2. The synaptic matrix is updated at the end of the cycle. The true gradient algorithm is not altered by its parallel implementation.
In the case of the LMS stochastic gradient algorithm, the synaptic matrix is updated for example k = s-2L+2 for every step s. Evidently, the pipeline carries an important modification of the LMS stochastic gradient algorithm. This is clear because the alteration of W at a given step interferes with computations of the states σ i and errors δ i for examples taken at different steps in the network. Experiments have shown that the pipelined LMS stochastic gradient algorithm converges. However, systematic experiments and theoretical investigations of the convergence properties of this algorithm must be performed.
The behavior of the BLMS algorithm will be exactly the same for sequential implementations as for parallel pipelined implementations if the pipeline is reinitialized for each block of examples. The efficiency of the pipelined computation is not significantly diminished if the size of blocks is large enough to allow the computation to be well balanced within the processor network.
Memory Requirement
Some synaptic weights may be fixed during training. For the sake of simplicity, we assume that these fixed weights are always of value zero; they correspond to non-existing connections. N W is defined as the number of alterable weights of the synaptic matrix, where N W is less than N 2 /2. We assume that the amount of memory required to store the synaptic matrix is close to the value N W . In addition, training uses X examples which consist of couples of vectors
In the case of the true gradient, N W synaptic coefficients must be stored as well as and δ(k) up to step k-2+2L. Therefore, (2L-2)N floating point numbers must be stored for both the states σ(k) and for the errors δ(k). These vectors allow the matrix ∆W, which is described by N W real numbers, to be updated. Therefore the minimal requirement of memory µ T is given by the expression:
The operation of the pipeline requires much memory if there is a large number of layers. This is the cost for achieving a high level of efficiency. A variant of the algorithm could eliminate the need to store the 2L-2 vectors σ(k) and δ(k) in most processors, but that would involve more communication and would decrease the efficiency. For the BLMS algorithm, the minimal requirement of memory µ B is equal to µ T .
In the case of the LMS stochastic gradient algorithm, the storage of ∆W is useless :
Processor Network
In this section, we will present the criteria for choosing the architecture of the processor network. A placement strategy for the synaptic coefficients and vectors σ(k) and δ(k) will be proposed in order to balance the computation among all processors. Finally, theoretical expressions of performance will be derived. It will be shown that the efficiency of the parallel computation does not depend strongly upon the architecture of the network.
Architecture of the Processor Network and Placement of Tasks
We present here the implementation of learning algorithms in a loosely coupled parallel machine. According to the vector expressions (9), (10), (11), (13) , and (15) 
where Q is the number of processors per line or column and N is the number of neurons of the network; " \ " is the "remainder" operator,  j /Q  is the floor function of j /Q. In our placement algorithm, each processor receives a two-dimensional regular sampling of matrices W and ∆W; the sampling step is equal to Q.
Let C be the adjacency matrix of the neural network. The sampled sub-matrices C ij (where 0 ≤ i, j < Q) will be nearly similar for all the processors if the sampling step Q is not too large.
Under this condition, the number of synaptic coefficients per processor is almost the same for all processors. Below we demonstrate that this guarantees a well balanced computation load. Figure 5 shows an example of permutation applied to an 8 × 8 synaptic matrix for a network of 2 × 2 processing elements. The components of the state vector and error vector must also be permuted according to the same model. 
Data Communications in a Torus of Processors
The product algorithms on a torus of processors, which we have used, involve broadcasts or accumulations of vectors along the lines or columns of the torus. Therefore, it is important to determine the duration of these communications. We denote by t c the time required to transmit a piece of data between two neighbors. In the case of transputers, a 32-bit real number is transmitted in at least 2.5 µs for communication links running at 20 Mbits per second. We denote by t s (startup) the delay between the beginning of the execution of a procedure transmitting a packet of data and the actual transmission of the first piece of data. The value of t s depends strongly on the complexity of the software controlling the data transmissions. No value can be given a priori.
Let an emitter processor and a receptor processor be separated by D-2 transmitter processors. A common practice for transmitting a vector of n components as quickly as possible from the emitter to the receptor is to divide the vector into packets. However, the startup (t s ) for the machine on which we measured the performances increased dramatically when packets were prepared before transmission. The data transmission by packets would only be advantageous for vectors of much larger dimensions than those we used in our experiments.
Thus, we have chosen to transmit data piece by piece. We estimate the communication time of a n-component vector through D processors, t com , as follows:
The duration of the broadcast of n items from a processor to all processors of a line or column of a Q × Q processor torus is equal to the duration of the communication of n data through Φ processors, where Φ is the maximum distance between two processors : t broad (n,Q) = t com (n,Φ).
• If broadcasts are mono-directional, Φ = Q.
• In the case of bi-directional broadcasts, Φ = Q/2.
• If the torus is embedded in a hypercube whose dimension is 2.log 2 (Q), Φ =log 2 (Q).
The duration of data accumulation is almost the same as the duration of a data broadcast. We have chosen to use mono-directional broadcasts or accumulations in order to keep the value of the startup (t s ) small. Bi-directional broadcasts were not efficient due to the small dimension of the processor network (16 processors) we employed.
Matrix/vector Products Algorithms on a Torus
Let t + and t × be the times for the floating point addition and the floating point multiplication respectively. The number of additions and multiplications in matrix products are almost the same. For the sake of simplicity, we define t f . the mean time required for an elementary floating point operation (addition or multiplication): is the number of neurons belonging to the layers λ1 to λ2.
To take into account the quality of the distribution of the computation load on each processor of the network, we introduce the following notation:
where the maximum is taken on the whole network of processors.
Placement of the vectors
The vectors σ(k) and δ(k) are placed on the "diagonal" constituted by processors PE ii of the torus so that the communications during the execution of the products W σ(k) and W T δ(k) are as simple as possible; below we will see that there is no explicit transposition. We will refer to this diagonal as the "main diagonal." The vectors i(k) and d(k) for a given example k are placed on the main diagonal as well. 
Computation of the
Computation of the Product
The sub-vector δ λ+1 (k) placed on the main diagonal of the torus is broadcast along the lines.
It is useless to broadcast δ λ+2..L (k), which was memorized during previous steps. The results of the local matrix products of δ λ+1..L (k) and W -λ T are accumulated and transmitted along the columns leading to the main diagonal to obtain B λ (k). Thus, the duration t B (λ) of the execution of a product for the presentation of an example is given by this expression:
Computation of the Product D := δ σ T :
Obtaining this product requires the broadcast of both vectors δ and σ on all processors. In fact, these broadcasts were already performed during the computation of the matrix products previously described. Thus, only local matrix products are required to perform this operation. The duration t D of the product computation for the presentation of an example is given by :
where N W is the maximum number of alterable synaptic coefficients of a processing element.
Parallelization of the Backpropagation BLMS Algorithm
The duration of the computation of both the activation function and its derivative are denoted by t sig and t der , respectively.
The relations presented below are valid only when the pipeline is full. Otherwise, the computation load is lighter. However, the computation load per processor remains almost unchanged if the number of examples in a block is much greater than the square root Q of the number of processing elements.
Computation of State σ on Every Layer of the Network for a block of b examples :
At the end of step s, the state related to example s+1-λ is computed on layer λ. 
end //for
The duration t σ of the state computations for a whole neural network during a whole presentation of an example block is derived from relation (17) as follows:
Computation of Error δ on Every Layer for a block of b examples :
At the end of step s, the sub-vector δ λ on layer λ is related to the example s+2-2L+λ. 
If L = 1, there is no backpropagation :
Computation of ∆W for a block of b examples:
If L > 1 at the end of step s, both the sub-vector δ 1..L (s+3-2L) and the state σ(s+3-2L) are computed. Therefore, the matrix ∆W can be updated:
If L = 1 at the end of step s, the sub-vector δ 1 (s) is computed and the weights are updated for example s at step s.
The duration t ∆W of the product computations for a whole presentation of an example block (derived from relation (19)) is given by:
In the case of the pipelined LMS stochastic gradient algorithm, matrix W is altered at each step of the computation :
updating of weights
Thus, the duration of this computation is equal to t ∆W .
In order for a whole block of examples to be presented, the following conditions must be sastified:
Speedup Analysis
At the end of a training cycle, the termination condition is checked. This involves the collection of data by a master processor which makes the decision and broadcasts it through the network. The communication time of this operation is proportional to Q t c , and this contribution is negligibly small in the duration of a cycle. The duration t par of the presentation of an example block for the pipelined BLMS algorithm is given by:
The speedup is given by the ratio S = t ser /t par , where t ser and t par are the sequential and the parallel computation times, respectively. Moreover, we define the efficiency of a parallel algorithm by the ratio S/P, where P is the number of processors in the network. In the case of a torus Q x Q,
It can easily be shown that :
The expressions t σ , t δ and t ∆W show that the efficiency of the parallel implementation essentially depends on the distribution of the synaptic coefficients among the processors.
Having the same number of synaptic coefficients on every processor produces a speedup whose value approaches the number of processors Q 2 when N is large enough. This distribution of synaptic coefficients is not directly determined by the architecture of the neural network. It depends on the placement algorithm as well. Given the architecture of a neural network, the computation time t parּ reaches a minimum and then increases as the number of processors P increases. We now derive a mathematical condition which prevents the value of P from exceeding that which corresponds to the minimum value of t. We assume that the synaptic weights are evenly distributed. Thus 
The upper boundary shown above is reasonable because the equality of t par in the above expression is asymptotically approached when the neural network is fully connected. The implementation of the backpropagation algorithm on a Q × Q network of processors will be acceptable if the computation load is well balanced and if the derivative of t par with respect to Q is negative. There is a simple criterion that satisfies this condition: Regardless of the placement parameter N w of the neural network and of the elementary computation times, the number of processors, P = Q × Q, should be less than or equal to N.
The best performance could be achieved with a larger processor network. However, the efficiency would decrease significantly and the cost of the machine would be prohibitive in view of its performance. In addition, the absolute maximum of the speedup strongly depends on the elementary times of operations and on the density of the synaptic matrix. The transputer was designed for efficient implementation of the high level OCCAM programming language [18] . OCCAM allows the implementation of sequential processes which can communicate by sending and receiving messages through logical channels, some of which correspond to physical links. A transputer can run several processes by time-slicing.
EXPERIMENTS
Transputers
The code generated by OCCAM is compact and efficient, making the use of machine language unnecessary.
Experimental Conditions
The purpose of these experiments was to measure the speedup, to check the validity of the analytic expressions presented above, and to derive the values of t f and (t c + t s ).
The experiments were performed on a bidimensional torus of sixteen T800 transputers (Qּ =ּ 4) in order to measure the parallel execution time t par . Then, we obtained the measurement of the sequential execution time t ser from a single transputer configuration. The speedup is defined as the ratio t ser /t par .
The network communicates with the host computer through an additional transputer referred to as the "root transputer". This transputer is in charge of initializing the processor network and accessing the host resources (figure 4).
For these measurements, auto-associative learning performed by the BLMS algorithm was run on a set of random binary vectors. The synaptic coefficients were initialized at random. The measurements were performed for both the fully connected feedforward network and for a three-layer network in order to compare the performances obtained for these very dissimilar network architectures. The size of the blocks of examples was 128.
In the case of a feedforward network of N neurons, we designated N as an integer multiple of Q in order to balance the computation load. In the case of a three-layer network, all the layers are fully interconnected and the number of neurons per layer is the same for each layer.
N was designated as an integer multiple of 3Q in order to balance the distribution of the synaptic coefficients.
In our experimental neural network, all units were sigmoidal.
Speedup Measurements
The time measurements for our two kinds of architecture were performed for networks of 
Estimates of Elementary Times t f , t sig , t der and t c + t s 4.4.1 Training of a Fully Connected Network
The fully connected feedforward network is composed of N neurons. Such a network has N layers, each containing only one neuron. Thus, given that N is an integer multiple of Q:
(from relation (25)) t par is deduced from relations (20), (21), (22) and (23) :
Training of Networks of Three Layers
All layers are fully interconnected and the number of neurons per layer is the same for every layer. We assume that N is divisible by Q(L+1):
The broadcasts through the network of processors are mono-directional. Thus Φ = Q.
We derive t par as follows: t par and t ser were measured for each kind of architecture. The transputer timer gave the total time for the execution of the local matrix products, which depends directly on t f . It also gave the total communication time, which allowed the easy calculation of t c + t s according to the analytic expressions of computation time. Finally it gave the total amount of time for the computation of the activation function and its derivative. These values permitted the calculation of t sig and t der . Figures 8 and 9 
Discussion
Predicting the Performance of the Parallel Learning Algorithm
This section has two purposes: First, we will evaluate the quality of the placement algorithm described in section 3. Secondly, we will show that performance can be evaluated even for complex network architecture, in light of the hypothesis that data access and storage times (to memory) are almost independent of the neural architecture. This hypothesis is founded on the conception of data structures optimized with respect to density and to the speed in which the synaptic coefficient in the memory is accessed.
The parallel learning algorithm presented above may deal with large networks that possess up to several thousand neurons. The synaptic matrices, which contain only several thousand non-zero coefficients, are very sparse. Let us consider a network (similar to the one described in [13] ) that features 1257 neurons distributed among five layers (figure 10). The simulation of such a network requires hardware and software environments which are presently unavailable for our parallel machine. However, performance prediction is possible because of the simplicity of the analytic expressions.
The expressions of the parallel computation time use N W et N W-1.. 3 . This requires a count of synaptic coefficients for each processor of the torus. This count is presented in figure   11 .
Values of for all the processors of a 4 × 4 torus However, this ideal efficiency, E sup , is an upper limit. The heuristic we used gives satisfactory results with respect to this limit.
CONCLUSION
The parallelization of the matrix products involved in the backpropagation learning algorithm leads to favorable performances on loosely-coupled parallel machines such as transputer networks because the data communications are kept moderate and well balanced. Moreover, the exploitation of the pipeline formed by the successive layers of the neural network permits an increase in the computation load per processor and enables the computation to be distributed more evenly throughout the network, even when the number of neurons per layer is small.
Thanks to the pipelined computation, the theoretical expressions of performance are simple and permit the prediction of speedups for complex neural architectures. Theoretical assessments of performance were validated by experiments. We have shown in Sections 3.6 and 4.4 that performance is almost independent of the neural network architecture, even in the two extreme cases presented in this paper: the fully connected feedforward neural network and the three-layer network. The greater the size of the neural network and training set, the better the speedup and the efficiency of the parallel algorithm.
An inappropriate placement of the synaptic matrix may detract from the performance. We have used a placement consisting of regular sampling of the synaptic matrix. The quality of this placement has been satisfactory.
We are studying the efficiency of the pipelined LMS stochastic gradient algorithm with respect to that of the non-pipelined algorithm. Preliminary experiments have shown that the pipelined algorithm seems to converge in the same way as the non-pipelined algorithm.
ACKNOWLEDGEMENT
The authors also wish to thank S. Jennings for reading the manuscript.
