In this paper, we present a new technique for mapping the backpropagation algorithm on hypercubes and related architectures. A key component of this technique is a network partitioning scheme which is called checkerboarding. Checkerboarding allows us to replace the all-to-all broadcast operation performed by the commonly used vertical network partitioning scheme, with operations that are much faster on the hypercubes and related architectures. Checkerboarding can be combined with the pattern partitioning technique to form a hybrid scheme which performs better than either one of these schemes. Theoretical analysis and experimental results on nCUBE2 TMy and CM5 TMz show that our scheme performs better than the other schemes, both for uniform and non-uniform networks.
Introduction
The Backpropagation algorithm (BP) 1] is one of the most popular neural network learning algorithms. It has been used in a large number of applications 2, 3, 4, 5] . This algorithm is computation intensive and as a result there has been a great interest in developing parallel formulations of this algorithm for a variety of parallel computers.
BP can be parallelized either by network partitioning or by pattern partitioning. In network partitioning schemes, nodes and weights of the neural network are partitioned among di erent processors and thus the computations of node activations, node errors and weight changes are parallelized 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17] . In pattern partitioning, individual weight changes due to various learning patterns are computed concurrently 17, 18, 19, 20] . Pattern partitioning and network partitioning can also be combined 6, 11, 12, 14, 20, 21] to form hybrid schemes. Several machine architectures including linear arrays, meshes and hypercubes have been explored to implement parallel BP.
In this paper, we present a new technique for mapping the backpropagation algorithm on hypercubes and related architectures. A key component of this technique is a network partitioning scheme which is called checkerboarding. The major communication-intensive operation in the commonly used vertical sectioning scheme 6, 9, 10, 11, 12, 13, 14, 15, 17, 22] is the all-to-all broadcast.
In this operation, each of the P processors has to broadcast its local m units of information to all the other processors. This takes (mP) time on linear array, on mesh as well as on hypercube. Hence, the vertical sectioning scheme performs equally well (or equally poorly) on all these architectures.
2 Backpropagation Learning Algorithm BP trains a given feedforward neural network for a given set of learning patterns. The trained network is able to map the input portion of each learning pattern to the output portion of the same learning pattern. The training of a neural network can be viewed as discovering values for its weights.
To simplify the discussion, we focus on a uniform multilayered feedforward neural network, in which the number of nodes, I, in each layer is the same. We discuss the case of non-uniform networks in section 6. Let J be the number of layers, and I be the number of nodes per layer. Figure 1 shows a uniform network with 3 layers (J = 3) and 3 nodes per layer (I = 3). The layers are labeled from 1 through J. The input layer is labeled 0, and is not counted in J. Each node i in layer j is associated with an activation value a j i and an error value j i . The activation values associated with the nodes in a layer form an activation vector. Z1, Z2 and Z3 represent the activation vectors for the di erent layers for a given pattern in Figure 1 . Similarly, the error vectors associated with di erent layers are represented by E1, E2 and E3. A node i in layer j is connected to all nodes at layer j ? 1 and j + 1. A weight, w j;j+1 i;k is associated with the connection from node i in layer j to node k in layer j + 1. The weight matrix is shown in the right half of Figure 1 . There are JI 2 non-zero weights in the network. The non-zero entries of the weight matrix are grouped as submatrices U1, U2 and U3. Each submatrix represents the weights on various edges between the nodes in adjacent layers. For example, U1 represents the weights on edges from the nodes in input layer to the nodes in layer 1. U2, and U3 are de ned similarly.
BP has two phases: Forward Phase, and Backward Phase. In the forward phase, the input portion of a learning pattern is fed to the network. It is propagated through the layers to compute the activations of the nodes in each layer. The di erence between the activation of nodes in the highest layer and the output portion of the learning pattern de nes error for the nodes in the highest layer. In the backward phase, the error is propagated backwards from the highest layer to the lower layers to compute the errors assignable to nodes in various layers. This computation is accompanied by the computation of weight changes that will reduce the total error at the nodes in the highest layer.
The forward and backward phases are carried out for every learning pattern. The error in the nodes at the highest level is accumulated over all the patterns. The computed weight change for each weight is also accumulated over all patterns, and each weight in the network is changed by the accumulated weight changes. This completes one iteration for all patterns. Now, if the accumulated error for nodes in the highest level is above a given target, this iteration is repeated again. BP terminates when the accumulated error in the nodes at the highest layer falls below a given target. The details of BP are shown in Figure 6 in Appendix A.
The bulk of the computations done by BP can be viewed as a sequence of matrix-vector multiplications, as shown in Figure 1 . The input portion Z0 of a learning pattern is presented to the nodes in layer 0. The activation vector Z1 is computed by multiplying Z0 with the weight matrix U1 1 . This process is repeated for computing Z2 (= product U2, Z1]), and Z3 (= product U3, Z2]) thus completing the computation of activation vectors for each layer. This is followed by error computation for the highest layer (layer 3) and the backward phase. The error vector E3 is computed as the vector di erence between the output portion of the learning pattern and Z3. is repeated to compute the error vectors for other layers, i.e. E1 = product E2, U2]. The backward phase also computes changes in the weights to reduce errors. The weight changes are computed by evaluating the cross products of activation and error vectors. For example, the weight changes for elements of U3 are computed via U3 = cross product Z2, E3]. The cross product of two vectors is de ned in the following manner: U3 i; j] = Z2(i)*E3(j), where the subscript i, and j refer to the node numbers given in Figure 1 . The weight change computation for weights in other layers are carried out in a similar manner; e.g. U2 = cross product Z1, E2], and U1 = cross product Z0, E1].
The dominant computation per iteration is represented by the sequence of matrix-vector operations described above. The overall runtime for each iteration of BP can be approximated as LJI 2 t c , where I 2 t c is the computation cost of all matrix-vector operations associated with each layer for one pattern, and L represents the number of learning patterns. This cost model can be used to analyze and compare the parallel formulations of the backpropagation in spite of the following simpli cations. BP does more computation than the sequence of matrix-vector operations depicted in Figure 1 . For example, it computes sigmoidal functions and the magnitude of the error vector for the highest layer. It adjusts weights after examining the entire set of learning patterns = (1 -Z2(i))*Z2(i)*product E3, U3](i) as shown in Figure 6 . That is, the i th element of the vector product E3, U3] is multiplied by a function of the activation of the node (i.e. by (1 -Z2(i))* Z2(i)).
in each iteration. Lastly, the number of matrix-vector products in the forward phase is one more than the number of matrix-vector products during the backward phase.
Training Regimes
In set-training regime described in Appendix A, the weight changes for all the learning patterns are accumulated before changing the weights. Many neural network software (e.g. 24]) support an alternative training regime, called per-pattern training. In this training regime, the set of patterns are permuted to create an ordering among the learning patterns. The patterns are then examined in sequence. Weight changes are computed for a pattern, and the weights are changed before examining the next pattern. This semantics prevents parallelization of weight adjustments due to di erent patterns.
Pattern-partitioning schemes to parallelize BP are not applicable to the per-pattern training regime for neural networks. Network partitioning schemes can, however, be used for both perpattern training regime and for the set-training regime. We will focus on the set-training regime in this paper.
Application Domains of Backpropagation
The applications of neural network learning algorithms can broadly be divided into two groups: recognition and generalization 3 3] . In recognition problems, a set of learning patterns are used to train the network. The trained network is expected to recall the output part of a previously seen learning pattern, when the input portion of the same learning pattern is presented. Pattern recognition applications 25] are examples of recognition problems. On the other hand, in generalization problems neural networks are tested with input portions of new testing patterns. The network is expected to predict the output portions of the testing pattern, which have not been shown to the network during the learning phase. Generalization problems include assignment of bond-rating 4] and economic prediction 5].
Usually a large neural network is used in recognition problems. Furthermore, recognition problems have a small number of learning samples. For example, in the protein-folding problem, L = 14 and I = 1000 25]. The recognition problems can be characterized by L I 2 . In contrast, generalization problems are characterized by a large number of learning samples. The number of learning samples should be an order of magnitude larger than the number of weights to be learned. This is required for reasonable con dence in the learned weights for generalization over unseen testing patterns. These problems can be characterized by L > 10I 2 J 3, 4].
Parallel Formulations of Backpropagation
Here we survey existing schemes to parallelize BP for the fully connected multi-layer networks. In these networks, the adjacent layers are completely connected. For schemes that primarily deal with randomly sparse networks, see 7, 26, 27, 28, 29] .
Parallelization schemes for BP can be classi ed into three broad categories: network partitioning schemes, pattern partitioning schemes, and hybrid schemes. The network partitioning schemes take advantage of the parallelism in the computation of node activations and node errors by distributing both the nodes and the weights of the network among di erent processors. Nodes can be partitioned in di erent ways. Complete partitioning assigns one node per processor 6, 8] . Vertical sectioning divides the nodes in a layer among the processors, and each processor gets some nodes from each layer 6, 9, 10, 11, 12, 13, 14, 15, 17, 22] .
The weights can be partitioned in four di erent ways: complete partitioning, inset grouping, outset grouping and checkerboarding. Complete partitioning 8, 16] allocates one processor per weight in the network. It allows maximum concurrency in the computation of the various terms of the node activation and node error. However, it requires communication to accumulate the terms. Inset and outset grouping schemes are often used in conjunction with the vertical-sectioning scheme. Inset grouping schemes form sets, which represent the collection of weights on the incoming edges for a speci c node. The inset of a node is allocated to the processor possessing the node. Inset grouping reduces the need for communication in computing node activations during the forward propagation. It has been used in 6, 11, 12, 14] . Outset grouping schemes form sets, which represent the collection of weights on the outgoing edges from a speci c node. The outset of a node is allocated to a processor possessing the node. Outset grouping eliminates the need for communication to compute error vectors during the backward phase. It has been used in 15]. Both inset and outset grouping can be used together to improve the e ciency of both forward and backward phases 9, 10, 13]. However, this scheme duplicates each weight on two processors, increasing the work during weight updates. It can also be shown that the communication cost for inset grouping is the same as the communication cost incurred by the replication scheme, through the use of multiply-accumulaterotate operation 11]. Checkerboard partitions the weights by grouping the rows and columns of the weight matrix. It has been used for systolic arrays 16], and for transputers 21] connected in a mesh con guration. Given enough processors, checkerboarding reduces to a complete partitioning of weights.
Pattern partitioning replicates the network nodes and weights at each processor. It divides the pattern set equally among all processors. Each processor carries out the forward and the backward phase for the local set of patterns. Each processor also accumulates the weight changes according to the local patterns. Then, the processors communicate to accumulate the weight changes for updating weights. This scheme is preferred for problems with a large set of learning patterns on machines supporting e cient broadcast operation. This scheme has been used in 18, 19, 17, 20] . The computation in di erent layers of BP can be pipelined 6, 14]; i.e., while one pattern is being processed for some layer, a di erent pattern can be processed for a preceding layer. Even though pipelining partitions the network horizontally, it is more appropriate to view it as pattern partitioning.
Hybrid schemes combine pattern partitioning with network partitioning. For example, pipelining can be combined with vertical sectioning 6, 14] . Other examples include the combination of vertical sectioning with simple pattern partitioning 12]. We can view the forward and backward phase computations for a pattern as a sequence of weight-matrix to activation/error-vector products. The hybrid schemes that process multiple patterns simultaneously can be viewed as a sequence of matrix-matrix products. This allows the possibility of using parallel algorithms for computing a sequence of matrix operations 30, 21] .
An orthogonal way of classifying the approaches to parallel BP is based on the architectures on which they are implemented. Complete node partitioning and inset weight partitioning with pipelining have been explored for linear arrays in 14]. The vertical-sectioning of nodes with duplicate inset/outset weight partitioning is used for mesh-connected transputers in 9, 10]. Mesh-based machines have been explored by many researchers. A matrix-matrix multiplication-based formulation is given in 21] for mesh architectures. Other schemes for mesh include those proposed in 13, 6, 16] . Techniques for hypercube and related architectures have been explored in 8, 11, 12, 15, 18, 19] .
Hypercubes and Backpropagation
In a hypercube-connected parallel computer, each processor is directly connected to log(P) other processors whose addresses di er by exactly one bit. The time to deliver a message of length m words between two adjacent processors is given by (t s + t w m). Here, t s is the message startup time and t w is per-word communication time. A message containing m words can be broadcast from a processor to all other processors in time (t s + t w m) log(P), as a binary tree of depth log(P) can be mapped on to a hypercube containing P processors 31] 4 .
We rst consider the following two important parallel formulations of BP before describing our new scheme:
Partitioning of patterns Partitioning of Neural Networks by Vertical Sectioning To simplify the discussion, we defer the discussion of hybrid schemes using both pattern partitioning and network partitioning to section 5.
Pattern Partitioning
In BP, under the set-training regime (see section 2.1) the computation of weight change for di erent patterns can be carried out in parallel without requiring any communication. But, the step of accumulating the weight changes requires communication among the processors.
In this scheme, the whole network is replicated on each of the P processors. Each processor then performs learning using L P patterns. For each iteration of BP, the following steps are performed on each processor:
1. Each processor carries out forward and backward phases to compute the sum of the weight changes due to the L P local learning patterns. This step takes time L P JI 2 t c approximately 5 , where t c is the computation cost per weight per pattern for each iteration of the set-training regime.
2. The average weight changes for all L patterns for each of the JI 2 weights is computed. To do this, a binary tree is imposed on P processors. The root of this tree accumulates and computes the average weight changes. This step takes t s + JI 2 t w ] log(P) communication time. 3. The average weight changes computed by the root processor are sent back to all processors to compute the new weight matrix at each processor. This step takes t s + JI 2 t w ] log(P) communication time.
Hence, the parallel runtime for the BP algorithm for one iteration and for L patterns is given by: L P JI 2 t c + 2 t s + JI 2 t w ] log(P)
The main features of this scheme are as follows:
It is usable only for set training regime and is not applicable to a per-pattern training regime. The maximum number of processors that can theoretically be used by this mapping scheme is equal to the number of learning patterns (L). As P approaches L, the computational component ( L P JI 2 t c ) can become much smaller than the communication time (2 t s + JI 2 t w ] log(P)). Hence, the speed-up will usually peak for P < L.
Network Partitioning using Vertical Sectioning
In a vertical-sectioning scheme, the nodes in the di erent layers of a uniform network are divided in an identical manner by allocating an equal number of nodes per layer to each processor. The processors also contain the incoming weights associated with the nodes that reside in them. Each processor computes node activations, node errors, and weight changes for one pattern at a time.
As an example, consider the scheme given in 11]. A similar scheme is given in 9, 10]. In this ( ( which shows the mapping of a network with I = 4, J = 2 on 4 processors. In the forward phase, to compute activation values for any node i in layer j, it is necessary to have the activation values of all nodes from layer (j ? 1). Thus, the all-to-all broadcast operation 33], in which each processor needs to send its I P activations to all other P processors, needs to be performed. This operation takes P t s + I P t w ] for each of the J layers. Note that the run time of this operation remains the same for a linear array, mesh, systolic array or a hypercube 6 . Similar computation and communication takes place during the backward phase. Hence, the parallel runtime for one iteration and L patterns is:
Parallel Runtime = L P JI 2 t c + 2LJP(t s + I P t w )
6 All-to-all broadcast can be carried out in ( P log(P) ts + I P tw]) on a hypercube if simultaneous communication along all channels of hypercube is allowed 31].
Our Parallel Formulation
In the vertical-sectioning schemes, one processor performs the computation related to one or more nodes in each layer. In contrast, in our scheme, a set of processors is used to perform these operations. To do this, the weights related to a node are divided among a set of processors. Furthermore, the processors processing the incoming and outgoing weights for a node are arranged in a special manner to minimize communication costs incurred in the computation of node activations and errors. This increases the degree of concurrency of our scheme over the vertical-sectioning scheme. Our formulation is essentially an adaptation of the matrix vector multiplication scheme for hypercube presented in 31]. The partitioning of nodes and weights in our scheme is depicted in Figure 4 (a). We partition each I x I weight submatrix (e.g. U1, U2, U3 in Figure 1 ) onto a grid of p P x p P which is embedded on the P processor hypercube as described above. Thus each processor contains I 2 P weights from each layer. The I nodes for each layer are partitioned among the p P processors that are on the diagonal of the p P x p P grid. Hence, each diagonal processor contains I p P nodes. Each row of p p processors together contains I incoming weights for each of the I p P nodes (per layer) in its diagonal element. Similarly, each column of processors together contains I outgoing weights for each of the I p P nodes (per layer) in its diagonal element. We call our scheme checkerboarding due to the manner in which the weights are partitioned.
The communication needed to compute the activation vectors, error vectors and weight changes are now described. In the forward phase, we need to move the node activation values from the diagonal processor to each row as shown in Figure 4( Parallel Runtime = LJ I 2 P t c + 2LJ log P t s + 2LJI log P p P t w
A formal description of the mapping and steps in our scheme is provided in Appendix E, to explain implementation details.
Performance Comparison 4.1 Framework for Comparison
Di erent network-partitioning schemes can be compared equally well on either per-pattern training iteration or on set-training iteration. On the other hand, per-pattern training iterations cannot be used to compare network-partitioning schemes with pattern partitioning-schemes to parallelize BP. However, the set-training iteration can be used as a benchmark task to compare alternative schemes to parallelize BP, since this task provides an equal amount of computation for all parallel formulations. Hence, in this paper set-training iteration has been used as the benchmark to compare pattern partitioning and network partitioning of BP as is done by 12, 21] among other papers. One full-iteration of the set-training regime is used as the benchmark task to compare the alternative parallel formulations. A full-iteration of the set-training regime of BP is made of the forward phase, the backward phase, and the weight change computations for all patterns and weight adjustments.
Because of our focus on hypercube-connected multi-computers, benchmark problems are chosen such that the cardinality of the set of learning patterns, as well as the number of nodes in any layer of the neural network, is a power of 2. This choice is made to simplify interpretation of the experimental results. If the size (number of nodes) of a layer is not a power of 2, then the network partitioning schemes (i.e. vertical sectioning and our scheme) increase the size of the layer by adding dummy nodes to the nearest larger power of 2. For example, the NETtalk network (203-60-26) is approximated by the network with the con guration (256-64-32) for network partitioning schemes. Pattern partitioning on hypercube is not a ected by network architecture, but the number of learning patterns (L) is rounded to nearest larger power of 2.
Parallel run-time and speedup are used as the metric of performance for di erent parallel formulations. The parallel time for a parallel formulation refers to the execution time of the benchmark on P processors, where P is the number of processors used. Speedup is the ratio of serial runtime to parallel runtime. The serial time refers to the execution time of the benchmark task on a single processor. We provide a translation of the speedup measurements into weight updates per second and connections per second to be able to compare our results to related results 12, 22] in the literature.
Parallel formulations speed up the execution of BP by accelerating the execution of each iteration. Another way to speed up the backpropagation algorithm is based on increasing the rate of convergence (i.e. reducing the total number of iterations needed for convergence) by tuning the parameters (e.g. learning rate and momentum), or by selecting an appropriate training regime. Measurements of improvements due to change in the rate of convergence is outside the scope of this paper, and is not captured in our speedup metric.
Theoretical Comparison
In this section, we compare the cost of the checkerboarding scheme, the vertical network sectioning scheme and the pure-pattern partitioning scheme. We rst show that the checkerboarding scheme is almost always strictly better than the vertical-sectioning scheme. We then compare the checkerboarding scheme to the pattern-partitioning scheme for some interesting cases. Table 1 shows the coe cients of t c ; t s and t w in the runtime for one full iteration of BP, for vertical sectioning and checkerboarding. The expressions given in Table 1 assume P < I for vertical sectioning and P < I 2 for the checkerboarding scheme.
Vertical Sectioning vs. Checkerboarding Scheme
Note that while the coe cients for t c in both the schemes are the same, the coe cient for t s is smaller in the checkerboarding scheme. Also, the coe cient for t w is smaller in the checkerboarding scheme when P 16 7 , since the ratio log(P) p P < 1. Hence, given a hypercube with more than 16 7 Recall that if communication along all channels of a hypercube is permitted, then the communication cost for vertical sectioning is P log(P) ts + I P tw]. In this case, P must be greater than 256 (for log 2 (P) p P < 1) for checkerboarding to perform strictly better on tw terms. Scheme tc ts tw Vertical Sectioning J I 2 P 2JP 2JI Checkerboarding J I 2 P 2J log(P) 2JI log(P) p P Table 1 : Coe cients of t c , t s and t w in the runtime for Vertical and Checkerboarding schemes for 1 pattern and 1 iteration of BP processors, our scheme will always outperform the vertical-sectioning scheme. Furthermore, our scheme can use up to I 2 processors, whereas vertical sectioning can use only up to I processors. Table 2 shows the coe cients for t s , t w , and t c in the runtime for the pattern-partitioning and the checkerboarding network partitioning schemes. It assumes P < L for pattern partitioning scheme and P < I 2 for the checkerboarding scheme. The coe cient of t w for pattern partitioning becomes too large as I 2 increases. On the other hand the coe cient of t w for the checkerboarding scheme becomes too large as L increases. The coe cient of t s for the checkerboarding scheme also increases with L; however, it can be made independent of L by sending and processing all the learning patterns together.
Comparison of Pattern Partitioning and Checkerboarding
Clearly, neither of the schemes dominates the other in all cases. Next, we consider di erent classes of problems and discuss the relative superiority of di erent schemes. Scheme tc ts tw Pattern Partitioning JI 2 L P 2 log(P) 2JI 2 log(P) Checkerboarding JI 2 L P 2LJ log(P) 2LJI log(P) p p Table 2 : Co cients of t c , t s and t w in the runtime for pattern partitioning and checkerboarding for L patterns, and 1 iteration Case 1: Per-Pattern Training Regime: Pattern partitioning cannot be used to speedup BP in this case. However, the checkerboarding scheme can be used to speedup the computation of weight changes due to each pattern.
Case 2: Recognition Problems: For recognition problems (see section 2.2), L tends to be smaller than I 2 . In particular, if L < I, then the coe cient of t w is always smaller for the checkerboarding scheme. Even if I 2 > L > I, the coe cient of t w is smaller for checkerboarding for any P larger than some threshold. The threshold value of P depends upon the value of I 2 L and on the value of t s .
Case 3: Generalization Problems: Generalization problems (see section 2.2), have a large number of learning patterns, L, which are larger than the number of weights, JI 2 in the neural network. In this case, pattern partitioning is strictly better than the checkerboarding scheme and can be shown to be better than any network partitioning scheme for hypercubes, meshes and linear arrays. Even for these problems, a hybrid of the checkerboarding scheme with the pattern-partitioning scheme can perform better than pure-pattern partitioning. This is discussed further in section 5.
Experimental Evaluation and Comparison
In this section, we present the performance of the checkerboarding scheme on uniform networks. To test the performance, we implemented BP on the nCUBE2 TM hypercube. nCUBE2 is a second generation hypercube containing up to 8192 processors. Each node of nCUBE2 is rated at approximately 8 MIPS / 2.5 MFlops (peak). In our experiments, we observe interprocessor communication startup time t s to be around 180 s, and the transmission time for 4-byte message t w to be about 3 s. The value of t c (for our unoptimized code) was observed to be around 10 s. BP was tested for uniform networks with an input layer and two other layers (i:e: J = 2), and a full-iteration was performed. Table 3 shows the speedups obtained on nCUBE2 for checkerboarding for di erent values of P, and for network sizes in carrying out one full iteration of set-training regime of BP. Speedup on small networks is poor but it improves signi cantly for large networks. Table 3 : Actual speedups obtained by checkerboarding
We were unable to get experimental speedup gures for I > 256, due to insu cient per-processor memory on nCUBE2 for determining serial execution time. To study the impact of larger I and P, we theoretically computed the parallel runtime and speedup for checkerboarding using a cost model described in Appendix B, which is slightly more detailed than the simpli ed model used for theoretical comparison. The constants used in this model were computed using the actual experimental values t s , t w and t c for nCUBE2. These values are given in Table 4 : Analytically computed speedups for checkerboarding Table 5 shows the speedups obtained on nCUBE2 for checkerboarding, vertical sectioning and pattern partitioning, for I equal 64 and 256, and for P equal 4, 16, 64 and 256. The number of training samples, L, was kept equal to P in each case. The entry for P = 256 and I = 64 for the vertical-sectioning scheme is empty, as P I. From this table, it can be seen that the checkerboarding scheme is substantially faster than the vertical sectioning scheme for P 16 Table 5 : Speedup results for Checkerboarding, Vertical Partitioning and Pattern Partitioning Note that the checkerboarding scheme is faster than the pattern-partitioning scheme for networks with a large number of nodes. The pattern-partitioning scheme is better than the checkerboarding scheme for networks with a smaller number of nodes. This is consistent with our theoretical analysis. We note that the vertical-sectioning scheme can use a maximum of I processors, but due to communication overhead, the speedup peaks at less than I processors. Similarly, the checkerboarding scheme can use a maximum of I 2 processors, but due to communication overhead, the speedup will peak at less than I 2 processors. Readers can verify that the speedup will peak at ( I 2 K log(I) ) processors, where K is determined by t c ; t s and t w .
Hybrid of Checkerboarding with Pattern Partitioning
The checkerboarding network partitioning scheme can be combined with pattern partitioning in a natural manner. This hybrid scheme divides a given set of processors into a collection of clusters. The set of learning patterns are divided equally among the clusters. Network nodes and weights are replicated across clusters. Each cluster divides the network nodes and edges in a checkerboarding manner. Proper mapping of clusters on a hypercube makes it possible to eliminate communication con icts within clusters for forward propagation and back propagation. It also eliminates communication con icts, among nodes responsible for a given weight, across clusters.
We divide P (= P 1 P c ) processors on a hypercube into P 1 clusters of P c processors, where P 1 and P c are powers of 2: i.e. P 1 = 2 a and P c = 2 b for some integers a and b. Let us number the processors by P(i; j), where i is the cluster number and j is the rank of the processor within its cluster. This division of processors is such that the P c processors in a cluster (i.e. P(i; j) for xed i) form a hypercube. Furthermore, the processors across clusters with a common rank, i.e. P(i; j) for xed j, also form a hypercube.
The learning patterns are divided into P 1 equal group patterns; i.e., the number of patterns in each cluster is L c = L P 1 . Each cluster of P c processors divides the network in an identical manner, via the checkerboarding scheme, over a p P c p P c grid imposed over the cluster. Each cluster carries out forward and backward phases for the L c patterns local to the cluster, accumulating the weight changes for all the JI 2 weights in the network. The division of the network among these p P c p P c processors is just like in the checkerboarding scheme, hence each processor in a cluster keeps J I 2 Pc weights and the corresponding weight changes. The communication time for the forward phase is JL c log(P c ) t s + I p Pc t w ]. The communication time for the backward phase is the same. Therefore, the total communication cost for the forward and the backward phase is C 1 = 2JL c log(P c ) t s + I p Pc t w ]. This phase is followed by the accumulation of weight changes across clusters. Each processor P(i; j) communicates with the processors with the same rank, j, in various clusters, to accumulate the changes for its J I 2 Pc weights. The changes are accumulated at a processor (say in cluster 0), and then broadcast back to the respective processors, which then independently recompute the weight matrix for the next iteration. The communication time for this step is C 2 = 2 log(P 1 ) t s + J I 2 Pc t w ]. The total parallel computation time taken by each processor is C 3 = LJ I 2 P t c . Therefore, the total runtime for the hybrid scheme can be approximated as C 1 + C 2 + C 3 . The cost model for the hybrid scheme reduces to the cost model of the pure-pattern partition for P c = P; P 1 = 1]. It reduces to the cost model of the pure-checkerboarding scheme, for P 1 = P; P c = 1]. The runtime can be changed by varying the parameters P 1 and P c such that P = P 1 P c , and P 1 = 2 a and P c = 2 b . The hybrid scheme can use LI 2 processors, which is more than the number of processors used by either the pattern partitioning or the checkerboarding scheme. Actually, in order to ensure e cient use of the parallel computer, the number of processors should be no more than L logL ( I logI ) 2 . If more processors are used then the increased communication overhead will nullify the gains due to extra parallelism.
The hybrid scheme can mimic pure pattern partitioning by choosing P 1 = 1, and can mimic pattern partitioning by choosing P c = 1. Thus, the hybrid scheme can perform at least as well as both the schemes. In addition, the hybrid scheme with P 6 = P 1 and P 6 = P c can outperform the pure schemes for a large class of problems. We show these cases via algebraic and experimental methods now. We summarize the terms for t c , t s and t w in We note from table 6 that the coe cient of t c is identical for the three schemes. The coe cient of t s is smallest for the pure partitioning scheme and is largest for the pure checkerboarding scheme. Furthermore, note that the coe cients of t w are larger than the corresponding coe cients of t s for each scheme. The t w terms will dominate the t s terms if JI 2 I p P > ts tw and J I 2 Pc > I p Pc > ts tw .
The t w term will dominate the communication cost for large networks and for a large number of learning samples. We will compare the t w term of the alternative schemes in the following algebraic analysis to simplify the analysis. Now we provide independent conditions under which the t w term for the hybrid scheme is smaller than that for the pure pattern-partitioning and pure checkerboarding schemes. The intersection of these conditions represents the region of dominance of hybrid over both the schemes. The derivation of these conditions is provided in Appendix C.
The t w of the hybrid scheme is smaller than the t w term of pattern partitioning, if L < IP 1 p P c .
This condition is met for recognition problems. It is also met for many generalization problems for a large value of P, or for large networks, or for small learning samples. Thus, we expect the hybrid scheme to outperform the pure pattern-partitioning scheme if a large number of processors are available, or if network is large, or if the number of training patterns is small. The t w term of the hybrid scheme is smaller than the t w term of pure checkerboarding if L > I p P 1 p Pc . This condition is met for generalization problems (e.g. with P 1 = P c ). and for some recognition problems.
Thus we expect the hybrid scheme to outperform checkerboarding, if large number of processors are available or if the network is small, or if the learning sample set is large. Finally, combining both inequalities, we get
which is the region of dominance for the hybrid scheme over both the pure pattern-partitioning and the checkerboarding schemes. The condition holds for some values of P 1 , P c , and P = P 1 P c , for given values of L and I. For example, the hybrid scheme can outperform the pure schemes for problems with L > I using P 3=2 > L I with P 1 = P c = p P. Note that this region is not the only region in which the hybrid scheme outperforms the other two schemes.
Experimental Evaluation of Hybrid Scheme Table 7 shows the speedups obtained for pattern partitioning, hybrid and checkerboarding schemes.
The results were obtained using 256 processor nCUBE2 for two networks: 2 64 (i:e: J = 2, I = 64), and 2 256 (i:e: J = 2, I = 256). The second column shows the value for log P 1 and log P c in the format (log P 1 log P c ). As explained before, for pure partitioning, log P c = 0, and for pure checkerboarding , log P 1 = 0. For the hybrid scheme, three combinations for log P 1 log P c were used, 6 2; 4 4; and 2 6. Note that all these hybrid combinations perform at least as well as the pattern partitioning and the checkerboarding schemes, with some combinations doing signi cantly better. Other network partitioning schemes can be combined with the pattern-partitioning scheme to form alternative hybrid schemes. In particular, vertical sectioning can be combined with pure patternpartitioning. The hybrid of checkerboarding with pure pattern partitioning can be shown to be better than the hybrid of vertical sectioning with pattern partitioning by extending the theoretical comparison discussed in section 4.2. provides the highest speedups but poor processor utilization. On the other hand, choosing p P I min = minimum I 1 ; I 2 ; ; I J ] provides high processor utilization but relatively smaller speedups. This issue arises when extending any network partitioning scheme to non-uniform networks. For example, the choice of the size (P) of linear array in the vertical-sectioning scheme faces a similar problem. Choosing P = I max improves speedups but yields poor processor utilization.
The actual choice of P would depend upon the number of processors available and on the network size. For example, consider the mapping of a network of size 1024 64 4 using our checkerboarding scheme. If P = 4 4, then all the processors will do useful work, as the weight matrices of size 1024 64 and 64 4 can be easily mapped on to 4 4 processors. If P = 8 8, then the weight matrix between the hidden layer and the output layer (64 4) will have to be padded with redundant entries to form a 64 8 matrix, causing redundant computation in the parallel formulation. In this case, the total amount of redundant computation is: 64 4 1024 64+64 4 :4%. If P = 64 64, then the amount of redundant computation is: 64 60 1024 64+64 4 6%. Detailed algebraic analysis for non-uniform networks is given in Appendix D. As shown there, the relative performance of vertical sectioning, checkerboarding, pattern partitioning and hybrid schemes remains the same. Checkerboarding can perform better than pattern partitioning if the network size (i.e. the total numer of weights in the network) is su ciently large. The hybrid of checkerboarding with pattern partitioning can perform better than either one.
Experimental Evaluation
We experimentally tested the performance of our hybrid scheme for non-uniform networks on nCUBE2 and CM5 parallel computer. CM5 contains up to 16384 processors connected via a pseudo a fat-tree. Since the hypercube topology maps naturally on a fat-tree (and vice-versa), the checkerboarding and the hybrid schemes can be easily implemented on CM5.
The speedups obtained on nCUBE2 and CM5 for non-uniform networks are shown in Table 8 .
These results were obtained for a network with J = 2 and L = 256. The rst column of this table shows the network size in the form of I 0 I 1 I 2 , where I 0 ; I 1 and I 2 are the number of nodes in the input, rst and output layer respectively. As the hybrid scheme can give di erent speedups depending on the value of log P 1 and log P c , we present in the fourth and sixth column the combinations that gave the best speedups. The actual values for log P 1 and log P c are shown in square brackets in the form log P 1 log P c ]. Again, note that the hybrid performs at least as well as checkerboarding, and in some cases much better. 7 Scalability Analysis and Optimality Considerations
The scalability of an algorithm on a parallel architecture is determined by its capability to e ectively utilize an increasing number of processors. For all four parallel BP schemes discussed, near-linear speedups can be obtained on an arbitrarily large number of processors, provided the network is large enough (for vertical sectioning, checkerboarding, and hybrid) or if the number of patterns is large enough (for pattern partitioning and hybrid). In that sense, all the schemes are scalable. But the degree of scalability of these schemes is actually quite di erent from each other. A number of metrics have been developed for characterizing the scalability of di erent parallel algorithms and architectures 35]. For xed processor utilization, the rate of change of problem size as a function of number of processors is a good characterization of the scalability of an algorithm 36, 37, 38, 39] . An algorithm that requires a smaller change in problem size to obtain xed e ciency is considered more scalable.
The problem size (i.e., serial run time) for each iteration of BP is LJI 2 t c . For pattern partitioning, the problem size can be increased as O(L); for checkerboarding and vertical sectioning, it can be increased as I 2 , and for hybrid it can be increased as LI 2 .
The e ciency E of a parallel algorithm is de ned to be S P = Serial Run Time / (P Parallel Run Time). By plugging the expressions for parallel run time for the di erent algorithms in the expression, we can derive the following statements for di erent schemes.
For vertical sectioning, xed e ciency can be maintained if I is increased as O(P); i.e., if I 2 is increased as O(P 2 ). For checkerboarding, xed e ciency can be maintained if I is increased as O( p P log P); also if I 2 is increased as O(P(log P) 2 ). Clearly this scheme is more scalable, as O(P 2 ) grows much faster than O(P(log P) 2 ). For pattern partitioning, xed e ciency can be maintained if L is increased as O(P log P).
Hence this scheme is quite scalable, provided the number of patterns is large. Note that the scalability expressions of checkerboarding (or vertical sectioning) and pattern partitioning cannot be compared, as the problem size is increased in a di erent way in both schemes.
For hybrid, xed e ciency can be maintained if I is increased as O( p P log P) or if L is increased as O(P log P). This scheme is more scalable then pattern partitioning and checkerboarding. To illustrate this, consider the following special case in which I = L and J is a constant, and thus the problem size is O(I 3 ). For this case, I should be increased as O( p P log P) for checkerboarding, as O(P log P) for Pattern partitioning, and as O(P 1 3 log P) for the hybrid scheme. We also note that the checkerboarding scheme and its hybrid with pattern partitioning are optimal parallel formulation of backpropagation in the following sense. These schemes provide parallel time close to the asymptotic lower bound on the parallel time for computing each iteration of BP on parallel machines, where the atomic operations for each processor is limited to binary arithmetic operations. In such an environment, the lower bound on the parallel computation time for evaluating an arithmetic expression with N atomic operations is dlog 2 (N)e for any number of processors 40] . The number of arithmetic operations in the expression computed at the output nodes of a feed-forward neural network at the end of forward propagation is (LI 2 ), assuming J to be a constant. Thus, the lower bound on carrying out each iteration of BP in set-training regime (see 2.1) is (log(L) +log(I)). Given enough processors (P = LI 2 ), a hybrid of checkerboarding with pattern partitioning (P c = I 2 , P 1 = L) can carry out each iteration of BP in in (log(L) +log(I)) time. Using a similar argument, the checkerboarding scheme can be shown to be an optimal network-partitioning scheme for the per pattern training regime.
Concluding Remarks
Our analytical and experimental results show that the checkerboarding network partitioning scheme is better than the commonly used vertical-sectioning scheme. It is also better than pattern partitioning for a large class of useful problems and a large number of processors. The checkerboarding scheme can be combined with pattern partitioning to provide very high overall performance on existing commercial parallel computers. For example, the observed value of t c on the currently available scalar processors of CM5 (in our non-optimized code) is 4:3 s. The processing time per node for only the forward phase is around 1:38 s. Hence, the non-optimized version of our hybrid parallel formulation on 256 processor CM5 (with scalar processors) performs over 50 million weight changes or 160 million connections per second for the 1024 256 64 network. With the upgrade time of processors to vector units, t c as well as the overall performance may improve by one or two orders of magnitude. The proposed schemes scale well with problem size and achieve asymptotically optimal parallel execution time for backpropagation algorithm on uniform networks.
The parallelization schemes for non-uniform networks need further exploration. We plan to explore the trade-o between processor utilization and speed-up in our future work. We would also like to generalize our schemes by utilizing di erent lengths and breadths of processor grids for di erent layers of the network. It would also be interesting to compare the parallelization techniques for randomly sparse networks 7, 26, 27, 28, 29] . with our scheme for non-uniform networks.
Appendix A Backpropagation Algorithm
The details of BP are shown in Figure 6 . Symbol f denotes the non-linear sigmoidal function used to compute the activation of each node. The thresholds associated with the nodes of the neural network in this description of BP have been ignored. Furthermore, we describe the set training regime in this algorithm. Per pattern training regimes would require weight adjustments in step 3 to be inside the foreach loop. Weighted summation over a tree: The checkerboarding scheme imposes balanced trees over the processors in each row and each column of the processor grid. These trees are used to compute the weighted sum to compute activation and error vectors. The processors at the leaves of the tree do not participate in this computation. In contrast, the diagonal processor (i.e. root of the tree) has to carry out additions for the I p P nodes in each of the log( p P) levels of the tree. Let t c3 denote the time for per node computation, related to summation. The parallel run-time to compute summation is LJI log P p P t c3 for the forward and backward phases.
The revised parallel time of the checkerboarding scheme T R par , is:
T R par = (LJ I 2 P t c + LJ I p P t c2 + LJI log P p P t c3 ) + 2LJ log P t s + 2LJI log P p P t w and the speedup is
In our implementation of checkerboarding on nCUBE2, t c2 is approximately 60 micro-seconds and t c3 is approximately 0.4 micro-seconds. With addition of oating point processors, these numbers will become smaller. For large networks, the detailed model reduces to the simple model used for the theoretical analysis of checkerboarding, for the following reasons. The coe cient of t c2 becomes insigni cant with respect to the t c as the network size (I) increases. This term can be ignored for a large network without compromising accuracy. In addition, the coe cient of t c3 is proportional to the coe cient of t w . Thus, the cost associated with t c3 can be merged with the variable cost of sending messages. 
Appendix D Theoretical Analysis of Non-uniform Networks
To simplify the algebraic analysis, we will compare the speedups of the alternative network partitioning schemes under the constraints of high processor utilization. High processor utilization refers to the level of processor utilization attained by these schemes for the case of uniform networks. The algebraic analysis can be generalized to other cases with added complexity for considering the trade-o between speedup and processor utilization. The total cost of one iteration of checkerboarding can be approximated as LW P t c +2JL log(P)t s + 2RL log(P)t w , where W = I 0 I 1 + I 1 I 2 + + I J?1 I J , and R = d I 1 P e + + d I J P e. The cost model for an iteration of vertical sectioning scheme training can similarly be generalized to LW P t c + 2JLPt s + 2SLt w , where S = P d I 1 P e + + d I J P e]. Note that the t s and t w terms are smaller for checkerboarding. The t w for checkerboarding is better for R S log(P) < 1. This condition is true if log(P) p P < 1; i.e. P > 16 . Thus all cost components of checkerboarding are strictly better than those of vertical sectioning, for a large number of processors. Overall, checkerboarding will perform strictly better than vertical sectioning scheme. The hybrid checkerboarding and pattern partitioning schemes can be extended to non-uniform networks. The pattern partitioning scheme needs no modi cations for a non-uniform network. The hybrid scheme needs modi cations in its component related to checkerboarding. The cost for the various schemes is provided in table 9.
Scheme
Cost Vertical Sectioning LW P t c + 2JLPt s + 2SLt w Checkerboarding LW P t c + 2JL log(P)t s + 2LR log(P)t w Pattern Partitioning LW P t c + 2 log(P)t s + 2W log(P)t w Hybrid LW P t c + t s 2JL c log(P c ) + 2 log(P 1 )] + t w 2L c R log(P c ) + 2W log(P 1 )]
where W = number of weights in the network can be used to simplify these expressions. Checkerboarding will be preferred if L W R = #weights #nodes p P, which will be the case in recognition problems. The hybrid scheme will be better than pure pattern partitioning for W > L P 1 R. The hybrid scheme will be better than the checkerboarding scheme for L > W RPc , or for W < LRP c . This can be derived via algebraic manipulations similar to those used in Appendix C. These conditions reduce to corresponding conditions on a uniform network by substituting W = JI 2 and R = JI p P .
diagonal processors apply sigmoidal functions to the weighted sum. By repeating these steps for each layer, forward propagation is completed. The Backpropagation of error is carried out in a similar fashion. Deltas are computed on all processes in parallel. Each process has the relevant activation vector elements and error vector elements to compute the delta for the local weights. The deltas are accumulated over patterns in the outermost sequential for loop over patterns. Weights are adjusted at the end of the iteration. 
