Abstract. The development of a parallel algorithm for batch pattern training of a multilayer perceptron with the back propagation algorithm and the research of its efficiency on a general-purpose parallel computer are presented in this paper. The multilayer perceptron model and the usual sequential batch pattern training algorithm are theoretically described. An algorithmic description of the parallel version of the batch pattern training method is introduced. The efficiency of the developed parallel algorithm is investigated by progressively increasing the dimension of the parallelized problem on a general-purpose parallel computer NEC TX-7. A minimal architecture for the multilayer perceptron and its training parameters for an efficient parallelization are given.
Introduction
Artificial neural networks (NNs) have excellent abilities to model difficult nonlinear systems. They represent a very good alternative to traditional methods for solving complex problems in many fields, including image processing, predictions, pattern recognition, robotics, optimization, etc [1] . However, most NN models require high computational load, especially in the training phase (up to days and weeks). This is, indeed, the main obstacle to face for an efficient use of NNs in real-world applications. Taking into account the parallel nature of NNs, many researchers have already focused their attention on their parallelization [2] [3] [4] . Most of the existing parallelization approaches are based on specialized computing hardware and transputers, which are capable to fulfill the specific neural operations more quickly than general-purpose parallel and high performance computers. However computational clusters and Grids have gained tremendous popularity in computation science during last decade [5] . Computational Grids are considered as heterogeneous systems, which may include high performance computers with parallel architecture and computational clusters based on standard PCs. Therefore, existing solutions for NNs parallelization on transputer architectures should be re-designed. Parallelization efficiency should be explored on general-purpose parallel and high performance computers in order to provide an efficient usage within computational Grid systems.
Many researchers have already developed parallel algorithms for NNs training on weights (connections), neuron (node), training set (pattern) and modular levels [6] [7] [8] [9] [10] . The first two levels are a fine-grain parallelism and the second two levels are a coarse-grain parallelism. Connection parallelism (parallel execution on sets of weights) and node parallelism (parallel execution of operations on sets of neurons) schemes are not efficient while executing on a general-purpose high performance computer due to high synchronization and communication overhead among parallel processors [10] . Therefore coarse-grain approaches of pattern and modular parallelism should be used to parallelize NNs training on general-purpose parallel computers and computational Grids [9] . For example, one of the existing implementation of the batch pattern back propagation (BP) training algorithm [6] has efficiency of 80% while executing on a 10 processors of transputer ТМВ08. However, the efficiency of this algorithm on general-purpose high-performance computers is not researched yet.
The goal of this paper is to research the parallelization efficiency of parallel batch pattern BP training algorithm on a general-purpose parallel computer in order to form the recommendations for further usage of this algorithm on heterogeneous Grid system.
Architecture of Multilayer Perceptron and Batch Pattern Training Algorithm
It is expedient to research parallelization of multi-layer perceptron (MLP) because this kind of NN has the advantage of being simple and provides good generalizing properties. Therefore it is often used for many practical tasks including prediction, recognition, optimization and control [1] . However, parallelization of an MLP with the standard sequential BP training algorithm does not provide efficient parallelization due to high synchronization and communication overhead among parallel processors [10] . Therefore it is expedient to use the batch pattern training algorithm, which updates neurons' weights and thresholds at the end of each training epoch, i.e. after the presentation of all the input and output training patterns, instead of updating weights and thresholds after the presentation of each pattern in the usual sequential training mode. The output value of a three-layer perceptron ( Fig. 1 ) can be formulated as:
where is the number of neurons in the hidden layer, is the weight of the synapse from neuron of the hidden layer to the output neuron, are the weights from the input neurons to neuron in the hidden layer, are the input values, are the thresholds of the neurons of the hidden layer and T is the threshold of the output neuron [1, 11] . In this study the logistic activation function is used for the neurons of the hidden ( ) and output layers ( ), but in general case these activation functions could be different.
The batch pattern BP training algorithm consists of the following steps [11] 
4. Repeat the step 3 above for each training pattern pt , where
, PT is the size of the training set; 5. Update the weights and thresholds of neurons using 
Parallel Batch Pattern Back Propagation Training Algorithm
It is obvious from analysis of the batch pattern BP training algorithm in Section 2 above, that the sequential execution of points 3.1-3.5 for all training patterns in the training set could be parallelized, because the sum operations ij w sΔ and are independent of each other. For the development of the parallel algorithm it is necessary to divide all the computational work among the Master (executing assigning functions and calculations) and the Slaves (executing only calculations) processors.
j T sΔ
The algorithms for Master and Slave processors functioning are depicted in Fig. 2 . The Master starts with definition (i) the number of patterns PT in the training data set and (ii) the number of processors p used for the parallel executing of the training algorithm. The Master divides all patterns in equal parts corresponding to number of the Slaves and assigns one part of patterns to himself. Then the Master sends to the Slaves the numbers of the appropriate patterns to train.
Each Slave executes the following operations for each pattern pt among the PT/p patterns assigned to him:
• calculate the points 3.1-3. according to the point 5 of the algorithm. These updated weights and thresholds will be used in the next iteration of the training algorithm. As the summarized value of is also received as a result of the reducing operation, the Master decides whether to continue the training or not.
) (t E
The software routine is developed using the C programming language with the standard MPI library. The parallel part of the algorithm starts with the call of the MPI_Init() function. The parallel processors use the synchronization point MPI_Barrier(). 
Experimental Researches
Our experiments were carried out on a parallel supercomputer NEC TX-7, located in the Center of Excellence of High Performance Computing, University of Calabria, Italy (www.hpcc.unical.it). NEC TX-7 consists in 4 identical units. Each unit has 4 Gb RAM, 4 64-bit processors Intel Itanium2 with a clock rate of 1 GHz. This 16 th -processor computer with 64 Gb of total RAM has a performance peak of 64 GFLOPS. The NEC TX-7 is functioning under the Linux operation system.
As shown in [12] , the parallelization efficiency of parallel batch pattern BP algorithm for MLP does not depend on the number of training epochs. Parallelization efficiencies of this algorithm are respectively 95%, 84% and 63% on 2, 4 and 8 processors of the general-purpose NEC TX-7 parallel computer for a 5-10-1 MLP with 794 training patterns and an increasing number of training epochs from 10 4 to 10 6 . As shown in [7] , parameters such as the number of training patterns and the number of adjustable connections of NN (number of weights and thresholds) define the computational complexity of the training algorithm and, therefore, exert influence on its parallelization efficiency. Therefore, research efficiency scenarios should be based on these parameters. In this case the purpose of our experimental research is to answer the question: what is the minimal/enough number of MLP connections and what is the minimal/enough number of training patterns in the input data set for the parallelization of batch pattern BP training algorithm to be efficient on a generalpurpose high performance computer?
The following architectures of MLP are researched in order to provide the analysis of efficiency: 3-3-1 (3 input neurons × 3 hidden neurons = 9 weights between the input and the hidden layer + 3 weights between the hidden and the output layer (Fig. 3 or Fig. 4 or Fig. 5) , (ii) then to choose the curve, which characterizes the necessary number of perceptron's connections and (iii) then to get the value of parallelization efficiency from ordinate axes which corresponds to the necessary number of training patterns on abscissa axes. For example, the parallelization efficiency of the MLP 5-5-1 (36 connections) is 65% with 500 training patterns on 4 processors of NEC TX-7 (see Fig. 4 ). Therefore the presented curves are the approximation characteristics of a parallelization efficiency of the certain MLP architecture on the certain number of processors of a general-purpose parallel computer.
As it is seen from the Figs. 3-5, the parallelization efficiency is increasing when the number of connections and the number of the training patterns is increased. However, the parallelization efficiency is decreasing for the same scenario at increasing the number of parallel processors from 2 to 8. The analysis of the Figs. 3-5 allows defining the minimum number of the training patterns which is necessary to use for efficient parallelization of the batch pattern training algorithm at the certain number of MLP connections (Table 1) . For example, the Table 1 shows that the number of training patterns should be 100 and more (100+) for efficient parallelization of MLP with the number of connections more than 16 and less and equal than 36. As it is seen from the Table 1 , it is necessary to use more training patterns in a case of small MLP architectures. The minimum number of the training patterns is increasing in a case of parallelization on the bigger number of parallel processors. Table 1 . Minimum number of training patterns for efficient parallelization on NEC TX-7. 
Conclusions
The parallel batch pattern back propagation training algorithm of multilayer perceptron is developed in this paper. The analysis of parallelization efficiency is done for 7 scenarios of increasing the perceptron's connections (number of weights and thresholds), in particular 16, 36, 71, 121, 181, 256 and 441 and increasing the number of training patterns, in particular 25, 50, 75, 100, 200, 400, 600, 800. The presented results can be used for estimation a parallelization efficiency of concrete perceptron model with concrete number of training patterns on the certain number of parallel processors of a general-purpose parallel computer. The experimental research proves that the parallelization efficiency of batch pattern back propagation training algorithm is (i) increasing at increasing the number of connections and increasing the number of the training patterns and (ii) decreasing for the same scenario at increasing the number of parallel processors from 2 to 8. The results of analysis of minimum number of training patterns for efficient parallelization of this algorithm show that (i) it is necessary to use more training patterns in case of small architectures of multilayer perceptron and (ii) the minimum number of the training patterns should be increased in a case of parallelization on the bigger number of parallel processors. The provided level of parallelization efficiency is enough for using this parallel algorithm in Grid environment on the general-purpose parallel and high performance computers. For the future research it is expedient to estimate the factors of decreasing the parallelization efficiency of batch pattern back propagation training algorithm at small number of training patterns and small number of adjustable connections of multilayer perceptron.
