Objective of the work is speaker independent recognition of vowels of British English. Back propagation is one of the simplest and most widely used methods for supervised training of multi layer neural networks. In this paper we use parallel implementation of Backpropagation (BP) on MasterSlave architecture to recognize speaker independent eleven steady state vowels of British English. We perform the recognition task on both sequential and parallel implementation. The performance parameters speed-up, optimal number of processors and processing time are evaluated for both implementations.
INTRODUCTION
Speech recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine-readable format. The first serious speech recognizer was developed in 1952 by Davis, Biddulph, and Balashek of Bell Labs. The machine takes a human utterance as an input and returns a string of words, phrases or continuous speech in the form of text as output. The conventional methods of speech recognition insist in representing each word by its feature vector and pattern matching with the statistically available vectors using HMM or neural network [1] , [2] . In this work we develop a neural network model that learns the system in least possible time and that could recognize the speaker independent utterances even though the utterance is with noise.
Connectionist Bench (Vowel Recognition -Deterding Data) Data Set from UCI Machine learning repository is used to test our recognition system based on Backpropagation Neural Network that was implemented parallel on Master-Slave Architecture. The dataset consists of a three dimensional array: vowel data [speaker, vowel, input] . The speakers are indexed by integers 0-89. (Actually, there are fifteen individual speakers, each saying each vowel six times.) The vowels are indexed by integers 0-10. For each utterance, there are ten floating-point input values, with array indices 0-9. The data from "speakers" 0-47 is taken to train the network and data from remaining speakers 48-89 for testing purpose [3] , [4] .
.Backpropagation (BP) is one of the widely used ANN models, though it takes a long time for learning some applications [5] . A lot of research is being carried out to reduce the learning time of BP. One of the approaches to minimize the learning time is parallel implementing the algorithm on multiple systems [6] . Implementations on commercially available dedicated parallel machines were developed. The Backpropagation Network (BPN) can be parallelized by partitioning the number of patterns, or by partitioning the network or by combination. Several researchers have developed parallel algorithms for different types of parallel computers like transputers, systolic arrays, multiple bus system, and hypercube, for diverse range of applications [7] [8] [9] [10] . As the cost of dedicated parallel machines is high, existing equipment in standard computer labs is found to be economical alternative for a wide range of engineering applications. The most important feature is that these resources can be shared with the other applications that require them.
In our model we have mapped each neuron of the network to a processor so that the parallel computer becomes a physical model of the network. The proposed parallel implementation avoids re-computation of weights and requires less communication cycle per pattern. The communication of data among the processors in the computing network is also less. We obtain the performance parameters like speed-up, optimal number of processors and processing time for both sequential implementation and parallel implementation on MasterSlave architecture.
MATHEMATICAL MODEL
In this paper, we consider a BPN with three layers (l=0,1,2) having Nl neurons in each layer. We trained the network by using both sequential and parallel algorithms and respective implementations of time are discussed in the following sections.
Sequential Backpropagation Algorithm
The BP algorithm [5] is a supervised learning algorithm, and is used to find suitable weights, such that for a given input pattern (X0), the network output (Y2i) should match with the target output (ti). The algorithm is divided into two phases, namely, forward phase, error back propagation phase. Error backpropagation phase includes weight updates also. The details on each of these phases and the time taken to process are discussed below.
Step 0: Initialize weights (set to small random values)
Step 1: While stopping condition is false do steps 2-9
Step 2: For each training pair do steps 3-8
International Journal of Computer Applications (0975 -888) Volume 48-No.3, June 2012
Feed forward
Step 3 : For each input unit xk=yk0 (k = 1,……N0) input signal xk and broadcast this signal to all units in the layer above(the hidden units)
Step 4: Each hidden unit yj1 (j = 1,2…N1) sums its weighted input signals and context signals. Then applies the activation function to compute its output signal. yj1(t) = f( w N0 k=0 jk1 yk0 ) Here y01=1 and wj01 is the bias and send this signal to all units in the layer above.
Step 5 : Each output unit yi2 (i = 1,2……N2)sums its weighted input signals and applies the activation function to compute its output signal.
calculates its weight correction terms ∆wij2=ηδi2 yj1 j =1,2,…N1
and sends δi2 i=1,2,…N2 to the units in the layer below.
Step 7: Each hidden unit yj2 (i =1,2…N1) sums its delta inputs from units in the layer above to find corresponding error information term
calculates its weight correction terms ∆wjk2=ηδj2 yk0 where j =1,2,…N1, k =1,…N0
Step 8: Each output unit yi2 (i =1,2,…N2) updates its weights wij2= wij2+ ηδi2 yj1 j =1,2,……..N1
Similarly each hidden unit yj1 (j =1,2,…N1)updates its weights wjk1= wjk1+ ηδj2 yk0 k =1,2,……..N0
Step Let tm, ta, and tac be time taken for one floating point multiplication, addition, and calculation of activation value respectively. The time taken to complete the forward phase (T1) is given by T1 = N1 (N0 + N2) M + (N1 + N2) tac where M = ta + tm
The time taken to complete the error back propagation phase is represented by T2 and is calculated as T2 = (1 + N1) N2 M + (N1 + N2) tac Back propagation
Step 6 : Each output unit yi2 (i =1,2……….N2) receives a target pattern corresponding to the input training pattern and computes error ei and error information term δi2 ei=(yi2-ti) and δi2=ei f 1( ) where i=1,2,…N2 i =1,2,…...N2 ; y01=1 and wi02 is the bias.
The time taken to update the weight matrix between the three layers is represented by T3 and it is equal to 
Parallel Implementation
In parallel implementation, the hidden layer is partitioned using vertical parallelism and weight connections are partitioned on the basis of synaptic level parallelism. The Master Slave architecture of the MLP network used in the proposed scheme is shown in the Fig.1 . In this architecture there is one front -end processor (FEP) and "m" slave processors. The output layer is placed on front -end processor, hidden layer is partitioned into N 1 / m neurons and the partitions are placed on slave processors. The input neurons are placed on all the slave processors [6] .
In this architecture the communication is only between the FEP and m Slave processors. Each of the Slave processor with FEP executes three phases of BP training algorithm. Parallel execution of the three phases and the corresponding processing time for each phases are calculated.
FEP

Fig 1: Slave Processors
International Journal of Computer Applications (0975 -888) Volume 48-No.3, June 2012
Algorithm for Front End Processor (FEP):
Step 0 : Download all the input vectors onto all the processors.
E old = LARGE_VALUE.
Step 1 : while stopping condition is false do steps 1-7
Step 2: For each training pair with old weights do steps 3-5
Step 3: Receive input partial sums calculated at processors. Store the hidden layer outputs in the context layer.
Step 4: Compute the output vector O of neurons in the output layer, error E new and error information terms δ i 1 and δ i 2
Step 5 : Send this information to nodes.
Step 6: If (E Step 0: Initialize the weights (set to small random values).
Receive all input vectors from FEP.
Step 1: While stop signal is not received from FEP do steps 2-9
Feed forward:
Step 3: Each input unit (x i =y i 0 ,i=1,2,…..N 0 ) receives input signal x i and broad casts this signal to all units in the layer above.
Step 4: Each hidden unit (y i 1 i=1…….N 1 ) present on this processor sums its weighted input signals and applies its activation function. and sends this signal to all units in the layer above.
Step Step 7: w ij 
Analytical Performance Comparison
Speed up analysis
Speed-up for m-processor system is the ratio between the time taken by uniprocessor to the time taken by parallel algorithm in m-processor network. If the network size is extremely larger than the number of processors m, then the speed up ratio will approach m. This is due to extra computation required in weight update phase and extra communication in exchanging the hidden neurons activation values.
Optimal number of processors
We observe from TP that the increase in the number of processors will lead to increase in the communication time and decrement in the computation time. The total processing time will decrease first and then increase after a certain number of processors. So, there exists an optimal number of processors m* for which processing time is minimum.
Difference between processing times
From the expressions of Tseq and TP the difference in time processing is calculated as follows. 
EXPERIMENTAL & STUDY RESULTS
The eleven steady state vowels of British English in the dataset are i , O, I, C:,E,U:,A,u.a:, 3: Y. The speech signals were low pass filtered at 4.7 kHz and then digitized to 12 bits with a 10 kHz sampling rate. Twelfth order linear predictive analysis was carried out on six 512 sample Hamming windowed segments from the steady part of the vowel. The reflection coefficients were used to calculate 10 log area parameters, giving a 10 dimensional input space. We trained the letter recognition dataset on the BPN with 10 input neurons, 8 hidden neurons and 3 output neurons. There was a substantial decrease in the learning time of the networks in parallel implementation. The results are shown graphically in fig-2 .
Fig-2 Learning Rate:0.2
We have also compared the learning performance by varying the number of processors. We observed that after a certain stage there is no advantage with increment in the number of processors. The optimal number of processors in our example is 9 as shown in 
CONCLUSIONS
In this paper, we implemented a parallel BP algorithm on a cluster of computers connected over Ethernet LAN to recognize the vowel dataset presented. The neural network is mapped onto Master-Slave architecture. The analytical performance of the proposed algorithm is compared with the sequential counterpart. Using the hybrid-partitioning scheme, re-computation of weights is avoided and the communication time is reduced. As one hidden layer is adequate for a large number of applications, in the present project work the algorithm is developed for neural nets with one hidden layer. 
