Abstract-The local cluster neural network is a feedforward RBF network that has been implemented in analogue neural net chip. The LCNN chip can be trained by chip-in-theloop training and this training method has been demonstrated to work efficiently. In order to increase the functionality of LCNN chip, we proposed on-chip training for the LCNN chip. In this paper, we describe two training algorithmsGradient Descent and Probabilistic Random Weight Change, which are used in LCNN on-chip training simulations. We also present the experiment results from the simulations in multidimensional function approximation. The training convergence is investigated and analyzed. The circuite signal flow chart for these two algorithms are designed.
I. INTRODUCTION
The Local Cluster Neural Network (LCNN) is a special kind of Feedforward neural network proposed by Geva and Sitte [1] . It is a multilayer perceptron (MLP) where sigmoidal neurons combine in clusters that have a localised response in input space like radial basis functions (RBF). The LCNN has been demonstrated to have good performance on function approximations in digital computer simulations and it is implemented in analog VLSI hardware in the LCX chip [2] [3] [4] . Chip-in-the-loop training is successful on the LCX chip for function approximations using Probabilistic Random Weight Change (PRWC) algorithm [9] . Although chip-in-theloop training is effective for training the analogue chip, it needs a computer and software external to the analogue chip. Based on previous research we proposed on-chip training for the LCNN. Two training algorithms are utilised for the onchip training and they are realised by computer simulations. In this paper we describe the architecture of Local Cluster Neural Network (LCNN) briefly in section II. Followed in section III and section IV by the training methods and the training algorithms: Gradient Descent (GD) and Probabilistic Random Weight Change (PRWC) and the on-chip training strategies. Section V and section VI show the experimental results in simulations and the block circuit diagrams. The two training methods are compared in section VII. The conclusion is in section VIII.
II. LOCAL CLUSTER NEURAL NETWORK AND ITS ANALOG HARDWARE IMPLEMENTATION
The Local Cluster Neural Network (LCNN) is defined by equation (1) . Figure 1 shows the signal flow diagram for a segment of two clusters of a LCNN. The LCNN uses sigmoidal neurones in two hidden layers to form functions localised in input space, similar to Radial Basis Functions (RBF) but which are capable of representing a wider range of localised function shapes [3] . Each neuron in the second hidden layer outputs such a local response function. The LCNN output is a linear combination of localised scalar functions in n-dimensional input space:
where v μ is the output weight, W μ is weight matrix for determining the output shape, r μ weight determines the position of the localised output function, k is the sigmoid slope and x is n-dimensional input. The operation of LCNN as follows: The first layer: (i) Subtraction of the position vector r of the local function centre from the input vector x and computation of two dot products as follows:
(ii) Calculation of sigmoid functions:
(iii) Subtraction of the two sigmoid functions to get the ridge function (5):
(i) Summation of the ridge functions:
(ii) Application of the sigmoidal windowing function to obtain the cluster output. Calculation of the output sigmoids as equation (7) . The constant b allows shifting the window with respect to the function. Figure 2 shows a local function L in two dimensions. 
The output of the LCNN is the weighted summation of all cluster outputs.
The LCNN has been implemented in an analogue chip with 6 inputs, one output and 8 clusters [4] that can be trained by the chip-in-the-loop training scheme [9] .
III. TRAINING OF LOCAL CLUSTER NEURAL NETWORK
Neural network training is a parameter optimisation process. Training neural networks in digital computing simulations is easy, but it is hard to realise the training in analogue neural network hardware with the same method that has utilised in digital simulations.
There are three schemes for training neural net hardware [7] . Off-chip training computes the network weights in separate computer simulation and then downloads the weight onto the chip. But this method is inaccurate, because the fluctuations and deviations in the analogue circuits are unknown and cannot be accounted for in the simulation [4] and therefore the weights obtained by training the mathematical model (LCNN) on software simulation will produce a different function on the chip. The chip-in-the-loop training scheme overcomes this problem by calculating the weights on a separate computer using the output of the analogue chip. Hence the effect caused by fluctuations and deviations is directly taken into account [9] . The ideal way of training an analogue chip is on-chip training. The on-chip training method has the training function inside the analogue chip and does not need an attached computer.
As the LCNN analogue chip does not have on-chip training circuits, in-the-loop training is applied for the LCNN chip. We formulated the Probabilistic Random Weight Change (PRWC) algorithm [9] and it is successfully used for in-theloop training for LCNN analogue chip. Followed the research achievement of LCNN analogue hardware implementation and its in-the-loop training, we propose on-chip training for the next version of LCNN analogue chip.
IV. TRAINING ALGORITHMS FOR ON-CHIP TRAINING
We have considered two different training algorithms for on-chip training -Gradient Descent (GD) and Probabilistic Random Weight Change (PRWC).
A. Gradient Descent (GD)
The Gradient Descent (GD) has been used in the LCNN digital computing simulation with batch training as equations (8) and (9) and it has proven to perform very efficiently.
where p represents the training sample number, q as the parameter in the neural network architecture (LCNN has three kinds of parameters: w, r and v), y as the neural network output and y * as the desired output. In each training epoch, mean square error is calculated for a set of training sample points and the weight change is calculated with a sum of derivative in the set of sample points (figure 3). The learning rates are adjusted according to the sum of mean square error changes (figure 4).
This batch training can be easily realised in software but it is hard to be realised in an analogue hardware as there are many rules in the training process and many memories are needed for rules and for batch calculations. Hence it can be seen that batch training is not suitable for analogue design. Instead of batch training, we use an on-line strategy or pattern-mode for on-chip training design (figure 5).
On-line training algorithm takes the squared error in each training sample
to update the weights instead of calculating mean square error from a batch of calculations. The learning rate is a fixed small value in training instead of adapted by rules. The weights are adjusted by the derivatives (equations 19 -21) in each training sample pointer. Then the weight changes are determined by learning rate η and the error e (equations 16 -18). Thus no rules are needed on training. This strategy simplifies the analogue training circuit compared to batch training. The Gradient Descent (GD) algorithm in on-line training strategy is defined as equations (11) 
where η is the learning rate, q is the parameter in neural network.
The LCNN includes three kinds of parameters (weights). They are updated as equations (13) 
B. Probabilistic Random Weight Change (PRWC)
The Probabilistic Random Weight Change (PRWC) is an alternative method that we proposed for on-chip analogue design. The PRWC training algorithm is "model-free" such as Random Weight Change (RWC) [5] , Weight Perturbation (WP) [6] and Simulated Annealing (SA) [8] . In the comparison with the GD on-line algorithm, the PRWC algorithm results in further simplifcation in circuit design as it does not require intermediate network outputs. PRWC has been successfully applied in the LCNN analogue chip in-the-loop training with batch strategy. We propose on-line strategy to design our on-chip training.
The PRWC is defined as follows:
where n is the number of training sample and i is the weight index. In each training sample pointer, we have the original weight set w i and the new weight set w i , such that we have training errors e and e in each training sample from equation (10). The weight change Δw i is:
where Lr is the learning rate, rand is a positive random value and k is the remainder of the random number divided by m. The weight w i and weight change Δw i in the next training sample are decided by equations (25) and (26).
In each training sample, w i is updated by equation (22). The weight change Δw i is determined by k in equation (23).
Thus the weight w i is randomly changed only while k = 0 or k = 1. i.e. w i is changed by probability 2/m. The probability 2/m is determined by m. If the error decreases, then the weight change Δw i for the next training sample will keep in the same value as in the last training sample. Otherwise, if the error increases, Δw i will be set by equation (23). Figure  6 shows the block diagram of PRWC on-chip training. 
V. EXPERIMENT RESULTS IN SIMULATIONS
We have tested both GD on-line training and PRWC online training in simulations. In this section results from these two simulations are shown.
The Mexican Hat function (27), subtraction of two Gaussian functions (28) and sine function (29) are used as our testing functions in multi-dimensional function approximations.
where a, b and c are the parameters that determine the output shape.
We did statistical testing for the two training methods. Table I and II present the average training errors in multidimensional mexhat functions and their standard deviations for GD and PRWC training. Table III and IV present the average training errors in sine time series multi-dimensional data and their standard deviations for GD and PRWC training. The one dimensional mexhat function approximation training with 4 clusters of LCNN in GD is displayed by figures 7 and 8. After 10000 GD training epochs, the final training error is 0.035 as showing in figure 7 . The plot in figure 8 shows desired output (solid line) and the GD training output (dashed line). In comparison with GD training, the figures 9 and 10 display the PRWC training in the same training samples and with the same clusters. After 10000 PRWC training epochs, the training error is 0.029 in minimum. The plot in figure 10 shows desired output (solid line) and the PRWC training output (dashed line). Figure 12 shows the desired 2D mexhat output (left plot) and the GD training output for 2D mexhat function approximation (right plot). The same two dimensional mexhat training sample is tested by PRWC training in 8 clusters of LCNN. Figure  13 shows the PRWC training error and figure 14 shows the desired output and the PRWC training output in 2D mexhat The results show that the GD training error decreases smoothly through training epoch. In PRWC training the error decreases not as smoothly as GD training because it adjusts the weights randomly but the error shows the same decreasing trend as GD. In addition, the PRWC training is faster than GD training because the PRWC involves less computation than GD. In the final hardware realisation the speed difference will be less and instead there will be a saving in circuit area. Figure 15 shows the Gradient Descent (GD) on-chip training block circuit diagram. We have used derivative calculations and many multipliers to complete the training procedure in flow chart 5. Figure 16 shows the Probabilistic Random Weight Change (PRWC) on-chip training block circuit diagram. Two weight storages are needed to keep the original weights and the updated weights and fewer multipliers are needed to complete the PRWC training procedure as described in figure 6 . Analysing the experimental results in section V and the block circuit diagram design in section VI, we find the following characteristics for PRWC and GD training:
VI. ON-CHIP TRAINING BLOCK CIRCUIT DESIGN
• Approximation error: The multi-dimensional time series experiment, two methods performan equally well. In normal multi-dimensional function approximation, the final error is affected by the type of target function and dimensionality. For complicated target functions, the GD training is more stable than the PRWC training.
• Convergence speed: The PRWC and GD have the similar convergence speed measured by the number of training epochs required to converge to the smallest error.
• Complexity: Comparing PRWC and GD block circuit diagrams, the GD needs much more computation than PRWC in training. So that GD is slower than PRWC in training processing and GD is more complicated than PRWC in fabrication.
VIII. CONCLUSIONS
The training methods, GD and PRWC, were successfully used in analogue on-chip training simulations. The PRWC is a "model-free" training method. Both algorithms are applied in on-line training strategy. Both methods performed well, statistically, in final error and convergence speed. Although the two methods have the different advantages in different cases, in overall, the PRWC training method has more potential for analogue hardware on-chip training implementation as its simplicity is suitable for achieving a high degree of parallelism on the chip. The analogue LCNN with on chip training is intended for control applications such as for example brushless DC motor control.
