Hardware implementation of neural networks usually have high computational complexity that increase exponentially with the size of a circuit, leading to more uncertain and unreliable circuit performance. This letter presents a novel Radial Basis Function (RBF) neural network based on parallel fault tolerant stochastic computing, in which number is converted from deterministic domain to probabilistic domain. The Gaussian RBF for middle layer neuron is implemented using stochastic structure that reduce the hardware resources significantly. Our experimental results from two pattern recognition tests (the Thomas gestures and the MIT faces) show that the stochastic design is capable to maintain equivalent performance when the stream length set to 10 Kbits. The stochastic hidden neuron uses only 1.2% hardware resource compared with the CORDIC algorithm. Furthermore, the proposed algorithm is very flexible in design tradeoff between computing accuracy, power consumption and chip area.
Introduction
Parallel computing is receiving increasing attention due to its efficient data processing ability. Computation on stochastic bit stream has become an attractive solution in parallel computing environment for many years [1] , [2] . It represents a value m in the unit interval (i.e., 0 ≤ x ≤ 1) by a bit stream L, in which the probability of a one is m [3] . Conventional methods generally use multipliers and adders in solving some matrix operation problems. With the increase in matrix size, the number of multipliers and adders is multiplied, and ultimately beyond hardware capabilities [4] . When the probabilistic values are used, the complicated operations can be approximated through simple arithmetic units, (e.g. absolute subtraction can be implemented with only a single XNOR or XOR gate.) resulting in simplified circuit structure and minimized system design cost. Meanwhile, most data processing systems that boast high accuracy work well only under ideal conditions [5] . When the environmental noise or manufacturing parameter deviation corrupts the system, traditional method often add redundancy design to raise anti-jamming capability. Thereby increasing the scale and cost of the system. An effective solution for reducing area and noise-sensitivity is to move from deterministic values toward probabilistic values. Some stochastic data processing designs have been already introduced [6] - [8] . For example, Brown and Card [9] , [10] showed that for complex operations, such as the exponentiation and sine functions, stochastic computing consumes less energy than binary radix computing, and the latency problem can be resolved using parallel processing method or a higher operating frequency owning to its simple circuit structure. Also in [11] Qian et al described a stochastic image processing architecture that is fault tolerant. More recently, some researches give structures of stochastic logic using Back Propagation (BP) neural network that uses much adders to achieve sigmoid function. However, the result is unpractical because scaled addition introduced precision loss. Moreover, we found that the generalization capability of RBF neural network is superior to that of BP neural network [12] .
As discussed above, we extend previous work of this algorithm. We encounter the scaled addition problem and best use the advantages of Two-Dimensional (2D) Finite State Machine (FSM) through the case study of RBF neural network. The overall architecture is design from the samples of input and output of the pattern by stochastic logic. The results have shown the effectiveness and capability of presented computing for high error-tolerant, high recognition accuracy and more power-efficient.
Stochastic Computing
Stochastic computing uses the probability of "1" (or "0") appearing in a stochastic bit stream to represent a deterministic value [1] , [2] . It has two coding formats: unipolar and bipolar formats. Both formats can coexist in a single system together. Operations at the logic level can be easily parallelized in space to reduce energy and total delay. That is to say, instead of using a bit stream of length L to represent a real value m, we increase the number of logic gates by N. Due to the short bit stream of length L/N, computation time can be much higher reduced. This can be done with little overhead due to the typically low gate area. Figure 1 shows how we can represent a value 0.5 by 2 stochastic bit streams of length 4.
Such streams can be generated with pseudo-random constructs such as Linear Feedback Shift Register (LFSR). As shown in Fig. 2 , we use a LFSR and a comparator [9] Copyright c 2017 The Institute of Electronics, Information and Communication Engineers to explain the operation of the proposed stochastic computing. The LFSR generates a different random number in each clock cycle. Assume that we convert a value m from its binary radix encoding to L-bits stochastic encoding. The comparator produces a one if the random number is less than the input number, and it produces a zero for the others.
After converting the real number to random stream, the mathematic operations, like addition, subtraction, division can be implemented using simple digital logics. Thus, this approach enables simple hardware and massively parallel processing. However, 2 m bits is needed for a stochastic encoding transform of m bits binary encoding which requires too much clock cycles to finish. This motivates us to design the complex function through a finite-state machine in terms of faster clock frequency.
Here we provide a two-dimensional state array [11] shown in Fig. 3 , which has two inputs (the input stream and the modulation stream) and total of M×N states (S 0 to S MN−1 ). The numbers on the arrows represent conditions that must be satisfied to proceed along the transition. Based on the Markov forecast method, when the state transit fairly enough, the state transition configuration is considered as probability distribution which is independent of initial states. Consequently, based on the probability of the stochastic stream, some complex functions simulation with parallel processing would be possible. 
Stochastic Logic Applied to RBF Neural Network
The proposed case study of RBF neural network is a feed forward network embedded in three layers: input, middle and output layers [12] . The input layer distribute the vari-
T to all neurons in middle layer where the vector is processed and further transmitted to the output layer to give a linear combination of output weight. For RBF responses in the hidden layer, an RBF network utilizes several kernel functions. In this work, the most commonly used Gaussian function is chosen and we expand and rewrite it as follows:
where σ is a positive real constant or kernel radius which measures the smoothness of fitting function and |x − c i j | denotes the Euclidean distance. The centroid c i j and constants σ are determined accordingly to the training data process. The middle layer operation can be seen clearly by referring to the right-side of Fig. 4 , which presents a detailed architecture of stochastic RBF neural network. The input value x and the cluster center value c i j are subtracted through XOR gates. The Gaussian function is synthesized based on 2D-FSM topology using the modulation streams k and q. They are connected by shift registers. The middle layer output results are obtained by ANDing all FSM outputs as described above.
On the left-side of Fig. 4 shows the stochastic units applies to output layer of neural network. We use a scaling deterministic-to-stochastic converter to scale down the value of ω within the range of [−1, 1]. The output of middle neuron y and the weight of the output neuron ω are multiplied through XOR gates. Note that addition operation will lead to declined accuracy, so we transfer the multiplication results into deterministic parameters using the sum of binary adder, and retrieve the real data z k .
Correlations among the stochastic numbers often lead to inaccuracies which implies the need for many random sources. There has been some effort by using techniques for generating multiple uncorrelated pseudorandom sources [6] , [17] . However, they still dominate the area cost of the architecture. The solution described in this paper is to make full use of the LFSR. As we know, the middle layer inputs must be correlated, which is assured by assigning them a same random source. For the modulation streams k and q, we use the same random source and obtain shift from the main sequence by tap shift registers. All those registers are integrated into control system. As for the output center vector ω, the computation is processed in the binary adder which requires only one LFSR. The area cost of these resources is minor since they are small, which gives an acceptably small overhead for neural network chip.
Experimental Results and Discussions

Gesture Recognition
The Thomas Data set contains 40 samples of 25 species hand gesture, and the pixel size of gesture is 248×256. There are a large number of pixel points in digital image. The input will be somewhat redundant because the values of adjacent pixels in an image are highly correlated. Meanwhile, the recognition process will be inefficient if the image training is given to each pixel point. Therefore, we adopt principal component analysis first to effectively approximate the input with a much lower dimensional one, while incurring very little error.
We randomly divided the data set into training samples and testing samples. Each sample have 25 gesture images (20 images for each gesture). The RBF neural network is set to 39 input neurons, 15 hidden neurons and 25 output neurons. We take the MATLAB language as the experimental procedure, and train the network through orthogonal least squares algorithm. The proposed and deterministic logic are simulated in the same environment. In deterministic logic, the recognition rate reached 96.80% and the Mean Square Error (MSE) is 0.0159.
There are several interesting results in Fig. 5 . The most important point is that the stochastic computing bit length has a considerable impact on network performance, both for the recognition rate and MSE. When the stream length is 100 bits, the recognition rate reaches 60.33% and the MSE is 0.0609. With the stream length reaches to 5 Kbits, the results change significantly with 58% improvements for the recognition rate and 72% for the MSE. Besides, when the stream length is greater than 50 Kbits, the recognition rate and the MSE almost converge to the rate of the deterministic network.
Stochastic logics can be used to implement the neuron network with simple circuits, so it can be integrated as an independent hardware component on silicon microdisplay. The performance can achieve almost perfect results even if the circuit components are noisy and uncertain [15] , [16] . In general, the input data is more easily to corrupt with noise than the internal structure of chips. Therefore, their input referred noise could be very high, which in our experimental architecture can simulate the noise of the input interface. Here we inject different level noises into the stochastic bit stream in the input layer. The stream length is set to 10 Kbits. As shown in Fig. 6 , the recognition rate reaches 96.10% and the MSE is 0.0165 when the noise is set to 5%. When we increase the rate of noise to 15%, the recognition rate is still as high as 94.20%. MSE increases correspondingly when the noise level increases. Thus, the architecture of stochastic RBF neural network has good error immunity when the uniformly distributed noise is less than 15%. The precision of the results is dependent most on the statistics of the bit-streams, and so the computation can tolerate errors gracefully.
Face Recognition
In this experiment, similarly we separate the MIT faces database into two parts: face library and non-face library. A total of 100 pictures are randomly chosen for each part, all having a resolution of 20×20 pixel.
A stochastic RBF neural network with 400 input neurons, 20 hidden neurons and 2 output neurons is simulated to do the facial recognition. The output value '01' and '10' represent the face and non-face results, respectively. In deterministic logic, the recognition rate reaches 95.0% and the MSE reaches 0.0722. For the stochastic logic, the 2D finite state machine with eight states is used to fit the Gaussian function. The deviation of the output between stochastic network and deterministic network calculated for the network evaluation is characterized by MSE. The stochastic network runs with different stream length and each result is repeated for one thousand times to reduce the random variance.
As shown in Table 1 , when the stream length increases from 1 Kbits to 10 Kbits, the results do not change significantly with 4-5% improvements for the recognition rate and 3-4% for MSE. When the stream length is greater than Table 1 Facial recognition rate and MSE for stochastic computing RBF neural network and deterministic network. Table 2 The error injection test result for the stochastic computing RBF neural network in the facial recognition experiment.
50 Kbits, the results gradually converge to deterministic network. The main issue of stochastic computing technique is its long time-consuming. General solution is using a faster clock frequency and/or using parallel computing in pattern process. Table 2 illustrates the noise tolerant of the stochastic RBF neural network in facial recognition. The error given by the stochastic bit stream of input layer is up to 30%, and the stream length is set to 10 Kbits. Clearly, the stochastic RBF neural network shows satisfied robustness of the input noise. Compared with the gesture recognition test, a much shorter stream length is needed to obtain the same recognition rate level, mainly because there are more inputs and fewer outputs in the facial recognition test.
Hardware Comparison and Application Forecast
To validate the hardware cost of the proposed algorithm, we implement the stochastic logic RBF network based on Altera Cyclone III FPGA (EP3C80F780C8) with 81264 logic elements and 430 pins. The network parameters c, k, q, ω are stored in the nonvolatile memory and serial shifted to the stochastic network. The data width is equivalent to the bit length of the stochastic logic. Three different deterministic algorithms with the same network structure are chosen and the device utilization is given in Table 3 [3] . The proposed approach uses only 22 logic elements, which is far less than the other methods. A Look-Up Table ( LUT) is an array that replaces runtime computation with a simpler array indexing operation. It takes only one clock cycle to complete the operation, but consume a large amount of cache and area. Pre-computation with interpolation can shrink the LUT size but may sacrifice the computation time. Coordinated Rotation Digital Computer (CORDIC) only has simple shift-andadd operations, which is suitable for the implementation on hardware of complex function operation. We also propose the binary design by three cost-performance normalizations (M0, M1 and M2) in each major system component. The higher normalized score means better performance of the whole system. In general, the 2D-FSM is the best choice in terms of the circuit power which is determined by area and speed. As compared with the LUT with interpolation architecture, the stochastic-based 2D-FSM requires 98.81% less area usage and has higher area efficiency. The vector M1, emphasizing the metrics of circuit area, implies that 2D-FSM architecture performs better than the typical implementation when 12 bits of data width assigned. The area efficiency of 12 bits 2D-FSM is found to be competitive with the CORDIC algorithm by the vector M2. It is concluded that our stochastic architecture occupy less area than traditional hardware implementations.
As matter of fact, current researches mainly focus on the theory framework and software realization of deep learning neural network [13] , [14] , while the method for hardware implementation is still lacking. Furthermore, under the constraint of the traditional computing methods. The neuron occupied undesirable resource consumption, which leads to only tens to hundreds of neurons in one single-chip. Therefore, according to the analyses and discussions above, we confirm that stochastic computing is an acceptable way to solve hardware constraint of deep learning and a large number of applications with high hardware resources cost can benefit from this parallel algorithm.
Conclusion
In this paper, the pattern recognition experiment of the stochastic computing is simulated and analyzed as a good beginning example of depth study especially for the design of realizable, simple, inexpensive and robust system with brilliant prospects in applications. Under the same network structure, by changing the length of the stream we can form different computation precision without changing the hard-ware structure of network, which gives the system designers more flexibility to weigh the pattern accuracy, power consumption and the speed of calculation. Moreover, noise input result shows that the stochastic logic improves the ability of fault tolerant remarkably for the given error rates. Device fault will not affect the performance and accuracy of the entire circuit. This will open up more opportunities for the new type of micro-mechanical manufacturing application and promotion.
