In the paper a Sugeno architecture based hardware implemented neuro adaptive inference system's training algorithm is presented. The block diagram of the neuro-adaptive inference system output computing implemented in hardware is discussed, and the implementation in reconfigurable circuit of real-time parameter tuning is presented. The proposed system functionality based on measurements achieved is demonstrated. The resulted architecture has a very high processing speed, and the parameter adaptation works in parallel with the output processing. The proposed architecture can also be used for different training algorithms' development.
General information
The background of the project is the hardware implementation in pipeline structure of an adaptive neuro fuzzy inference system [1] . The implementation of the system was designed in two phases. In the first phase the output processing was made in the reconfigurable part of the SoC type circuit. During the development of the system we divided the algorithm in functional subunits, and implemented them individually.
In this paper the hardware unit for tuning the parameters operating in real time is presented. The result of the hardware implementation of the algorithm is an IP core unit, connected to the bus system of the ARM processor of the SoC circuit. As hardware support a Zynq System on Chip type development board was used, with a dual-core ARM Cortex A9 hard-processor and a programmable logic part [2] .
Hardware implementation of the neuro fuzzy inference system

A. Structure of the hardware implemented neuro fuzzy system
The model developed in pipeline system was made with two methods: as a modular structure in System Generator environment and using high level synthesis. Using both methods, an IP core was developed from the adaptive neuro fuzzy system that was connected to the bus system of the processor. In this paper the System Generator based implementation of the neuro fuzzy system is presented. As a hardware tool we used a SoC type development board containing a reconfigurable unit. In this paper the system developed in pipeline structure will be presented.
The block diagram of the model proposed for pipeline hardware implementation is presented on the following figure ( Figure 1 ) and contains the main units as follows:
Control unit: The control unit synchronizes the subunits of the IP core. After the inputs have entered the corresponding registers, the commands can be transmitted to the control unit through the control register to start the processing cycle.
Input/output registers. The input registers serve as storage for x, y predefined values.
Control register: configures the functioning of the system. Output registers: error register, register storing the output of the inference register.
In the test phase the values of the input register are programmed from the processor. In a specific implementation the inputs can be stored in the input register directly coming from the sensors of the real system. In the required values register the required output for the actual inputs are stored, because in the training algorithm implemented in the FPGA circuit the required outputs corresponding to the input value pairs are needed.
The operating mode of the IP core is set with the control register(s) (eg. training is realized on the processor or in the module on the circuit). The control module is also managed by the control register. Through the status register the state of the hardware implemented neuro fuzzy module can be established.
Parameter memory: It serves the storing of the parameters of the inference unit. Based on the results of the previously built hardware implemented neural networks, the outputs of the inference system and the training algorithm can be processed in parallel with the output processing if dual port BRAM memories are used.
Memory interface: When using the training algorithm running on the processor the recalculated parameters are written in the parameter memory through this interface. Fuzzification units: membership functions are implemented in these modules. If we take into consideration that overlap exists only between neighboring functions, then for every input only a maximum of two membership functions are activated.
As a result of the fuzzification we get the indices of the active membership functions and the activation degrees (membership values).
B. Fuzzification subunit hardware implementation
The pipeline structure of the fuzzification is represented in the following figure. The implementation of membership functions was made by lookup table, one memory being used for storing the parameters of the membership functions, while another was used for the values of the membership functions [ICCC] The main phases of the pipeline structure during fuzzification are:  addressing the membership functions,  reading the values of the membership functions. Based on the parameters read, if the membership function is active, the membership value is read from the membership value memory.
In the last phase of the pipeline structure the activation values of the active membership functions and the indices of the membership functions are stored in the output registers.
The synchronization of the subunits of fuzzification are made by the fuzzification controller. The end of fuzzification is forwarded on the RD output. For the implementation of membership functions there are more possible methods (as related in previous sections). At first as an easier solution we choose the membership storing in BRAM memories. This method has the advantage of an easy implementation, but its disadvantage is that membership function parameters will not be tunable in real time. By using a training algorithm running on a processor the values of the membership functions have to be recalculated using the new parameters and they must be reloaded in the parameter memory of the IP core fuzzification unit.
Hardware implemented parameter tuning
A. Training algorithm
In case of the ANFIS inference system both the precondition parameters (parameters of the 1. layer) and the consequent parameters (parameters of the 3. layer) are tunable. Adaptation of parameters can be made in different ways [3] , [4] :  gradient based training, reinforcement Learning [5]  using a generic algorithm for tuning the parameters [6] , [7] , [8] , [9]  particle swarm optimization, hybrid learning algorithm [3] . At the ANFIS hardware implementation we took into consideration that parameters to be tunable from the embedded processor, thus the testing of different training algorithms being possible. Output of the hardware implemented ANFIS can be processed fast.
We designed a hardware module for teaching the parameters of the inference unit that makes it possible to tune the parameters according to the actual error in parallel with the output processing. The output of the inference system can be defined by the following equation:
Tuning of output parameters by applying the gradient method can be made based on the following equation:
where  is the training coefficient, E=d-y the actual quadratic error, d is the expected output according to the inputs, y the calculated output.
Making the calculations for p, q, r parameters we get the following equations:
where xl, yl are the l-th input of the training set, for which the error was calculated, wi is the normalized activation degree.
B. Training algorithm hardware implementation
The pipeline structure based hardware module of parameter tuning from consequent part of ANFIS was designed according to the equations presented above. The block diagram with mechanisms of parameter adaptation for hardware implementation is presented in the following figure. The calculation of new values of the parameters works in parallel with the calculation of the inference system output. While the output processing is going for inputs xk, yk, the training is realized for parameters activated according to inputs xk-1, yk-1.
Last operation in output processing is the calculation of the error between the reference value and the inference system output. If we want to update the parameters in the same cycle, after the error calculation also the tuning of the parameters has to be carried out.
At starting a new calculation cycle, the contents of input registers from the fuzzification submodules xk, yk are copied into the input register in the parameter adaptation module.
Simultaneously the error value from the error register at the end of the output processing module, corresponding to inputs xk, yk is copied to the error register in the parameter adaptation module, and the learning factor in the parameter register. The input registers in the fuzzification units get the new value xk+1 and yk + 1 form the input registers of the interface module.
In the current cycle the outputs are calculated based on the input values xk + 1 and yk + 1, whereas in the parameter update module the parameters are tuned based on the input values xk and yk.
The recalculation of the consequent part parameters is realized in submodules identified by P, Q and R implemented as a pipeline structure based architecture.
In the k-th processing cycle during the calculation of the inference system output for the xk and yk input the activated parameters read from the parameter memory are shifted to the delay chains in subunits P, Q, R, the weight values are shifted to the subunit denoted with W, and the parameters memory address is shifted to the memory address delayed chain.
During the k + 1 processing cycle, the data (parameters, weights) stored in the parameter update module in cycle k are shifted to the operation units as a new data is received in the delay chains.
At the input section four new parameters are added to the chain, while at the module output on four clock cycles the parameters new values are given.
The addresses of parameters in the address delay chain are shifted in parallel with the parameters, and when at the modules P, Q, R outputs the new parameters is ready, at the output of the address delay chain the memory address corresponding to the parameter will appear. The new parameters values are written back to the specified memory address in the parameter memory. It is very important that every parameter value to be written back to the same location in the parameter memory from where it was read.
One solution for a simple treatment of the parameters during the tuning process is to use Dual port BRAM memories for parameter storage.
But in order to have access also to the parameter values from the program running on the embedded processor three-port memory would be necessary (one to read the parameters during the output calculation process, one during the parameter update process, respectively one for parameter access from the embedded processor).
Finally, dual port BRAM memory was used for parameter storage of the inference system's consequent part, one port was used to access the data from the embedded processor, while the other port was used for output processing and parameter update.
The new parameter values can be written back to the parameter memory only after the parameter reading phase is finished.
Simultaneous output calculation and parameter update will not function properly because it is possible that, during the output processing for one of the inputs another membership function will be activated as was activated during the parameter update for the previous cycle. This would mean that during the output processing the parameters are read from the wrong memory address.
The resulted hardware implemented parameter tuning permits to update the inference system's consequent part parameter in parallel with output processing.
The module works at high speed, the output processing and the parameter update being realized in 60 clock cycles.
The system architecture was constructed in such a way to allow to test different training algorithms running on an embedded processor core.
C. System architecture
Based on the model implemented and simulated using the System Generator tool the IP core was generated, which was connected to the AXI bus of the embedded processor using the Xilinx EDK Platform Studio tool.
As basic hardware a SoC Zynq-xc7 z020-1clg484 circuit was used with reconfigurable part.
The hardware implemented neuro fuzzy inference system was designed using an XPS tool, the IP core was connected, and the configuration .bit file was generated.
The XPS results were exported to the Software Development Kit, and then the testing program was developed running on the embedded processor.
D. Measurements and system testing
The program prepared to test the inference system allows the parameter initialization, membership function definition, reading back the current parameter, reading back the membership functions, starting a learning cycle.
The results of the testing program are sent back over the serial UART port to the host computer. The result is nothing more than a Matlab program with arrays of parameters and control surface creation.
The inference system has been tested for surface approximation. Several types of reference surfaces were generated and approximated with the adaptive neuro fuzzy inference system.
The expected outputs of the various control surfaces, and the normalized square error were calculated during a training process.
For all measurement series the reference surface (training sets), the resulted control surfaces before and after the training phase, and the evolution of the mean squared error were plotted.
For the following series of measurements the reference surface is shown, the initial control surface without learning, the control surface after the first 6, and s after 24th measurement with learning enabled (Figure 4. ). For the last measurement series the training coefficient was set to a much too high value and the training algorithm did not converge, as can be evaluated on the control surface. In this measurement series, the learning coefficient (UFix_13_10) was changed several times. Initially it was programed to 10/1024 (first part), reduced to 1/1024 (in the 2nd part), increased to 100/1024 (in the 3rd part), increased to 200/1024 in the 4th part. In the last measurement the learning coefficient was increased to 1000/1024, the training algorithm was no longer convergent as resulted from the control surface.
Conclusion
In the research the inference system output processing and parameter update for the consequent part was successfully implemented in hardware, using a pipeline based architecture.
In the development of the application more XILINX based design software's were used. The model was mainly implemented and tested using the System Generator tool. Some of the subunits (fuzzification controller and system controller) were implemented in VHDL and tested using the ISIM simulator.
The embedded processor-based design was created using the XILINX Platform Studio and the software part with the Software Development Kit tools part of the Embedded Development Kit. The design errors were explored during simulation tests performed in System Generator.
We developed a very fast hardware device. In the pipeline structure most of the clock cycles are needed by the division operation to be performed.
The developed architecture provides the opportunity for testing various training algorithms, and to use it in real application tasks.
The hardware-implemented weight adaptation module operates in parallel with the output calculation, therefore it can be said that during the time it manages to calculate the inference system output while the training is performed for an input pattern.
