The purpose of this work is to present the design flow and the implementation of a neuro-fuzzy controller Intellectual Property (IP) core, using High Level Synthesis (HLS) tool. The realized IP core is designed for FPGA based embedded system architectures. The implemented control algorithm is a Sugeno model based Adaptive Neuro-Fuzzy Inference System (ANFIS). The optimization possibilities using the HLS tool and the designing of the interfaces for the IP core are presented.
Introduction
For the control engineering, the neuro-fuzzy system has many favorable features. First of all it combines both the neural network structure's and the fuzzy inference system's advantages. The neural network structure gives the adaptability of the system, and the fuzzy logic part makes it possible to use a linguistic rule base to describe the ideal behavior of the controlled system. Beside the advantages of this kind of systems, there are some difficulties with the real time implementation, because the algorithm requires complex operations under time constrains. The sequential implementations in most of the cases cannot work in real time, but a parallelized structure of the operations could be a solution. There are many related works to the implementations of neuro-fuzzy models, using different techniques [1] , [2] , [3] . The best platform to realize embedded systems with parallel architectures is the Field Programmable Gate Array (FPGA). With the flexibility of these devices it is possible to create an efficient hardware implemented neuro-fuzzy controller. The FPGA devices can be programmed using Hardware Description Languages (HDL), this is an efficient way to create embedded hardware, but also requires deep electrical understanding of the circuit. Nowadays, we have the possibility to create FPGA based hardware by describing the behavior of the circuit using high level programming languages, like C/C++/SystemC. This type of codes can be synthetized to HDL with HLS tools, like Xilinx Vivado HLS [4] . This type of designing hardware architectures has many advantages, like shorter design time, easier debugging solutions and the detailed knowledge of the Register Transfer Level (RTL) circuit is not necessary. Until few years ago, the HLS tools hadn't been advanced enough to use them to create efficient hardware architectures, but with the recent improvements and with the new high capacity FPGA boards this could be a reasonable option [5] , [12] , [13] .
Sugeno model based ANFIS structure
The Sugeno fuzzy model has many similarities with the Mamdani fuzzy model, the main difference between them is the defuzzification part. While the Mamdani model uses an output space covered with membership functions, the Sugeno model uses a linear function for each rule. From these active function values a weighted average is calculated. Because the Mamdani defuzzification requires the calculation of the area and the center of gravity of the active output membership functions it demands much more resources then the Sugeno defuzzification. The general structure of the Sugeno model based ANFIS can be seen on Figure 1 . We can see that the neural network has two inputs and one output, and also has five layers. The first layer is actually the fuzzification layer (1), (2), the nodes of this layer are the input membership functions. Each membership function node is connected to only one input, in this case five functions for each input. The input value activates maximum two membership functions, for each input.
On the second layer the active membership functions from the two inputs are combined, and a product is calculated from each pair of the membership values (3).
The third layer should be a normalization for the calculated values in the previous layer, but for resource saving reasons this operation is done just after the summation of the active output values. In this case the third layer is part of the defuzzification, these nodes contain the functions corresponding to the active rules (4) .
The fourth and the fifth layers are a summation of the active rule function values and the normalization of the weights (5), (6) . 
Implementation of the algorithm
The target device for this architecture is a Zynq System on Chip type development board, with a dual-core ARM Cortex A9 hard-processor and a programmable logic part. The ANFIS IP core is implemented into the programmable logic, and the learning algorithm will run on the processor. The ANFIS algorithm was implemented in C language, using the Xilinx Vivado High Level Synthesis tool. The HLS tool, during the synthesis generates an FSM based hardware architecture from the C code. During the synthesis the technical constraints of the target device are taken in account, the timing constraints, and the directives assigned to some variables or program structures. Only integer operations are used, because the floating point operations require much more resources. The program has three main parts, the ANFIS function, the Fuzzification function, and the Bell_Curve function. The ANFIS function is the main function, the inputs are the two inputs of the controller (X, Y) and the third layer's parameters (L3_par), and the outputs are the calculated control signal (U), and the Ret parameter structure which contains data for the learning algorithm. The Fuzzification function (L1) calculates the membership values for each input value and returns to the main function the indices of the active membership functions and the corresponding membership values. As membership functions, we use bell curves which have the formula (7), but because we can use only integer operations, we rescaled the The Bell_Curve function is used by the Fuzzification function, and calculates the membership value from the input value and the parameters of the respective membership function. The membership functions of one input space could be seen on Figure 3 . The [0, 1024] intervals of the two input spaces are covered by five-five membership functions.
To optimize the code, optimization directives are applied to some elements of the program. Pipeline and unroll directives for some loops or DATA_PACK and ARRAY_MAP directives are used for resource saving reasons. With this optimization, the synthesis resulted in a latency of 68 clock cycles at 100 MHz which means 680 ns for the calculation of an output. In the 1. We can see, that only 21% of the available DSP-s, 6% of the available flip-flops and 10% of the available LUTs were utilized to generate the ANFIS IP core. There are many training algorithms for this kind of systems [6] , [7] , [10] . The learning algorithm implemented also in C language is running on the processor. The Ret structure contains the indices of the active rules and the weights of the second layer (w) for each active rule. The processor's software refreshes the parameters with the following formulas:
Where indexX and indexY contains the indices of the active membership functions on the two inputs, μ is the learning coefficient, ̅ is the normalized weight of the second layer.
Interfaces
The ANFIS IP core has two types of interfaces, AXI4Lite slave [8] interface and BRAM interface. The whole architecture, containing the ANFIS IP core, and the processor can be seen on the Figure 4 . The X and Y inputs and the U control signal, and the Rett data structure outputs of the ANFIS IP core are using the AXI4Lite interface. Each of these four inputs/outputs has a handshake type protocol assigned, and each of them can be accessed through a register. The third layers parameters are stored into a dual-port BRAM memory. The Dual-port BRAM has the advantage that the contents can be read through the port A and can be written through the port B at the same time. So the ANFIS reads the parameters for the current calculation, and at the same time, the processor calculates the new adapted parameters for the previous inputs and writes them into the BRAM. The BRAM is connected to the processor via an AXI BRAM controller. The AXI channel from the ANFIS IP core, and the AXI BRAM controller are connected to an AXI Interconnect module and this module makes the connection to the processor.
Simulation results
For the training simulation we choose a reference surface ( Figure 5) , which was the result of a simulation with a fuzzy controller. The surface is generated On the Figure 6 the surface generated from the simulations made with the ANFIS IP core can be seen. We can see the evolution of the surface during the training cycles. These plots are made after 5, 50, 75 and 100 training cycles. There are used five membership functions for each input space with the parameters shown in 2. Figure 6 : Output surface at different number of training cycles
Conclusion
We have presented the hardware implementation of a neuro-fuzzy system with high level synthesis tool. We proved the usability of the high level synthesis tools in creation of efficient hardware architectures. The hardware implemented with HLS requires less design time then other techniques, there are many optimizing possibilities and a simple method to create interfaces. Reasoning from the low latency, from the acceptable resource utilization and the measurement results we can conclude that the ANFIS IP core can be used in real-time control applications.
