Abstract-The hardware implementation of an artificial neural network (ANN) using field-programmable gate arrays (FPGAs) is a research field that has attracted much interest and attention. With the developments made, the programmer is now forced to face various challenges, such as the need to master various complex hardware-software development platforms, hardware description languages, and advanced ANN knowledge. Moreover, such an implementation is very time consuming. To address these challenges, this paper presents a novel neural design methodology using a holistic modeling approach. Based on the end-user programming concept, the presented solution empowers end users by means of abstracting the low-level hardware functionalities, streamlining the FPGA design process and supporting rapid ANN prototyping. A case study of an ANN as a pattern recognition module of an artificial olfaction system trained to identify four coffee brands is presented. The recognition rate versus training data features and data representation was analyzed extensively.
I. INTRODUCTION
H ARDWARE implementation of an artificial neural network (ANN) using field-programmable gate arrays (FPGAs) has been an interesting research field with applications in various domains since early nineties. At the beginning, the only generally accepted method was to design the application by means of hardware description languages (HDLs) for very large-scale integration circuits, in particular (VHDL) Very high speed Hardware Description Language. or Verilog. Nowadays, engineers use modern Electronic Design Automation tools to create, simulate, and verify a design and, without committing to hardware, can quickly evaluate complex systems with high confidence in the "right first time" correct operation of the final product. The FPGA reconfiguration capability and its parallel processing power are "hot topics," recognized in many papers focused in industrial applications: hardwareimplemented polar decoders [1] , FPGA-embedded controller of an n-Level dc-dc-ac inverter [2] , or hardware implementation of predictive control algorithms for power converters [3] . With the newly emerged development environments for all programmable systems-on-chip and multiprocessor systemson-chip, complex algorithms are now implemented in FPGAembedded processors [4] : FPGA/digital signal processor (DSP)-based digital controller with self-reconfiguration property for power quality compensation [5] , FPGA-embedded multiprocessor Programmable Logic Controllers (PLC) that provides high execution speed, and multiprocessing programming [6] . Despite ANNs being implemented in hardware for more than 25 years [7] , it remains in the center of attention for many researchers, and a variety of methods to develop hardware-implemented ANNs have been reported in the literature in the past decade [8] , [9] . An overview of these achievements is given in [10] , where the ANN theory and its hardware implementation are analyzed. The main advantage in using the above methods is given by the fact that now the functional description of the design (the mathematical model) and its hardware implementation has been brought closer, but the gap between them still exists. The pressing need to master different environments calls for a holistic approach in which the mathematical description and the electronic design implementation are simultaneously addressed in a unique environment. According to [11] , the benefits of the holistic modeling approach are given by the possibility to evaluate increased system complexity at an early design stage in a unique platform. The time to market will be shortened, the use of automatic processes for hardware implementing the ANNs will be facilitated, and therefore, investigating different system topologies (ANN topologies) will be eased. Combining the above-enumerated holistic modeling advantages with HDLs and FPGA capabilities, more complex ANNs can be modeled, simulated, and implemented with an increased use of resource efficiency [12] . In this sense, an interesting approach is taken in [13] , where the VHDL code of a multilayer perceptron ANN topology is generated by means of a graphical user interface (GUI) designed in MATLAB. The tool lifts the VHDL design burden from the user's shoulders, making the CAD environment to be more user-centered. Similar methods are reported in the literature where automatic tools are developed to help the designer to exploit the dynamic partial reconfiguration of the FPGAs circuits [14] or to generate the VHDL code of complex fuzzy-logic systems [15] . This paper takes these steps further and presents a methodology based on the end-user programming (EUP) [16] concept, where end users (EUs) are shielded from the need to know low-level technical HDLs. This is achieved by providing different layers of abstractions to represent in hardware the application functionality, such that EUs are empowered by simply manipulating the abstractions via an intuitive and interactive GUI to support rapid prototyping. The system was tested as a pattern recognition module of an artificial olfaction system for identifying different coffee brands. An extended analysis of the recognition rates (RRs) versus data representation has been performed. This paper is structured as follows. Section II presents EUP on ANN design approach. Section III describes the neural library design. Section IV gives ANN abstraction and EUP. Section V provides application and discussions. Conclusions are given in Section VI.
II. EUP ON ANN DESIGN APPROACH
EUP is characterized by the use of techniques that allow EUs of an application to create "programs" themselves without needing to write any code [16] . A common way to achieve this goal is to create propriety types of "scripting languages," abstracting conventional programming algorithms into some form of representations (e.g., graphical objects), and then to provide a platform for the users to manipulate these representations as the basis of learning how to create a program. Earlier work in this area was primarily focused on single desktop computing, allowing EUs to create programs by manipulating abstract graphical objects. Recent developments have moved away from desktop computing systems to technology-rich ubiquitous environments where the EUP approach is no longer restricted to a single PC but leverages objects as a means to interact with the system [17] . Consequently, while some approaches still employ traditional GUIs on a single PC [18] , others t mobile devices [19] . The technique adopted in this paper follows earlier published work on pervasive interactive programming [20] that employs a show-me-by-example approach via natural interactions. The method further extends the use of modern Electronic design automation (EDA) tools for the design, simulation, and hardware implementation of an ANN aiming to change the way in which user applications are defined. Instead of a classical solution, in which the application is defined using HDLs, it is more efficient (in terms of performances versus hardware resource utilization) and user friendly (the user does not need to know the neural algorithm or how to implement it in hardware) to create a pattern recognition system, in our case an ANN, by means of providing layers of abstractions to represent configurable modules, which are grouped into specific libraries that interface with the hardware. The abstractions are presented on an intuitive and interactive configurable GUI, which EUs can interact with and easily manipulate. The manipulations can be achieved by simple gestures such as pressing, tapping, and "drag and drop." The GUI "listens" and "learns" the user's interactions on the screen and composes the program at the same time based on the desired requirements. The immediate benefits of this approach are as follows. 1) Speed up the early phases of different ANNs' design and development process. 2) Allow the EUs, who may not be familiar with the technologies, to create their programs without enduring a steep learning curve (see Fig. 1 ). The outcome is a configurable neural library embedded into a design environment that allows considering simultaneously all the aspects of the system design. In this way, high processing speed (PS) at minimum hardware resource utilization can be achieved.
III. NEURAL LIBRARY DESIGN
The ANN performance is heavily influenced by the topology chosen and its correlation with the application remains crucial. In this paper, the feed forward with back-propagation learning algorithm network (FFBP) was chosen to be modeled. As the main FFBP features, such as the network topology, are selected by repetitive modifications, simulations, and implementations of the project code, the availability of a hardware ready implementable ANN library would bring a plus in the effort to rapid design reliable pattern recognition systems in hardware.
The created ANN library, described in the next sections, contains extendable modules that comply with a generic FFBP architecture. It consists of processing units (neurons) organized in successive layers: 1) one input layer; 2) one or more intermediate hidden layers; and 3) one output layer. The network is fully connected, i.e., all the outputs of a layer are connected by synapses to all inputs of the following layer. Only the hidden and the output layers include processing units, whereas the input layer is used just for data feeding. The network uses the feedforward algorithm to push information forward from one layer to the next one and the backpropagation training algorithm for determining its weights: a repetitive algorithm that finds the minimum of the error function (the derivative of the sum-of-squares error with respects to the weights). The proposed software-hardware platform, underpinned by the ANN library and user interface, represents a viable way of designing and FPGA implement FFBP topologies with on-chip learning as demonstrated in Section V.
A. FFBP Neural Network Algorithm
The neural algorithm that emulates the FFBP ANN behavior is described through (1)- (4). It starts with the computation of the output vector in (1): MAC (multiply and accumulate) of all inputs with their corresponding weights and fires the results with an activation function f
(
The goal of the algorithm is to minimize the error function calculated in (2) by means of weights adjustment
For this, a corresponding partial derivative error with respect to its net output value is computed
Next, an update stage follows, in which all the weights, hidden and output ones, will be adjusted 
(4)
The FFBP network will follow these computation steps until the calculated error will be less than a given threshold value. In (1)-(4), the following abbreviations were used: net is the fired neuron output; w k is the weight vector of neuron k from the output layer; i, j, and k are the neuron's indexes (number); I, J, and K are the number of neurons of the input, hidden, and output layer, respectively; E is the error function; o is the output vector of the output layer; y is the output vector of the hidden layer; δ ok and δ yj are the gradient of the error signals of neuron k of the output layer and respectively neuron j from the hidden layer; and v j is the weight vector of the neuron j from a hidden layer.
B. FFBP Neural Library Design
Designing the neuron of a multilayer FFBP with on-chip learning must consider not only the computations in the propagation phase, when the neural network is already trained and performs the recognition task (1), but also the learning phase, when the neural weights are updated according with the error minimization, as in (2)-(4). The neuron designed by the authors is built using Xilinx System Generator library blocks and consists of MAC unit, RAM memory module, multiplexor and a register for bias values initialization, and a firing function block. When designing the MAC unit, two approaches may be adopted (see Fig. 2 ): 1) using distributed resources; or 2) dedicated modules such as XtremeDSP or BRAM blocks. The number of dedicated modules (which ensure the best neuronal processing performances) differs from one FPGA family to another. Therefore, for finding the best neuron's architecture, related to the hardware resources available in the targeted FPGA, four possible optimization scenarios were considered. 1) DL_AO: minimize the occupied area with multiplications done using distributed logic resources. 2) DL_SO: maximize speed processing with multiplications done using distributed logic resources. 3) DSP_AO: minimize the occupied area with multiplications done using dedicated resources. 4) DSP_SO: maximize speed processing with multiplications done using dedicated resources. The synthesis of the maximum processing frequency and the hardware resources utilization were generated with the ISE Xilinx report generator tool and are presented in Table I . The result analysis shows that the neuron based on the XtremeDSP block has the highest processing frequency and uses the fewest hardware resources in terms of slices or LUTs (as expected). Nevertheless, as the XtremeDSP blocks are limited, (128 for 4VSX35), to extend the number of neurons implemented, distributed logic can be used instead.
Another component of the neuronal library is the activation function. Its role is to map the neuron output values to a range of values given by the function chosen as a firing function, in this case the sigmoid function
Implementing the sigmoid function in hardware requires advanced HDL knowledge. Moreover, once implemented, it acts as a bottleneck for the neuron speed performance demanding considerable hardware resources in the same time. In order to reduce the hardware cost, different approximations of the sigmoid function can be adopted. The main classical methods are lookup tables and truncation of the Taylor series expansion. Taylor expansion can further be implemented in various ways: 1) sum-of-steps; 2) piecewise linear; 3) combination of the previous; or 4) others. The best results reported in the literature show errors of 8% to 13.1% for sum-of-steps approximations and ±2.45% to ±1.14% for piecewise linear approximation. Also, there are approximations with smaller errors, but they use floating-point multiplications; thus, practical implementation becomes too complex [21] .
The firing function library, created by the authors, consists of ready hardware implementable modules of functions chosen to approximate the sigmoid function: 1) A-low (F1); 2) Alippi (F2); 3) PLAN (F3); and 4) Zhang (F4) [21] . Their mathematical and hardware implementation is summarized in Table II and their hardware resource utilization in Table III . As shown, the approximation functions were implementing using minimum of the hardware resources. Table IV shows the resources utilized by the entire neuron using different approximation functions, revealing that each of them has drawbacks and strengths in terms of PS and hardware utilization.
It can be concluded that the best approximation method, in terms of resources utilized and errors introduced, is the PLAN function, when the number of the neurons that use sigmoid function is larger than the number of the BRAM blocks available in the FPGA circuit. When the number of neurons is lower than the total BRAM blocks available in the FPGA circuit, the best way to approximate the sigmoid function is the lookup table method. The resolution used was (3, 10) , where 3 bits were allocated for the integer part and 10 bits for binary part. The errors introduced by the implemented functions are summarized in Table V .
C. Control Neural Library With On-Chip Learning
The control of the neuronal processing components is done through specialized blocks designed to accommodate the onchip BP learning algorithm and the parallelism at the neuron level, i.e., all neurons within the same layer are controlled at the same time (in parallel), taking advantage of the massive parallel processing supported by the FPGAs.
The blocks that control the ANN processing units consist of a general counter, used to provide the time base for the entire neural network according to the ANN's phase: propagation (when the network is already trained and performs recognition) or learning (when the network is on-chip trained) and ANN layer specific command signal generator blocks (two in the example given: one for each neuronal layer) (see Fig. 3 ).
The general counter block calculates, function of neurons architecture, network topology, and processing phase (propagation or learning), the counter's maximum value and generates tth layer of network and ceil is the MATLAB function that approximates a real number up to the next integer. The following equation gives also the ANN PS in the learning (PL = 1) or propagation (PL = 0) phase:
The command signals generator block generates the controlling signals for all the processing elements of the neurons at specific moments (counter values). For this, the block calculates the values at which commands have to be given using (7), according to the neuron's architecture, where t is the layer number, n is the number neurons in layer t, sra is the counter's value at which the accumulator's reset signal is set, ea_start is the counter value at which the accumulator enable signal is set, ea_stop is the time at which the accumulator enable signal is reset, p_start is the counter's value at which the neuron's propagation phase starts, p_stop is the counter's value at which the propagation phase stops, ul_start is the counter's value at which the tth layer weights start to be updated, ul_stop is the counter's value at which the weights of the tth layer updating process is stopped sra = (t − 1)(n + 6) + 2 ea start = (t − 1)(n + 6) + 4 ea stop = (t − 1)(n + 6) + 4 + n − 1 p start = (t − 1)(n + 6) p stop = (t − 1)(n + 6) + n − 1 ul start = t(n + 6) + 12 ul stop = t(n + 6) + 12 + n − 1.
The blocks are described using VHDL language and implemented using Black box modules, a block that converts a VHDL design into a system generator block. The computational tasks that describe the algorithms for updating on-chip the ANN weights [see (2)(4)] have been implemented with three computing blocks: 1) the errors computing block; 2) the output layer weights computing block; and 3) the hidden layer weights computing block. The errors and layer weights computing blocks calculate the accumulated error, the gradient of the error (δ) and the value that the weights should be changed with (Δw) to decrease the accumulated error (see Fig. 4 ). The weights of the hidden layer are calculated last (due to back propagation error algorithm). For this, processing blocks that calculate the new weights according to (3) were designed (see Fig. 6 ). The processing time, expressed in clock cycles, is given in (8) and is used in delaying the weights updating task (the delay permits the calculation of the new weights to be completed) Delay = n 2 · ceil((n 1 + n 2 + 6t + 12)/n 2 )
An overall view of the main blocks involved in designing an ANN architecture with a 7-7-4 topology (seven neurons in the input layer, seven neurons in the hidden layer, and four neurons in the output layer) is shown in Fig. 5 . 
IV. ANN ABSTRACTION AND EUP
The idea behind the ANN abstraction is to shield the complexity from the EU while allowing them to create their own desired ANN program without incurring a steep learning curve, thus promote rapid prototyping. To achieve this goal, we employed rule-based technique that often found in many AI systems. Rules are first created according to the design and requirement of each of ANN blocks (see Fig. 7 ) and store in the "EU Programs and Semantics" component (i.e., the knowledge space) (see Fig. 1) . These rules are then used as the basis of ANN programming abstraction presented as graphical representations. The relationships between each rule are described in a form of semantic manner. The EU will then create their own ANN programs by simply manipulating the graphical "Rules" representations and develop their own new "Rules" via an interactive GUI (see Fig. 7 ). The GUI is implemented using event-based architecture.
Using JavaScript API , various UI events (such as drag, drop, move, and click) have been developed mapping the rules requirements that store in the main knowledge space (thus can "trigger" other events based on those rules) to "listen" to the user's activities that is happening on this panel. To understand the user interactions, a learning feature is implemented such that each activity will be captured and interpreted/inference according to the rules store in rule-based or be learned as new rule. Based on rules, an "expert system" is implemented to guide the EU to create their ANN program via a series of dialogs. User is able to configure the "rule" by simply clicking the graphical representations. The newly created "Rules" will be recorded as instances and stored back in knowledge space, which can retrieve and amend later. The contributions of this paper in terms of software design are: 1) the semantic rules based on ANN design; 2) the abstractions and representations; 3) the expert system including the learning feature; and 4) GUI event-based architecture. The EUP GUI is implemented using Python language and JavaScript, together with a preinstalled MATLAB Engine that enable Simulink functions to be called through the provided APIs.
V. APPLICATION AND ANALYSIS
The developed FFBP neural network library was used to create a pattern recognition module for an artificial olfactory system trained to recognize different types of coffee. The olfactory system consists of seven gas sensors chosen to react to a wide spectrum of odours (TGS842, TGS826_1, TGS826_2, TGS2600, TGS2601, TGS2602, TGS2620), temperature sensor (LM35), humidity sensor (SY-HS-230), mounted into a gas test chamber, test chamber, three gas pumps, circuits for sensors conditioning and pumps command, data acquisition board, pattern recognition module hardware implemented in FPGA (Virtex-4 SX 4VSX35), user interface.
A. Data Acquisition and Processing
The data acquisition module was customized to control the gas pumps (used to transport the smell to and from the test chamber), acquire data generated by all nine sensors and preprocess the acquired signals (filtering, drift cancellation). The data have been extracted from the measurement over a defined absorption/ desorption time of the voltage drop on sensors resistance when the enriched odour is applied/removed. Data acquired constitute the fingerprint of the smell, and to process it, dimensional reduction techniques are applied. In most cases, this is performed by extracting a single parameter (e.g., steady-state, final, or maximum response) from each sensor, disregarding the initial transient response, which may be affected by the dynamics of the odour delivery system. In some situations, transient analysis may significantly improve the performance of the gas sensor arrays and should be taken in consideration. Considering the feature extraction methods reported in the literature [22] , a heuristic method has been adopted with the following selected features: average value (A1), maximum value (A2), function integral (A3), integral of the absorption time (A4), maximum slope of the absorption (A5), maximum slope of the desorption function (A6), time at which maximum slope of absorption function occurs (A7), and time at which maximum slope of desorption function occurs (A8).
B. ANN Performance Analysis
For determining the best FFBP network implementable with a minimum of resources, a series of different FFBP NN topologies have been tested. In addition, for each topology, fixed-point binary representation with different resolutions has been investigated. Fig. 8 shows the RR versus data representation for a topology of 56-56-4 neurons, which processes an input vector with 56 components: eight features per sensors (A1-A8) and seven sensors. The RR varies from 100%, for (16, 16) bits representation (16 bits for integer part and 16 bits for binary part), to 50% for (7, 8) remains constant for a major drop of data resolution (16, 16 ) → (8, 8) . These observations may be very useful when choosing the data representation resolution. Figs. 9 and 10 are plotted in order to highlight the influence of data representation resolution over the RR for a given training set. First, a training set with features (A1, A2, A3) is shown in Fig. 9 and (A2) in Fig. 10 . It can be concluded that there is no perfect FFBP network topology for every purpose, but it can be adapted to fulfill the most important requirements of a given application. For example, if the chip area occupation is an important issue, then a 21-21-4 FFBP network with a (5,5) bits representation and a theoretically RR of 90% could be more than acceptable. However, for obtaining a higher RR, a 56-56-4 FFBP network with a (16,16) bits representation might be a better option. Consequently, as demonstrated in the above discussion, the accuracy of the ANN is massively determined by the data representation adopted. Similar reports are shown in [10] . 
C. ANN Hardware Implementation Results
To implement in FPGA the above ANN topologies requires specific hardware resources, which can be priory calculated. Having a formula to estimate the hardware resources needed for implementing a specific ANN topology would let the user choose the right ANN size and FPGA circuit.
By analyzing the hardware implementation reports presented in Table VI , where HL denotes the hidden layer and OL the output layer, it can be concluded that each neuron added to the hidden layer increases by 32 LUTs and one multiplier the overall resource utilization, and each neuron added to the output layer increases by 40 LUTs and four multipliers the output neurons weights computation block and with 49 LUTs and one multiplier the hidden neurons weights computation block. Based on the reports presented, three equations have been generated to estimate the hardware resources utilized to implement a given FFBP topology, prior to an actual hardware implementation [see (9)-(11)]. These permit choosing the right FPGA circuit for a given ANN topology/size in the very early ANN design stages, saving time and costs
where N o is the number of neurons in the output layer and N h is the number of neurons in the hidden layer. Applying (9)-(11) to the Virtex4 targeted in this paper (15.360 slices, 30.720 LUTs, 192 BRAMs, 192 DSPs), the maximum number of neurons that can be implemented using strictly the dedicated BRAMs and XtremeDSP blocks (for ensuring the maximum PS) is 60, organized as: 45 in the hidden layer and 15 in the output layer. However, using the distributed multipliers and BRAMs available in the circuit, 26 more neurons, 20 in the hidden layer and 6 in the output layer, can be implemented. These will utilize 6878 LUTs and 76 BRAMs, leaving 22657 LUTs unused. The unused LUTs can be further converted into 20 neurons in the hidden layer and 10 in the output layer. Therefore, the maximum number of neurons that can be hardware implemented (on the expense of the PS) is approximated to 120 (double than the number of neurons that use only dedicated BRAMs and XtremeDSP blocks).
To illustrate the FPGA implementation performance, a report in terms of hardware resources utilization and maximum processing frequency is presented in Table VII .
D. ANN Performance Comparisons
A direct comparison of the data presented in Table VII with others reported in the literature is not always relevant due to the lack of common referencing in reporting the hardware resources per ANN performances. These depend on the type of the resources available in the FPGA (four-or six-input LUTs, multipliers or XtremeDSPs, etc.), the depths of the ANN parallelism adopted (synapse, neuron or layer), the firing function (sigmoid, hardlim, etc.) PS, data representation, use of dedicated or distributed resources, on or off chip learning, and number of hidden layers to nominate the most important ones. In [10] , for implementing the 10-3-1 FFBP topology with a synaptic parallelism, 70 DSPs and 8043 LUTs were used. In [11] , the hardware utilization is reported per neuron with 1299 LUTs / neuron. In [23] , for a 2-5-1 topology, 11 DSPs and 6384 LUTs were consumed. In this paper, for a similar topology of 7-2-4, 43 DSPs and 412 LUTs were used.
As shown above, the hardware utilization depends on factors which vary from one ANN topology, and FPGA, to another but they are all reflected in the RR and PS supported by the chosen FPGA. Hence, reporting RR and PS, along with the hardware utilization, would indicate better the level of success in using a particular ANN topology in a specific FPGA circuit.
Choosing the right FPGA circuit for a given ANN or the ANN size for a given FPGA circuit is not straightforward. As shown in [10] for selecting the right FPGA circuit, the designer is forced to implement the design first and then interpret the hardware resources used versus the ANN topology. Therefore, being able to estimate the hardware resources needed for implementing an ANN before to an actual implementation would shorten the development time and consequently save costs. This is addressed for a given FPGA family by (9)-(11).
VI. CONCLUSION AND FUTURE WORK
A novel neural design strategy has been developed, which benefits of reduced design time over classical field orientation approaches, leading to a low-complexity and easy-to-implement pattern recognition module. A particular application of the pattern recognition system for an olfactory system is investigated, and results presented show efficient hardware implementation in the FPGA circuit. The achievement presented in this paper refers to a holistic modeling /design method, using modules created into hardware-software co-design environment (MATLABSystem Generator-ISE) and grouped in a specific NN library. These modules emulate in hardware any FFBP network topology behavior, giving the opportunity to design hardware implementable FFBP neural networks, at a higher level, via an intuitive and interactive EUP interface.
The proposed methodology takes advantage of the FPGA parallel processing power preparing the ground for an autoadaptive reconfigurable device ready to respond-read autoreconfigure-to any pattern recognition challenge. It is hoped that, through the proposed method, it would be possible to make steps toward a "more like brain" computational machine, in terms of adaptability and quick response, a system that makes its own choices (upon an implemented algorithm), i.e., intelligence.
As the components are entirely designed using system generator blocks, the created library is technology dependent to the software used. For increasing the portability, future work will consider having the blocks designed using HDLs, generated from system generator.
In conclusion, this paper shows that any FFBP topology may be built using predefined neural blocks with the following characteristics: 1) holistic modeling and optimization; 2) behavioral analysis; and 3) easy hardware prototyping on an FPGA development platform via an intuitive EUP interface. In addition, a set of equations has been developed to estimate: 1) the hardware resources needed to implement an FFBP ANN with on-chip learning in a given FPGA circuit [see (9)- (11)]; and 2) the PS of the implemented ANN topology [see (6) ]. Moreover, design concepts introduced in [20] and [24] are brought further with contributions in developing an ANN design platform based on semantic rules, abstractions and representations, expert system, and GUI event-based architecture.
