Abstract-Marine diesel engines operate in highly dynamic and uncertain environments, hence they require robust and accurate speed controllers that can handle the encountered uncertainties. Type-2 Fuzzy Logic Controllers (FLCs) have shown that they can handle such uncertainties and give a superior performance to the existing commercial controllers. However, there are a number of computational bottlenecks that pose as significant barriers to the widespread deployment of type-2 FLCs in commercial embedded control systems. This paper explores the use of parallel hardware implementations of interval type-2 FLC as a means to eradicate these barriers thus producing bespoke co-processors for a soft core implementation of a FPGA based 32 bit RISC micro-processor. These coprocessors will perform functions such as fuzzification and type reduction and are currently utilised as part of a larger embedded interval Type-2 Fuzzy Engine Management System (T2FEMS). Numerous timing comparisons were undertaken between the co-processors and their sequential counterparts where the type-2 co-processors reduced significantly the computational cycles required by the type-2 FLC. This reduction in computational cycles allowed the T2FEMS to produce faster control responses whilst offering a superior control performance to the commercial engine management systems. Thus the proposed co-processors enable us to fully explore the potential of interval and possibly general type-2 FLCs in commercial embedded applications.
I. INTRODUCTION
Marine diesel engines are large engines that due to their vast sizes and large power outputs require accurate and robust speed control/governing that can handle the large amount of uncertainties present in the engines environments. It has been shown recently that type-2 FLCs can handle such uncertainties to provide superior control performance to the existing commercial controllers [1] . The current commercial engine management systems attempt to address the faced uncertainties through averaging the sensor inputs and using gain scheduled control algorithms such as the gain scheduled PID controller with numerous non-linear gain functions embedded in the Viking 25 industrial controller illustrated in figure 1. Despite the additional complexity of applying these supplementary functionalities, the Viking 25 is still computationally efficient requiring about ten thousand clock cycles to perform all of its speed control functions, and using the remaining clock cycles to perform other engine management features such as signal conditioning, communications, alarm and monitoring etc. On the other hand, a type-2 FLC requires the equivalent number of clock cycles to perform typereduction alone [1] . Thus despite any performance improvements type-2 FLC may offer, these computational bottlenecks remain as a barrier to the type-2 FLC deployment in commercial embedded control systems. As a result, an alternative solution which exploits the high level of parallelism offered by the type-2 FLC was required.
Currently there are only two other hardware implementations of the type-2 FLC available. The first implementation was presented in [2] and they produced a VLSI implementation where the type-2 FLC was designed at the transistor level on a single chip for a dual input single output controller supporting up to 64 rules. This approach whilst offering a tailored solution does not present the flexibility nor re-programmability of a micro-processor based solution. Alternatively Melgarejo et al [3] designed a type-2 FLC for an adaptive filter with a rule base of nine rules using the Wu-Mendel approximation. This implementation was embedded on a Field Programmable Gate Array (FPGA) which is a single chip programmable logic device. This approach is a highly optimised and pipelined solution offering a type reduced set in 9 clock cycles at the expense of being highly memory intensive; making use of memory base fuzzification, reciprocal division and distributed arithmetic each of which require large amount of on chip memory. Although this approach is applicable to the higher end FPGAs, it is unsuitable for larger rule bases on lower cost FPGAs with less memory. This paper presents parallel hardware implementations of the interval type-2 FLC for the purpose of industrial control which can accommodate much larger rule bases (supporting up to 64 fired rules and hence supporting much larger rule bases) and use cheaper hardware solutions than the previous hardware implementations mentioned above. This will create bespoke co-processors that can perform functions such as fuzzification and type reduction. The proposed system is currently utilised as part of a larger embedded interval Type-2 Fuzzy Engine Management System (T2FEMS). We will show how the proposed type-2 co-processors reduce significantly the computational cycles required by the type-2 FLC. This computational cycles reduction allows the T2FEMS to produce faster control responses that are ten times quicker than the Viking 25 while also offering a superior control performance [1] . This paper begins by introducing the hardware and the engine testing platforms in section II. Section III discusses the hardware co-processor design. Section IV presents the experiments and results followed by the conclusions in Section V.
II. THE USED HARDWARE AND THE ENGINE TESTING PLATFORMS
A. The Hardware Platform The T2FEMS is embedded on a FPGA and exploits a soft core implementation of a 32-bit RISC Harvard architecture (the MicroBlaze) with 32 general purpose registers, Arithmetic and Logic Unit (ALU) and a rich instruction set optimised for embedded applications.
The MicroBlaze is a soft core developed by Xilinx (a manufacturer of FPGAs) which is implemented using the logic primitives of the FPGA with the key benefits of easy integration with the FPGA fabric while avoiding obsolescence [4] . The soft core based solution has several advantages over a pure hard core based design. Firstly, it supports a purely software based application development, allowing portability between FPGAs. Secondly, the soft core features do not require any silicon area when it is not needed, while the hard processor based approach always consumes the same area [4] . Of particular interest in soft core solutions is the prospect of identifying bottlenecks in the embedded algorithms and replacing them with a customised soft core co-processor thus accelerating the performance of certain functional blocks [4] . These customised soft core coprocessors communicate with the Micro-Blaze via a number of bus interfaces; two of which are widely exploited in the T2FEMS design which are the On-chip Peripheral Bus (OPB) and the Fast Simplex Link (FSL). The OPB is a CoreConnect IBM standard bus used for less time critical communications. Alternatively the FSL is a dedicated pointto-point data streaming interface providing a low latency interface to the processor pipeline. The OPB and the FSL allow for extending the processor' s execution unit with custom hardware accelerators (co-processor) [4] .
The MicroBlaze also supports development in C code using the Xilinx Embedded Development Kit (EDK) whilst the co-processors are developed using Xilinx ISE Foundation in VHDL (Very High Speed Hardware Description Language). The functional/timing simulations were performed in Mentor Graphics ModelSim.
All of the aforementioned co-processors and Micro-Blaze were embedded in a Spartan 3E FPGA (shown in figure 2(a) ) which is one of the lower cost per gate devices Xilinx produce. It is also targeted and rated for automotive applications (including engine control) with respect to temperature, packaging etc. Currently the functionally of both the co-processors and MicroBlaze is verified in both simulation with ModelSim and additionally through the use of a hardware development platform with an embedded Spartan 3E (XCS500E).
(a) (b) 
B. The Engine Testing Platform
The embedded control algorithms are tested and verified on the engine testing platform shown in figure 2(b) which is designed to realistically reflect the characteristics and operating conditions of the marine diesel engines; with the ability to alter speed, load, inertia and torque. The platform uses the same noisy sensors and actuators used on the engine with the ability to introduce the same uncertainty levels encountered on the real engines.
Although, we have developed various co-processors for the type-2 FLCs, however due to the space restrictions, this paper will be only able to report in detail on the type reduction co-processors as will be discussed in the next section.
III. HARDWARE CO-PROCESSORS
A. Fuzzification Numerous fuzzification strategies have been exploited in the design of FPGA and VLSI based fuzzy controllers. The most common method is memory based fuzzification [3] , where any arbitrary shaped Membership Function (MF) can be represented in memory by discretising its universe of discourse and storing the resultant degree of membership in memory thus providing very fast fuzzification. The disadvantage of this method relates to the required resolution and level of discretisation possibly resulting in very large MF tables.
Mathematical approximations of MFs [5] , [6] are also widely used but any errors produced by crude approximations can be problematic for adaptive systems as a continuous function approximation may not prove to be continuous in all segments.
Analogue fuzzifiers have been developed by [7] , [8] and despite being very fast and power efficient, analogue implementations are prone to temperature related drift and inaccuracies related to component tolerances.
Linear interpolation is applicable for both trapezoidal and triangular MF fuzzification offering fast fuzzification with minimal resources.
We have implemented type-2 fuzzifiers employing trapezoidal and triangular MFs. The type-2 fuzzifier was implemented as single FSL co-processor. In this implementation, the triangular MFs were considered a special case of the trapezoidal MFs. This co-processor makes use of linear interpolation producing both upper and lower fuzzified values in 12 clock cycles. Also developed are type-2 Gaussian fuzzifiers in C code for the MicroBlaze.
B. Rule Base FPGA based rules bases are typically a binary pattern used to reference antecedents, consequents and the connective operator of each rule. The hardware realisation of the rules involves a number of multiplexers and memory elements as in [3] . The T2FEMS rule base will be defined in C code within the Micro-Blaze processor. Hard-coded VHDL implementations were previously designed but prove a less flexible and maintainable solution. Also the rule base will form part of an adaptive type-2 FLC planned for future development making further use of the Micro-Blaze processor core. The size of the rule base supported by the T2FEMS is only limited by the number of rules the type reduction co-processors can handle, i.e. currently up to 64 fired rules. Thus the maximum size of the rule base is obviously much larger.
C. Inference A number of t-norm and t-conorm inference operators exist. One of the most common t-norms for embedded systems is the minimum t-norm due to its ease of implementation and lesser resources required compared to the product t-norm. This is because the product t-norm is defined as a multiplier while the minimum t-norm needs only a simple magnitude comparator.
In retrospect, application of the minimum operator is not necessarily the correct choice for embedded control systems with more than two inputs [9] , [10] , as it creates nonlinearities in the control surface. Additionally the MicroBlaze ALU is already defined in the FPGA resources and can perform a product operation in a single clock cycle, thus the product t-norm will be used throughout this paper.
D. Type-Reduction
The approach taken in this paper is to minimise the logic used by the co-processor sacrificing speed in favour of a reduced gate count and lower cost FPGAs. As with the fuzzifiers the type-reduction co-processor will interface directly to the Micro-Blaze via the FSL bus. The coprocessor will support up to 64 fired rules with a resolution of 16 bits for each of the firing interval bounds f f and the consequent centroids bounds w , wI. Where possible single adders will be used for large summations (i.e. where f is a single adder) thus requiring M clock cycles for i=1 the summation but utilising less FPGA resources than a completely parallel implementation requiring M adders and a single clock cycle.
Currently both the iterative Karnick-Mendel (KM) procedure for type-reduction [11] (employing the centre of sets type-reduction) and the Wu-Mendel (WM) Uncertainty Bounds [12] method (approximating type-reduction) are supported and designed using VHDL as co-processors to the Micro-Blaze, denoted KM-CP and WM-CP respectively. Both approaches will now be defined in a manner more applicable to a FPGA based implementation.
1) KM Co-Processor
The KM procedure is typically defined as a four step iterative procedure [ 1] . This procedure will now be redefined with the removal of step 2 thus not requiring the L and R index variables used in [11] . The modified procedure is shown below for yl.
Without loss of generality assume the w are arranged in ascending order: wj < w2 .w 
The procedure for Yr can also modified in a similar manner. This 3 step procedure for both y, and Yr was analysed and segmented into a number of parallel processes and memory elements. Figure 3 Each colunm of the processes operates in parallel i.e. P1 to P6 (which require M clock cycles) function in parallel and their outputs are used as inputs to another colunm of parallel processes P7 to PlO (which requires a single clock cycle).
The process P11 (division) is included twice and represents a shared process i.e. the same hardware is used for both divisions. P11 is a VHDL implementation of a pipelined radix-2 non-restoring signed integer divider requiring an initial 20 clock cycles for the first division and 1 additional clock cycle thereafter thus both P11 processes will require a combined total of 21 clock cycles to complete.
The final colunm contains comparative processes P12 and P13 representing step three of the modified KM procedure, comparing y7 to yIand Yr to y respectively. If either comparison is false then y is set equal to y and is fed back to the inputs of PJ-P6, this requires only a single clock cycle to perform. However when both P12 and P13 are true then type reduction is complete and y,I Yr are passed back to the Micro-Blaze processor via the FSL bus.
The additional signal "initialise" relates to the first iteration through the modified KM procedure, and is activated by the first FSL write to the KM-CP i.e. FSL Init. The remaining initial FSL writes access the memory elements MEM] and MEM2 and concurrently these writes are also used by processes P1 to P6. This reduces the first iteration through the KM-CP by M clock cycles i.e. during the initialise iteration it requires 4M+2 clock cycles to perform all FSL writes to MEM] and MEM2 and complete the arithmetic processes P1 to P6. Additionally during this initialise iteration, processes P2, P3 and P5 are disabled (thus outputting 0), additionally all f ' are set equal to (fi + fi)/2, also y' is set to a predefined maximum value whilst y' is set to a predefined minimal value. Finally the comparative processes P12 and P13 are also disabled simply setting y" equal to y' and returning the value to the inputs of PJ-P6. After this first iteration the "initialise" signal changes state and the processes operate as normal.
Finally when both comparisons (P12 and P13) are true yl and Yr are combined into a single 32 bit value and returned via the FSL bus to the Micro-Blaze requiring a further two clock cycles (FSL Read). Table I defines the complete number of clock cycles required for each parallel colunm of figure 3. The total number of clock cycles required by the KM-CP is represented in Equation (2) and it is greatly influenced by the number of fired rules and the number of iterations required to complete type-reduction 4M + 27 + (M + 23) * N (2) Were N is the number of iterations required by the KM iterative procedure (not including the initialisation iteration). In the following section a similar analysis and is carried out for the WM-CP.
2) WM Co-Processor The Wu Mendel Boundary equations provide mathematical formulas for the inner and outer bounds of the type-reduced sets [12] . Analysis of these equations will reveal that WM approach makes use of a larger number of arithmetic operators than the KM-CP thus requiring more FPGA resources. However, the WM approach has the added advantage of not being an iterative process thus does not require any large local memory elements as in the KM-CP. Figure 4 depicts a graphical representation of the final VHDL implementation of the WM-CP in FPGA. Again each colunm of processes operates in parallel, where P1 to P4 are multiply and summation, P5 and P6 are summations, P7 is a summation and a subtraction, finally P8 to P11 are multiply and summation with a subtraction.
The WM-CP does not require the MEM elements as in Thus if 20 rules fired the WM-CP would require 113 clock cycles to compute the values required by the MicroBlaze for defuzzification whilst the KM-CP would require 107+(43)*N, thus if the KM procedure required only one iteration (not including the first iteration) it would be 37 clock cycles slower than the WM-CP, otherwise for two iterations it would be 80 clock cycles slower and for 3 iterations it would be 123 clock cycles slower and so on.
E. Defuzzification Whilst defuzzification could also be performed in the type reduction co-processors, this function is currently performed in the MicroBlaze where defuzzification for both the KM and WM methods is achieved by multiple binary shift operations (a single binary shift right is equivalent to a division by 2) requiring a minimal computational effort.
IV. EXPERIMENTS AND RESULTS

A. Computational Comparisons
In this subsection, we will introduce comparisons between the computational times of the KM-CP, the WM-CP and their sequential counterparts. A type-2 FLC was coded in the MicroBlaze processor in C, including Gaussian type-2 fuzzification and the rule base previously used in a similar timing analysis in [1] . The firing strengths and centroids of the rule consequents were calculated in the MicroBlaze before being passed to the co-processors via the FSL bus. trait as the WM-CP consistency and predictability allow the MicoBlaze to transfer all data to the WM-CP and then continue executing other code, rather than waiting for the WM-CP to return the type reduced sets. Conversely the KM-CP is dominated by the number of fired rules and the number of iterations required to complete type reduction. Thus it is difficult to predict in advance the total clock cycles the KM-CP will need, as the number of required iterations is unknown. It can be also clearly shown that the WM-CP requires much less clock cycles relative to the KM-CP. Figure 5 (b) illustrates a sequential floating point implementation of the KM and WM type reducers implemented in C on the MicroBlaze (executed from external memory), subjected to the same test data as the coprocessors. It can be clearly derived from figure 5 that both the WM-CP and KM-CP achieved an approximate computational reduction of approximately 100% relative to their sequential counterparts. Also notice the spreads of the WM sequential implementation in 5(b) which is related to the additional executional branches such as the WM Max and Min operations. These issues are resolved in the WM-CP by concurrent operations resulting in no spread.
In fact the complete T2FEMS with additional coprocessors is able to produce a crisp output in less than 1000 clock cycles which is about ten times quicker than the Viking 25 commercial controller while the T2FEMS offers a superior control performance [1] . Thus the hardware acceleration offered by the co-processors removes any significant bottlenecks from type-2 FLC and make it even faster than the sequential type-I FLC and the commercial controllers whilst the type-2 FLC offers a superior control performance. Hence, the proposed co-processors enable us to fully explore the potential of interval and possibly general type-2 FLCs in applied commercial embedded applications.
B. Co-Processors FPGA Resources Comparison
The synthesised VHDL implementation of the co-processors revealed similarities in both the KM-CP and WM-CP with regard to the total number of FPGA resources (expressed in slices) required i.e. 1121 and 1159 slices respectively, with 559 slices used by the divider alone. To put this in perspective, the final hardware implementation of T2FEMS is targeted for the XC3S1600E with 14,752 slices available.
Additionally both the KM-CP and the WM-CP achieve similar maximum operational frequencies i.e. 76 MHz and 65 MHz respectively. This maximum operational frequency is important for the Micro-Blaze as the FSL writes occur using the processor clock (currently 50MHz).
V. CONCLUSIONS
In this paper, we presented a parallel implementation of both WM and the KM approaches to type reduction. Both implementations were defined in VHDL and operate as coprocessors to a 32 bit soft core micro-processor. The coprocessors communicated over the FSL bus to the computational reduction of approximately 100% when compared to the sequential implementation. In addition, the co-processors ability to support up to 64 fired rules allows the use of much larger rule bases using cheaper hardware than the previous hardware implementations. Timing analysis compared the WM-CP and KM-CP where the WM-CP required an average of 45 % less clock cycles than the KM-CP. Furthermore, the WM-CP offers predictable timing enabling the MicroBlaze to predict a fixed window within which it can execute other tasks.
Unfortunately the restricted length of the paper did not allow us to present a detailed analysis of the complete T2FEMS and additional co-processors which is able to produce a crisp output ten times quicker than the other commercial controllers. Indeed the improved computational speed coupled with control results from previous work [1] means the type-2 FLC outperforms its type-I and commercial controllers counterpart in speed and performance. This will finally allow type-2 FLC to be considered as a viable option for embedded control systems. In addition, this work reveals new prospects for the commercial application of general type-2 FLC and presents an exciting future for applied embedded type-2 systems.
