Abstract
Introduction
In the recent years numerous examples of industrial and research applications have been done by utilizing Fuzzy Logic. In more details fuzzy processors find most of their applications in control logic fields [1] . Besides that it can be said that once an user has written down some fuzzy rules for describing a particular problem, a related fuzzy algorithm can be created. In general this fuzzy algorithm can be implemented either in SW or HW platforms.
Nevertheless, even if Fuzzy Logic is also developing in order to decrease the processing time, the HW/SW implementation of a fuzzy algorithm on commercial fuzzy processors may not give good results in term of speed. Consequently, mainly for high speed applications, dedicated fuzzy processors are required. As far as VLSI implementations [2] , many researchers have improved the performances of the HW processors by analog [3] , [4] , digital [5] , [6] , [7] , [8] , [9] or mixed solutions. The different approaches to the design of the VLSI fuzzy processor architectures is in order to find a trade-off between speed, flexibility and layout area. These are exactly the features we have investigated while designing the architecture of the chip here presented. Particularly, chip dimensions, high speed and membership function shapes have been take into account during the design feature approach.
As far as application fields, the Fuzzy Processor has been designed for future applications in High Energy Physics Experiments Fields where high processing rates are a part of the global constraints. In this field a very fast 2 input 1 output fuzzy processor may find many applications for problem of particle trajectory recognition.
Fuzzy Processor Architecture
Firstly the main processor features are here summarized: two 7-bit input digital fuzzy variables; one 7-bit output digital fuzzy variable; 4-bit input variable and premise degrees of truth (see below); 49 9-bit fuzzy rules; minimum disjunction and operator; Sugeno [10] inference and defuzzification methods; 50 MHz clock frequency; 80 ns input data set rate; 270 ns input-output delay time; 10 mm 2 layout area, 0.7 µ m CMOS digital technology.
The architecture has been divided into several steps here summarized and reported in figure 1. Firstly the fuzzification process takes place; the input variable values (A0 and A1) are connected to the relative fuzzy sets depending on the input fuzzy set distribution. This is done by addressing, directly with the input values, the look-up-tables A0 and A1 MF Memory blocks where the input membership function shapes are stored; this process returns the potential input degrees of truth, herein named α (s). We say potential since some of them may be rejected if the related active rules (see below) do not concern them. Moreover, the processor while executing the input degrees of truth α (s) selects the active fuzzy rules that are stored into the Rule Memory. These fuzzy rules are the possible ones among all the stored fuzzy rules that can give a non null contribution. Besides that, each fuzzy rule, may involve both the two input variables, just one of them or even neither. This features relies on a premise code which is stored into the fuzzy rule as well; just using a bit for each of the two input variables it is possible to select or reject the related input degree of truth. In addition, the fuzzy rule contains the output Sugeno value, herein named Z which is requested by the inference process. So, each fuzzy rule is composed of 7 bits for the Z + 2 bits for the premise code = 9 bits. Then, once the input degrees of truth are read from the membership function look-up-table and selected by means of the premise code, the premise degree of truth must be computed. This is done by a minimum operator that implements the fuzzy and disjunction; in other words, the minimum value among α s is the premise degree of truth herein named Θ .
In particular, the rule inference process performs the multiplication Θ *Z. This is the contribution of a given fuzzy rule to the final result named Zout. Moreover the defuzzification process takes place into the defuzzifier block. This process is composed of two additions and a division operations. The two additions, which concern a numerator and denominator of a weighted sum final result, are carried out by adding the Θ *Z and Θ values respectively to the previous partial sums ∑ Θ *Z and ∑ Θ . Finally the division process ∑ Θ *Z/∑ Θ can start. It should be noted that just this last process is off pipeline while all the previous ones compose the pipeline stages (see figure 2 ).
On the other hand, coming back to the fuzzy rule selection, it can be said that in case of 2 input variables, 7 membership function for each variable and an overlapping among the fuzzy sets not greater that 2, all the possible fuzzy rules are 7 2 =49 although the number of active rules, that can give a non-null contribution, is much smaller. In fact the active rules is reduced to 2 2 =4. In this way using an active rule selector the number of rules to be processed is strongly reduced. The two inputs A0 and A1, coded as 7-bit numbers, enter the active rule selector that is a circuit dedicated to the generation of the fuzzy rule memory 6-bit addresses. In more detail, the two input variable values are used for finding the related involved intervals. This means that if the input variable domain is divided into intervals depending on the fuzzy set distribution, a particular input value belongs to two consecutive Fuzzy sets. This is done by means on the Active Interval Selector which compares the input variable values to some interval points stored into the MF Interval Memory, and sends the interval codes (A0 and A1 Intervals) to the Address Generator. Once this step is carried out and the Rule Memory is read, the minimum operation is done by selecting the two α values related to the active rule under process. It should be noted that all the possible input degrees of truth (α s) are processed always with the same order to make the Rule Memory know which α s are going to be processed. Thus, the rule memory output, that contains a premise code as previously mentioned, selects the right degrees of truth. The Rule memory is dimensioned to contain all the possible 7 2 combinations of the input variables and fuzzy sets. In this way the fuzzy rules are loaded in the Rule memory starting, for example, from the one that involves all the lowest Fuzzy Sets for the two input variables up to the one that involves all the highest corresponding Fuzzy Sets. So that for a given address is known in advance which fuzzy rule is considered. Thus the Rule Memory can be organized as 49 words of 9 bits, where a word of this memory contains only Z, a 7-bit word representing the rule consequent code, and the rule premise code, a two-bit word that tells which variables are present for every rule. For example when the rule premise code is 11 both A0 and A1 are present, when 10 only A0 is present and when 00 neither A0 nor A1 are present. The input variables A0 and A1 are also used as addresses for the two membership function memory banks: in these two 128 words of 8 bits memories are stored the membership function values α s, coded as 4-bit numbers. In this way given any possible value of the input variables this memory provides as output the alpha values α 0 and α 1 for A0 and α 2 and α 3 for A1.
Software Development Tool
For High Energy Physics applications we need a fuzzy processor able to take a decision in less than 500 ns. Therefore the solution is to design an ASIC fuzzy chip. Since our goal is a high processing rate we decided to process only the active rules related to a fuzzy system where all the rules are present. Therefore a SW application has been developed which converts any general fuzzy system to a new one where all the possible rules are present. A first important choice has to be made as far as the input
On the other hand, as previously mentioned, each general problem may be described by a set of fuzzy rules. In addition, a finite set of them is bounded by the number of input variables, by the number of fuzzy sets for each input and by the fuzzy set overlapping grade. In our case we have at most 49 fuzzy rules. Anyway one could use just a subset of them with some fuzzy rules which involve just one input variable.
Thus the Software Development Tool has been designed right in order to convert this partial fuzzy set of rules into the equivalent complete one. We mean that the two fuzzy sets of rule must give exactly the same result after having been processed.
With this tool we set up the Fuzzy Processor by means of always the same number of fuzzy rules so that we can exactly foreseen when the output result will be ready to be used.
Pipeline Subdivision
The overall architecture of the fuzzy processor is pipelined as shown in figure 2 , where it is displayed the data flow for every pipeline stage. It is to be noted that the 9 pipeline stages reported in the figure are composed of 5 actual pipeline stages into which the fuzzy architecture has been divided and 4 pipeline stages due to the number of active rules. Nevertheless, since always 4 active rules are processed, this time is considered as a pipeline time. From the moment a new data set enters the processor nine pipeline stages are required for the fuzzification and inference processes. In the first clock period two processes are performed in parallel : the four α s are read from the membership function memory. In the following period the address generator produces the first address Address and a period later the first rule codes are available for the minimum circuit. The first Θ is produced in the 4 th pipeline stage, whereas the first Θ *Z is valid one period later. After 9 periods both sums ∑ Θ and ∑ (Θ *Z) are carried out, so that the final process of division, which requires 70 ns, can start. What is really remarkable in this pipelined structure is that a new input data set can enter the system after only four clock periods, since at this stage all the four addresses Address have already been generated and the first logic blocks can accept new data. As far as the whole delay estimated starting from the input data set loading cycle to the correspondent output data generation, four contributions have to be considered.
In fact, first the startup time due to the input synchronization, which requires one clock period of 20 ns, has to be considered; then it has to be added the time due to the actual number of pipeline stages that is 5 as above reported that is 5 x 20 ns = 100 ns; thus it has to be added the time due to the number of active fuzzy rules which is 4 x 20 ns = 80 ns; eventually it has to be considered the division time that takes nearly 70 ns.
All together these delays give rise to a global processing time of 270 ns if a 50 MHz clock rate is used.
Layout Design Guidelines
Here is reported a first version of a layout representation. Figures 3 and 4 show the main block dimensions and I/O pads. The silicon area is nearly 10 square mm for 0.7 µ m ES2 digital technology. The dashed line represents the clock net designed with a tree shaped routing style. As can be evaluated the memory blocks do not require a large silicon area in comparison to the whole layout. This justifies the look up table choice previously mentioned. In addition, thanks to the small size layout, both the net parasitic capacitance and, consequently, the timing delays are quite reduced. This simplifies the layout design in terms of power supply and clock net dimensioning.
The
This is what we had after having synthesized the VHDL code and properly pre-layout simulated the schematic view. As far as the layout design, the processor components have been divided into the logic blocks as reported in figure 3 . This division has been made taking in consideration the logic function each component was designed to. Moreover the input/output pad distribution has been chosen as a trade-off solution in order to minimize the net connectivity. Thus, the placement and routing phases have been done after having put some constraints in terms of integrated circuit design parameters. In particular some global nets such as power supply nets VDD and GND, clock and reset nets, have been routed manually. In fact, since these nets may connect hundreds of points, it can be very dangerous let them be routed automatically: the result could be very long far from what one would expect. This do not apply for short nets which connect not too many components. Another important point during routing phase is the net priorities. This parameter concern the net overlapping. In fact, when two nets cross each other, it is to be decided which one has to be left on one single metal layer and which one must be routed on another metal layer by means of a layer contact. This point, especially for global nets such those previously mentioned, may affect significantly the connectivity and the functionality of the chip since different layers have different performances in terms of resistance, maximum current and parasitic parameters.
In our design, the clock net routing may be done by means of a large net trunk from the clock pad and many small net branches from the trunk to the standard cell clock pins. This allows the clock edges to be distributed quite evenly among the cell rows. It is also important to leave a sufficient margin between the clock pad fan-out and the global net fan-in. Moreover, the standard cells which are not connected to the clock signal, or better, the logic blocks that have not to be synchronized, may stay off the clock net. In this way, the synchronous part of the design, may be put together and divided into two main standard cell row blocks (see figures 3) ; in the middle of the blocks, the clock net can be easily routed. In the figure 4 is reported just the power supply net distribution in order to emphasize the interdigitized power trunks. As well shown, each component block has its own VDD and GND net trunk. This allows a non overlapping between power and ground nets while each components is correctly supplied. The power net width must be dimensioned taking into account the clock frequency, the number of components and their average power consumption, the estimated number of involved components for each clock cycle and the percentage of them that really commutes. This is why even if all the logic gated are always connected, the signal propagation starting from the input pads does not pass always through all of them. Moreover, where this signal propagation passes may also do not have any commutation effect.
Of course the power net dimension operation may be done only approximately and, for this reason, the net width result can be multiplied by 2 or 3 for being more confident.
ConclusionS
One of the main intents of this paper is to develop the possibility to apply Fuzzy Logic and, in particular, fuzzy processors, to other application fields besides the control, patter recognition fields, etc. For example here is reported the high speed investigation that has been made in order to apply fuzzy processors for High Energy Physics Experiments. This can be done by means of fast pipeline-parallel fuzzy architecture that, in case of small number of input variable, gives rise to a small size fuzzy chip. Besides that, this small size chip prototype design that is going to be submitted for the fabrication follows the same rules we have used and successfully applied for the previous chip designs. For this reason we really foreseen that the prototype will work properly.
