This paper presents a novel architecture to implement general-purpose fuzzy chips. It allows fully-parallel rule processing employing a reduced number of mixed-signal computing blocks and minimum-sized digital memories. The resulting fuzzy processor can interact directly with continuous sensors and actuators and subsequent digital processing system.
A fuzzy system is inherently parallel in the u input variables, R rules, and N elements in which the output space is discretized. Regarding hardware, this means a trade-off between high inference speed (parallel processing) and low silicon area (sequential processing). Singleton fuzzy systems that employ singleton values, c r , to define the consequents, B r , are usually chosen for hardware realization since they eliminate parallelism in N [1] [2] [3] . In the literature, fully-digital approaches have been reported reducing parallelism in R (=L u ) by only processing the α u simultaneously active rules (where α is the overlapping degree of the input membership functions) [1, 4] . To allow parallel processing of the rules, the proposal in [4] is to employ α u copies of the rule memory and to use multibit computing operators, which are very area consuming. Combining analogue and digital circuitry seems more interesting since digital circuitry eases programmability of the fuzzy processor and compatibility with subsequent digital systems while analogue circuitry offers parallel computing with lower hardware resources. However, digitally-programmable analogue realizations previously reported implement a small number of rules and they do not optimise digital part because the programmable parameters are stored in digital registers and selected by extensive matrixes of switches (multi-port-like digital memories), which occupy a large area [2] [3] .
The architecture presented in this letter allows optimization of both the analogue and digital part of a fully-parallel fuzzy processor. The analogue core is optimised by using an active-rule driven scheme implemented with current-mode computing blocks. The digital part is also optimised by using an adequate memory organisation that makes possible to retrieve all the required parameters in parallel without a need for replication or multi-port costly memories.
PROPOSED DESIGN
Singleton fuzzy systems carry out the following formula: y = Σ r h r •c r / Σ r c r , where h r is the activation degree of the r-th rule [1] [2] [3] . The architecture we propose to implement them is illustrated in Figure 1 for the case of two inputs, u=2, and a maximum overlapping degree of two, α=2. The parameters that define the antecedents and consequents are stored in conventional RAMs (X i -Mem and Y-Mem sets) so that the fuzzy processor can be suitably programmed for a given application.
The membership degrees, I µ , of each input variable are obtained by the transfer functions of α circuits known as MFCs. The MFCs described in [5] (Figure 2 ), which are based on digitally-programmable current mirrors (D/A), have been selected. They admit analogue input signals and provide trapezoidal functions defined by 4 digital words. Hence, the size of the global memory, X-Mem, that stores the membership functions' parameters of each input is 4•L words. Each X-Mem memory is divided into α parts (conventional RAMs) which store the parameters of the membership functions that are never active simultaneously. This is illustrated in Figure 3 for the simple case of L=4 (in this example, the 4•2 words associated with the membership functions NB and PS are stored in the M 11 part of the X 1 -Mem while the other 4•2 words are stored in the M 12 part). For a given input x i , one set of parameters (4 words) of each memory part is addressed by a code of n bits, {b 1xi , ..., b nxi }, n being the integer bigger or equal to log 2 (L+1-α), where L+1-α are the possible combinations of active input fuzzy sets. Hence, calculation of the membership degrees is performed in parallel since the α sets of required parameters per input are retrieved with one access to the X-Mem global memory. Each code {b 1xi , ..., b nxi } is obtained by comparing the input x i with the centres of the membership functions that cover the i-th input space, as illustrated in Figure 3 . This comparison can be done in parallel by using L-α current comparators and a maximum of L-2 programmable current mirrors. Another solution is to opt for a binary-tree comparison scheme. In this case, shown in Figure 1 , the operation takes more time (n clock phases governed by the signals {R 1 , ..., R n }) but no additional current comparators or programmable mirrors are required by exploiting the input stage of the MFCs (shown within a dashed box in Figure 2 ).
The i-th MUX block after the MFCs implements current replications and identifies which MFC output goes to each ΜΙΝ by using the least significant bits of the set {b 1xi , ..., b nxi }. Computation of the rules' activation degrees is performed in parallel by the α u multi-input analogue MIN circuits whose structure is described in [5] . The MUX blocks after the MIN circuits identify the corresponding CONS by using the least significant bits of the whole set {b 1x1 , ..., b nx1 , ..., b 1xu , ..., b nxu }.
The CONS blocks are digitally programmable current-mirrors (Figure 2 ) that weight each rule's activation degree by its corresponding singleton value. The L u digital words that define all the singleton values are stored in the Y-Mem global memory. This memory is divided into α u parts (conventional RAMs) where each part stores the consequents' values that are never active simultaneously. For the case illustrated in Figure 3 (α u =4) , each of the 4 parts stores 4 digital words (M 1 , for instance, stores c 1 , c 3 , c 9 , and c 11 ). Given an input {x 1 , ..., x u }, one word of each memory part is addressed by the whole code {b 1x1 , ..., b nx1 , ..., b 1xu , ..., b nxu } so that the α u required consequents are retrieved in just one access to the Y-Mem memory. The sums Σ r h r •c r and Σ r h r are simply implemented by wired connection as we are working with current signals. Hence, the whole processing of all the active rules is carried out in parallel.
The block DIV implements the final division Σ r h r •c r / Σ r h r with a successive-approximation technique so that the output is provided in both digital and analogue formats (Figure 2) . Division can be performed in parallel by using a flash A/D converter at the cost of silicon area. A good trade-off speed/area is achieved by a divider based on continuoustime algorithmic data converters, like that described in [3] . In this case, the time invested in division increases with output resolution.
From previous designs integrated in 2.4-µm CMOS process [3, 5] , we can estimate the following features for a typical two-input fuzzy processor with α=2 implemented with the proposed architecture: Its analogue core occupies a silicon area of about 1mm 2 (considering 8-and 4-bit words to program the antecedents and consequents, respectively, and 5-bit resolution for the output) and it consumes less than about 20mW for a 5-V power supply. Its response time is less than about 2µs. These features slightly change when increasing the total number of rules (for instance implementing 16, if L=4, or 81 rules, if L=9). The area and power consumption of the digital part is also optimised since the number of words stored is the minimum to define the system.
CONCLUSIONS
A novel architecture to implement fuzzy processors has been presented. Area and power consumption is very small because parallel computing is performed in current-mode analogue domain using an active-rule driven scheme. An adequate organisation of the digitally programmable parameters makes possible to retrieve them in parallel from conventional RAM memories. Hence, processing of many rules can be achieved at high inference speed and with very low hardware resources.
