This Paper presents a mixed-signal neuro-fuzzy controller chip which, in terms of power consumption, input-output delay and precision performs as a fully analog implementation. However, it has much larger complexity than its purely analog counterparts. This combination of performance and complexity is achieved through the use of a mixed-signal architecture consisting of a programmable analog core of reduced complexity, and a strategy, and the associated mixed-signal circuitry, to cover the whole input space through the dynamic programming of this core [1] . Since errors and delays are proportional to the reduced number of fuzzy rules included in the analog core, they are much smaller than in the case where the whole rule set is implemented by analog circuitry. Also, the area and the power consumption of the new architecture are smaller than those of its purely analog counterparts simply because most rules are implemented through programming. The Paper presents a set of building blocks associated to this architecture, and gives results for an exemplary prototype. This prototype, called MFCON, has been realized in a CMOS 0.7µm standard technology. It has two inputs, implements 64 rules and features 500ns of input to output delay with 16mW of power consumption. Results from the chip in a control application with a DC motor are also provided.
I. INTRODUCTION Associative Memory Networks, as the CMAC network, the B-spline Network, Radial Basis Function Network or Fuzzy Systems [2] perform a partition of the input space and generate the output from data related to small areas around the input vector. This fact provides network transparency and allows the introduction of structured knowledge, as in the Fuzzy Systems, which has become a major advantage to design control systems quickly. On the other hand, the number of basis functions (rules in a fuzzy system) grows exponentially with the input space dimension in Associative Memory Networks, which is their main disadvantage.
Implementations of large control algorithms with many variables are usually carried out by software in powerful computers [3] [4] . This works for systems where just mechanical or thermal processes are involved, with time constants above one second, thus the input-output delay of the control action is not very demanding. However, if faster processes are going to be faced, as those in motion control and power systems, special purpose hardware could be necessary or perform better than the previous approach. To cope with complex control tasks in the range of milliseconds, general purpose microprocessors or DSPs, or those with a special set of instructions are the best choice [5] [6] . However, to manage delays of few milliseconds and down to the microsecond and even nanosecond range, special purpose ASICs are required [7] [8] [9] .
Digital ASICS [10] [11] are robust because they work with digital signals, thus they can handle more complex tasks. On the other hand, they need the outer shell of analog circuitry to build the interface with sensors and actuators. Analog circuits [12] [13] [14] [15] [16] are considered good candidates to implement neural networks, despite their sensitivity to errors and noise, because the precision requirements are supposed low. However, the latter is not as true in ANN as in other networks with a high redundancy as the multilayer perceptrons, due to the fact that just a few nodes and parameters determine the output, thus errors in these nodes modify the output and are not compensated or corrected by other nodes. Thus, if the input dimension grows and hence the system complexity, errors are difficult to keep bounded. The most complex pure monolithic analog fuzzy controllers implement around 15 rules (basis functions) [13] [14] . Nevertheless, since analog circuits provide the best efficiency in terms of area and power/speed ratio, it would be desirable to be able to increase their complexity to manage problems as motor control, where complexities above 25 rules are common [5] . Their straight interface to the plant and faster operation allow them to get better results in terms of less overshoot, smaller settling time, oscillations or ripple voltages or currents [7] [9] .
In order to build larger analog circuits with bounded errors, we could exploit the inherent tuning capabilities learning procedures have. However, this is only useful if the learning algorithm is implemented on-chip or with the chip-in-a-loop [17] . The first approach increases the requirements on hardware, because we need precise circuits for the learning part as well as fine tunable nodes in the remaining architecture [18] . The second approach still needs tunable nodes but takes advantage of an external computer to perform precise computations. The main drawback here is that the resulting controller is more expensive because of the need of tuning. It is also less flexible, the process to get the controller becomes longer and begins to loss its main appeal. In addition, the resulting controller is less autonomous to be used in embedded applications. Thus, learning algorithms are many times used to get the programming parameters in a conventional computer, then the result is put on silicon without further tuning [9] .
Under such conditions, how could we increase the complexity of the networks while processing in analog mode to keep a small delay, power consumption and area?. This paper shows and implementation of the strategy in Fig.1 that exploits the local feature of ANNs to preserve the advantages of analog implementations [1] . Since ANNs provide the output from just a few set of nodes, we implement these nodes in an analog core and make it dynamically programmable to compute the output for any input vector. The basis function identifier performs the input space partition with a set of A/D converters. Note that these converters do not convert the input for further computing, but just perform a coarse clustering to get the set of basis functions that determine the output. This means we usually need only simple, as low as 3bits A/D converters for every input dimension. The output of the converters is used to address a data base which stores the programming data for the analog core. Once the programming data are in the programming bus, the controller is able to provide the output because the core processes the input with analog circuits.
Since the input to output signal path is entirely analog, the analog performance is preserved as long as the programming time is just a fraction of the analog core delay. In addition, small analog cores can be carefully designed to bound the error at output and be multiplexed dynamically to implement a large number of rules. This solves the problem of facing complex tasks while preserves a small input-output delay time, and even good performance in terms of area and power consumption. Section II briefly describes this strategy and the resulting mixed-signal highcomplexity fuzzy architecture; Sections III and IV describe the implementation of the high level building blocks in this architecture; and Section V provides experimental results of the MFCON prototype based on the previous approach that has been implemented in standard technology as well as results from a control application example. Finally, conclusions are collected in Section VI.
II. MIXED-SIGNAL HIGH-COMPLEXITY FUZZY CONTROLLER ARCHITECTURE
The proposed controller is based on a zero-order Takagi-Sugeno fuzzy system whose rules, have singletons in the consequents. The surface response is interpolated from the singletons as (1) where the multi-dimensional basis functions are evaluated by extracting the minimum from the values of the one-dimensional membership functions associated to the k-th rule, and are chosen to generate a lattice partition of the input space [19] [20] − see Fig.2 (a) for illustration of lattice partitions.
This type of inference with lattice partitions has been employed in different implementations.
Its advantages are simplicity, generality and ease of programmability [21] [22] . As a counterpart, the number of rules needed to perform a good approximation becomes prohibitively large as the number of inputs increases [20] . Since in fully-analog implementations the errors and parasitic
capacitances at the global computation nodes grow with the number of rules, these errors become very large, thus degrading the global accuracy and input-output delay.
The architecture used herein overcomes this problem by using the decomposition property presented in [23] [24] . The input space is split into subspaces defined by the lattice partition; see Fig.2 (a) for illustration. Within each subspace the corresponding piece of the surface response − see drawing at the right in Fig.2 (a) − is captured by the simplest fuzzy system within this subspace, a system of just two rules per input [23] [24] . Interestingly, the structure of this simplest fuzzy system remains the same for all subspaces; only some parameters must be tuned in order to fit the 
surface response within each subspace. Thus, the strategy adopted here consists of: a) implementing a programmable analog fuzzy core for the simplest fuzzy system; b) locating the subspace corresponding to the applied inputs; c) mapping the actual subspace location onto a set of corresponding programming signals for the analog fuzzy core. Outputs are then computed by the analog core driven by the inputs and the subspace programming signals. Since errors and inputoutput delay are basically determined by the simple analog core, they can be kept bounded even in very complex controllers.
From now on, the generic subspace, labelled C mn in Fig.2(a) , will be called interpolation interval. 
and the Rule Consequent Programmer, which provides a set of digital programming values, .
These are used to program the antecedent and the consequent blocks of the Analog Core, respectively.
This system operates in asynchronous and continuous-time mode, so that its input/output delay is bounded only by the intrinsic circuit response time. In order to preserve the analog performance, the multiplexing blocks must be designed to minimize their effects on global parameters, such as the input/output delay, errors, etc. In the following, we will describe the implementation of the blocks in Fig.2(b) . This description includes considerations pertaining to a general case and details pertaining to the bidimensional case implemented at the MFCON controller prototype.
III. ANALOG CORE
The analog core of Fig.2(b) implements the Takagi-Sugeno algorithm in (1) in a controller with two inputs, therefore four rules, which has a generic interpolation interval as the input space − see Fig.3 (a). The circuitry is based on that previously reported by the authors in [14] , although some important modifications have been made to incorporate programmability, as well as to save area and power. 
Rule
The core is composed of instances of two main building blocks, namely the Input Block − see Note that rules R 1 and R 3 in Fig.3 (a) are rules R 2 and R 4 , respectively, in the interval located at left of that depicted in the figure. This means that a rearrangement of the rules is required when there is a change of the interpolation interval and the core is programmed. This rearrangement is realized by analog multiplexors in other programmable architectures [13] [25]; here, however, the core architecture is fixed, and the rearrangement is realized by digital multiplexors in the programming interface − details are found in Section IV.E. On the one hand, this strategy is much more robust; on the other, it yields a significant reduction of errors, delays and interferences.
A. Input Block Circuitry
As Fig Table 1 gives some expressions related to circuit design and performance.
in Table 1 is the large signal gain transconductance factor of the M i transistor in Fig.4 
Voltages and are generated to keep M ET and M EB in the top mirror, as well as M QT and M Q B i n t h e b o t t o m m i r r o r, i n t h e s a t u r a t i o n r e g i o n , w h i c h m e a n s a n d . T h e s e voltages are obtained from the bias circuits in Fig.4(b) . The parameter is chosen to obtain the desired smoothness, which improves for smaller values of − see Fig.5 . A bias current is added to the differential pair output in Fig.4 (b) to prevent the transistors at the top mirror from entering in weak inversion, which would degrade the dynamic response. CMOS implementation in the MFCON chip prototype, and Table 2 its main design equations. First, as already said above, the minimum circuit output cell provides the rule antecedent output in Fig.6 , where is needed to perform the complement at output required by De Morgan's law and provides a path to discharge node . Thus, the normalization is performed in every rule block by the normalization circuit unit cell in Fig.6 [14] to obtain . The bias current source and the sink transistor are shared by all rule blocks in the analog core. Every normalization circuit output is reflected by a PMOS current mirror and weighted by the singleton value − represented by the binary code in Fig.6 ; also the current offset is added to improve the dynamic performance. This weighting is carried out by a digitally programmable current mirror. However,
B. Rule Block and Output
I off the dynamical programming of the mirror can cause large current spikes at output unless a special design is made. This design introduces the top branches at Fig.6 that are controlled by , thus they drive some current when their associated switch in the output branch governed by is ON, and vice versa. This guarantees that transistors in the current mirror are always in the saturation region, and never in the ohmic region. Since transitions from the ohmic to the saturation region were found to be the cause of the large spikes at output, the latter are reduced drastically with the proposed design.
Since the singleton weighting circuit provides a current as output, and the final processing step of the algorithm is the addition of these currents, we just wire up the rule block outputs, as Fig.3 (b) 
illustrates, to exploit KCL and obtain a current which enters the global output node. However, we still reflect this output current with a current mirror to get a current that leaves the controller. The output current range is , thus 150µA FSO (Full Scale Output) for the chip of this Paper. Fig.7 shows the output of the analog core for an interpolation interval of MFCON as measured in the laboratory.
With respect to errors, note first that those of systematic nature are minimized by using symmetrical structures and cascode transistors. Hence, most important errors are of random nature, due to transistor mismatches. The boundary at a given fuzzy set core [20] or interpolation point is determined by , where and are found in Table 2 . many parameters have to be set to fit the estimated error into a specified boundary, a software mathematical assistant is required. The design of this Paper was made to get 3σ below 10% FSO.
The choice in this chip for the offset current in Fig.6 was and .
However, smaller values of this offset current achieve a considerable reduction of the error, while the dynamic response is not much affected.
IV. MULTIPLEXING BLOCK SET

C. Interval Selector
This block comprises a battery of A/D converter blocks, as Fig.8(a) shows. A simple and fast A/D flash converter − see Fig.8 (b) − is the best choice since high-speed asynchronous operation is required, and low resolution is enough − note that the number of labels L is seldom higher than seven in most control applications. In addition, the converter comparators are designed to have hysteresis, and a priority coder is used to convert the thermometer code into a Gray code. 
It is important to note that these converters are not in the signal path, but in the control path; also, they do not encode the input signal to be digitally processed, but just cluster the input space into regions. Therefore, the proposed fast flash A/D converters can readily accomplish the resolution requirements, and the input-output delay of the overall controller is hardly affected by the programming circuitry around the analog core which processes the input signal. Specifically, the estimated delay for this block in the controller presented in this Paper is 60 ns, for an input overdrive of 50mV from the reference voltage. This delay is around 10% of the measured global controller input-output delay.
There are three basic elements which make up the Interval Selector block in Fig.8 
CMOS logic gates of minimum size.
On the other hand, Fig.9(a) shows the simple three-stage comparator with hysteresis that has been implemented in MFCON, whose schematic is depicted in Fig.9(c 20 M nQ 10 
M nQ 2ij
M nQ 1ij
3 0 µm/2µm 6 0 µm/2µm 2 0 µm/2µm 2 0 µm/2µm 1 0 µm/2µm 2 0 µm/5µm
4µm/0.8µm 1 5 µm/3µm 6 0 µm/3µm 1 5 µm/3µm 6 0 µm/3µm
The switches in the second stage allow us to either connect or disconnect the current sources I PH to the input inverter node v A ; this either adds or substracts current at this node, thereby forcing the reference voltage of the comparator to be either or , respectively (see Fig.9 (b).
First-order calculations obtains, ,
where g m is the small-signal transconductance of the differential stage; and , if and , where are the output resistances of the circuitry which implements the current sources I PH − see right part of Fig.9 (c), and are the ON resistances of the NMOS and PMOS switches, respectively. The expression above shows that the hysteresis can be controlled by the current source I PH , which is derived from an external bias current I POLH .
The Table in Fig.9 shows the sizes of the transistors in Fig.9 (c) as used in MFCON. They have been designed to cope with the requirements of gain, common mode range, power consumption and errors, from the simulations and analytical expressions reported in [27] .
D. Rule Antecedent Programmer
The main requirements for this block are high-speed operation, as well as design simplicity, reliability and compactness. Because every element in must be selected from the whole set of programming values in Fig.8(b) , this block comprises a battery of digitally controlled analog multiplexor cells, as Fig.10(a) shows. Fig.10(b) shows the internal structure of an analog multiplexor cell, which is composed of analog switches (CMOS transmission gates), and a Gray decoder which provides the digital control signals for the switches. Both elements are wellknown CMOS building blocks and their design will not be explained here. The estimated input- 
E. Rule Consequents Programmer
Every element in is an S-bit digital value, which must be selected from the whole set of singleton values , for , hence this block must store and address efficiently up to data bits in a digital memory. Besides, it must implement two ways to address these data, one for writing or reading data for external programming (i.e. to program the controller with the proper rule set), and another one for internal accesses, to get the set from the address provided by the Interval Selector. This internal read interface must be asynchronous and fast enough to cope with the speed and continuous time requirements of the controller. It must also provide the whole set ( ) in one step and in the proper order.
Because of the input space lattice partition, every generic fuzzy rule belongs to different adjacent interpolation intervals, thus the corresponding singleton value can belong to different programming sets − see example in Fig.11(a) The design of this block is based on the generic conceptual architecture for internal accesses depicted in Fig.12(a) . The figure shows how the data are distributed into different blocks of Memory Cells, which contain subsets of singleton values which will never be addressed simultaneously. For a given address, the Row Selector selects one row per Memory Cell block simultaneously, thus all needed singleton values are ready to be accessed. At the same time, the Rule order and programming bus example for two adjacent intervals 
Column Selector controls the multiplexor to locate properly every singleton value in the programming bus. Fig.12(b) illustrates the timing of these accesses. The basic building block of the Memory Cell blocks in MFCON is shown in Fig.14(a) . It comprises four one-bit memory basic cells, associated to two different Memory Cell Blocks in Fig.12(a) . Specifically, row n in Fig.14 ,n+1) . Thus, the whole block is configured to be accessed as a two-row, two-column conventional memory block, and the memory is a conventional RAM for external accesses. Fig.14(b) shows the elements in one column of the basic building block in Fig.14(a) . The one-bit basic memory cell used in MFCON (enclosed in dash square in Fig.14(b) ), is a single-ended bitline static CMOS cell [28] , which is common in register files and multiport memories. Because of the flexibility in changing its configuration, and its robustness [29] , this cell results very suitable for the reconfiguration requirements of this application. Fig.14(b) also shows the switches for reconfiguration and the latches for regenerating the logic levels. The transistor sizes in the memory
cell have been determined from the design recommendations in [28] .
The Row Selector activates the proper row selection lines after decoding the access type (read/ write) and origin (internal/external) and the corresponding subset of address lines − see Fig.13 . For external accesses the block works as a conventional binary decoder, while for internal accesses it works as a Gray decoder which activates simultaneously several row selection lines per access, one
per Memory Cell block considered. Fig.15(a) illustrates the conceptual architecture and interfaces, while Fig.15(b) illustrates the basic building block of the Row Selector as implemented in
MFCON.
Because internal accesses are always for reading, the Column Selector for Internal Accesses comprises a battery of properly sized multiplexors with shared control lines, as Fig.15(c) illustrates, where the conceptual architecture and interfaces of this block in MFCON are shown.
Because column data lines are properly wired to the multiplexor inputs, the control block can be been placed far away from the digital one, with large isolation guard rings in between 1 . Also, in order to further reduce interferences between these parts, separate pins and lines have been employed for the analog supply, the digital supply, and the ground [30] .
1. The area occupation of this chips, around 5mm 2 , is much larger than needed. Since 5mm 2 is the minimum area for MPW projects, largely conservative layout floorplanning strategies have been adopted regarding the separation of analog and digital parts. In addition, Fig.20 shows results from an example application with the chip in a control loop.
The task is the start of a DC motor controlled with a PWM DC-DC switching converter at 100kHz. Fig.20(a) shows the control surface, while Fig.20(b) and Fig.20(c) show the motor speed (top) and armature current (bottom) for both, the direct start and controlled soft-start, respectively. Note that the speed rise time is similar in both, direct and controlled cases, while the initial current spike is not present in the controlled case. 
VI. CONCLUSIONS
The mixed-signal fuzzy controller chip presented in this Paper attains the performance levels of fully analog controllers while overcoming their inherent limitations in terms of programmability and complexity. This is achieved by employing the multiplexing strategy and architecture presented by the authors in [1] . The data in Table 3 are intended to compare the MFCON chip to From Table 3 it is seen that the MFCON chip implements much more rules that the others. Note also that this increased number of rules is not accompanied by a significant power consumption increase; neither by an operation speed drop. Actually, regarding power consumption, only the I m = 800mA prototype in [14] consumes less power than the MFCON chip, although it realizes four times less rules. Regarding speed, the prototype in [13] is faster, although its power consumption is much larger and its complexity much smaller.
The data in Table 3 confirms that the proposed strategy actually overcomes the limitations of purely analog controllers while keeping their performance advantages. For instance, a fully analog controller designed by the authors [14] using similar circuitry yields 470ns and 8.6mW for 16 rules, while the MFCON chip yields 500ns and 16mW for 64 rules. Advantages of the proposed architecture become more evident as the number of rules and inputs increases [1] . Furthermore, the proposed architecture is very well suited for the modular generation of complex fuzzy controllers.
Since it is based on the dynamical programming of an analog core, whose size − rules − depends just on the number of inputs M, we could have a reduced set of well-designed analog cores (one input, two inputs, three inputs...) as cells. Every cell is valid for building controllers with a different number of rules, while their performances must be quite similar in terms of errors, power consumption and input-output delay. These cells could even be integrated in conventional microcontrollers which would provide a very good control performance.
