i analogue bus t t t t t t t A CMOS GENERAL-PURPOSE SAMPLED-
Introduction
Although analogue integrated circuits can offer advantages over their digital counterparts, in terms of speed, power dissipation and silicon area consumed by the circuitry, digital circuits are often a preferred solution in cases where programmability is required. In particular, digital signal processors or microprocessors offer a large degree of flexibility, as the functionality of the system can be determined solely through software development. Recent research in the field of programmable analogue circuits resulted in the development of reconfigurable devices [I-31 which can be generally thought of as the analogue equivalent of digital FPGAs. Algorithmically programmable analogue chips, based on the Cellular Neural Network (CNN) operation, have been also introduced [4] . In this paper we describe the analogue microprocessor (ApP), which executes software programs, and can be thought of as an analogue equivalent of a digital microprocessor.
The idea of an instruction-level programmable analogue processor has previously been described by Masuda et al [5] . The architecture presented there was based on charge-domain operations and the circuitry required high-gain amplifiers, capacitors and analogue switches. However, this discretecomponent implementation, although providing proof-of-concept, exhibited rather poor performance. Recent advances in analogue sampled-data signal processing techniques, and in CMOS technology, allow for the efficient implementation of the analogue microprocessor as a switched-current circuit. Our results show that the ApP can achieve savings in terms of silicon area and power dissipation, when compared with digital processors. These, especially when combined with parallel processing techniques, can enable the design of low-cost high performance systems.
One of the application areas where it will be advantageous to use ApPs is in low-level image processing [6] . Massively parallel SIMD (Single Instruction Multiple Data) arrays of mesh-connected digital processing elements have long been known to be efficient in executing early-vision algorithms [7] . The area-efficient implementation of a processing element is of primary importance, as it enables the integration of thousands of processors onto a single die, and thereby fully exploits the inherent fine-grain parallelism of early-vision tasks by realising pixel-per-processor correspondence [SI. The ApP described in this paper exhibits high performance/area and performance/power ratios and therefore is very suitable to be used as a processing element in the massively parallel array.
Analogue microprocessor architecture
The block diagram of the generic ApP architecture is presented in Figure 1 . The ApP consists of a register file (each register is an analogue memory cell, capable of storing a sample of data), an analogue ALU (Arithmetic Logic Unit), and an analogue U0 port.
All the building blocks are interconnected via an analogue data bus. The processing of information is performed entirely on analogue values. However, in a way akin to a digital microprocessor, the ApP executes a software program, performing consecutive instructions issued by a digital controller. These instructions may include register transfer operations, which move the analogue samples of data between registers of the ApP, I/O operations which move the data to and from I/O ports, arithmetic operations, which modify the analogue data, and comparison operations, which allow for conditional branching. The program is stored in the local memory of the controller, which is a purely digital device. The complete processor is therefore a mixed-mode system, with an analogue data-path and a digital control-path. 
The register file
The schematic diagram of a simplified AFP is presented in Figure  2 . The depicted register file comprises four registers, each of which is a basic SI memory cell consisting of a memory transistor Mx (X=A,B,C,D), a current source, and two switches WX and SX.
Consider simple register-transfer operation, which can be denoted as A c B . To execute this instruction the switches WA, SA and SB are closed, the remaining ones are open. Therefore, the only nonzero currents entering the analogue bus node will be the current ig, which is the current read out from register B, and the current i,, which is being written to register A. Of course, it is also true that:
(1)
Since WA is closed, the transistor MA is diode-connected and therefore its gate-source voltage, VgsA, will set itself to the value corresponding to the drain current, as described by the saturationregion equation:
If now the switch WA is opened, a quantity of charge will be stored at the gate capacitance of the transistor MA, and the gate-source voltage V g s~ will hold its value (for the purpose of this analysis we disregard any error effects). As long as the switch WA remains open, each time the switch SA is closed the drain current of MA will be set by the gate-source voltage V,,, (it is assumed that the memory transistors are in saturation when their corresponding S-switches are closed) and will be equal to I&A=IREF-iA, where iA is the value of the input current at the time the switch WA was opened. In this way, the current iA=-iB is stored in register A. (By default, all switches revert to the open position after an instruction has been executed. For correct operation, it is also necessary to ensure that W-switches always open before S-switches.)
. .
The analogue ALU
The analogue ALU is required to provide the basic arithmetic operations of addition, inversion and multiplicatioddivision. However, as can be seen from ( l ) , the inversion is inherent in the basic current-transfer operation. Moreover, the addition operation in a current-mode system is performed on the analogue bus with no area overhead, using current summation. For example, to execute instruction DcB+C, the switches WD, SD, SB and Sc are closed, the remaining ones are open, and the current stored in register D is equal to: Therefore only a multiplier/divider needs to be physically implemented. The multiplier is constructed as a set of currentmirrors, with binary scaled transistors MMI, M M~ and M M~. This is enough to realise the multiplication of an analogue value by a digital constant. The current iM is stored in the multiplier, just like in another register, by closing switches WM and SMl.We get:
(4) The current read out from the multiplier, iM', depends on the multiplication factor k. This is a binary word which selects the appropriate mirrors using switches SMI, S M~, S M~:
As a current comparator a simple CMOS buffer is used. The output voltage Vcow will be determined by the current charging or discharging its high-impedance input node. This voltage provides the controller with the comparison results, allowing for conditional branches. An input/output port is simply realised by an analogue switch SIo, which connects the port node to the analogue bus.
Program execution
All of the switches within the ApP are operated in response to logic-level voltages set by a digital controller. The complete set of these voltages, controlling all switches, forms the instruction-code word (ICW). The sequence of the ICWs issued by the controller constitutes a machine-level software program. This program dictates the way the samples of data are transferred and manipulated within the processor, allowing the software implementation of the required processing algorithm.
To further illustrate the operation of the ApP consider an example program presented in Table I . High-level description, resulting machine level code, ICWs and resulting current-transfer equations are shown.
Accuracy
An 'important issue with analogue circuits is the accuracy of processing. Apart from noise, the major error sources in SI circuits are charge injection effects in analogue switches [lo] , voltage 11-4 18 coupling through the gate-source capacitance of transistors and the finite output conductance of the transistor. The errors in SI memory cells will cause the ApP instructions to be performed with a limited accuracy. Consider the transfer instruction A c B . In the non-ideal case, the current transfer is performed with an error, which consists of a systematic part As(iB), and a random noise AN(*). The systematic error can be split into the signal-independent part As1 and the signal-dependent part ASD(1B).
Many methods have been proposed to reduce the error effects in SI circuits [8] , however, the more sophisticated methods of error compensation in SI cells require more complex circuitry, and therefore the design of an ApP will involve trade-offs between accuracy, speed, area and power. A particularly good compromise between cell area and accuracy can be achieved using the S'I technique which offers significant signal-dependent error reduction [Ill. Signal-independent error cancellation can be easily achieved by appropriate sequencing of the instructions. For example, consider variable assignment from A to B. The basic transfer instruction of the AyP performs inversion, so the assignment will be performed in two steps, using auxiliary register C: first transfer C c A , followed by B t C . Now, assume that each transfer instruction is performed with a constant signal-independent error Asr (i.e. neglecting signal-dependent errors, noise and errors arising from device mismatches). For the first transfer we get: (8) And as a result of the second transfer instruction we get:
The errors cancel out and an assignment operation that is free of signal-independent errors is achieved. Similarly, as shown in Table 11 , complete cancellation of the . signal-independent error can be achieved for the addition, subtraction and multiplication operations.
Test Chip
Our implementation of the ApP is targeted at a processor array, intended for low-level image processing. For this application, the primary consideration is the silicon area occupied by a single processing cell. Also the power consumption must be kept within certain limits. On the other hand, our analysis shows that for the majority of low-level image processing tasks a moderate level of 
accuracy, equivalent to 5 or &bits, is adequate. The use of six to eight registers and a multiplier resolution of 3 to 4 bits should also be sufficient. To aid the evaluation of different design trade-offs, we have designed and fabricated an integrated circuit containing 15 A P s , using various error-cancellation methods and different transistor sizes. The silicon area occupied by a basic register cell varies therefore from 17 pm x 39 ym to 54 ym x 57 pm. The chips were fabricated through EUROPRACTICE using the standard 0.8 pm CMOS process from AMs. The ApP circuits operate with a 3.3 V power supply voltage and were tested, performing various algorithms, using a laboratory data generator as an external controller. For different processor designs we have obtained magnitudes of the signal-dependent error of a single instruction from 0.2 % to 3.5 %, with processors operating at speeds from 70 kHz to 4 MHz. Consider one of our ApPs, built using the SzI error compensation technique. The processor works satisfactorily with a clock frequency of up to 2.5 MHz. The nominal reference current level is equal to 1OpA. The total power dissipation within the processor is less than 1OOpW. (To reduce power consumption only current sources required by a particular instruction are enabled). The effective area occupied by the processor, comprising six registers and a 3-bit multiplier, is equal to 11200 ym'. As the typical assignment or arithmetic operation will take two clock cycles, the performancehea ratio for this processor is equal to 0.11 GOPS/rnmz (Giga Operations Per Second per mm'). The performance/power ratio is equal to 12.5 G O P S N . As can be seen from Table 111 Figure 3 . In this example the image was processed serially on a single processor clocked at 2.5 MHz. Pixel values were fed to the processor using a D/A converter, and the result read out using an A/D converter. The processing speed was therefore relatively low. However, small cell size and low power dissipation are the key features that enable massive parallelism. A very high performance system could be built by integrating a large number of
processors. An SIMD array of 128x128 such ApPs could be feasibly accommodated on a single die and, when clocked at 2.5MHz, perform algorithms with a speed of over 20 GoPS while dissipating less than 2 W of power.
Conclusion
We have presented a general-purpose analogue microprocessor whose architecture is analogous to that of its digital counterpart. The ApP executes software programs while operating on analogue data values. The ApP paradigm will find application in areas that can benefit from employing analogue signal processing techniques, but where nevertheless the flexibility of a software-programmable As an example we have considered a massively parallel processor array, targeted at image processing applications. The high performance/area and performance/power ratios exhibited by the switched-current ApP will allow for a great number of processors to be integrated onto a single chip, resulting in the development of low-cost high-performance systems.
