Unfortunately, the many ideas underlying the design of this chip cannot be covered in a single paper; hence, this paper is focused on, first, placing the ACE16k in the ACE chip roadmap and, then, discussing the most significant modifications of ACE16K versus its predecessors in the family.
I. INTRODUCTION
V ISION involves extremely complex computational tasks [1] - [8] . So complex that, despite its huge set of applications and potential uses, no artificial vision system has been able to reach the level of efficiency of natural vision systems up to date. Indeed, performances of currently available artificial vision systems are far below those of the smallest insect, despite the usage of the most sophisticated latest generation computing devices. Is this paradox due to a lack of industrial or commercial interest? Clearly not, since the number of applications of artificial vision systems are enormous. Which can be hence the reason underlying the gap between natural and artificial vision systems? Probably, the reason is that conventional signal processing architectures are not the best suited for vision. In these architectures, there exists a clear separation between signal acquisition and signal processing, with the role of analog processing being restrained to the front-end functions, namely transduction, signal conditioning and data encoding. The problem is that images contain a huge amount of data, many of them redundant, i.e., not carrying any information. Hence, does it make any sense to consume resources in handling, i.e., converting and processing, these data? Nature gives us some guesses about that. In natural vision systems, the front-end device, the retina, does not only acquire but also pre-processes the visual information [9] , [10] , such that the amount of data transmitted through the optic nerve to the brain gets compressed by a factor around 150. 1 A similar compression of information occurs in any vision processing chain. As the signal climbs through consecutive levels in the processing path, its dimensionality shrinks whereas its abstraction increases. Thus, although using serial digital signal processing is advisable at the upper levels of the hierarchy, it might not be so adequate for early processing. Operating with images at the bottom level of the processing hierarchy implies intensive memory accesses and poses important constraints on the bandwidth of the communications between memory and processor. Also, having a chip to sense the visual information (imager) and another one to process it (processor), requires high-speed data conversions and transferences to achieve large frame rates. Using the conventional Imager-Memory-DSP architecture it is possible to reach 30 FPS, even for large resolution images. However, high-speed industrial applications requiring ultrafast frame rates 2 might turn unfeasible.
ACE chips render ultrafast operation feasible by using massively parallel analog processing at the early stages, as natural 1 The human eye contains about 150 mill. photoreceptors whilst the optic nerve contains about 1 mill. fibers. 2 In the order of 1000 FPS.
1057-7122/04$20.00 © 2004 IEEE retinas do. Some reasons supporting this choice are [11] - [13] as follows.
1) The accuracy required for early processing is moderate or even low. Actually, the perceptual quality of the images does not drop significantly in the presence of perturbations (noise, spatial variances, nonlinearities, ); even if these perturbations are as large as 5% of the full scale 3 . 2) The speed versus power efficiency of moderate-to-low resolution analog circuits is much larger than that of digital counterparts. This is relevant since very high speed is needed to achieve high frame rates for moderately large images.
3) The area efficiency of analog circuits for moderate-to-low resolution applications is better than that of digital counterparts. The chip described in this paper represents the third generation of ACE chips and has been designed to overcome some limitations of its predecessors, particularly those of the the so-called ACE4k chip [5] . Major improvements of ACE16k include the following.
• Incorporation of digital buses for grayscale data: ACE16k embeds per-column data converters (arranged in analog-to-digital (A/D) and digital-to-analog (D/A) re-configurable pairs) for fully digital interfacing.
• Exact control of the timing for input/output (I/O) access: To that purpose, ACE16k does not include the possibility of individual cell selection; instead it incorporates an autonomous addressing scheme. Also, it employs a hand-shaking protocol to eliminate timing constraints.
• Better internal organization of the processing cells:
ACE16k incorporates the so-called ACE-BUS to allow any functional block within the cell to communicate with any other.
• Use of nonconventional logic blocks: Particularly, the four local logic memories (LLMs) of ACE4k have been replaced by local analog memories (LAMs), and the local logic unit (LLU) has been designed to operate within reduced analog-compatible voltage ranges, instead of within complete digital voltage ones. Also, dynamic, instead of static, digital memories are used to store template masks. Finally, dedicated logic inverters with peak current limitation have been used instead of conventional ones.
• Improvement of the optical interface: ACE16k incorporates a re-configurable optical input module with the following features:
• User-defined photo-sensing device: The user can select among a P-Diffusion/N-Well photo-diode, a N-Well/P-Substrate photo-diode or a P-N-P vertical photo-transistor.
• User-defined sensing scheme: The user is allowed to select between normal linear integration modes or logarithmic compression sensing.
• Incorporation of an address event detection scheme:
to simplify the extraction of information from black and 3 The exact number is obviously application dependent. ACE chips consist of an array of identical processing elements (PE) which execute the same instructions at the same time. Instructions are executed on data which are locally defined, i.e., at the PE level, while the sequence of instructions is controlled and timed by a digital controller which is shared by all the PEs. Typically, for implementation purposes, communications between PEs are restricted to the nearest neighbors. However, despite such an architectural limitation, ACE chips are able to implement most early-vision processing tasks [4] - [6] , [13] . Adding the capability of sensing the visual information in a one-by-one pixel-to-PE correspondence makes these systems very well suited to implement the front-end stage of VSoCs. Obviously, processing images whose resolution is larger than the array size (necessarily limited due to the incorporation of programmable processing circuitry at pixel level) requires windowing and time multiplexing.
Regarding ACE chip architectures, different questions arise, which relate to: 1) functions to be incorporated within the PE; 2) complexity of the control unit; 3) interfacing with other hardware and/or equipment. The answers to these questions are largely dependent of the intended application. However, due to size, design complexity, and fabrication costs of these chips, the design of special purpose devices is only advisable if a market niche absorbing mass production is ensured. Otherwise, the architecture of the PE must be flexible enough to guarantee the execution of the largest possible amount of vision algorithms under real-life illumination conditions. Thus, taking into account that most early vision processes consist of the application of convolutions masks, and the combination (either by Boolean operations in the case of B/W images, or by a local analog arithmetic operator) of their results in a bifurcated-flow algorithm, the following operators should be included at the PE level: 1) multipliers and adders; for the convolution operation; 2) analog registers; to allow for the storage of previous results at the local level; 3) arithmetic operator and/or binary operator; to combine previously obtained results; 4) local masks; to allow for the conditional execution of certain operations at PE level depending on some locally defined value. 5) wide dynamic range optical input; to permit the lightsensing capability, and, hence, to avoid the bottleneck existing in data transmission from the sensory to the processing plane in conventional nonmassively parallel solutions. To cope with the objective of covering the largest possible set of applications, all functions above must be programmable, including reliable setting of analog parameters, reconfiguration of topologies and control of internal data-flows. Regarding the control unit, its roles are: 1) controlling the sequence of operations to be executed on the array; 2) storing the machine code of the algorithms to be implemented; 3) storing the data which define the internal analog parameters of the array. 4) interfacing the external world using standard protocols; 5) performing high-level signal processing tasks. Based on, first, the convenience of making the interfacing completely standard, and, second, the necessity to guarantee robustness in the control of the analog parameters, the control unit should be fully digital, with the obvious exception of the blocks which interchange information (both data and commands) with the array.
ACE chips have been designed with these guidelines in mind. Specifically, this is the case of ACE16k [6] whose conceptual architecture is depicted in Fig. 1 . As already mentioned, ACE16k represents the third generation of ACE chips. Fig. 2 depicts the evolution of these chips, where a bifurcation appears at the time when ACE16k was released. Such bifurcation is related to the different nature of the behaviors addressed by instances belonging to each of the branches. On the one hand, ACEXX chips are basically conceived to perform spatial image processing on temporal image flows. On the other hand, CACEXX chips are designed to emulate the spatial-temporal dynamic evolutions observed in mammalian retinas [14] . Table I summarizes some main features of the three different generations of ACEXX chips. It highlights a continuous improvement across time. ACE400, the first member of this family, was designed in 1996 using a standard 0.8-m technology [4] . It was conceived to operate only on B/W image flows, and included reduced programming capability. Special attention was paid to the optical interface in order to achieve high speed capturing through the incorporation of Darlington-based photocurrent amplification.
Four years later, in 2000, the ACE4k chip was released [5] . Together with an increase by a factor of ten in spatial resolution, this chip incorporated much larger programming capabilities. Despite the increased complexity and its capability to handle grayscale images, this chip featured significantly larger PE density and lower power consumption while basically keeping the time constant unaltered. These ameliorations were basically the consequence of major architectural and circuital improvements, and marginally due to the scaling down of the fabrication technology-from 0.8 m to 0.5 m.
By the end of 2002, the first version ACE16k chip was made available from the foundry [6] . Improvements of ACE16k versus ACE4k have already been mentioned in the Introduction and are summarized in Table II . Details about the architectural and circuital tricks employed to achieve such significant enhancements can be found in [13] . In Section III, we basically discuss the modifications affecting the PE itself. Below, we give some hints regarding the programming memory and the I/O interface, whose circuit level details are presented in [6] .
Regarding the programming memory of ACE16k, it is similar to that of ACE4k. However, three main differences exist.
• The instruction memory has been arranged into two blocks with 64 words of 32 bits each. This division aims to separate addresses from definition of operations-something like defining operations and operators separately. Thus, and thanks to the use of separate control buses, ACE16k has a programming memory of 64 64 words of 32 bits, instead of simply 64 words of 48 bits as with ACE4k. Such an increase in the memory gives the user the possibility of programming and testing more complex algorithms.
• ACE16k uses a memory control circuitry which includes a voltage-controlled oscillator to generate all the timing signals required for memory management.
• Finally, ACE16k uses hand-shaking protocols, instead of strobing signals, to control the access to the programming memory. This overly simplifies control. A related major modification of ACE16k consists of the incorporation of self-calibration stages to the analog buffers which drive weights and analog references to the cell array. Although ACE16k uses the same distributed buffer strategy as ACE4k, the topology of the buffer includes extra circuitry for calibration purposes. Fig. 3 shows a simplified block diagram of the weight generation circuitry in ACE16k, including the RAM block in which coefficients are digitally stored, the 8-bit D/A converter (DAC), the two-level buffer structure, and the calibration circuitry.
Regarding I/O, ACE16k incorporates a fully digital port for image transferences. Fig. 3 shows a simplified block diagram of the I/O block in ACE16k. It includes a bank of 128 8-bit A/D converter (ADC) and DAC. Since the data bus is 32 bits wide, each word transmitted to/from the chip contains information about four adjacent cells-same row, consecutive columns. Then, and by just looking at the way of writing/reading images, the array can be divided into 32 identical blocks of four adjacent columns.
Data transference uses a two-stage pipeline architecture. In the input mode, data are sent to an input register of 8 128 bits (see Fig. 4 ). Once filled, this register is transmitted in parallel to an internal 8 128 register whose outputs (in blocks of 8) are permanently connected to a bank of 128 DACs which operate in parallel. At the same time, the external register is again being filled with the information about the next row to be written-avoiding idle periods. At the end of the conversion, the first module of a double bank of 2 128 sample-and-hold (S&H) circuits acquires the converted data and sends it to the selected row of cells. While the first module of the bank of S&H sends the analog value to the array, the second module is this bank is capturing the next row of data which is being converted by the DACs.
During an output process, the first row is acquired by the first module of the S&H bank. In the next step, these data are held and converted while the second module of the S&H bank captures the second row. At the end of this step, the digital information (the result of the conversion of the first row) is sent to the external register where it is ready to be downloaded during the third step. In the third step, the content of the first row is read at the output of the external register, the content of the second row is being converted and the third row is being captured by the first module of the S&H bank. 
III. ACE16K VERSUS ACE4K: MODIFICATIONS IN PE

A. ACE-BUS
The PE of ACE4k is designed to allow direct communication only between closely related blocks. However, ACE16k uses a new communication scheme, the ACE-BUS. The ACE-BUS is basically a node of the processing unit (PU), where every functional block connects its input and output ports. Communications between blocks always happen in the same way; first, one functional block (the data source) is configured in output mode, while a second one (the destination) is configured in input mode. It overly simplifies the definition of operations and data movements, and allows for rapid checking of conflictive switch configurations. Fig. 5 shows the block diagram of the PU in ACE16k. Synapses take their inputs from the ACEBUS. They can be initialized by using either the LAM module contents, the result of LLU operations, the result of an optical acquisition, or the result of a passive diffusion realized by using one resistive grid embedded in the chip. The analog processing core steers the processed input current (the input current after eliminating all the offset contributions) to the ACE-BUS; this current can then be rooted either to the state capacitor or to any of the LAM modules.
B. Image-Processing Kernel
The synaptic analog multipliers are designed by using the same one-transistor technique as in ACE4k [13] . They are driven by voltages at both the signal and the scaling input and deliver a current at the output. The bank of multipliers, depicted at the conceptual level in Fig. 6 , is driven by three different pixel values, , and so that the current which flows into the PE is (1) where the operator denotes the convolution product of the template and the pixel value matrix, and is the offset term generated by the one-transistor multipliers. This offset term is eliminated by using a high-accuracy current memory [13] , [15] . Fig. 7 shows a conceptual schematic of the PE input block including the S3I current memory used for offset cancellation, based on [15] . The resulting current (2) is either steered to the ACE-BUS, or to the input of a capacitiveinput current comparator [16] whose output is connected to the ACE-BUS through an analog switch. Then, two situations may occur.
• A voltage codifying the sign of (i.e., the sign of the outcome of the convolution operation) is delivered to the ACE-BUS (3) In this case, the output is a B/W pixel value.
• The analog current is routed to one of the capacitors associated to the pixels and the output is a grayscale pixel value. In any case, the specific pixel capacitor to which the output of the input block is routed is selected by the user through the activation of some bits in the digital instruction. By so doing, the evolution of the PU is described by a state equation whose actual expression depends on the selected integrating capacitor. Therefore, different kinds of processing kernels are available.
• Consider, for instance, that you want to execute a Sobel operator [8]. The convolution matrix is then defined in ; the image is loaded into ; the following values are set: , and
; and the signal current is routed to . Hence, the equivalent state equation obtained for each PU is (4) whose steady state is , as corresponds to the desired convolution output.
• Consider now that the capacitor which receives the input current is . Then, the cells are dynamically coupled and CNN spatio-temporal operations are realized.
• Consider finally that the current is routed to ; that all but the central entries of matrix are null; and that this central entry is . The steady-state solution is then drives the weight to the cell. After some calculations, the following expression can be obtained: (6) where denotes the aspect of synapse transistors and denotes the width of the metal layers driving the weights. Since, and , (6) becomes (7) Since is almost invariant from technology to technology (in the ideal case, both scale as the technology scaling factor does), the aspect ratio of the synapse transistor in ACE16k must be reduced by a factor of four in order to keep the same voltage drop as in ACE4k. However, because the number of multipliers is two times smaller in ACE16k, the aspect ratio is reduced only by a factor of two. The reason for reducing the width is that it does not practically affect the time constant. The counterpart is a degradation of matching which is attenuated by hardware.
C. Increasing the Cell Density
Lastly, the PE size is determined by the lines which carry the weights and control signals: their number, their width and the minimum separation between them. Obviously, having five metal layers (ACE16k@0.35-m technology) instead of three (ACE4k@0.5-m technology) gives some room for decreasing the cell size. However, the following hold.
• The top metal layer, metal 5, should be used only for power supply and ground. On the one hand, this layer has the maximum separation between adjacent lines. On the other hand, it has the greatest conductivity and hence the maximum current driving capability. • ACE16k has a much larger number of PE-embedded functions than ACE4k (50 versus 35). Obviously this increases the number of control lines. To meet the target of having cell densities larger than 150 cells mm , ACE16k employs an interaction pattern among cells different from that of ACE4k. As Fig. 6 shows, each PE contains 12 analog multipliers. Eight of them connect the cell to its neighbors; the other four provide additional inputs to the processing block. The multipliers marked with a star in Fig. 6 are double; they consist of the parallel aggregation of two multipliers. The purpose of this "double strength" is to increase the robustness in certain operations. From [17] , it can be seen that in most cases, the central element of the template matrices is larger than the noncentral elements. At the electrical level it means that the corresponding multipliers have to be driven by quite different voltages, thus increasing mismatch-induced errors. By increasing the strength of central multipliers, the difference between weight voltages, and consequently the overall robustness, increases.
D. Digital Modules
The PE of ACE4k embeds conventional digital circuitry. This is not convenient because of the following.
• Level adapters are needed to transform the logic voltage levels, corresponding to full-scale swings, into levels compatible with the electrical operation of the PE analog circuitry.
• Protective measures must be taken to attenuate the impact of the large-power switching noise on the analog circuitry [18] . Last, this means greater area and penalizes cell density. In the case of ACE16k different measures have been taken to overcome these drawbacks, the following.
• The four LLMs of ACE4k have been replaced by LAMs. On the one hand, this eliminates the digital switching noise introduced by the LLMs. On the other hand, the impact on the silicon area is not very large because the readout amplifier is shared with the other LAMS. Finally, voltage level adapters between the LLMs and multipliers are not further needed. In addition to that, having eight instead of four LAM modules increases significantly the algorithmic capabilities of the chip.
• The LLU has been conceived to operate as an independent module which gets its inputs from the ACE-BUS and which also drives its output to the same ACE-BUS. This means a significant difference as compared to ACE4k. There, the LLU was intrinsically related to the LLM since its inputs were always taken from two fixed LLMs. In addition, although the LLU works as an intrinsically logic device, its inputs and outputs are provided via the ACE-BUS and have hence analog voltage levels. Fig. 8 shows the LLU in ACE16k. Its two inputs, OP0 and OP1 are acquired from the ACE-BUS by using instruction bits WOP0 and WOP1, while the result of the LLU operation is written to the ACE-BUS when the bit RLLU is activated. Logic inverters in the LLU (as well as any other inverter in the cell) are not conventional CMOS inverters but current-peak limited inverters. They have been designed using an NMOS transistor connected to a PMOS resistive load as depicted in Fig. 9 . The resistive load is biased by a common biasing circuitry-shared by all the inverters in the cell. It establishes the quiescent point of the inverter around the middle of the voltage range for pixels.
E. Multimode Optical Sensor
Light sensing in ACE4k is realized by a parasitic diffusion-tosubstrate diode of the LAM access switches. Thus, sensitivity is rather low, and cross-talk among the LAM modules arises. ACE16k incorporates a multimode optical sensor which has been conceived to be flexible enough to operate under very different illumination conditions. Fig. 10 shows its conceptual schematic, including three main blocks.
• The first one, a tri-state readout buffer, controls the communications between the sensor and other blocks in the PU. Sensor accesses are controlled by the global programming signal ROPT.
• The second one is devoted to transforming the photo-generated charges into a voltage. The user has the possibility of selecting the photo-transduction mechanism by means of signals LOG1, LOG2, PCH.
• The third block includes the optical sensor itself and two configuration switches used to select one out of the three available photo-sensors. The selection of the sensor is carried out by signals DW and WS. The optical sensor can be configured to operate in three different linear integration modes [ Fig. 11(a)-(c) ] and three different logarithmic compression modes [ Fig. 11(d)-(f) ]. In the integration modes, the sensing procedure is always carried out in the same way. First of all, are turned off by making . Afterwards, switch precharges the internal node to a user definable voltage VPCH. Finally, switch is turned off and the photo-generated current charges or discharges (depending on the selected photosensor) the pixel capacitor . Further details about the ACE16k sensor operation can be found elsewhere [13] , [19] . Fig. 12 shows the global block diagram of the ACE16k-PU. There, the different building blocks can be identified. Control and configuration signals from the programming memory are in bold. A detailed description can be found in [13] .
F. Cell Layout and Metal Distribution
The layout of the PU in ACE16k differs from that in ACE4k in various points.
• Metal 1 and metal 2 are used for internal routing, instead of just metal 1. This helps to increase cell density.
• As already mentioned, the last metal layer, metal 5, is employed for power and ground distribution. Therefore, power and ground lines can be as wide as almost half the cell height. This increases the quality of these signals: better uniformity across the array, less noise, lower probability of error during the fabrication, etc. • The existence of the ACE-BUS allows for a more organized layout. Generally speaking, the more involved the schematic, the more difficult the layout and the lower the cell density. Hence, the ACE-BUS concept contributes also to increase the cell density.
• A problem in the physical design of the ACE16k PU is related to the necessity to make a hole (in all the metal planes) to allow the light to reach the sensing area. This hole, located in the middle of the cell, just on top of the sensing area of the photosensor, reduces by three the number of available minimum width lines in each plane.
• As in ACE4k, digital instructions are sent to the cell by using a horizontal bus (metal 3) while weights are communicated by a vertical bus of metal 4 lines. Weights that are connected to double-strength synapses are communicated through double-width lines. 
IV. DISCUSSION
Applications of the ACE16k chip can be found in other papers of this special issue. Overall, these applications demonstrate that the chip is capable of operating with grayscale images at frame-rates larger than 1000 FPS under room illumination conditions. This means a significant improvement as compared to other ACEXX chips which in turn have been shown to outperform other vision chips and architectures [13] . Further insight about the improvements yielded by ACE16k is provided by the data in Table III , where we have employed an equation which combines the number of operations (additions and products), the time constant of the process, and time constant units to keep settling errors below a given limit. In the particular case of linear convolutions (8) where is the number of elements in the array (128 128 in our case), the number of additions (8 in 3 3 linear convolutions), the number of products (9 in 3 three linear convolutions), the resolution for the settling error in an equivalent number of bits, and the time constant of the process in (4), about 135 ns for the largest allowed .
In summary, ACE chips, and specifically, the ACE16k prototype, are practical demonstration vehicles for the following statements.
• Sensory-processing concurrence is feasible with mixedsignal standard CMOS circuitry.
• Flexibility and programmability features can be incorporated by the smart synergy of analog and digital circuits.
• Robustness can be achieved through proper analog design, and the use of calibration and error-correction techniques.
• Standard interfacing is a must which can be incorporated through embedded A-D and D-A converters. These chips demonstrate that flexible analog early vision can be implemented in practice, and represent the first step toward the development of VSoCs. However, significant design challenges still have to be confronted to make true VSoCs capable of handling 10 000 Frames/s with moderate power consumption (below 1 W) and a large enough spatial resolution.
