Abstract A high speed analog VLSI image acquisition and low-level image processing system is presented. The architecture of the chip is based on a dynamically reconfigurable SIMD processor array. The chip features a massively parallel architecture enabling the computation of programmable mask-based image processing in each pixel. Each pixel include a photodiode, an amplifier, two storage capacitors, and an analog arithmetic unit based on a fourquadrant multiplier architecture. A 64 9 64 pixel proof-ofconcept chip was fabricated in a 0.35 lm standard CMOS process, with a pixel size of 35 lm 9 35 lm. The chip can capture raw images up to 10,000 fps and runs low-level image processing at a framerate of 2,000-5,000 fps.
Introduction
Today, improvements in the growing digital imaging world continue to be made with two main image sensor technologies: charge coupled devices (CCD) and CMOS sensors. The continuous advances in CMOS technology for processors and DRAMs have made CMOS sensor arrays a viable alternative to the popular CCD sensors. This led to the adoption of CMOS image sensors in several high-volume products, such as webcams, mobile phones, PDAs for example. New technologies provide the potential for integrating a significant amount of VLSI electronics into a single chip, greatly reducing the cost, power consumption, and size of the camera [1] [2] [3] [4] . By exploiting these advantages, innovative CMOS sensors have been developed and have demonstrated fabrication cost reduction, low power consumption, and size reduction of the camera [5] [6] [7] .
The main advantage of CMOS image sensors is the flexibility to integrate processing down to the pixel level. As CMOS image sensors technologies scale to 0.18 lm processes and under, processing units can be realized at chip level (system-on-chip approach), at column level by dedicating processing elements to one or more columns, or ar pixel-level by integrating a specific unit in each pixel or local of neighboring pixels. Most of the researches deals with chip and column-level [8] [9] [10] [11] . Indeed, pixel-level processing is generally dismissed because pixel sizes are often too large to be of practical use. However, as CMOS scales, integrating a processing element at each pixel or group of neighboring pixels becomes feasible. This offers the opportunity to increase quality of imaging in terms of resolution, noise for example by integrating specific processing functions such as correlated double sampling [12] , anti blooming [13] , high dynamic range [14] , and even all basic camera functions (color processing functions, color correction, white balance adjustment, gamma correction) onto the same camera-on-chip [15] . Furthermore, employing a processing element per pixel offers the ability to exploit the high speed imaging capabilities of the CMOS technology by achieving massively parallel computations [16] [17] [18] [19] [20] [21] [22] . Komuro et al. [16] describe a new vision chip architecture for high-speed target tracking based on hardware implementation of bit-serial and cumulative summation circuits. Rodriguez-Vasquez et al. [17, 18] path in a fully-parallel manner. Lindgren et al. [19] presents a multiresolution general-purpose high-speed machine vision sensor with on-chip image processing capabilities dedicated to high-speed multisense imaging. Sugiyama et al. [20] have developed a specific imager performing both target tracking within a 512 9 512-pixel entire image area and acquisition of partial images simultaneously and independently. Dudek and Hicks [21] describes a smart-sensor VLSI circuit suitable for focalplane low-level image processing applications which is characterized by small cell area, low power dissipation and the ability to execute a variety of image processing algorithms in real-time. Miao et al. [22] present a programmable vision chip for real-time vision applications based on a pixel processing element array and row-parallel processors, able to implement mathematical morphology algorithms such as erosion and dilatation.
In this paper, we discuss hardware implementation issues of a high speed CMOS imaging system with per-pixel image processing. Embedding low-level tasks at focal plane is quite interesting for several aspects. First, the key features are the capability to operate in accordance with the principles of single instruction multiple data (SIMD) computing architectures [17] . This enables massively parallel computations with processing times, independent of the resolution of the sensor. This leads to high framerates up to thousands of images per second, with a rather low power consumption [23] [24] [25] . Secondly, embedding hardware processing operators, along with the sensor's array, enables to remove the classical input output bottleneck between the sensor and the external processors in charge of processing the pixel values. This can benefit the implementation of new complex applications at standard rates and can also improve the performance of existing video applications such as motion vector estimation [26, 27] , multiple capture with dynamic range [28, 29] , and pattern recognition [30] .
To sump up, we designed, fabricated, and tested a proofof-concept 64 9 64 pixel CMOS analog sensor with perpixel programmable processing element in a standard 0.35 lm double-poly quadruple-metal CMOS technology. The analog processing operators are fully programmable devices by dynamic reconfiguration, They can be viewed as a software-programmable image processor dedicated to low-level image processing. The main objectives of our design are: (1) to evaluate the potential for high speed snap-shot imaging and, in particular, to reach a 10,000 fps rate, (2) to demonstrate a versatile and reconfigurable processing unit at pixel-level, and (3) to provide an original platform for experimenting with low-level image processing algorithms that exploit high-speed imaging.
The rest of the paper is organized as follows. The main characteristics of the sensor architecture are described in the Sect. 2. The Sect. 3 talks about the design of the circuit, with a full description of the photodiode structure, the embedded analog memories, and the arithmetic unit. In the Sect. 4, we describe the test hardware platform and the chip characterization results. Finally, some experimental results of high speed image acquisition with pixel-level processing are described in the last section of this paper.
This paper is an extended and complementary version of a preliminary paper [31] dedicated to fundamental theoretical aspects and specificities of our image sensor. In this new paper, focus has been made on image processing and the development and implementation of various low level image processing applications on the chip.
Description of the architecture
The proof-of-concept chip presented in this paper is depicted in Fig. 1 . The core includes a two-dimensional array of 64 9 64 identical processing elements (PE). Each PE follows the SIMD computing paradigm and is able to convolve the pixel value issued from the photodiode by applying a set of mask coefficients to the image pixel values located in a small neighborhood. The key idea is that a global control unit can dynamically reconfigure the convolution kernel masks and then implements the most part of low-level image processing algorithms [17, 18] . This confers the functionality of programmable processing devices to the PEs embedded in the circuit. Each individual PE includes the following elements:
-a photodiode dedicated to the optical acquisition of the visual information and the light-to-voltage transduction, -two analog memory, amplifier and multiplexer structures called [AM] 2 , which serve as intelligent pixel memories and are able to dissociate the acquisition of the current frame in the first memory and the processing of the previous frames in the second memory, -an Analog arithmetic unit named A 2 U based on four analog multipliers, which performs the linear combination of the four adjacent pixels using a 2 9 2 convolution kernel.
In brief, each PE includes 38 transistors integrating all the analog circuitry dedicated to the image processing algorithms. The global size of the PE is 35 lm 9 35 lm (1,225 lm 2 ). The active area of the photodiode is 300 lm 2 , giving a fill-factor of 25%. The chip has been realized in a standard 0.35 lm double-poly quadruple-metal CMOS technology and contains about 160,000 transistors on a 3.67 mm 9 3.77 mm die (13.83 mm 2 ). The chip also contains test structures on the bottom left of the chip. These structures are used for detailed characterization of the photodiodes and processing units.
Circuit design

Pixel structure
Each pixel in the CMOS image sensor array includes a photodiode and a processing unit dedicated to low-level image processing based on neighborhoods. In our chip, the type of photodiodes is one of the simplest photo element in CMOS image sensor technology, i.e. N-type photodiodes based on an n ? -type diffusion in a p-type silicon substrate. In order to achieve good performances, the photodiodes have been designed and optimized carefully, in order to minimize critical parameters such as the dark current and the spectral response [32] . Moreover, the shape and the layout of photodiode have significant influences on the performance of the whole imager [33, 34] . The active area of the photodiode absorbs the illumination energy and turns that energy into charge carriers. This active area must be large as possible in order to absorb a maximum of photons. In the mean time, the control circuitry required for the readout of the collected charges and the inter-element isolation area must be as small as possible in order to obtain the best fill factor. We have theoretically analyzed, designed and benchmarked different photodiodes shapes [31] , and finally, an octagonal shape based on 45°struc-tures was chosen (see Fig. 1 ).
The second part of the pixel is the analog processing unit, dedicated to the implementation of various in situ image processing using local neighborhoods. This forces a rethinking of the spatial distribution of the processing resources, so that each computational unit can easily use a programmable neighborhood of pixels. For this purpose, the pixels are mirrored about the horizontal and the vertical axis in order to share the different Analog arithmetic units (A 2 U). As example, a block of 2 9 2 pixels is depicted in Fig. 1 . Such a distribution optimizes the compactness of the metal interconnections between pixels, giving a better fill factor.
Analog memory, amplifier and multiplexer [AM] 2
In order to increase the algorithmic possibilities of the architecture, one potential solution is the separation of the acquisition of the light inside the photodiode and the readout of the stored value at pixel-level [35] . One of the main advantages of such structures is that the capture sequence can be made in the first memory in parallel with a readout sequence and/or processing sequence of the previous image stored in the second memory, as shown in Fig. 2 . Such a strategy has several advantages:
1. The framerate can be increased (up to 29) without reducing the exposure time, 2. The image acquisition is time-decorrelated from image processing, implying that the architecture performance is always the highest, and the processing framerate is maximum, 3. A new image is always available without spending any integration time.
So, for each pixel, we have designed and implemented two specific circuits called analog memory, amplifier, and multiplexer [AM] 2 , as shown in Fig. 3 . The system has four successive operation modes: acquisition, storage, amplification, and readout. All these phases are externally controlled by global signals common to the full array of pixels. In each pixel, the photosensor is associated with a PMOS transistor reset. This switch resets the pixel to the fixed voltage V dd . The pixel array is held in the reset mode until the init signal raises, turning the PMOS transistor off. 2 is selected when the st i signal is turned on. Then, the associated analog switch is open allowing the charge of the corresponding C i capacitor with a voltage level reflecting the integrated photocurrent. Consequently, the capacitors are able to store the pixel value during the frame capture from one of the two switches. The capacitors are implemented with doublepolysilicon. The size of the capacitors is as large as possible in order to respect the fill-factor and the pixel size requirements. The capacitors values are about 40 fF. They are able to store the pixel value for 20 ms with an error lower than 4%. Behind the storage subcircuit, a basic CMOS inverter is integrated. This inverter serves as a linear high-gain amplifier around V dd /2 with a gain of 12. Finally, the last phase consists in the readout of the stored values in the capacitors C i , through one of the two switches, controlled by the r i signals.
Analog arithmetic unit (A 2 U)
Our analog arithmetic unit (A 2 U) is able to perform convolution of the pixels with a 2 9 2 dynamic kernel. This unit is based on four-quadrant analog multipliers [36, 37] named M1, M2, M3, and M4, as illustrated in Fig. 4 . Each A 2 U includes only 22 transistors leading to a relative small area, simplicity. Each multiplier M i (with i ¼ 1; . . .; 4) takes two analog signals V i1 and V i2 and produces an output V iS which is their product. The outputs of multipliers are all interconnected with a diode-connected transistor employed as load. Consequently, the global operation result at the V S point is a linear combination of the four products V iS . Image processing operations such as spatial convolution can be easily performed by connecting the inputs V i1 to the kernel coefficients and the inputs V i2 to the corresponding pixel values.
Considering the MOS transistors operating in subthreshold region, the output node V iS of a multiplier can be expressed as a function of the two inputs V i1 and V i2 as follows:
with k r represents the transconductance factor, V ThN and V ThP are the threshold voltage for the NMOS and PMOS transistors. Around the operating point (V dd /2), the variations of the output node mainly depend on the product V i1 V i2 . So, the Eq. 1 can be simplified and finally, the output node V iS can be expressed as a simple first-order of the two input voltages V i1 and V i2 .
The value of the coefficient M gives a primordial importance to the term V i1 V i2 in Eq. 1, limiting the impact of second-order products. Consequently, the output V iS mainly depends on the input values V i1 and V i2 around the operating point V dd /2. This leads to a good linearity of our multiplier design integrating only five transistors.
Chip characterization
An experimental 64 9 64 pixel image sensor has been developed in a 0.35 lm, 3.3 V, standard CMOS process with poly-poly capacitors. Its functional testing and its characterization were performed using a specific hardware platform. The hardware part of the imaging system contains a one million Gates Spartan-3 FPGA board with 32MB SDRAM embedded. This FPGA board is the XSA-3S1000 from XESS Corporation. An interface acquisition circuit includes three ADC from Analog Device (AD9048), high speed LM6171 amplifiers and others elements such as the motor lens. Figure 5 shows the schematic and some pictures of the experimental platform.
Electrical characterization
The sensor was quantitatively tested for conversion gain, sensitivity, fixed pattern noise, thermal reset noise, output levels disparities, voltage gain of the amplifier stage, linear flux, and dynamic range. Table 1 summarizes these imaging sensor characterization results.
To determine these values, the sensor included specific test pixels in which some internal node voltages can be directly read. The test equipment hardware is based on a light generator with wavelength of 400-1100 nm. The sensor conversion gain was evaluated to 54 lV/e -RMS with a sensitivity of 0.15 V/lux.s, thanks to the octagonal shape of the photodiode and the fill factor of 25%. At 10,000 fps, measured non-linearity is 0.12% over a 2 V range. These performances are similar to the sensor described in [25] . According to the experimental results, the voltage gain of the amplifier stage of the two [AM] 2 is Av = 12 and the disparities on the output levels are about 4.3%.
Fixed pattern noise
Image sensors always suffer from technology related nonidealities that can limit the performances of the vision system. Among them, fixed pattern noise (FPN) is the variation in output pixel values, under uniform illumination, due to device and interconnect mismatches across the image sensor. FPN can be reduced by implementing correlated double sampling (CDS). To implement CDS, each pixel output needs to be read twice, after reset and at the end of integration time. The correct pixel signal is obtained by subtracting the two values. A CDS can be easily implemented in our chip. For this purpose, the first analog memory stores the pixel value just after the reset signal and the second memory stores the value at the end of integration. Then, at the end of the image acquisition, the two values can be transfered to the FPGA, responsible for producing the difference. In Fig. 6 , the two images show fixed pattern noise with and without CDS using a 1-ms integration time. On the left image, the FPN (225 lV RMS) is mainly due to the random variations in the offset voltages of the pixel-level analog structures. On the right picture, the FPN has been reduced by a factor of 34-6.6 lV after an analog CDS, performed as described above.
5 High-speed image processing applications
Sample images
The prototype chip was used for acquisition of raw images. First, sample raw images of stationary scenes were captured at different framerates, as shown in Fig. 7 . In the three views, no image processing is performed on the video stream, except for amplification of the photodiodes signal. From left to right, we can see a human face obtained at 1,000 fps, a static electric fan at 5,000 fps, and a electronic chip at 10,000 fps. The exploitation of high FPS capability with the 64 9 64-pixel sensor is obtained with a simple sequencer in charge of the transfer of analog pixel values to external ADCs. For bigger sensors, it could be achieved with the integration of a dedicated output module able to cope with a gigapixel per second bandwidth. Another possible solution is to assemble 64 9 64-pixel modules with a dedicated output bus for each of them. Figure 8 represents different frames of a moving object, namely, a milk drop splashing sequence. In order to capture the details of such a rapidly moving scene, the sensor operates at 2,500 fps and stores a sequence of 50 images. The frames 1, 5, 10, 15, 20, 25, 30 and 40 is shown in the figure.
Sobel operator
The Sobel operator estimates the gradient of a 2D image. It is used for edge detection in the preprocessing stage of computer vision systems. The classical algorithm is based on a pair of 3 9 3 convolution kernels (see Eq. 3), one to detect changes along the vertical axis (h 1 ) and another to detect horizontal contrast (h 2 ). For this purpose, the algorithm performs a convolution between the image and the sliding convolution mask over the image. It manipulates 9 Various raw images acquisition at 1,000, 5,000 and 10,000 fps pixels for each value to produce. The value corresponds to an approximation of the gradient centered on the processed image area.
The structure of our architecture is well-adapted to the evaluation of the Sobel algorithm. It leads to the result directly centered on the photo-sensor and directed along the natural axes of the image. The gradient is computed in each pixel of the image by performing successive linear combinations of the four adjacent pixels. For this purpose, each 3 9 3 kernel mask is decomposed into two 2 9 2 masks that successively operate on the whole image. For the kernel h 1 , the corresponding 2 9 2 masks are:
The Fig. 9 represents the 3 9 3 mask centered on the pixel ph 5 . Each octagonal photodiode ph i (i ¼ 1; . . .; 9) is associated with a processing element PE i , represented with a circle on the figure. Each PE i is positioned on the bottom right of its photodiode, as in the real layout of the circuit (see Fig. 1 ). The first mask m 1 contributes to evaluate the following series of operations for the four PE i s:
and the second mask m 2 computes: Fig. 8 A 2 ,500 fps video sequence of a milk drop splashing Analog Integr Circ Sig Process (2010) 65:389-398 395
with V ij corresponding to the result provided by the processing element PE j (j ¼ 1; 2; . . .; 9) with the mask m i (i = 1, 2), and V ph k (K ¼ 1; 2; . . .; 9), the voltages representing the incidental illumination on each photodiode ph k . Then, the evaluation of the gradient at the center of the mask can be computed by summing the different values on the external FPGA. Note that V 12 = -V 21 and V 15 = -V 24 . So, the final sum can be simplified and written as V h1 = V 11 ? V 22 ? V 25 ? V 14 . If we define a retina cycle as the time spent for the configuration of the coefficients kernel and the preprocessing of the image, the evaluation of the gradient on the vertical direction only spends a frame acquisition and two retina cycles. By generalization, the estimation of the complete gradient along the two axis spends four cycles because it involves four dynamic configurations. In short, the dynamic assignment of coefficient values from the external processor gives the system some interesting dynamic properties. The system can be easily reconfigured by changing the internal coefficients for the masks between two successive computations. First, this allows the possibility to dynamically change the image processing algorithms embedded in the sensor. Secondly, this enables the evaluation of some complex pixel-level algorithms, implying different successive convolutions. The images can be captured at higher framerates than the standard framerate, processed by exploiting the analog memories and the reconfigurable processing elements and output at a lower framerate depending of the number of the dynamic reconfigurations. Moreover, the analog arithmetic units implementing these pixel-level convolutions drastically decrease the number of single operations such as additions and multiplications executed by an external processor (a FPGA in our case) as shown in Fig. 5 . Indeed, in the case of our experimental 64 9 64 pixel sensor, the peak performance is equivalent to four parallel signed multiplications by pixel at 10,000 fps, i.e. more than 160 million multiplications per second. With a VGA resolution (640 9 480), the performance level would increase to a factor of 75, leading to about 12 billion multiplications per second. Processing this data flow by external processors will imply important hardware resources in order to cope with the temporal constraints.
As an illustration of the Sobel algorithm, Fig. 10 is an example sequence of 16 images of a moving object, namely, an electric fan. Two white specific markers are placed on the fan, i.e. a small circle near the rotor and a painted blade. The speed rotation of the fan is 3,750 rpm. In order to capture such a rapidly moving object, a short integration time (100 ls) was used for the frames acquisition. The Sobel algorithm allows to distinguish clearly the two white markers even with a high framerate.
Conclusion and perspectives
An experimental pixel sensor implemented in a standard digital CMOS 0.35 lm process has been described in this paper. The architecture of the chip is based on a dynamically reconfigurable SIMD processor array, featuring a massively parallel architecture dedicated to programmable low-level image processing. Each 35 lm 9 35 lm pixel contains 38 transistors implementing a circuit with photocurrent integration, two [AM] 2 and an A 2 U. A 64 9 64 pixel proof-of-concept chip was fabricated. A dedicated Experimental results reveal that raw image acquisition at 10,000 fps can be easily achieved using the parallel A 2 U implemented at pixel-level. With basic image processing, the maximal framerate slows down to about 5,000 fps. The potential for dynamic reconfiguration of the sensor was also demonstrated in the case of the Sobel operator.
The next step in our research will be the design of a similar circuit in a modern 130 nm CMOS technology with pixel size less than 10 lm 9 10 lm. In order to evaluate this future chip in some realistic conditions, we would like to design a CIF sensor (352 9 288 pixels), which leads to a 3.2 mm 9 2.4 mm in a 130 nm technology. In the same time, we will focus on the development of a fast analog to digital converter (ADC). The integration of this ADC on future chips will allow us to provide new and sophisticated vision systems on chip (ViSOC) dedicated to digital embedded image processing at thousands of frames per second.
