Abstract-This paper conveys a proof-of-concept chip for Gaussian pyramid generation for image feature detectors. Gaussian filtering and image resizing are performed with a switchedcapacitor (SC) network. The chip is conceived as the mapping of a CMOS-3D architecture for feature detectors onto a conventional technology, with some functionality removed, and the corresponding area overhead with respect to that of a CMOS-3D architecture, but preserving masivelly parallel Correlated Double Sampling (CDS) and A/D conversion. The chip has been fabricated on a die of 5×5 mm 2 with 0.18 µm CMOS technology, achieving an array of 176×120 sensing elements (pixels). The pixels are arranged in Processing Elements (PEs). Every PE comprises four photodiodes, four SC nodes, one CDS circuit, and local circuitry for one ADC. Every PE occupies an area of 44×44 µm 2 . The chip senses an image and computes the Gaussian pyramid with an average power consumption lower than 75 nW/pixel at 30 frames/s.
I. INTRODUCTION
The advances in the computational power of the last years allow doing tasks as traffic or citizen control, surveillance, robot guidance or augmented reality in reasonable computing time. Feature detectors have become common place in these applications. The Scale Invariant Feature Detector (SIFT) [1] is a state-of-the-art feature detector. As any other feature detector, SIFT comprises low-and intermediate-level processing stages. The first stage is the generation of a Gaussian pyramid or scale-space. The scale-space generation is defined as the sucessive Gaussian filtered versions of the incoming image. The incoming image is filtered with rising up widths (σ in the Gaussian filter), also called scales (S) to provide one set (octave) of filtered images. This process is done O times to get O octaves with S scales each. The origin of a new octave is one half-sized reduction of a previous octave.
Low-level image tasks like convolution-type operations as Gaussian pyramids are better suited for a Single Instruction Multiple Data (SIMD) architecture with a Processing Element (PE) per pixel. The Gaussian filtering is naturally performed by both a resistive-capacitor (RC) grid, or a switched-capacitor (SC) network [2] , [3] . These solutions outperform some other paradigms like those based on cellular non-linear networks [4] .
Intermediate-level stages work on a reduced set of pixels.
In SIFT, this stage works on the extrema, obtained from the Gaussian pyramid. The extrema amount to the 1% of the pixels in the image [1] . In this case another kind of parallelism arises, and the digital domain emerges as a better solution to perform more complex functions like feature description. This paper addresses a proof-of-concept chip conceived as the mapping of a CMOS-3D architecture for feature detectors onto a conventional CMOS technology. The result is a chip with some functionality removed, and area overhead with respect to that of a CMOS-3D architecture. The chip, however, preserves the massive parallelism for CDS and A/D conversion, with an assignment of 4 sensing elements to 4 nodes of an SC network for Gaussian pyramid, and one CDS stage, and the local circuitry for A/D conversion.
II. VISION CHIPS FOR FEATURE DETECTORS ON CMOS-3D ARCHITECTURES
In the realm of feature detectors, CMOS-3D technology comes out as a possible solution to embed the whole algorithm onto a single die thanks to the distribution of functionality across different tiers interconnected with the so-called Through-Silicon-Vias (TSV) [5] . The spread of functionalities as shown in Fig. 1 (a) might permit to optimize functions across different image processing levels, or within a given level itself without degrading fill-factor or footprint. For instance, if several tiers were available, a specific tier could be left for sensing in order to enhance certain characteristics like dynamic range or spectral response. Also, a second layer with PEs that include the Gaussian piramid or CDS or in-pixel ADC, a third layer to store a digital image in a frame buffer with circuitry for extrema detection, or even a fourth tier for higher-level processing could be incorporated. Such a CMOS-3D architecture would provide more parallelism than that of a conventional CMOS imager, which usually counts on percolumn CDS and ADC circuits.
III. VISION CHIP FOR FEATURE DETECTORS ON CONVENTIONAL CMOS TECHNOLOGY
The chip addressed in this paper conveys an array of 176 x 120 pixels in a conventional 2D CMOS UMC 0.18 µm 176x120 Pixels technology. The layout of the chip is displayed on Fig. 1 
(b).
The chip is manufactured on a 5×5 mm 2 die. Every PE in the array comprises 4 photodiodes, 4 nodes of an SC network for Gaussian filtering, one CDS circuit and one comparator, the latter being part of an 8-bit single-slope ADC. Furthermore, the area constraint obliges to reuse circuitry between Gaussian filtering, CDS and A/D conversion. The registers that complete the ADCs are laid down outside the array, and labeled as 1/2 frame buffer in Fig. 1(b) . The analog ramp for the ADCs along with additional biasing circuits are placed outside the array too. This floorplan resembles an approach with two tiers in a CMOS-3D technology, with the upper tier for sensing and Gaussian filtering, and the bottom tier for the frame buffer of the ADC, both connected with TSVs. The key difference between the chip addressed in this paper and a chip over a CMOS-3D stack lies in the routing overhead between the PE array and the frame buffer, as in CMOS-3D technology a TSV per PE would connect the frame buffer with the comparator to complete the ADC, lowering the fan-out.
A. Processing Element (PE)
The PE's architecture comprises four main blocks: i) 4 3T-APS (3 Transistors -Active Pixel Sensor) structure, ii) 4 capacitors C pi with an inverter as gain stage to work as memories and realize the CDS, iii) a switched-diffusion network for Gaussian filtering with communication with the neighbors along the four cardinal directions, and iv) a capacitor C for CDS that is reused together with a comparator for A/D conversion. Fig. 2 shows a PE circuit, whilst the sizes of transistors and photosensors are listed in Table I . Every PE occupies an area of 44×44 µm 2 . Concerning functionality, the main functions performed by the PE are: i) image acquisition and CDS, ii) Gaussian Filtering, and iii) A/D conversion.
1) Image acquisition and CDS:
The image acquisition in a PE is done by 4 photodiodes with 4 source followers and 4 selecting transistors, but only one current source shared by the 4 photodiodes biased at 1 µA (Fig. 2) . The sensing element is an n-well diode to enhace the spectral response at longer wavelengths. The source follower is designed to achieve the largest operating range through the use of low threshold voltage transistors. The source follower provides a gain spread less than 0.4 % with an operation range of 1 V , and an average power consumption of 2.5 nW per PE at 30 frames/s (13.2 µW for the whole array). 
The inverter has been designed with a double cascode configuration in order to achieve a high nominal gain of 65 dB, required for low linearity errors. The schematic of the inverter with its transistor sizes in microns is displayed on Fig. 3(b) . The bias voltages for the cascode inverter are vbp = 1.2 V , vcp = 0.95 V and vcn = 0.65 V respectively, providing a bias current I = 1 µA. Additional transistors enable and enable n permit to cut down to zero the static power consumption during standby periods (leakage currents neglected). As seen below, the gain stages in the comparator for A/D conversion use the same design. The double-cascode inverter has an average power consumption of 15.4 nW at 30 frames/s due to the fact that they are off during the 85 % of the computing time.
2) Gaussian Filtering: The Gaussian filtering and the octaves generation are implemented by a switched-capacitor network [3] . The switched-capacitor network minimizes the non-linearity of a conventional RC network implemented by MOS transistors [2] , [3] . On the other hand, they permit a more accurate control of the Gaussian width σ by the number of switching cycles. Eq. (2) represents the behavior of one node of the network in a given diffusion cycle n. An extra capacitor C pi = 130 fF (see Fig. 2 ) implemented as an MOS capacitor has been added in parallel with the MIMcaps in order to reduce switching and leakage errors, turning the capacitance associated with an SC node ij in the array to C ij = C pij + C pij . The exchange capacitor C E , also implemented with an MOS transistor, has a value of 28.5 fF.
From Eq. (2), the σ 0 width per iteration is given by Eq. (3).
The diffusion network is performed by a double Euler configuration, highlighted in the bottom-left hand-side of Fig.  2 [6] . The Gaussian width per Gaussian filtering or diffusion cycle is σ 0 = 0.48. Fig. 4 illustrates 9 diffusion cycles for a 16 × 16 image from simulated PEs with an asymptotic RM SE = 0.6 LSB.
3) A/D Conversion: As mentioned before, the in-PE singleslope ADC is distributed within and outside the pixel array. The comparator is located within every PE, while the frame buffer is outside the pixel array. Global circuitry for the generation of the single-slope or analog ramp along with biasing circuitry are also needed to complete the ADC. The function of the ADC is to digitize either the input image or the scales to perform extrema detection.
The comparator is shown in Fig. 5 . It is an offsetcompensated topology with two gain stages (−K) implemented with a double cascode structure with the same biasing and transistor sizes as those of the amplifier used for CDS. The capacitor C used for offset compensation is shared with the CDS stage (see Fig. 3 ). Signals comp rst and comp rst d are used to apply the bottom sampling technique to cancel offset, leading to the output of the first inverter given by:
with V Q being the quiescent point of the first inverter, and V pix and V ramp being the signal acquired by the photodiode or a given scale S, and the ramp of the 8-bit single-slope ADC, respectively.
The A/D-conversion finishes when signal EoC goes down to the 0 logic state. This happens with the transition of the ouput of the first inverter (inv1) from 0 to 1 , leading the enable input of the first inverter to 0 through the feedback loop from the second inverter. The complementary enable input of the second inverter is also tied to logic state 1 through inv1. The feedback loop reinforces the logic states of both inverters after the zero crossing between V pix and V ramp , and it also cuts down to zero the static power consumption of the two inverters. Finally, the NAND gate with the signal comp to 0 produces a signal EoC tied to zero, avoiding writing into the frame buffer while the comparison is not taking place. The comparator is the most expensive stage in power consumption per PE, with an average value of 265 nW/PE at 30 frames/s.
The frame buffer stores the scales or the input image generated in the pixel array. The signal EoC provided by the comparator at every pixel finishes the reading of the registers, storing the 8-bit word that unleashes the conversion. The 8-bit word is generated by an 8-bit counter implemented with D-type flip-flops.
The frame buffer is laid down in the top and bottom sides of the die, in such a way that the upper 60 pixels drive the top frame buffer, while the remaining 60 lower pixels are read out by the bottom frame buffer. This floorplan minimizes routing. Also, as seen in Fig. 6 , every 1/2 frame buffer (see Fig. 2 ) comprises 352 x 15 8-bit registers. In turn, every 1/2 frame buffer is split into two regions of 176 x 15 registers each. As there is only one comparator or ADC per 4 pixels or photodiodes, 4 bursts are needed to complete the reading of the array. The split of the 1/2 frame buffers into two regions allows for reading 176 x 15 pixels into the registers at the same time as 176 x 15 pixels are read out of the chip.
The ADC is complete with an 8-bit Digital-to-Analog Converter (DAC) and its corresponding buffers to provide the single-slope ramp to the comparator. Fig. 7 shows the schematic of the analog ramp generator. This is implemented with a current steering thermometer DAC driven by a counter, with the current sources implemented with cascode stages biased at 2 µA. The currents are transformed into voltages in an external resistor of R=1.8 KΩ, providing a voltage of 3.6 mV per LSB. The DAC includes a circuit to calibrate offset with 5 bits of resolution, while the gain errors can be minimized by tuning the external resistor. At simulation level the non-linearity errors were IN L = ±0.19 LSB, and DN L = ±0.006 LSB. The ramp generated by the DAC is buffered to the pixel array with two stages of folded cascode OTAs, labeled '1', and '21' and '22' in Fig. 7 , and biased at 50 µA, and 600 µA respectively.
B. Chip Comparison
Although there are SIFT implementations over programmable hardware like FPGAs or GPUs, we focus on the most recent ASICs to compare with the chip described in this paper. Table II summarizes the most relevant performance metrics. It should be noted that the chips in [7] and [8] are implemented in the digital domain, and that both use additional strategies to drop power consumption and computation time. For instance, the chip in [8] utilizes a visual attention algorithm to run SIFT only on the image regions with relevant information. It splits an HD 720p image into 3600 tiles of 16 × 16 pixels each, usually processing less than 1/3 of the pixels in the image, which enables video frame rate. The data in the first column of Table II , however, refers to a case where all the pixels of an HD 720p are computed, to set a fairer comparison with our approach. The available data for the chip of reference [7] refers to the detection of only one feature. It is not clear how long it would take to detect all the features of an image. In both chips there would be an additional step of image acquisition and A/D conversion. A/D conversion in our chip clearly worsens the performance metrics, still, and taking into account that our data are simulated results, the data collected in Table II show that the expected performance is comparable to that of the state-of-the-art custom chips for SIFT. 
IV. CONCLUSION
This paper has addressed a visual chip on 0.18 µm CMOS technology with an array of 176 × 120 pixels for running Gaussian pyramid on an SC network with an assignment of 4 photodiodes per 4 SC nodes, one CDS and one ADC. The chip is the mapping of a CMOS-3D architecture for feature detection with reduced functionality, and area overhead due to the lack of TSVs. The data collected from this implementation, however, would make it easier a future design on CMOS-3D technology. The performance metrics from the extracted layouts are comparable to those of state-of-the-art chips that include SIFT for object detection and recognition.
