Abstract-This paper introduces a CMOS vision sensor chip in a standard 0.18 µm CMOS technology for Gaussian pyramid extraction. The Gaussian pyramid provides computer vision algorithms with scale invariance, which permits having the same response regardless of the distance of the scene to the camera. The chip comprises 176 ×120 photosensors arranged into 88 ×60 processing elements (PEs). The Gaussian pyramid is generated with a double-Euler switched capacitor (SC) network. Every PE comprises four photodiodes, one 8 b single-slope analog-to-digital converter, one correlated double sampling circuit, and four state capacitors with their corresponding switches to implement the double-Euler SC network. Every PE occupies 44 × 44 µm 2 . Measurements from the chip are presented to assess the accuracy of the generated Gaussian pyramid for visual tracking applications. Error levels are below 2% full-scale output, thus making the chip feasible for these applications. Also, energy cost is 26.5 nJ/px at 2.64 Mpx/s, thus outperforming conventional solutions of imager plus microprocessor unit.
data, and data transmission and storage consume significant energy and area. Also, preprocessing and reduced data transmission result in increased throughput. Actually, preprocessing is smartly implemented in natural vision systems [1] , [2] , a fact that has motivated Lee and Hsieh [3] , Fernández-Berni et al. [4] , Carey et al. [5] , Park et al. [6] , and Rodríguez-Vázquez et al. [7] to explore architectures for CMOS imaging frontends with per-pixel processing circuitry. These systems are recently making the transition from academic proof-of-concept prototypes to industrial products [8] .
Sensory-processing front-end chips with per-pixel processors typically operate as single instruction multiple data (SIMD) processors, namely, all processors run concurrently the same operation on the data captured by the pixel photosensors, thus accelerating computation. Also, mixed-signal per-pixel processors provide speed advantages with large energy efficiency [9] , [10] . As a result, image sensors with embedded mixed-signal processors emerge as suitable candidates for the frontend of vision systems with optimum size, weight, and power (SWaP) figures and large throughput. Throughout this paper, we will use the term CMOS vision sensors (CVISs) to refer to image front-end devices with embedded analysis capability, and we will retain the term CMOS image sensors (CISs) for conventional image frontends conceived to deliver just images.
Major points hampering further development of the CVIS-SIMD are as follows.
1) Their outcome may not be compatible with computer vision software tools, thus limiting their acceptance by system engineers and integrators. 2) Reduced fill factor when realized in standard 2-D technologies. 3) Large pitch and hence smaller resolution than the CIS per the given form factor, again in standard 2-D technologies. Nevertheless, the loss of resolution and the image quality of CVIS-SIMD are not insurmountable barriers for vision. Nature also teaches lessons in this regard, for instance, patients with retinitis pigmentosa see with a small fraction of their photoreceptors alive [11] , which suggests that large pixel counts may not be a must. Indeed, resolutions as low as 32 × 32 px suffice to get the gist of complex scenes [12] and have been demonstrated for indoor elderly care [13] . Also, commercial sensors with low pixel counts (QCIF: 176 × 144) are produced for machine vision applications [14] and have 0018-9200 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
been demonstrated for adaptive laser welding [15] , among other applications. Also, reduced fill factor may be overcome with controlled illumination, as it actually happens in many machine vision applications [16] . Furthermore, many computer vision algorithms cope with inaccuracies arisen during processing [17] , [18] , thus easing the use of mixed-signal CVIS-SIMD. As an example, the chip in [19] , which runs the earliest stages for face detection using the algorithm in [17] , tolerates processing errors close to 10%. As shown in Section IV-D, the chip measurements in this paper show that inaccuracies in the Gaussian pyramid are low enough as not to be a concern for visual tracking.
Regarding compatibility with computer vision tools, it can be met by aligning the conception of CVIS-SIMD to standard computer vision procedures [20] , particularly by focusing on the embedding of preprocessing functions customarily used by computer vision system engineers. This is actually the case of image pyramids, such as the Gaussian pyramid [21] . Image pyramids are found at the initial stages of the processing vision chain for a large variety of computer vision applications and algorithms such as the scale invariant feature transform (SIFT) and variations thereof. Their calculation is resource intensive because it involves repetitive operations with the whole set of image data. As a consequence, the potential benefit of calculating them with the CVIS-SIMD is huge. The CVIS-SIMD may represent a first step toward embedding complete computer vision on a single die with vision capabilities into SWaP-sensitive systems, such as vision-enabled wireless sensor networks [22] and unmanned aerial vehicles [23] .
From now on, we will use the term processing element (PE) for the elementary cell of CVIS-SIMD image front-end chips. This paper reports a 0.18 μm CMOS sensory processing chip to extract the Gaussian pyramid with per-pixel processing circuitry, the ADC, and correlated double sampling (CDS). It contains 176 × 120 3T APSs arranged into 88 × 60 PEs, i.e., four photosensing points per PE. Gaussian filtering is realized using a diffusive, double-Euler, and switched capacitor (SC) grid. The chip operates at 2.64 Mpx/s with an energy consumption of 26.5 nJ/px (0.6 μJ/frame), thus outperforming conventional architectures of imager and the MPU by several orders of magnitude. Measurements show errors below 2% full-scale output (FSO) versus Gaussian pyramid computed by software [24] ; these errors are tolerated by vision applications.
II. GAUSSIAN PYRAMID EXTRACTION

A. Basic Concepts
The scale-space enables computer vision algorithms to give the same response regardless of the distance between camera and object. A common function for scale-space generation is the Gaussian filter [25] , [26] . The scale space is a function L(x, y, σ ) resultant from the convolution of a variablewidth Gaussian function with an input image
where * is the convolution operator, σ is the width of the Gaussian function, and x and y are the spatial coordinates of the image. The Gaussian pyramid illustrated in Fig. 1 consists of several scale spaces arranged into octaves. Starting from the bottom of Fig. 1 , images within each new octave have all one quarter the resolution of those in the previous octave. Subsampling is hence made in the transition from each octave to the next one. Regarding images contained within each octave, these images are scales obtained through Gaussian filtering with increasing widths. The width of each new scale is k times larger than that of the previous one. The range of scale widths is the same for all octaves, namely, from σ 0 to 2σ 0 . The width σ 0 is application dependent, and as such, it could be selected by the user. Usually, three octaves with six scales each suffice [21] . At hardware level, the issue is to provide accurate widths σ i of the Gaussian function.
B. Hardware Implementation
The Gaussian function gives the value I i j of each pixel as the solution of a first-order differential equation under the driving force of the values of the four neighboring pixels along the cardinal directions, namely
which is actually the continuous-time heat differential equation [27] , with D being the diffusion coefficient, usually a constant value common to all the pixels in the image space. In the case of the Gaussian pyramid, D determines the degree of blurring through the expression σ = √ 2Dt, where variable t is the time. In our case, pixel values are voltages V i j held at state capacitors of capacitance C and pixels are connected to the four neighbors through resistive links with resistance R. In such a case, (2) transforms into
from where D = 1/RC and the filter width σ RC = √ 2t/RC. Resistance R can be implemented either through TMOS transistors operating in ohmic region, giving rise to RC networks, or through SC networks. Fig. 2 illustrates both implementation styles. The former are inherently more nonlinear than the latter. Also, RC networks need sampling mechanisms to stop the transient evolution of the network and thereby set the width [4] . The nonlinearity of active resistive links and the time uncertainty of sampling mechanisms degrade the accuracy of the diffusion process in RC networks. These problems can be overcome by emulating resistive links through SCs, giving rise to the so-called diffusive SC networks.
There are many different SC topologies to run Gaussian filters [28] . Fig. 2(b) and (c) displays simple-and double-Euler SC networks in 1-D. In both cases, an exchange capacitor C E is sampled by two switches driven by two nonoverlapping clock signals φ 1 and φ 2 [ Fig. 2(d) ]. The Gaussian pyramid provided by the double-Euler configuration yields better figures of merit than those of the simple-Euler SC topology when included in the SIFT algorithm [29] . Hence, the double-Euler is the SC network implemented on the CVIS-SIMD presented in this paper.
Assuming, as in any SC circuit, that transients associated with the ON resistances of the switches are neglected, that all state capacitors have the same capacitance C, and that C E1 = C E2 = C E , the equivalent impedance of the doubleEuler SC topology is R = T clk /nC E , where n is the number of clock cycles and T clk is the clock period. The resultant σ SC , the Gaussian width of the double-Euler SC topology across the number of clock cycles, becomes Equation (4) can be used to set the Gaussian width by design. However, deviations may be observed during fabrication, which depend on the actual device employed to implement C E and C. It is hence convenient to extract the onchip σ SC value through measurements. Extracted values might be used for calibration if needed. The extraction procedure of on-chip σ SC for our chip will be addressed in Section IV-B.
III. CHIP DESIGN
A. Chip Floorplan and Processing Elements
The micrograph on the left of Fig. 3 shows the chip floorplan, consisting of a core array of PEs surrounded by a split frame buffer. The core array includes 88 × 60 PEs. Each PE comprises the following: 1) four 3T-APS pixel-spatial resolution regarding image acquisition is hence 176 × 120; 2) a comparator for in-PE A/D conversion; 3) four state capacitors and a CDS circuit, which is also used as part of local analog memories (LAMs) to store Per-PE ADC and per-PE CDS, instead of the conventional per-column approach, increase parallelism. Also, this strategy gets favored by the retargetting of the herein proposed architecture to vertical technologies, leading to better performance metrics [30] , [31] . Circuit sharing through the use of the same devices for different functions along time in part compensates for the per-PE ADC and CDS area overhead. Larger routing from the per-PE and per-CDS is alleviated by laying down the frame buffer that stores the results from the A/D conversion in two halves at the top and bottom of the PE array, which in turn diminishes power consumption.
B. PE Array Configuration
The PE array changes its configuration according to the function realized by the chip. The input image and the scales in the first octave are stored at state capacitors (C pi j _O1 ). As seen in Fig. 4 (a) and (b), as there is only one ADC and CDS circuit per four pixels and four state capacitors, image acquisition and scales read out are performed for four cycles. State capacitors are shunted across octaves. Fig. 4 (c) shows the configuration during the second octave. In this case, the state capacitors of a PE are combined into only one to perform downscaling, which leads to one-to-one state capacitor per CDS and A/D circuit in the PE array. In the third octave, the state capacitors of four PEs are merged, and again, there is a one-to-one state capacitor per CDS and A/D circuit. The read out of the input image and the 18 scales resultant from three octaves and six scales each amounts to 40 A/D conversions of the PE array for the whole Gaussian pyramid. Table I lists the sizes of the transistors in Fig. 5 . Switches are implemented with nMOS transistors with minimum dimensions. Circuit sharing is performed with amplifier A1 and capacitors C and C pi j . Every 3T-APS pixel has its corresponding capacitor C pi j . This is shown in Fig. 5 with the same gray shade. Capacitor C runs CDS and offset-compensation comparison during A/D conversion. Amplifier A1 and capacitors C pi j are part of LAMs and CDS circuits. The latter are also part of the state capacitors C pi j _Ok in the SC network.
C. Circuit Implementation
The gain stages in the PE are double-cascode topologies. Only one amplifier is included for CDS and image storing in the LAMs, while two are required in the comparator of the ADC. The amplifier can be configured in two modes of operation, namely, I A and I B, shown in Fig. 6 (a) and (b), respectively. In both cases, the current can be cut off through enable ports. Switches driven by enable ports increase their output impedance close to the end of the operating range of the amplifier, increasing the gain too. Configuration I B consumes 
1) Image Acquisition:
The photodiode is an n-well over p-substrate structure in order to enhance the spectral response at longer wavelengths. The bias current of the source follower of the 3T-APS is set to 1 μA by M4 through a transconductance circuit with an external resistor. CDS is included to diminish reset noise and FPN from mismatch [33] . The nominal working range for the output voltage of the CDS circuit is defined by amplifier A1 in Fig. 5 , namely, [0.4, 1.3] V. These are the lower and upper bounds for the voltages at the state capacitors of the double-Euler SC network. Fig. 7 shows the CDS topology with its control signals. A similar implementation has been used in [34] . For a given pixel i j, signal φ rw_ pi j is high during the whole acquisition time. Reset and signal voltages for CDS are sampled at time instants t 0 and t 1 with signal φ acq high. The CDS output is stored in C pi j as well as in C pi j and the four exchange capacitors C E are connected to the node n i j . Signals φ 1_O1_ pi j and φ 1_ pi j set the initial values in the exchange capacitors used for intra-PE and inter-PE connections in the double-Euler SC network, respectively.
The CDS is implemented with amplifier A1 in I A mode to support a wide input voltage range. Enable signal φ en_inv1 allows switching off amplifier A1 between the two samples at t 0 and t 1 . By assuming large enough gain A1, the CDS output voltage is given by
where V ref = 400 mV. 
2) Local Analog Memories:
The LAMs store both the image after CDS and the scales across the Gaussian pyramid. The LAMs are implemented with amplifier A1, capacitors C pi j , and switches φ writep , φ rdm , and φ write0 (see Fig. 5 ). Scales across the Gaussian pyramid are stored and read out in two phases with signal φ rw_ pi j high and φ vref_cds low. Both phases are shown in Fig. 8 . During the first phase voltage, V ni j − V Q is held in capacitor C pi j with signal φ rdm high and φ writep and φ write0 low. The read out is performed during the second phase with φ rdm low and φ write0 and φ writep high, leaving V outij = V ni j , where V ni j is the voltage at node ni j .
3) Comparison for in-PE ADC: Our chip embeds an 8 b single-slope in-PE ADC. Fig. 9 shows the single-input offsetcompensated comparator of the in-PE ADC. Offset compensation makes the comparator less sensitive to manufacturing variability. Switches are implemented with nMOS transistors. Their sizes are collected in Table II . Label M15 means the four transistors in the NAND gate of the comparator, which is implemented with complementary logic. Amplifier A2 is configured as I A, while A3 is in mode I B to cut power consumption, which is further decreased with the feedback loop between both gain stages. The bottom sampling technique is run with different delays between signals (Delay1 -Delay3 in Fig. 9 ).
The comparator works in two phases: reset and comparison. During reset, both the first input signal and the quiescent point 
V outij -V ramp crossing triggers the signal end of conversion (EoC) to low, enabling the writing of a digital word given by an 8 b counter into the frame buffer assigned. The EoC occurs with V out2 low [see Fig. 9 and (6)], which in turn cuts off current in the first gain stage through a positive feedback loop. The feedback loop also reinforces logic levels. Voltage and current waveforms in the first amplifier of the comparator (V out1 in Fig. 9 ) with and without feedback loop plotted in Fig. 10 (a) confirm this statement. Fig. 10(b) and (c) illustrates power savings from the feedback loop for two input voltages, corresponding to ADC output codes 250 and 40, close to the lower and upper parts of the falling ramp. Blue and pink lines are the currents integrated along the whole ramp in the first and second amplifiers of the comparator. The comparator without feedback loop consumes 1.65 and 1.7 μW for codes 250 and 40, respectively, and the feedback loop leads to 75 nW and 1.65 μW, resulting in large power savings for the largest ADC output codes.
4) Gaussian Pyramid Construction:
Our double-Euler SC network with NEWS connectivity yields the Gaussian pyramid. Intra-and inter-PE connections are shown in different gray shades in Fig. 5 . Fig. 11 gives a complete view of both intra-and inter-PE connections.
Downscaling across octaves in the Gaussian pyramid leads to three types of switching blocks in the SC network, labeled SC A , SC B , and SC C in Fig. 11 , all of them implemented as nMOS transistors with minimum dimensions. In addition, one of the four PEs has a slightly different structure from the other three. Such a PE is shaded and marked with β in Fig. 11 . PEs of α type comprise switching blocks SC A and SC B . PEs of β type contain switching blocks SC A and SC C . The scales are provided by capacitors C pi j _Ok . C pi j _O1 means any of the 176 × 120 state capacitors in the first octave. Similarly, C pi j _O2 and C pi j _O3 mean any state capacitor in the second and third octaves, where the resolution is downscaled to State capacitors C pi j _O1 in the first octave are the combination of MiM structures of M5-M6 metal layers C pi j with capacitors realized with transistors C pi j in order to keep dynamic errors low, leading to C pi j _O1 = 330 fF. Capacitors C pi j are isolated from the SC network during LAMs' read out through signal φ read_net , leaving C pi j = 200 fF for these functions (see Fig. 5 ). Exchange capacitors in the first octave are set to C E = 38.5 fF and realized with transistors. According to (4), the state to exchange capacitors ratio yields σ SC_O1 = 0.48 √ n for the scales in the first octave, with n being the number of clock cycles. Such scales are built with blocks SC A , SC B and SC C . Blocks SC A run the two terms of the Gaussian kernel with NEWS connectivity through the switches that connect state capacitors within a given PE. The other two terms of the Gaussian kernel are executed with blocks SC B or SC C , correspondingly providing inter-PE connectivity of a given state capacitor with its neighbors. As an example, and as seen in Fig. 5 , the state capacitor, which results from merging C pi j with C pi j into C pi j _O1 within the the first octave, is connected to its eastern and southern neighbors through SC A within the PE, while their northern and western connections comprise blocks SC B in PEs of α type and blocks SC C in PEs of β type. Finally, signals φ 1 and φ 2 in the basic cell of the double-Euler SC network of Fig. 2 are implemented with signals φ 1_O1_ pi j and φ 2_O1 in blocks SC A , φ 1_ pi j and φ 2_O1O2 in blocks SC B , and φ 1_ pi j and φ 2_O1O2 in SC C . φ 1_O1_ pi j , φ 1_ pi j , and φ 1_ pi j are turn on to initialize C E and C pi j capacitors during image acquisition through CDS in every PE with signal φ read_net high, as seen in Figs. 5 and 7 .
The one quarter downscaling from the first to the second octave occurs by shunting the four state capacitors C pi j _O1 of the first octave with the eight intra-PE exchange capacitors C E , giving rise to larger state capacitors throughout the second octave as C pi j _O2 = 4C pi j _O1 + 8C E for a given PE. In so doing, signals φ 1_O1_ pi j and φ 2_O1 in blocks SC A are always high in the second octave. Signals φ rw_ pi j , φ rw_ pi j +1 , φ rw_ pi+1 j , and φ rw_ pi+1 j +1 are also high to shunt capacitors C pi j in the PE (see Fig. 5 ). Signals φ 1 and φ 2 in the basic cell of the double-Euler SC network of Fig. 2 are now given by the pairs φ 1_ pi j and φ 2_O1O2 , and φ 1_ pi j and φ 2_O1O2 in blocks SC B and SC C , respectively. Signals φ 1_ pi j and φ 1_ pi j are used to initialize exchange capacitors for the second octave with blocks SC B and SC C . Also, as seen in Fig. 11 , the NEWS connectivity for PEs of α type is given by two SC B blocks along each direction. Similarly, two SC C blocks along each cardinal direction are used for PEs of β type. This means that now the exchange capacitors for the second octave become 2C E , which all in all leads to σ SC_O2 = 0.23 √ n. Finally, the one quarter downscaling from the second to the third octave is carried out in two phases. During the first step, the four state capacitors C pi j _O2 of four PEs are shunted together through signals φ 1_ pi j and φ 2_O1O2 high in blocks SC B . Subsequently, these signals turn low, disconnecting PEs of β type from those of α type in every group of four PEs. As a consequence, the scales in the third octave are performed among PEs of β type through blocks SC C , where φ 1_ pi j and φ 2_O3 play the role of control signals φ 1 and φ 2 in the basic cell of the double-Euler SC network of Fig. 2 . Initialization of state capacitors is carried out with φ 2_O1O2 high. In this scheme, both exchange and state capacitors remain the same as in the second octave, so that σ SC_O3 = σ SC_O2 .
D. Peripheral Circuits 1) Gaussian Pyramid Read-Out:
The Gaussian pyramid is read out through two frame buffers laid down at the top and bottom of the PE array, and labeled "1/2 frame buffer" in Fig. 3 . Every register bank is assigned to the corresponding half of the PE array. The frame buffer split into two halves diminishes routing area. Fig. 13(a) shows the half frame buffer. Every PE has two 8 b registers assigned in the frame buffer, allowing the read out and A/D conversion of two pixels at the same time. Such registers are named A and B in Fig. 13(b) . Every frame buffer Fig. 13(b) . As an example of read-out procedure, for the first column of PEs of the bottomhalf array-PEs across the 30th to the 59th row-the PEs from the 30th to the 44th row are A/D converted in column 0 in the register bank, while the PEs from the 45th to the 59th row are A/D converted in column 2 [both of them in reg. A in Fig. 13(b) ]. At the same time, the data converted in the previous cycle are read out of the chip in columns 1, 3 . . . [reg. B in Fig. 13(b) ]. Signal Reg_select allows selecting one of the two 8 b registers, either A or B, yielding the A/D conversion. Finally, the 4 and a 9 b row and column decoders are NOR MOS decoders with pull-up transistors.
The signal EoC from the in-PE comparator enables writing of the digital word generated by a global counter into the registers, which are implemented with an nMOS transistor at the input and a pMOS transistor in their feedback loop [ Fig. 13(c) ]. The 8 b register of a word includes a tristate at the output as shown in Fig. 13 . The row decoder enables these tristates in a full row and all write the stored word in a per column vertical bus. Another tristate placed at the end of each column selects the column that must be read. The column tristate writes the data in the bus that drives the digital word to a buffer. This buffer reinforces and drives the 8 b word to the output paths digou and digod (digital output up/down).
2) Analog Ramp and Voltage Bias Generation:
The analog ramp for the 8 b single-slope ADC is produced with an 8 b current steering DAC [35] . The DAC is laid down at the left of the PE array in Fig. 3 . The unity current for the DAC is set to 2 μA. The current from the DAC is converted to voltage in an external resistor. The DAC also comprises a 5 b current steering to set up the offset of the ramp. Finally, the bias voltage generators of the gain amplifiers in the PE are implemented with wide-swing transconductance amplifiers included on the left side of the die, within the block labeled "Ana. Ramp" in Fig. 3 .
IV. EXPERIMENTAL RESULTS
A. Camera Module Prototype
Fig. 14 shows a camera module prototype composed of three interconnected boards. The first of them (carrier board) hosts the sensor chip (FPGP). The second board encloses an FPGA DEO-Nano [36] to control the chip. The last one is a microPC (Raspberry Pi [37] ) for visualization purposes. The optics is a C-mount type 35 mm @ f1/4 lens. The system is powered to 5 V through a plug Jack/μUSB type.
B. On-Chip Gaussian Pyramid
The chip operation depends on the value of the emulated Gaussian filter width, σ SC . This is set during design through capacitors C and C E with (4), where n stands for the number of clock cycles. Nevertheless, σ SC may change during physical realization. Fig. 15 displays changes measured from the chip. The black line shows the designed σ SC as a function of the number of clock cycles n. The blue line shows the σ SC values of the scale space extracted by iteratively comparing the outcome of the chip across the number of cycles n with an ideal scale space L(x, y, σ ) on the image acquired by the chip through RMSE minimization. The red line is a polynomial fitted to the measured values. This experimental curve fits (4) using exchange capacitor values of C E ≈ 28 fF and C E ≈ 26.5 fF for the first and second octaves, instead of the designed ones, i.e., C E = 38.5 fF, due to tolerances and parasitics, which do not destroy chip functionality. It should be noted that both the exchange capacitors C E and part of the state capacitors C pi j are implemented with transistors, while part of the state capacitors C pi j are MiM devices (see Fig. 5 ). Deviations among the experimental scales and scales designed with (4) are below 1% of the full scale, as it is illustrated by the right vertical axis in Fig. 15 , where it is seen that the rms error (RMSE) saturates around 2.5 in a scale of 255 (1% of FSO). Finally, Fig. 16 further illustrates the outcome of Gaussian filters realized by the chip by showing different scales obtained within the first octave.
C. Implementation Comparison
The chip generates a Gaussian pyramid of three octaves with six scales each in 8 ms. Time required for A/D conversion is included in this number. Thus, the chip can provide 125 digitally encoded pyramids per second. The data conversion takes 200 μs per conversion and the clock cycle for the double-Euler SC network is 150 ns. The relative energy consumption and throughput of our chip are 26.5 nJ/px at 2.64 Mpx/s. Table III compares these metrics versus those provided by systems where Gaussian pyramids are obtained through digital signal processing following sensor read out. Since some of these systems do not embed image sensors, energy Fig. 16 . Image acquisition and different snapshots of the on-chip Gaussian pyramid across the first octave. The upper left image is the input scene, the rest of the images from left to right and top to bottom correspond to σ = 1, 77 (clock cycles n = 19), σ = 2, 17 (n = 29), and σ = 2, 51 (n = 39).
for conventional CMOS imagers [38] scaled to the image resolution of the corresponding processor have been added for proper comparison.
Energy data in Table III do not include external memory accesses as they largely depend on the camera system. Their forecast would hence be inaccurate and similar for all the Gaussian pyramid sensory-processing subsystems, including ours. Our chip is up to four orders of magnitude better than conventional and low-power MPUs in computer performance (Mpx/J), while the throughput is similar to that of the most efficient competitor. Table IV further illustrates the performance of the chip versus other highly efficient sensory-processing CVIS chips with per-pixel circuitry. The chip in [6] performs 2-D optic flow estimation. The PE array evaluates temporal contrast change by subtracting two frames whose gains are set by a programmable gain amplifier. The chip in [42] runs 3×3 convolutions. The chip in [43] performs general-purpose low-level image processing. Finally, the chip in [44] performs background subtraction. These functions are simpler than the generation of a Gaussian pyramid with three octaves @ six scales performed by the herein reported chip.
Still, the chips in [42] and [43] might compute Gaussian filters, as these are weighted convolutions. The metrics in Table IV correspond to isolated pairs of convolutions as Roberts or Prewitt edge detectors and to real-time edge detection at 25 frames/s, respectively. The evaluation of the Gaussian pyramid with these chips would certainly give different metric values and it would require additional hardware to switch between octaves. The chip in [44] performs background subtraction with two digitally programmable SC low-pass filters per pixel. The energy overhead on our chip compared with that on the chips in Table IV is partly explained by the higher complexity of the function that it runs. Differences in fill factor and pixel pitch are also due to the larger complexity of our PE. Particularly, our chip and that in [6] embed an 8 b single-slope ADC. While [6] follows a per-column ADC architecture, our chip follows a per-pixel one to achieve full paralellism and hence large speed.
D. Application Assessment
The accuracy of the on-chip Gaussian pyramid has been assessed by incorporating hardware errors into the interactive tool reported in [45] . This tool employs the SIFT feature detector to perform visual tracking of six 2-D textures on VGA resolution videos. Visual tracking metrics are calculated along the application of homography, defined as the matrix that captures the transformation of the 2-D textures from one frame to the next one, e.g., rotation.
Repeatibility (RP) is the metric that we have calculated to assess the quality of visual tracking with the on-chip Gaussian pyramid [45] . As defined in [45] and formulated in (7), RP is the set of interest points S j −1 and S j −2 at frames j − 1 and j − 2 such that the geometrical distance between them after applying the corresponding homographies (H j −1 and H j −2 ) from frames j − 1 and j − 2 to frame j are below a certain threshold normalized to the total number of interest points S j −1 or S j −2 . RP gives an estimate of the percentage of interest points whose allocation in successive frames is successfully forecast with the extracted homography
The RMSE values measured from the chip have been expressed as per-pixel local errors by finding the standard deviation of the normal distribution, which corresponds to the given RMSE level. The normal distribution conveys the variability from chip manufacturing. These errors have been added to every scale of the Gaussian pyramid. Fig. 17 displays RP versus RMSE for RMSEs of 0%, 1%, 2.5%, and 5%. Our on-chip RMSE levels are below 1.2% of FSO. RP is the average of the aforementioned six 2-D textures throughout all the frames of the corresponding videos with three different image transformations, namely, rotation, zoom, and perspective distortion. The error bars, calculated as the standard deviation throughout the averaged data, reports RP degradations that are tolerable for most applications. In fact, as reported in [45] , the temporal distance between consecutive frames has a larger impact on R P. In this regard, the large Gaussian pyramid calculation throughput of our chip becomes an important asset as it enables reducing the baseline distance between consecutive frames.
V. CONCLUSION
This paper presents a proof-of-concept CVIS of 176 × 120 pixels for the parallel computation of the Gaussian pyramid with double-Euler SC networks. Cutting PE area through smaller state capacitors of the SC network might be the most straightforward way to upscale our architecture while keeping performance metrics. Eventually, a given resolution could not be met with a double-Euler SC network. In that case, resorting to a simple-Euler network might be a solution if the loss of accuracy is affordable for the targeted application framework. Measurements from our chip demonstrate that sensory-processing architectures with per-pixel mixed-signal processors outperform conventional architectures consisting of an imager and an MPU in terms of both energy consumption and throughput. Our results also show that unavoidable errors of the analog circuitry do not result into unfeasible Gaussian pyramids as it has been verified by visual tracking metrics with a publicly available image data set. The main limitations posed by the type of SIMD-CVIS reported in this paper are direct consequences of the use of per-pixel circuitry and standard, planar technologies, namely: 1) enlarged pixel pitch and 2) reduced fill factors. The former might constrain the use of this type of chips to applications where the object of interest is at a short distance to the camera. The latter calls mainly for applications with controlled illumination conditions. However, these limitations can be overcome by retargetting our architecture into 3-D vertically integrated technologies, a task for which the circuits and methods reported in this paper can be reused. Currently, he is an Analog Designer with Atomos GmbH, Montabaur, Germany. He is the first inventor of a U.S. patent.
Dr. Suárez was a recipient of the third Best Student Paper Award at ECCTD 2013.
