Abstract-This paper addresses a comparison of architectures for the hardware implementation of Gaussian image pyramids. Main differences between architectural choices are in the sensor front-end. One side is for architectures consisting of a conventional sensor that delivers digital images and which is followed by digital processors. The other side is for architectures employing a non-conventional sensor with per-pixel embedded preprocessing structures for Gaussian spatial filtering. This later choice belongs to the general category of "artificial retina" sensors which have been for long claimed as potentially advantageous for enhancing throughput and reducing energy consumption of vision systems. These advantages are very important in the internet of things context, where imaging systems are constantly exchanging information. This paper attempts to quantify these potential advantages within a design space in which the degrees of freedom are the number and type of ADCs, single-slope, SAR, cyclic, , and pipeline, and the number of digital processors. Results show that speed and energy advantages of preprocessing sensors are not granted by default and are only realized through proper architectural design. The methodology presented for the comparison between focal-plane and digital approaches is a useful tool for imager design, allowing for the assessment of focal-plane processing advantages.
Indeed, these days image and vision sensors are flooding all application territories, their usage and markets are increasing at exponential pace and they are expected to play important roles within Internet of Things domains [1] , [2] .
One major obstacle faced by image and vision sensor architects is the huge amount of data required for image coding. These data stress intermediate storage resources and communication channels. Also, their handling requires large energy budget particularly if on-line reaction is targeted. Different paths are being explored in the quest of overcoming these difficulties, covering from innovative sensor front-ends to enhanced multi-core back-end processors. Dynamic vision sensors [3] and computational image sensors [4] [5] [6] are relevant examples of advances regarding front-ends. In both cases sensors are meant to extract information, instead of just data, from the scene. Thus, reduced sets of abstract data, as opposed to raw data, are downloaded from the sensor for processing, hence de-stressing the system. This paper deals with information-centric computational image sensors. Particularly a 6-T active pixel is proposed to expedite the calculation of the Gaussian Pyramid (GP) [7] . This is relevant because image pyramids, and in particular the GP, constitute the first stage of many computer vision processing pipelines [8] [9] [10] [11] . Also, their computation mobilises significant computational resources and results into large delay and energy consumption. However, the functional primitive underlying GP calculation is rather simple -just diffusions across the scene plane are needed. Diffusions can be implemented by embedding simple mixed-signal circuits at pixel level, as done for instance in [12] and [13] . Actually, results in [13] demonstrate orders of magnitude of improvement in throughput and energy consumption when compared to architectures using conventional, data-centric, sensors. But these advantages have the counterpart of much larger pixel pitch. The sensor hereby described aims at overcoming this drawback by using only two extra transistors per pixel; i.e. by employing a 6-T APS instead of the standard 4-T APS used in conventional image sensors [14] .
The main asset of the 6-T GP pixel proposed in this paper comes from the parallel implementation of diffusions. However, data interchange requirements are still significant. It means that potential advantages of the non-conventional architecture are not granted by default. Exploring the conditions under which these advantages really occur, and benchmarking them, is the main purpose of this paper. In other words, we perform comparative throughput and energy analyses of a non-conventional architecture based on a 6-T GP pixel, on the one hand, and a conventional architecture, on the other hand. In this later architecture all processing for the GP takes place in the digital domain -no preprocessing at all is performed in the sensor. We consider different kinds of data readout and various architectural choices regarding the number of analog-to-digital converters (ADCs) embedded in the sensor readout channel, the type of ADC, and the number of processors in the digital back-end. Results presented in this paper show that potential advantages of the non-conventional architecture are largely dependent on the choice of these high-level, architectural degrees of freedom.
The paper is organized as follows: Sec. II defines the concept of Gaussian Pyramid and sets the context of its hardware realization; Sec. III addresses a pixel implementation where only 6 transistors are required to perform GP processing; time and energy numerical comparisons are presented in Secs. IV and V, respectively; a case study is described in Sec. VI in order to validate the proposed implementation; finally some concluding remarks are presented in Sec. VII.
II. GAUSSIAN PYRAMID IN COMPUTER VISION
Object detection is the starting point for most computer vision pipelines. Once a particular object of interest is detected, it can be segmented, tracked, recognized etc. A major challenge for the implementation of this early vision task is that the scale of targeted objects is not known a priori. Objects can enter the surveyed scene at different distances from the image sensor. Those appearing at distant locations will require higher resolution to be detected than close-up objects for which most of the pixels will contain redundant information. The concept of pyramid representation [7] thus arises as a multi-resolution scene representation where each frame making up an image flow is progressively filtered and subsampled in order to efficiently deal with the search of objects at different scales. An example of pyramid is shown in Fig. 1(a) . The images with no subsampling are depicted in Fig. 1(b) for better visualization of the applied filtering. Formally, filtering followed by subsampling is defined by the reduce operation given by:
where K is the filtering kernel and f l is the image of the pyramid at level l [7] . The canonical way to construct a pyramid representation is based on Gaussian filtering [10] . This filter ensures that no artifacts are generated when going from finer to coarser scales. Indeed, Gaussian Image Pyramid is one of the predefined vision functions included in the industrial standard OpenVX [15] . We make use of this standard definition in our analysis.
Concerning hardware realization, a conventional approach to generate a GP is that of Fig. 2(a) . The image sensed by an M×N pixel array is converted into digital and stored in memory. A prescribed number of Processing Elements (PEs) then access memory in order to process the image just captured and generate the corresponding pyramid. PEs can operate in parallel. This approach will constitute our reference realization for comparison.
Due to the significance of the GP as a fundamental processing primitive in computer vision, numerous non-conventional approaches aiming at boosting its hardware performance have also been reported [12] , [13] , [16] [17] [18] [19] [20] . Among them, mixedsignal focal-plane sensing-processing [12] , [13] , [16] , [17] stands out as the best approach in terms of parallelization and energy efficiency. Additional circuitry is incorporated per pixel, usually connected to its counterpart at neighboring pixels, in order to concurrently process the image sensed by photo-sensitive circuit elements. Unfortunately, this approach typically suffers from large pixel pitch, thereby having a negative impact on key parameters of image sensing like sensitivity, resolution, noise etc. 
III. PROPOSED PIXEL IMPLEMENTATION
In order to address this drawback of focal-plane sensingprocessing realizations, we thoroughly analyze a focal-plane realization of Gaussian filtering requiring only two extra transistors per pixel. A basic block diagram of this realization is depicted in Fig. 2(b) . The proposed circuit implementation is shown in Fig. 3 . An n-channel transistor can be used as a switch, resulting in a pixel with six transistors [21] . Pixel operation starts by resetting the floating diffusion nodes. After the integration time, the charge accumulated at the photodiode cathode is transferred (according to TX) to the floating diffusion node. When the switches close, charge redistribution is performed among parasitic capacitors at the corresponding floating diffusion nodes. The average voltage after charge redistribution represents the mean luminance in the sub-matrix where the pixels were connected. This operation is lossy. Once pixels are interconnected in a sub-matrix, all parasitic capacitors end up holding the same voltage level. This loss of the original information does not prevent the GP generation.
As an example, consider the 8×8 matrix in Fig. 4(a) , where the pixel values encode an original image. This first step consists in connecting the pixels into 2×2 blocks, to perform an average operation inside each block, as shown in Fig. 4(b) . If we sample one pixel inside each block, then the resulting image has half the number of rows and half the number of columns of the original image. The first step is necessary to perform convolution in the proposed way, but it reduces the resolution of the image. Sub-sampled image pixel positions are written in p i, j format in the middle of each block in Fig. 4(b) . All subsequent steps perform Gaussian filtering on this sub-sampled matrix. In the first subsequent step, we change the grid and, again, group the pixels into 2×2 blocks. This grid change and the result of the new charge redistribution step is shown in Fig. 4(c) . After the charge redistribution we have that If we change the grid again, back to the first grid, as shown in Fig. 4(d) , we perform the same filtering for a second time. Changing the grid back to the first one corresponds to filtering the sub-sampled image from According to the example in Fig. 4 we conclude that, for every grid change, the sub-sampled image is filtered with the 2×2 binomial kernel. The size of the targeted kernel determines the number of times that the grid must be shifted and charge redistribution enabled. The possible kernels that can be implemented with the proposed hardware are 2×2 binomial kernel cascade associations. Figure 5 presents the steps required for the generation of a three-level pyramid according to the definition of GP of the standard OpenVX.
Step (2) from Fig. 5 , is required for changing the image resolution. To generate Level 0, which is the GP starting level, we sample one pixel inside each 2×2 block of the image generated after this charge redistribution. This image is then filtered through steps (4) to (7), resulting in the image that is subsampled to generate Level 1. To compute Level 2, we connect the pixels into 4×4 blocks, with the same goal of step (2), thus reducing the resolution. As in the calculation of Level 1, four charge redistribution operations are performed to filter the image, which is done in steps (10) to (13) . By the end of these operations the result is subsampled, generating Level 2. To create a pyramid with four levels, the pixels are connected into 8×8 pixel blocks. The maximum number of levels that can be generated by the proposed hardware mainly depends on the fabrication technology leakage current and the floating diffusion node capacitance.
IV. TIME ANALYSIS COMPARISON
The main goal of this paper is to compare our reference digital implementation -depicted in Fig. 2(a) -to the focalplane approach just described -sketched in Fig. 2(b) . Note that for the focal-plane realization the resolution of the Level-0 filtered image (M×N) is a quarter of the resolution of the captured image (2M×2N).
In the digital processor, the convolution is based on sliding a binomial kernel across the image. At every location, the image pixels inside the kernel window are multiplied by the kernel elements, and the multiplication results are summed. For efficiency, the digital processor has a multiply and accumulate (MAC) unit, formed by one or more PE. The binomial kernel only requires addition and division by four, so the MAC unit is realized by simple digital circuitry (logic adders and shift registers) placed outside the pixel array. Filtering with a 2×2 kernel requires four pixel values for each kernel window, but two of these values are kept from the previous window operation, requiring only two memory-read accesses per window. Likewise, one MAC operation per window can be spared if we consider a partial result from the previous window. After each window computation the memory is accessed for writing the result.
For a numerical comparison, the flows of both architectures are broken into tasks, which are analyzed considering processing time and energy consumption. In the time analysis, each task is related to a variable τ that represents the time to perform a given task once. We then compute the number of times the task is executed. Overall time is equal to τ multiplied by the number of executions of that task. After finding the processing time expressions for both approaches as functions of τ , each τ is associated with the clock period, τ Clk , which leads to expressions with a single global variable. The time needed for image capture is approximately the same for both approaches, so it is not considered in the time comparison. The same idea applies to the data output transmission.
A. Focal-Plane Approach Time Analysis
The focal-plane approach steps are inferred from Figs. 5 and 2(a). Aside from capture and transmission, there are two main steps: 1) Gaussian Pyramid generation: the time it takes to generate the GP depends on the number of charge redistribution operations multiplied by the time it takes for a single charge redistribution. Image size does not affect the GP generation time, because this operation runs concurrently across the matrix. Kernel size determines the number of charge redistributions per level. We need n k − 1 charge redistribution operations to implement an n k × n k kernel. From Fig. 5 we see that this operation is repeated at every level, except the last one. Finally, we sum the charge redistribution operations that take place when the pyramid level changes. The overall number of charge redistribution operations is
where N Lev is the number of pyramid levels. Multiplying N C R by the time required for performing one charge redistribution, τ C R , we have the overall processing time
2) Analog-to-digital conversion: after each computation at the focal plane, pixel values are read out and sent to an analog-to-digital conversion stage, which comprises one or more ADCs. The time required for performing one sample conversion by one ADC is τ ADC . Overall data conversion time depends on the number of ADCs, N ADC , and on the amount of data converted, N conv .
To compute N conv , we note that for every pyramid level the image size is reduced by a factor of 4:
Overall conversion time is thus:
Overall focal-plane processing time is obtained by adding up τ F P Proc and τ ADC T otal :
B. Digital Implementation Time Analysis
The digital approach requires more steps than the focalplane approach, as it can be seen in Fig. 2: 1) Analog-to-digital conversion: the captured image is immediately converted to digital. This is the only data conversion required by this approach. The size of the converted data is equal to the pixel array size. Thus, input values for the current pyramid level from a memory, performs multiply and accumulate operations and writes the result back into the memory. The number of times this operation is performed depends on image size and on the number of times the image is filtered by the binomial kernel inside each level. Image size changes at every level according to a series similar to the one given for the number of conversions, N conv , except for the fact that we do not perform convolutions at the highest level.
The number of operations is equal to: 
C. ADC Architectures Comparison
Before using the above equations to compare focal-plane and digital approaches, it is important to remember that it is common to work with the ADC at a clock period different from the one used for the other parts of the circuit. In our case, we define τ Clk as the period of the clock signal that controls the pixel array, memory, and digital circuitry. The ADC clock period, on the other hand, is K ADC · τ Clk , where K ADC depends on ADC type.
We consider five ADCs commonly used in CMOS image sensors: ramp, successive approximation register (SAR), sigma-delta ( ), cyclic and pipeline [22] . To compare ADC types and find the appropriate clock period in each case, we use reported imagers in which the performance figures of the embedded ADCs are included [23] - [58] . ADCs have already been compared by Leñero-Bardallo and Rodríguez-Vázquez [22] and Murmann [59] . The present comparison focuses exclusively on ADCs designed for image sensors, in the context of comparative time and energy analysis, including recently published works.
The ramp converter, a linear approximation converter with simple architecture requiring low area and low power consumption [60] , is probably the most used converter in image sensor applications [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] . It is suitable for working with high clock frequencies. We thus use it as a reference for other converter types: the ramp ADC clock period is equal to the global clock,
The data converters in the comparison were designed for different resolutions. For a fair comparison, we normalize the conversion rates and energies for the same number of bits, which is set as N bits = 8. Although imagers with higher number of bits are common, eight bits per pixel is more typical [61] . The conversion rate normalization depends on the number of clock cycles per bit each converter architecture requires. A single slope ramp converter, for example, requires 2 N bits · τ Clk Ramp (maximum) for a conversion. The normalized conversion rate considering eight bits is f s = 2 N bits · f s /2 8 , where f s and N bits are the reported conversion rate and resolution. For the SAR and cyclic converters, the conversion time is N bits ·τ Clk S AR,Cyclic , so the normalization is f s = N bits · f s /8. The conversion time depends on the oversampling rate (OSR). For second-order incremental converters, the number of bits is N bits = log 2 [OSR · (OSR + 1)] − 1, where OSR is the reported oversampling rate. We consider an oversampling rate equal to 25, which yields resolution equal to 8.3 bits. The normalization is f s = OSR · f s /25. The pipeline converter conversion time is one τ Clk Pipeline , with some latency, which does not depend on the number of bits, i.e. normalization is not required. Pipeline converters are not as common in image sensors as the other converter types (simulation results have been reported, as well as experimental results from ADC chips working together with imaging chips), but they are included in the comparison because of their improved speed.
To normalize energy figures, we assume that the power consumption doubles for every bit added [59] : E = 2 8 · P/( f s · 2 N bits ). Walden's figure of merit for ADCs [62] uses the effective number of bits (ENOB) instead of the resolution. The normalized energy values in Fig. 6 are based on the resolution because some of the references do not report ENOB. Figure 6 shows the normalized energy versus normalized conversion rate for the five ADC types considered. The median conversion rate and energy (black markers in the figure) Summarizing, we defined K ramp = 1, since this converter is used as reference, and, using reported figures, found K S AR,Cyclic = 16, K = 8 and K Pipeline = 2. These constants define the ratio between the ADC clock period and the clock period τ Clk , used for the other stages of the circuit.
D. Time Comparison Results
We now establish some default values for the parameters in Eqs. (3) and (5), and associate the overall times to a global clock period. As explained in Sec. IV-C, τ Clk is the period of the clock signal that controls the pixel array, memory, and digital circuitry and K ADC · τ Clk is the ADC clock period.
Assuming that charge redistribution is practically instantaneous, it is clear from Eq. (3) that the bottleneck of the focal-plane approach is at the ADC, because of the amount of data to be converted. The digital approach bottleneck, on the other hand, is either at the ADC or at the processing stage, which depends on ADC type. For both approaches, we explore different ADC types and N ADC values. For the digital approach, we explore several N P E . We thus do not define default values for τ ADC , N ADC , and N P E . The maximum N ADC value is set to the number of columns at pyramid Level 0, since image sensors with one ADC per column are commonly found [63] . Although stacking technologies allow for the integration of one ADC per pixel [23] , this is still an upcoming technology with high fabrication costs.
We use VGA (video graphics array, 640×480 pixels) standard for the pyramid Level 0 image size. Consequently, the pixel array size in the focal-plane approach is 1280×960. The time analysis does not change significantly if the resolution increases, but the bandwidth for the transmission of the generated data increases. Increasing the resolution and using one ADC per column also increases power consumption. The pyramid size can not be too large, because computation accuracy is limited by leakage currents. The operations can be performed as long as the capacitance voltages are not affected by these currents. We set N Lev = 4. To achieve a reasonable compromise between the circuit complexity and speed, we set N bus Mem = 4. Choosing N bus Mem = 1 would impair digital circuit performance, but increasing the number of simultaneous memory accesses increases digital circuit size and complexity.
Charge redistribution, memory access and MAC operation times (τ C R , τ mem and τ op ) are written as functions of the clock period τ Clk . Charge redistribution itself is practically instantaneous, but the time it takes to drive the charge redistribution switches is considered, so τ C R = 1τ Clk . The time to access the memory, τ mem , was defined as 2τ Clk considering that one clock period is necessary to define the position of the memory access and another to actually access that position. The time to perform a MAC operation, τ op , was also defined as 2τ Clk , since two clock cycles are necessary to perform the division by four operation and that the sum is performed with combinational logic, which does not depend on the clock. Table I summarizes the established parameter values.
Applying the parameter values in Eqs. (3) and (5) yields:
and
Equations (6) and (7) allow different N ADC values for focalplane and digital approaches. Charge redistribution time is not taken into account, because of its negligible contribution to Eq. (6) . The ratio between the expressions in Eqs. (7) and (6) is:
Using the K ADC constants defined in Sec. IV-C, we replace τ ADC in Eq. (8) by an appropriate function of τ Clk , which depends on the converter architecture. For the ramp converter we have τ ADC = 2 8 · K Ramp · τ Clk = 256τ Clk . Considering that both the focal-plane and digital approaches use the ramp converter, the maximum advantage that the focal-plane approach achieves occurs when N ADC Dig = 1, N P E = 1 and N ADC F P = N ADC Max = 640. The focal-plane approach is then 600 times faster than the digital approach. If N ADC Dig = N ADC F P = N ADC Max = 640, the focal-plane approach is 120 times faster. For ramp converters, the effect of increasing the number of PEs is shown in Fig. 7 , in dash-dotted line, where the ratio between digital and focal-plane total operation times is plotted. With only four PEs, the focal-plane approach is 31 times faster, so for ramp ADCs the focal plane advantage is modest.
For the SAR or cyclic converters, we have τ ADC = N bits · K S AR,Cylic · τ Clk = 128 · τ Clk . These converters require fewer clock cycles to perform one conversion, but their operation frequency is limited, hence resulting in performance comparable to that of the ramp ADC. The maximum advantage the focal plane achieves with SAR or cyclic converters corresponds to 700 times faster. The dashed line in Fig. 7 shows the evaluation of Eq. (8) for the SAR converter when N ADC Dig = N ADC F P = N ADC Max = 640. To reduce the advantage of the focal plane to less than two orders of magnitude, three PEs are necessary. With ten PEs, the focal-plane approach is 28 times faster. The conversion time depends on the OSR, which is equal to 25, as explained in Sec. IV-C: τ ADC = OSR· K ·τ Clk = 200 ·τ Clk . The dotted line in Fig. 7 shows the comparison between focal-plane and digital approaches when the converter is used. The result is in between the ramp converter and the SAR converters: only two PEs are necessary to reduce the advantage of the focal plane to less than two orders of magnitude. For the pipeline converter analysis, we assume that it is not possible to integrate 640 converters inside the chip, because an imager with one pipeline converter per column has not been reported, to the best of our knowledge. For this converter, τ ADC = K Pipeline · τ Clk = 2 · τ Clk . The solid lines in Fig. 7 correspond to results considering different numbers of pipeline ADCs. The focal-plane approach is highly advantageous when the number of ADCs is higher than 64. In this case, 18 PEs are necessary to drop the focal plane advantage to less than two orders of magnitude.
The speed of the digital processor may be increased by using double data rate (DDR), which allows for memory access and shift operation (division by four) to be carried out in a single clock period. In order to perform timing comparisons between the focal-plane approach and generic digital circuits not having additional power or area requirements, we do not take the DDR into account in the analysis. Nevertheless, if τ mem = τ op = τ Clk , the processing time ratios presented in Fig. 7 halve.
While focal-plane processing is being performed it is not possible to capture a new frame, which limits the frame rate. Even though, we can guarantee that the frame is always way above 30 frames/sec for the VGA resolution. If we consider a 100 MHz global clock, and one ramp converter per column, then approximately 1600 μs are necessary for generating the GP. Assuming that the image capture requires an additional 400 μs, then 2000 μs are necessary for image Ratio between digital and focal-plane processing times as a function of the number of PEs. Ramp, SAR, , and pipeline ADCs are shown, respectively, in dash-dotted, dashed, dotted, and solid lines. For better visualization, a zoom of the curves is presented in the top right of the figure. capture and GP generation, which yields frame rate around 500 fps. If the image resolution is increased to 6400×4800 (a factor of 100), it is still possible to achieve 60 fps by keeping the same conditions, which are namely one ramp converter per column and a global clock frequency of 100 MHz.
V. ENERGY ANALYSIS COMPARISON
The energy analysis is more complicated because it is highly dependent on the architecture, the technology parameters are also of major importance and there is no global parameter (as the clock period was global in the time analysis). Also, aside from the stages necessary for the GP generation in each approach, both architectures must comprise the controlling circuits outside the pixel matrix, which are responsible for the interface between each stage shown in Fig. 2 . Although these circuits play an important part on the energy consumption, a proper energy analysis of the controlling circuitry requires a careful design of this stage, which is not under the scope of this paper, so these circuits are not considered.
For the ADC stage, the energy consumption depends on the type of converter and architecture. A general empirical analysis on the energy efficiency of ADC architectures can be found in [64] . This paper defines a lower boundary for energy consumption per sample equal to 2 2(E N O B−9) , and states that lowering the resolution below nine bits results in minor advantages. The minimum energy per sample in our case, eight bits, would be thus equal to 1 pJ/Sa. Although it is important to have this lower boundary limit it is also interesting to consider converters that have been used for image sensors. As mentioned in Sec. IV-C, several references were used for finding representative values of conversion rate and energy consumption for each ADC architecture. The median energy consumption per sample for each type of ADC, which can be seen in Fig. 6 , is used in this section.
Aside from the ADC, the other sources of energy consumption can be divided in: DC consumption, E DC , when there is a constant current flowing, usually for biasing circuits; dynamic consumption, E Dynamic , as a result of the circuit activity, which requires charging and discharging capacitive nodes of the circuit; static consumption, E Static , which is the energy that the transistor consumes even when it is off, depending on the leakage current I leak ; and shortcircuit consumption, E Shortcircuit , which is another source of dynamic energy and happens when switching the inputs of a logic gate, in a moment when both n-channel and p-channel transistors are on, thus allowing for a short-circuit current to flow. The short-circuit current can be minimized by matching the rise/fall times of the input and output signals, reaching a maximum of 15% of the total dynamic consumption [65] . E Shortcircuit is computed as a portion of the dynamic energy: E Shortcircuit = 15(E Dynamic + E Shortcircuit )/100 → E Shortcircuit = 15E Dynamic /85. In the following equations, C n is the node capacitance, V dd M is the pixel matrix voltage supply and V dd is the voltage supply outside the pixel matrix.
The dynamic power consumed by a digital circuit can be estimated by
where N d is the number of nodes and f 0→1 is the switching frequency of the nodes from 0 to 1 [65] . This equation is found considering that every node in the digital circuit is capacitive and that the energy necessary to charge a capacitive node is equal to C n · V 2 dd . The switching frequency can be written as a function of the clock frequency: f 0→1 = α f clk = α/τ Clk , where α is called switching activity factor and represents the probability of a node switching from 0 to 1, resulting in P dynamic = α·N d ·C n ·V 2 dd /τ Clk . The energy is given by P dynamic multiplied by the time during which the circuit operates:
Clk . In our case, τ total can be computed according to the time analysis presented in Sec. IV.
The SRAM memory is considered for the energy analysis of the digital circuit. The schematic diagram of a one-bit cell of this memory is shown in Fig. 8 . The memory has the same size of the Level 0 image in the pyramid, M×N, and each pixel is represented with N bits . In order to read a value from the memory, we need to select the memory row using the switch WL and read the result in the BL bus. Writing requires selecting a memory cell through the WL switches and setting Write to zero, which closes transistor M 1 or M 2 , depending on the bit that is being written, W bit . If W bit is logical zero, transistor M 2 closes and the bias current generated by V bias discharges the bitline BL. If W bit is logical one, transistor M 1 closes and the bias current discharges the bitline BL and thus charges BL.
A. Focal Plane
Except for the A/D conversion stage, which was explained in the beginning of the section, the steps that were considered for the energy consumption estimation are described next. As opposed to the time analysis computation, here we have to consider the image capture and readout steps because the pixel matrix size has an influence in the consumption.
1) Image capture: this operation involves, for each pixel, charging the floating diffusion node and operating the Reset and TX switches, shown in Fig. 3 . Dynamic: the energy for capturing a single pixel can be estimated as the one necessary for charging three capacitances,
Since this operation happens for every pixel of the matrix, E capture = 2M · 2N · E pixCapture . The capacitances C F D , C Rst and C T X can be replaced by the node capacitance C n , thus 2) Charge redistribution: this operation is passive, but energy is necessary to close the switches that connect the floating diffusion nodes. Dynamic: the energy that is needed to control two switches per pixel, 2C n · V 2 dd M , must be multiplied by the number of times the charge redistribution is performed (from Sec. IV-A) and by the size of the pixel matrix, since the operation is performed throughout the entire matrix,
where n k is the size of the filter. 3) Image readout: reading a pixel requires closing the row select switch and enabling the current source that biases the source follower. This current flows for the time necessary to charge the pixel matrix column capacitance. Dynamic: the gate of transistor M 4 , from Fig. 3 , is connected to a bus with every other select transistor of the same row of the matrix, the equivalent capacitance is estimated as 2M · C n . The pixel matrix column capacitance, on the other hand, depends on the number of rows and is estimated as 2N ·C n . The dynamic energy is thus E pixel Read Dynamic = (2M + 2N) · C n · V 2 dd M . The pixel matrix columns capacitances are charged whenever a pixel is read. The number of times a pixel is read is equal to N conv , defined in Sec. IV-A. The row select switch is activated every time the image is being read, once for each row, thus N conv /M times. The total energy is
B. Digital
For the digital approach, we have the following steps:
1) Image capture: following the same analysis as in the focal-plane case, but changing the image size, yields
2) Image readout: also very similar to the focal plane, but the bus capacitance changes and the image is read only once,
3) MAC operation: the digital processor that is considered is a MAC unit formed by a logic adder and a shift register. Dynamic: the energy consumed by a digital circuit was explained in the beginning of this section.
In the case of the MAC operation, the time during which the circuit operates is N op ·3τ op (according to Sec. IV-B), so
depends on the overall number of transistors inside the digital ports. Half of the transistors inside a common logic gate are off, so 
Clk , where N op · 2 is the number of times the memory is accessed for reading, according to Sec. IV-B. The activity factor α is only necessary for the BL bus and represents the cases where the bus voltage does not change when closing WL. The WL switch remains closed while the reading is performed and opens right after, so there is no activity factor in this case. Static: from Fig. 8 , inside a one-bit memory cell, each inverter has one n-channel transistor and one p-channel transistor. Regardless of the state of the memory there is one p-channel transistor off and one n-channel transistor off. Besides, the WL switches can be formed by one nchannel transistor each, which are off most of the time. Thus, E read Static = 4 · V dd · I leak · τ Digital . Short-circuit: E read Shortcircuit = 15(E read Dyn )/85. 5) Memory write: writing a single value in the memory requires more energy than reading a single position of the memory because the bias current is activated, and the writing controlling circuits are used. Dynamic:
Clk , where C W bit is the capacitance of the input W bit of the controlling circuit, C Write is the capacitance of the node Write and C n is the gate capacitance of either M 1 or M 2 , which are complementary nodes, so only one capacitance is considered. The number of times the memory is accessed for writing is N MemW rite = M · N + N op , from Sec. IV-B. Static: the static power consumption is only due to the contribution of the write control circuit, because the cell circuit contribution was provided in item (4) . Transistors M 1 and M 2 are on only when a bit is written, so we assume that they contribute with the static consumption during 
. DC: the bias current, that is activated whenever we need to swap a bit in the desired writing position, flows only for the time necessary to discharge the bus capacitance, E Mem DC = α· N memW rite ·V dd · I bias Mem ·τ Clk /10, where τ Clk is divided by ten to model capacitance discharge time, which is significantly shorter than the clock period. The activity factor is necessary to represent the cases where the cell bit that is being written does not change.
C. Energy Comparison
To compare focal-plane and digital approaches, we use the values shown in Tab. II. Node capacitance, voltage supply, leakage and memory bias current were established by means of simulations with a 110 nm CMOS technology. The clock frequency determines static energy consumption: 100 MHz is arbitrarily chosen, considering the clock frequency reported in some papers. The activity factor is 0 < α ≤ 1 [65] . Two values were chosen for α to give an idea of how the energy changes according to it. An activity factor closer to one benefits the focal-plane approach. The energy of the converters are the median energy consumption values from Fig. 6 .
Aside from the values defined in the table, it is also necessary to estimate the number of nodes of the MAC unit circuit. An example of a two-bit adder with carry and an eight-bit shift register is shown in Fig. 9 . From the figures, we deduce that an N bits adder requires at least 4 + 7 · (N bits − 1) nodes and the N bits shift register at least N bits nodes. Thus, a single PE of our MAC unit can be implemented with (8 · N bits − 3) nodes. The flip-flop from Fig. 9 actually requires more nodes, but we are assuming N bits nodes as an optimistic estimation, which benefits the digital approach.
Determining the memory node capacitances is also necessary for the comparison. The capacitance of the node Write, C Write , is equal to 2C n , since Write is connected to two logic gate inputs. For the bit capacitance, considering that it is connected to a column bus, C W bit = N ·C n . The bitline capacitance also depends on the number of rows, C B L = N · C n . The wordline capacitance depends on the number of the memory matrix columns: 2 · N bits · M · C n .
Considering the values from Tab. II, α = 0.2 and 640 converters for both approaches, the focal-plane approach requires 33 times less energy than the digital approach when the ramp converter is being used. For the SAR, cyclic and converters, the focal plane is around 52 times more energy-efficient. For the pipeline converter, the focal-plane approach is 24 times more energy-efficient. Making α = 0.8, there is a modest increase in the advantage of the focal plane: it is 34, 54 and 25 times more energy efficient for the ramp, SAR (also cyclic and ) and pipeline, respectively. It is interesting to see the effect of the capacitance increase on the result. Since most of the nodes considered for the analysis are connected to metal input or output lines, the metal parasitic effects would probably result in capacitances higher than the ones considered. Figure 10 shows how the ratio between digital energy consumption and focal-plane energy consumption varies as the C n of the nodes connected to metal lines increases. The activity factor used in this plot is 0.2.
Let us consider, for example, that we use the ADC presented in [30] . This is a column parallel SAR ADC that, normalized to eight bits, consumes 14.6 pJ per sample, with an ADC clock frequency of τ Clk S AR = 5.6 MHz. Under these conditions, the focal-plane approach takes 911 μs to generate the GP. If we use 10 PEs in the digital approach, then the focal plane is 26 times faster. The energy consumed with the focal-plane approach is around to 23 μJ, 49 times more energy-efficient than the digital approach. 
VI. CASE STUDY: SIFT ALGORITHM
The first step of the scale invariant feature transform (SIFT), which is an object recognition algorithm, is multiple-scale image representation [66] . First, the image is filtered n times with Gaussian kernels, thus creating the first octave. The image from the middle of the octave is then copied and subsampled. The resulting image is filtered with the same kernels of the first octave, thus generating the second octave. The procedure is repeated until the target number of octaves is obtained. A difference of Gaussian (DoG) is performed afterwards in order to create a scale-normalized Laplacian of Gaussian (σ 2 2 G) representation of the image. Points of interest are then searched throughout the scales of the Laplacian scalespace pyramid representation.
With the proposed hardware, it is possible to generate a scale space that can be used by the SIFT without a significant performance drop [21] . First, we capture the image and group the pixels into 2×2 pixel blocks. After sampling and quantization, the result is the first image from the first octave of the scale space. We then change the grid and obtain the second scale-space image. This kernel is a good approximation of the Gaussian kernel with standard deviation σ f ilter = σ 1 = 0.5. By changing the grid again, we perform a second filtering operation, which results in the third image from the scale space. The resulting standard deviation is σ 2 = σ 2 1 + σ 2 f ilter = 0.707. The ratio of the standard deviations of adjacent scale-space filters must be kept constant [21] , k = σ 2 /σ 1 = √ 2. Consequently, the next image must be the result of filtering with a kernel with standard deviation equal to k · σ 2 = 1. This is achieved by using the binomial kernel twice: σ 2 2 + σ 2 f ilter + σ 2 f ilter = 1, which leads to the fourth image from the scale space. The next octave is computed after all the images from the previous octave are generated, by grouping the pixels into 4×4 blocks and repeating the filtering procedure.
System-level simulations show that the results achieved with the proposed hardware implementation are similar to those obtained with the original approach. These simulations were run using the database from [67] and OpenCV SIFT libraries. By computing original image keypoints and comparing them with transformed image keypoints, we evaluate whether the proposed keypoint method is robust to those transformations. This evaluation measure is denoted as repeatability. Table III shows repeatability results for the original, fully digital, and the proposed, focal-plane, method. The original method parameters are: three octaves, six scales per octave, 0.04 for contrast threshold (which is used for removing weak features), and 10 for edge threshold (which is used for filtering edge-like features). For the focal-plane method, we also have three octaves, but four scales, 0.05 for contrast threshold (more selective), and the same edge threshold. As it can be seen in Tab. III, the systems yield similar results, which validates focal-plane hardware scale-space implementation for SIFT.
The same time and energy analysis carried out in Secs. IV and V can be extended for scale-space generation. In this case, the image does not change resolution after each filtering operation (more convolutions are performed at the focal plane) and some specific images must be sampled. The conclusions remain the same: the scenario in which the focalplane approach shows most advantage is the one in which fast converters are being used, when we have one data converter per column. The time equations obtained from the scale-space analysis using the ideas presented in Sec. IV are: 
where the number of scales is N scales (greater than or equal to 2), and the number of octaves is N oct . Within each octave, the number of charge redistribution operations is 2 N scales −2 .
VII. CONCLUSION Sensors with embedded per-pixel processors have been since long advocated as critical for increasing speed and decreasing energy consumption of vision hardware. These claims rely on two conceptual pillars: on the one hand, analog processing is known to have larger energy efficiency than digital for applications with moderate SNR requirements; on the other hand, sensor pre-processing features data compression at the sensor, thus relaxing bandwidth and storage requirements. The analyses that were carried out in this paper show that these potential advantages are case-specific. These analyses are completed for a vision primitive which is commonly employed in computer vision, namely the image pyramid. The computation of GPs can be accelerated by employing a nonconventional sensor front-end with extra per-pixel circuitry to perform spatial filtering. When comparing this approach with the use of a conventional sensor, without embedded preprocessing, followed by a conventional processor, a bottleneck of the former is found at the required number of analog-todigital conversions. Different image sensors ADCs are considered in the paper with the goal of finding values for conversion rate and energy consumption that can be used for comparison purposes, taking into account each ADC type. Thus, regarding processing time, results show that the non-conventional sensor architecture requires fast ADCs, ideally one ADC per column, to report significant advantages. Regarding energy savings, the non-conventional architecture yields best results with SAR, cyclic or topologies. To reach that conclusion, we consider state-of-the-art experimental median figures regarding ADC energy consumption. Considering specific cases, the best case for energy savings is when the single-slope converter from [36] is used. By way of example, analysis using a column parallel SAR ADC with 14.6 pJ/sample shows that the architecture with pre-processing sensor can be 26 times faster and 49 times more energy-efficient than the digital approach with 10 PEs. The methodology presented in this paper allows for a quantitative estimation of the advantages that focal-plane processing might bring about. This is an interesting tool for imager designers to understand, before implementation, the strengths of the proposed focal-plane processing techniques. 
