Abstract-This paper presents a massively parallel processing array designed for the 0.13-μm 1.5-V standard CMOS base process of a commercial 3-D through-silicon via stack. The array, which will constitute one of the fundamental blocks of a smart CMOS imager currently under design, implements isotropic Gaussian filtering by means of a MOS-based RC network. Alternatively, this filtering can be turned into anisotropic by a very simple voltage comparator between neighboring nodes whose output controls the gate of the elementary MOS resistor. Anisotropic diffusion enables image enhancement by removing noise and small local variations while preserving edges. A binary edge image can also be attained by combining the output of the voltage comparators. In addition to these processing capabilities, the simulations have confirmed the robustness of the array against process variations and mismatch. The power consumption extrapolated for VGA-resolution array processing images at 30 fps is 570 μW.
I. INTRODUCTION
T HREE-DIMENSIONAL IC technologies [1] , [2] are changing the way in which circuit design has been traditionally approached. The development of new techniques, from the transistor level up to system architecture, which takes advantage of vertical across-chip interconnections, will dramatically boost the performance of the targeted functionality. In particular, with regard to smart CMOS imagers based on focal-plane sensing-processing [3] , the availability of throughsilicon vias (TSVs) removes the tradeoff arising when it comes to allocating a silicon area for sensors and processors on the same plane. A top sensor layer can now be integrated and vertically interconnected with other layers exclusively dedicated to processing. Consequently, smart imagers with fill factors close to 100% and very high resolution can be achieved. When compared with planar implementations [4] , [5] , the absence of photosensitive devices at the processing layers releases a significant amount of area to be occupied by processing cir- cuitry. However, this circuitry represents a major source of power consumption, particularly for early vision tasks, where the lattice of processing elements must ideally have a similar resolution to that of the sensor layer. The design of ultralowpower building blocks is therefore mandatory to exploit the additional computational power provided by the vertical integration without shooting up the power consumption. This paper focuses on this crucial issue. We present a massively parallel processing array for image enhancement and edge detection designed for the 0.13-μm 1.5-V standard CMOS base process of the TSV stack commercialized by Tezzaron Semiconductor. The backbone of the array is a MOS-based RC network carrying out isotropic diffusion [6] . On this, we incorporate a low-power time-controlled voltage comparator between neighboring nodes. The output of each comparator enables diffusion only between those nodes whose difference is less than a programmable threshold. Anisotropic filtering is thus performed, preserving large pixel differences-that is, edges-while suppressing noise and small local variations. In addition, the outputs of the comparators are combined in order to deliver a binary edge image. The array features no static consumption apart from leakage currents. A dynamic consumption of 570 μW for VGA resolution at 30 fps is extrapolated from simulation. Finally, the proposed circuitry, which will be part of a smart CMOS imager providing additional early vision capabilities, can be reprogrammed to accommodate global process variations, also being robust against mismatch.
II. RELATED WORK
The primary drawback of multiscale image description based on isotropic diffusion was clearly spelled out in [7] . Gaussian filtering does not distinguish between natural boundaries of objects and essentially flat regions containing only noise or textures. As a result, the edges are shifted as coarser scales are generated, preventing them from being accurately located. In order to solve this issue, a new definition of the scale-space representation is proposed [7] . It is also based on the diffusion equation, but the diffusion coefficient is now chosen to vary spatially at any scale, that is,
where V (x, y, t) is a brightness function defined over a continuous plane and D(x, y, t) represents the spatially variant 1549-7747/$31.00 © 2012 IEEE diffusion coefficient. The point is tuning this coefficient across the plane in such a way that intraregion filtering is given priority over filtering across region boundaries.
Resistive networks [8] constitute the basis for most of the VLSI implementations of this content-aware multiscale representation. These networks render a distribution of either currents or voltages that is equivalent to applying a diffusion process during a certain time interval over the input sources. In order to achieve anisotropic filtering, resistive fuses are introduced. These circuit elements behave as resistances of value R only when the voltage difference across their terminals is less than a certain threshold V off . Otherwise, they behave as open circuits. By adjusting V off , edges are prevented from being filtered while small brightness variations undergo smoothing. A spatially variant diffusion coefficient is thus emulated. There are, however, two significant problems associated with the first practical implementations of resistive fuses reported [9] - [11] . First, they rely on subthreshold operation. This makes them very dependent on the characteristics of the process, parameter variations, and mismatch. Second, this dependence causes, in turn, a very restricted range for R and V off , greatly constraining the set of filters attainable. These aspects are solved differently in [12] and [13] . Discrete-time switchedcapacitor techniques are applied in [12] in order to implement the horizontal resistors of a resistive network. The control of the amount of charge exchanged between neighboring nodes enables tunnable filtering. V off also features a wider voltage swing. In [13] , the image pixels are represented by means of currents. The current difference between neighboring nodes is compared with a programmable threshold. The binary output of this comparison controls the gate of a single transistor acting as the horizontal resistor of a resistive network. The amount of smoothing realized can be adjusted through the common-mode input level of the pixels.
Despite these successful implementations in terms of functionality, resistive networks present a major drawback: their static consumption. The input sources must continuously inject current into the grid in order to get the filtering done. Alternatively, Gaussian filtering can also be achieved by an RC network like that of Fig. 1 . A real diffusion process takes place within the network. An uneven charge distribution at the capacitors is diffused across the network and along time with a pace determined by the time constant τ = RC. We demonstrated in [6] that an accurate approximation of an ideal RC network can be obtained by substituting every resistor with a MOS transistor biased in the ohmic region. Moreover, the value of the pixels can be easily mapped to the initial conditions of the capacitors and, without any additional energy contribution, the network will carry out isotropic filtering. In order to turn this filtering into anisotropic, the MOS resistors must be turned into resistive fuses. To accomplish this, we propose a timecontrolled voltage comparator based on a differential pair. The inputs of this comparator correspond to the neighboring nodes linked by a MOS resistor. Its binary output, together with a global diffusion enable signal, determines whether that resistor is activated or turned off. A time-controlled elementary resistive fuse is thus emulated. We must remark at this point that the MOS-based RC network for anisotropic filtering proposed in [13] works quite differently. It makes intensive use of current mode operation, leading again to high static consumption.
III. ELEMENTARY CELL
Consider Fig. 2 . It corresponds to the elementary cell of the network in Fig. 1 after substituting the elementary resistors interconnecting neighboring cells with MOS transistors biased in the ohmic region and incorporating additional circuitry to achieve anisotropic diffusion. Notice that the south and east connectivities suffice to make up the entire grid. Cells located at the bottom and rightmost edges will obviously not include south and east connections, respectively. The key component of this elementary cell is the voltage comparator. It outputs a logic "1" when the absolute value of the difference between its input voltages exceeds a certain programmable threshold V off . This turns the corresponding transistor off, no matter the logic value of the global active-low diffusion enable signal DIF F _EN . If the output of the comparator is "0," diffusion between the neighboring nodes will take place during the time interval t in which DIF F _EN is active, that is, is set to "0." The output of the comparators is also combined in order to obtain a binary edge image represented by V be ij . We propose to implement the voltage comparator as shown in Fig. 3(a) . It is based on a differential pair where the currentto-voltage conversion is not carried out as usual, that is, with resistive or MOS loads. In order to reduce the power consumption, the conversion takes place on the capacitors C p . They are first precharged by turning on the pMOS switches connected to V DD for T p seconds. Subsequently, they are discharged during a certain time interval T d . The pace of discharge for each capacitor is determined by the current flowing through each branch of the differential pair. Small differences between V in 1 and V in 2 will imply small differences between V p 1 and V p 2 at the end of the discharge, whereas remarkably different input voltages will cause large differences between V p 1 and V p 2 . By adequately adjusting T d through the control signal ctrl, both V p 1 and V p 2 can be situated over the input threshold voltage of the XOR gate for differences between the input voltages less than V off . In such a case, the output of the XOR gate is set to "0." For differences larger than V off , either V p 1 or V p 2 will be situated over the gate input threshold, whereas the other one will be situated below, setting the output to "1." Notice that this output is latched in order to avoid the effects of leakage over
Let us analyze how V off can be adjusted. Consider a signal range of [0.75 V, 1.5 V] for the pixel values. This range is chosen according to the criteria described in [6] . Bear in mind that the pixels are represented by the voltages V ij , which, in turn, constitute the inputs of the voltage comparators across the network. Suppose a voltage comparator where V in 1 = 0.75 V and V in 2 = 0.8 V. The corresponding voltage difference, 0.05 V, is considered too small to belong to an edge. Consequently, the comparator must output a logic "0." However, for the same V in 1 but V in 2 = 1 V, the difference reaches 0.25 V. The existence of an edge is assumed now, and the comparator must therefore output a logic "1." We have shown in Fig. 4 the precharge of the capacitors to V DD for 30 ns and the subsequent discharge for different time intervals, namely, 40, 70, and 120 ns, in the two cases proposed earlier. In these simulations, V DD is 1.5 V, C p is 200 fF, and V bias is 0.6 V, which is translated into a bias current of 3.5 μA. We have used the transistor models in HSPICE provided by Tezzaron/Globalfoundries as well as standard cells of the technology. For the sake of reference, the straight line crossing each diagram corresponds to the input threshold voltage of the standard XOR gate. It can be seen that, for V in 1 = 0.75 V and V in 2 = 0.8 V, intervals of T d = 40 ns and T d = 70 ns still keep V p 1 and V p 2 over the gate input threshold, as targeted. However, an interval of T d = 120 ns is too long since V p 2 reaches a voltage below that threshold, setting the output to "1." For V in 1 = 0.75 V and V in 2 = 1 V, a discharge of 40 ns is too short to highlight the pixel difference from the point of view of the XOR gate. As a result, the output is set to "0." The adequate output is attained for discharges of 70 and 120 ns. It can therefore be concluded that T d = 70 ns achieves the targeted behavior for both cases. Notice that, because of the intrinsic structure of the comparator, any V in 2 greater than 1 V-keeping V in 1 = 0.75 V and T d = 70 ns-will make V p 2 go below the gate input threshold. As V p 1 stays, under these conditions, always above it, no matter the value of V in 2 considered, any pixel difference greater than 0.25 V will be processed as an edge. In order to demonstrate the programmability of V off , we have set V in 1 = 0.75 V and then V in 2 has been swept from 0.8 up to 1.5 V with steps of 0.05 V. This sweeping has been simulated for different intervals of discharge, registering for each the smallest pixel difference from which the voltage comparator starts to output a logic "1." In other words, we have obtained the voltage V off featured by the elementary MOS resistive fuse of the RC network for each interval. The result is that V off ranges from 0.6 to 0.05 V for discharges between 50 and 120 ns, respectively. The effect of the parasitics over these intervals is negligible, according to the simulations of the extracted layout shown in Fig. 3(b) . Due to the nonlinearity of the transistors, these values of V off will undergo distortion when considering other possible combinations of input voltages, that is, other possible sweepings. However, this distortion can be compensated via calibration of V off , as described in Section IV.
Finally, we comment about the power consumption and area usage of the elementary cell just described. The efficiency of the MOS-based RC network was previously mentioned and experimentally proved in [6] . The energy cost of the digital cells included with an order of magnitude of tens of nanowatts per megahertz at most, is also really small for the typical frame rates of most vision applications. The main source of power consumption is the precharge of the capacitors at the voltage comparators. It must be noticed that the bias current simply makes the charge stored at these capacitors flow through the differential pair. No further energy contribution is involved in its operation. All in all, according to the simulations realized and the datasheet of the standard cells used, a pessimistic estimation of power consumption would be 1.85 nW per elementary cell at 30 fps. Extrapolating this figure to a VGA array, the total power consumption would be 570 μW at 30 fps. This calculation exclusively considers the power consumption associated with the processing capabilities of the array. We have therefore obviated the energy cost of mapping the pixel values to the initial conditions of the capacitors in the RC network. Concerning the area usage, we must say that the capacitors finally implemented in the targeted CMOS smart imager are going to be smaller than those considered in this paper. These elements, with their current values, require the largest area by far. Thus, for example, in Fig. 3(b) , each capacitor of 200 fF has been realized by using four MOS capacitors of dimensions 4.06 × 8.14 μm 2 , leading to a total comparator area of 19.72 × 16.62 μm 2 . Smaller capacitors, apart from possible mismatch considerations, simply mean changes in the timing of the control signals, making the processing dynamics faster.
IV. SIMULATION OF A 32 × 32 ARRAY
In order to corroborate the results obtained for a single elementary cell, we have built a 32 × 32 array in HSPICE. A larger array was not possible due to the heavy memory and computational requirements of the simulations. In any case, as the binary edge image only depends on the immediate neighborhood of each pixel, a 256 × 256-px image was divided into 32 × 32-px subimages. Each subimage was mapped into the array, on which we incorporate an additional row and column at the bottom and rightmost sides, respectively. This allows for taking into account the neighbors of the pixels at the edges of every subimage. The outcome can be seen in Fig. 5 . The discharge interval was T d = 70 ns. Notice that the binary edge image is available just after this discharge has finished, independent of a possible subsequent anisotropic diffusion. According to the simulations described in the previous section, a first estimation of V off for such interval would be 0.25 V. This leads to a percentage of false/missing edge locations of 2.52% with respect to an ideal array extracting edges without error. However, it is still possible to find the voltage V off better emulated by the network. To this end, we have swept V off from 0 up to 0.3 V in the ideal array. For each value of V off , the output image is compared with the image provided by our nonideal array. The minimum error percentage, 0.68%, is achieved for V off = 0.15 V. That is to say, the edge threshold implemented by our array, whose first approximation was 0.25 V, is really 0.15 V. Under mismatch conditions, making use of the MOSFET statistical models provided by the manufacturer, the minimum error is 2.78% for V off = 0.14 V. Notice that this calibration of V off can be easily realized off-chip as a previous step of further analysis of a scene. Once the edge detection functionality is confirmed, let us move on to the other one: image enhancement. The dynamics of the entire RC network is now involved in the final output. This prevents us from making use of images of higher resolution in the same way as for edge detection. The original 32 × 32-px noisy Lena image mapped into the array is first shown in Fig. 6 . The second image corresponds to the output represented by the voltages V ij after enabling anisotropic diffusion for t = 50 ns. The value of the elementary capacitor of the RC network simulated is 1 pF, whereas the dimensions of the elementary MOS resistor are 0.15/1. According to the design methodology described in [6] , these elements render a time constant τ = 118 ns. Consequently, the width of the Gaussian filter applied is σ = 2t/τ = 0.92. It can be seen that this filter is adequate to remove the noise while preserving the contrast of the image. For the sake of comparison, we also show the resulting image after applying the same filter via isotropic diffusion. The noise is removed but at the cost of clearly worsening the contrast. Finally, we must remark that the proposed circuitry can easily accommodate global process variations by adjusting both t and T d . The value of t will be determined by the equivalent resistance of the elementary MOS resistor at the point of the design space considered [6] . The adjustment of T d at that same point will permit to get an estimation of V off . We have tuned these parameters for the different corners of the technology in order to carry out the same processing as that of Fig. 6 . The maximum rmse of any of the resulting images with respect to their counterpart for typical conditions is only 2.7% for the "FF" corner, being t = 44 ns and T d = 28 ns at this corner. Concerning mismatch variations, we have performed ten Monte Carlo simulations of the array. Again, we have compared the output images of each simulation with those of Fig. 6 , finding a maximum rmse of 3.7%. These results demonstrate the flexibility and robustness of the design when it comes to addressing the unavoidable nonidealities of the manufacturing process. Finally, Table I summarizes the main reported features of other implementations of anisotropic filtering. It can be seen that the circuitry described in this paper presents a power consumption four orders of magnitude below the lowest consumption previously reported in the literature. However, it is not very competitive in terms of area usage in its current version. As mentioned previously, a much more area-efficient implementation will be finally incorporated in the targeted smart CMOS imager.
V. CONCLUSION
The implementation of sensing-processing vision chips in 3-D TSV technologies demand low-power building blocks to make the most of the additional computational power available without dramatically raising the power consumption. Focused on this issue, we have presented an ultralow-power massively parallel processing array for image enhancement and edge detection. Its operation is based on anisotropic diffusion. In addition to its energy efficiency, the array stands out for its programmability and robustness against process variations.
