Abstract While conventional CMOS active pixel sensors embed only the circuitry required for photo-detection, pixel addressing and voltage buffering, smart pixels incorporate also circuitry for data processing, data storage and control of data interchange. This additional circuitry enables data processing be realized concurrently with the acquisition of images which is instrumental to reduce the number of data needed to carry to information contained into images. This way, more efficient vision systems can be built at the cost of larger pixel pitch. Vertically-integrated 3D technologies enable to keep the advnatges of smart pixels while improving the form factor of smart pixels.
I. INTRODUCTION
CMOS Image Sensors (CIS) are suited to embed sensing devices, signal conditioning, data conversion, communication and processing circuitry on a common semiconductor substrate [1] . This way sensing, signal conditioning and processing happen concurrently at the sensor, thus enabling camera functions be incorporated already at the image capture front-end and thereby paving the way towards camera systems with improved SWaP factors [2] [3] . For illustration purposes, Fig.1 shows the block diagram of a smart CIS with on-chip, sensorembedded image correction [4] . The embedded core handles digitized full frames to complete on-chip operations intended to either deliver corrected images (FPN-correction, PRNUcalibration, etc.) or to analyse images (filtering, convolutions, morphology, etc.). In any case, the goal is reducing either the requirements or even the need of off-chip resources, thus simplifying the implementation of camera systems.
Circuit embedding can actually be made at different levels, namely: per-pixel [5] - [9] , per-column [10] , and per-chip [11] . The intensity of embedding at each level depend on the targeted applications. Thus, for instance, consumer applications seek for minimum possible in-pixel circuitry [12]- [14] , while high-end machine vision applications may call for larger amounts of in-pixel circuitry to increase the speed in the extraction of image features and in the reaction thereof [15] [16] . As a general rule of thumb, the incorporation of circuitry at pixel level enables images being processed as they are acquired thus increasing speed and reducing power consumption in the realization of vision tasks [17] - [21] . Sensors composed of these smart pixels, makes the next evolutionary step of CMOS pixels, following passive pixels and active pixels [1] , by embedding within the pixel resources for mixed-signal processing, memory and the programming and control of information flows.
Smart pixels are typically arranged according to the paradigm of Single Instruction Multiple Data (SIMD) computer architectures, and the sensors formed by them may be called CVIS (CMOS VIsion Sensors) to stress the fact that they do not only acquire images (like CIS) but also complete earlyvision tasks [22] to reduce the amount of data sent for ulterior processing. Cameras employing CVISs at the front-end are capable to analyse images at thousands frames per second rate and with very low power consumption. However, their pixels have large pitch (in the range of tens of Pm´s) and reduced fill factor. These features make CVISs to have limited sensitivity and modest spatial resolution, thereby constraining their usage to applications with limited field-of-view and active illumination. These limitations can be overcame by resorting to 3-D integration technologies [23] and the subsequent improvement of the form factors and footprints of the functional structures embedded at the pixels and at the whole chip.
II. ARCHITECTURES FOR SMART-PIXEL CVISS
Camera architectures with conventional, frame-based, frontend sensors and per-chip centralized processor are not the most efficient regarding speed of the decision-making process, • On the one hand, they must transfer all pixel data from the sensor to the processor through the readout channel; despite these data carrying either relevant (such as pixels at object borders) or irrelevant information (such as pixels within an uniform background). This need to transfer all data creates communication bottlenecks.
• On the other hand, they require large power budget owing to the necessity to handle huge amount of data. Both features constraint or even preclude the usage of these architectures whenever either on-line analysis or portability are required. For instance, vision-enabled wireless sensor networks employing this kind of architecture use multiple-of-ten thus greatly constraining portability [18] .
For increased efficiency, alternative architectures consisting of distributed, multi-core processor arrays are worth considering because they better fits to the peculiarities of the data processing through the processing chain of vision [15] . The quest for these alternative architectures becomes particularly necessary for early vision tasks due to the huge amount and large redundancy of the data involved in these tasks [22] . This fact has been highlighted in [16] which states that brute force pattern matching, the conventional approach adopted by many vision system developers, is not the right tool in many applications. Instead, sic, "a majority of smart camera applications can be solved using only a small number of image processing algorithms that can be learned quickly and used very effectively". Interestingly enough these simple algorithms (thresholds, blob analysis, edge detection, average intensity, binary operators, …) can be mapped down onto dedicated, processor architectures composed of simple processors with mostly local interactions [19] .
The convenience to devise architectural concepts better suited to images is illustrated with the help of Fig.2 , which shows the reduction of the amount of data as information progresses through the processing chain of vision [15] . It corresponds to an application where the target is detecting defective parts as they move on a conveyor belt. Images are acquired in asynchronous manner and analysed on-line to extract a number of features on the basis of which parts are classified as either defective or correct and a corresponding trigger signal is generated. The progressive data reduction highlighted by this example calls for corresponding progressive processing architectures.
The conception of architectures suitable for progressive processing benefits from the usage of concepts such as multiscale representations, feature extraction, sub-sampling, and the like. Most efficient architectures employ mixed-signal smartpixels for parallel completion of the computational-intensive early vision tasks, followed by sub-sampled topographic processor arrays (typically digital), processors-per-column and scalar processors. Unfortunately, there is not yet a standard, universally accepted set of functions to be incorporated at smart-pixels and most of the contributions are ad-hoc for specific tasks and requirements. Examples of state-of-the-art advanced smart-pixels are presented in the sections below. Fig.3 corresponds to a smart-pixel employed into a CIS conceived to capture High Dynamic Range (HDR) images using a content-aware compression law [24] . It employs a tone mapping algorithm [25] built into the pixel to achieve more than 151dB DR. Fig.3 depicts the pixel schematics, floorplan (where the architectural feature of heterogeneity is noticeable)
A. Smart-pixel HDR CIS for 145dB Intra-Frame Capture
PJ pixel e and representative HDR image captures. Fig.4 shows the pixel schematics and illustrative processing for a smart-CIS which employs mixed-signal MFPS to realize programmable spatial-temporal filtering [9] . It employs 22nJ/ cycle to complete binomial filtering operations at nsec rate which makes it very well suited for the front-end of portable vision-enabled wireless sensors. The target of the processing illustrated in the figure is segmenting the zones of the image with the largest changes of intensity, that is, the relative values among the blocks of the scene representation are the key point here. It is done by the chip in [9] by in-pixel energy computation to guide subsequent foveation.
B. Smart-pixel CVIS for Programmable Spatial-Temporal Processing

C. High-Speed Vision System-on-Chip
FIG .5 shows the functional diagram of a smart-pixel employed at the front-end of an industrial CVIS that combines on-chip per-pixel and per-chip (32-bit RISC digital processor for postprocessing) circuitry to go from image acquisition to decision making at 1,000F/s rate with around 60nW per pixel required.
For instance, the processing chain of Fig.3 is entirely realized at the sensory plane and the central processor is provided with just 10bytes information indicating if pieces are either deffective or correct.
III. MAPPING SMART-PIXEL CVIS ONTO 3-D ARCHITECTURES
The schematics and/or bock diagrams of the smart-pixels in Figs. 3, 4 and 5 show significant overhead, non-sensing circuitry that penalizes sensor spatial resolution and photoresponsiveness. t is clearly highlighted at the pixel layout of the HDR smart-pixel depicted in Fig.3 , where the photosensor occupies for a pixel pitch of . Vertical stacking of the different functions embedded within pixels can improve the form factor of smart-pixels thus keeping their advantages while overcoming the drawbacks of large pitch. The form factor improvement yielded by 3D technologies is particularly notorious when complex processing algorithms such as object detection and recognition [26] , image retrieval, image registration, or tracking must be implemented. Since these algorithms rely on local properties, diffusions and multiscale representations play a prominent role for their implementation. This is actually the very domain of analog and mixedsignal circuits [8] [27] [27] . 3D sensor architectures for the extraction of features and the calculation of interest points within imagers have been previously proposed by the authors in [27] and [28] . Both architectures rely on the implementation of Gaussian filtering at the sensory plane and the calculation of Gaussian pyramids thereof. For illustration purposes, Fig.6(a) shows the functional splitting among consecutive layers for the detection of salient points into an image. Also, Fig.6(b) shows an architecture for interest point detection where sensors lie in a dedicated tier as in reference [27] , so that different active pixels are possible. The last tier in this architecture is a DRAM memory block which is actually the implementation choice in [29] . These architectures represents a first move towards a general architectural solution where each layers captures a given data abstraction within the hierarchical processing chain of vision.
The proposal of such a general architectural solution remains still an open problem. 
