Abstract. Monolithic integration of photodetectors, analog-to-digital converters, data storage, and digital processing can improve both the performance and the efficiency of future portable image products. However, digitizing and processing a pixel at the detection site presents the design challenge to deliver a system with the required performance at the lowest cost, not just a system with the highest performance. This paper analyzes the area-time efficiency, the area efficiency, and the energy efficiency of a mixed-signal, SIMD focal plane processing architecture that executes front-end image applications with neighborhood processing. Implementations of the focal plane architecture achieve up to 81x higher area efficiency and up to 11x higher energy efficiency when compared to traditional TI DSP chips. Higher efficiency ratings are required to maintain portability while addressing technology limitations such as interconnect wiring density, heat extraction, and battery life. Systems can be implemented with a less expensive fabrication technology by increasing the number of pixels per processing element (PPE).
Introduction
The demand for mobile productivity has led to incorporating embedded systems into handheld devices for real-time applications. A wide range of products, such as cellular phones, video and still cameras, and portable data assistants (PDAs), deliver multimedia processing by including inexpensive imaging chips. However, this processing creates a challenging design problem [1] and requires a change in paradigm to accommodate processing requirements [2] . Future portable imaging products will benefit from the monolithic integration of photodetectors, analog-to-digital converters, digital processing, and data storage to improve their performance, efficiency, and cost. A typical system-on-a-chip digital signal processing (DSP) architecture, shown in Fig. 1(a) , assigns an entire image to a single processing * Currently affiliated with the Department of Electrical Engineering and Computer Science at Vanderbilt University.
core. This architecture is designed to span all stages of image processing. However, the DSP architecture uses a significant amount of area and energy resources to perform basic image enhancement and image analysis applications. For example, the digital image signal multiprocessor [3] reports 53% of its execution time for preprocessing and 47% for feature extraction. Preprocessing includes common tasks, such as noise reduction, smoothing, and segmentation, which are characterized by high spatial locality. Processing in the focal plane can more efficiently handle these basic tasks, leaving the more complex applications to the DSP architecture. A focal plane architecture, shown in Fig. 1(b) , can be built by integrating analog-to-digital conversion (ADC) and digital processing at each detector site in the focal plane array.
Incorporating analog and digital components to process pixels at the detection site creates a challenging design problem [4] [5] [6] . The goal is not strictly building the system with the highest performance, but delivering the system with the required performance at the lowest cost. The architecture must consider key performance and cost metrics, such as execution time, system throughput, chip area, power consumption, and area-time efficiency. However, in the system-on-a-chip design, two figures of merit are also important, namely area efficiency and energy efficiency [7] . Characterizing the targeted application workload leads to reasonable design choices for register file size, datapath width, and functional units to efficiently utilize silicon area and power consumption [8, 9] . In addition, higher ratings in area efficiency (system throughput per unit area) and energy efficiency (system throughput per unit power) are required because of technology limitations such as interconnect wiring density and heat extraction. The choice of both fabrication technology and pixels per processing element impact the system design.
This paper presents the area-time efficiency, the area efficiency, and the energy efficiency from analyzing a system-on-a-chip mixed-signal focal plane processing architecture. These results help quantify the systemlevel design issues to monolithically integrate photodetectors, analog-to-digital converters, data storage, and digital processing into focal plane architectures. Different implementations are used that vary the number of pixels per processing element (PPE). A focal plane architectural simulator provides the execution time and sustained throughput for the application suite. Component area models project the chip size for the design. A Technology Scenario Analyzer (TeSA) projects power consumption using different fabrication technologies. Because excessive die sizes would limit the realization of these systems, guidelines for affordable manufacturing are used from ITRS specifications [10, 11] . Battery life is another vital concern. Therefore, maximum power for each technology is constrained to ITRS specifications for portable battery operation, which is generally less than 3 Watts [10, 11] . Area-time efficiency, area efficiency, and energy efficiency are calculated using the appropriate projected values.
Despite a significant difference in clock frequency, implementations of the focal plane architecture perform well compared to a traditional DSP architecture. The sustained throughput achieved by the focal plane architecture implementations operating at 10 MHz exceeds the values reported in TI DSP chip specifications (Table 1) . For area-time efficiency, the 16 PPE implementation is comparable to the TI TMS320C6211B chip, while the 1 PPE and the 4 PPE implementations show some improvement versus the TI TMS320C6411 chip.
Implementations of the focal plane architecture have higher energy efficiency and area efficiency when compared to traditional TI DSP chips. Energy efficiency increases by an average factor of 11x compared to the TMS320C6211B and by an average factor of 2.9x compared to the TMS320C6411. The energy efficiency demonstrates the potential for extended battery life in a portable device. The area efficiency is dramatically improved as well using the focal plane architecture (Table 2). The area efficiency for the focal plane architecture implementations includes the data acquisition and analog-to-digital conversion circuitry, which is not available with the DSP chips. Although the 1 PPE implementation provides the most efficient architecture, it generally exceeds the constraints for chip size and power consumed. A 4 PPE or 16 PPE with a less aggressive technology may provide a more cost-effective solution. In addition, detector area dominates as feature size shrinks. Therefore, denser detector technology could substantially reduce chip area. Another potential technique to reduce the analog component area is to share the analog-to-digital conversion circuitry [12] . The organization of this paper is as follows. Section 2 provides the background for architectures used by image processing systems. Section 3 describes the technique to analyze the performance of the focal plane processor array on the targeted application suite. Section 4 discusses the area models for the processing element components and presents the projected area for the focal plane processor array. Section 5 presents the projected power consumption for the focal plane processor array. Section 6 presents the analysis of resource efficiency for design implementations and provides a comparison to a traditional DSP architecture. Section 7 concludes the discussion on efficiency analysis for a mixed-signal focal plane architecture. Finally, Section 8 presents directions for future development.
Background
Numerous architectures have been developed for image processing systems [13, 14] . Efficient handling of the two-dimensional image data is a common issue among these designs. A natural solution to this issue is to process the spatially parallel data within the image plane [15] . Processing capability can be integrated directly into the focal plane. A logical approach for focal plane processing incorporates analog computational ability into the sensor device. Analog design methods, such as the Silicon Retina, have been presented to implement focal plane arrays [16] . This has led to the development of neuromorphic vision sensors for early image processing [17] . Another technique includes utilizing cellular neural networks for parallel processing [18] . This led to the development of the CNN Universal Chip [19] . However, the advantages of analog techniques, such as low power and small area, break down as CMOS technology scales [20] . In addition, analog architectures are typically non-programmable, requiring multiple designs to implement different functionality.
A programmable digital pixel is formed by monolithically incorporating the data acquisition device, analogto-digital conversion (ADC) circuitry, data storage, and digital processing circuitry within a processing element (PE). A block diagram of this processing tile [20] is shown in Fig. 2 . This tile element is replicated to form a focal plane array imager with integrated analog-todigital conversion, nearest-neighbor communication, and SIMD processing. This section discusses the functionality of data acquisition, analog-to-digital conversion, data storage, and digital processing.
Data Acquisition
The first stage performed by the digital pixel is the acquisition of the analog light intensity of an image. The relative response of the human eye correlates to the major color bands of the visible spectrum, which span from approximately 0.4 µm to 0.7 µm [21] . Therefore, photodetector devices used in the digital pixel should provide reasonable performance across these wavelengths. Photodetectors absorb photons with energy greater than the material bandgap to generate a current proportional to the number of optically generated electron-hole pairs [22] . The quantum efficiency of the detector is the number of electron-hole pairs generated per incident photon for a given wavelength [22] . The spectral response, or responsivity, is the quantum efficiency over a range of light wavelengths [23] . Photodetectors made from silicon can detect light wavelengths up to approximately 1.1 µm due to its bandgap energy of 1.12eV and provide reasonable performance over the visible spectrum (Fig. 3) .
Implementing photodetectors in a standard CMOS process enables the integration of other processing components at the detection site. Research efforts have been made to develop and evaluate these CMOS image sensors [24] [25] [26] . A noted development is the active pixel sensor (APS) [27] , which has been implemented into a digital camera on a chip with the functionality of analog-to-digital conversion [28] .
Analog-to-Digital Conversion
The next stage is to convert the acquired image signal into a digital value. The interface between analog signals and their digital representation requires three tasks: (1) anti-alias filtering, (2) sampling, and (3) quantization [29] . These tasks can be implemented using an oversampling technique. Oversampling methods use sampling rates far above the Nyquist rate, which is the minimum rate for reconstructing a signal without aliasing [30] . Figure 4 shows a block diagram for analog-todigital conversion [29] . The oversampled analog signal is passed through one or more integrators, represented by transfer function block H(z), before quantization. Using a feedback loop, the quantization noise is transformed (or shaped) into a high-pass response. The signal is then sent to a decimator, which combines the low-pass filtering operation and rate reduction. This removes the high-pass quantization noise and other signal information above the maximum input frequency of interest. A candidate implementation of analog-to-digital conversion with noise shaping is the delta-sigma (or sigma-delta) converter [31] . These converters have become popular because they avoid some of the difficulties of conventional A/D conversion, such as the need for high-precision analog circuitry and vulnerability to noise or interference [31] . The performance of these circuits has been studied with standard CMOS implementation [32] [33] [34] . Sigma-delta converters have been successfully integrated with CMOS image sensors [35] [36] [37] .
Data Storage
Data storage implementation, the most transistorconsuming function for a pixel-level processing element, affects most other aspects of the design [38] . The architecture must provide the storage capacity that would normally be found in a frame buffer for conventional store-and-process systems. Each application executed on the architecture has a minimum data storage threshold, typically consisting of the input image, the output image, and processing overhead. The data storage is also constrained by the minimum bit-precision required for accuracy. The architecture can utilize the six-transistor SRAM cell as the foundation of memory in the processing element. The instruction set architecture of the focal plane array imager requires a three-ported register file organization [39, 40] . In addition, models have been developed to predict silicon area usage for register file configurations [41, 42] .
Digital Processing
Once the pixel information is represented and stored digitally, the programmable processing core can implement various image applications. Image processing applications are categorized into (1) point operations, (2) local operations, and (3) global operations [43] . A point operation such as thresholding occurs at the individual pixel level. A local operation such as smoothing requires knowledge of an individual pixel and its immediate neighbors. A global operation such as histogramming uses all pixel data from the image. These operations form the basis of applications found in imaging systems.
To illustrate the processing sequence required for a typical imaging system of (N × N ) pixels, [45] . The first three stages represent a significant proportion of the computational workload. For example, the digital image signal multiprocessor [3] reports 53% of its execution time for preprocessing, which includes tasks from the first three stages such as noise reduction, smoothing, and segmentation. These stages are candidates for processing at the pixel level.
Several architectures, such as the Near Sensor Image Processing (NSIP) [46] , the Programmable and Versatile Large Size Artificial Retina (PVLSAR) [38] , and the Simple and Smart Sensory Processing Elements (S 3 PE) [47] , follow the pixel-level model to perform early image processing in the focal plane. However, these architectures utilize bit-serial processing techniques with limited memory, which can either restrict processing to binary images or require multiple cycles to perform a single instruction on data words. This may prevent the implementation of some early image processing algorithms. However, the integration of additional functional units at the pixel level enables the versatility to execute a broader set of applications.
Application Suite Analysis
This section describes the technique to characterize the performance of the focal plane architecture on an image application suite. A focal plane architectural simulator is used to implement the various algorithms. A description is given for each application. The performance is calculated and hardware constraints are determined for the application suite.
Focal Plane Architectural Simulator
Applications for the focal plane architecture can be programmed using the SIMD Pixel Processor (SIMPil) Simulator [48] . This software tool is a windows-based instruction level simulator, running on a PC platform. The SIMPil Simulator allows editing, assembling, executing, and debugging parallel image applications in a single integrated workbench. The current version of the SIMPil Simulator is available on the download page [48] .
An image processing application is executed on the architectural simulator to provide a dynamic workload. Each application is implemented using an instruction set architecture (ISA) that corresponds to the available functional units within the digital pixel. The number of pixels per processing element is assigned in the application. The simulator is instrumented to measure the operand resolution for each instruction, the distribution for instructions issued in parallel, the concurrency level of the processing elements, and the number of processor cycles required for execution.
Description of Applications
The grayscale (8-bit) front-end applications, shown in Table 3 , provide workload characteristics for image enhancement and image analysis. The enhancement and analysis applications of median filtering, convolution, and morphological processing represent the typical early image processing sequence of: (1) noise removal, (2) smoothing, and (3) segmentation. The workload characteristics for these applications are used to make efficient architectural choices for processing in the focal plane. This section briefly discusses each application. preserving spatial resolution. The algorithm is a rankorder filter [49] that replaces each pixel in the image with the median value in the window. Generally, a window size is selected to generate a rank-order filter with odd length. A larger window size increases the severity of the median filtering effect [50] . This implementation performs a 2-D nonseparable rank and selects the median value for a 3 × 3 window.
Median Filtering. Median filtering (MED) is useful to remove impulse noise from an image while

Convolution.
Convolution-based filtering (CONV) has been implemented to perform different filtering operations, such as shadowing, smoothing, and edge detection [50] . The filter mask elements are broadcast one at a time to every PE. All calculations requiring the mask element are performed before the next element is broadcast. Each PE multiplies the mask element by the corresponding pixel value from the original image and accumulates the result. Values are accumulated in a spiral pattern that places the final result in the center pixel of the filter mask. A 3 × 3 filter mask is used for this implementation.
Morphological Processing.
Morphological image processing refers to the study of the topology or structure of objects from their 2D spatial representation [51] . Binary or grayscale images are morphologically transformed by passing a structuring element over the image in a process similar to convolution. At each pixel position, a specified logical function is performed between the structuring element and the underlying image. For grayscale images, erosion is the minimum pixel value in the structuring element, and dilation is the maximum pixel value in the structuring element [50] . Depending upon the size and content of the structuring element, different effects such as inside edge detection (IED) can be produced from erosion and dilation operations. A 3 × 3 structuring element has been used to implement the morphological operations.
Application Performance
The execution time was calculated for application suite described in Section 3.2. The execution time is determined with reference to a 10 MHz target platform This clock frequency addresses both the speed of analog components in each processing element and power density limitations for the high utilization factor of PEs in the array. Simulations were run with a system resolution equal to Quad-CIF (176 × 144 pixels). This resolution is one specification of the H.261 and H.263 video codec standards of the International Telecommunications Union (ITU) [52] . However, the execution time of the focal plane architecture is independent of the image dimension due to the parallel data processing.
In Table 4 , the total execution times of the early image processing sequence generally follow the increase of the PPE factor with a baseline of the 1 PPE case. Median filtering is the dominant component of the processing sequence. The median filtering application for 16 PPE is not as efficient as the versions for 1 PPE and 4 PPE; therefore it does not follow the trend of the PPE factor. The total execution time for each PPE is within real-time constraints of 30 frames/sec (33.3 ms per frame).
Using the execution time, the sustained throughput Throughput sust , measured in billion operations per second, is determined for the focal plane architecture as follows:
where IC is number of parallel instructions issued to the PE array during the application. The system utilization U is calculated as the average number of active processing elements determined from the simulator's concurrency meter. N P E is the number of processing elements in the PE array and is determined from the following formula:
where PPE is the number of pixels in each processing element (1, 4, or 16) . For PPE > 1, pixels are arranged in a square (i.e. 2 × 2 or 4 × 4). The target system resolution is QCIF (176 × 144 pixels). The values forN PE are shown in Table 5 .
The sustained throughput has been calculated for focal plane architecture with 1 PPE, 4 PPE, and 16 PPE implementations ( Table 6 ). The maximum number of PE instructions executed equals (IC · N PE ). The utilization is derived from the weighted average of executing each application. The sustained throughput achieved by the focal plane architecture implementations exceeds the values achieved by a traditional DSP architecture despite a significant difference in clock fre- [53] . Although operating at 10 MHz, the focal plane architecture implementations for 1 PPE, 4 PPE, and 16 PPE increase sustained throughput by factors of 65x, 16x, and 5x respectively compared to the TI TMS320C6411 chip. These factors are doubled when comparing against the TI TMS320C6211B chip.
Application Hardware Constraints
The hardware must satisfy both the data storage and the data precision constraints to execute the algorithms. Table 7 shows the architectural design parameters for each processing element determined by the application suite. 
Projected Area
Because of limited chip resources in a focal plane architecture, silicon area usage within an integrated digital pixel is a critical design factor. A pixel design tool provides a common context for component area models within an integrated pixel-processing array [54] . Area models based upon implementations of detector array circuitry and CMOS functional units are used to project silicon area for the architecture using different fabrication technologies.
Component Area Models
Area models are developed from fabricated analog and digital components. Area projections of the photodiode and analog-to-digital conversion are based upon the CMOS focal plane array [5] . This 8 × 8 array of Si CMOS detectors, fabricated in 0.8 µm technology, incorporates a current input first-order sigma-delta analog-to-digital converter at each pixel. The transistor circuitry is scaled to feature sizes ranging from 250 nm to 100 nm. Area projections for memory and digital circuitry are based upon the SIMPil16 focal plane architecture [55] that maps 16 pixels to each processing element. SIMPil16 is a 16-bit implementation fabricated on MOSIS HP 0.8 µm (1.0 µm drawn) CMOS technology. Selected functional units of a SIMPil16 PE used by the pixel-level processing architecture include: (1) ALU, (2) register file, (3) decoder, (4) bus driver, (5) sleep unit, and (6) communication unit. The original areas for these functional units are scaled to feature sizes ranging from 250 nm to 100 nm for area projections of the focal plane processor. Bit slicing is used to adjust both the ALU area and the register file area for a reduced datapath width. In addition, the register file area is adjusted to correspond to the number of words in the design. 
Processing Element Area
Silicon area allocation is a significant issue because in single-level VLSI, the photodiode, the analog signal conditioning, the analog-to-digital converter, the memory, and the digital processing core compete for silicon area in a small replicated processing element (Fig. 2) . A P-N photodiode is used in the CMOS focal plane array [5] . As a first-order approximation, the area of the sampling capacitor represents the analog signal conditioning because it dominates that component. The ADC is a first-order sigma-delta circuit. Memory is a register file with the number or registers based upon the formula required to execute the full application suite ( Table 7 ). The digital processor contains the functional units mentioned in the previous section. Processing element area has been projected using various fabrication technologies for 1 PPE (Table 8) , 4 PPE (Table 9) , and 16 PPE (Table 10) . By observing the component area trends for 1 PPE (Fig. 6 ), 4 PPE (Fig. 7) , and 16 PPE (Fig. 8) , the digital components (digital processor, memory, ADC) benefit from technology scaling. However, the analog components (photodiode and analog signal conditioning) do not see the same benefit [25, 56] . The photodiode has a fixed area requirement to ensure acquisition of light. The analog signal conditioning (sampling capacitor) also has a fixed area cost. As the PPE increases, a larger percentage of the silicon area within the processing element is used for data acquisition and data storage.
Focal Plane Processor Array Area
Portable devices with an embedded focal plane architecture must reduce their silicon area cost for a compact design. Excessive die sizes would limit the ITRS provides guidelines for affordable chip sizes for each technology [10, 11] . Using the array dimensions for different PPE values (Table 5 ) and the processing element areas (Table 8, 9, 10), the focal plane processor array area has been projected for various technologies (Fig. 9) . Using the ITRS guideline for manufacturing affordability, the 1 PPE implementation does not project to the die size target. The 4 PPE implementation exceeds the guideline by approximately 20% starting with the 150 nm technology and reaches the ITRS guideline using 100 nm technology. The 16 PPE implementation meets the guideline for all the selected technologies except 250 nm, allowing it to be implemented using a less aggressive and less expensive fabrication technology. Because detector area dominates as feature size shrinks, denser detector technology could substantially reduce chip area, making the chip more affordable. Another potential technique to reduce the analog component area is to share the analog-to-digital conversion circuitry [12] .
Projected Power
Portable devices with an embedded focal plane architecture must provide a meaningful battery life. Setting the minimum time between battery charges (MTBC) to a desired value translates into a limit for maximum power consumed during operation for a fixed battery energy. Typical double barrel, NiCd AA sized batteries (3.6 V at 700 mA·hours) have an energy capacity of about 10 Watt·hours. Therefore for MTBC ≥ 3 hours, the power should not exceed 3 Watts.
A Technology Scenario Analyzer (TeSA) was developed to project the power consumption of the digital components within the focal plane architecture using different fabrication technologies [57] . The power consumption has been projected for fabrication technologies ranging from 250 nm to 100 nm using a 10 MHz clock (Fig. 10) . The maximum power limit for each technology is used from ITRS specifications for portable battery operation [10, 11] . The 1 PPE implementation consumes the most power because of the large number of processing elements required to provide QCIF resolution (176 × 144 = 25344). It requires technology generations of 130nm or smaller to satisfy the ITRS power constraint. However, it may be too close to the limit to account for the analog components. The 4 PPE implementation satisfies the constraint for all technologies except 250 nm and 180 nm by reducing the required number of processing elements (88 × 72 = 6336). The power consumed is reduced by a factor of 3.9 (i.e. 74%) when compared to the 1 PPE implementation. The 16 PPE implementation, which uses the least number of processing elements (44 × 36 = 1584), consumes the least power and satisfies the constraint for all the chosen technologies. The overall reduction factor is 18 (i.e. 94%) when compared to the 1 PPE implementation.
Resource Efficiency Analysis
Higher ratings in area-time efficiency, area efficiency, and energy efficiency metrics are desired for image processing systems because of technology limitations such as interconnect wiring density and heat extraction [58] . This section examines the area-time efficiency, area efficiency, and energy efficiency of the focal plane architecture implementations. Comparisons for TI DSP chips are based upon application benchmarks and chip specifications [53] .
Area-Time Efficiency
The area-time efficiency is defined as:
where the reciprocal is taken for the product of the area for each system and the execution time for the imaging processing sequence (Table 4 ). Because a system should execute in the shortest time using the smallest area, the optimal solution is determined by maximizing the reciprocal product. The (A · T) −1 efficiency of focal plane architecture implementations using 1 PPE, 4 PPE, and 16 PPE, has been calculated for different fabrication technologies (Fig. 11) . The area includes the data acquisition and analog-to-digital conversion circuitry. Two TI DSP chips, the TMS320C6211B and TMS320C6411, are shown for comparison. The 16 PPE implementation is comparable to the TMS320C6211B operating 150 MHz. The 1PPE and the 4 PPE implementations provide a 3.5× and 1.8× factor increase respectively when compared to the TMS320C6411 operating at 300 MHz. This demonstrates that a focal plane architecture can be utilized as an embedded component to deliver efficient processing for common image applications. These applications can consume over half of the processing for an image.
Area Efficiency
Area efficiency is defined as the number of operations executed per second per unit area:
Area efficiency has been established as an important figure of merit for system-on-a-chip architecture design [7] . Because of limited chip resources in a focal plane architecture, silicon area usage within an integrated digital pixel is a critical design factor. Higher area efficiency implies better component utilization within the architecture.
The area efficiency of focal plane architecture for implementations using 1 PPE, 4 PPE, and 16 PPE, has been calculated using different fabrication technologies (Fig. 12) . Generally, an increase in PPE decreases the area efficiency of the focal plane architecture. However, there is a significant gain when compared to a DSP architecture in the same fabrication technology. For the TMS320C6211B, which uses 180nm technology, the area efficiency increases by factors of 81×, 47×, and 20× for 1 PPE, 4 PPE, and 16 PPE respectively. For the TMS320C6411, which uses 120 nm technology, the increase factors are 56×, 27×, and 10× for 1 PPE, 4 PPE, and 16 PPE respectively. This result is significant because the gain in efficiency is for common image enhancement and image analysis tasks that can represent a significant portion of the total operations within an image processing sequence. In addition, the area efficiency for implementations of the focal plane architecture includes the data acquisition and analogto-digital conversion circuitry, which is not available with the DSP architecture. This does not suggest that the architecture replaces a DSP but can handle certain common image enhancement and image analysis tasks more efficiently through specialization.
Energy Efficiency
Energy efficiency is defined as the number of operations executed per unit energy:
Previous work [59, 60] has illustrated the validity of energy efficiency for fixed throughput computation in uniprocessor systems. The validity is extended to massively parallel embedded-focal plane architectures by introducing system utilization in (1) for calculating the throughput. Increasing energy efficiency implies enhancing the sustainable battery life in portable devices. Minimizing power dissipation translates into minimizing energy per operation. The energy efficiency of a focal plane architecture using 1 PPE, 4 PPE, and 16 PPE, has been calculated for the digital components using different fabrication technologies (Fig. 13) . The 16 PPE implementation delivers the highest energy efficiency. The 4 PPE implementation is slightly lower than the 1 PPE implementation. This may indicate that the application suite optimizes slightly better for using a single pixel in a processing element. Two TI DSP chips, the TMS320C6211B and TMS320C6411, are also shown for comparison. For the TMS320C6211B with 180 nm technology, the energy efficiency increases by an average factor of 11× when using implementations of the focal plane architecture. For the TMS320C6411 with 120 nm technology, the average increase is 2.9×. This demonstrates the potential to extend battery life in portable devices.
Conclusion
Recently, a new dynamic has emerged with users desiring mobile productivity, where image processing functionality is encapsulated within portable devices containing embedded hardware. Portable imaging products will benefit from the monolithic integration of photodetectors, analog-to-digital converters, digital processing, and data storage to perform image acquisition and computation. Processing on the focal plane utilizes data-parallelism naturally found in image applications. Technological advances in device fabrication and integration are enabling the development of these monolithic focal plane architectures with the potential for improved performance, efficiency, and cost versus traditional imaging architectures.
This paper presents the analysis of a mixed-signal focal plane processing architecture. Key performance and cost metrics for focal plane architectures include execution time, system throughput, chip area, power consumption, area-time efficiency, area efficiency, and energy efficiency. Although the area-time efficiency has been traditionally used, it does not provide guidance for addressing energy consumption. Area efficiency and energy efficiency are important figures of merit when evaluating a focal plane architecture. The choice of both fabrication technology and pixels per processing element impact the system design, particularly for portable devices. Die size for manufacturing cost and power consumption for battery life should follow ITRS guidelines. In addition, higher ratings in area efficiency and energy efficiency are required because of technology limitations such as interconnect wiring density and heat extraction.
The performance, area, and power were projected for focal plane architecture implementations using 1 PPE, 4 PPE, and 16 PPE. This architecture is more area and energy efficient when compared to traditional TI DSP chips. However, the design problem is not strictly building the system with the highest performance, but delivering the required performance with the lowest cost. Therefore, a 4 PPE or 16 PPE with a less aggressive technology may provide a better implementation solution.
Future Research Directions
Future work includes analyzing the design of the CMOS image sensor circuitry to quantify the performance versus both the area and energy consumed. Although the analog components do not receive dramatic benefits from technology scaling, improvements in optoelectronic fabrication can reduce the size requirements of the photodiode and the sampling capacitor. In addition, the analog components may require a significant amount of power when compared to the digital components. The components of the digital pixel will be implemented in VLSI layout to extract circuit parameters for analysis. The layout may reveal additional opportunities for shared components, such as power or signal lines. This layout would enable a prototype of the architecture to be fabricated and tested, as well as provide feedback to improve the area models for the photodetectors, analog-to-digital converters, data storage, and digital processing circuitry.
Improved architectural developments must also address data storage and data communication issues found in focal plane systems. Data storage and data communication are tightly coupled within the focal plane system. Because data storage is the most transistor-consuming function for a pixel-processing element, implementation choices affect most other aspects of the design [38] . Data communication among processing elements follows directly from the method of storage. Pixel level processing could benefit from a distributed or reconfigurable processing architecture that allows data storage and data transfer to merge into a communication network. 
