, "Image acquisition system using on sensor compressed sampling technique," J. Electron. Abstract. Advances in CMOS technology have made high-resolution image sensors possible. These image sensors pose significant challenges in terms of the amount of raw data generated, energy efficiency, and frame rate. This paper presents a design methodology for an imaging system and a simplified image sensor pixel design to be used in the system so that the compressed sensing (CS) technique can be implemented easily at the sensor level. This results in significant energy savings as it not only cuts the raw data rate but also reduces transistor count per pixel; decreases pixel size; increases fill factor; simplifies analog-to-digital converter, JPEG encoder, and JPEG decoder design; decreases wiring; and reduces the decoder size by half. Thus, CS has the potential to increase the resolution of image sensors for a given technology and die size while significantly decreasing the power consumption and design complexity. We show that it has potential to reduce power consumption by about 23% to 65%.
Introduction
In recent years, the resolution of image sensors has increased at an amazing rate. For example, smartphones with 41 megapixel cameras are available on the market. It is increasingly becoming difficult to handle the amount of data generated by such sensors in portable devices such as smartphones and cameras in terms of power requirements. If we use a byte of data (which is modest) to store the color of a pixel in RGB format, we have 3-MB raw data per image for a 1-megapixel camera. For a 41-megapixel camera, we have massive 123-MB raw data to process in hundreds of milliseconds. This poses a huge challenge given the power constraints of mobile devices and the numerous snapshots and amount of data that users are generating today in the multimediacentric world. While we have huge secondary storage these days, e.g., 128-GB SD/micro-SD cards, the challenge is to handle the raw data generated at the sensor. Certainly, some sort of energy efficient modification has to be done in the traditional image acquisition system to handle the amount of data. If the compression is done at the sensor itself, we can avoid the huge bus wires, decrease the clock rate, and reduce the register widths. This will result in significant power savings as the I/O read out will be reduced proportionately.
In recent years, a lot of research has been conducted for compressively sampling natural images. According to CS theory, if a signal is sparse in some domain, it can be recovered faithfully from a small number of linear combinations of the signal values provided that the matrix representing the linear combinations is incoherent with sparse domain basis vectors. However, the traditional methods of CS make matters worse when it comes to acquisition effort per bit and storage effort per bit. Oike and El Gamal 1 applied CS at the analog-to-digital conversion level. The biggest issue with that approach was that the sampled image loses image-like properties, and hence image compression techniques such as JPEG do not work well, resulting in an increase of storage effort per bit. Also, each pixel is read out multiple times, which results in some waste of energy and acquisition time. It also uses a pseudorandom generator, which consumes additional energy. The design presented by Dadkhah et al. 2 does CS at the sensor level, but it wires the output of a pseudorandom generator to each block. In addition to the problems associated with design presented by Oike and El Gamal., 1 it also consumes significant wiring area in the pixel and decreases the active area in the pixel. This will result in poor peak-signal-to-noise-ratio (PSNR) performance of the pixel. Katic et al. 3 also presented a design on similar lines. Their design also contains a random number generator that needs to be routed to pixels, consuming wiring area and power. The goal of this paper is a very simplified implementation of CS that results in power savings, reduction in raw data rate, application of standard image compression techniques such as JPEG post CS, and simplification of hardware design while achieving optimal performance. To achieve this, the paper presents a system design methodology for an imaging system and a simplified pixel design to be used in the system so that the CS technique can be implemented easily at the sensor level. We show that pixels in our design flow are even simpler than the normal ones. In our paper, we have circumvented the need for a pseudorandom generator by employing the CS superresolution technique presented by Sen and Darabi 4 with modifications. We present results from both the binary permuted block diagonal sampling matrix as mentioned in Refs. 5 and 6 as well as our nonbinary block diagonal sampling matrix. These are easy to implement in hardware and help us to perform on-sensor image compression. These matrices preserve image-like properties, so JPEG can be applied to compressively sampled images, unlike traditional designs. We show that our design methodology has the potential to achieve 23% to 65% power savings.
Background and Motivation
This section introduces the background concepts and motivation behind our system design methodology as well as the pixel design.
Compressed Sensing Theory
Suppose we have signal X, having N samples such that X ∈ R N×1 , and we want to recover X from Y, where E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 6 3 ; 6 1 5 Y ¼ ΦX;
(1)
such that Φ is an M × N matrix and M ≪ N. The number of unknowns is significantly larger than observations, so it is difficult to recover X from Y because Eq.
(1) has infinitely many possible solutions. However, if X is sufficiently sparse, exact recovery is possible. This is compressed sensing (CS) (Ref. 7) . A popular choice for Φ, i.e., the measurement basis, is a randomly generated matrix. In this work, we also assume that Φ is orthonormal, i.e.,
Since A is M × N and M ≪ N, recovery of the original signal is difficult because the system of equations represented by Eq. (6) has infinitely many solutions. This is where CS comes to the rescue. If the sensing matrix A satisfies the restricted isometric property stated (RIP) 8 below E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 6 3 ; 2 2 0 1 − ϵ ≤
for some ϵ > 0, then perfect reconstruction is guaranteed with very high probability. To reconstruct the signal, we solve the following equation using linear programming techniques:
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 6 3 ; 1 3 1 min
Another condition related to RIP is that sparsity basis should be incoherent with the sampling basis. 9 The coherence between the two is calculated as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 3 2 6 ; 7 5 2
where ϕ and ψ are the basis vectors in sampling basis Φ and sparsity basis Ψ, respectively. The coherence ranges from 1 to ffiffiffiffi N p . If μ is close to 1, then matrices are incoherent and vice versa. While the requirement of incoherence is implicit in Eq. (7), it is explicit in another sufficient condition for recovery of compressively sampled signals. Select M measurements uniformly at random in Φ domain. Then if E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 3 2 6 ; 6 4 9 M > Cμ 2 ðΦ; ΨÞS log N;
for some positive constant C and S-sparse signal (i.e., only S coefficients of the signal are nonzero), the solution to Eq. (8) is guaranteed with very high probability. 10 Equation (10) also indicates that, if incoherence is less, we need more samples to reconstruct the original signal with high probability. 9 The above discussion was applicable to a strictly sparse signal, which means that the signal has a lot of perfect zero values when represented in the sparse domain. However, such signals are rarely found in nature. Images represented in the matrix form are no exception. Many natural signals are only approximately sparse, which means that most of the coefficients are very small in magnitude. In such cases, small coefficients can be discarded without much loss of perceptual quality. Let the signal X be approximately sparse. Let all but the S largest elements of our approximately sparse signal X be zero and the resulting signal be X S . Let the corresponding transform be T S . Because Ψ is orthonormal basis E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 3 2 6 ; 4 2 6
Thus, if T can be classified as sparse or compressible, meaning sorted magnitudes of the T decay quickly, then X can be approximated by X S and, therefore, the error kX − X S k 2 is small. 10 This means that we can discard a significant fraction of coefficients without much loss of quality. This is why CS works well with natural images.
For images, popular sparsity basis are wavelet, Fourier, or gradient. The measurement matrices that satisfy incoherence requirements broadly fall in four categories: random or Gaussian random matrices, 11 scrambled Fourier matrices, 12 partial noiselets, 9 and scrambled block Hadamard matrices. 5, 13 Unfortunately, these matrices have very expensive and challenging hardware implementation. Any attempt to implement these matrices negates the advantage gained by CS in terms of sampling effort per bit. To make matters worse, storage of the sampled image becomes even more challenging.
For images, the sampling matrix can be quite huge, i.e., of the order of one million. Storing or generating a matrix of such size is not feasible in a camera or a portable device. To solve this problem, block-based CS is used; it is explained in next section.
Block-Based CS
In block-based CS sampling, the image is divided into B × B blocks. The sampling is done using a E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 6 3 ; 7 5 2
:
where off-diagonal elements are all zeros. For block-based, the CS image has to be vectorized in one-dimension by either using raster scan or just reshaping the matrix. There is a trade-off involved between memory and reconstruction performance in the selection of block dimension. Small B means less memory but poor reconstruction performance while large B means more memory but superior reconstruction performance. Here, we used an even simplified version of block CS. We have not vectorized the image in one-dimension. Instead, we keep the image as such and use an ðM∕N × BÞ × B sampling matrix. This leads to even more simplified implementation. For our case, the block size does not have any effect on reconstruction performance in our simulation. An explanation of this is provided in Sec. 3. Hence, we choose the smallest possible block size (i.e., 2 × 4) for simplicity.
The next section introduces the transform domain in which natural images are sparse, a key requirement for CS/reconstruction.
Directional Transforms for Sparse
Representation There are many transforms that can be used to represent an image as a sparse or approximately sparse signal. A popular one is discrete wavelet transform (DWT). DWT lacks important properties such as shift invariance or directional selectivity. There are many modifications to DWT that have been extensively studied to preserve a much higher degree of directional representation than DWTs. One of them is dual tree discrete wavelet transform (DDWT) (Ref. 15) . DDWT has an advantage over DWT as it provides efficient representation of directional features such as edges and contours. It has a redundancy of 2 m ∶1 for m-dimensional signals. Hence, for two-dimensional images, redundancy will be 4:1. It consists of both real and imaginary parts, but only the real or imaginary part of DDWT guarantees perfect reconstruction; hence, it can be used as a standalone transform (Ref. 16 ). While DWT is ambiguous in directionality property, mixing þ45 and −45 together, DDWT has unique wavelets in each direction. They are oriented at þ∕ − 75; þ∕ − 15; þ∕ − 45. The wavelets are shown in Fig. 1 .
The next section introduces the reconstruction algorithms for images sampled using the CS technique.
Reconstruction Algorithm
A major problem associated with block-based CS is blocking-artifacts. A solution to this problem was presented by Gan 14 by incorporating Weiner filtering into the basic PL (projected Landweber) framework. This filtering helps to impose smoothness as well as sparsity inherent in the PL algorithm. Algorithm1 17 is given as
In the above algorithm, Weiner() represents pixelwise adaptive Weiner filtering using a neighborhood of 3 × 3. The initial value is given as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 3 2 6 ; 3 4 6
and the termination criterion is as follows:
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 3 2 6 ; 3 0 4 jD ðiþ1Þ − D ðiÞ j < 10 −4 ;
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 3 2 6 ; 2 7 7 where
The above sections were about CS and reconstruction. The next section introduces a popular image storage technique, which is a key component in our system flow.
JPEG Theory
JPEG stands for joint photographics expert group. It is a very widely used lossy image compression technique. It can perform both lossless and lossy compression, though lossy compression is the most widely used mode of compression. Lossy compression relies on the fact that most of the image information is contained in a very few coefficients in the discrete cosine transform (DCT) domain. Thus, a vast majority of insignificant coefficients can be discarded without much loss in perceptual quality, resulting in large compression ratios. f unction x iþ1 ¼ SPLðX i ; y; ϕ B ; Ψ; λÞ
; λÞ
Journal of Electronic Imaging 013019-3 Jan∕Feb 2018 • Vol. 27 (1) JPEG first divides the image into 8 × 8 pixel blocks and then calculates DCT of each block. A quantizer rounds off the resulting DCT coefficients according to the quantization matrix, which controls the amount of compression one wants to do. This step represents the "lossy" part of JPEG but allows for large compression ratios. We can also control the amount of compression by appropriately setting the quantization matrix. After quantization, data are compressed further by the use of variable length encoding of these coefficients. While JPEG has been applied previously to CS sampled images, 18 compression performance has not been mentioned. Li et al. 18 also used the Gaussian random matrix to compressively sample the image. When we sample an image with the Gaussian random matrix, the sampled image has Gaussian distribution and the image-like properties are lost. This results in a very poor JPEG compression performance, which will significantly increase the effort/ energy required to store the image.
Deterministic CS and Superresolution
Traditionally, the projection or sampling matrix Φ is chosen as Gaussian random matrix as it possesses good RIP and is highly incoherent with most having a sparsifying basis. However, hardware implementation of a Gaussian random matrix is infeasible. A deterministic construction of a sampling matrix can result in considerable simplification of hardware implementation. A method for deterministic construction of matrices was first introduced in detail in Ref. 19 . The author used finite fields to construct cyclic matrices that satisfy RIP. This is popularly known as deterministic CS. Other methods for deterministic construction have also been proposed, such as one in Ref. 20 in which the authors used Euler square-based binary CS matrices, which outperformed their Gaussian counterparts.
Superresolution (SR) implies construction of high-resolution images from one or more low resolution images. Traditionally, SR had been done using a set of low-resolution images. The idea is to enforce the constraint of sparsity in the transform domain such as wavelet to reconstruct the image. However, using CS for SR means that the sampling matrix is no longer random but deterministic. The sampling or projection matrix for SR is guided by an imaging model. SR sampling matrix L can be viewed as the product of two matrices as follows (see Ref. 21 ):
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 6 ; 6 3 ;
where R is the decimation operator or downsampler and L p is the low pass filter. Since there is a low pass filter involved in the construction of L, it will have a frequency discriminative nature. It will filter out high frequency components but preserve low frequency components. By contrast, a Gaussian random matrix will preserve all frequencies. This means that L exhibits good RIP characteristics for a class of signals that contain low frequency information only, but a Gaussian random matrix has good characteristics for any class of signals (see Ref. 21) . However, in cases of natural images, most of the energy is concentrated in low frequency signals only. Hence, if the cutoff frequency for L p is appropriately set, the loss might not be too much, resulting in reasonable reconstruction. Lossy image compression algorithms also weed out or reduce the high frequency components during the process of compression. Sen and Darabi performed SR CS reconstruction 4 using a filtered and point downsampled image. In our work, we present an image sensor design (see Sec. 4) for filtering and downsampling the image in the CMOS image sensor itself without additional hardware and resulting in significant power savings. An advantage is that, because we use filtering and downsampling, we do not need randomization of the sampling matrix. This also results in significant savings in terms of hardware and power consumption as there is no need for a random generator and associated wiring.
The next sections will introduce the hardware aspect of image sensors.
Photodetectors
There are mainly three types of photosensing elements: photogates, phototransistors, and photodiodes. In this work, we used photodiodes. There are different types of photodiodes, too. We used a simple p-n junction, although we can use a more sophisticated p-i-n junction to improve the efficiency of an image sensor. As the name implies, the p-i-n junction consists of an intrinsic region between the p and n regions. The p-i-n junction device reduces dark current and chargetransfer noise. 22 Hence, using a p-n junction over a p-i-n junction does not affect the demonstration of main functionality of our system design methodology.
There are various types of p-n junction photodiodes also. They are n+/p-sub, n-well/p-sub, and p+/n-well/p-sub. Murari et al. 23 list the parameters and advantages of various photodiodes. We use n+/p-sub because of the large fill factor, low dark current per unit area values, and ease of implementation to demonstrate our concept. Its schematic diagram is shown in Fig. 2 .
Image Sensors
In the past decade, extensive research has been done on CMOS sensors. An image pixel can be broadly divided into two parts: a photodetector element and a sensing circuit. Depending on the sensing circuit, there are two main families of image pixels: an active pixel sensor and a passive pixel sensor. A passive pixel sensor carries out the charge of the photodetector and amplifies them later. An active pixel sensor has a photodetector and an active amplifier. Passive pixel sensors have mostly been implemented with charge coupled device technology while active pixel sensors are implemented using CMOS technology. Decreasing size and cost of CMOS elements has made CMOS image sensors viable and the technology of choice. 25 The ever decreasing size of transistors has made high-resolution image sensors possible. The most popular active pixel sensors designs are 3T, 4T, and capacitive transimpedance amplifier (CTIA) pixels. CTIA is mostly used in scientific applications while 3T and 4T are mostly used in commercial systems. We will not be discussing CTIA, but the results presented can be applied in CTIA pixel as well. The schematic diagram for 3T and 4T pixel is shown in Fig. 3 .
3T pixel is very compact but has less sensitivity and an unstable bias voltage across the photodiode. This pixel architecture consists of a photodiode and three transistors: reset (M_R), source follower (M_SF), and a row select transistor (M_RS). In 3T pixel operation, first the photodiode is reset using a reset transistor. Now, the charge gets collected on the photodiode proportional to the light signal and exposure time. After a set integration time, the row select transistor is turned on to read out the signal using external readout circuitry.
The 4T (four transistor) pixel architecture is shown in Fig. 3 . 26 Its architecture has two additional elements compared with the 3T architecture, namely, the transfer gate (TX) and the floating diffusion node (FD). It uses either a pinned photodiode (PPD) or a normal photodiode (PD) depending on the design shown in Fig. 3 . As long as TX is off, charge is accumulated in PPD or PD. When TX is on for a set Integration time period, charge is transferred to the diffusion node. We used 4T pixel design with PD as our choice for implementation as we did not have a PPD model to perform the simulation. It is expected that the result will be similar with PPD as explained in an earlier subsection.
Because the charge collection area and readout area are separated in the 4T pixel via a M_Tx transistor, it offers some key advantages. While the 3T design can only implement a rolling shutter, the 4T design can implement both rolling as well as global shutters. A global shutter is very important for the high speed imaging application. The 4T pixel also allows low noise operation through the use of the correlated double sampling (CDS) technique. The reset noise or kTC noise is the main source of noise resulting from the resetting operation of FD node through the resistive channel of the reset transistor. Thus, the CDS technique can be employed to sample the floating diffusion node before and after M_Tx is turned on within a short time interval, thereby eliminating kTC noise. This operation is shown in Fig. 4 .
Transfer transistor or M_Tx makes the bias voltage across photodiode very stable. It also helps us to increase sensitivity because the integration capacitor can be kept small. CTIA has around eight transistors, but it has the highest sensitivity among all of them and a stable photodiode voltage. Because of the large pixel size, it is not used much in commercial systems. It is mostly used in scientific applications.
Nonidealities in Image Sensors
Nonidealities can be broadly classified into two major groups: pixel-level nonidealities and readout-level nonidealities. 3 Both of them present challenges to the image sensor designers. Major pixel-level nonidealities are dark signal nonuniformity, offset fixed pattern noise, photoresponse nonuniformity, pixel response nonlinearity, and pixel temporal noise. Major readout-level nonlinearities are offset column fixed pattern noise, gain error column fixed pattern noise, readout nonlinearity, readout temporal noise, readout output voltage range, and quantization.
In this paper, we will not deal with temporal noise, but we will consider the fixed pattern noise and application of CS to overcome the challenges posed by fixed pattern noise. We are also not dealing with readout output voltage range and quantization noise as it is a research problem by itself and has been included in the future course of our work. Offset fixed pattern noise can be easily dealt with using the CDS technique, but photoresponse nonuniformity, pixel response nonlinearity, and gain error fixed pattern noise require sophisticated circuitry to deal with. A simple way to deal with this problem is discussed in Sec. 3 of the paper. An example image with column and pixel-level fixed pattern noise is shown in Fig. 5 .
Simulation
Our entire system design methodology can be described using the block diagram in Fig. 6 . The input to the system is an image. The image gets sampled using a compressed sampling technique using either of the sampling matrices. This sampling function is implemented in the image sensor itself. The design of the image sensor is discussed in Sec. 4.
Depending on the quality desired, a specific number of bits are truncated while sampling. From here, the image goes to the image processor of the camera system. Here, the image is compressed using the JPEG technique as mentioned earlier.
JPEG encoding can be done using ASIC chip or FPGA as well. We used different levels of compression in JPEG to study the performance of our system, which ranges from different levels of lossy to lossless compression. After compression, the image may get transmitted over a communication medium. The compressed image is then uncompressed. If the compression was lossy, there will be some loss of information. This uncompressed image is then reconstructed using the SPL algorithm mentioned previously. The reconstruction performance is measured using a PSNR metric. Now, we demonstrate the reconstruction results of our proposed system flow. We use both binary and nonbinary block diagonal matrices to compressively sample the image. The binary block diagonal (Φ B ) and nonbinary block diagonal (Φ NB ) sampling matrices are mentioned below. 
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 8 ; 6 3 ; 3 1 2 Φ NB ¼ 9 7 0 0 0 0 9 7 :
Since matrix Φ B adds two neighboring pixels, it does not significantly alter the statistical distribution of the image and hence preserves the image-like properties. However, matrix Φ NB performs weighted addition, so it does alter the distribution but still preserves some image-like properties. It does not alter the image as significantly as random Gaussian sampling matrix, which makes the distribution of the resulting sampled image Gaussian. We arrived at Φ NB empirically, and it was found to be the most optimal. One pixel is weighed ∼1.3 times relative to the other in Φ NB . One can try higher relative weights also, but it will be difficult to implement in hardware due to large capacitor requirements (as per out design presented in next section). Since we have fixed bitwidth analog-to-digital converters (ADC)s, we can only use integer weights to sample the image; otherwise, we will lose the information contained in the decimal part. The image sampled using Φ NB requires 12 bits to store each pixel of resulting image. For Φ B 9 bits are required for the same.
Because of the way our sampling matrix is constructed, block size will not have any effect on reconstruction performance. We can see this from two matrices of different block sizes presented below.
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 9 ; 3 2 6 ; 5 7 6
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 0 ; 3 2 6 ; 5 2 9 
We can see that both of the above block matrices, when used as the sampling matrix, actually perform the same function of adding two rows. The 4 × 8 matrix is actually two 2 × 4 matrix along the main diagonal of the sampling matrix presented in Eq. (12) . Thus, 4 × 8 can be expressed in terms of 2 × 4 matrix as follows:
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 1 ; 3 2 6 ; 3 8 9
Thus, both of them lead to the same sampling matrix presented in Eq. (12) . Hence, there will not be any effect in performance. The same reasoning applies to the nonbinary sampling matrix also. According to Eq. (16), Eq. (19) can be viewed as the product of downsampler (R) and a circulant averaging filter (L p ). These matrices are as follows:
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 2 ; 3 2 6 ; 2 7 8 (17) and (18) is sparse, leading to less incoherency with sparse transforms such as wavelet transform, which is used as sparse basis. 5 However, it still works well for CS because, according to Eq. (10) and Ref. 5 , if incoherence is less, we need more samples to reconstruct the signal with high probability. Since we use 50% compression in CS, this matrix works well, as shown by our experimental results. Our sampling matrix in Eqs. (17) and (18) does not satisfy Eq. (2). We call our sampling matrix in Eqs. (17) and (18) a front-end sampling matrix. We have to perform a transformation on the front-end sampling matrix, so it satisfies Eq. (2). The matrix resulting from transformation is known as the back-end sampling matrix. We use a front-end sampling matrix because it is very easy to implement on the sensor level. The transformation from front-end to back-end is very simple. We multiply the front-end sampling matrix by a normalization constant. The normalization constant is simply the square root of the sum of squares of all the elements in a row of the matrix.
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 4 ; 6 3 ; 4 9 7 Φ
where N is the sum of squares of the row elements of the matrix. The back-end sampling matrix generated from Eq. (24) will satisfy Eq. (2), and this transformation can be implemented in the reconstruction algorithm itself. Multiplication of this transformation constant with the compressively sampled image (using the front-end sampling matrix) is equivalent to sampling the image using a back-end sampling matrix, which is what is desired. Thus, we use a back-end sampling matrix as the sampling matrix in the reconstruction algorithm. Using the transformation, we calculated our backend sampling matrix as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 5 ; 6 3 ; 3 2 1 Φ B;back-end
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 6 ; 6 3 ; 
The level of lossy compression in JPEG is controlled using the quality parameter of the MATLAB ® function. We measured the size of this compressed image. The baseline for image size is taken as the size of the JPEG image (quality ¼ 75, bits ¼ 8), i.e., raw image stored as JPEG with quality ¼ 75 and bitdepth ¼ 8. The size of a compressively sampled image is reported as the relative percentage of this baseline. The baseline image for PSNR measurement is the raw image. The metrics for the baseline is shown in Table 1 .
We used a set of 30 images to perform the above simulation. These images are shown in Fig. 7 . All the images are in grayscale 512 × 512 format. In Fig. 8, we show how the average of the normalized size of the raw image stored in JPEG format scales with the quality factor. Similarly, in Fig. 9 , we show how reconstruction performance of JPEG (measured as the average of PSNR of 30 images) varies with the quality factor of JPEG. Table 2 lists the results for the system shown in Fig. 6 for the binary sampling matrix with input parameters such as quality factor of JPEG and bitdepth of the image. The output values are normalized size, PSNR for reconstruction, and on-chip compression.
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 7 ; 3 2 6 ; 5 6 6 On-chip compression ¼ ð16 − bitdepthÞ 16 ;
where bitdepth is the number of bits required to represent each pixel of the compressively sampled image. Since each raw-pixel is 8 bits and we add 2 pixels while doing CS, we calculate on-chip compression relative to 16 bits in Eq. (27) . Similarly, Table 3 shows the same for the nonbinary matrix. The best case and worst case image reconstruction for the nonbinary CS followed by lossless JPEG is shown in Fig. 10 .
Since we add two 8 bit pixels for the binary sampling matrix, we need 9 bits to represent the addition perfectly. For the nonbinary matrix, we use weights of 9 and 7 for each pixel, so the max value of weighted pixels is 16 × 255. Hence, we need 12 bits to represent the weighted addition perfectly.
We can see from Tables 2 and 3 that the performance of the binary and nonbinary matrix for CS with lossless JPEG and without bit-truncation is almost the same. This is in agreement with the results stated in Ref. 5 . We can also see that the storage size is very high for CS with lossless JPEG. This has the potential to degrade the performance of the imaging system when it comes to storage, and we will need a much more complicated JPEG decoder. To decrease the size, we can either decrease quality or truncate LSBs or both. By truncating LSBs, we not only decrease the size of the image but also significantly simplify the ADC design as well as the JPEG encoder and decoder design. This simplified decoder will also consume less energy because of reduced switching activity resulting from reduced bitwidth. Similarly, we can also decrease the quality factor to decrease the size. For example, if we use a default quality factor, i.e., 75, we see that performance loss is not much but size is much smaller.
In general, for a given quality factor, the nonbinary matrix performs quite better than the binary matrix. This is because it can preserve much more information than the binary matrix because of larger bitwidth. This makes it more resilient to degradation during the JPEG quantization step. This is also evident in the graph shown in Fig. 11 , where none of the LSBs have been truncated. The better PSNR for the nonbinary sampling matrix comes at the cost of increased image size. A comparison between the normalized image-size resulting binary and nonbinary sampling matrix for bitdepth ¼ 9 and bitdepth ¼ 12, respectively, is shown in Fig. 12 . By pruning some LSBs, we can decrease the image size at the cost of the PSNR of the reconstructed image. Thus, the nonbinary sampling matrix offers more control over image quality than the binary sampling matrix. We can also see from Tables 2 and 3 that, for a given quality factor, as we truncate the LSBs of the CS sampled image in the nonbinary sampling method, the result approaches that of the binary sampling method, i.e., the performance of the nonbinary matrix almost equals that of the binary matrix for the same bitdepth. For the maximum performance case, i.e., CS with lossless JPEG, the performance of both sampling matrices is the same for full bitdepth for each case, respectively. While the result for the maximum performance case for CS is roughly 2 dB less than the baseline JPEG case of Table 1 , the former provides roughly 43% raw data compression, but the latter provides none. Reduction in raw data rate will significantly simplify our system design. This is discussed in the next section. These were the simulations for grayscale images. For colored images, the procedure is straightforward. In the case of RGB images, the three different color planes can be thought of as three different images and CS can be applied to each of the three images. The reconstruction performance for the colored Lenna image is mentioned in Table 4 .
The next section will discuss the implementation of a front-end sampling matrix on the image sensor level.
Design
This section discusses the sensor level design to implement the front-end sampling matrix presented in the previous section. It also briefly discusses the ADC and JPEG encoder.
When it comes to hardware implementation, the binary block diagonal matrix means an addition of the row or column pixels. The number of pixels to be added is the number of ones in the row of the sampling matrix. For our binary sampling matrix, we can simply implement this using double-sized pixels. We can choose any pixel design, i.e., 3T or 4T. Large pixels have better SNR values because dark current decreases much faster than sensitivity as area increases. 24 Even if noise is larger in smaller pixels, it is taken care of using the CDS technique, so the higher noise level of smaller pixels is not much of an issue. If we use a large photodiode to implement the binary sampling matrix, it means an increase in the fill factor of the pixel. If the fill factor for a given pixel design is f, then using a double-sized photodiode will roughly give a 2f∕ð1 þ fÞ fill factor. For f ¼ 0.7, we get a rough approximation for a new factor as f ¼ 0.82. This increased fill factor can compensate for the loss due to the reconstruction algorithm.
The nonbinary block diagonal matrix has to perform weighted addition. This can be done using our design shown in Fig. 13 . This is inspired by the 4T design. We used a very simple technique to perform weighted addition. We used a small capacitance (gate capacitance of MOS) to decrease the response of one of the photodiodes by placing it before the shutter or Tx transistor. This MOS is labeled as cap in Fig. 13 . This effectively decreases the sensitivity of the photodiode and it generates less output (output of a photodiode is actually a decrease in the output voltage w.r.t. reset voltage level of photodiode because photocurrent flows to discharge the junction capacitance of photodiode) as compared with the other photodiode without additional capacitance. Thus, if the same amount of light falls in both photodiodes, then one photodiode will generate less output voltage than the other. When the shutter MOS (i.e., Tx_1 and Tx_2) opens, then current drains from the floating diffusion node to the photodiode. Since one photodiode has less voltage than other, one will draw less current than other. This is because our circuit is operated in the transient state rather than the steady state. The shutter open time is set such that the circuit remains in the transient state. Since both currents are unequal, the resulting voltage at the floating diffusion node, i.e., FD, is like the weighted addition of two equal signals. For the nonbinary sampling matrix, we used weights of 9 and 7, so the relative weight of one pixel w.r.t. to another is ∼1.3ð9∕7Þ. The circuit shown in Fig. 13 also achieves approximately the same weight. Since even after truncating some LSBs we can get good images, the weighted addition does not have to be very exact as the errors will get truncated, too. The Spectre simulation results for the circuit are stated in Table 5 . The weight was calculated in the table keeping CDS technique in mind. The weight was calculated by curve fitting for 100 different points. For generating these points, the photocurrent in each photodiode was varied from 100 to 1000 fA in steps of 100 fA. This generated 10 points for each photodiode. Then, all possible permutations of these two sets (one set for each photodiode) of 10 points were taken to generate 100 different points. Figure 14 shows the sweep analysis performed for these 100 points (offset voltage has been removed). Figure 15 shows a plot to demonstrate weighted addition of photodiode outputs. The curves in the plot represent the output voltage values for the proposed pixel circuit for two different cases. In each case, the photocurrent of one of the photodiodes is fixed at 100 fA, and the other one is varied from 100 to 1000 fA in steps of 100 fA. Thus, for a given current value in the x-axis of the plot, the total charge generated in the pixel will be the same, but the output of the pixel will be different for both cases because of the weighted addition of photodiode output. The addition of a capacitor in one of the photodiodes results in a decrease in sensitivity. In traditional designs, a decrease of sensitivity implies a loss of resolution, but in our design reconstruction algorithms help us recover this information.
If we truncate the bits, we significantly simplify the ADC design, too. Bit truncation in the simulation can be implemented in hardware by decreasing the ADC resolution. This will result in a simpler and power efficient ADC. Since at lower resolutions noise and linearity requirements are relaxed, voltage scaling can help us achieve an exponential reduction in power consumption. 27 Since ADC is responsible for a major chunk of power consumption during the process of raw image acquisition, 1,28 our technique will have a significant impact in reducing the power consumption.
We designed our pixel for both frontside illumination and backside illumination (BSI). 29 The FSI layout for Fig. 13 circuit is shown in Fig. 16 . In FSI layout, light enters from the frontside of the sensor whereas in BSI it enters from the backside. This means that, in BSI, we can draw metal lines over the photodiode and increase the fill factor. There are two different technologies in BSI, which are shown in Fig. 17 . They are conventional BSI and stacked BSI. 29 In conventional BSI, the logic circuit and the pixel circuit are in the same plane. Metal wiring can be drawn over the pixel circuit as light enters from the backside. This results in an increase of the fill factor. In stacked BSI, the logic circuit and pixels are in different planes. This means that the fill factor is almost 100% for stacked BSI. The layout for conventional BSI and stacked BSI for our pixel circuit is given in Figs. 18 and 19 . We used the TSMC 200-nm technology library and Cadence Design tools to implement our design. The advantages associated with on-chip implementation of CS do not depend on the technology of choice. It works equally well in any technology.
The junction capacitance, responsivity, and dark current for the photodiode used in our pixel were estimated using the data and graphs presented in Refs. 24 and 23. The formula for junction capacitance is given as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 8 ; 3 2 6 ; 1 1 4 Calculated weight using curve fitting (100 points) = 1.22
Journal of Electronic Imaging 013019-11 Jan∕Feb 2018 • Vol. 27 (1) where C J0 and C J0sw represent zero-bias capacitance at the bottom and sidewall components, respectively, V d is the voltage applied to the photodiode, v j and v jsw stand for the built-in potential of the bottom and the sidewall, respectively, m and m jsw are the grading coefficients of the bottom and the sidewalls, respectively, A D is the photodiode area in m 2 , and P D represents the photodiode perimeter in m. These parameters are given in Table 6 for our design. The table also lists the fill factor for our layout. A problem with our pixel design is that it is nonlinear. The switching transistor, source follower, and active capacitances all contribute to nonlinearity. This nonlinearity can be removed by curve fitting. In an actual system, a lookup table can be used to simplify implementation. The equation obtained after curve fitting is reproduced as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 9 ; 3 2 6 ; 2 2 9 output ¼ 1.037 − ð3.324 × 10 −5 Þp2 − ð4.065 × 10 −5 Þp1
where p1 and p2 refer to photocurrents in photodiode 1 and photodiode 2, respectively, in fA and output refers to output voltage in V of the pixel shown in Fig. 13 . The weight for weighted addition has been calculated by taking the ratio of coefficients of p1 and p2. Our system design methodology simplifies the JPEG encoder and decoder design as well. JPEG generally takes the DCT of image blocks of size 8 × 8. Since we combine two pixels into one, we effectively reduce the number of blocks by half. This will cut energy spent during encoding Fig. 15 Plot showing weighted addition of photodiode outputs. For each curve, photocurrent in one of the photodiodes is fixed at 100 fA while the other one is varied from 100 to 1000 fA in steps of 100 fA. Each point in the x -axis represents the same amount of charge generated in the pixel, but output is different due to weighted addition of photodiode outputs. (1) by half. Since the encoder design is mostly pipelined, it will also reduce encoding latency by half. Since workload is reduced, one can reduce the voltage and frequency of operation of the JPEG encoder to maintain the same latency. This will result in an exponential decrease in energy consumption during the encoding and decoding processes. If we truncate the LSBs of the image, this will lead to additional simplification of encoder and decoder design and power savings. It leads to a proportional decrease in switching activity and hence dynamic power. It reduces the register as well as arithmetic unit bitwidth. Reduction in bitwidth of the arithmetic unit can lead to a direct reduction in latency of such units and the chip floor-area. Other implementations for both the binary and nonbinary sampling matrices are also possible. For implementing the binary sampling matrix, we can also use the design presented in Fig. 13 . We can remove the cap MOS from the circuit to do so. This design will be especially useful when we have two photodiodes separated, as in the colored image sensor implemented using the popular Bayer pattern. 30 This is shown in Fig. 20 . One can see that each photodiode representing a color has to be separated. It might not be possible to make a large single color photodiode because of resolution reasons. Hence, we can use design presented in Fig. 13 for both the nonbinary matrix as well as the binary matrix (i.e., without cap). While the theoretical assumption is that all source followers or photodiodes provide the same gain/response, this is hardly the case practically. Because of manufacturing inconsistencies, the two photodiodes will have different responses and parasitics. Thus, the binary matrix will become nonbinary in actual implementation. This will not pose any problem because a nonbinary matrix works equally well. By incorporating such inconsistencies further into the sampling matrix, we can solve the problems posed by fixed pattern noise during the reconstruction. There are multiple sources for fixed pattern noise. Photoresponse nonuniformity, source follower mismatch, etc., are sources of mismatch. 31 We can handle this mismatch by incorporating the mismatch in the sampling matrix. If all source followers or photodiodes provide different gain/responses, then we can use different weights for our sampling matrix for each of Yet another way to implement a binary or nonbinary matrix is to sum the pixels at the ADC level, similar to what was presented in Ref. 1 . This has an advantage of having an option to choose between CS mode and non-CS mode of operation. However, one has to pass address for each pixel. This will slow down frame rate and increase power consumption. There are certain additional disadvantages associated with it, which are discussed in the paragraph below.
The advantage of CS implemented at the sensor level is not limited to a reduction of data rate. Our implementation shown in Fig. 13 for the nonbinary matrix requires only six transistors (excluding floating diffusion nodes and capacitance) per 2 pixels, i.e., three transistors per pixel with global shutter. This means an improvement in fill factor and reduction in the size of pixels. This also means less power consumption. A simple analysis of power consumption for image acquisition can be performed by looking at the data mentioned in Refs. 1 and 28. Both papers use different designs, technology, and specifications. Hence, power consumption is different for both of them. However, the relative breakdown of power spent in I/O, ADC, pixel, and other operations are approximately the same. Roughly 90% of power is spent in I/O and ADC in the image acquisition. Our proposed CS implementation cuts the I/O and ADC operations exactly by an on-chip compression ratio, i.e., by 25% to 68.75%. We used the data for the normal mode of operation at 120 frames/s in our work. The data are reproduced in Table 7 . The table also lists the estimation of power if our system design methodology is implemented in the normal image sensor described in the paper. We also incorporated the power spent during JPEG compression in the same table using the design presented in Ref. 32 as reference. Note that we only performed a rough approximation of JPEG power consumption-based switching activity. We can see from Table 7 that one can achieve ∼23.5% to 65% power savings for compression ratios of 25% to 68.75%, respectively. In this work, we only consider the energy spent during the image acquisition and compression process.
Our proposed CS technique also reduces the wiring area in the die as we combine two rows/columns into one. We cut the amount of wiring required by half. We need half the number of rows or columns select, reset, and transmit for global shutter. We also need half the size of the address decoder, leading to a reduction in power and chip area. Because of the reduced size of the pixel and reduced wiring area, we can fit more pixels in the same die area using existing technology. It is not possible to exploit these advantages if we implement CS at the ADC level as mentioned previously and in Ref. 1.
Conclusions
We have discussed a system design methodology for an imaging system that significantly cuts down power and simplifies hardware design. Using simple deterministic matrices presented in this paper, CS can be used for on-chip compression of the raw image to save power. These matrices also help us to use JPEG in conjunction with CS for additional compression. We also presented a pixel design implementing such matrices and results for on-chip compression ranging from 25% to 68.75%. This leads to a significant reduction in power spent at I/O, ADC, and JPEG. We can significantly simplify the ADC design as we require less resolution and less speed. Similarly, we can simplify the JPEG encoder design because there are half the number of 8 × 8 image blocks and reduced bitwidth per pixel. When we do not use voltage-frequency scaling, we save the power by approximately the same amount as on-chip compression. In such cases, estimated power savings are around 23.5% to 65% for the on-chip compression ratio of 25% to 68.75%, respectively. If we use scaling, we can achieve exponential reduction also. We also require fewer transistors to implement the same number of pixels. For the design proposed in Fig. 13 , we need three transistors per pixel to implement global shutter, which in normal cases demands four. This not only increases the fill factor of pixels but also decreases the pixel size. We also reduce the amount of wiring required by half, which means a significant reduction in crosstalk and pixel size. We need only half the number of row/column wiring, power supply, reset, and global shutter control wiring. We also simplify row/column address decoder significantly as we need only half the size decoder. All this means that we can fit more pixels in a given die area while maintaining power efficiency and can more than compensate for the loss due to the reconstruction algorithm. Since the amount of raw data is reduced, we can also increase the frame rate. Thus, our system design methodology has a huge potential to increase the power efficiency of the CMOS image sensor design while increasing resolution and significantly simplifying circuit design. This aspect should be explored further by testing prototypes.
Future Work
This paper presents a methodology for an image acquisition system. The performance of this methodology depends on the performance of each component of this system. Each component can be redesigned to suit the needs of this system. In particular, the image sensor presented in this paper can be fabricated and tested to get more accurate results. Another significant component of this system is the reconstruction algorithm. An improvement in the performance of reconstruction algorithm can significantly improve the performance of this system. Postfabrication, the noise characteristics of the system should also be studied and improved upon.
