Abstract: By adopting the human visual system property, a priority-based selective bit dropping strategy to reduce DRAM (Dynamic Random Access Memory) and SRAM (Static Random Access Memory) power consumption is presented in this paper. The tradeoff between power consumption and output quality is explored as well. During the data flow in image processing, the original image data are first processed with our proposed strategy, from which the number of bit-'1' in lower part of each pixel is reduced. Then the approximate data are pushed into the DRAM and SRAM for further computation, where the refresh power consumption for DRAM is reduced due to the less bit-'1' in each data and the write power consumption for SRAM is also reduced due to the lower switch probability in write operation. The proposed strategy has been realized with digital logic circuits and the approximate image data are processed by the Discrete Cosine Transform (DCT) in simulation. The results show that 27.7% refresh power reduction on average for DRAM can be achieved and the SRAM also obtained 21.7% write power reduction with negligible overhead. As for the final output quality of the images, only 1.01 dB losses for Peak-Signal-Noise-Ratio (PSNR) is presented (about 3% lower than accurate processing) after the DCT processing.
Introduction
With the increasing prevalence of mobile devices, various high-speed and largecapacity multimedia applications have been integrated into smart phones and grow rapidly. In embedded system, memory, including on-chip SRAM and off-chip DRAM, is a key problem for image processing and accounted for more than 90% workload [1] in practical applications. In a typical image/video processing system, as shown in Fig. 1 , the original captured image data are first pushed into the off-chip DRAM, then the on-chip SRAM will load part of the original image data for further on-chip computation blocks. For the bitcell in DRAM implementation as shown in Fig. 2 (a) [2] , the overall power consumption mainly comes from the refresh operation to the capacity since the electric charge reserved in the capacity, when bit-'1' is written into the cell, will leak out after a short period.
Thus, as pointed out in [3] , the refresh power consumption in DRAM is linear relative to the whole number of written bit-'1'. In other words, when the original image data contains more bit-'0', the refresh power consumption will be reduced significantly.
As for the on-chip SRAM, high capacity and fast speed SRAM operation also leads to big power dissipation in practice, which highly limits the flexibility of mobile devices. Thus, it will be also meaningful to reduce SRAM power consumption through various approaches. The circuit scheme of one bitcell for SRAM is shown in Fig. 2(b) , which is made of two connected inverters with positive feedback. In essence, the overall power dissipation for SRAM can be divided into three parts as leakage power, read power and write power. As pointed out in [4, 5] , the leakage power takes small part of the whole dissipation and the value of write power is 3.3X larger than read power. Actually, most power dissipation happens when the transistors of the bitcell in SRAM switching between on and off state, and this switching power, which largely comes from write operation to the bitcell, is proportional to several parameters as shown in Eq. 1:
Where α is the activity factor (switches between bit-'1' and bit-'0'), C is the effective capacitance, VDD is the supply voltage and f is the operating frequency [6] . Thus, considering the large refresh power for DRAM and write power for SRAM, various strategies have been proposed from previous researches. For DRAM implementation, data encoding from [7] is an efficient way to reduce the number of bit-'1' for memorization, however, additional information or the flag bits should be inserted, which will deteriorate the bandwidth utilization. The embedded compression algorithms from [8] could reduce the overall amount of the image data, however, the implementation is complicated and large overhead will be generated. As for the SRAM, based on the Eq. 1, a low power design for SRAM can be achieved through decreasing one or more of the above four parameters simultaneously. The technique of voltage scaling down in [9] can be used to get a low power design due to its square-effect in Eq. 1, however, this approach will generate performance decline as an overhead in order to guarantee no timing-error during the read and write operations. This problem also exists in frequency-scaling technique. In [10] , the bitcell size is adjusted to match the capacitance, which also needs large effort due to the special design. Different from the accurate operation to all the bitcells above, in the field of image processing, errors in different bit-positions of a pixel will lead to different output quality of a processed image. Based on the fact that human visual system is more sensitive to the most significant bits (MSBs) of a pixel and less sensitive to the least significant bits (LSBs) [11] , a completed accurate storage method is unnecessary in practical applications, especially for the image/video senor networks, where an acceptable output quality for human vision is just enough [12, 13] . Hence, for the approximate storage in DRAM, in [14] , the original image data are first partitioned into critical part and non-critical part, then the refresh rate for critical data will remain same as before and the non-critical data will be operated with a lower refresh rate. Thus, the refresh power will be reduced with certain output quality decline. However, the control flow this approach is complicated and the overhead could not be ignored. As for the SRAM, in [4] , the MSBs are deliberately stored in robust bitcells, such as 8T-scheme bitcells in [15] , while the LSBs are given less or no protection in exchange of power reduction. For example, these LSBs are provided with a lower supply voltage and a small part of error will be introduced to the original pixel data. With these approximate data, the final output quality after processing the images, such as through Discrete Cosine Transformation (DCT), will decline in a tolerable range. However, it is not a simple process to allocate different voltage rails on the chip and the overhead to balance the drive between different voltage banks cannot be ignored as mentioned above. Since the overall capacitances of the SRAM highly depends the data volume, which is relatively fixed in practical applications, reducing the α value in Eq. 1 is considered to be an efficient approach to achieve low power design.
In this paper, a priority-based selective bit dropping strategy to jointly reduce DRAM and SRAM power consumption is presented. Different from previous approaches, we provide a general efficient strategy to reduce the number of bit-'1' for DRAM and the switch probability for SRAM is also reduced simultaneously. The LSBs in original image data are selectively dropped while the MSBs are protected to be correct. Thus, the power consumption for DRAM and SRAM could be reduced simultaneously and this is different from previous works where the approaches are focused on single DRAM or SRAM. Moreover, it should also be noted that our proposed approach is parallel with previous works and can be utilized jointly with previous techniques to reduce the power consumption for DRAM and SRAM. Using the Peak Signal-to-Noise ratio (PSNR) as the metric to evaluate the output quality of image processing through DCT, the proposed strategy enables the DRAM and SRAM with different scales reduction for power consumption depending on the expected output quality. Thus, a tradeoff between power savings and output quality has been established. Extensive simulations have been implemented and the results show 27.26% refresh power reduction on average for DRAM can be achieved and the SRAM also obtained 21.79% write power reduction with negligible power overhead. Furthermore, a methodology for low power design between power savings and output quality is also presented to practical applications. As for the final output quality of the images, only 1.01 dB losses for PSNR, which is about 3% lower than accurate processing, is presented after the DCT processing.
The rest of this paper is organized as follows. In Section II, we present our selective bit dropping strategy and a power consumption model is established. The circuit to realize the proposed strategy will also be given in this section. Section III shows the simulation results and Section IV concludes the paper.
2 Proposed strategy and power consumption model 2.1 Selective bit dropping strategy and design methodology As we analyzed in Section I, when processing images in a digital circuit system, the original image data will be first pushed into DRAM, then part of the image data will be loaded into SRAM in sequence for on-chip computation. The refresh power for DRAM will be reduced when the number of bit-'1' for storage decreases. At the same time, the switch probability for SRAM will also becomes smaller, which will generate power savings for write operations in SRAM as well. In our proposed strategy, we focused on the original image data and certain approximation is exploited to realize the idea as we mentioned above.
The proposed strategy is as follows: we divide a pixel value, which contains 8 bits, named bit 8; bit 7; . . . ; bit 1 from high order to low order, into two parts: the higher part takes the bits from bit 8 to bit ðk þ 1Þ, where k is an integer ranging from 0 to 8 (when k ¼ 8, that means no bits are pushed into the higher part), while the lower part contains the bits from bit k to bit 1 (k ¼ 0 means no bits are pushed into the lower part). Before the original captured images are pushed into the DRAM, we ensure that the higher part (from bit 8 to bit ðk þ 1Þ) of the pixel will keep the original value. As for the lower part (from bit k to bit 1), we always ensure that the first appeared bit-'1' is reserved, while the other bit-'1' will be enforced to be changed into bit-'0'. As illustrated in Fig. 3 , suppose the original value for a series of pixels are listed on left, which will be prepared to be written into the DRAM. Let k to be 4, during the writing process, the bits in higher part, which contains bit 8 to bit 5, will be reserved normally as their original value, meanwhile, for the lower part, for each pixel, the first appeared bit-'1' is ensured to be reserved, while the other bit-'1' are changed into bit-'0'. This procedure will be carried out repeatedly for each other pixel in the captured images. The basic motivation of our strategy is to reduce the overall number of bit-'1' in LSBs for storage while protecting all the information stored in MSBs. According to our strategy, when k is 1, there will be no approximation and all the bits in original images will be reserved. When k is bigger than 1, the original bits in lower part will be changed and only one bit at most will be '1' for the lower part. Thus, approximate image data will be generated and these data will be further pushed into DRAM and SRAM for computation. Since the number of bit-'1' is reduced, the refresh power for DRAM will get lower and this advantage will also decrease the switch probability for write operation in SRAM since there will be only two bits switched at most for the lower part. It can be seen that when the k is closed to 8, more critical information stored in MSBs will be lost, which means a reduction for the output quality and an increase for power savings will be generated. Thus, a tradeoff between output quality and power savings can be achieved with various k values. The whole design flow is shown in Fig. 4 . For a specific image processing application with certain output quality configuration, each pixel from the original image data will be processed by the proposed selective bit dropping strategy, which can be modeled in Matlab or C/C++. As analyzed above, when the k value increases, more power savings will be obtained while the output quality will decline in correspondence. When the output quality is satisfied at a certain k value, the process will be finished and the overall power consumption and savings will be achieved through subsequent experiments by hardware simulation tools. The whole design methodology can also be summarized as Eq. 2:
Based on all the analysis above, it can be seen that our proposed strategy for tradeoff between output quality and power savings are simple and efficient than previous works. First, there is no complicated control flow and no changes to the inner scheme for DRAM and SRAM as we only pre-process the original image data at the beginning of the processing. Second, the proposed approaches are parallel with previous works such as voltage scaling, which means that the proposed strategy could be jointly utilized with other techniques for lower power design. Finally, not only the DRAM could achieve power savings with the proposed approach, the write power for SRAM could also be reduced as the approximate part of the data has lower switched probability (this will be seen more clearly in later simulation), which is a special advantage over previous design as they focused on either DRAM or SRAM. 
return D t;approximate ½8 : 1;
Circuit implementation of the proposed strategy
In order to realize the proposed strategy to selectively drop the original bits in lower part, a specific circuit scheme should be implemented. Here takes k ¼ 5 as an illustration, suppose the original image data before pushed into the DRAM at 't'-time is D t ½8 : 1 and the data operated with our proposed strategy at 't'-time is D t;approximate ½8 : 1, then the whole logic computation is shown in Circuit-1. It can be seen that only several AND-gates and inverters are utilized for the computation. With this circuit block for each pixel, the whole circuit scheme is shown in Fig. 5 . Suppose the input data width is 64-bits and 8 pixels from the images can be written into the DRAM simultaneously. It should be noted that the pixel data for the Circuit-1 block only needs the current image data and no previous information are necessary, thus, no D-Flip-Flops are needed in the whole circuit. The circuits in Fig. 5 are just a port before the data are pushed into the DRAM. This circuit can be easily coded in Verilog and synthesized by DesignCompiler tools. The performance and power consumption for this circuit will be given in simulation part. These delay and power consumption are just the overhead.
Simulations
In this section, the proposed method is modeled in Matlab with the design methodology in Fig. 4 and the circuit scheme in Fig. 5 . With different k value, the number of bit-'1' for storage in DRAM and the switch probability for SRAM will be simulated and evaluated. In order to evaluate the output quality of the processed data with our strategy, all the processed approximate image data from [17] will be pushed into the DCT-IDCT kernel in Matlab. The PSNR of each image after DCT-IDCT is utilized for the final output quality evaluation.
3.1 Simulation for the number of bit-'1' for storage in DRAM and the switch probability in SRAM In order to calculate the average number of bit-'1' for one pixel in DRAM with our proposed strategy, real images from [17] are utilized for the simulation. Then the data from DRAM will be pushed into the on-chip SRAM for further computation, where the switch probability and PSNR for output quality will be evaluated. The whole procedure is shown in Fig. 6 , where the size of SRAM is 32 k-bits, a typical value for practical application. All the simulation results are shown in Table I , where the DRAM bits per pixel means the average number of bit-'1' per pixel in DRAM and SRAM switch probability means switch probability in SRAM. For different k values, as we analyzed in Section 2.1, when k ¼ 1, that means no approximation will be introduced into the original data, which can be treated as accurate processing, thus, no power savings will be achieved and can be used as the baseline for comparison with other k values. It can be seen that the value of DRAM bits per pixel and SRAM switch probability will get smaller as the k rises. Thus, the corresponding power savings for refresh power in DRAM and write power in SRAM will also increase as we analyzed in Section 2.1. However, this achievement for power savings will generate output quality losses for the application as we present in following part.
Evaluation of output quality with the proposed approach
As shown in Fig. 6 , the approximate image data with different k values will be processed by DCT-IDCT kernel in Matlab to evaluate the output quality of the approximate images, where the average PSNR value will be used as shown in Fig. 7 . It can be seen that the output quality based on our proposed strategy will decrease when more approximation (bigger k value in other words) are introduced into the original image data. Thus, as a main contribution in this paper, it provides us with a dynamic image processing management based on application need. When the requirement of output quality increases, the power consumptions increase quickly, which is in accordance with intuition that more efforts are needed with better output. When power saving is the main purpose, a relatively large k can be adopted to achieve more power efficiency. In this paper, 5% quality losses are accepted as described in [6] , thus, a optimized point between power consumptions and PSNR is selected at k ¼ 4 in this paper, where the power savings for refresh power in DRAM is 27.7% and the power savings for write power in SRAM is 21.7%, while the PSNR losses are only 1.01 dB (only 3.1% reduction than accurate processing), which can be acceptable in many image processing applications. The achievement from this tradeoff between power efficiency and output quality is worthwhile. One of the processed image with different k value is shown in Fig. 8 . As for the complete dropping strategy in Fig. 7 , in which all the lower k bits for one pixel are enforced to be '0', it can be seen that the output quality decreases more sharply than our proposed method. When k ¼ 4, the corresponding PSNR is 27.39 dB, which means that 14.3% reduction than accurate processing for output quality will be presented and this reduction is intolerable in practical applications as pointed in [6] .
Discussion for the power and area overhead
The logic circuit as shown in Circuit-1 to realize the proposed strategy is coded in Verilog and then synthesized by DesignCompiler, from which the obtained gatelevel netlist is pushed into PrimeTime Tool [16] to get the delay, power consumption and area for overhead discussion. All the delay, power and area value with different k are listed in Table II (the k value starts from 2 since k ¼ 1 means no approximation introduced). Suppose the data width of the input for DRAM is 64-bits and each pixel for image data contains 8-bits, thus, the power and area overhead with our proposed strategy contains all the power consumption and area of eight Circuit-1 blocks for the final implementation. It can be seen that the power overhead from the Circuit-1 blocks are negligible since the value stays on µW-level, compared with the whole big power consumption for DRAM (stays in Wlevel in [5] ) and SRAM (about 400-µW in [4] at 100 Mhz). As for the area cost, the Circuit-1 blocks can also be ignored as the area for standard DRAM and 32 k-bits SRAM are far more larger than µm 2 -magnitude.
Compared with the power consumption based on complete dropping strategy, like the simulation process in section 3.1, the average number of bit-'1' for one pixel in DRAM and the switch probability for SRAM based on the complete dropping will be evaluated, which are presented in Table III . In this table, k ¼ 0 means no lower bits in each pixel are set to be '0', which can be treated as the accurate baseline processing. It should be noted that k ¼ 8 is not included in the table since the point of k ¼ 8 means all the bits in each pixel are set to be '0' and this situation should not happen in practical processing. It can be seen that when k ¼ 4, more power savings could be achieved with complete dropping method compared to Table I , however, as we demonstrated in section 3.2, the output quality reduction for complete dropping strategy cannot be accepted at this point in practical applications. Thus, based on the design methodology in Eq. 2, where the output quality should be satisfied with top priority, the proposed approximate strategy is more efficient than complete dropping method.
Conclusion
In this paper, we presented a selective bit dropping strategy to reduce DRAM and SRAM power for image processing. Based on the property that human eyes are more sensitive to the error of high order bit of a pixel while less sensitive to the LSBs' errors, our strategy decrease the number of bit-'1' in original image data. We divide a pixel into two parts. The strategy ensures the normal switch operations of bitcells in higher part during the process and allows the highest switched bit in lower part to be operated correctly. Critical information stored in MSBs can be reserved and proper protection to LSBs are provided. As a result, various power savings can be achieved. In future, we will apply our strategy into other applications, such as machining learning algorithms to verify its effect and we believe the proposed strategy could also provide fine tradeoff between power efficiency and output quality in these applications.
