The aggressive CMOS technology scaling in the sub-100-nm regime leads to highly challenging VLSI design due to the presence of unreliable components. The delay failures in arithmetic units are increasing rapidly due to the increased effect of process variation (PV) in scaled technology. This paper introduces a novel process-tolerant lowpower adder (Prot-LA) architecture for error-tolerant applications. The proposed Prot-LA architecture segments the operands into two parts and computes addition of the upper parts in carry-propagate, whereas it computes the lower parts in a carry-free manner. In the Prot-LA, the number of bits in carry-propagate and carry-free additions can be reconfigured based on the amount of PV. An on-chip PV detector is embedded to determine the PV severity. Because of this reconfigurability, the proposed adder completes the carry propagation with minimum error even under severe process variation. The simulation results show that proposed Prot-LA provides 19.9% reduced power consumption over the state-of-the-art approximate adder. The efficacy of the proposed adder is demonstrated in the real application by designing an image scaling processor (ISP). The simulation results show that the Prot-LA embedded ISP consumes 7.75% reduced energy with 2.43 dB higher PSNR over the existing approximate adder embedded ISP.
Introduction
The rapid development of the VLSI technology has lead to an exponential growth of portable electronic devices that employ several applications to satisfy user demand. Low power is the inherent requirement of all portable devices to extend the battery lifetime, improve reliability, and reduce the cost associated with heat removal. Scaling is no more the solution to address these challenges due to the increased effect of process variation (PV). The intra-die PV exhibits random behavior and may increase or decrease the delay. The die where large computations take place to evaluate the ultimate result, the overall delay comes out to the average value which may not be significant. Therefore, most of the existing techniques try to mitigate inter-die PV due to having systematic behavior. The PV-induced delay spread is almost 30% at 70-nm technology and raises to more than 50% at 45-nm technology [1, 2] , which reduces the yield of the chip significantly. The logic required to mitigate the effect of PV is becoming more costly in terms of area, power, and performance. Therefore, there is a growing need for designs that jointly address the process variation effect and low power requirement.
There are numerous applications that do not demand exact computations while processing the input data i.e. these applications can accept certain amount of error. The relaxation of computational accuracy in these applications is due to the limited perception of human vision, existence of redundancy in input data, and nonexistence of accurate output. These applications include image/video processing, artificial neural networks, etc. and are commonly known as error-tolerant applications. The conventional design approaches that perform accurate computations become inefficient designs for these applications. The approximate designs reduce power consumption while exhibiting acceptable error in output such that overall quality is maintained. Since addition is the fundamental operation that is used in the computations, the design of process-tolerant low-power adder is the critical requirement. This paper presents a novel process-tolerant low-power adder architecture that jointly addresses the technology issue and provides improved speed-power-accuracy-area trade-off. The major contributions of the paper are:
1. A novel algorithm and architecture of reconfigurable carry-propagate adder (RCPA) is proposed that can dynamically change the number of bits to be added with/without carry propagation.
2. Proposed a novel full adder circuit that can compute accurate/approximate sum based on the control signal.
3. A novel process-tolerant adder architecture is proposed that can detect process severity and can configure the number bits which will be added accurately without timing failure.
4. Finally, the paper demonstrates improved performance of the proposed adder over the state-of-the-art adder architectures in the image scaling processor.
The rest of the paper is organized as follows: Section 2 explores various approximate adder architectures that exhibit performance-quality trade-off. A novel process-tolerant adder architecture is presented in Section 3 while the analysis of PV severity is presented in Section 4. Section 5 first evaluates and compares the performance of the proposed adder over the existing architectures as a standalone arithmetic unit and then evaluates its efficacy in the image processing application. Finally, the work is concluded in Section 6.
Literature review
The last few decades show a healthy competition between researchers to develop designs for the energy/powerefficient VLSI chips. All these designs exhibit trade-off in different design metrics such as area, power, and delay. To further improve the performance of these designs, the concept of approximate computing is utilized which provides acceptable results besides underlying some incorrect computations. Thus, accuracy acts as a new trade-off parameter for these applications and provides improved performance/energy-efficiency at the cost of accuracy/quality [3, 4] . The conventional approaches exploit truncation, overclocking, and voltage overscaling to achieve improved performance in different error-resilient applications [5] .
Along with the design of high performance signal processing units via approximate architectures and/or algorithms, low power high performance approximate arithmetic units, such as adders and multipliers have also been proposed via the concept of shortening the critical-path. In the direction of designing adder with reduced carry propagation path, Zhu et al. [6] [7] [8] [9] introduced four approximate error-tolerant adders (ETA) namely:
ETA-I, ETA-II, ETA-IIM, and ETA IV. In ETA-I [6] , input operands are segmented into accurate parts which consist of few most significant bits (MSB) and inaccurate parts which consist of the remaining least significant bits (LSB) to achieve approximate results. ETA-II [8] design cuts the carry propagation to increase the speed of the adder at the cost of large error at MSBs. ETA-IIM is the modified version of ETA-II where more MSBs are connected through carry-chain to reduce the error. ETA-IV overcomes the short-comings of ETA-II and ETA-IIM by considering high-speed carry-select adder for MSB addition. Further, Shin et al. [10] introduced a data path redesign technique for various adders where the critical path of the carry-chain is curtailed. The logic complexity reduction approach at transistor level is reported in [11] to achieve various imprecise full adders and the efficacy of these approximate adders is demonstrated in the image/video compression algorithms. Kai et al. [12] used a new function speculation technique for designing latency speculative adders. Mazahir et al. presented an area-efficient error correction circuit for the approximate designs [13] . Recently, approximate reverse carry-propagate adder architecture is presented in which carry propagates in a counter-flow manner [14] .
The RCPA reduces timing variation-induced large errors at the cost of fixed design-specific error.
Along with approximate designs/architectures, different approximate design methodologies/synthesis approaches have also been introduced [15] [16] [17] . Huang et al. [15] presented a general methodology for fidelityefficiency design space exploration and implemented imprecise adders in CORDIC algorithm at 130-nm technology. A design methodology for reducing power consumption using approximate arithmetic unit cells is presented in [16] . Further, Lee et al. [17] presented a high-level synthesis tool that exploits arithmetic unit precision and voltage scaling to reduce energy consumption while maintaining the quality constraint. Finally, Rahimi et al.
[18] presented variability-aware OpenMP environment using shared, variation-tolerant, and accuracy reconfigurable floating point units.
Although the fixed accuracy design techniques discussed above provide improved performance, their scope is limited due to the different accuracy requirement in different applications. To achieve desirable accuracy, Kahng et al. [19] demonstrated an accuracy configurable adder (ACA) where accuracy of the results can be configured during runtime to achieve desired throughput and power efficiency. In the generic accuracy configurable (GeAr) [20] , error correction is implemented by changing the inputs to the subadders and reevaluating the partial sums. An iterative accuracy programmable adder that can reconfigure the probability of getting correct output was proposed in [21] . Recently, a reconfigurable carry-look-ahead adder (RCLA) [22] has been presented in which longest carry propagation path is truncated when operated in approximate mode. Since the RCLA exhibits on two operating modes (accurate/approximate), the design must have more approximate modes with different accuracy-performance trade-off for large applicability.
All these aforementioned techniques exploit error tolerance of the application and compute the approximate output. In the nano-scaled design, the effect of process variation is intolerable and cannot be ignored. Therefore, an arithmetic unit that provides process variation tolerance along with the trade-off between quality and performance is highly desirable. The adder architecture proposed in the next section provides acceptable output even under severe process environment.
Proposed process-tolerant low-power adder (Prot-LA) architecture
This section begins with the architecture of the proposed Prot-LA followed by the algorithm and microarchitectures of its various parts.
Prot-LA architecture
In the worst case, the PV-induced delay variation may cause delay of the adder to be twice the original delay. In this case, half of the carry propagation will be done correctly. If delay variation is less than the worst case, more carry propagation can be considered to compute the sum. Therefore, in the proposed addition approach, the operands are divided into two parts: the upper part which consists of some most significant bits (MSB) and the lower part which consists of the remaining least significant bits (LSB). The architecture of the Prot-LA, as shown in Figure 1 , consists of conventional accurate adder for MSBs addition, a reconfigurable adder for carry-propagate/carry-free addition of LSBs, controller, and PV detection logic. In the proposed adder, PV detection logic determines the PV severity which is given to the controller to generate a control signal. The Prot-LA reconfigures the lower adder to compute sum either in carry-propagate (accurate) or in carry-free (approximate) manner based on the value of control signal. As the PV severity increases, the control signal reduces (increases) the number of carry-propagate (carry-free) bits to compute the sum which reduces the large error at MSBs. The next subsection details the architecture and working reconfigurable lower part adder i.e. reconfigurable carry-propagate adder. 
Reconfigurable carry propagation adder: algorithm and architecture
In the Prot-LA, the addition of the least significant bits is done by reconfigurable carry propagation adder which computes addition accurately (carry-propagate)/approximately (carry-free) based on the given control signal.
The algorithm 1 provides various steps to compute the approximate sum of operands A and B for the given control signal (CTL). The control signal exhibits maximally one bit at logic '1' and other bits are set to logic '0 '. All bits of control at logic '0' reflects no PV effect. Further, if the higher bit of the CTL is '1 ', it reflects large PV. The algorithm first computes the operand part for which accurate sum is to be calculated for a given CTL. For the rest of the lower bits, the corresponding sum bits are approximated to 1's. If all the bits of CTL are '0 ', the result of the sum will be accurate.
Return Sum The architecture of RCPA, as shown in Figure 2 , consists of sum generator and controller. The sum generator consists of a modified full adder (MFA) connected in a ripple-carry fashion, whereas the controller consists of cascaded connected OR gates. The carry-free/carry-propagate addition using RCPA can be understood via an example as shown in Figure 3 . For the sake of simplicity, two 8-bit numbers, A = 01100100 and B = 00110100 are considered along with 8-bit control signal CT L = 00001000 . Since the 3 rd bit of control signal in Figure 3a is at logic '1', the addition of operands (A and B) left of 3 rd bit is done in a carry-propagate manner while sum bit of 3 rd bit and its right bits are approximated to logic '1'. Thus, in the given example, carry will propagate from 5-bit position to 7-bit only. In a similar manner, as shown in Figure 3b , the sum for the same input with different control illustrates more approximation for higher control bit at logic '1'. In order to achieve the approximate addition within RCPA, a full adder with an additional control signal is required. The MFA computes accurate sum if CT L = 0, whereas it sets sum value to 1 whenever CTL = 1. Furthermore, the carry out in the approximate part is set to logic '1 '. The circuit diagram that achieves this functionality is shown in Figure 4 , where it can be seen that it requires four additional transistors (two transistors at sum path and two at the C o path) over the accurate 28 transistors full adder circuit.
Process variation tolerance
As mentioned in Section 2, the process variation is becoming an increasingly serious issue with each new technology, and it is a very challenging task to address the process variations and low power dissipation simultaneously. This section first investigates the process variations severity on the conventional adder and then shows how the proposed adder reduces the severity of the process variation.
Adders under process variation
The worst-case carry propagation which occurs rarely in the RCA determines the operating frequency. In the normal mode of operation, the sum at the output is latched correctly as the carry completes its path. Under process variation, the delays of adder may vary depending on the process corner where the chip lies [23, 24] .
Even at the nominal supply voltage, the process variations may increase the critical path of the adders which are at the slower process corners. The increased delay due process variation produces a large amount of error due to its existence at MSB as shown in Figure 5 . Further, the application of supply voltage scaling to reduce power consumption may increase the critical path delay. Thus, the increased delay in the conventional RCA due to process variation and/or supply scaling may result in a large error as illustrated in Figure 5 .
The severity of the error reduces significantly if the probability of failure at MSB is moved to LSB. To compute the process severity, delay monitoring approach is exploited. As the delay of the inverter from low threshold (Vth) die is lower than the inverter from a high Vth die, a counter is used to calculate the number of clock cycles in specific time interval given by a calibrate signal as shown in Figure 6 . If the delay of the inverter is increased/decreased due to PV, the number of clock pulses counted by the counter will decrease/increase.
The proposed adder detects process corner and based on its value it starts the carry propagation from LSB to MSB such that the carry propagation is completed without any failure. Therefore, the large errors at the MSB are avoided at the cost of small errors at the LSBs as shown in Figure 5 . In the analysis, area and power overhead due to PV detection logic are not taken into consideration as this circuit is utilized for all the designs available on the chip. 
Simulation environment and result discussion
To evaluate the efficacy of the proposed adder, we implemented the proposed adder along with existing approximate adders on MATLAB and simulated with 1 million random input patterns. The designs are compared on the basis of extracted error metrics. The efficacy of the proposed adder is evaluated in the image processing application by implementing the image scaling processors (ISP) embedded with proposed and existing adders [25] . These ISPs are simulated with benchmark input images such as Lena, Mandrill, Cameraman etc. and are compared based on the simulation results. Further, we implemented the adders and ISPs in Verilog and synthesized using Synopsys' Custom Designer to achieve Verilog netlist. The Spice netlist is extracted from Verilog netlist and simulated with 1000 patterns using 45-nm PDK.
In the following subsections, different quality metrics are first explained followed by comparative analysis of the proposed adder over the state-of-the-art adder architectures as a standalone and in-the-application arithmetic unit.
Quality metrics
As the concept of approximate computing has attracted significant interest in recent years, apart from different approximate designs, researchers have investigated imprecise/approximate design metrics. Along with the commonly used error metrics such as mean error ( µ ) and mean square error (MSE), the following parameters are utilized to evaluate the efficacy of the proposed design.
Mean error distance (MED) [26]
It is computed by averaging all error distances (EDs) as expressed by:
where ED is the difference between the correct and erroneous output, S is the sample space, and ED i is the error distance in i th value. Although it is easy to evaluate, its value increases with increase in the size of the design. Thus, it does not effectively evaluate the approximation technique rather evaluates the available design.
Normalized error distance (NED) [26]
It is the MED normalized to maximum error and thus characterizes the approximate techniques efficiently. Its values for the given approximate technique is the same and independent of design size.
Peak signal to noise ratio (PSNR)
The most commonly used metrics to evaluate the error in the image/video processing applications is the PSNR.
It is the ratio of maximum signal power to the associated noise power. The PSNR in dB is given by:
where I max and MSE are the maximum values of the input signal and mean square error, respectively.
Structural similarity index metric (SSIM) [27]
It provides pixel to structural-based analysis and represents structure of objects in the scene, independent of the average luminance and contrast. It is given by:
Although SSIM which quantifies the structural similarity between the original and reconstructed image has gained good attention, the most frequently used quality parameter for image processing is the PSNR.
Quality/accuracy analysis
The above-mentioned quality metrics were obtained for the proposed and existing adder architectures by implementing on MATLAB and simulating with 1 million random input patterns. We equally divided the operands into two parts while designing ETA-I adder, and divided them equally into three parts while designing ETA-II and ACA adder architectures. In ETA-IIM and ETA-IV, most significant two subadders were combined to achieve large carry propagation at the MSBs part. Furthermore, no error correction was done and a single approximate mode was considered while evaluating error metrics of ACA. Table 1 reflects the error metrics for various varying bit-width approximate adders. Adders ETA-IIM and ETA-IV have the same error characteristic; therefore, simulation results of only ETA-IIM is presented. 
739.1
It can be observed that proposed ProtLA infers the small value of error metrics such as MED and NED along with the high value of PSNR. Furthermore, the error metrics of the ETA-I and ProtLA reduce with increasing the size of the adder due to more carry propagation in the accurate part. The error metrics of 32-bit ETA-II and ACA adders are poorer than 16-bit adders due to consideration of similar subadder size i.e. same carry propagation to exhibits similar delay characteristic. Furthermore, it can be observed that the error metrics of RCLA are better than those of the ACA for 8-bit adder but become very poor for larger bit-width designs. Finally, the proposed ProtLA provides superior error metrics even for the larger bit-width adder.
Since the image scalers are now widely used in many fields of image/video processing from consumer electronics to medical imaging to make image capturing and output display devices independent, the efficacy of the proposed adder is evaluated in the image scaling processor (ISP). Among the several image scaling algorithm, the bilinear interpolation that is the most computation-efficient is significantly used for designing image scaler [25] . The intermediate pixel (T) as shown in Figure 7 , in bilinear interpolation algorithm is computed using Eqs. (5-7) , where in (i,j) , in (i+1,j) , in (i,j+1) , in (i+1,j+1) are four neighboring pixels, y d and x d are the distances from original pixel to the target pixel in vertical and horizontal directions respectively. Figure 7 . Illustration of bilinear interpolation.
We implemented image scaling processors embedded with different approximate adders. These ISPs are simulated with various benchmark images such as Lena, Baboon, Cameraman. The simulation results of the different image scaling processors embedded with the proposed and existing adders are illustrated in Table 2 . The PSNR of the proposed ISP embedded with ProtLA is 2.43 dB, 8.49 dB, 5.27 dB, 11.7 dB, and 13.85 dB higher than those of the ETA-I [6]-, ETA-II [7] -, ETA-IIM [8]-, ACA [19] -, and RCLA [22] -embedded ISPs respectively. The SSIM of the ISP embedded with the proposed adder is also higher than those of the existing approximate adders, which reflects that the proposed ProtLA provides higher quality scaled images over the existing adders. Figure 8 illustrates the original and scaled images using different ISPs embedded with different approximate adders. Figure 8a is original image while Figures 8b-8h represent the scaled images via image scaler embedded with accurate, ETA-I, ETA-II, ETA-IIM, ACA, RCLA, and the proposed ProtLA adder architectures, respectively. It can be observed that the image scaled via ISP embedded with RCLA shows easily observable black spots, whereas the scaled images using the proposed adder shows negligible error. 
Design metrics of Prot-LA on FPGA
To evaluate the performance of the proposed adder over the existing adders, all the adders were coded in Verilog and simulated using ModelSim. The synthesis was carried out using Xilinx ISE tool-chain and finally implemented on Xilinx Spartan-6 FPGA (XC6SLX45). Simulation results in Table 3 show that the proposed adder requires 11.1%, 33.3%, 33.3%, 77.7%, 55.5%, and 122.2% reduced area over ETA-I [6] , ETA-II [8] , ETA- [7] , e) ETA-IIM/ETA-IV [8] , f) ACA [19] , g) RCLA [22] , and h) ProtLA.
IIM [8] , ETA-IV [9] , ACA [19] , and RCLA [22] , respectively. Moreover, the results show that worst-case combinational delay is lowest for the RCLA while worst for the ETA-I adder.
Furthermore, ISPs with different various approximate adders are implemented in Verilog to evaluate the efficacy of the proposed adder in the application. Table 4 illustrates area and maximum combinational delay of the different ISP embedded with different approximate adders. The simulation results show that ProtLAembedded ISP requires 6.1%, 10.7%, 10.7%, 20%, 16.9%, and 41.5% reduced area over ETA-I [6] , ETA-II [8] , ETA-IIM [8] , ETA-IV [9] , ACA [19] , and RCLA [22] , respectively. Although the delay of ProtLA is more than ETA-II and ACA, it is smaller than ETA-I. Although the delay of RCLA is minimum among all the designs, large area and power consumption reduces its effectiveness. Furthermore, full custom analysis is also done using Synopsys tool-chain, which is discussed in the next subsection. 
ASIC implementation
In order to do full custom analysis, the designs are implemented in Verilog and synthesized using Synopsys' Design Compiler with 45-nm PDK. The design metrics of synthesized netlists for different bit-width approximate adders are illustrated in Table 5 .
The simulation results reflect that the proposed ProtLA exhibits small area over the existing approximate adders. Furthermore, the 8-bit ProtLA requires 19.9%, 92.38%, 95.7%, 144.0%, 133.5%, and 220.6% reduced power consumption over the ETA-I [6] , ETA-II [8] , ETA-IIM [8] , ETA-IV [9] , ACA [19] , and RCLA [22] , respectively. Figure 9 compares the power delay product (PDP) of the proposed adder with the existing varying bit-width adders. It can be seen from the figure that the proposed ProtLA requires small energy compared to the state-of-the-art adders. Furthermore, with increasing bit-width, the energy saving is increasing.
The power-quality trade-off for the various image scaling processors embedded with different approximate adders is illustrated in Figure 10 . The ACA consumes large power with smaller PSNR as it was designed without considering the effect of process variation, whereas the ISP embedded with the proposed Prot-LA provides much more improved PSNR with very small power consumption than the existing architecture.
The layout of the image scaling processor embedded with the proposed and existing approximate adders is extracted from Synopsys' IC compiler with 45-nm PDK. The postlayout simulation results as illustrated in Table 6 show that the proposed ProtLA-embedded ISP requires minimum area over the existing. Furthermore, the delay and power of the proposed IS-ProtLA is 6.6% and 8.67% smaller over the IS-ACA. The PDP which reflects the energy consumption of the ProtLA-embedded ISP is 7.75%, 11%, 16.01%, 36.31%, 15.6%, and 44.44% smaller than those of the ISPs embedded with ETA-I [6] , ETA-II [8] , ETA-IIM [8] , ETA-IV [9] , ACA [19] , and RCLA [22] , respectively. Thus, the proposed adder provides significantly improved design metrics along with acceptable quality metrics, which suggests that it can be effectively utilized in image processing applications. The layout of the proposed ISP embedded with ProtLA is shown in Figure 11 . The area of this layout is 15425µm 2 . Figure 11 . Layout of the image scaling processor embedded with ProtLA.
Conclusion
In this paper, we have presented a process-tolerant low-power adder that reduces the large errors occurring due to the process variation by reconfiguring the number of bits in carry-propagate and carry-free paths. The proposed adder reconfigures the number of bits in the carry-propagate path such that large error due to timing failure does not occur. The efficacy of the proposed adder was evaluated as standalone adder and in the application and then compared over the state-of-the-art adders. The simulation results showed that the proposed adder accomplishes at least 19.9% reduced power consumption compared to the best-known approximate adder at 45 nm PDK. Furthermore, we demonstrated that the ProtLA-embedded image scaling processor exhibits at least 7.75% reduced energy consumption with higher image quality (2.43 dB higher PSNR) compared to the state-of-the-art adder architecture. The simulation results showed that the proposed adder can be effectively utilized in the error-tolerant applications requiring energy-efficient signal processing.
