This paper presents approximate multipliers which are efficiently deployed on Field Programmable Gate Arrays (FPGAs) by using newly proposed approximate logic compressors at different levels of accuracy. Our approximate multiplier designs offer higher gains of power-delay-area products (PDAP) than those of the state-of-the-art works at comparable accuracies. Furthermore, in terms of delay, occupied area, and dynamic power dissipation, our designs are much better than Lookup Table based multiplier Intellectual Properties that are available on an FPGA. Particularly, our proposed 8-, 16-, and 32-bit multipliers can deliver PDAP gains up to 7.1 ×, 8.3 ×, and 5.0 ×, respectively. The effectiveness and applicability of our designs are also demonstrated by image processing applications such as image multiplication and sharpening. The experiments show that for the image sharpening, our 8 × 8 multipliers can deliver a good peak signal-to-noise ratio (PSNR) of 46.81 dB, a structural similarity index metric (SSIM) of 0.9989, and a dynamic power saving of up to 36.7% with regard to the exact multiplier. For the image multiplication, approximate 16 × 16 multipliers can offer a high PSNR of 80.25 dB, an SSIM of 1.0, and a dynamic power saving of up to 58.15%. From these demonstrations, the proposed multipliers are expected to be appropriate with high-performance and low-power error-resilient applications.
I. INTRODUCTION
In many of error-resilient applications such as multimedia, data mining, image processing, machine learning, etc., precise computations are not always necessary [1] - [4] . Computation results with some degradation of accuracy can be acceptable and meaningful enough for these applications [5] . By taking advantage of this property, therefore, we can take into consideration of trade-offs between the accuracy and electrical performances of a circuit. That is, we can sacrifice some loss of accuracy for beneficial gains of power dissipation, occupied area, and delay.
Voltage over-scaling is a solution to reduce the power dissipation of a circuit [6] - [8] . However, when a circuit operates under the normal voltage level, timing-induced failures can
The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. lead to large and unpredictable errors due to the failures of the most significant bits. Approximate computing is one of the efficient paradigms to lower the power consumption and enhance the performance of an embedded system. Some errors are allowed at the outputs of a complex circuit in order to simplify logic expressions, which turns out reducing logic counts. Eventually, savings in areas and dynamic power dissipation, and shortening of circuit delays can be achieved. In the past few years, intensive researches have focused on the approximation techniques for arithmetic units such as adders and multipliers [2] , [5] , [9] - [12] . Approximations can be implemented at transistor level as well as logic gate level [4] , [13] . Due to the high complexity of multipliers, applying approximate computing to these circuits can offer benefits of power and performance. In principle, approximate techniques can be applied to any stage of a multiplier. However, most of the hardware resources and computation time are occupied by the partial product accumulation. Therefore, the optimization of partial product accumulation can achieve more gains.
The removal of partial products from the partial product matrix (PPM) leads to the reductions of circuit area and delay. Truncation is such a method which ignores the least significant partial products from the partial product accumulations [14] . In [15] , authors proposed a broken-array multiplier whose omitted carry-save adder cells were specified by the horizontal break and vertical break parameters. These methods cause biased errors. That means the approximate multipliers can produce zero values with respect to non-zero valued inputs. Zervakis et al. [16] proposed to skip the generation of arbitrary partial product rows in the PPM, which was called a partial product perforation. Nevertheless, the more removal of partial products, the more error are produced.
OR-gate based logic compressions were applied to reduce the partial product tree. In [17] , authors used 2-input OR gate to combine two partial products in the same columns into a new one, and this procedure was applied to all columns in the PPM. In [18] , authors proposed to combine two or more partial products in the same columns at the first reduction stage by using two or more input OR gates under the consideration of bit position significance. Then, exact full adders were employed for the next reduction stages. However, these methods presented high errors at the final results. To recover the accuracy, error compensation vectors were intentionally generated and added to the final stage or the first reduction stage [17] - [19] . The generation of error vectors incurs area and power dissipation overheads. Moreover, they also increase the height of the partial product tree, which could cost some extra logic compressions. Another limitation of the work [18] is that it used full adders to compress the partial product tree at the reduction stages following the first stage. This limits the benefit of area and power reductions since the logic compression ratio is only 1.5 (= 3/2).
In order to get more benefit of area saving and power reduction, a higher ratio of logic compression was used to reduce partial product tree, and approximate 4-2 compressors are typically utilized with some losses of accuracy [5] , [14] , [19] , [20] . The higher ratio of logic compression can be used such as approximate 5-2 (5 inputs and 2 outputs), 6-2 compressors [20] . However, they showed higher errors in the final products. Authors in [5] proposed approximate 4-2 compressors with simple logic equations to reduce logic counts. However, their approximate compressors still occupied high hardware resources when synthesized on an FPGA. Besides, their proposed multipliers utilizing that compressor produced non-zero values with zero-valued inputs, which costs extra logics to detect this case. Similarly, authors in [21] used both their proposed approximate compressors and exact compressors to form the so-called configurable dual-quality multipliers. Nevertheless, their method cannot be deployed on FPGAs since the existing commercial FPGAs do not feature the power gating. Besides, their proposed 4-2 compressors still consume high hardware resources and power with moderate error rates when synthesized on FPGAs.
As all the above discussions, employing presented approximation techniques for designing different types of multipliers can achieve performance gains, energy savings, and area reductions. Generally, FPGA-based and application-specific integrated circuit (ASIC)-based design architectures are totally different from each other. For ASIC-based designs, we can fully optimize and customize our circuits to reduce logic counts. In contrast, for FPGA-based designs, the logic functions are implemented by fabricated look-up tables (LUTs) which have a fixed number of inputs and fixed drive strength. It is not easy to fully control their utilization. Thus, the approximation techniques that have been efficiently deployed to ASIC-based designs might offer limited benefits when porting to FPGA-based designs.
Xilinx and Intel FPGAs provide fast DSP-based multipliers for digital signal processing applications demanding low power. However, using these multiplier Intellectual Properties (IP) can incur the large routing delay since they are available at some locations on an FPGA. Meanwhile, LUTs are scattered all over the device, which can make the routing delay much shorter. Note that in modern devices, the routing delay becomes dominant in the data-path delay. Besides, if an application exhaustively occupies all dedicated IPs, there are no IPs left for other applications that require low-power and high-speed computations. Eventually, LUT-based multiplier IPs are still necessary [22] . Ullah et al. [23] , [24] have recently proposed to optimize their multipliers by truncating off a least significant partial product of a 4 × 2 multiplier to save an LUT. Then, larger operand size multipliers (e.g., 8 × 8, 16 × 16) were built by using this building block. Their approach only provided a limited benefit in area and power savings.
To tackle the aforementioned problems, our work has designed and analyzed a compact approximate 3-2 compressor and approximate 4-2 compressors. Then, different approximate multipliers are proposed and realized by using those logic compressors at different levels of accuracy. In this work, our goal is to optimize the approximate multipliers so that they can be efficiently implemented on FPGAs with high electrical performances (i.e., low power consumption, small are, low latency) and low accuracy losses. Thus, these multipliers can be suitable for digital signal processing applications such as image processing, multimedia, etc. The efficiencies of our multiplier designs will be analyzed and assessed by comparing to the state-of-the-art works in terms of delay, power, area, power-delay product (PDP), and PDAPs. Furthermore, their accuracy losses are also analyzed and compared to those of the prior works. Our work is summarized as follows:
1) Approximate multipliers are proposed to be efficiently implemented on FPGAs at low losses of accuracy by using the compact approximate 3-2 compressor and approximate 4-2 compressors.
2) Design-time accuracy configurations can be achieved by using different proposed multiplier designs. 3) Quality metrics (QM) such as mean error distance (MED), mean relative error distance (MRED), and normalized mean error distance (NMED) are used to rigorously analyze the quality of the proposed approximate multipliers. 4) Exhaustive simulations are carried out to obtain exact QMs for approximate 8 × 8, and 16 × 16 multipliers. 5) Formulas are derived for estimating the important metrics (MED and NMED) for large operand size multipliers being scaled up from smaller operand size ones. 6) Our proposed approximate multipliers are prototyped on an FPGA to measure electrical performances such as delays, areas, and dynamic power dissipation. 7) The trade-offs between the accuracies and PDAPs are also presented. 8) Image processing applications such as image multiplication and image sharpening are implemented and evaluated on an FPGA board to demonstrate the applicability and effectiveness of our proposed multipliers.
Compared to our previous study [25] , this work contains several significant extensions as follows: (i) propose a newly designed approximate 4-2 compressor (CP2) to improve power consumption and area, (ii) formulate the error analysis of a large operand size multipliers being scaled up from smaller ones, (iii) develop large size multipliers (16 × 16, 32 × 32), (iv) extend the proposed approach to approximate signed multipliers, (v) develop image sharpening application to evaluate the effectiveness of approximate multipliers. This paper is structured as follows. The overview of exact and approximate compressors and error metrics are presented in Section II. The error analysis for large operand size multipliers is formulated in Section III. The approximate compressors are designed and analyzed in Section IV. Our proposed approximate multipliers are designed and implemented in Section V. Error analysis and performance evaluation are shown in Section VI. Effectiveness and applicability of our approximate multipliers based on image applications are demonstrated in Section VII. Finally, Section VIII concludes our work.
II. PRELIMINARIES A. EXACT AND APPROXIMATE 4-2 COMPRESSORS
A conventional exact 4-2 compressor consists of two full adders as shown in Fig. 1(a) . It has 5 inputs and 3 outputs. The output sum has the same weight of 1 as the inputs while the outputs cout and carry have a weight of 2. The reduction tree of an 8 × 8 multiplier is illustrated in Fig. 2 in which the dashed shapes represent half and full adders and the solid shapes are the exact 4-2 compressors.
An approximate 4-2 compressor has two outputs which can approximately count all 1's at its four inputs. The block diagram of an approximate 4-2 compressor is illustrated in Fig. 1(b) . Two outputs of the approximate 4-2 compressor can have the same or different weights. Compared to the conventional exact 4-2 compressor, the approximate one does not have the carry input (receive from the preceding compressor) and carry output (generate to the next compressor). This shortens the delay and reduces the occupied area and power consumption.
B. ERROR METRICS
Let us consider an N × N multiplier. Error metrics such as MED, MRED, and NMED are utilized as a means to evaluate the quality of the multiplier. Let ED i (error distance) be the arithmetic distance between the i-th accurate and approximate products, and S i be the i-th accurate product. The MED, MRED, and NMED are defined as (1), (2), and (3), respectively [21] , [26] .
III. A FORMULATION OF ERROR ANALYSIS
Let us consider an N × N multiplier built from m × m multipliers (N = 2 × m) as illustrated in Fig. 3 . The mean error distance of N × N multiplier (MED N ) is given as (4) .
The error distance of N × N multiplier is obtained as (5) , where e and a denote exact and approximate terms, respectively. (AH , BH ) and (AL, BL) are high and low halves of the operands A and B.
From (5), we have the following inequality, where ED HHi , ED HLi , ED LHi , and ED LLi are the i-th error distances caused by four approximate m × m multipliers.
The equality in (6) occurs. In this case, error distances of all approximate m × m multipliers are greater than or equal to zero, or all of them are less than or equal to zero. Substitute ED i in (6) into (4), we have:
It is important to note that we can swap values of AH , AL, BH , and BL so that their occurrences are uniform over the range of [0, 2 2m
− 1] as illustrated in Fig. 4 . It is worth noting that swapping values of AH , BH , AL, and BL does not affect the final summation of error distances ED i . Additionally, (k+1).2 2m i=k.2 2m +1 ED XYi (k is an integer, k = 0 → N /m + 1; X and Y can be H or L) is a constant over each interval [0, 2 2m − 1]. Therefore, we have: 
Finally, the MED of an N × N multiplier is a summation of weighted MED of m × m multipliers as expressed in (10) .
If m × m multipliers have the same MED m , the MED N of an N × N multiplier can be given as:
The normalized mean error distance of N × N multiplier (NMED N ) can be calculated from (10) and (3). Moreover, if m × m multipliers have the same MED m , the NMED N can be approximated as (12) 
The inequality in (6) holds. Error distances of some approximate m × m multipliers are greater than or equal to zero, and some of them are less than zero. In this case, the expression (6) can be rewritten as (13) .
Therefore, the worst-case (upper bound) of the mean error distance (MED N ) can be obtained as (14) .
IV. DESIGN AND ANALYSIS OF APPROXIMATE COMPRESSORS
In this section, we present two kinds of compressor that are efficiently implemented on an FPGA: an approximate 3-2 compressor and approximate 4-2 compressors. We also analyze the error probabilities that occur in these compressors.
A. APPROXIMATE 3-2 COMPRESSOR
An exact 3-2 compressor converts three inputs into a value (the number of 1's at inputs) represented in two outputs.
To cover all the possible combinations of inputs, one output has a weight of 1 as inputs while the weight of other output is 2. However, by accepting some errors in the final counting results, its two outputs can get the same weights as the inputs, which forms an approximate 3-2 compressor. The Karnaugh map is shown in Fig. 5 . The error cases are marked by circles and highlighted in red color. The outputs y1 and y2 are simplified as (15) and (16) .
If the compressor is applied to the reduction stage 1 (see Fig. 2 ), it occupies two 6-input LUTs (including the partial product generation circuit). The probability of a partial product bit (a i .b j ) getting 0 is 3/4 (= p(a i b j = 0) = (1/2) × (1/2) + (1/2) × (1/2) + (1/2) × (1/2)) while its probability being 1 is 1/4 (= p(a i b j = 1) = (1/2)×(1/2)). From Fig. 5 , the error probability P 1 (Err) is given as (17):
If the compressor is applied to the reduction stage n (n > 1), it just uses 3 pins of a 6-input LUT for y1 and y2 (compression ratio = 1.5). As the occurrence probabilities of the inputs are redistributed through the reduction stages, the error probability P n (Err) is given as (18) , where p n (x3, x2, x1) is the occurrence probability of a specific pattern (x3, x2, x1) at the reduction stage n (n > 1). P n (Err) = p n (0, 1, 1) + p n (1, 1, 1)
B. APPROXIMATE 4-2 COMPRESSORS
In this section, we design and analyze two classes of approximate 4-2 compressors. In the first class, the approximate compressors CP1 and CP2 have outputs with different weights. That means one output has a weight of 1 whereas the other is 2. In the second class, the outputs of approximate compressors CP3 and CP4 have the same weight as their inputs. 
1) APPROXIMATE 4-2 COMPRESSOR 1 (CP1) -CLASS 1
As observed in [27] , two outputs of this compressor have different weights, which can represent the number of 1's for most possible combinations of inputs. An error occurs when its inputs get all 1's. The Karnaugh map of this compressor is illustrated in Fig. 6 . The outputs C and S of this compressor are expressed as (19) and (20) .
The error probability P 1 (Err) is 1/256 (= (1/4) 4 ) when this compressor is applied to the reduction stage 1. On the other hand, if this compressor is used for the reduction stage n (n>1), the error probability P n (Err) is p n (1, 1, 1, 1). In that case, each variable in (19) and (20) can be a single variable. Thus, it can be fitted into one 6-input LUT as illustrated in Fig. 7 (b) . The delay of this compressor is the delay of one 6-input LUT. However, if this compressor is used to reduce the PPM at the first stage, it costs three 6-input LUTs (including the partial product bit generation circuitry).
2) APPROXIMATE 4-2 COMPRESSOR 2 (CP2) -CLASS 1
A new approximate 4-2 compressor is proposed by modifying the Karnaugh map in Fig. 6 (i.e., introduce some more errors to its outputs). The Karnaugh map of this compressor is illustrated in Fig. 8 . The main target of this compressor is to reduce the switching activities as well as glitches when its inputs change. Therefore, the reduction in power consumption can be achieved. The logic expressions of the outputs C and S are described in (21) and (22) .
If this compressor is used to implement the logic compression at the reduction stage n (n>1), it consumes one 6-input LUT. The error probability is given by (23) , where E is a set of errors shown in Fig. 8 , and p n (x4, x3, x2, x1) is occurrence probability of a specific pattern (x4, x3, x2, x2) in set E.
Two outputs of this compressor have the same weight, which is used to represent the number of 1's at the inputs. The Karnaugh maps are described in Fig. 10 . The outputs y1 and y2 are expressed as (24) and (25), respectively, which are OR-gate based compressions as introduced in [18] .
If it is used to compress the PPM height at the first reduction stage, the error probability is given as (26) , where E is a set of error patterns as in Fig. 10 and (x4, x3, x2, x1 ) ∈ E. To optimize the area and power dissipation, the FPGA sythesis tool normally combines the partial product compressor and the partial product generation circuitry together. Thus, when this compressor is used to compress the PPM height at the first reduction stage, it only consumes two 6-input LUTs as shown in Fig. 11 (including the partial product generation circuit). One remaining input (I_0) of each 6-input LUT can be used for other purposes. The circuit delay of this compressor is equivalent to the delay of one 6-input LUT. Note that the maximum error distance of this compressor is 2. Therefore, the compressors belong to the class 2 cause large errors if two or more compressors are cascaded in series.
4) APPROXIMATE 4-2 COMPRESSOR 4 (CP4) -CLASS 2
This compressor is proposed to improve the accuracy with moderate timing and very small area overheads. The corresponding Karnaugh map is shown in Fig. 12 . The logic expressions of its outputs y1 and y2 are represented in (27) and (28) .
If this compressor is applied to the first stage reduction, the total error probability P 1 (Err) is given by (29) and (x4, x3, x2, x2) ∈ E.
When the compressor CP4 is used to suppress the PPM height at the first reduction stage, it still consumes two 6input LUTs as illustrated in Fig. 13 (including the partial product generation circuitry). Compared to the compressor CP3, this compressor fully consumes hardware resources of two 6-input LUTs. Its circuit delay is about the delay of two 6-input LUTs.
C. COMPARISON OF APPROXIMATE 4-2 COMPRESSORS
This section aims at comparing approximate 4-2 compressors in terms of power dissipation, occupied area, and error rate. Based on the comparison, we find out which compressor is suitable for which stage of the reduction tree.
As aforementioned, to optimize the power dissipation and occupied LUTs, the 4-2 compressors are normally merged with the partial product generation circuits when they are used to compress the partial product tree at the first reduction stage. So, they occupy two 6-input LUTs. Table 1 shows the performance comparison between the presented approximate 4-2 compressors and the ones in previous works. As we can see, the compressors CP3 and CP4 show higher maximum error distances (i.e., 2) than those of other compressors; however, they can save up to 50% power dissipation and approximate 33.3% occupied area. Therefore, the compressors CP3 and CP4 are suitable for the reduction stage 1. Table 2 illustrates the performance comparison between the presented approximate 4-2 compressors and the ones in previous works when they are utilized to compress the PPM height at the reduction stage n (n>1). That is, their inputs are not partial products (p ij ). Hence, they only occupy one LUT on an FPGA. The compressors CP2, CP3, and CP4 are much better than the others in terms of the power dissipation. The maximum error distances of the compressors CP3 and CP4 are higher than those of the compressors CP1 and CP2. As pointed out in Section IV-B.3, we should not cascade the compressors CP3 and CP4 in consecutive reduction stages since they produce high error rates and high error results. So, the compressor CP2 should be selected to implement the logic compression at the reduction stage n. Moreover, the compressor CP1 can be also used for the higher accuracy multipliers.
V. PROPOSED APPROXIMATE MULTIPLIERS
A. APPROXIMATE 8 × 8 MULTIPLIERS Our proposed 8 × 8 approximate multiplier implementations consist of several steps: (i) partial product generation, (ii) PPM reduction at the first stage by using approximate 4-2 compressors in class 2, (iii) the PPM reduction at the following stages n (n> 1) by using approximate 4-2 compressors class 1, (iv) generation of the final result by using a ripple carry adder (RCA). Note that the partial product generation and the PPM reduction at the first stage are combined together (using approximate 4-2 compressors in class 2) to save the hardware resource, thereby, the power consumption. This is a considerable difference between our FPGA-based implementations and ASIC-based ones. Actually, we analyzed the benefit of this implementation in Section IV-B.
A dot diagram of the proposed 8 × 8 approximate multiplier is illustrated in Fig. 14. We apply different kinds of approximate compressors to compress the PPM at different reduction stages. In terms of hardware resources and accuracy, each kind of approximate compressors should be used properly for the PPM reduction at different stages. For example, the approximate 4-2 compressors in class 2 is efficiently applied to compress the PPM at the first stage since it costs small hardware resources and less power dissipation. The approximate 4-2 compressor class 1 is appropriate with reducing the PPM height at stages n (n >1) because they show higher accuracies and consume less dynamic power. This is also deeply analyzed in the compressor implementations in Sections IV-B and IV-C.
Moreover, the proposed approximate multiplier is clustered into two parts with different accuracies. The most significant partial products are accumulated by using accurate compressors and adders while the least significant PPs are accumulated by utilizing presented approximate compressors. In this work, we implement different configurations of approximate multipliers with various accuracies. For example, in Fig. 14, the approximate multiplier is implemented with 5 leftmost columns of the partial products compressed by using accurate compressors and adders. An RCA is used to add two last rows of the final stage to generate the final result since the dedicated RCA is known as one of the fastest adders on FPGAs.
The proposed multi-level approximate architecture improves the accuracy, shortens the delay, and reduces the power dissipation and the occupied area of the approximate multiplier. Finally, all the proposed approximate 8 × 8 multipliers with different accuracy configurations are summarized in Table 3 . The M8_CP13_k group includes approximate 8 × 8 multipliers using the approximate 4-2 compressor CP3 at the first reduction stage and the compressor CP1 at the second reduction stage. k is the number of leftmost partial product columns (the most significant PP is not counted) compressed by accurate compressors, full-, and half-adders.
In this work, the approximate 4-2 compressor CP2 is not used to construct approximate 8 × 8 multipliers. Since only the small amount of approximate 4-2 compressors is used for the reduction stage 2 as in Fig. 14, the reduction in dynamic power dissipation is small while the loss of accuracy could be considerable. The effectiveness of the approximate 4-2 compressor CP2 on the dynamic power reduction is evaluated on larger operand size multipliers (i.e., 16 × 16, 32 × 32 multipliers).
B. APPROXIMATE 16 × 16 MULTIPLIERS
In FPGAs, carry chains (each CARRY4 consists of 4 carry cells as described in [28] ) are used to implement the fast RCA to generate the final result in approximate 8 × 8 multipliers. There are some extra delay costs due to the inputs and outputs of carry chains (at the boundary of carry chains). These delays are mainly incurred by LUTs used to generate control signals for the carry chains. This introduces some delay overheads to 16-bit multipliers if they are built by using 8 × 8 ones as shown in Fig. 3 . Importantly, in order to further optimize the performances of 16-bit multipliers, all the proposed 4-2 compressors are employed. Thus, 16 × 16 multipliers should be built in the same method that 8 × 8 ones have been done. Table 4 shows all the 16 × 16 multiplier configurations presented in this work; k is the number of leftmost partial product columns (the most significant PP is not counted) compressed by accurate compressors, full-, and half-adders.
C. APPROXIMATE 2M × 2M MULTIPLIERS
It is beneficial to design 32 × 32 multipliers are designed by using 16 × 16 ones since the boundary delays of the dedicated carry chains are small compared to the delay of 32 × 32 multipliers. Another reason is that it seems impossible to do functional simulation for 32 × 32 multipliers to cover all the possible combinations of input values because the simulation time is extremely long. Thus, we cannot guarantee their qualities. In contrast, if we build large operand size multipliers from the smaller ones, we can assess their qualities through error metrics of those building blocks as shown in Section III. Fig. 3 shows how to build the larger operand size multipliers from the smaller ones. In this way, we can build many approximate 32 × 32 multiplier configurations. However, we just show some typical configurations as described in Table 5 . Note that the M32_CP13_h_m is built by the M16_CP13_h for the higher part while the middle and lower parts are built by the M16_CP13_m.
VI. EXPERIMENTAL RESULTS
In this section, we present the accuracy analysis of the proposed approximate 8-, 16-, 32-bit multipliers. Then, we also describe the experimental conditions for evaluating and measuring the circuit delay, occupied area, and dynamic power dissipation of the proposed approximate multipliers. (12) and (14) .
The accuracy metrics of the proposed approximate 8 × 8, 16 × 16, 32 × 32 multipliers, and the prior works are presented in Table 6 . The approximate multiplier M8 [14] has the lowest quality while the multiplier M8-Ca [23] shows the highest accuracy. Among our proposed multipliers, the M8_CP14_6 has the highest quality, and its accuracy is comparable with that of the M8_Ca [23] . The proposed multipliers using the compressors CP3 have lower qualities than the multipliers with the compressors CP4 since the compressor CP4 is more accurate than the compressor CP3. The accuracy is increased with respect to the increase of the number of most significant partial product columns (k) being compressed by using the accurate compressors and adders. In terms of MED and NMED, the highest quality multipliers (M8_CP14_6 and M8_CP13_6) are over 7 × more accurate than the lowest quality multipliers (M8_CP14_2 and M8_CP13_2).
For 16 × 16 multipliers, all approximate 4-2 compressors (CP1 to CP4) are employed to develop different approximate multipliers with different accuracies. The proposed multipliers using compressors CP1 and CP4 (M16_CP14_k, k is the index) have the best qualities while the qualities of the multipliers M16_CP23_k are lowest. The multiplier M16 [14] whose twelve leftmost partial product columns are compressed by exact compressors and eight less significant partial product columns are truncated. As a result, our multiplier M16_CP14_12 and the M16 [14] show comparable qualities. Our proposed multipliers outperform other existing works in term of the quality (except for the M16_CPxy_4; x = 1, 2; y = 3, 4). In general, the multipliers M16_CP24_k have better accuracies than those of the multipliers M16_CP13_k (except for M16_CPxy_4; x = 1, 2; y = 3, 4). As shown in Table 6 , our multipliers are classified into 4 groups. In each group, the highest quality multiplier is nearly 100 × more precise than the lowest one (for MED and NMED). The accuracy is increased with the increasing number of most significant partial product columns (k) compressed by accurate compressors. Among the existing multipliers, the M16-Ca [23] demonstrates the best quality (except for the M16 [14] ). Nevertheless, its accuracy is just better than those of our multipliers M16_CPxy_4 (x = 1, 2; y = 3, 4), and less than those of all our remaining multipliers.
Since the accuracies of approximate 32 × 32 multipliers are estimated by using the formulas in Section III, we first validate this estimation method. Fig. 15 plots the NMED errors between the simulation and estimated results of approximate 16 × 16 multipliers which are built of 8 × 8 multipliers. The actual NMEDs are bounded by estimated ones. The errors between the estimated and simulation results are extremely small, which is at most 0.6% for the multiplier M16 in [30] . For our multipliers, there is no error in NMED since the approximate products are always less than or equal to the exact results, which means that the equality in (6) is satisfied.
As the proposed 32 × 32 multipliers are built by using approximate 16 × 16 ones, their accuracies have the same trend as those of 16-bit multipliers. Besides, hybrid 32-bit multipliers are also built by using different 16-bit multipliers with different accuracies as shown at the bottom of Table 6 . The accuracies of the multipliers M32_CPxy_12_4 (x = 1, 2; y = 3, 4; see Table 5 ) are strongly dependent on those of the 16-bit multipliers used to build the higher part (as in Fig. 3 ).
2) SIGNED MULTIPLIERS
Booth encoding is one of algorithms to implement the signed multiplier. Booth encoding reduces the height of partial product matrix at the expense of the complexity of partial product generation (encoding). Section V implemented unsigned approximate multipliers. Our proposed approach can be extended to implement signed multipliers by using Baugh-Wooley algorithm [31] as shown in Fig. 16 . In the diagram, there is a small change in the partial product matrix in comparison with that of the unsigned multiplier.
The approximate signed 8 × 8 and 16 × 16 multipliers are implemented by using Baugh-Wooley algorithm. Their accuracy metrics are exhaustively analyzed and compared with prior works as shown in Tables 7 and 8. Note that p is the number of least significant partial product columns approximated. Our proposed approximate signed 8 × 8 multipliers are named by M8s_CPxy_k (x = 1; y = 3, 4; k = 2 to 6). For approximate signed 16 × 16 multipliers, their names are M16s_CPxy_k (x = 1, 2; y = 3, 4; k = 2 to 12).
As we can see in Table 7 , the MREDs of the proposed approximate signed 8 × 8 multipliers M8s_CP13_k and M8s_CP14_k (k = 2 to 6) are lower than those of the multipliers R4ABM2 (except for M8s_CP13_6), and higher than those of the R4ABM1. The NMEDs of M8s_CP_14 (k = 2 to 6) are comparable to those of the R4ABM2, and the NMEDs of M8s_CP13_k (k = 2 to 6) are higher than those of the multipliers R4ABM2. However, our multipliers show better power savings, which is demonstrated in Section VII-B.2 (case study). The multipliers M8s_CP14_k show significant advantages in accuracy compared to M8s_CP13_k (k = 2 to 6). For 16-bit operands, the multipliers M16s_CP14_k and M16s_CP24_k are more accurate than M16s_CP13_k and M16s_CP23_k, respectively (k = 2, 4, 6, 8, 10, 12) . In terms of MRED, our signed 16-bit multipliers have better accuracies compared to the signed multipliers R4ARBM1 and R4ARBM2 proposed by Liu et al. [33] as shown in Table 8 .
Previous studies [21] , [32] , [33] assessed the quality of 32-bit multipliers by simulations with randomized input values since it is impossible to cover all possible combinations. This approach does not guarantee the qualities of the developed approximate multipliers. Therefore, we still build approximate signed 32-bit multipliers by using small ones. However, it is complicated to develop such multipliers with 2's complement input operands. Instead, we still use small-size approximate unsigned multipliers to implement large-size signed multipliers (with some modifications for Fig. 3 ) as shown in Fig. 17 . The most significant bits (MSB) in the operands are the sign bits. The remaining bits are encoded in the format of unsigned values. That means we don't use 2's complement values. For example, a signed 17-bit value -65535 10 is encoded by 1_FFFF 16 in which the MSB is the FIGURE 17. An implementation of approximate signed N-bit multiplier by using the smaller ones. Operands A and B are encoded in unsigned values (e.g, a signed 17-bit value -65535 10 is encoded by 1_FFFF 16 ; the MSB is signed bit). The final product can be kept in the format of operands A and B or 2's complement, which is controlled by the signal 'mode'. sign bit. For an N-bit number, the maximum negative value −2 (N −1) is rounded off to −2 (N −1) + 1 since we can only represent the values within [−2 (N −1) + 1, +2 (N −1) − 1]. The input value encoding is performed at the application level. Thus, the signed multiplication is transformed to the multiplication of two unsigned values. The sign of the final product is determined by an XOR operation of the input operands' signs. The final result (P f ) is obtained by inverting the unsigned product (P) if the sign bit (S) is 1's, which is given as (30) .
Compared to the approximate unsigned multipliers, the approximate signed ones produce one more error. Besides, an error also occurs in case of maximum negative value. However, these errors are very small. According to this implementation, the accuracy (e.g., MED, MRED, and NMED) of approximate signed multipliers are nearly equal to those of the unsigned counterparts. That is, the accuracy of signed 33-bit multiplier is nearly equal to that of the unsigned 32-bit multiplier. We take an example for signed 17-bit multiplication to validate our approach as shown in Table 9 . The differences between qualities of signed 17-bit and unsigned 16-bit multiplications are very small, which is an O(10 −7 ). The area overhead of approximate signed multipliers are caused by a sign bit detector (an XOR gate) and 63 multiplexers (2-to-1), which approximately occupies 33 LUTs. Compared to the area of 32-bit multipliers, this overhead can be negligible.
B. DELAY, POWER AND AREA RESULTS
To evaluate the efficiency of our designs, the proposed multiplier are implemented with different accuracies by varying the number of leftmost partial product columns compressed by exact compressors, full and half adders. These proposed multipliers are compared with the exact LUT-based, DSP-based multipliers, and the prior works in terms of delay, area, and power consumption, PDP, and PDAP. Our multipliers and the prior works are implemented VOLUME 8, 2020 by using Verilog-HDL and evaluated on a Xilinx Spartan-6 FPGA (XC6SLX9-2TQG144).
For the power consumption measurements, these circuits work on the relaxed operating frequency of 100 MHz (for 8 × 8 and 16 × 16 multipliers), and 50 MHz for 32 × 32 multipliers. Fig. 18 shows the setup model where a high-precision multimeter is used to measure the current dissipation of the multipliers, which is more precise and faster than the power estimation using the PowerAnalyzer in ISE software. The dynamic power dissipation of 8 × 8 and 16 × 16 multipliers are evaluated with all the possible combinations of input data. It is impossible to cover all the cases of 32-bit input data for 32 × 32 multipliers, so their dynamic power dissipation is measured by feeding inputs with the random input patterns (generated by pseudo-random number generator) [34] . Table 10 summarizes the comparison between our 8 × 8 multipliers and the prior works in terms of electrical performances (i.e., delay, power, and area), PDP, and PDAP. The electrical performances of proposed multipliers using approximate 4-2 compressors CP1 and CP3 are better than those of the multipliers built by employing the approximate 4-2 compressors CP1 and CP4. Our proposed 8 × 8 multipliers outperform the prior works, the exact LUT-based, and DSP-based multipliers. Compared to the DSP-based multiplier, the proposed multiplier M8_CP13_2 and M8_CP14_2 can achieve delay improvements 40.8% and 37.1%, respectively. Compared to the exact LUT-based multiplier, the multipliers M8_CP13_2 and M8_CP14_2 have shorter delays, smaller occupied areas (LUTs and CARRY4s), and consume less dynamic powers. They can save (28.2%, 23.7%) in delays, (54.3%, 47.8%) in areas, and (67.8%, 64.4%) in dynamic power dissipation, respectively. They can offer PDP and PDAP gains of up to (4.3 × and 3.7 ×) and (9.5 × and 7.1 ×), respectively, with regard to the exact multiplier. The M8-Ca and M8-Cc in [23] show worse performances than those of the exact multiplier. The reason is that their multipliers were built from their customized components which cannot be optimized by the synthesis tools. Fig. 19 plots the trade-offs between the accuracies and PDAPs of presented approximate multipliers. Our proposed multipliers offer a good range of the PDAP benefits at different levels of accuracy, which is bounded by the two ends (M8_CP14_6 and M8_CP13_2). This enables a search space exploration to meet user-defined constraints of both the power dissipation and accuracy.
The comparison of electrical performances, PDPs, and PDAPs between our approximate 16 × 16 multipliers and previous works are summarized in Table 11 . All four approximate 4-2 compressors (CP1 to CP4) are deployed to develop four classes of approximate multipliers. Among them, the multipliers M16_CP23_k deliver better performances than those of our other multipliers. The multiplier M16_CP23_4 is more dominant than the M16_CP14_4 in terms of the electrical performances. Generally, the multipliers M16_CP24_k deliver better electrical performances than the multipliers M16_CP13_k. Compared to the exact LUT-based multipliers, the M16_CP23_4 can save 27.9% in delay, 70.7% in dynamic power dissipation, and 43.2% in the area. Especially, it can improve the PDP and PDAP of 4.7 × and 8.3 × with respect to the exact LUT-based multiplier. The prior works show worse PDAPs than those of the exact LUT-based multipliers except for the 16-bit multipliers in [17] and [14] . Even though the multipliers M16 in [17] and [14] are better than the exact LUT-based multiplier, their electrical performances are not better than those of our highest accuracy multiplier M16_CP14_12. Compared to the DSP-based multiplier, our multipliers show better electrical performances (delay, power consumption and PDP) except for the multiplier M16_CP14_k (k = 4 to 12), M16_CP13_12 and M16_CP24_12. In term of delay (latency), the proposed multiplier M16_CPxy_k (x = 1, 2; y = 3, 4; k = 4 to 10) are better than the DSP-based multiplier. The multiplier M16_CP23_4 can improve the latency up to 16%. Similar to 8-bit multipliers, Fig. 20 depicts the trade-offs between the accuracies and PDAPs of our proposed 16-bit multipliers as well as the prior works.
As explained above, our 32 × 32 multipliers are built by using approximate 16 × 16 multipliers. Their electrical performances are summarized in Table 12 . Compared to the prior works and the exact LUT-based multiplier, all of our proposed multipliers have shorter delays. They consume less dynamic powers except for the multipliers M32_CPxy_12 (x = 1, 2; y = 3, 4). They cost smaller hardware resources than the prior works and the exact LUT-based multiplier except for the M32_CP14_12 (it is a little bigger than M32 in [30] ). In terms of PDAP, most of our multipliers are better than the existing works and the exact LUT-based multiplier (except for M32_CP14_12 and M32_CP24_12). The multipliers M32_CP23_4 and M32_CP23_12_4 offer good benefits of PDPs and PDAPs. They can improve PDPs and PDAPs up to (2.9 ×, 2.0 ×) and (5.0×, 3.2×), respectively, compared to the exact LUT-based multipliers. Our multipliers M32_CP13_4, M32_CP24_4, M32_CP23_4, M32_CP23_12_4 and M32_CP24_12_4 demonstrate better performances than the DSP-based multiplier. Especially, the delays of all the proposed multipliers are almost half of the delay of the DSP-based multiplier. Note that the 32bit DSP-based multiplier is built by using 18-bit DSP-based multiplier since the used FPGA only supports the 18-bit multiplier. The procedure to build the large size multiplier from the smaller ones are described in Fig. 3 .
Liu et al. [33] stated that large size approximate multipliers can be used in applications demanding high dynamic range computation. As an example, the 32-bit approximate multipliers was applied to a high dynamic range (HDR) image processing [33] . Besides, it can be also applied to machine learning applications as reported in [35] .
VII. IMAGE PROCESSING APPLICATIONS: CASE STUDIES
In this section, two image processing applications such as image sharpening and image multiplication are developed by using the proposed approximate multipliers. The dynamic power dissipation of the approximate multipliers and the quality of output images are measured and assessed.
A. IMAGE SHARPENING
We apply the proposed approximate multipliers to implement an image sharpening, which has been widely used to evaluate the effectiveness and quality of approximate multipliers. The sharpened image is obtained by (31) . In this design exam- ple, an 8-bit image is convolved with the filter kernel M as described in (32) .
The mask M (7 × 7) [36] is given as (32) : 
Peak signal-to-noise ratio (PSNR) is one of the metrics, which is used for investigating the quality of approximate images. Additionally, the structural similarity index metric (SSIM) that is consistent with the human perception of an image is also used to assess the quality of the processed image. This metric works based on measuring the structural similarity of the exact and approximate images.
All the designs are implemented by using Verilog-HDL and on the Xilinx Spartan-6 FPGA (XC6SLX9-2TQG144). The FPGA is powered with the supply voltage of 1.2 V and run at the clock frequency of 100 MHz to measure the dynamic power consumption of multipliers. Table 13 shows the comparison between the proposed approximate multipliers and the prior works in terms of quality (PSNR and SSIM) and power savings. As we can see, the prior works show limited benefits in power dissipation compared to the exact LUT-based multiplier. Some prior works consume more power than the exact LUT-based multiplier. These results confirm that the approximation methods which are successfully applied to ASIC-based designs present a limited (4.2%) or even no benefit (−48.3%) when directly applied to FPGA-based designs. Our proposed design M8_CP13_6 shows a good PSNR (45.87 dB) and SSIM (0.9987). It can offer a power saving of 18.3% with respect to the exact LUT-based multiplier. The design M8_CP13_2 shows a high quality with a PSNR of 39.1 dB, an SSIM of 0.9978, and a high percentage of power-saving 36.7%. Compared to the proposed multiplier using 4-2 compressor CP3, the multipliers employing 4-2 compressor CP4 offer smaller power-savings, but higher qualities. Our multipliers M8_CP13_k (k = 2 to 5) offer better power savings than the exact DSP-based multiplier. Though other proposed multipliers are not better than the DSP-based multiplier in term of power consumption, they deliver good latencies. Fig. 21 describes some results of the output images, their qualities (PSNR, SSIM), and the percentage of power savings.
B. IMAGE MULTIPLICATION 1) UNSIGNED MULTIPLIERS
In this design example, an image multiplication is developed by using the proposed approximate 16 × 16 multipliers. Two 16-bit images are multiplied each other on the basis of pixelby-pixel, which blends two images into a single output image. The block diagram of an image multiplication is illustrated in Fig. 22 . Note that the 16-bit Lena image is converted from the original 8-bit image by using Matlab. Input images are read into input registers and then processed by the approximate multipliers. The output values are then stored into output registers. The metrics PSNR and SSIM are used to evaluate the quality of the output images. [19] , (e) [14] , (f) [23]-Ca. [19] , (i) [30] . Table 14 shows the comparison between the proposed approximate multipliers and the prior works in terms of quality (PSNR and SSIM) and power savings. The previous works offer a limited benefit in power saving (9.85% for M16 [30] ) or even no power benefit (-38.46% for M16-Ca [23] ). Our multipliers provide the best quality with the PSNR up to 80.25 dB, and the SSIM up to 1.0. The qualities of the proposed multipliers M16_CP23_k (k = 4 to 12) are less than those of the multipliers M16_CP14_k, but they can save more dynamic power dissipation. Their PSNRs are greater than 40 dB, which implies that the quality of output images are good enough for human perception as demonstrated in Fig. 23 . Compared to the exact LUT-based multiplier, the M16_CP23_4 offers an impressive power saving of up to 58.15%. Compared to the exact DSP-based multiplier, our multipliers M16_CP24_4, M16_CP23_4, and M16_CP23_6 offer comparable power saving and better latencies. Other proposed multipliers can be used when the DSP-based multipliers are exhaustively consumed by some application as reported in [22] , or when the timing constraint of the design is critical. R4ABM1 and R4ABM2 are higher than those reported by Liu et al. [32] since the image source that was used for experiments can be different. The PSNRs of output images that are processed by our approximate signed multipliers M8s_CP14_2 and M8s_CP13_2 are lower than those of the R4ABM1 and R4ABM2 (p = 12). However, the SSIMs of our output images are higher than those of the images processed by the multipliers R4ABM1 and R4ABM2 (p = 12). Importantly, our approximate signed multipliers offers much high power savings. The approximate multipliers R4ABM1 and R4ABM2 in [32] consume high power because they still used exact compressors to accumulate partial products. Furthermore, our proposed multipliers show higher benefits in dynamic power dissipation compared to the DSP multiplier (except for M8s_CP1y_6 ( y = 3, 4) ). Fig. 24 shows the output images processed by our multipliers.
2) SIGNED MULTIPLIERS

VIII. CONCLUSION
In this work, we proposed approximate 8 × 8, 16 × 16, and 32 × 32 multipliers with different accuracies, which were efficiently implemented on FPGAs by using proposed approximate compressors.
Our proposed approximate multipliers outperform the prior works and exact LUT-based multiplier in terms of electrical performances (delay, power, area), PDPs, and PDAPs. We also presented the trade-offs between the accuracies and PDAPs, which guides designers on selecting appropriate multiplier to meet the design requirements. The expressions of accuracy analysis for large operand size multipliers are also formulated. Besides, our proposed multipliers outperform the DSP-based multipliers in terms of delays (latencies). One more disadvantage of the DSP-based multiplier is that it still uses 4 DSP primitives for arbitrary operand sizes from 19 bits to 32 bits (causes higher routing delays as shown in 32-bit multiplier, Table 12 ) while our proposed method can optimize these multipliers easier.
Our proposed approximate multipliers were evaluated on both circuit and application levels to demonstrate their effectiveness and applicability. Two image processing applications such as image sharpening and image multiplication were implemented to measure the dynamic power dissipation savings. Besides, the quality metrics (PSNR, SSIM) of output images are also evaluated. To the best of our knowledge, this is the first work to implement the approximate multipliers and measure their dynamic power consumption on an FPGA board. Finally, our proposed multipliers are suitable for high-performance and low-power error-resilient applications.
