Abstract-Approximate computing has received significant attention as a promising strategy to decrease power consumption of inherently error tolerant applications. In this paper, we focus on hardware-level approximation by introducing the partial product perforation technique for designing approximate multiplication circuits. We prove in a mathematically rigorous manner that in partial product perforation, the imposed errors are bounded and predictable, depending only on the input distribution. Through extensive experimental evaluation, we apply the partial product perforation method on different multiplier architectures and expose the optimal architectureperforation configuration pairs for different error constraints. We show that, compared with the respective exact design, the partial product perforation delivers reductions of up to 50% in power consumption, 45% in area, and 35% in critical delay. In addition, the product perforation method is compared with the state-of-the-art approximation techniques, i.e., truncation, voltage overscaling, and logic approximation, showing that it outperforms them in terms of power dissipation and error.
Abstract-Approximate computing has received significant attention as a promising strategy to decrease power consumption of inherently error tolerant applications. In this paper, we focus on hardware-level approximation by introducing the partial product perforation technique for designing approximate multiplication circuits. We prove in a mathematically rigorous manner that in partial product perforation, the imposed errors are bounded and predictable, depending only on the input distribution. Through extensive experimental evaluation, we apply the partial product perforation method on different multiplier architectures and expose the optimal architectureperforation configuration pairs for different error constraints. We show that, compared with the respective exact design, the partial product perforation delivers reductions of up to 50% in power consumption, 45% in area, and 35% in critical delay. In addition, the product perforation method is compared with the state-of-the-art approximation techniques, i.e., truncation, voltage overscaling, and logic approximation, showing that it outperforms them in terms of power dissipation and error.
Index Terms-Approximate arithmetic circuits, approximate computing, approximate multiplier, error analysis, low power.
I. INTRODUCTION

I
N MODERN embedded electronic devices, power consumption is a first-class design concern. Considering that a large number of application domains are inherently tolerant to imprecise calculations, e.g., digital signal processing (DSP), data analytics, and data-mining [1] , approximate computing appear as a promising solution to reduce their power dissipation. Such applications process large redundant data sets or noisy input data derived from the real world, do not have a golden result, perform statistical/probabilistic computations, and/or demand human interaction, thus their exactness is relaxed due to limited human perception [2] , [3] . Approximate computing can be applied at both software and hardware levels.
Hardware-level approximation mainly targets arithmetic units, such as adders and multipliers, widely used in portable devices to implement multimedia algorithms, e.g., image and video processing. The most commonly used techniques for the generation of approximate arithmetic circuits are truncation [4] , [5] , voltage overscaling (VOS) [2] , [6] , and simplification of logic complexity (i.e., alteration of the truth table) [7] - [9] . Extensive research has been conducted on approximate adders [6] , [7] , [10] , [11] , providing significant gains in terms of area and power while exposing small error. However, research activities on approximate multipliers are limited. Efficient approximate multipliers introduced in [8] , [9] , [12] , and [13] target the approximation of the partial product accumulation but do not examine approximations on the partial product generation.
Approximate hardware circuits, contrary to software approximations, offer transistors reduction, lower dynamic and leakage power, lower circuit delay, and opportunity for downsizing. Motivated by the limited research on approximate multipliers, compared with the extensive research on approximate adders, and explicitly the lack of approximate techniques targeting the partial product generation, we introduce the partial product perforation method for creating approximate multipliers. Inspired from [14] , we omit the generation of some partial products, thus reducing the number of partial products that have to be accumulated, we decrease the area, power, and depth of the accumulation tree. The major contributions of this paper are summarized as follows.
1) We adopt and apply, for the first time, the software-based perforation technique [14] on the design of hardware circuits, obtaining the optimized design solutions regarding the power-area-error tradeoffs. 2) We analyze in a mathematically rigorous manner the arithmetic accuracy of partial product perforation and prove that it delivers a bounded and predictable output error. Our error analysis is not bound to a specific multiplier architecture and can be applied with error guarantees to every multiplication circuit regardless of its architecture. Such a rigorous analysis enables precise error estimation over input data distributions. 3) We explore and characterize the efficiency of the product perforation method on several multiplier schemes, exposing its power-area impact on different architectures. This is the first time that such an exploratory analysis over different approximate multiplier architectures is offered to the designer, enabling also the selection of the optimum architectureperforation configuration for given error constraints. 4) We show that the partial product perforation outperforms the related state-of-the-art works in terms of power consumption and error, as well as output quality, when applied to image processing and data analytics algorithms. More specifically, we apply the partial product perforation on 16 different multiplier architectures using industrial strength tools, i.e., Synopsys Design Compiler and PrimeTime. Through extensive experimental evaluation, we present the optimal approximate multiplier configurations for various error constraints. We show that, compared with the accurate multiplier, the product perforation offers reductions of up to 50% in power consumption, 45% in area, and 35% in critical delay for 0.1% normalized mean error distance (NMED) [15] . Moreover, it is compared with the state-of-the-art approximate computing works that use either VOS [6] , logic approximation [9] , or truncation [4] , outperforming them significantly in terms of power dissipation and error. Finally, we examine the scalability of our technique by applying it on different bit-width multipliers and show that the delivered savings increase with the width increase.
The rest of this paper is organized as follows. In Section II, we discuss the related literature with an emphasis on circuitlevel approximation. Section III introduces the partial product perforation technique, providing the corresponding error analysis and error correction methods. In Section IV, we examine the product perforation on different multiplier architectures, exposing the optimal architecture-perforation configuration pairs under differing error constraints. Section V evaluates the product perforation method by comparing it with the related state-of-the-art works. Finally, the conclusion is drawn in Section VI.
II. RELATED WORK
In this section, the related research in the field of hardware approximate computing is discussed. Both generalpurpose approximation techniques [4] , [6] , [16] applied to any arithmetic circuit and circuit-specific approximation either to adder [7] , [10] , [11] or multiplier designs [8] , [9] , [13] , [17] , [18] have been presented.
Regarding the general approximation techniques, VOS [2] , [6] and truncation [4] , [5] , [12] have been proposed. VOS is applied in any circuit by lowering the supply voltage below its nominal value. Decreasing the supply voltage reduces the circuit's power consumption, but produces errors caused by the number of paths that fail to meet the delay constraints [2] . Banescu et al. [12] proposed an automated generation of large precision floatingpoint multipliers in field-programmable gate arrays using sophisticated truncation over underutilized DSPs. In [5] , a truncated multiplier with a constant correction term is proposed, significantly decreasing the error imposed by typical truncation. King and Swartzlander [4] proposed a truncated multiplier with variable correction that outperforms [5] in terms of error. Probabilistic pruning and logic minimization techniques have been presented in [16] using a greedy approach to generate approximate circuits. These techniques systematically eliminate circuit's components and simplify logic complexity according to the circuit's activity profile and output significance. Both the techniques heavily depend on the application's characteristics, and in addition, the induced approximation error is not rigorously bounded.
Extensive research has been conducted targeting the implementation of approximate adders [7] , [10] , [11] . Verma et al. [11] developed a probability proof, estimating that the longest carry chain in an n-bit adder is logn, and produced a fast inexact adder limiting the carry propagation. In [10] , approximation is performed by decomposing the addition circuit in an accurate and an approximate inaccurate part. Gupta et al. [7] build imprecise full adder cells, requiring fewer transistors, by approximating their logic function and then use them to build imprecise adders. Although it is proposed to use such adders targeting to build approximate multipliers, it is not clear how they can be used in different tree architectures and how their error scales in the case of multioperand addition. Targeting the creation of approximate multipliers, Kulkarni et al. [8] proposed a simplified imprecise 2 × 2 multiplier cell used as the basic block for constructing larger multiplier architectures. Momeni et al. [9] presented two approximate 4:2 compressors by modifying the respective accurate truth table, which were then used to build two approximate multipliers outperforming [8] . The approximate compressors of [9] are used in Dadda tree with 4:2 reduction. However, different multiplier architectures were not explored. Based on an approximate adder that limits the carry propagation, Liu et al. [13] presented a fast and lowpower multiplier scheme with higher error than [9] . However, in all the aforementioned approaches, the imposed error cannot be predicted, as it depends on carry propagation and the circuits' implementation, and requires simulations over all possible inputs in order to be calculated.
Recently, Narayanamoorthy et al. [17] and Hashemi et al. [18] proposed the use of m × m multipliers to perform an n × n multiplication (with m < n). Narayanamoorthy et al. [17] statically split the multiplicand in three m-bit segments and perform the multiplication utilizing the segment containing the most significant 1 (leading one). However, as stated in [18] , m needs to be at least n/2 to attain acceptable accuracy, thus limiting the energy savings and the scalability of this approach. Hashemi et al. [18] extended the idea of leading-one segments to enable dynamic range multiplication and added a correction term. Although [18] delivers higher accuracy designs than [17] using smaller values for m, its approach requires the allocation of extra complex circuitry, i.e., two leading-one detectors, two complex multiplexers for segment selection, one log(n)-bit comparator, a log(n)-bit adder, and one 2n-bit barrel shifter. These extra components are expected to highly increase the circuit's complexity, introducing nontrivial delay, area, and energy overheads that may considerably decrease the approximation benefits [17] . This is expected to be more evident in designs targeting too small error values, in which the need for larger m values is required.
In this paper, we target the design of power-error efficient multiplication circuits. We differ from the previous works by exploring approximation on the generation of the partial products. The proposed method can be easily applied in any multiplier architecture without the need for a special design, in contrast to related works. In addition, the error imposed by perforation depends only on the configuration parameters and, in contrast to existing work, can be analytically calculated without the need for exhaustive simulations. The latter is critical, as, given the application's inputs, a precise estimation of the output quality can be extracted. Finally, the knowledge of the induced error permits the selection of the configuration that maximizes the power savings for a specific error bound.
III. ANALYZING PARTIAL PRODUCT PERFORATION
A. Method Analysis
In this section, the partial product perforation method for the design of approximate hardware multipliers is described. Consider two n-bit numbers A and B. The result of their multiplication A × B is obtained after summing all the partial products Ab i , where b i is the i th bit of B. Thus
The partial product perforation technique omits the generation of k successive partial products starting from the j th one. A perforated partial product is not inserted in the accumulation tree, and hence n full adders can be eliminated. Applying the product perforation with j and k configuration values on the multiplication, A × B produces the approximate result
Note that
Similarly, when modified booth encoding (MBE) [19] is used for generating the partial products, the result of the approximate multiplication is given by Fig. 1 shows an example of applying the partial product perforation method on different 8-bit multipliers with j = 2 and k = 2 configuration values. For each architecture, the dot diagrams [19] of the accurate and the respective perforated tree are presented. The dots represent the bits of the partial products that have to be accumulated, while the stages represent the delay of the reduction process followed by each tree. The dashed boxes with four dots are 4:2 compressors, those with three are full adders and those with two are either full-or half-adders. Through the proposed approximation technique, the power, area, and delay of the multiplication circuit are decreased, making, though, the computation imprecise. The higher the order of a perforated partial product, the greater the error imposed at the final result. In addition, since the addition is an associative and commutative operation, when more than one partial products are perforated, the total error results from the addition of the errors produced from the perforation of each partial product separately.
We use the notation D[j,k,c] to label the different approximate multiplier architectural configurations. The parameter D refers to the tree architecture, j is the order of the first perforated partial product, and k is the number of the perforated partial products. If no j and k are specified, the respective notation refers to the exact design. Finally, c corresponds to the partial product generation technique and takes the value s for simple partial products (SPPs) or m for MBE. For example, The partial product perforation should not be confused with the truncation technique. Truncation eliminates the circuit that produces specific least significant bits (LSBs) of the accumulation tree, while the perforation skips the generation of partial products and thus decreases the number of operands to be accumulated. For example, in an 8-bit array multiplier, perforating a partial product removes eight full adders from the accumulation tree and reduces its delay. In order to attain similar circuit reduction using truncation, 6 LSB have to be truncated. However, truncating 6 LSB does not offer any delay reduction. Moreover, in this example, the truncation delivers, in all the cases, incorrect results, whereas the outputs of perforation are 50% correct. Finally, perforating one partial product (out of eight) results in a 12.5% loss of information while truncating 6 LSB (out of 16) results in a 37.5% information loss. In Section V, the perforation and truncation techniques are quantitatively compared in greater detail regarding error and power metrics, in order to further expose their differences.
B. Error Analysis
A critical issue for the approximate computing is the error imposed during computations and how it affects the final result. In this section, an error evaluation analysis of the partial product perforation technique is presented. We evaluate the induced error metrics proposed in [15] , i.e., ED, MED, and NMED, as effective metrics for quantifying the accuracy of approximate arithmetic circuits. ED is defined as the absolute distance of the fully accurate product P and the approximate one P , ED = |P − P |. The MED is the average of EDs for all inputs and NMED = MED/P max , where P max = (2 n − 1) 2 in the case of an n-bit multiplier [13] . The relative error distance (RED) is defined as RED = ED/P, and the mean RED (MRED) is similarly obtained [13] .
1) Error Evaluation:
When applying the product perforation on an n-bit multiplier using SPP generation, the ED of multiplying two numbers A and B is calculated as follows:
where x B ∈ [0, 2 k ) and
If p A and p B are the probability density functions (PDFs) of A and B, respectively, then the MED is calculated from
Without loss of generality, the rest of our analysis considers a uniform distribution over the overall n-bit numbers, i.e., (A, B)
Assuming that ED A is the sum of EDs ∀B for a given A, we have
and the sum of all EDs is
Using (9), (7) equals
Thus
Similarly
and
The previous analysis provides rigorous expressions of error metrics, enabling a fast error analysis of differing product perforation configurations. As shown in Section IV, these analytical error expressions are used in an exploration loop for deriving optimized approximate design solutions. The analytical equations (11) and (13) consider uniform distribution; thus in the case of differing distributions, 1 they should be adjusted according to the new PDFs, since the power-error efficiency of approximate designs highly depends on the multiplier's operands distribution. In most applications, e.g., multimedia, the inputs are highly correlated [16] . As an intuitive example, Fig. 2(a) shows the power-NMED Pareto graph for a 16-bit Dadda 4:2 multiplier when A and B follow the uniform distribution over the overall range of n-bit numbers, while Fig. 2(b) shows the same graph with inputs derived from the GSM 06.10 audio benchmark [20] . As shown, increasing the k-values results in lower power consumption but increased error values, while the selection of the j-value mostly depends on the input distribution. Intuitively, for a uniform distribution over all possible n-bit numbers [ Fig. 2(a) ], where all the bits have equal probability of being one or zero, j should be kept small to minimize the error. This is also confirmed from Fig. 2(a) , where 58% of the Pareto configurations feature j = 0 and 42% of the Pareto configurations feature j = 1. However, as shown in Fig. 2(b) , when the inputs are correlated without following a uniform distribution, we observe that the Pareto front is formed by configurations featuring many different j values, i.e., 0, 2, 6, and 15. The previous example shows that there is not a golden value for j and k, but their selection highly depends on the error constraints and the inputs PDF.
2) Error Correction Methods: In this section, we introduce two methods to decrease the error induced from the application of partial product perforation. They are implemented as extra components complementing the multiplication circuit, thus their area, power, and delay overheads as well as the error reduction they offer, do not depend on the architecture of the multiplier. Although multiplication is commutative, i.e., A × B = B × A, this does not apply in perforated multipliers. From (4), when multiplying A × B, the imposed error is proportional to the multiplicand A and the term x B and thus decreasing one of these operands decreases the error delivered to the output. As a result, comparing A and B or x A and x B before the multiplication and swapping accordingly, A and B can reduce the error. 
If A and B follow the uniform distribution in [0, 2 n ), (14) equals:
Every number A can be written in the form:
The sum [S1(y)] of all numbers A that have x A = y, where y is a constant and y ∈ [0, 2 k ), is given by
Supposing that B is fixed and x B = z, we get that
and ∀A:
By evaluating (17) for all B, we obtain 2 ∀A,B:
By evaluating (18) for all B, we obtain ∀A,B:
Using (19) and (20), (15) is equal to
The sum of all REDs is given by
Denoting C I = 2 k − 1 and using that ∀A,B:
and ∀A,B:
and MRED is calculated as a relation of j and k from MRED = 2 j 2 n+k 
and MRED = 2 j 2 2n
Fig . 3 shows the error improvement achieved by Methods 1 and 2, for a 16-bit (n = 16) multiplier and all the product perforation configurations (j, k). Fig. 3(a) shows the NMED reduction attained by the correction methods with respect to the NMED of product perforation without an error correction method. Fig. 3(b) shows the respective graph for the MRED metric. The proposed corrective methods offer both NMED and MRED reduction. Method 1 offers higher NMED reduction, while Method 2 achieves higher MRED reduction. On average, Method 1 offers 30% NMED reduction and 24% MRED reduction, while Method 2 offers 26% reduction and 50% reduction, respectively. As a result, the selection of a corrective method depends on the application in which the perforated multiplier will be used. If the magnitude of the error is more important than its absolute distance from the accurate result, then Method 2 should be preferred; if not, then Method 1 should be selected. However, the implementation of Method 1 requires a k-bit comparator, while Method 2 requires an n-bit one, and thus Method 1 induces smaller area and power overheads. As a result, since both the methods offer significant NMED and MRED reductions and Method 1 induces less power overhead, it should be preferred in the case the application is unknown.
Methods 1 and 2 decrease the error metrics, but their implementation requires an additional comparator. Fig. 4 shows the impact of correction Method 1 or Method 2 on the delay, power, and area on the Dadda 4:2 multiplier, with respect to the accurate design. Since the complexity of the comparator is mainly affected by the perforation variable k, Fig. 4 shows the perforation configurations that feature j = 1 and k = 1 to 8 (similar results are obtained for other j and for MBE designs). As expected, using Method 1 with perforation induces 13% overhead on critical delay, but also retains 26% and 20%, on average, power saving and area saving, respectively. The respective values for Method 2 are 20%, 26%, and 17%.
The NMED and MRED analytical relations show that the error imposed by the product perforation method is bounded and predictable. Therefore, when the application's input data set is determined, it can be used to calculate the optimal combination of j and k that produces an error less than a desired upper bound.
IV. EXPLORING THE EFFICIENCY OF PARTIAL PRODUCT PERFORATION
In this section, the partial product perforation method is applied to various multiplier architectures in order to explore how their power consumption, area, delay, and accuracy behave, considering the perforation configuration variables j and k. This analysis targets to expose the optimal architectureconfiguration pair for determined error values regarding both power dissipation and area complexity. This is critical, since different configurations may not have the same impact on a multiplier architecture, e.g., an architecture may be the power optimal one when accurate calculations are performed, but suboptimal when partial product perforation is applied.
Both the SPP and MBE techniques are considered in our analysis. Regarding the accumulation tree, the most common architectures are used: 1) array; 2) balanced delay; 3) compressor 4:2; 4) counter 7:3; 5) Dadda; 6) Dadda with 4:2 compressors; 7) redundant binary; and 8) Wallace [19] , [21] , [22] . The array is the simplest way to accumulate the partial products. It consists of successive carry-save adders (CSAs) and has the least complexity but the highest delay. The Wallace tree reduces to the least possible the number of partial products in each layer and is theoretically the fastest multioperand adder. However, it has very complex interconnections that do not permit practical implementations. The balanced delay tree provides a more regular routing and minimizes the number of wiring trucks. The compressor 4:2 tree also has a regular structure and sums the partial products as a binary tree does, using 4:2 compressors instead of CSAs. Unlike the Wallace tree, Dadda makes the fewest reductions needed in each layer and can achieve similar overall delay, but requires less gates. The Dadda tree is based on 3:2 counters (full adders) but also 2:2 counters (half-adders) to reduce the hardware complexity. The Dadda 4:2 and counter 7:3 trees use the same reduction strategy with the Dadda tree using though 4:2 and 7:3 compressors, respectively. In the redundant binary tree, the partial products are in a redundant representation, and the addition is performed by redundant binary adders [23] in the form of a binary tree. A carry look-ahead adder is used as the final adder in all multipliers. Fig. 1 shows some typical reduction schemes of the aforementioned tree architectures and the respective perforated trees with configuration j = k = 2. Using the unit gate model 2 [24] , the area of the array is 2 Area/delay of a full adder is 7 au/4 tu, of a half adder 3 au/2 tu and of a 4:2 compressor 14 au/6 tu. Exploration and Analysis: The flow used for our evaluation is summarized in Fig. 5 . For our analysis, 16-bit unsigned 3 multiplier architectures are considered. They are implemented in structural Verilog and synthesized using Synopsys Design Compiler and the TSMC 65-nm standard cell library. We simulate the designs using Modelsim and calculate their power consumption with Synopsys PrimeTime triggering the average mode of calculation. All the possible combinations of j and k are explored, and 1376 architectural configurations are examined in total. The metrics measured for each design are the NMED, MRED, minimum delay and, at the relaxed period of 2 ns, its power consumption and area complexity. In [25] , a detailed power, area, and delay characterization and the analysis of the examined perforated multiplier architectures have been performed showing that the aforementioned metrics are scaling gracefully, i.e., average slope −0.16%, −242%, and −0.03%, respectively, for increased values of k.
Since power, area, and delay metrics scale differently for each multiplier architecture when different error values are considered, we illustrate in Fig. 6 that the power-area Pareto curves for different NMED values in order to distinguish the optimal designs. We consider the NMED values of 10 −4 , 5 × 10 −4 , and 10 −3 which enclose a large set of different partial product perforation configurations while keeping the error small. The optimal accurate design is the Dadda[m] . Moreover, the Dadda4:2[m] 5 architecture appears in all curves but with different product perforation configurations (i.e., different j and k values), depending on the NMED bound. With respect to the accurate design, the perforation achieves up to 50% power, 45% area, and 35% delay reductions for only 0.1% error (i.e., NMED < 10 −3 ).
Aiming to elucidate the impact of partial product perforation on each multiplier architecture, we examine their power variation (i.e., the range of power values) for a bounded error. Fig. 7 shows the box plot diagram for all the architectures with regard to power, considering all the product perforation configurations that result in NMED < 5 × 10 −4 . The MBE-based architectures exhibit smaller variation and lower median than the respective SPP-based ones. The lowest median and variation values are observed for the counter 7:3[m] architecture. Thus, its power consumption for various perforation configurations is concentrated in a smaller range, making its power behavior more predictable. The same conclusion is confirmed in Fig. 6 , where the counter 7:3[m] for NMED values 5 × 10 −4 and 10 −3 is the Pareto optimal point with the lowest power.
V. EXPERIMENTAL EVALUATION
A. Comparative Study on Circuit Level
In this section, we extensively evaluate the efficiency of partial product perforation in terms of power, area, and error, and we compare it with the state-of-the-art approximation techniques, which apply truncation [4] , logic approximation [9] , or the VOS technique [6] . Using the two inexact 4:2 compressors of [9] at the 16 LSB columns, two approximate 16-bit multipliers ACM1 and ACM2 are implemented in structural Verilog and synthesized at 2 ns using Synopsys Design Complier and PrimeTime. Error metrics calculation is performed through exhaustive MATLAB simulation. In order to compare the partial product perforation with the VOS technique, we use the Synopsys' composite current source (CCS) model [26] . CCS models are proven to deliver signoff-level accuracy to within 2% of the HSPICE simulation, are designed to be scalable for voltage, temperature, and process, and offer better accuracy than the nonlinear delay and power models [26] . For the exact multiplier architectures of Section IV, we scale the supply voltage from 1 (nominal) to 0.80 V and measure their power consumption and error metrics using 10 5 randomly generated inputs. Regarding truncation, two truncated multipliers with variable correction [4] that use the Dadda 4:2 tree to accumulate the partial products are implemented. In the first one (TR10), 10 LSB are truncated, while in the second (TR16), the 16 ones. For the perforated multipliers, the error correction Method 1 (Section III-B2) is used. Fig. 8 shows comparative results on the power, area, NMED, and MRED metrics after applying the four different partial product perforation configurations, the approximate compressors according to the technique presented in [9] (ACM1 and ACM2), the VOS technique, and the truncation (TR10 and TR16) on a 16-bit Dadda 4:2 multiplier using SPP [ Fig. 8(a) The proposed partial product perforation for the SPP-based designs, included in Fig. 8(a) , delivers power savings of up to 49% and area reduction of up to 40% compared with the respective accurate design, while the NMED value is 6.5 × 10 −4 at most and the MRED one goes up to 1.1 × 10 −2 . The respective values for MBE-based configurations [ Fig. 8(b) ] are 47% power savings, 38% area reduction, NMED 1.8 × 10 −3 , and MRED 2.5 × 10 −2 . The approximate compressors multipliers ACM1 and ACM2 [9] with SPP [ Fig. 8(a) ] have 15% and 20% power, and 15% and 18% area savings, respectively, over the accurate Dadda 4:2 multiplier. Their NMED values are 2 × 10 −5 and 1.5 × 10 −5 , while their MRED ones are 5.3 × 10 −3 and 5.6 × 10 −3 , respectively. For the MBE [ Fig. 8(b) ], ACM1 and ACM2 have 16% and 23% power savings and 8% and 11% area reductions, respectively, over the accurate Dadda 4:2 multiplier. Their NMED values are 2.4 × 10 −4 and 1.6 × 10 −4 , while their MRED ones are 17 and 24, respectively. Regarding the MBE-based designs, [9] is less efficient, since less partial products compared with the SPP technique are accumulated in the tree and an error occurring in one column has a greater impact on the output. VOS does not deliver any area reduction, offering though significant power savings compared with the accurate design. When decreasing the supply voltage of the SPP-based design to 0.80 V [ Fig. 8(a) ], the power consumption is 1.06 mW (i.e., 37.9% less than the accurate one). Similarly, the power consumption of the MBE-based design [ Fig. 8(b) ] is 0.94 mW (i.e., 37.7% less than the precise design). However, even for small power savings (10% at 0.95 V), the NMED and MRED values of VOS are too large, more than 0.65 and 10, respectively, as VOS errors are mainly impacting MSBs, resulting in large ED. The truncated multipliers TR10 and TR16 [4] , when SPP is used, offer 14% and 46% power savings and 18% and 44% area reductions for 1. On average, the partial product perforation configurations, shown in Fig. 8 , exhibit lower MRED values than ACM2, but higher NMED. The large NMED value of partial product perforation implies that it may produce large ED. However, the small value of MRED shows that such large ED is insignificant compared with the accurate result. The aforementioned points can be further explained based on the error analysis in Section III-B. As shown, the ED is proportional to the inputs, and thus it can be as large as the input numbers. However, RED = x B 2 j /B, and since a few partial products are removed, the nominator is much smaller than B, resulting in small relative error values. On the other hand, [9] produces smaller ED, but its errors are of greater significance compared with the exact results. This behavior is also captured in Fig. 9 , where the PDF of the ED and RED for ACM2 and Dadda4:2[1,5,s] is presented. ACM2 exhibits lower NMED but higher MRED compared with Dadda4:2[1,5,s]. Fig. 9(a) shows the PDF of the ED for the aforementioned multipliers. ACM2 has a significantly greater error probability, but its probable error values are concentrated in a smaller range. In contrast, the Dadda4:2[1,5,s] errors are spread to a wider range and have almost equal, but very low, probability to appear. Fig. 9(b) shows the same graph for the RED metric. As shown in Fig. 9(b To summarize, the partial product perforation technique shows significant gains compared with the accurate design and the state-of-the-art approximate techniques. On average, compared with VOS, the partial product perforation configurations attain 3% lower power consumption and 96% lower MRED, when SPP is used, and 9% and 99%, respectively, when MBE is used. Compared with [9] for SPP schemes, their power [4] , the perforated multipliers of Fig. 8 deliver on average 3% higher power for 99% lower MRED, while for MBE, the respective values are 4% lower power and two orders of magnitude lower MRED. Finally, Table I offers a more straightforward comparison among the examined approximation schemes, by ranking them according to their savings and error metrics. The examined designs have been grouped in four subgroups, each one with designs exposing similar power and/or error characteristics. In each subgroup, the perforated multipliers deliver the lowest power and MRED values and, in most cases, the lowest NMED and area as well.
B. Comparative Study on Real-Life Applications
In this section, we evaluate the efficiency of the proposed technique on real life use cases from the image processing and data analytics domains. For our analysis, we consider the Canny edge detection [27] and geometric mean filters from the image-processing domain and the K-means clustering [28] from the data analytics domain, respectively. All the examined algorithms are implemented in C++, while for the image processing ones, OpenCV library is used.
Geometric mean filter removes noise from images, offering better results than the arithmetic mean filter for Gaussian-type noise. The geometric mean filter with parameter r filters an image by replacing each pixel's value by the geometric mean of the values of all the neighboring pixels that are inside a (2r + 1) × (2r + 1) block centered on that pixel. For our evaluation, the r parameter is set to 3. We approximate the geometric mean by replacing the multiplication between the pixels with an approximate 16 × 16 multiplier. We used as input the 16-bit (16 bits/pixel) grayscale image, as shown in Fig. 10(a) . To evaluate the accuracy of the output images of the geometric mean, we use the peak signal-noise ratio (PSNR).
Canny edge detection [27] filter is considered to be an optimal edge detector. In particular, it masks the image by applying a Gaussian filter to remove the noise, it calculates the gradient of the image to find the edge strength, it applies a nonmaximum suppression to keep only the local maxima, it determines the potential edges by thresholding, and it tracks edges by hysteresis, i.e, suppresses all the edges that are weak and not connected to strong edges. The size of the Gaussian kernel is 7 × 7 with 1.1 standard deviation value and uses 16-bit fixed point arithmetic. We approximate Canny edge by replacing the multiplication in the Gaussian filter with an approximate 16 × 16 multiplier. We used as input the 16-bits grayscale image, shown in Fig. 11(a) . The percentage of the edges detected using the approximate multiplier over those detected using the accurate one is used as our quality metric.
K-means is a popular algorithm for clustering data points from a multidimensional space into k clusters. It uses a two-phase iterative method and aims to partition the data points into sets, so as to minimize the within-cluster sum of distance functions of each point in the cluster to the center. We use the Euclidean distance as a distance function. We approximate the K-means algorithm by replacing the multiplications in the calculation of the Euclidean distance with an approximate 16×16 multiplier. We use a random generated input data set of 100 000 4-D points with 16 bits/dimension.
The input data set is clustered in 100 clusters. To evaluate the accuracy of the K-means algorithm, we use the average relative L2-norm, i.e., (|x acc − x approx | 2 /|x acc | 2 ) .
Similar to [9] and [10] , the approximate multiplier is considered as part of a general processing system that implements the aforementioned algorithms. The rest of hardware components (except the multiplier) are considered to deliver accurate results, and thus any applications inaccuracy and energy savings result from the usage of the approximate multiplier. The energy values of each multiplication operation are delivered by postsynthesis simulations of the approximate multipliers on the input data traces extracted by the applications execution. Note that in the Canny edge detection and geometric mean algorithms, the number of the multiplications depends only on the image size, and thus it is the same for the accurate as well as the approximate version of the algorithm. On the other hand, the iterations performed by the K-means algorithm are not constant, and as a result, the number of multiplications in the accurate may differ from the ones in the approximate version. Fig. 10 shows both the input image and the output image of the geometric mean filter when using the accurate multiplier Dadda4:2[s], the perforated multipliers Dadda4:2[1,5,s] and Dadda4:2[3,4,s] with and without any correction method and the approximate multiplier ACM2. Fig. 11 shows the same images for the Canny edge detection. Table II summarizes the values of the energy savings and quality metrics of each application when using the aforementioned multipliers. 16 .3% higher energy reduction, detects 0.5% more edges, and has 2.8% higher PSNR. Regarding to the K-means algorithm, using a correction method with product perforation does not deliver any quality improvement. This is explained by the fact that in the Euclidean distance, the multiplier is used as a squarer, and as a result, swapping the multiplicands does not decrease the multiplication's error. Moreover, we observe that using ACM2[s] in the K-means algorithm does not offer any energy reduction. The implementation of the K-means algorithm with ACM2[s] fails to converge and exits after reaching a maximum number of allowed iterations. As a result, although ACM2[s] has lower power consumption compared with the accurate multiplier, the increased number of multiplications results in an energy increase of the K-means algorithm.
C. Impact of Bit-Width Scaling
In this section, we examine the scalability of the proposed technique in terms of increased multiplier's bit width. More specifically, we study the impact of scaled bit widths, i.e., 16 up to 128 bits, on the proposed perforation technique focusing on the delivered accuracy (NMED, MRED) and power and area gains. We consider the Dadda 4:2 as our driver architecture solution and NMED ≤ 10 −4 as our quality constraint. Fig. 12(a) shows for each of the examined bit widths the power and area reduction delivered by the perforated Dadda 4:2 solutions with respect to their accurate designs. In a complementary manner and for the same scaled bit widths, Fig. 12(b) shows the NMED and MRED values when targeting 50% power reduction. In particular, for NMED ≤ 10 −4 , the power and area gains for 16-bit width are 21% and 31%, respectively. The respective gains in the case of 128-bit width design scales up to 74% and 91% regarding to power and area, respectively. Similarly, Fig. 12(b) shows that for the same relative power gain, i.e., 50%, the 16-bit solution delivers an NMED and MRED value of 1.95 × 10 −3 and 2.61×10 −2 , respectively. For the 128-bit solution, NMED and MRED reduce to 1.73 × 10 −18 and 2.05 × 10 −16 , respectively. Thus, the partial product perforation offers better results as the multiplier's bit width increases, i.e., higher power and area reduction for the same error constraints or lower error values for the same power savings.
This good scaling behavior for increased multiplier's bit widths can also be theoretically confirmed utilizing the error analysis in Section III-B. Let us assume two multipliers M 1 and M 2 with different bit widths n 1 and n 2 with n 1 < n 2 having the same j-value for the partial product perforation. For both multipliers to achieve the same NMED, the following relation should hold, according to (11):
Given that n 1 < n 2 ⇒ k 1 < k 2 . High k-values imply the perforation of more partial products. Thus, for two approximate multipliers with the same NMED but different bit widths, the higher the multiplier's bit width, the higher the number of partial products that should be perforated, and thus the higher the power gains achieved with respect to their accurate counterparts.
VI. CONCLUSION
In this paper, we proposed the partial product perforation technique for producing approximate hardware multipliers. The proposed technique omits a number of partial products enabling high area and power savings while retaining high accuracy. Through a rigorous error analysis, we analytically characterized the induced error metrics proving that the error is bounded and predictable and we proposed two error correction methods that trade a small increase in power for high error reduction. We explored product perforation on a large set of multiplier architectures, evaluating its impact on different architectures and error bounds. In comparison to the state-ofthe-art approximation techniques, we showed that the proposed approach achieves significant gains in power, area, and quality metrics of image processing and data analytics algorithms. Finally, we showed that our technique is scalable, offering better results as the multiplier's bit width increases.
Georgios Zervakis received the Diploma degree from the Department of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece, in 2012, where he is currently pursuing the Ph.D. degree in digital and microprocessor system design.
His current research interests include approximate computing, VLSI arithmetic circuits, lowpower design, and cryptography.
Kostas Tsoumanis (S'12) received the Diploma degree from the Department of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece, in 2010, where he is currently pursuing the Ph.D. degree.
He has co-authored research papers in international conferences. His current research interests include hardware-efficient implementation of arithmetic operations and low-power design of digital signal processing algorithms.
