Taking advantage of the error resilience in many applications as well as the perceptual limitations of humans, numerous approximate arithmetic circuits have been proposed that trade off accuracy for higher speed or lower power in emerging applications that exploit approximate computing. However, characterizing the various approximate designs for a specific application under certain performance constraints becomes a new challenge. In this paper, approximate adders and multipliers are evaluated and compared for a better understanding of their characteristics when the implementations are optimized for performance or power. Although simple truncation can effectively reduce the hardware of an arithmetic circuit, it is shown that some other designs perform better in speed, power and power-delay product. For instance, many approximate adders have a higher performance than a truncated adder. A truncated multiplier is faster but consumes a higher power than most approximate designs for achieving a similar mean error magnitude. The logarithmic multipliers are very fast and power-efficient at a lower accuracy. Approximate multipliers can also be generated by an automated process to be very efficient while ensuring a sufficiently high accuracy.
INTRODUCTION
As a potential technique for implementing complex computations with reasonable speed and power consumption, approximate computing has been explored actively from circuits to programming languages. Among circuit designs, adders and multipliers have been a focus because they play a pivotal role in determining the performance and power dissipation of many compute-intensive applications [8] . Therefore a large variety of approximate adders and multipliers have been proposed. However, these designs usually seek tradeoffs among accuracy, performance and power consumption, i.e., a design may be very fast but with a low accuracy or high power, or a very power-efficient design may have a significantly low speed or accuracy. Also, different designs in the literature have been evaluated using different synthesis tools and technologies. These differences have made it difficult to choose a suitable approximate design for a specific application with designated purposes.
An evaluation and comparison of the accuracy and circuit characteristics was provided in [11] for approximate adders, multipliers and dividers. However, the circuit measurements were obtained without considering any design constraints and, hence, the generic power-delay product (PDP) metric was mainly used for hardware evaluation. At a similar PDP, it was concluded that a truncated arithmetic circuit results in a smaller error magnitude (in the mean relative error distance (MRED)) than most of the other approximate designs.
In this paper, rather than considering the overall measurement, high performance and power efficiency are respectively pursued as independent design metrics. For example, an approximate design with a high speed is useful for coping with aging-induced timing errors in adders and multipliers [1] . Also, high-performance arithmetic circuits are preferred in real-time machine learning systems [10] . For mobile and embedded devices, power-efficient arithmetic circuits would be key to the extended use given limited battery life. In this work, therefore, the approximate circuits are optimized for maximizing performance (through delay) or minimizing power (through area). Compared to previous studies, several recent approximate arithmetic designs are included for comparison and to provide new insights to approximate arithmetic circuit design.
In the remainder of this paper, Section 2 introduces the simulation methodology. Sections 3 and 4 review and comparatively evaluate approximate adders and multipliers, respectively. Section 5 concludes the paper.
SIMULATION METHODOLOGY
The accuracy of the approximate designs is evaluated through Monte Carlo simulations. The error distance (ED = | ′ − |) and the relative error distance (RED = | ED |) are then calculated, where ′ and are the approximate and accurate results, respectively [18] . The mean error distance (MED) is calculated by averaging all of the obtained EDs. The error characteristics of the approximate designs are assessed by the error rate (ER, the probability of producing an incorrect result), the normalized MED (NMED, the normalization of MED by the maximum output of the accurate design) and the MRED (the average value of all of the obtained REDs).
To assess the circuit characteristics, the approximate designs are implemented in VHDL and synthesized using the Synopsys Design Compiler (DC) in STMicro's 28-nm CMOS technology, with a supply voltage of 1.0 V at a temperature of 25 ∘ C. For a fair comparison, all designs use the same process, voltage and temperature with the same optimization option. To compare speed and power, the approximate circuits are synthesized under different constraints. The critical path delay of a design is constrained to the smallest value without a timing violation for the delay-optimized synthesis, whereas the area is minimized for the area-optimized synthesis. The DesignWare library and "ultra compile" are used in the synthesis for optimization. The critical path delay and area are reported by the Synopsys DC. Power dissipation is measured by the PrimeTime-PX tool with 10 million random input combinations.
APPROXIMATE ADDERS
Two basic adders for binary addition are the ripple-carry adder (RCA) and the carry lookahead adder (CLA). For an -bit RCA, the delay and circuit complexity increase proportionally with (denoted by ( )). The delay of an -bit CLA is logarithmic in (or ( ( ))), thus significantly shorter than that of an -bit RCA. However, a CLA requires a larger circuit area (in ( ( ))), which results in a higher power dissipation. To reduce the critical path delay of an accurate adder, many approximation methodologies have been proposed, including involving speculative operation, segmentation, carry selection and an approximate full adder.
Review
In this section approximate adders are briefly introduced. Please refer to [11] for a detailed introduction to these designs.
Speculative Adders.
In an -bit speculative adder, ( < ) least significant bits (LSBs) are used to predict the carry input for each sum bit [22] . Thus, the critical path delay is reduced to ( ( )) (for a parallel implementation such as the CLA, the same throughout this section unless otherwise noted). The almost correct adder (ACA) is designed to reduce the hardware overhead by sharing part of the carry-generation circuits [32] .
Segmented Adders.
A segmented adder is implemented by using several smaller adders operating in parallel. They include the equal segmentation adder (ESA) [26] , the error-tolerant adder type II (ETAII) [36] , the accuracy-configurable approximate adder (ACAA) [13] and the dithering adder [24] . The delays of the segmented adders grow with ( ( )) and the circuit complexities grow with ( ( )) for ESA and ETAII, and with (( − ) ( )) for ACAA [11] .
Carry Select
Adders. An -bit carry select adder consists of = ⌈︀ ⌉︀ blocks, and the carry input for each block is selected using different schemes in different designs. In the speculative carry select adder (SCSA), each block is made of two -bit adders: adder0 with carry-in "0" and adder1 with carry-in "1"; the sum is selected by a multiplexer according to the carry-out of adder0 in the previous block [6] . SCSA and ETAII achieve the same accuracy for the same value of due to the same carry prediction function. For the carry skip adder (CSA), each block consists of a sub-carry generator and a sub-adder [14] . The carry-in of the ( + 1) ℎ sub-adder is determined by the propagate signals of the ℎ block, which enhances the carry prediction accuracy. In the carry speculative adder (CSPA), each block contains one sum generator, two internal carry generators and one carry predictor; fewer than input bits are used in a carry predictor [20] . In the consistent carry approximate adder (CCA), the carry prediction depends not only on its LSBs, but also on the higher bits [17] . The generate signals-exploited carry speculation adder (GCSA) has a similar structure as CSA and uses the generate signals for carry speculation [9] . In the gracefully-degrading accuracyconfigurable adder (GDA), control signals are used to configure the accuracy by selecting an accurate or approximate carry-in signal using a multiplexer for each sub-adder [34] . In the carry cut-back adder (CCBA), the full carry propagation is prevented by a controlled multiplexer or an OR gate for a high-speed operation [5] . Although the critical path delays of the carry select adders vary, they generally grow with ( ( )), where is the size of the sub-adder.
Approximate Full Adders.
In this class of designs, the LSBs are implemented by approximate full adders. These adders include the simple use of OR gates (and one AND gate for carry propagation) in the lower-part-OR adder (LOA) [23] , the approximate designs of the mirror adder [7] and the approximate XOR/XNOR-based full adders [33] . For the LOA, the critical path is approximately ( ( − )), where is the number of approximate LSBs. Finally, an adder whose LSBs are truncated is referred to as a truncated adder (TruA) that works with a lower precision. It is considered as a baseline design.
Evaluation
In this evaluation, 16-bit approximate adders are considered. For CSPA, the size of the carry predictor is ⌈ /2⌉. The global speculative carry for CCA is "0". All sub-adders are implemented as CLAs because most approximate adders are designed based on the CLA.
The simulation results show similar performance trends for the MRED and NMED of the approximate adders, so only the MRED is considered in the comparison. Figs. 1, 2 and 3 show the comparison of MRED, ER, delay (for delay-optimized synthesis), power (for area-optimized synthesis) and PDP. Fig. 1 shows that, among the adders with small MREDs, LOA and ETAII are faster than the other designs, whereas CCA is the slowest followed by CSA. CSA-6 is not shown in the figures because it is accurate due to the precise carry generated for every block, so the ER and MRED of CSA-6 are 0. For a high MRED, ESA and CSPA are faster. When the same ER is considered, ETAII, SCSA and ACAA are among the fastest designs. Fig. 2 shows that in terms of power consumption, CCBA, LOA and TruA are the most efficient designs, while ACA is very power consuming. However, CCBA, LOA and TruA have very high ERs. CSA has a rather low ER. For a similar ER, ETAII and ACAA are very power-efficient, while ACA consumes relatively high power. As shown in Fig. 3 efficiency. In particular, they are useful in applications in which ER is not important. Consequently, truncation can be used to design a hardware-efficient approximate adder (albeit with a high ER). On the other hand, the carry select scheme is more effective for an approximate adder design to achieve a high accuracy. Table 1 summarizes the different approximate adders, with their advantages and disadvantages highlighted (i.e., the metrics with moderate values are not shown). The MRED and NMED of approximate adders are represented by ED in the table.
APPROXIMATE MULTIPLIERS
Multiplication is implemented by partial product generation, accumulation and a final addition. A partial product is usually generated by an AND gate. The most common partial product accumulation structures are the Wallace and Dadda trees and the carry-save adder array. In a Wallace tree, the FAs (or half adders (HAs)) in each layer operate in parallel without carry propagation. Thus, for an × multiplier, the delay of the Wallace tree grows with ( ( )). Also, the FAs in a Wallace tree can be considered to be (3:2) compressors and can be replaced by other counters or compressors (e.g. a (4:2) compressor) to further reduce the delay. The Dadda tree has a similar structure as the Wallace tree, but it uses as few adders as possible. In a carry-save adder array, the carry and sum signals generated by the FAs (or HAs) in a row are passed on to the FAs in the next row. FAs in a column operate in series. Hence the delay of an × carry-save adder array is approximately ( ), longer than that of a Wallace tree. Five main methodologies are used for approximating an unsigned multiplier: approximation in generating the partial products, approximation (including truncation) in the partial product tree, using approximate adders, counters or compressors in the partial product tree, using logarithmic approximation, and using an automated process such as a genetic programming method.
Review

Approximation in Generating Partial
Products. In [15] , the accurate 2 × 2 multiplication result "1001" is simplified to "111" to save one output bit when both the inputs are "11", which results in an approximate 2 × 2 multiplier. Larger approximate multipliers are then designed by using the approximate 2 × 2 multiplier, which is referred to as the underdesigned multiplier (UDM). UDM introduces an error when generating the partial products, however the accumulation remains accurate.
Approximation in the Partial Product
Tree. The brokenarray multiplier (BAM) omits some carry-save adders in an array multiplier in both the horizontal and vertical directions [23] . A more straightforward approach is to truncate some LSBs on the input operands and, thus, a smaller multiplier is used to process the remaining most significant bits (MSBs). This truncated multiplier (TruM) will be considered as a baseline design. In [35] , partial product perforation is applied to different multiplier structures (PPAM), which ignores several consecutive rows of partial products (not necessarily starting from the LSB). The error tolerant multiplier (ETM) consists of a multiplication section and a non-multiplication section [16] . A control block is used to decide if the multiplication section is to be activated to multiply the LSBs or the MSBs. In the static segment multiplier (SSM), no approximation is applied to the LSBs; either the MSBs or the LSBs of the operands are accurately multiplied depending on whether its MSBs are all zeros [30] . In [4] , a bit-width aware approximate multiplication and a carry-in prediction scheme are used to construct an approximate Wallace tree multiplier (AWTM).
Using Approximate Counters or Compressors in the Partial
Product Tree. An approximate (4:2) counter is proposed for an inaccurate 4 × 4 Wallace multiplier [19] . The carry and sum are approximated as "10" for "100" in the approximate counter when all input signals are "1." The inaccurate 4 × 4 multiplier is then used to construct a larger multiplier that is referred to as ICM. In [27] , two approximate (4:2) compressor designs are used in a Dadda multiplier with four different schemes. In this paper, the more accurate schemes 3 and 4 of the approximate compressor based multiplier (referred to as ACM-3 and ACM-4) are considered for comparison. An approximate 4:2 compressor with encoded inputs and improved accuracy is proposed for 4 × 4 multipliers that are used to build larger multipliers [3] (referred to as M16). A novel approximate adder is proposed to accumulate the partial products for the approximate multiplier with configurable error recovery [12] . Two approximate error accumulation schemes are proposed to compensate the error generated by the approximate adder. In scheme 1 (AM1), errors are accumulated by using OR gates, while both OR gates and the approximate adders are used in scheme 2 (AM2). The truncation of half LSBs in the partial products in AM1 and AM2 results in TAM1 and TAM2, respectively. In [31] , approximate half adders, full adders and 4:2 compressors are utilized to accumulate the altered partial products; it is denoted as the USask design.
4.1.4 Using Logarithmic Approximation. By using the logarithm and anti-logarithm approximations of a binary number, [25] proposes a logarithmic multiplier (LM) implemented by shifting and addition; it is the baseline design for LMs. The accuracy of a LM is improved in [21] by using a set-one-adder (SOA) and in [2] by using an improved algorithm with exact and approximate adders (ILM-EA and ILM-AA).
4.1.5
Using an Automated Process. In [28] , 471 8×8 approximate multipliers are evolved by a multi-objective Cartesian genetic programming method, which are then used to construct 16 × 16 approximate multipliers [29] (i.e., BrnoA1, BrnoA2, BrnoA3 and BrnoA4 for different construction methods).
Evaluation
Monte Carlo simulation results for 16 × 16 multipliers show that most of the approximate designs result in large ERs close to 100%. However, ICM has a low ER of 5.45%; some configurations of BrnoA1, BrnoA2, BrnoA3 and BrnoA4 also show lower ERs than the other designs. Thus, MRED, NMED, delay (for delayoptimized synthesis), power (for area-optimized synthesis) and PDP (for both delay-and area-optimized syntheses) are jointly considered to evaluate the approximate multipliers, as shown in Figs. 4, 5 and 6. Fig. 4 shows that BrnoA1 is the most accurate design with very small values of MRED and NMED because only one approximate 8×8 approximate multiplier is used to construct a 16×16 BrnoA1. ICM and the truncated Wallace multiplier (TruMW) are faster than the other designs for a low MRED and NMED, whereas TAM1, ILM-AA, SOA, PPAM are the fastest while the MRED Delay (ns) Delay (ns) and NMED are higher. AM1 is also very fast; however, BAM, SSM, ETM and the truncated array multiplier (TruMA) are relatively slow. As shown in Fig. 5 , BAM is the most powerefficient followed by TruMA and BrnoA3, while UDM, AM1, AM2, M16 and BrnoA4 are relatively power-hungry. The power dissipation of TAM1 and TAM2 is in the medium range.
As shown in the delay-optimized results in Fig. 6(a) , TAM1, BrnoA2 and BrnoA3 have smaller PDPs than other approximate designs (including TruMW) for a similar MRED. For the areaoptimized results in Fig. 6(b) , ICM, TAM1, TAM2 and BrnoA3 show smaller PDPs than other designs (including TruMA).
A summary of the accuracy and circuit characteristics for the approximate multipliers is shown in Table 2 . Truncation is an effective scheme for a high-speed operation with a relatively low PDP. By using the logarithmic approximation, a multiplier tends to be hardware-efficient but with relatively low accuracy. The approximate multipliers designed using the automated process perform well when a high accuracy is required. 
CONCLUSION
In this paper, designs of approximate adders and multipliers are evaluated in terms of accuracy and circuit characteristics. Specifically, the performance and power consumption are compared by obtaining the synthesis results under delay and area constraints. The simulation results show that truncation is an effective scheme to improve the hardware efficiency of an arithmetic circuit. Most of the approximate adders in the literature are designed for a high-speed and low ER (e.g., 0.02% for CSA-5) by cutting off the carry propagation chain. A truncated adder has a high ER close to 100%; however, it has a lower power dissipation and lower PDP than most approximate designs at a similar MRED (except for LOA and CCBA). The performance (or speed) of a truncated adder is not as high as some approximate adders.
Unlike the adder, a truncated Wallace multiplier is faster than most approximate multipliers (except for ICM and TAM1), but with a higher power at a similar MRED. For a similar accuracy, BAM and BrnoA3 consume less power than a truncated multiplier. In terms of PDP, TAM1, TAM2, ICM, BrnoA2 and BrnoA3 have smaller values than a truncated multiplier for a given MRED.
