Area Efficient High Speed Approximate Multiplier with Carry Predictor  by Sunny, Anju et al.
 Procedia Technology  24 ( 2016 )  1170 – 1177 
Available online at www.sciencedirect.com
ScienceDirect
2212-0173 © 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license 
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the organizing committee of ICETEST – 2015
doi: 10.1016/j.protcy.2016.05.072 
 
International Conference on Emerging Trends in Engineering, Science and Technology 
(ICETEST - 2015) 
 
Area Efficient High Speed Approximate Multiplier with Carry 
Predictor 
 
Anju Sunny
a*
, Binu K. Mathew.
b
, Dhanusha P.B
c
 
 
a, b ,c
 Dept. of Electronics and Communication, SAINTGITS College of Engineering, Kottayam, India, 686532 
 
Abstract 
 
Multimedia and image processing applications, may tolerate errors in calculations but still generate meaningful and beneficial 
results. This work deals with a high speed approximate multiplier with TDM tree and carry prediction circuit. The modified 
multiplier utilizes an optimised TDM carry save tree which reduces the device utilization on FPGA as well as the combinational 
path delay and power consumption. The proposed design is analyzed using the simulation and implementation results on Xilinx 
Spartan 3E family.  
© 2016 The Authors. Published by Elsevier Ltd.  
Peer-review under responsibility of the organizing committee of ICETEST – 2015. 
 
Keywords: Approximate Carry Adder;  Three Dimensional Reduction method;  Approximate Multiplier 
 
 
1. Introduction 
 
Exact and precise models and algorithms are not always appropriate for proficient use in multimedia and 
image processing operations. The model of approximate calculation relies on entirely relaxing fully exact and 
completely deterministic building blocks while, designing energy-efficient systems. In digital designs, integer 
multiplication is one of the fundamental building blocks, which deeply affects the microprocessor and DSP 
performance. A faster digital circuit is obtained by implementing a speculative (prediction) approach.  
Speculative digital circuits are based on faster operation by employing a speculative functional unit, which 
is an arithmetic unit that employs a predictor for the carry signal, without actually waiting for the carry propagation. 
The speculative unit predicts the carry of the one or more cells used in the digital circuit without waiting for the 
actual carry propagation to take place. This is similar to a predictor in the microprocessor.  
Here we have considered a speculative multiplier which consists of a predictive carry-save reduction tree using 
three steps: partial products recoding, partial product partitioning and speculative compression. The speculative tree 
 
Anju Sunny 8086212742 
anjus.sun@gmail.com 
 016 The Authors. Published by Elsevi r Ltd. This is an open access article under the CC BY-NC-ND license 
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the organizing committee of ICETEST – 2015
1171 Anju Sunny et al. /  Procedia Technology  24 ( 2016 )  1170 – 1177 
 
utilize (m: 2) counters, and are faster than traditional compressors based on half adders and full adders. The tree is 
further comprised of a fast carry-propagate adder and an error rectification circuit. Speculative multipliers have 
higher speed compared to their conventional counterparts.  
This paper is organized in to five sections. Section II discusses the previous works related to approximate 
multiplier. The proposed approximate multiplier with optimized TDM is discussed in section III. The simulation 
results and the implementation details are given in section IV. Section V concludes the work along with the scope of 
the future work. 
 
2. Previous Works 
 
[10] Shows that approximate circuits have higher performance as compared to precise logic circuits. Many inexact 
multipliers have been proposed in the literature [4] [6] [7] [13]. These designs employ a truncated multiplication 
method. In [6], an inexact array multiplier is used, by ignoring selected least significant bits in partial products. A 
inexact multiplier with correction constant has been proposed in [13].  
A variable correction constant inexact multiplier is proposed in [4].This method modifies the correction term 
according to column n-k-1. If partial products in column n-k-1 are one, then correction factor is increased and, if all 
partial products in the above column are zero, the correction factor is decreased. In [7], a basic 2x2 multiplier block 
is suggested for constructing larger multiplier arrays. In all these designs the area was found to be very high.  
In [11] another approximate multiplier with two approximate [4:2] compressor has been proposed. This 
multiplier requires lesser area as compared to multipliers using truncation technique however the error percentage 
was found to be very high.  
[12] Describes another approximate multiplier design which utilizes prediction units for the carry signal and also 
has lesser error percentage as compared to [11]. SFUs (Speculative Functional Units) are prediction circuits that can 
be considered as black box entities which are faster than their non -speculative counterparts, independently of the 
particular implementation [8]. Hence approximate multipliers using SFUs also aim to achieve delay improvements, 
at the same time introducing less power and area overheads. This multiplier utilises Carry Save Adder (CSA) tree 
[14] for partial product reduction, wherein the carry outputs are propagated rather than being preserved thereby 
reduces the delay.  
Popular CSA schemes include Wallace tree and Dadda multiplier. Wallace tree [1] [9] result in long and irregular 
wires along the columns to connect to the CSA. The wire capacitance in turn increases the delay and energy of the 
multiplier and the wires are difficult to layout. Dadda refined Wallace’s method by introducing a counter placement 
strategy that requires few numbers of counters in the reduction stage but at the cost of larger Carry propagate Adder 
(CPA) [2] [9]. The delay from an input to an output in a full adder is not the same. This delay is dependent on a 
particular transition (0-to-1, 1-to-0). Therefore it is also possible to come up with different realizations of a full 
adder wherein a specific signal path is favored with respect to the others and has been designed in such a way that a 
signal propagation of this path takes a minimal amount of time [3]. The CSA scheme which takes care of this delay 
in transition is Three Dimensional Scheme (TDM) [3], where partial product array is represented in space and time. 
This is followed by a speculative adder [5]. 
 
3. Approximate Multiplier without optimized TDM Design 
 
3.1. Architecture 
 
The approximate multiplier without optimized TDM contains different modules as shown in the Fig.1. The 
modules are partial products recoding, partial product partitioning, speculative compression, TDM tree, Speculative 
adder. If an error signal is generated correction word is applied to the output of the speculative compressor which is 
then followed by TDM tree and a Carry look Ahead Adder. 
1172   Anju Sunny et al. /  Procedia Technology  24 ( 2016 )  1170 – 1177 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 1. Architecture of Approximate Multiplier with Carry Speculation Compressor 
 
i) Partial Product Recoding  
Consider two partial products aibj and ajbi of the i+j-th column of the PPM. Now we will define two modified 
partial products: 
 
Ai, j = aibj AND ajbi (1) 
Oi, j = aibj OR ajbi (2) 
 
Thus couple of partial products aibj and ajbi can be replaced with modified partial products Ai, j and Oi, j. The 
advantage of introducing such a recoding technique is the introduction of lower probability terms in the PPM. The 
probability of Ai, j is given by (.25)
2 = 0.0625, much lower than the probability of the original partial product (i.e. 
0.25). Alternatively the probability of Oi, j is 7/16. From the above two observations it can be concluded that 
speculative carry tree utilizes lower probability terms, to minimize the probability of misprediction. The introduction 
of recoded terms does not modify the total number of partial products, but introduces an additional very small delay  
for the recoded partial products. The figure below shows a 16 X 16 Partial Product Matrix (PPM) after being 
recoded. 
1173 Anju Sunny et al. /  Procedia Technology  24 ( 2016 )  1170 – 1177 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 2. 16 X 16 PPM with Recoded Terms 
 
ii) Partial Product Partitioning  
Only the lower probability terms Ai, j is has been added in the speculation carry-save tree. Partial products that 
belong to the largest columns of PPM are singly recoded. In the figure given below the partial products in the 
columns 11, 12……22 are recoded.  
iii) Speculative Compression  
Although the probability Ai, j has been decreased with respect to the actual partial products, simple removal of  
Ai, j terms would bring about a large misprediction error probability. Thus, instead of omitting these terms we sum 
them in an approximate manner by using speculative compressors. A (m: 2) speculative counter has m inputs 
(x0….xm-1) and only two outputs Sum (S) and Carry (C). The speculation compressor counts the number of input 
bits and determines the output bits, on the supposition that not than more than three inputs are high. Analogously to 
full adders and half adders, the output C has a doubled weight with respect to S, so that 
 
2C + S = x0 +x1…………..+.xm-1 for: x0 +x1…………..+.xm-1 <3 (3)  
For m=2 and m=3, the speculative counter produces the correct number of high input bit and hence corresponds 
to a half adder (m = 2) and full adder (m=3). For m > 3, it is not possible to represent sum x0 +x1…………..+.xm-1, 
by using only C and S signals for all likely input configurations. The speculation counter computes the outputs based 
on the supposition that not more than three inputs are high: If this criterion is not met, an error occurs; the 
multiplication result is wrong and must be corrected. 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 3. (5:2) Speculative Compressor 
1174   Anju Sunny et al. /  Procedia Technology  24 ( 2016 )  1170 – 1177 
A (m: 2) speculative counter is defined as a component with m inputs (x0……..xm-1) and two outputs sum S and 
Carry C. The output S is obtained by xor-ing the inputs. The calculation of the carry (C) signal presents certain 
difficulties. The function given by, f ≥2(x0……..xm-1) correspond to the carry signal. 
 
f ≥2(x0……..xm-1) = f ≥2(x0……..xm1-1) +….f ≥2(xmk-1……..xm-1) + f ≥2(x0 +.... + xm1-1,… + xmk-1 ……..xm-1) (4) 
 
iv) TDM CSA Tree 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 4. Input to TDM CSA tree 
 
The figure 4 shows the input to TDM CSA tree. The TDM takes in to account the different arrival intervals of its 
inputs and attempts to make suitable connections to full adders so that the delay all over each path is roughly the 
same. Thus, the late incoming outputs of the speculation counters are associated with the shortest delay route in the 
TDM. The figure below shows a TDM carry save tree with eight inputs which includes five partial products (p0, p1, 
p2, p3, p4) and three intermediate carry (c0, c1, c2) from the previous vertical compressor slice. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 5. TDM Tree with Eight Inputs 
 
iv) Error Correction Block  
The error signal is ORed to generate error output. This is outside the critical path. The function realizes by the 
error correction block can be described using half adders and full adders. For example a (4:2) correction circuit must 
produce a single bit error word EW with weight two; it is high only if all the inputs are 1. The output of error 
correction block is given to TDM CSA tree which is followed by faster adders like Carry Look Ahead Adder [12] 
and hence generate a non- speculative result.  
v) Approximate Carry Adder (ACA)  
In its first stage, it has a parallel prefix structure. Propagate/Generate (P/G) signals for single digits are computed 
first. In a k-bit ACA adder, each output bit is dependent on the previous k bits, so for each position i, we need the 
1175 Anju Sunny et al. /  Procedia Technology  24 ( 2016 )  1170 – 1177 
signals P(i−1)−(i−k) and G(i−1)−(i−k). They can be built with the scheme depicted in figure given below (where k = 4). 
Sum output bits Si are then easily obtained by P/G signals. When there is a chain of more than k propagates in the  
addenda an error occurs. To check for the presence of an error, we must consider all chains of length k + 1, and 
check if any of them contain solely propagates. The error signal as well as sum is obtained from equations given 
below: 
 
Si = Pi xor G (i−1)−(i−k) (5) 
E = σ (i– 1) – (i– k) (6) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 6. Generation of P/G Signal 
 
3.2. Modified Approximate Multiplier with optimised  TDM 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 7. Modified TDM tree 
 
For instance in Fig. 5 where TDM is fed with five inputs which includes five partial products (p0, p1, p2, p3, p4) 
and three intermediate carry (c0, c1, c2) from the previous tree. TDM design is modified to reduce the delay. Since 
c0, c1, c2 are generated after one unit delay they are fed to the counters in a manner as shown in the figure above. 
Unlike in the previous case where both the partial products and carry are routed in the first stage, here the horizontal 
carries coming from the previous vertical compressor slices have been given to the second stage of half adders and 
full adders. In the previous case counters in the first stage has to wait for the arrival of these horizontal carries 
1176   Anju Sunny et al. /  Procedia Technology  24 ( 2016 )  1170 – 1177 
 
thereby increasing the delay of the multiplier which can be avoided in the proposed design resulting in faster 
operation. Thereby utilising the above principle throughout a faster multiplier can be constructed. More over the 
usage of more half adders reduces the resource utilisation. 
 
4. Synthesis and Simulation Results 
 
To understand the effectiveness of the modified approximate design, the design is synthesized on Xilinx Spartan 
3E family. From the table given below it is seen that in the modified design the delay reduces by 45.4% and the area 
of logic utilisation is reduced. Also the estimated power consumption reduces by 5mW. 
 
Table 1. Synthesis Results 
 
Parameter Approximate Multiplier Modified Design 
 without Optimised 
 TDM Design 
Delay(ns) 95.8 52.27 
  
No. of slices 696 660 
No of 4-i/p LUT 1205 1166 
  
Total Estimated 102mW 95mW 
Power  
Consumption  
 
4.1. Simulation Results  
Behavioral simulation is done prior to FPGA implementation to check the functionality of the circuit. After the 
different phases of implementation, namely, translate, map, and place & route, post route simulation is done to 
observe the exact performance the architecture. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 8. Approximate Multiplier without Optimisation 
1177 Anju Sunny et al. /  Procedia Technology  24 ( 2016 )  1170 – 1177 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig.9. Modified Approximate Multiplier with Optimisation 
 
5. Conclusion 
 
Here a high speed approximate multiplier design has been proposed. Proposed design utilizes an optimised TDM 
tree. The circuit utilizes some of the partial product as well as a speculative compression tree to sum the recoded 
partial products. A speculative adder is used in the final carry propagate addition. The designs functionality have 
been verified using Xilinx ISE design suite 14.5 (web-edition). A comparison of the proposed design with 
conventional approximate multiplier showed that it has faster operation. The synthesis and simulation results 
showed that the proposed multiplier design gives 45.4% improvement in delay, lesser resource utilization and lesser 
power consumption as compared to multiplier without optimisation. In cases where multiplier speed is not critical, 
the use of speculative units remains unjustified. The performance of the approximate multiplier can further be 
improved by considering don’t- care conditions and further by using variable latency adder instead of almost correct 
adder. 
 
References 
 
[1]. L. S. Wallace, ―A suggestion for fast multipliers,ǁ IEEE Trans.Comput., vol. EC-13, Feb. 1964, pp. 14–17.  
[2]. L. Dadda, ―Some schemes for parallel multipliers,ǁ Alta Frequenza, vol. 34, Mar. 1965.   
[3]. V. G. Oklobdzija, D. Villeger, and S. S. Liu, ―A method for speed optimized partial product reduction and generation of fast parallel 
multipliers using an algorithmic approach,ǁ IEEE Trans. Comput., vol. 45, no. 3, Mar. 1996, pp. 294–306.   
[4]. E. J. King and E. E. Swartzlander, Jr., ―Data dependent truncated scheme for parallel multiplication,ǁ in Proceedings of the Thi rty 
First Asilomar Conference on Signals, Circuits and Systems, 1998, pp. 1178–1182.   
[5]. A. Cilardo, ―A new speculative addition architecture suitable for two’s complement operations,ǁ in Proc. Design, Autom., Test Eur. 
Conf. Exhib., Apr. 2009, pp. 664–669.   
[6]. H.R. Mahdiani, A. Ahmadi, S.M. Fakhraie, C. Lucas, ―Bio-Inspired Imprecise Computational Blocks for Efficient VLSI 
Implementation of Soft-Computing Applications,ǁ IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 57, no. 4, April   
2010, pp. 850-862.   
[7]. P. Kulkarni, P. Gupta, and MD Ercegovac, ―Trading accuracy for power in a multiplier architectureǁ, Journal of Low Power   
Electronics, vol. 7, no. 4, 2011, pp. 490—501.   
[8]. A. A. Del Barrio, S. O. Memik, M. C. Molina, J.M.Mendias, and R. Hermida, ―A distributed controller for managing speculative 
functional units in high level synthesis,ǁ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 30, no. 3, Mar. 2011, pp. 350– 
363.   
[9]. Vasudev G. and Rajendra Hegadi, ―Design and Development of 8-Bits Fast Multiplier for Low Power Applications,ǁ in Proceedings   
of IEEE IACSIT International Journal of Engineering and Technology, Vol. 4, No. 6, December 2012.  
[10]. J. Liang, J. Han, F. Lombardi, ―New Metrics for the Reliability of Approximate and Probabilistic Adders,ǁ IEEE Transactions on 
Computers, vol. 63, no. 9, 2013, pp. 1760 – 1771.   
[11]. A. Momeni, J. Han, P.Montuschi and F. Lombardi, ―Design and analysis of approximate compressors for multiplicationǁ IEEE 
transaction on computers 2013   
[12]. A. Cilardo, , D. De Caro, N. Petra, F. Caserta, N. Mazzocca, E. Napoli, A.G.M Strollo, ―High Speed Speculative Multipliers Based 
on Speculative Carry-Save Treeǁ, IEEE transactions on circuits and systems, vol. 61, no. 12, December 2014   
[13]. M. J. Schulte and E. E. Swartzlander, Jr., ―Truncated multiplication with correction constant,ǁ VLSI Signal Processing VI, 1993, 
pp. 388-396.  
[14]. N.H.E. Weste, D.M. Harris, CMOS VLSI Design a circuits and systems design perspective, fourth edition  
 
