A Fine-Grained Arithmetic Optimization Technique for High-Performance/Low-Power Data Path Synthesis by Junhyung Um et al.
A Fine-Grained Arithmetic Optimization Technique
for High-Performance /Low-Power Data Path Synthesis
Junhyung Um Taewhan Kim C. L. Liu
Dept. of Computer Science & Electrical Engineering Dept. of Computer Science
and Adv. Information Technology Research Center National Tsing Hua Univ.
Korea Adv. Institute of Science & Technology, Taejon, Korea Hsinchu, Taiwan R.O.C
Abstract: Wallace-treecompressorstylehasbeenwidely rec-
ognized as one of the most effective implementation schemes
for arithmetic computations in VLSI design. However, the
scheme has been applied only in a rather restrictive way, that
is, for implementing fast multipliers and for generating ﬁxed
structures without considering the characteristic of the input
signals. The contributions of our work are (1) to extend the
applicability of the Wallace scheme to any arithmetic circuit
which consists of additions/subtractions/multiplications glob-
ally (instead of applying it to each operation) to produce a
globally efﬁcient architecture of the circuit; (2) to optimize
the timing of the circuit for uneven signal arrival proﬁles;
(Speciﬁcally, we present an efﬁcient algorithm for generat-
ing a delay-optimal (bit-level) carry-save addition structure of
an arithmetic circuit.) (3) to provide a comprehensive analy-
sis of the switching activity of a (bit-level) carry-save addition
structure, andbasedonwhichwederiveaneffectivealgorithm
for synthesizing low power circuits. Putting these arithmetic
optimization solutions together, a circuit designer will be able
to fully understand the synthesis of arithmetic circuit based on
the bit-level carry-save addition.
1 Introduction
There are many design objectives to be optimized in the vari-
ous phases of the synthesis process. Among them, timing and
power consumption of the circuit are two of the most impor-
tant design objectives.
Timing optimization: In behavioral synthesis[1], timing
refers to two factors, namely, the number of control steps (i.e.,
latency) and cycle time. Cycle time is usually speciﬁed by the
designer, and the latency is determined by the task of schedul-
ing the operations under the given cycle time. In RTL syn-
thesis, timing of the design is optimized to meet the given
cycle time. There is a number of ways to carry out timing
optimization at the operation level. The tasks are (1) binding
for operations to module implementations and (2) balancing
operation trees. In RTL synthesis, these tasks are tried repeat-
edly to meet the requirement of the speciﬁed cycle time.
In the logic synthesis phase, gate-level optimizations are per-
formed to satisfy the cycle time if timing requirement was not
met through RTL synthesis.
Power optimization: In behavioral synthesis, if several mod-
ules with a range of power/area/delay costs are available for
the implementation of an operation, an appropriate allocation
and binding of operationsto modules canlead to adesign with
reduced power consumption[2]. The total switching capac-
itance of a data path highly depends on how operations are
bound to functional modules and variables are to registers. In
RTL synthesis, estimating or analyzing the power consump-
tion of a design are major research issues to be explored dur-
ing the behavioral synthesis. In logic synthesis, various opti-
mization techniques, such as path balancing, clock gating, en-
coding and retiming, to reduce switching activities have been
employed.
The outcome of gate-level optimization in logic synthesis
depends heavily on the outcome of the operation-level opti-
mization in RTL synthesis. However, it is not always true that
using the fastest or the lowest power consuming implemen-
tation for each operation in RTL synthesis will produce the
fastest or the lowest power consuming circuit in logic synthe-
sis. This is mainly because ‘boolean’ optimization is not able
to properly reﬂect the internal structures of the arithmetic im-
plementations.
In contrast to the conventional two-step optimization in
which individual operations are implemented in a way to min-
imize the timing/power of the design and subsequently, the
boolean equations of the logic structure are optimized, we
propose a one-step approach which fully exploits the structure
of the arithmetic computation to minimize timing/power of
the resultant design. We accomplish this by using the concept
of the bit-level carry-save addition of the Wallace scheme[3].
However, unlike the conventional application of the Wallace
scheme which assumes equal signal arrival times and signal
switching activities of input bits, our algorithm generalizes
the scheme to make use of nonuniform arrival times and sig-
nal switching activities. In a strict sense, a full/half adder
(FA/HA) is not a logic primitive. Rather, it is a composi-
tion of logic gates. However, since it has a bit-level function-
ality with a simple composition, its implementation is much
closer to a gate rather than to a word-level operation. In that
context, our approach is a one-step data path synthesis proce-
dure which combines RTL and logic synthesis since the entire
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Permission to make digital/hardcopy of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or 
distributed for profit or commercial advantage, the copyright notice, the title of the 
publication and its date appear, and notice is given that copying is by permission of 
ACM, Inc.  To copy otherwise, to republish, to post on servers or to redistribute to 
lists, requires prior specific permission and/or a fee. 
DAC 2000, Los Angeles, California 
(c) 2000 ACM 1-58113-188-7/00/0006..$5.00 arithmetic expression is transformed into a circuit consisting
of FAs/HAs.
The key difference between our optimization algorithm
and the Wallace scheme is that our algorithm is driven by the
arrival times of the input signals if minimization of timing de-
lay is the primary objective, and is driven by the signal tran-
sition activity if minimization of power is the primary objec-
tive. Consequently, applying our algorithm to all arithmetic
expressions in a circuit iteratively will improve the overall
performance/power-consumption of the circuit. Speciﬁcally,
in this paper, we provide a polynomial time algorithm for
generating a delay-optimal FA allocation for arithmetic ex-
pressions. Furthermore, we analyze the relations between FA
allocation and its power consumption, and establish a set of
important properties, based on which we derive an efﬁcient
FA allocation algorithm for low power.
2 Preliminaries
2.1 FA Allocatoin : Deﬁnitions
Our algorithm for optimizing arithmetic circuits is based on
(the bit-level) carry-save addition of the Wallace scheme that
uses FAs as the basic units. We examine ﬁrst the structure
of an FA and how it is utilized in implementation of arith-
metic expressions. An FA is a combinational circuit that sums
three input bits of the same weight, say 2i, and produces
two outputs, one with weight 2i (sum-bit) and the other with
weight 2i+1 (carryout-bit). Consequently, an iterative appli-
cation of FAs to a matrix of bit addends will reduce the bit
addends of the matrix to two bit addends.1 The reduced bit
addends of the columns form two operands which are then
added together by a normal (carry-propagating) adder to pro-
duce the ﬁnal sum. We refer the structure of FAs for re-
ducing addends to produce the ﬁnal sum to as the FA-tree
of the implementation, and the normal adder placed at the
root of the FA-tree is called the ﬁnal adder of the FA-tree.
Figure 1 shows an example of FA-tree allocation for imple-
col-1 col-0
x1 x0
y1 y0
z0
w1 w0
(a)Addend matrix with 4 rows
col-2 col-1 col-0
S(x1,y1,w1) w0
C(x1,y1,w1) C(x0,y0,z0) S(x0,y0,z0)
  x  y  w         x  y  z  w     
Final Adder
FA FA
o 1 1 1o o o
(b) Reduced matrix with 2 rows (c) FA-tree
Figure 1: An example of FA allocation for F = X+Y+Z+W
menting the expression F=X+Y+Z+W where X = x1x0,
Y = y1y0, Z = z0 and W = w1w0.2 First, the expres-
sion is translated into an addend matrix with 4 rows as shown
1One HA, if needed, shall be allocated at the end of an FA-tree for the
addends in each column of the matrix to ensure that exactly two addends are
remained at the end.
2For convenience of explanation, we use an unsigned type of vector rep-
resentation in this paper.
in Figure 1(a). Addends x0, y0 and z0 in bit column 0 are
added together to produce sum-bit S(x0,y0,z0) and carryout-
bit C(x0,y0,z0). Also, addends x1, y1 and w1 in column 1
are added to produce S(x1,y1,w1) and C(x1,y1,w1) as shown
in the reduced matrix with two rows in Figure 1(b). Con-
sequently, two FAs are allocated in the FA-tree as shown in
Figure 1(c), in which the two operands (S(x1,y1,w1)w0) and
(C(x1,y1,w1)C(x0,y0,z0)S(x0,y0,z0)) are summed up by a ﬁnal
adder. Note that according to the arrival time or switching ac-
tivity an optimal allocation of FAs will minimize timing delay
or power consumption of the resultant FA-tree.
3 FA Allocation for Timing
3.1 Delay Model for FA
Since an FA has two outputs, there are two constant internal
delay parameters Dc and Ds where Dc represents the delay
from the three input ports of the FA to the carryout port and
Ds represents the delay from the three input ports to the sum
port. Note that the values of Dc and Ds vary depending on
the target technology.
3.2 Motivational Examples
Figure 2 shows the effect of different selections of inputs to
the FAs on the timing of the ﬁnal tree. The input operands to
be added are X = x1x0, Y = y1y0, Z = z0 and W = w1w0,
and their (bit-level) arrival times are in the parentheses next
to the addends as shown in Figure 2(a). In the example, we
assume that Ds=2 and Dc=1. Figures 2(a)-(c) show three pos-
sible FA-tree allocations. Figure 2(a) shows an FA-tree with
a ﬁxed selection of addends as inputs to FAs as the Wallace
scheme does, where three input addends x0, y0 and z0 in col-
0, and three input addend x1, y1 and w1 in col-1 are used as
inputs to FAs. Consequently, there are two critical paths as
shown in the dotted arrows in the ﬁgure where the time delay
is t(x0) + Ds = t(x1) + Ds =9 . 3 On the other hand, Fig-
ure 2(b) shows the FA-tree generated by selecting, for each of
col-0 and col-1, three addends with the earliest arrival times
among the ‘input’ addends as inputs to the FAs. The time
delay is still 9, but there is only one critical path. Finally,
Figure 2(c) shows an FA-tree generated by selecting the three
addends with the earliest arrival times among the intermediate
as well as the input addends as inputs to FAs. Consequently,
the FA in col-1 uses y1, w1 and C(y0,z0,w0) as inputs. The
time delay is reduced to 8.
From the FA-tree allocations in Figure 2, we propose a FA-
tree allocation procedure which leads to a minimum time de-
lay, and it must satisfy
Condition 1: allocate FAs for the addends in the rightmost bit-
columnﬁrstuntiltwoaddendsinthecolumnremain. Thisstep
is repeated for the addends in the next bit-column together
with the addends that have been generated from the previous
column until the leftmost column is processed;
Condition 2: assign addends with the earliest arrival times to
the inputs of the FAs created at the beginning phase of the
bottom-up (i.e., from leaf to root) FA allocation in each bit-
column.
3t(x) denotes the arrival time of signal x.x0  y0  z0  w0
(7)  (2)  (3)         (7) (5) (4) (2)
col-0 col-1
x1  y1  w1
                             
FA FA
s0(     w0 y0 z0)      (9)
s1 ( x1 y1 w1) 
c0(w0 y0 z0)     
(8)
c1 ( x1 y1 w1) 
(8)             
(9)
col-0 col-1
 w0 y0 z0  x0 x1 y1 w1
 
FA FA
s0(w0 y0 z0)
(7)
c0(w0 y0 z0)
(6)
c1(x1 y1 w1)
(8) s1(x1 y1 w1)
(9)
(a) by Wallace’s (b) by Column-Isolation
(6)
(7)
x1      
y1 w1   
FA
FA
(7)
col-0 col-1 col-2
c1(y1 w1 c0)     
s0( y0 z0 w0) 
c0( y0 z0 w0) 
 y0 z0 w0  x0
(8)
s1(y1 w1 c0)     
(c) by Column-Interaction
Figure 2: Effects of selection of signals to FAs on timing
3.3 Algorithm for Timing-Driven FA-tree Allocation
The FA-tree allocation problem is stated as follows:
Problem 1: Given an addition expression of
F = X1 + X2 + ··· + Xm (1)
where Xi (i =1 ,2···,m) is ni-bit (=xi,ni−1···xi,1xi,0)
addend with arrival time t(xi,j) for each bit-signal xi,j, j =
1,2,···,n i, determine an allocation of FA-tree T with a min-
imum of t(F).
Since the ﬁnal adder of the FA-tree can be implemented
with any of several types of modules, we slightly modify
Problem 1 such that we want to minimize the maximum value
among the arrival times of the signals that are inputs to the ﬁ-
nal adder of FA-tree. (Note that under this modiﬁed objective,
we derive strong results which in turn enables Problem 1 to be
solved optimally in polynomial time.)
Let us ﬁrst consider a simple case in which the addend ex-
pression of Eq. (1) is constrained as
F = X1 + X2 + ··· + Xm (2)
where bit width(Xi)=1 ,i=1 ,2···m.
The corresponding initial addend matrix is a single bit-
column with m rows. For example, when m is 6, the ini-
tial and the reduced ﬁnal matrix are shown in Figures 3(a)
and (b), respectively. Here, the objective of our FA-tree al-
location is to generate a ﬁnal matrix with a minimum of
max{t(r11),t(r12),t(r21),t(r22)}. Based on Condition 2 in
Sec. 3.2, we propose a simple constructive procedure for allo-
cating FAs to reduce the single column with m addends to the
one with two addends:
Algorithm SC T(M0):
FA Allocation for Single Column for Timing (i.e., F in Eq. (2))
•M 0 = {xi,0}, 1 ≤ i ≤ m /* addend set in the column */
while (|M0|≥3) {
if (|M0| > 3) {
• Select three addends with the earliest arrival times from M0;
• Create a new FA and connect them as inputs;
}
else { /* |M0| =3*/
• Select two addends with the earliest arrival times from M0;
• Create a new HA and connect them as inputs;
}
• Delete the three/two addends used from M0;
• Insert the addend coming from the sum port of the FA/HA to M0;
}
col-0
x11
x21
x31
x41
x51
x61
col-1 col-0
r21 r11
r22 r12
(a) Initial matrix (b) Reduced ﬁnal matrix with 2 rows
Figure 3: Matrix reduction when bit width(Xi)=1, i =
1,2,···m and m=6
Note that algorithm SC T is similar to the Huffman’s
algorithm[4]forconstructingabinarytreewithminimumpath
length. However, our problem is more complicated in that in
addition to minimizing the arrival times of sum signals of FAs
we also minimize the arrival times of the carryout signals be-
cause they directly affect the arrival times of the signals re-
duced in the columns with the higher weight of the matrix.
However, the following lemma tells us that algorithm SC T is
indeed strong enough to satisfy the additional consideration
of minimizing the arrival times of carryout signals.
Lemma 1 Given an addend matrix for Eq. (2), let s1 and
s2 be the two sum signals (t(s1) ≤ t(s2)) and c1, ··· cl
(t(c1) ≤ ··· ≤ t(cl)) be the carryout signals produced by
the application of algorithm SC T to the (single) column of
the matrix. Similarly, let s 
1 and s 
2 (t(s 
1) ≤ t(s 
2)) be the two
sum signals and c 
1, ···c 
l (t(c 
1) ≤ ··· ≤ t(c 
l)) be the car-
ryout signals produced by the application of any algorithm.
Then, t(s1) ≤ t(s 
1), t(s2) ≤ t(s 
2), t(c1) ≤ t(c 
1), ···, and
t(cl) ≤ t(c 
l).
Basedon Condition 1 in Sec.3.2 and Lemma 1, to generate
an FA-tree with minimum time delay for the addend expres-
sion in Eq. (2), we apply algorithm SC T repeatedly to every
column generated. In other words, SC T is applied to the sin-
gle column of the initial addend matrix, reducing it into a col-
umn with 2 addends by allocating FAs, and at the same time,
creating a new column composed of the carryout addends pro-
duced by the FA allocation. This process then repeats with the
next new column until the column has only 2 addends. We
have the following results.
Lemma 2 Given an addend matrix for Eq. (2), let ri1 and ri2
(t(ri1) ≤ t(ri2)), i =1 ,2,···n be the two signals in column-
i of the matrix ﬁnally reduced by the repeated applications
of algorithm SC T where n is the total number of columns.
Similarly, let r 
i1 and r 
i2 (t(r 
i1) ≤ t(r 
i2)), i =1 ,2,···n be
the two signals in column-i of the matrix ﬁnally reduced by
the application of any algorithm. Then, t(ri1) ≤ t(r 
i1) and
t(ri2) ≤ t(r 
i2), i =1 ,2,···n.
Lemma 2 indicates that the repeated applications of algo-
rithm SC T to every column generated, starting from the sin-
gle column of the addend matrix for Eq. (2) produces FA-tree
with optimal timingof everysignal inthereducedﬁnal matrix.
This leads to the following observation:
Observation 1: Lemma 2 claims that there is no FA-tree al-
location algorithm in which the arrival time of any signal in
the reduced ﬁnal matrix (even at the expense of the times ofother signals in the reduced matrix) is smaller than that of
the corresponding signal produced by our FA-tree allocation
algorithm. Consequently, using the signals produced by our
proposed algorithm as the input (vector) addends to the ﬁnal
adder will produce an output of optimal time delay (i.e., an
optimal value of t(F) in Eq. (1) of Problem 1).
Now, we complete our algorithm for solving Problem 1 with
Eq. (1). The proposed FA-tree allocation algorithm is based
on Observation 1 and the procedure used for solving the ad-
dendmatrixforEq.(2). Thealgorithm(calledFA AOT) works
in the following way: Given an initial addend matrix with
multiple columns, apply algorithm SC T to the column with
weight 20 (i.e., rightmost one). In the next iteration, apply
SC T to the column with weight 21 and so on. The iteration
stops when the leftmost column is processed.
Theorem 1 Algorithm FA AOT allocates an FA-tree for F
in Eq. (1) with an optimal time delay for F.
Theorem1claimsthatouralgorithmproducesanFA-treewith
an optimal timing. We now summarize the ﬂow of the com-
plete algorithm in the following:
Algorithm FA AOT(F):
FA-tree Allocation for Optimal-Timing for F in Eq. (1) :
• n = max{ nk | 1 ≤ k ≤ m } /* nk: bit-width(Xk)* /
•M j = {xi,j}, 0 ≤ j ≤ n − 1
/* addend set in column j of addend matrix M */
• Set j = 0; /* the rightmost (the lowest weight) column */
repeat {
• Call SC T(Mj);
• Insert all addends of the carryouts of FAs (and an HA if created)
produced to Mj+1;
(If Mj+1 does not exist, create an empty Mj+1 and insert them
to the set.)
• Set j = j +1 ;
} until (|Ms|≤2, for each s=1 ,2 ,···)
/* Here, we make sure matrix M has two rows */
• Create a ﬁnal adder with its input bit-widths being the numbers
of addends in the rows;
• Connect the two operands, one for each row, as inputs to the adder;
4 FA Allocation for Low Power
4.1 Power Model for FA
In a CMOS gate, most of the power consumption occurs dur-
ing output transitions when the outputs are charging and dis-
charging. Let us Ws and Wc denote the amount of power
consumed by an FA when value transitions occur at the two
outputs, sum s and carryout c, respectively.
We use a stochastic process model for the sequence of sig-
nal values at inputs and internal nodes of a circuit. We assume
the signals are random variables with p(x) denoting the prob-
ability that the value of x is 1. We employ the zero gate-delay
model and ignore signal transitions due to glitches.
4.2 Problem Formulation
Under the model in Sec. 4.1, the average switching activity of
signal x becomes Eswitching(x)=p(x) · (1 − p(x)). Conse-
quently, the problem we want to solve is to allocate an FA-tree
T for an arithmetic expression which minimizes the sum of
the values of Eswitching(x) for all output signals of FAs in T.
That is, the power consumption of FA-tree T is measured by
Eswitching(T)= 
v∈V (T){Ws · p(vs)(1 − p(vs)) + Wc · p(vc)(1 − p(vc))}
where V (T) is the set of FAs in T, and vs and vc represent the
sum and carryout signals of a v in V (T), respectively. Thus,
our optimization problem is expressed as:
Problem 2: Given an addition expression of
F = X1 + X2 + ··· + Xm (1)
where Xi (i =1 ,2···m) is ni-bit (= xi,ni−1 ···xi,1xi,0)
addendwithsignalprobabilitiesp(xi,j), j =1 ,2,···,n i, and
with the power model in Sec. 4.1, determine an allocation of
FA-tree T for F which minimizes Eswitching(T).
FA-tree allocation for Problem 2 is tightly related to the
signal probabilities of the sum and carryout of FA. It can be
easily shown that, for an FA in which x, y, and z are its in-
put signals with their signal probabilities p(x), p(y) and p(z),
and if we set q(v)=p(v)−0.5, the value of the q(s) and q(c)
become
q(s)=4· q(x) · q(y) · q(z),
q(c)=0 .5 · (q(x)+q(y)+q(z)) − 2 · q(x) · q(y) · q(z)
Note that since p(v)(1 − p(v)) = −(q(v))2 +0 .25, minimiz-
ing Σp(v)(1 − p(v)) is equivalent to minimizing −Σ(q(v))2.
The basic framework of our FA-tree allocation algorithm for
low power is the same as the one for optimizing time delay
proposed in Sec. 3.3. The only difference is how to select the
three signals of addends to be assigned to each of FAs created
during the process of FA-tree allocation to reduce the power
consumed by the resultant FA-tree. Sec. 4.3 proposes a solu-
tion to the problem.
4.3 Algorithm for Power-Driven FA-tree Allocation
Let usbeginwithan examplein Figure4 toillustrate theeffect
of the selection of addends to be used as inputs to an FA on
the quantity of Eswitching(T). We assume Wc=Ws=1. Fig-
ure 4 shows two different FA-trees T1 and T2 for 4 single-
bit addends x1,x 2,x 3 and x4 with signal probabilities p(xi)
(q(xi)=p(xi) − 0.5). Since T1 and T2 select different ad-
dends as inputs to FA, different amounts of total switchings
are resulted. That is, Eswitching(T1) = Ws ·{ − (q(s1))2 +
0.25} + Wc ·{ − (q(c1))2 +0 .25} =- (q(s1))2 − (q(c1))2 +
0.5 = -0.089 + 0.5 = 0.411, and Eswitching(T2) =- (q(s2))2 -
(q(c2))2 + 0.5 = -0.100 + 0.5 = 0.400. This means that tree
T2 in Figure 4(b) consumes less power than tree T1 in Fig-
ure 4(a). Clearly, different selections of the addends to be
x3 x4
s1
c1
x2 x1
FA
q(x4) = -0.1
q(x1) = -0.4
q(x3) = -0.2
q(x2) = -0.3
p(x1) = 0.1
p(x2) = 0.2
p(x3) = 0.3
p(x4) = 0.4
x2
c2
s2
x1 x3 x4
FA
T1 T2
(a) (b)
Figure 4: Example of showing effect of signal selection to FA
on power
used as inputs to FAs generate internal structures of FA-tree
with different values of Eswitching(T), which is a weighted
sumoftwofactors, one(i.e.,q(s))for sumsignalandtheother(i.e., q(c)) for carryout signal. From the example of selecting
addends and the formulations of q(s) and q(c) we observe the
following fact:
Observation 2: To minimize the power dissipated by the
switchings of the sum signal of an FA we should maximize
(q(s))2, which is {4q(x) · q(y) · q(z)}
2. To achieve this, it
is desirable to select the three input addends with the largest
values of |q| as inputs to an FA. Note that such a selection is
likely to increase the value of the (q(c))2 of the same FA as
well. For example, it can be easily shown that by the selec-
tion of the addends, the value of the corresponding (q(c))2,
which {0.5(q(x)+q(y)+q(z)) − 2q(x) · q(y) · q(z)}
2,i s
maximized if q(x), q(y) and q(z) is either all positive num-
bers or all negative numbers.
Based on Observation 2, we propose ﬁrst an FA-tree alloca-
tion algorithm for F in Eq. (2) (i.e., a single-column matrix)
as follows:
Algorithm SC LP(M0):
FA Allocation for Single Column for Low Power (F in Eq. (2))
•M 0 = {xi,0}, 1 ≤ i ≤ m /* addend set in the column */
• If |M0| is an odd number, include a logic value 0 (called x
)
in M0 /* for allocating a HA */
while (|M0|≥3) {
• Select three addends with the largest values of |p(x) − 0.5|
(=|q(x)|) from M0;( statement a)
• Create a new FA and connect them as inputs;
(when selected addends include x
, create a HA)
• Delete the three addends used from M0;
• Insert the addend coming from the sum port of the FA/HA
to M0;
}
The ﬂow of algorithm SC LP is the same as that of SC T
in Sec. 3 except the way of selecting three addends in each
iteration (i.e., statement a) The inclusion of logic-value 0
(p(0) = 0)i nM0 in SC LP is intended to model the alloca-
tion of a HA so as to satisfy the constraint that the column of
M0 is ﬁnally reduced to the one with only two addends.
For an FA-tree T, the values of Pswitching(T)=
Pswitching(T) −

v∈V (T){
√
Ws ·| q(vs)|+
√
Wc ·| q(vc)|}
and Eswitching(T) are highly related, namely, as the value of
Pswitching(T) decreases the value of Eswitching(T) tends to
decrease. Property 1 claims that SC LP produces an FA-tree
which minimizes P under special conditions:
Property 1 Given an addend matrix for Eq. (2), algorithm
SC LPproducesanFA-treeT whichminimizesPswitching(T)
if 2 ·
√
Ws ≥
√
Wc and the signal probabilities of all input
addends are either in between 0 and 0.5 or in between 0.5 and
1.
Property2claimsthatifweconsideronlytheamountofthe
power dissipated by the switches of the sum signals of FAs,
the FA-tree generated by SC LP consumes a power mini-
mally.
Property 2 Given an addend matrix for Eq. (2), algorithm
SC LPproducesanFA-treeT whichminimizesEswitching(T)
when Wc =0 .
However, for a general distribution of the signal probabili-
ties and the values of Ws and Wc, SC LP does not guarantee
an optimal solution. This is mainly because of the complica-
tion of analyzing the characteristics of the effect of the signal
probabilities of the carryouts produced by the FA allocation
of the current column on the FA allocation of the next col-
umn. More generally, the complication arises from the facts:
Our optimization problem can be viewed as a “FA decompo-
sition” for low power. Contrary to the problem of primitive
gate decompositions for low power, such as AND-gate[5] and
XOR-gate[6] decompositions, in which the logic expressions
are rather simple to analyze, the logic expressions for FA is
much complicated, and further, the switching activities of two
output signals rather than one are involved in the analysis of
power consumption. However, Observation 2 and Property 3
support that algorithm SC LP is able to generate an FA-tree
with low power consumption even though it’s not always, and
signiﬁcantly reduce the risk of allocating an FA-tree with ex-
tremely high power consumption caused by the switches of
the carryouts of FAs.
Property 3 When a column of addend matrix is reduced to a
single-element column, the sum of signal probabilities of all
carryouts of FAs (and a HA if exist) produced is constant no
matter what FA allocation methods are used.
Finally, based on Observation 2 and Properties 1, 2 and
3, we extend algorithm SC LP to handle the low power FA-
tree allocation for the addend matrix with multiple columns.
The extended algorithm, called FA ALP (FA-tree Allocation
for Low Power), has exactly the same ﬂow as that of FA AOT
in Sec. 3.3. The difference is that FA ALP uses subroutine
SC LP while FA AOT uses SC T. Speciﬁcally, FA AOT selects
three signalsto be usedas inputs to FA according to the arrival
times of the signals (i.e., for timing), and if there are ties, the
selection priority is given according to the quantity of |q| (i.e.,
for low power). Likewise, algorithm FA ALP works in the
reverse way of FA AOT.
5 Experimentation
We tested our algorithms, FA AOT (for timing) and FA ALP
(for power), on a large number of arithmetic computations
that are typically found in industry. Our program accepts
an arithmetic expression (together with input characteristic,
i.e., bit-width, arrival time and signal probability) as input and
generates the netlist of a functionally equivalent FA-tree with
optimal-timing/low-power in Verilog HDL. The netlist is then
used as input to the Synopsys Design Compiler[7] for logic
optimization. We also implemented the delay-optimal algo-
rithm CSA OPT[8] of CSA allocation for comparison pur-
pose. Inexperiments,weusedlcbg10pv(.35µ)technology[9].
Timing optimization: Table 1 shows the comparisons of the
designs produced by the conventional RTL and logic opti-
mization, CSA OPT, and FA AOT. In the ﬁrst column, the
non-zero arrival times of inputs and the bit-widths of each de-
sign are speciﬁed. IIR is the arithmetic part of the 2nd-order
iir ﬁler design and Kalman is the state vector computation
part of the kalman ﬁlter design. Complx is the arithmetic partof complex number calculation. Serial-Adapter is a 3-port
serial adapter which is regularly used in many ladder digital
ﬁlter structures. Since FA AOT is optimal in timing, which is
always shorter than or equal to the timing by CSA OPT, the
comparisons are the reference only to show how much our al-
gorithm is effective. In fact, the results indicate that our algo-
rithm is performing well, not only for the polynomial expres-
sions but also for the ﬁlter designs. Overall, FA AOT reduced
the timing by 37.8% and 23.5% ON AVERAGE With much
less circuit area compared with those by the conventional one
and CSA OPT, respectively. Note that CSA OPT produces a
comparable result for the Serial-Adapter design with that of
ours. This is mainly due to the nature of regular CSA struc-
ture which might allow the more efﬁcient logic optimization.
However, the improvement is very little and only in the circuit
area.
Design Convent. [8] FA AOT Impr.wrt
(time,area) (time,area) (time,area) Convent./[8]
(time,area)
X2 1.33 ns 1.06 ns 0.33 ns 75.2%, 80.7%/
(X: 3-bit) 545 units 275 units 160 units 69.0 %, 42.8%
X3 3.54 ns 3.24 ns 2.01 ns 43.2%, 64.8%/
(X: 4-bit) 2345 units 1670 units 825 units 37.9%, 50.6%
X2 + X + Y 4.63 ns 3.84 ns 3.18 ns 31.3%,43.7%/
(X,Y: 8-bit, X: 0.7 ns) 5534 units 3789 units 3111 units 17.2%,17.8%
x2 +2 x · y + y2 5.26 ns 4.63 ns 4.01 ns 23.8%, 29.3%/
+2x +2 y +1 9138 units 8134 units 6458 units 13.4%, 20.6%
(x,y: 8-bit, 1.0 ns)
x + y − z+ 5.16 ns 3.77 ns 3.61 ns 30.0%, 21.8%/
x · y − y · z +1 0 7568 units 6297 units 5916 units 4.2%, 6.0%
(x,y, z: 8-bit)
IIR 6.57 ns 4.75 ns 3.68 ns 43.9%, 37.5%/
(16-bit output) 13362 units 11202 units 8349 units 22.5%, 25.5%
Kalman 6.09 ns 4.50 ns 3.69 ns 39.4%, 30.7%/
(32-bit output) 31073 units 25713 units 21542 units 18.0%, 16.2%
IDCT 11.51 ns 6.38 ns 4.45 ns 61.3%, 29.3%/
(32-bit output) 85364 units 77052 units 60307 units 30.2%, 21.7%
Complex 5.22 ns 4.51 ns 3.70 ns 29.1%, 28.8%/
(32-bit output) 53879 units 50083 units 38343 units 17.9%, 23.4%
Serial-Adapter 6.46 ns 6.00 ns 5.72 ns 11.5%, 4.7%/
(16-bit output) 6593 units 5608 units 5631 units 4.7%, -0.4%
Table 1: Comparisons of designs optimized for timing
Power optimization: We ran two FA-tree allocation algo-
rithms: one algorithm, called FA random, which randomly
selects the signals to be used as inputs to FA, and FA ALP.
We measured the total power consumptions (in terms of
Eswitching(T) in Sec. 4.2) caused by the signal transitions of
the outputs of the FAs in the FA-trees produced by the algo-
rithms, and summarized them in Table 2. We used random
signal probabilities for the inputs of the designs, and used
the Synopsys Design Power[10] for measuring the power (i.e.,
Ws and Wc) consumed by the single signal transitions of the
sum and carryout of FA with lcbg10pv(0.35u) target library
and global operating voltage = 3.3. The comparisons indi-
cate that the selection of signals to be used as inputs to FAs
proposed in FA ALP is quite effective in reducing the power
consumption of the FA-tree. Moreover, the power reduction is
consistent, reﬂecting a very low risk of generating an FA-tree
with high power consumption by our algorithm.
6 Conclusions
In this paper, we presented a new algorithm for synthesiz-
Design FA random FA ALP Impr.
IIR 257 mW 240 mW 6.6%
Kalman 316 mW 281 mW 11.0%
IDCT 1406 mW 1324 mW 5.8%
Complx 330 mW 299 mW 6.6%
Serial-Adapter 324 mW 240 mW 25.9%
Average 11.8%
Table 2: Comparisons of designs optimized for power
ing arithmetic circuits based on the (bit-level) carry-save ad-
dition of the Wallace scheme. Unlike the previous approaches
in which the application of the bit-addend compression tech-
nique is conﬁned to individual multiplications only with the
assumption of equal arrival times of input signals, (1) we ex-
tended the application to any of arithmetic expressions com-
posed of addition/subtraction/multiplication operations which
can be translated into addition expressions, and proposed two
bit-addend compression algorithms, one for optimizing tim-
ing and another for optimizing power. (2) The former algo-
rithm is able to optimize the timing delay of the circuit for
uneven signal arrival proﬁles, speciﬁcally, we present an efﬁ-
cient algorithm for generating a delay-optimal carry-save ad-
dition structure of the arithmetic circuit, while (3) the latter is
designed based on a comprehensive analysis of the effect of
constructing a carry-save addition structure on its switching
activity. From experimentations, we showed that our algo-
rithms were able to allow an extensive utilization of the (bit-
level) carry-save additions over the arithmetic circuit, which
leads to signiﬁcant reductions in timing and power consump-
tion.
Acknowledgment
This work was supported by the Korea Science and Engineer-
ing Foundation (KOSEF) through the Advanced Information
Technology Research Center (AITrc).
References
[1] D. D. Gajski, N. Dutt, A. Wu, and S. Lin, High-Level Synthesis - In-
troduction to Chip and System Design, Kulwer Academic Publishers,
1992.
[2] A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. W.
Broderson, “Optimizing Power using Transformations”, IEEE Transac-
tions on Computer-Aided Design of Circuits and Systems, Vol. 14, No.
1, pp. 12-31, January 1995.
[3] D. D. Gajski, “Parallel Compressors,” IEEE Transactions on Comput-
ers, Vol. C-29, No.5, pp. 393–398, May 1980.
[4] D. Huffman, “A method for the construction of minimum redundancy
codes”, Proc. of the IRE, Vol.40, pp.1098-1101, 1952.
[5] H. Zhou and D.F. Wong, “An Exact Gate Decomposition Algorithm for
Low-Power Technology Mapping”, Proc. of International Conference
on Computer-Aided Design, pp. 575-580, 1997.
[6] Unni Narayanan and C.L. Liu, “Low Power Logic-Synthesis for XOR
Based Circuits”, Proc. of International Conference on Computer-Aided
Design, pp. 570-574, 1997.
[7] Synopsys Inc., Design Compiler Reference Manual, 1998.
[8] J. Um, T. Kim and C. L. Liu, “Optimal Allocation of Carry-Save-
Adders in Arithmetic Optimization”, Proc. of International Conference
on Computer-Aided Design, November 1999.
[9] LSI Logic Inc., G10-p Cell-Based ASIC Products Databook, 1996.
[10] Synopsys Inc., Power Compiler Reference Manual, 1998.