Multiple Constant Multiplication for Digit-Serial Implementation of Low Power FIR Filters by Kenny Johansson et al.
Multiple Constant Multiplication for Digit-Serial
Implementation of Low Power FIR Filters
KENNY JOHANSSON, OSCAR GUSTAFSSON, and LARS WANHAMMAR
Department of Electrical Engineering
Linköping University
SE-581 83 Linköping, SWEDEN
{kennyj, oscarg, larsw}@isy.liu.se, http://www.es.isy.liu.se/
Abstract: - Multiple constant multiplication (MCM) is an efﬁcient way of implementing several constant multi-
plications with the same input data. The coefﬁcients are expressed using shifts, adders, and subtracters. By utiliz-
ing redundancy between the coefﬁcients the number of adders and subtracters is reduced resulting in a low
complexity implementation. However, for digit-serial arithmetic a shift requires a ﬂip-ﬂop, and, hence, the
number of shifts should be taken into consideration as well. In this work we investigate the area, speed, power
trade-offs for implementation of FIR ﬁlters using MCM and digit-serial arithmetic. We also introduce an algo-
rithm for reducing both the number of adders and subtracters as well as the number of shifts.
Key-Words: - Multiple constant multiplication, Multiplier block, Multiplierless, Digit-serial arithmetic, FIR ﬁl-
ter, Low power, Adder graph, Shift-and-add multiplication, Adder depth
1  Introduction
Multiplication with a constant is commonly used in
digitalsignalprocessing(DSP)circuits,suchasdigital
filters. The transfer function of an Nth-order finite-
length impulse response (FIR) filter can be written as
(1)
In a transposed direct form FIR filter, as shown in
Fig. 1, one input is multiplied with multiple coeffi-
cients [1],[2]. This is often referred to as the multiple-
constant multiplication (MCM) problem, which can
be realized using a multiplier block as illustrated by
the dashed box in Fig. 1.
Constant multiplication can be efficiently imple-
mented using shifts, adders, and subtracters. As the
complexity is similar for adders and subtracters we
will refer to both as adders, and the number of adders
and subtracters as adder cost.
Each multiplier in a multiplier block can be imple-
mented separately, e.g. using the canonic signed-digit
(CSD) representation [1],[3]. However, it is possible
to utilize redundant partial results to reduce the
number of adders required to realize multiple-constant
multiplication [4]–[9].
Most existing work on MCM has focused on mini-
mizing the number of adders, as the shift operations
canbehardwiredinabit-parallelarchitecture.Howev-
er, in bit- and digit-serial arithmetic the shift opera-
tions require flip-flops, and, hence, they have to be
considered as well. In [10] an algorithm that minimiz-
es the number of shifts while keeping the adder cost
low was proposed.
Most work on implementation of digit-serial FIR
filters has focused on implementation in FPGAs and
without using multiplier blocks [11]–[13]. However,
in [14] the digit-size trade-off in implementation of
digit-serial transposed direct form FIR filters using
multiplier blocks was studied. One of the best MCM
algorithms in terms of number of adders, referred to as
RAG-n [5], and the algorithm proposed in [10], re-
ferred to as RSAG-n, was used in the comparison.
The conclusion in [14] was that an algorithm that
minimize the number of adders, while keeping the
number of shifts low, would be preferable for most
cases.
Another factor, except adders and shifts, that is of
importance when multiplier blocks are designed is
adder depth, i.e., the number of cascaded adders. An
algorithm, referred to as C1, that aims at minimizing
the adder depth was proposed in [15].
In this work we propose an algorithm that firstly
aim to minimize the number of adders and secondly
thenumberofshifts.Weinvestigatehowlargesavings
that can be achieved compared with RAG-n and
RSAG-n, respectively. The algorithms are compared
in terms of complexity and adder depth. Furthermore,
we provide two example implementations and com-
pare the power consumption of the three algorithms
with the algorithms in [7] and [15] and using separate
multipliers based on CSD coefficients.
H z ( ) h k ( )z k –
k 0 =
N
∑ =
T T T y(n)
x(n) Multiplier block
...
...
h(0) h(N−2) h(N−1) h(N)
Figure 1. Transposed direct form Nth-order FIR filter.2  Digit-Serial FIR Filters
In this section implementation aspects of FIR filters
using digit-serial arithmetic are discussed.
2.1 Digit-Serial Arithmetic
In digit-serial arithmetic, the words are divided into
digits of d bits that are processed one digit at a time
[16],[17]. The integer number d is usually denoted the
digit-size. This provides a trade-off between area,
speed, and power consumption [17],[18]. For the spe-
cial case where d equals the data wordlength we have
bit-parallel processing and when d equals one we have
bit-serial processing.
Digit-serial operators can be derived either by un-
folding bit-serial operators [19] or by folding bit-par-
allel operators [20]. In Fig. 2, a digit-serial adder,
subtracter, and shift operation is shown, respectively.
In Fig. 3 (a) a digit-serial implementation with
d = 1 of a constant multiplication is illustrated. Here,
the constant 45 is obtained as 5⋅9 = (1 + 22)(1 + 23),
i.e., two and three shifts are required in the first and
second adder stage, respectively. Note that a CSD
multiplier would require three adders as the CSD rep-
resentation of 45 is 1010101, where a bar is used to
representanegativedigit.InFig. 3 (b)thecorrespond-
ing graph representation of the multiplier is shown.
Here, edges corresponds to shifts and marked nodes to
additions.
Considering the processing elements it is clear that
the area of an FIR filter using digit-serial arithmetic
will increase for larger digit-size. How speed and
power consumption is affected is not obvious.
2.2 Implementation Aspects
The transposed direct form FIR filter is mapped to a
hardware structure using a direct mapping.
In our implementation we select the output word-
length as an integer multiple of the digit-size. It is pos-
sible to use an arbitrary wordlength, but this requires a
more complex structure of each processing element
[21].Furthermore,thepartialresultsarenotquantized,
as this would lead to higher complexity of the process-
ing elements. On the other hand, it may lead to delay
elements with shorter wordlength.
Assuming an input data wordlength of Wd bits and
that the maximum number of fractional bits of the fil-
tercoefficients,h(k),isWf,thetotalwordlength,WT,is
(2)
This leads to that in some cases, We extra bits are re-
quired, where
(3)
Theseextrabitsareusedasguardbitstofurtherreduce
the risk of overflow. However, the filter coefficients
are assumed to be properly scaled. The number of
clock cycles between each input sample is WT/d.
Hence, the input word should be sign-extended with
WT – Wd bits.
3  Proposed Algorithm
In [5] the n-dimensional Reduced Adder Graph
(RAG-n) algorithm was introduced. This algorithm is
known to be one of the best MCM algorithms in terms
of number of adders. Based on this algorithm an n-di-
mensional Reduced Shift and Add Graph (RSAG-n)
algorithm has been developed [10], that not only tries
to minimize the adder cost, but also the number of
shifts. However, this algorithm has an increased adder
cost, which will be dominating for larger digit-sizes
[14].
Here, an n-dimensional Reduced Add and Shift
Graph (RASG-n) algorithm is proposed. The new al-
gorithm is a hybrid of the RAG-n [5] and RSAG-n
[10] algorithms. RASG-n work with odd coefficients,
like RAG-n and only realizes one coefficient in each
iteration, like RSAG-n. When it is possible to realize
more than one coefficient RASG-n selects the one that
require the lowest number of additional shifts. This
makes it possible for RASG-n to minimize both the
number of adders and shifts in an effective way.
These algorithms are graph based. Node values are
referred to as fundamentals. Realized coefficients are
removedfromthecoefficientsetandaddedtoaninter-
connection table that specifies how the value is ob-
tained. The termination condition of the algorithm is
that the coefficient set is empty. The steps in the
RSAG-n algorithm are as follows:
Figure 2. Digit-serial (a) adder, (b) subtracter, and
(c) left shift.
FA
FA
FA
1
D
a0
b0
(a) (c)
D
a1
b1
ad-1
bd-1
cd-1
c1
c0 FA
FA
FA
D
a0
b0
(b)
a1
b1
ad-1
bd-1
cd-1
c1
c0
0
0
FA
D
D D
D D D FA
D
x(n)
FA
D
D D
D D D FA
D
x(n)
D
Model 1
Model 0
Figure 3. (a) Multiplication with the constant 45
implemented using digit-serial arithmetic with d = 1
and (b) the corresponding graph representation.
(a)
(b) 8
 1  4 
1
5 45
WT
Wd W f +
d
---------------------- - d =
We WT Wd – W f – =1. Divide even coefﬁcients by two until odd, and save
the number of times each coefficient is divided.
These shifts at the outputs can be considered to be
free when other coefficients are synthesised.
2. Remove zeros, ones, i.e., coefﬁcients which corre-
sponds to a power-of-two, and repeated coefﬁ-
cients from the coefﬁcient set.
3. Compute the single-coefﬁcient adder cost for each
coefﬁcient, which is done by using a look-up-
table.
4. Compute a sum matrix based on power-of-two
multiples of the fundamental values included in
the interconnection table. At start this matrix is
and is then extended when new fundamentals are
added. If any required coefﬁcients are found in the
matrix, compute the required number of shifts.
Find the coefﬁcients which require the lowest
number of additional shifts, and select the smallest
of those. Add this coefﬁcient to the interconnec-
tion table and remove it from the coefﬁcient set.
5. Repeat step 4 until no required coefﬁcient is found
in the sum matrix.
6. For each remaining coefﬁcient, check if it can be
obtained by the strategies illustrated in Fig. 4. For
both cases two new adders are required. If any
coefﬁcients are found, select the smallest coefﬁ-
cient of those which require the lowest number of
additional shifts. Add this coefﬁcient and the extra
fundamental to the interconnection table. Remove
the coefﬁcient from the coefﬁcient set.
7. Repeat step 5 and 6 until no required coefﬁcient is
found.
8. Choose the smallest coefﬁcient with lowest single-
coefﬁcient adder cost. Different sets of fundamen-
tals that can be used to realize the coefﬁcient are
obtained from a look-up-table. For each set,
remove fundamentals that are already included in
the interconnection table and compute the required
number of shifts. Find the sets which require the
lowest number of additional shifts, and of those,
select the set with smallest sum. Add this set and
the coefﬁcient to the interconnection table.
Remove the coefﬁcient from the coefﬁcient set.
9. Repeat step 5, 6, 7, and 8 until the coefﬁcient set is
empty.
The basic ideas for the RAG-n [5], RSAG-n [10],
and RASG-n algorithms are similar, but the resulting
difference is significant. The main difference between
the first two algorithms is that RAG-n chooses to real-
ize coefficients by using extra fundamentals of mini-
mum value, while RSAG-n chooses fundamentals that
require a minimum number of shifts. The result of
these two different strategies is that RAG-n is more
likely to reuse fundamentals, due to the selection of
smaller fundamental values and by that reduce the
adder cost, while RSAG-n is more likely to reduce the
numberofshifts.Astheproposedalgorithm,RASG-n,
is a hybrid of these strategies realizations with both
few adders and few shifts are obtained.
It is worth noting that if all coefficients are realized
before step 6 of the algorithm, the corresponding im-
plementation has optimal adder cost [5].
4  Complexity
In this section the complexity, including adders and
shifts, for the three algorithms are compared. Average
results are shown for 100 random coefficient sets.
4.1 Coefﬁcient Wordlength Effects
The different algorithms were used to design multipli-
er blocks with coefficient sets of varying wordlength.
The setsize is fixed to 25 coefficients.
In Fig. 5 (a) the average number of additional
adders for each coefficient using the RASG-n algo-
rithm is shown. Coefficients that can be realized with
no adders includes zeros, power-of-twos, and repeated
coefficients. Most coefficients can be realized with
onlyoneadditionaladder.Thenumberofaddersisop-
timal for all coefficient sets of wordlengths up to 8 bits
as shown in Fig. 5 (b). However, when larger coeffi-
cients are included in the sets, it is more likely that
somecoefficientsrequiretwoadditionaladders,ascan
be seen in Fig. 5 (a), and the optimality is unknown.
Figure 4. The coefficient c is obtained from (a) two
existing fundamentals or (b) three existing
fundamentals.
1
0 c
(a)
0
1
2
f c
f
f f
f
(b)
1 1 – 2 2 – 4 …
1
1 –
2
…
2 0 3 1 – 5 …
0 2 – 1 3 – 3 …
3 1 4 0 6 …
… … … … … …
6 8 10 12
0
50
100 (a)
Coefficient bits
A
d
d
e
r
 
p
r
o
b
a
b
i
l
i
t
y
 
[
%
]
no adders
1 adder
2 adders
≥ 3 adders
             
0
50
100 (b)
6 7 8 9 10 11 12
Coefficient bits
O
p
t
i
m
a
l
 
p
r
o
b
a
b
i
l
i
t
y
 
[
%
]
Figure 5. Statistics from realization of multiplier
blocks using the RASG-n algorithm. (a) Average
number of additional adders for each coefficient.
(b) The probability of proven optimal adder cost.Corresponding statistics for the other two algorithms
would look similar.
Theaveragenumberofaddersforthethreealgorithms
are shown in Fig. 6 (a). It is clear that the number of
adders is higher for RSAG-n. The average number of
shifts is lower for RASG-n than for RAG-n, while
RSAG-n has the lowest number of shifts as shown in
Fig. 6 (b).
In Fig. 7 (a) a histogram for the required number of
adders using 10 bits coefficients is shown. RASG-n
and RAG-n only have a different number of adders in
one out of the 100 cases. As can be seen in Fig. 7 (b)
RASG-n have on average more than 11 shifts less than
RAG-n. RSAG-n has the highest number of adders
and the lowest number of shifts.
4.2 Coefﬁcient Setsize Effects
With the coefficient wordlength fixed to 10 bits, the
different algorithms were used to design multiplier
blocks of varying setsize.
The average number of additional adders is shown
in Fig. 8 (a) for the RASG-n algorithm. For a small
setsize many of the coefficients will require two addi-
tional adders, which result in a low probability of op-
timalityasshowninFig. 8(b).Foralargesetsizemost
coefficients can be realized with only one additional
adder, and the probability that the total number of
adders is optimal is high. The algorithm performs bet-
ter for sets containing many coefficients. The reason is
that it is more likely to find required coefficients in the
matrix used in step 4 of the algorithm when a large set
is considered.
In Fig. 9 (a) the average number of adders for the
three algorithms are shown. Again, the number of
adders for RAG-n and RASG-n are similar. All algo-
rithms are likely to have an optimal number of adders
for a large setsize, and the difference is naturally small
for a small setsize. Hence, the difference between
RSAG-n and the other two algorithms has a maxi-
mum, which occur for setsize 20.
The differences in number of shifts is increasing
for larger setsize as shown in Fig. 9 (b). RSAG-n takes
full advantage of the fact that coefficients are more
likely to be obtained without additional shifts when
morevaluesareavailable,andofcoursehasthelowest
number of shifts. The average number of shifts is low-
er for RASG-n than for RAG-n.
In Fig. 10 (a) a histogram for the required number
of adders using sets of 40 coefficients is shown. It can
be seen that RASG-n and RAG-n have the same
number of adders in all 100 cases. However, RASG-n
have on average almost 18 shifts less than RAG-n as
illustrated in Fig. 10 (b).
5  Adder Depth
In[22]and[23]methodstopredictthenumberoftran-
sitions in multiplier blocks was introduced. These
methods are based on the fact that high adder depth re-
sult in more transitions, and consequently higher pow-
er consumption.
The characteristics for the three algorithms consid-
ering adder depth are shown in Fig. 11. The same co-
efficient sets as in Section 4 was used. It is clear that
6 8 10 12
0
10
20
30
Coefficient bits
N
u
m
b
e
r
 
o
f
 
a
d
d
e
r
s
RAG−n
RASG−n
RSAG−n
6 8 10 12
0
20
40
60
Coefficient bits
N
u
m
b
e
r
 
o
f
 
s
h
i
f
t
s
Figure 6. Average number of (a) adders and
(b) shifts for sets of 25 coefficients.
20 22 24 26 28 30 32
0
10
20
30
40
Number of adders
F
r
e
q
u
e
n
c
y
RAG−n
RASG−n
RSAG−n
15 20 25 30 35 40 45 50 55 60
0
5
10
15
20
25
Number of shifts
F
r
e
q
u
e
n
c
y
Figure 7. Frequency of the number of (a) adders
and (b) shifts for the three different algorithms
using 10 bits coefficients.
10 20 30 40
0
50
100 (a)
Number of coefficients
A
d
d
e
r
 
p
r
o
b
a
b
i
l
i
t
y
 
[
%
]
no adders
1 adder
2 adders
≥ 3 adders
                 
0
50
100 (b)
5 10 15 20 25 30 35 40 45
Number of coefficients
O
p
t
i
m
a
l
 
p
r
o
b
a
b
i
l
i
t
y
 
[
%
]
Figure 8. (a) Average number of additional adders
for each coefficient and (b) probability of proven
optimal adder cost for the RASG-n algorithm.
10 20 30 40
0
10
20
30
40
Number of coefficients
N
u
m
b
e
r
 
o
f
 
a
d
d
e
r
s
RAG−n
RASG−n
RSAG−n
10 20 30 40
0
20
40
60
Number of coefficients
N
u
m
b
e
r
 
o
f
 
s
h
i
f
t
s
Figure 9. Average number of (a) adders and
(b) shifts for 10 bits coefficients.RAG-n has the lowest adder depth among the three al-
gorithms. For comparison, the lower bound [24] is
also included. In Figs. 11 (a) and (b) it can be seen that
RAG-n is close to the lower bound for small coeffi-
cients. Furthermore, the adder depth does not increase
for larger coefficient sets for RAG-n and therefore ap-
proaches the lower bound, which is illustrated in
Figs. 11 (c) and (d).
6  Implementation Examples
The power consumption is studied by the use of two
example filters implemented by logic synthesis of
VHDL code using a 0.35 µm CMOS standard cell li-
brary. The filters are implemented using the trans-
posed direct form structure shown in Fig. 1. Only the
arithmeticpartsareconsideredhere,i.e.,themultiplier
block and the structural adders, which refers to the
adders that are not included in the dashed box in
Fig. 1.
6.1 Example 1
A 27th-order lowpass linear-phase FIR filter with
passband edge 0.15π rad and stopband edge 0.4π rad
is used. The maximum passband ripple is 0.01, while
the stopband attenuation is 80 dB.
Thefilterhassymmetriccoefficients{4,18,45,73,
72, 6, –132, –286, –334, –139, 363, 1092, 1824,
2284}/213, which have been optimized for a minimum
number of signed-powers-of-two (SPT) terms. The
number of fractional bits of the coefficients, Wf, is 13
bits. Nine different values of the digit-size, d = {1, 2,
3, 4, 5, 6, 8, 10, 15}, are considered. The input data
wordlength, Wd, is selected to 16 bits. The total word-
length, WT, is computed for each digit-size from (2) as
WT = {29, 30, 30, 32, 30, 30, 32, 30, 30}.
In Fig. 12 the multiplier block realizations for the
three different algorithms are illustrated using graph
representation. The required number of adders and
shifts is given in Table 1. The RAG-n and RASG-n al-
32 34 36 38 40 42
0
10
20
30
40 (a)
Number of adders
F
r
e
q
u
e
n
c
y
RAG−n
RASG−n
RSAG−n
10 20 30 40 50 60 70 80
0
10
20
30 (b)
Number of shifts
F
r
e
q
u
e
n
c
y
Figure 10. Frequency of the number of (a) adders
and (b) shifts for the three different algorithms
using sets of 40 coefficients.
Figure 11. Average and maximum adder depth.
(a), (b) Sets of 25 coefficients and (c), (d) 10 bits
coefficients. LB denotes the lower bound.
6 8 10 12
0
2
4
6 (a)
Coefficient bits
A
v
e
r
a
g
e
 
a
d
d
e
r
 
d
e
p
t
h
6 8 10 12
0
5
10
(b)
Coefficient bits
M
a
x
i
m
u
m
 
a
d
d
e
r
 
d
e
p
t
h
10 20 30 40
0
5
10
(c)
Number of coefficients
A
v
e
r
a
g
e
 
a
d
d
e
r
 
d
e
p
t
h
RAG−n
RASG−n
10 20 30 40
0
5
10
15
(d)
Number of coefficients
M
a
x
i
m
u
m
 
a
d
d
e
r
 
d
e
p
t
h
RSAG−n
LB Table 1. Arithmetic complexity for the ﬁrst ﬁlter.
Algorithm Adders
Shifts
Internal External Total
RAG-n [5] 12 20 10 30
RASG-n 12 14 9 23
RSAG-n [10] 14 18 1 19
Pasko [7] 15 27 12 39
CSD [3] 28 78 20 98
Figure 12. Multiplier block realizations using the
(a) RAG-n, (b) RASG-n, and (c) RSAG-n
algorithm. Coefficient values are in bold.
1
2
45
363
139
73
6
4
2
18
1092
334
912 132
5
286
2284
3
4
1
4
1
8
16 1
8
8
2 8
8
2
1 8 8
1
−1
8 −2
−2
−8
−4
−1
1 1
1
1
1
1
1 1
−1
2 2
2
2 4
4
4
4
4 8
9
−1
45
−1
363
8
1 139 2 143
167
3
73
33
273
571
57
1
32
1
8
139
45
57
273
73
9
3
33
167
363
1
2
8
1
1
8
16
−1
8 1
4
1
4
8 1
1
143 −4
1
8
571
−1
(a)
(b)
(c)gorithms require 12 adders, which is optimal for this
coefficient set. Internal and external shifts refers to the
shifts within the multiplier block and shifts between
the multiplier block and the structural adders, respec-
tively. The RSAG-n algorithm requires the lowest to-
tal number of shifts, and has few external shifts as it
maintain even coefficients in the design algorithm.
Also included are implementations using separate
CSD multipliers and based on the algorithm in [7].
The smallest area is obtained for RSAG-n for small
digit-sizes, while for larger digit-sizes RASG-n is the
best.
The maximum clock frequency and corresponding
maximum sample frequency is shown in Fig. 13. Here
it is seen that the CSD implementations have the high-
est sample frequency. This is because for CSD multi-
pliers there are at least two shifts between each adder,
and hence, the critical path is short. Furthermore, CSD
multipliers have low adder depth. The slowest imple-
mentationsaretheonesbasedonRASG-n.Thiscanbe
explained by that many adders are cascaded without
any shifts in between for the RASG-n case.
The power consumption was obtained using Na-
noSim with 100 random input sample. As can be
seen in Fig. 14 (a) the energy per sample for the shifts
in the multiplier block is smallest for RSAG-n and
largest for CSD. The energy per sample for the adders
in the multiplier block is shown in Fig. 14 (b). RAG-n
consumes less energy for any digit-size. By adding the
energy for the adders and the shifts, the energy for the
multiplier block is obtained, as shown in Fig. 14 (c).
RSAG-n consumes the least energy for digit-sizes one
and two and RAG-n for larger digit-sizes. Note that
the energy consumption corresponding to shifts and
adders dominatesfor smalland large valuesofthedig-
it-size, respectively. In Fig. 14 (d) the normalized en-
ergy per sample is shown. From this it can be seen that
the optimal digit-size for RASG-n and RSAG-n is
three, while for the other three algorithms it is six.
The energy per sample consumed for the structural
adders is shown in Fig. 14 (e), while the total energy
for all arithmetic operations is shown in Fig. 14 (f).
The power for the structural adders is only effected by
the glitches from the multiplier block. It can be seen
that the glitches are significantly higher for RASG-n
and RSAG-n. For RSAG-n the reason is that the
number of external shifts, which provides glitch re-
ductionbetweenthemultiplierblockandthestructural
adders, is small. For RASG-n the increased number of
glitchesdueto highadderdepthinthemultiplierblock
is propagated to the structural adders.
A surprising result is that the energy consumed by
the adders is larger for RASG-n than RSAG-n, al-
though the number of adders is smaller. The reason for
this will be discussed in the following.
FromFig. 12theadderdepthforeachcoefficientin
the example filter can be found, which is shown in
Fig. 15. It is clear that RASG-n has larger adder depth
than RSAG-n, which explains the higher power con-
sumption. RAG-n has the lowest adder depth.
The fact that adder depth is highly correlated with
power consumption is established when the energy
consumed in each adder is investigated. This is shown
in Figs. 16 (a) and (b) for digit-size one and five, re-
spectively. Note that the RSAG-n implementation in-
cludes two extra adders, hence, the total energy is
larger than illustrated in Fig. 16.
However, the energy consumption also depends on
other factors. Consider, for example, the coefficients
363and2284(4⋅571).Althoughbothhaveadderdepth
six in the RASG-n realization, the adder that generates
the output corresponding to the coefficient 363 con-
sumes morethantwo timesmore energyfortheimple-
mentation with digit-size five. The explanation can be
0 5 10 15
0
100
200
300 (a)
Digit−size
C
l
o
c
k
 
F
r
e
q
u
e
n
c
y
 
[
M
H
z
] RAG−n
RASG−n
RSAG−n
Pasko
CSD
0 5 10 15
0
10
20
30
40
50 (b)
Digit−size
S
a
m
p
l
e
 
F
r
e
q
u
e
n
c
y
 
[
M
H
z
]
Figure 13. (a) Maximum clock frequency and
(b) maximum sample frequency.
Figure 14. Consumed energy per sample for
(a) shifts, (b) adders, (c) the total multiplier block,
(d) normalized for the total multiplier block,
(e) structural adders, and (f) all arithmetic parts.
0 5 10 15
0
0.5
1 (b)
0 5 10 15
0
0.5
1
1.5 (a)
E
n
e
r
g
y
/
S
a
m
p
l
e
 
[
n
J
]
RAG−n
RASG−n
RSAG−n
Pasko
CSD
0 5 10 15
0
0.5
1
1.5
2 (c)
E
n
e
r
g
y
/
S
a
m
p
l
e
 
[
n
J
]
0 5 10 15
0
0.5
1
(e)
E
n
e
r
g
y
/
S
a
m
p
l
e
 
[
n
J
]
Digit−size
0 5 10 15
0
1
2
(f)
Digit−size
0 5 10 15
0
0.5
1
(d)found in Fig. 12 (b). One of the inputs to the 571 adder
is directly connected to the input, i.e., it is glitch free,
and the glitches at the other input are reduced by two
shifts. For the 363 adder there is a path from the input
through all adders in the critical path without any
shifts at all, i.e., generated glitches are propagated
without any reduction.
6.2 Example 2
As adder depth was shown to highly affect the energy
consumption, an example where larger difference in
adder depth can be expected for the different algo-
rithmsisnowconsidered.Thethreepreviouslystudied
algorithms are here compared to the C1 algorithm
[15]. For the simple coefficient set used in Example 1
the C1 algorithm gives exactly the same result as
RAG-n, and was therefore not included.
The 24th-order linear-phase FIR filter used for the
example in [15] is considered. The filter has symmet-
ric coefficients {–710, 327, 505, 582, 398, –35, –499,
–662, –266, 699, 1943, 2987, 3395}/214. The data
wordlength,Wd,is20bitsandtheconsidereddigit-siz-
es are d = {1, 2, 3, 4, 5, 6, 7, 9, 12, 17}.
The obtained number of adders and shifts for the
different algorithms is presented in Table 2. As ex-
pected, RAG-n has the lowest number of adders and
RSAG-n has the lowest number of shifts.
The average and maximum adder depth is also giv-
en in Table 2 and illustrated in Fig. 17 for each filter
coefficient.TheC1algorithmhasasignificantlylower
adder depth than any of the other three algorithms.
For bit-serial arithmetic the multiplier block con-
sumes least energy using the RSAG-n algorithm, as
shown in Fig. 18 (c). For a digit-size larger than one,
C1 performs better than the other algorithms. Further-
more, the structural adders consumes less energy us-
ing C1 for any digit-size, as illustrated in Fig. 18 (e).
When all arithmetic parts are considered the RASG-n
andRSAG-nalgorithmsbothhaveaminimumfordig-
it-size two. RAG-n and C1 also have a minimum, but
for d = 4.
As expected, the C1 algorithm consumes less ener-
gy due to low adder depth. However, the difference is
not large, and it should be possible to obtain improved
results using an algorithm that combines the good
qualities of different algorithms.
7  Conclusions
In this paper implementation of low power digit-serial
FIR filters using multiple constant multiplication
(MCM) techniques has been considered. Some con-
clusions regarding design guidelines for low power
digit-serial multiplier blocks can be deduced. The ac-
tual complexity in terms of adder cost and number of
shifts is not the main factor determining the power
consumption. Instead the adder depth, as for parallel
arithmetic, is a main contributor. Hence, an algorithm
with low adder depth should be used. Furthermore, the
shifts prevent glitch propagation through subsequent
adders. For even coefficients the shifts can be placed
either before or after the final additions. Hence, a heu-
ristic for placing the shifts would be also useful.
Figure 15. Adder depth for each coefficient in
Example 1 using three different algorithms.
                           
0
2
4
6
4
1
8
4
5
7
3
7
2
6
−
1
3
2
−
2
8
6
−
3
3
4
−
1
3
9
3
6
3
1
0
9
2
1
8
2
4
2
2
8
4
A
d
d
e
r
 
d
e
p
t
h
RAG−n
RASG−n
RSAG−n
Figure 16. Energy per sample for the adders in
Example 1. Digit-size (a) one and (b) five.
                           
0
0.01
0.02
0.03
0.04
(a)
4
1
8
4
5
7
3
7
2
6
−
1
3
2
−
2
8
6
−
3
3
4
−
1
3
9
3
6
3
1
0
9
2
1
8
2
4
2
2
8
4
E
n
e
r
g
y
/
S
a
m
p
l
e
 
[
n
J
]
RAG−n
RASG−n
RSAG−n
                           
0
0.02
0.04
0.06 (b)
4
1
8
4
5
7
3
7
2
6
−
1
3
2
−
2
8
6
−
3
3
4
−
1
3
9
3
6
3
1
0
9
2
1
8
2
4
2
2
8
4
E
n
e
r
g
y
/
S
a
m
p
l
e
 
[
n
J
]
Table 2. Arithmetic complexity for the second ﬁlter.
Algorithm Adders Shifts
Adder depth
Average Max
RAG-n [5] 17 35 4.68 9
RASG-n 19 19 5.96 11
RSAG-n [10] 20 18 4.84 9
C1 [15] 19 33 2.80 5
                         
0
5
10
15
−
7
1
0
3
2
7
5
0
5
5
8
2
3
9
8
−
3
5
−
4
9
9
−
6
6
2
−
2
6
6
6
9
9
1
9
4
3
2
9
8
7
3
3
9
5
A
d
d
e
r
 
d
e
p
t
h
RAG−n
RASG−n
RSAG−n
C1
Figure 17. Adder depth for each coefficient in
Example 2 using four different algorithms.References:
[1] L. Wanhammar, DSP Integrated Circuits,
Academic Press, 1999.
[2] L. Wanhammar and H. Johansson, Digital Filters,
Linköping University, 2002.
[3] M. Vesterbacka, K. Palmkvist, and L.
Wanhammar, Realization of serial/parallel
multipliers with fixed coefficients, in Proc.
National Conf. Radio Science, 1993, pp. 209–212.
[4] D. R. Bull and D. H. Horrocks, Primitive operator
digital filters, IEE Proc. G, Vol. 138, No. 3, 1991,
pp. 401–412.
[5] A. G. Dempster and M. D. Macleod, Use of
minimum-adder multiplier blocks in FIR digital
filters,IEEETrans.CircuitsSyst.II,Vol.42,No. 9,
1995, pp. 569–577.
[6] R. I. Hartley, Subexpression sharing in filters using
canonic signed digit multipliers, IEEE Trans.
Circuits Syst. II, Vol. 43, 1996, pp. 677–688.
[7] R. Pasko, P. Schaumont, V. Derudder, S. Vernalde,
andD.Durackova,Anewalgorithmforelimination
of common subexpressions, IEEE Trans.
Computer-Aided Design Integrated Circuits, Vol.
18, No. 1, 1999, pp. 58–68.
[8] O. Gustafsson, H. Ohlsson, and L. Wanhammar,
Improved multiple constant multiplication using
minimum spanning trees, in Proc. Asilomar Conf.
Signals, Syst., Comp., 2004, pp. 63–66.
[9] Y. Voronenko and M. Püschel, Multiplierless
multiple constant multiplication, ACM Trans.
Algorithms, 2006.
[10] K. Johansson, O. Gustafsson, A. G. Dempster, and
L.Wanhammar,Algorithmtoreducethenumberof
shifts and additions in multiplier blocks using serial
arithmetic, in Proc. IEEE Melecon, 2004, pp. 197–
200.
[11] S. He and M. Torkelson, FPGA implementation of
FIR filters using pipelined bit-serial canonical
signed digit multipliers, in Proc. IEEE Custom
Integrated Circuits Conf., 1994, pp. 81–84.
[12] J. Valls, M. M. Peiro, T. Sansaloni, and E. Boemo,
Design and FPGA implementation of digit-serial
FIR filters, in Proc. IEEE Int. Conf. Electronics,
Circuits, Syst., 1998, Vol. 2, pp. 191–194.
[13] H. Lee and G. E. Sobelman, FPGA-based FIR
filters using digit-serial arithmetic, in Proc. IEEE
Int. ASIC Conf., 1997, pp. 225–228.
[14] K. Johansson, O. Gustafsson, and L. Wanhammar,
Implementation of low-complexity FIR filters
using serial arithmetic, in Proc. IEEE Int. Symp.
Circuits Syst., 2005, pp. 1449–1452.
[15] A. G. Dempster, S. S. Demirsoy, and I. Kale,
Designing multiplier blocks with low logic depth,
in Proc. IEEE Int. Symp. Circuits Syst., 2002, Vol.
5, pp. 773–776.
[16] S. G. Smith and P. B. Denyer, Serial-Data
Computation, Kluwer, 1988.
[17] R. I. Hartley and K. K. Parhi, Digit-Serial
Computation, Kluwer, 1995.
[18] H. Suzuki, Y.-N. Chang, and K. K. Parhi,
Performance tradeoffs in digit-serial DSP systems,
in Proc. Asilomar Conf. Signals, Syst., Computers,
1998, Vol. 2, pp. 1225–1229.
[19] K. K. Parhi, A systematic approach for design of
digit-serial signal processing architectures, IEEE
Trans. Circuits Syst., Vol. 38, No. 4, 1991,
pp. 358–375.
[20] K. K. Parhi, C.-Y. Wang, and A. P. Brown,
Synthesis of control circuits in folded pipelined
DSP architectures, IEEE J. Solid-State Circuits,
Vol. 27, 1992, pp. 29–43.
[21] K. K. Parhi, VLSI Digital Signal Processing
Systems: Design and Implementation, Wiley, 1998.
[22] S. S. Demirsoy, A. G. Dempster, and I. Kale,
Transition analysis on FPGA for multiplier-block
based FIR filter structures, in Proc. IEEE Int. Conf.
Elect. Circuits Syst., 2000, Vol. 2, pp. 862–865.
[23] S.S.Demirsoy,A.G.Dempster,andI.Kale,Power
analysis of multiplier blocks, in Proc. IEEE Int.
Symp. Circuits Syst., 2002, Vol. 1, pp. 297–300.
[24] O. Gustafsson, A. G. Dempster, K. Johansson, M.
D. Macleod, and L. Wanhammar, Simplified
design of constant coefficient multipliers, Circuits,
Syst. Signal Processing, Vol. 25, No. 2, 2006, pp.
225–251.
Figure 18. Consumed energy per sample for
(a) shifts, (b) adders, (c) the total multiplier block,
(d) normalized for the total multiplier block,
(e) structural adders, and (f) all arithmetic parts.
0 5 10 15
0.4
0.6
0.8
1 (b)
0 5 10 15
0
0.2
0.4
(a)
E
n
e
r
g
y
/
S
a
m
p
l
e
 
[
n
J
]
RAG−n
RASG−n
RSAG−n
C1
0 5 10 15
0.6
0.8
1
(c)
E
n
e
r
g
y
/
S
a
m
p
l
e
 
[
n
J
]
0 5 10 15
0.6
0.8
1
(e)
E
n
e
r
g
y
/
S
a
m
p
l
e
 
[
n
J
]
Digit−size
0 5 10 15
1.4
1.6
1.8
2
2.2
(f)
Digit−size
0 5 10 15
0.6
0.8
1
(d)