FIR filtering cores by The Pennsylvania State University CiteSeerX Archives
Architectural trade-offs in the design of low power
FIR ﬁltering cores
A.T. Erdogan, E. Zwyssig and T. Arslan
Abstract: There is a continuous drive for methodologies and approaches of low power design. This
is mainly driven by the surge in portable computing. On the other hand, the design of low power
systems for different portable applications is not a simple task. This is because of the number of
constraints that inﬂuence the power consumption of a device. In addition to issues of performance
and functionality, there is a need to satisfy strict test coverage constraints. The authors investigate
the impact of DSP architectural realisation, multiplier type, and the choice of number
representation on the overall power consumption of DSP devices. Work in the literature so far
has concentrated on the effect of these on a part or a section of a DSP system. Furthermore the
effect of DfT circuits on the overall performance is studied. A hearing aid device is considered as an
example of a system with strict power/area constraints. It is shown that the choice of multiplier
architecture and number representation should be carefully considered when speciﬁc DSP
architectural choices are made. The results are demonstrated with a number of specially designed
DSP architectures for the implementation of FIR ﬁltering algorithms on hearing aid devices.
1 Introduction
FIR ﬁltering algorithms such as subband decomposition,
noise reduction and echo cancellation are executed
repetitively in DSP systems such as hearing aids. Therefore
an effective ﬁltering architecture is a key part of a DSP
system. The basic FIR ﬁlter is represented by the following
equation:
yn ¼
X M 1
m¼0
hm:xn m ð1Þ
From this equation a number of key components/opera-
tions can be identiﬁed. These are as follows:
  a multiply-accumulator (MAC),
  several forms of memory such as those required for the
sample values xn m (RAM) and the coefﬁcient values hm
(ROM),
  a storage cell for the ﬁlter output value yn,
  a controller which schedules the different components.
Figure 1 illustrates the principle data ﬂow between the
above components in a block diagram. Among these
components the MAC is the most critical, which in turn
accommodates the multiplier.
Different multipliers have been explored in the past and
their architectures and layouts analysed with regard to
power dissipation, area consumption and circuit delay [1].
Meier et al. [2] compared array multipliers with Wallace tree
multipliers. Keane et al. [3] veriﬁed the impact of data
characteristics and different multiplier architectures for low
power dissipation. A power minimisation scheme for
multipliers is presented by Fried [4] and a power-efﬁcient
multiply–accumulate design for ﬁlters can be found in
Farag et al. [5]. A hearing aid speciﬁc low power multiply–
accumulate scheme, particularly for FFT butterﬂy and ﬁlter
structures, has been proposed by Moller et al. [6]. Nielsen
et al. [7] have investigated a number of design issues relating
to the implementation of low power asynchronous FIR
ﬁlter circuits including those targeting hearing aid applica-
tions. In [8] the authors demonstrate that low power FIR
ﬁlters can be constructed from the concurrent multiplier-
accumulator circuits.
All of the above work consider performance issues
regarding the MAC or multiplier alone without analysing
the overall power consumption of a system. However, for a
true performance evaluation the overall system architecture
together with the design of the key components must be
considered. For example, a linear phase FIR ﬁlter may be
implemented utilising a direct form (DF) or a folded direct
form (FDF) structure. The coefﬁcients for such a ﬁlter are
symmetric around the midpoint of the impulse response. A
coefficient memory
data memory
o
u
t
p
u
t
 
s
t
o
r
a
g
e
controller
x xn
b
yn
multiply-
accumulate
Fig. 1 Filter block diagram
A.T. Erdogan and T. Arslan are with the Department of Electronics &
Engineering, The University of Edinburgh, The King’s Buildings, Mayﬁeld
Road, Edinburgh EH9 3JL, Scotland, UK
E. Zwyssig is with Wolfson Microelectronics Ltd., 20 Bernard Terrace,
Edinburgh EH8 9NX, Scotland, UK
r IEE, 2004
IEE Proceedings online no. 20040227
doi:10.1049/ip-cds:20040227
Paper ﬁrst received 23rd April 2002 and in revised form 25th February 2003
10 IEE Proc.-Circuits Devices Syst., Vol. 151, No. 1, February 2004linear-phase ﬁlter can thus be implemented efﬁciently by a
FDF structure, where two data samples are added before
being multiplied with the corresponding coefﬁcient. There-
fore, using a FDF structure will reduce the number of
multiplications by half at the expense of additional
hardware. A dual-port memory and an additional adder
will be required to access two data samples at a time and to
add them together prior to the multiplication stage. In
addition, the use of different number representations will
introduce some overheads. This is mainly due to the need
for data converters before and after the multiplication
process.
Furthermore, for an FIR ﬁlter core to become a viable
product it must have high fault coverage and hence must
incorporate adequate design-for-testability (DfT) circuitry.
For example, a scan path can be used for small blocks such
as counters and built-in self-test (BIST) for more complex
ones such as memory and MAC. However, DfT circuits
introduce an overhead in terms of the additional circuitry.
Traditionally, high fault coverage and fast test time have
been the main objectives of DfT designs. While these
objectives still remain, a new design objective, namely low
power dissipation, is becoming especially important. For
these reasons, the test circuitry is crucial to the power and
area performance of an FIR ﬁlter core.
In this paper, we investigate the impact of DSP
architectural realisation, multiplier type, and the choice of
number representation on the overall power consumption
of DSP devices. We consider a hearing aid device as an
example of a system with strict power/area constraints.
Furthermore we study the effect of DfT circuits on the
overall performance, using typical compact BIST circuitry
used in area/power critical applications such as hearing aids.
We show that the choice of multiplier architecture and
number representation should be carefully considered when
speciﬁc DSP architectural choices are made. This type of
investigation is not presented in the literature so far. Our
results are demonstrated with a number of specially
designed DSP architectures for the implementation of
FIR ﬁltering algorithms in hearing aid devices.
2 Implementation
2.1 Multiplier cores
Although multiplication is now available from synthesis
tools it is still worthwhile to design multipliers from scratch.
There are several reasons for this. First of all, designing a
multiplier with a technology independent HDL makes it
available for all future projects. No dependence on a tool
vendor exists and the resulting design is known in detail.
Low power modiﬁcations and operand size changes can be
performed easily, i.e. a maximum level of maintainability is
achieved. Secondly, different multiplication schemes such as
Dadda [9],Wa l l a c e [10], Booth [11], modiﬁed Booth [12],
and pre-add Booth [6] can be implemented and compared
which is not possible with a multiplier provided from an
EDA vendor. Finally, DfT features can be included in the
design.
Availability of a wide variety of multiplication schemes
and limitations of automatic synthesis tools make it difﬁcult
to select a good multiplier for a design. It is therefore
necessary to design and compare a series of different
multipliers. The following parallel multipliers have been
designed and evaluated in terms of area, power consump-
tion, and fault coverage:
  Wallace–Dadda multiplier
  Booth multiplier
  reduced glitch Booth multiplier
  redundant binary Booth multiplier
  pre-add Booth multiplier
The rest of this Section brieﬂy describes the multiplier
cores to clearly distinguish the details of the architectures
designed and utilised in this work.
Wallace–Dadda multiplier:Wa l l a c e [10] and Dadda [9]
have shown that a parallel multiplication can be executed
more efﬁciently than in an array multiplier. Dadda
suggested the use of carry–save adders (csa) to reduce the
partial products with a compression factor of 3 to 2. Once
only two partial products are left the ﬁnal sum is generated
with a carry-look-ahead adder. Wallace suggested a
modiﬁcation of the Dadda algorithm so that the amount
of carry–save adders and the delay of the partial product
reduction tree can be reduced further.
Booth multiplier: Booth [11] suggested a higher radix
multiplication algorithm to design faster multipliers. In
his approach, multiple bits of the multiplier input are
scanned to generate fewer partial products of the multi-
plicand. For example, three bits {yi+1, yi, yi 1}o ft h e
multiplicand input of the multiplier are scanned simulta-
neously if radix-4 Booth encoding is used. The positive or
negative multiplicand (B,2 B,0 , 2B, or –B) is then added
to the partial product reduction tree, depending on the
scanned triple.
Reduced glitch Booth (RG_Booth) multiplier: Fried [4]
suggested a two-gate-delay implementation of the Booth
encoder and partial product generator in order to balance
the gate delays in the Booth multiplier and therefore reduce
the glitches and hence the power consumption.
Redundant binary Booth (RB_Booth) multiplier:C a r r y –
save adders are usually used to reduce the partial
products with a compression factor of 3 to 2. However,
better results can be achieved using a compressor with a
compression factor of 4 to 2 or more. Compressors with
higher compression factors can be implemented using
redundant signed-digit number representation. In this work,
a radix-2 signed digit set, deﬁned as { 1, 0, 1}, is used for
the implementation of the redundant binary multiplier
suggested in [12].
Pre-add Booth (P_Booth) multiplier:M o l l e ret al. [6]
suggested using a pre-add Booth multiplier, a Booth
multiplier that has a pre-adder integrated in the Booth
encoder, for ﬁlters with symmetric coefﬁcients. The
P_Booth encoding requires an adder and some logic gates
for the carry generation, in addition to the Booth encoder.
The P_Booth encoding is complex and has an increased
delay compared to the normal Booth encoding due to the
carry propagate logic. The P_Booth encoder is also liable to
glitches due to the carry propagation delay through
several Booth encoders. The authors in [6] have
examined ﬁve different partial product addition structures
in terms of number of nodes, gate delays, and the average
number of node transitions per multiplication. It has been
shown that the number of transitions could be minimised
by building a mixed adder tree with 3 to 2 and 4 to 2
compressors.
2.2 Multiply-and-accumulate
The MAC is the core cell of a digital signal processor which
calculates the sum of an accumulated intermediate result
and a product. The block diagrams of the basic MAC and
the pre-add MAC units that have been implemented
are shown in Fig. 2. The basic MAC cell contains a
multiplier, an adder/subtractor, a register (accumulator),
and a multiplexer. The functionality of the MAC is deﬁned
IEE Proc.-Circuits Devices Syst., Vol. 151, No. 1, February 2004 11as follows:
if ðaccÞ
s ¼ s þð   1Þ
neg:b:g
else
s ¼ a þð   1Þ
neg:b:g
The implementation architecture of the MAC is depicted in
Fig. 3. The mult_16 16 determines the partial product of
two 16-bit operands x and y. The actual product of x and y
is the sum of the mult_16 16 outputs out1 and out2
(x.y¼out1+out2). The control signal, neg, determines
whether the multiplier output is added to or subtracted
from the multiplexer output which in turn is determined by
the control signal acc. The multiplexer is used for switching
between a new input value, a, or the stored accumulator
value, s. The subtraction is realised by exclusive-oring the
mult_16 16 outputs with the neg signal and adding the neg
to the resulting values through the carry-in inputs of the
csa_36 and cla_36 circuits, effectively two’s complementing
the mult_16 16 outputs out1 and out2. To reduce the
likelihood of an overﬂow during the accumulation process
the accumulator employs 4 guard bits, making it 36 bits
wide. Therefore, the mult_16 16 outputs are sign extended
to 36 bits after they are exclusive-ored. A 36 bits carry–save
adder (csa_36) is used to reduce the three inputs to two after
which they are added through a 36 bit carry-look-ahead
adder (cla_36). The ﬁnal output is then stored into the
accumulator.
This MAC architecture is also suitable for performing the
multiplications using sign–magnitude (SM) number repre-
sentations. It is well known that two’s-complement (2’sC)
number representation has a much higher switching activity
than the SM representation [13]. However, SM addition
and subtraction are complex operations to implement. For
this reason SM is only used during the multiplication
process in this work. Prior to a multiplication the data at
both multiplier inputs are converted to SM. Then the
multiplication is performed using the unsigned magnitudes,
where the sign bits of the two operands are exclusive-ored to
determine the control signal neg.
2.3 Digital ﬁlter implementation
This Section describes the FIR ﬁlter architectures imple-
mented in this work. The generic architecture of a DF FIR
ﬁlter is illustrated in Fig. 4. The coefﬁcient memory consists
of a ROM look-up table, an address counter and a data
multiplexer. The data memory (RAM) is built around an
array of latch banks (24 16 bits). In addition, it has a 1-to-
24 address demultiplexer, a 24-to-1 data multiplexer, and
BIST circuitry. The dual port RAM used for the FDF
*
±
s
acc
*
±
s
acc
neg
+
neg

 1 2 
a
b
Fig. 2 Block diagram for the MAC units
a Basic MAC
b Pre-add MAC
s
xor_32
a
a_n
neg
csa_36
p1
p2 s1
p3 s2
cin
xor_32
a
a_n
neg
cla_36
a
b s
cin
mult_16x16
x out1
y out2



neg
acc
clk
Fig. 3 Architecture for the MAC unit
12 IEE Proc.-Circuits Devices Syst., Vol. 151, No. 1, February 2004implementations has an additional data multiplexer for the
second output port. The MAC module contains a multiply–
adder and an accumulator. The system controller is not
shown here. Figure 5 shows the generic architecture of a
FDF FIR ﬁlter. This looks very similar to the DF ﬁlter.
However, two read ports for the x input sample memory
and two read address counters and a pre-adder are required
for the FDF FIR ﬁlter. Note that one input of the
multiplier (g in our example) requires a one bit increase in
operand size to maintain the full precision.
2.4 Design-for-testability
Different strategies could be used to achieve the required
fault coverage for a given design. In this work, the required
fault coverage (490%) of an FIR ﬁlter core dictated by the
overall hearing aid DSP is achieved by using scan path for
controller related cells and BIST for memory and MAC
units. The BIST scheme employed in this work encloses the
device under test (DUT) like a gauge and isolates it from
the environment for the oncoming test, as shown in Fig. 6.
A pattern generator applies speciﬁc test patterns to the
DUT (e.g. MAC, RAM) in order to get maximum fault
coverage with minimum test time. An output compressor
takes the response from the DUT and minimises the
amount of data to be analysed. Both the BIST pattern
generator and the BIST output compressor are controlled
by the BIST controller. Multiplexer circuits are used to
isolate DUT while BIST is running. As an example, the
BIST block diagram for the MAC is shown in Fig. 7. An
effective BIST algorithm for fast multiplier cores has been
presented in [14]. We have modiﬁed this scheme for a MAC
circuit and achieved a fault coverage of 497% with only
1024 test patterns. A wide variety of memory test
algorithms exist such as chess pattern, butterﬂy, MATS,
MATS+, March C and 6-n algorithm [15]. In this work, we
have implemented the 6-n algorithm, because of its small
area consumption, short test time and excellent fault
coverage compared to others. This has been demonstrated
by previous research work in the literature [15].T h e s e
features are ideal for a device such as a hearing aid with
strict performance constraints.
3 Circuit synthesis and power analysis
A number of FIR ﬁltering cores, each realising a 24-tap
linear-phase low-pass ﬁlter, were developed. These ﬁlter
cores vary in their realisation architecture, the type of
multiplier circuits employed, and the number representation
used, as shown in Table 1. Typical ﬁlter data (distorted sine
wave) were used for the veriﬁcation of the different ﬁlter
cores. The data was generated by adding two sine wave
signals, one representing the carrier and the other the
distortion, with the following characteristics: fcarrier¼fsample/9,
fdistortion¼fsample/3, and signal to noise ratio¼1.
The different FIR cores have been analysed with regard
to fault coverage, area usage, and power consumption. The
+
*
  
y
x
b
ROM
b_addr
read
write +
4
12w 12:1w
1:24w 24w 24:1w
5
5
 
f
total 24 cycles
accu
f ff
Fig. 4 DF FIR ﬁlter block diagram
+
*
y
x
b
ROM
b_addr
read 1
write +
+
r
e
a
d
 
2
f f
f f
f
f
total 12 cycles
f
12w 12:1w
4
1:24w 24w
24:1w
24:1w
5
5
5
36bit accu
 
Fig. 5 FDF FIR ﬁlter block diagram
DUT
BIST
output
compressor
BIST
pattern
generator
BIST controller
in_a
in_b
out_a
out_b
scanIn
scanOut scan-
Control
test
Fig. 6 Basic built-in self-test block diagram
*
±


p
acc neg
accu
acc
neg
scanIn
scan-
Control
clock


 
scanOut
s
M
A
C
 
p
a
t
t
e
r
n
M
A
C
 
o
u
t
p
u
t
c
o
m
p
r
e
s
s
o
r
g
e
n
e
r
a
t
o
r
Fig. 7 Built-in self-test block diagram for MAC
IEE Proc.-Circuits Devices Syst., Vol. 151, No. 1, February 2004 13cores were designed using verilog HDL and then synthe-
sised using Ambit BuildGatest t a r g e t i n ga0 . 3 5 m standard
cell CMOS library. The requirements for the synthesis were
identical for all the cores. This was necessary to allow for a
consistent power consumption and area usage comparisons.
A maximum circuit delay of 35ns has been deﬁned for all
the cores. A layout for each core was generated using the
Envisiat Silicon Ensemblet place-and-route software. This
was followed by extracting RC information and then
performing RC back-annotated post-layout gate-level net-
list simulations using Verilog-XLt simulator. The resulting
data including switching activity of the circuit nets and the
capacitive load information extracted from the layouts was
then used by the Synopsys DesignPowert tool to compute
power consumption ﬁgures for the different FIR cores. In
all of the above stages a clock rate of 10MHz and a supply
voltage of 3V were used. The results obtained are illustrated
in Tables 2–4. The following can be concluded by analysing
the tables:
  Area consumption: Area usage for different ﬁlter cores are
illustrated in Tables 2 and 3. Clearly, ﬁr_df_wd occupies the
least area, followed by ﬁr_df_booth. However, the variation
in area among the different cores is less than 10%. Table 4
provides comparisons between different ﬁlter implementa-
tions using 2’sC and SM representations. An area increase
of up to 3% is incurred when SM is used instead of 2’sC.
On the other hand, a FDF implementation of a FIR ﬁlter
consumes 4–9% more area compared to a DF implementa-
tion. This is mainly due to an area increase in the RAM
(a dual port RAM is used instead of a single port RAM)
and the additional adder circuitry used to add the two data
samples obtained from the dual port RAM. The reduction
in the number of multiplications (due to FDF) does not
have much effect on the overall ﬁlter area, since both DF
and FDF ﬁlters are single multiplier implementations.
  Fault coverage: Fault coverage of all the ﬁlters is above
90% (between 91 and 96%). This proves that the DfT
features, which have been built into these ﬁlter cores, are
very efﬁcient. Only 1024 test vectors were needed to reach
these results. It can be concluded that a fault coverage of 97
to 99% can be achieved using additional test vectors,
especially since most of the undetected faults have been
reported outside the DUT in the BIST circuitry.
  Power consumption: The results for DF and FDF ﬁlter
architectures are illustrated in Tables 2 and 3. Comparing
different DF ﬁlter core implementations using 2’sC number
representation, ﬁr_df_rg_booth achieves the best result with
an overall reduction of 13% compared to ﬁr_df_booth
Table 1: Designed FIR ﬁlters
Filter Filter
architecture
Multiplier type Multiplier
size
Partial product
reduction tree
Number representation
for multiplication
ﬁr_df_booth DF Booth 16 16 3to2 2’sC
ﬁr_df_rb_booth DF RB_Booth 16 16 4to2 2’sC
ﬁr_df_rg_booth DF RG_Booth 16 16 3to2 2’sC
ﬁr_df_wd DF Wallace–Dadda 16 16 3to2 2’sC
ﬁr_df_booth_sm DF Booth 16 16 3to2 SM
ﬁr_df_rb_booth_sm DF RB_Booth 16 16 4to2 SM
ﬁr_df_rg_booth_sm DF RG_Booth 16 16 3to2 SM
ﬁr_df_wd_sm DF Wallace–Dadda 16 16 3to2 SM
ﬁr_fdf_booth FDF Booth 17 16 3to2 2’sC
ﬁr_fdf_rb_booth FDF RB_Booth 17 16 4to2 2’sC
ﬁr_fdf_rg_booth FDF RG_Booth 17 16 3to2 2’sC
ﬁr_fdf_p_booth FDF P_Booth 17 16 3to2 & 4to2 2’sC
ﬁr_fdf_wd FDF Wallace–Dadda 17 16 3to2 2’sC
ﬁr_fdf_booth_sm FDF Booth 17 16 3to2 SM
ﬁr_fdf_rb_booth_sm FDF RB_Booth 17 16 4to2 SM
ﬁr_fdf_rg_booth_sm FDF RG_Booth 17x16 3to2 SM
ﬁr_fdf_wd_sm FDF Wallace–Dadda 17x16 3to2 SM
Table 2: DF ﬁlter architecture analysis results
Filter Fault coverage, % Area, mm
2 Reduction, % Power,
nW/sample
Reduction, %
ﬁr_df_booth 96 0.66 F 21.9 F
ﬁr_df_rb_booth 96 0.68  32 0 . 2 8
ﬁr_df_rg_booth 96 0.71  81 9 . 1 1 3
ﬁr_df_wd 96 0.65 2 19.4 11
ﬁr_df_booth_sm 94 0.67 F 22.4 F
ﬁr_df_rb_booth_sm 94 0.70  42 1 . 7 3
ﬁr_df_rg_booth_sm 94 0.72  72 1 . 0 6
ﬁr_df_wd_sm 93 0.66 1 16.4 27
14 IEE Proc.-Circuits Devices Syst., Vol. 151, No. 1, February 2004implementation. This is followed by ﬁr_df_wd and
ﬁr_df_rb_booth achieving reductions of 11% and 8%
respectively. However, when SM representation is used,
ﬁr_df_wd_sm provides the best result with a power
reduction of 27%, followed by ﬁr_df_rg_booth_sm (6%)
and ﬁr_df_rb_booth_sm (3%), compared to ﬁr_df_
booth_sm. On the other hand, when ﬁlter cores using SM
representation are compared to their counterparts using
2’sC representation, the performance deteriorates for ﬁlter
cores employing a Booth based multiplier. Although SM
representation reduces the switching activity at data and
coefﬁcient inputs of the multiplier by 10% and 27%
respectively, the overall power consumption increases by
2%, 7%, and 10% for ﬁr_df_booth_sm, ﬁr_df_rb_booth_sm,
and ﬁr_df_rg_booth_sm ﬁlter cores respectively, see Table 4.
This is in sharp contrast to cases where a Wallace–Dadda
multiplier is employed in the cores, resulting in a 15%
improvement in the overall power reduction. The difference
in this power proﬁle can be explained by examination of the
power reductions in the multiplier section of the ﬁlter cores.
For example, in the case of ﬁr_df_wd, a power reduction of
52% is achieved in the multiplier compared to only 11%
reduction for ﬁr_df_rg_booth. Therefore, the power reduc-
tion in the multiplier achieved with SM representation is not
sufﬁcient to compensate for the added overheads for Booth-
based ﬁlter cores. When FDF ﬁlter cores are considered,
ﬁr_fdf_p_booth results in 27% more power consumption
compared to ﬁr_fdf_booth, see Table 3. This is mainly due
to an increase in the number of glitches in the encoder
section of the P_Booth multiplier. Similar to DF ﬁlter cores,
the best result for FDF is obtained using ﬁr_fdf_wd_sm,
achieving a 24% reduction. Figure 8 shows that power
savings between 36% and 48% can be achieved using FDF
i n s t e a do fD Fa r c h i t e c t u r ew i t ha na r e ai n c r e a s eo fl e s st h a n
10%. In general, our results indicate that SM representation
outperforms 2’sC when used in Wallace–Dadda based ﬁlter
core implementations. However, a Wallace–Dadda multi-
plier is slower compared to Booth-based multipliers.
Therefore, if the required speed cannot be met by the
Wallace–Dadda multiplier then either a faster multiplier
(such as a Booth-based multiplier) or some speed-up
techniques (such as pipelining and/or use of multiple
multipliers; note that in this work these cases were not
considered since they increase area usage signiﬁcantly) has
to be considered. If a Booth-based multiplier is chosen in a
ﬁlter core our results indicate that 2’sC representation will
lead to less overall power consumption compared to SM
representation.
To analyse the performance due to the constituent
components of the ﬁltering cores examples of DF and FDF
ﬁlter cores were considered using ﬁr_df_rg_booth and
ﬁr_fdf_rg_booth as examples. Considering the power and
the area performance of ﬁr_df_rg_booth ﬁlter core, 62% of
the area and 44% of the power in the MAC is used by the
multiplier, as shown in Fig. 9. Although the carry-look-
ahead (cla) adder consumes only 11% of the area, it is
responsible for 28% of the MAC power. The cla is therefore
another critical building block in addition to the multiplier.
Therefore, a pipeline stage before the cla c o u l db eu s e dt o
reduce its power consumption. The BIST in the MAC unit
Table 3: FDF ﬁlter architecture analysis results
Filter Fault coverage, % Area, mm
2 Reduction, % Power, nW/sample Reduction, %
ﬁr_fdf_booth 93 0.71 F 11.4 F
ﬁr_fdf_rb_booth 94 0.71 0 12.7  11
ﬁr_fdf_rg_booth 95 0.76  71 1 . 6  2
ﬁr_fdf_p_booth 94 0.74  41 4 . 5  27
ﬁr_fdf_wd 95 0.71 0 10.9 4
ﬁr_fdf_booth_sm 91 0.73 F 12.3 F
ﬁr_fdf_rb_booth_sm 92 0.73 0 13.8  12
ﬁr_fdf_rg_booth_sm 94 0.78  71 2 . 6  2
ﬁr_fdf_wd_sm 93 0.71 3 9.4 24
Table 4: Comparison between 2’sC and SM implementa-
tions
Filter Area, mm
2 Power, nW/sample
2’sC SM % 2’sC SM %
ﬁr_df_booth 0.66 0.67  2 21.9 22.4  2
ﬁr_df_rb_booth 0.68 0.70  3 20.2 21.7  7
ﬁr_df_rg_booth 0.71 0.72  1 19.1 21.0  10
ﬁr_df_wd 0.65 0.66  2 19.4 16.4 15
ﬁr_fdf_booth 0.71 0.73  3 11.4 12.3  8
ﬁr_fdf_rb_booth 0.71 0.73  3 12.7 13.8  9
ﬁr_fdf_rg_booth 0.76 0.78  3 11.6 12.6  9
ﬁr_fdf_p_booth 0.74 FF 14.5 FF
ﬁr_fdf_wd 0.71 0.71 0 10.9 9.4 14
0
5
10
15
20
25
P
o
w
e
r
,
 
n
W
/
s
a
m
p
l
e
f
i
r
_
b
o
o
t
h
f
i
r
_
r
b
_
b
o
o
t
h
f
i
r
_
r
g
_
b
o
o
t
h
f
i
r
_
w
d
f
i
r
_
b
o
o
t
h
_
s
m
f
i
r
_
r
b
_
b
o
o
t
h
_
s
m
f
i
r
_
r
g
_
b
o
o
t
h
_
s
m
f
i
r
_
w
d
_
s
m
DF
FDF
Fig. 8 Power consumption comparison between DF and FDF
implementations
IEE Proc.-Circuits Devices Syst., Vol. 151, No. 1, February 2004 15consumes 18% of the area and 20% of the power. When
the whole ﬁlter core is examined the MAC unit consumes
49% of the area and 81% of the power, as shown in Fig. 10.
Although the RAM occupies 36% of the ﬁlter area, it
consumes only 5% of the total power. This proves that the
MAC is the most critical circuit and that the latch-based
memory used in this work is very efﬁcient in terms of power
consumption.
Similarly, our results show that for the FDF ﬁlter core
(ﬁr_fdf_rg_booth), the multiplier consumes 75% of the
MAC area and 55% of the MAC power. Approximately
30% of the MAC power is consumed in the cla circuit. A
small increase in power consumption could be measured in
the multiplier of the FDF compared to the DF ﬁlter core.
This is due to an increase in the switching activity and the
wordlength of the data, due to the addition stage before the
multiplier. The BIST circuitry in the MAC consumes less
than 20% of the total MAC area and power. The MAC
unit consumes approximately 50% of the area and 70% of
the power of the whole ﬁlter core. Although the RAM
occupies 40% of the ﬁlter area, it consumes less than 10%
of the total power.
4 Conclusions
The impact of different ﬁlter realisation structures, multi-
plier architectures and number representations on the
overall power and area performance of a number of FIR
ﬁlter cores has been studied within the context of a hearing
aid application. The study includes the effect of DfT circuits
on the overall performance, using typical compact BIST
circuitry used in hearing aids applications. In general, power
savings of up to 48% can be achieved using FDF ﬁlter cores
over the DF ones at the expense of a less than 10% increase
in area. The best power performance was obtained using a
ﬁlter core with a Wallace–Dadda multiplier employing a
signed-magnitude number representation. However, for
high-speed applications where a booth multiplier will be
required the best power performance can be obtained using
a core with booth multiplier and two’s-complement data
representation.
5 Acknowledgment
The authors would like to thank Bernafon Ltd., Switzer-
land, and the U.K. Engineering and Physical Sciences
Research Council (grant no: GR/N08322) for their support.
6 References
1 Lapsley, P.D., Bier, J.C., Shoham, A., and Lee, E.A.: ‘DSP processor
fundamentals: architectures and features’ (IEEE Press, New York,
1997)
2 Meier, P., Rutenbar, R.A., and Carley, L.R.: ‘Exploring multiplier
architecture and layout for low power’. Proc. IEEE Custom
Integrated Circuits Conf. CICC’96, pp. 513–516
3 Keane, G., Spanier, J., and Woods, R.: ‘The impact of data
characteristics and hardware topology on hardware selection for low
power DSP’. Proc. IEEE Symp. on Low power electronics and design,
ISLPED’98, Monterey, CA, USA, pp. 94–96
4 Fried, R.: ‘Minimizing energy dissipation in high-speed multipliers’.
Proc. IEEE Symp. on Low power electronics and design, ISLPED’97,
pp. 214–219
5 Farag, E., Yan, R.-H., and Elmasry, M.I.: ‘Power-efﬁcient multiplier-
accumulator design for FIR ﬁlters’. Proc. IEEE Canadian Conf. on
Electrical and computer engineering, CCECE’97, pp. 27–30
6 Moller, F., Bisgaard, N., and Melanson, J.: ‘Algorithm and
architecture of a 1V low power hearing aid instrument DSP’. Proc.
IEEE Symp. on Low power electronics and design, ISLPED’99, San
Diego, CA, USA, pp. 7–11
mult
62% cla
11%
csa
5%
xor
4%
BIST
18%
BIST
20%
xor
3%
csa
5%
cla
28%
mult
44%
a
b
Fig. 9 MAC
a Area usage
b Power usage
MAC
49%
ctrl
5%
RAM
36%
beta_reg
2%
ROM
1%
gamma_reg
2%
others
5%
MAC
81%
ROM
0%
beta_reg
4%
gamma_reg
4%
RAM
5%
ctrl
6%
a
b
Fig. 10 FIR
a Area usage
b Power usage
16 IEE Proc.-Circuits Devices Syst., Vol. 151, No. 1, February 20047 Nielsen, L.S., and Spars, J.: ‘Designing asynchronous circuits for low
power: An IFIR ﬁlter bank for a digital hearing aid’, Proc. IEEE,
1999, 87, (2), pp. 268–281
8 Bartlett, V.A., and Grass, E.: ‘A low-power concurrent multiplier-
accumulator using conditional evaluation’. Proc. 6th IEEE Int. Conf.
on Electronics, circuits and systems, Pafos, Cyprus, 1999, pp. 629–633
9 Dadda, L.: ‘Some schemes for parallel multipliers’, Alta Freq., 1965,
34, pp. 349–356
10 Wallace, C.S.: ‘A suggestion for a fast multiplier’, IEEE Trans.
Comput., 1964, EC-13, pp. 14–17
11 Booth, A.D.: ‘A signed binary multiplication technique’, Q. J. Mech.
Appl. Math., 1951, 4, pp. 236–240
12 Huang, X., Liu, W.-J., and Wei, B.W.Y.: ‘A high-performance CMOS
redundant binary multiplication-and-accumulation (MAC) unit’,
IEEE Trans. Circuits Syst., 1994, 41, (1), pp. 33–39
13 Chandrakasan, A.P., and Brodersen, R.W.: ‘Minimizing power
consumption in digital CMOS circuits’, Proc. IEEE, 1995, 83,
pp. 498–523
14 Paschalis, A., Gizopoulos, D., Kranitis, N., Psarakis, M., and Zorian,
Y.: ‘An effective BIST architecture for fast multiplier cores’. Proc.
IEEE Conf. Design, Automation and Test in Europe, DATE’99,
pp. 117–121
15 Van De Goor, A.J.: ‘Testing semiconductor memory’ (John Wiley &
Sons, 1991)
IEE Proc.-Circuits Devices Syst., Vol. 151, No. 1, February 2004 17