Studies on Implementation of . . . High Throughput and Low Power Consumption by Henrik Ohlsson
Linköping Studies in Science and Technology
Thesis No. 1031
STUDIES ON IMPLEMENTATION OF
DIGITAL FILTERS WITH
HIGH THROUGHPUT AND
LOW POWER CONSUMPTION
Henrik Ohlsson
LiU-Tek-Lic-2003:30
Department of Electrical Engineering
Linköpings universitet, SE-581 83 Linköping, Sweden
Linköping, June 2003Studies on Implementation of Digital Filters with
High Throughput and Low Power Consumption
Copyright © 2003 Henrik Ohlsson
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping
Sweden
ISBN 91-7373-694-5 ISSN 0280-7971i
ABSTRACT
In this thesis we discuss design and implementation of frequency selec-
tive digital ﬁlters with high throughput and low power consumption. The
thesis includes proposed arithmetic transformations of lattice wave digital
ﬁlters that aim at increasing the throughput and reduce the power con-
sumption of the ﬁlter implementation. The thesis also includes two case
studies where digital ﬁlters with high throughput and low power con-
sumption are required.
A method for obtaining high throughput as well as reduced power con-
sumption of digital ﬁlters is arithmetic transformation of the ﬁlter struc-
ture. In this thesis arithmetic transformations of ﬁrst- and second-order
Richards’ allpass sections composed by symmetric two-port adaptors and
implemented using carry-save arithmetic are proposed. Such ﬁlter sec-
tions can be used for implementation of lattice wave digital ﬁlters and
bireciprocal lattice wave digital ﬁlters. The latter structures are efﬁcient
for implementation of interpolators and decimators by factors of two. The
proposed transformations increase the throughput of the ﬁlter implemen-
tation. The increased throughput can be traded for reduced power con-
sumption through power supply voltage scaling.
In the thesis two typical applications for digital ﬁlters with high
throughput and low power consumption are studied, a digital down con-
verter for a multiple antenna radar system and a combined interpolation
and decimation ﬁlter for oversampled ADCs and DACs in an OFDM sys-
tem. For both these cases several different ﬁlter structures have been con-
sidered and evaluated with respect to arithmetic complexity and
throughput. The purpose with these evaluations were to ﬁnd the most
power efﬁcient implementations.
For the digital down converter, three different ﬁlter structures, combin-
ing FIR ﬁlters and wave digital ﬁlters, have been implemented in VHDL
and mapped to a standard cell design using a cell library in a 0.18 mm
CMOS process. For the combined interpolator and decimator, four differ-ii
ent novel ﬁlter structures were considered. One of these structures was
implemented using a standard cell library in a 0.35 mm CMOS process.
The functionality of the implementation has been veriﬁed and the power
consumption of the ﬁlter chip has been measured.v
ACKNOWLEDGEMENTS
First of all I would like to thank my supervisor, Professor Lars Wanham-
mar, for his support and guidance as well as for giving me the opportunity
to do this work.
I have had many fruitful discussions with my colleague Tek. Lic.
Oscar Gustafsson during this work. I would also like to thank Dr. Håkan
Johansson for guiding me in the area of ﬁlter design.
I would also like to thank the rest of the staff at Electronics Systems
for their support and friendship.
This work was ﬁnancially supported by the Swedish Foundation for
Strategic Research (SSF).vivii
TABLE OF CONTENTS
1 Introduction ................................................................................. 1
1.1 Motivation.................................................................................... 1
1.2 FIR Filters.................................................................................... 2
1.2.1 FIR Filter Structures .......................................................  2
1.2.2 Linear-Phase FIR Filter Structures.................................  4
1.3 IIR Filters..................................................................................... 5
1.3.1 Iteration Period Bound for Recursive Filters...................  5
1.4 Wave Digital Filters ..................................................................... 6
1.4.1 Lattice Wave Digital Filters.............................................  7
1.4.2 Bireciprocal Lattice Wave Digital Filters .........................  7
1.4.3 Iteration Period Bound for LWDFs..................................  8
1.5 Digital Filters for Interpolation and Decimation ......................... 10
1.5.1 Interpolation..................................................................  11
1.5.2 Decimation....................................................................  11
1.5.3 Polyphase Decomposition............................................  11
1.6 Arithmetic .................................................................................. 13
1.6.1 Bit-Serial Arithmetic......................................................  13
1.6.2 Digit-Serial Arithmetic...................................................  14
1.6.3 Bit-Parallel Arithmetic...................................................  14
1.6.4 Carry-Save Arithmetic ..................................................  15
1.7 Implementation of Constant Multipliers..................................... 17
1.7.1 Shift-and-Add Multipliers ..............................................  17
1.7.2 Minimum-Adder Multipliers...........................................  18
1.7.3 Multiple-Constant Multiplication....................................  19
1.8 Carry-Save Lattice Wave Digital Filters ................................... 20
1.8.1 Mapping of LWDFs to Carry-Save Adders ...................  20
1.8.2 Stability of Carry-Save LWDFs.....................................  21
1.8.3 Iteration Period Bound for Carry-Save LWDFs ............  22viii
1.9 Power Consumption in Digital CMOS Circuits .......................... 24
1.9.1 Dynamic Power Consumption......................................  24
1.9.2 Power Supply Voltage Scaling .....................................  25
1.9.3 Power Reduction at the System Level..........................  26
1.9.4 Power Reduction at the Algorithm Level ......................  26
1.9.5 Power Reduction at the Logic Level.............................  27
1.9.6 Power Reduction at the Technology Level...................  28
1.10 A Design Flow for Digital Filters................................................ 28
1.10.1 Filter Design .................................................................  28
1.10.2 Mapping to Hardware...................................................  29
1.10.3 Module Generators.......................................................  30
1.11 Outline of the Thesis ................................................................. 31
2 Arithmetic Transformations of Lattice Wave Digital Filters ....... 33
2.1 Arithmetic Transformation of First-Order Richards’
Allpass Sections........................................................................ 34
2.1.1 Richards’ Structure.......................................................  34
2.1.2 Proposed Transformation.............................................  35
2.1.3 Mapping of the Transformed Filter Section to
Carry-Save Arithmetic ..................................................  37
2.1.4 Evaluation of the Transformation..................................  37
2.1.5 Example........................................................................  44
2.2 Arithmetic Transformation of Second-Order Richards’
Allpass Sections........................................................................ 47
2.2.1 Richards’ Structure.......................................................  47
2.2.2 Proposed Transformations...........................................  48
2.2.3 Mapping of the Transformed Filter Section to
Carry-Save Arithmetic 49
2.2.4 Evaluation of the Transformation..................................  51
2.2.5 Example........................................................................  55
3 A Digital Down Converter for a Radar Receiver........................ 61
3.1 Conventional Receiver Structures............................................. 62
3.2 Considered Radar Receiver Structure ...................................... 62
3.3 Receiver Specification............................................................... 63ix
3.4 The Digital Down Converter...................................................... 64
3.4.1 Hilbert Transform..........................................................  64
3.4.2 Highpass Filter..............................................................  65
3.5 Design of the Digital Down Converter....................................... 66
3.5.1 Mapping of the DDC to Hardware ................................  66
3.5.2 FIR-FIR Solution...........................................................  67
3.5.3 FIR-BLWDF Solution....................................................  67
3.5.4 BLWDF-BLWDF Solution.............................................  68
3.6 Evaluation of the DDC Structures ............................................. 69
3.6.1 Number of Arithmetic Operations.................................  69
3.6.2 Number of Adders and Memory Elements ...................  71
3.7 Implementation of the DDC....................................................... 72
4 A Combined Interpolator and Decimator for an
OFDM System........................................................................... 75
4.1 Design of the Digital Filters ....................................................... 76
4.1.1 Narrow-band Frequency Masking Filters......................  76
4.1.2 Efficient Implementation of Cascaded
Multirate Filters 78
4.2 Considered Filter Structures .................................................... 80
4.3 Evaluation of the Filter Structures ............................................ 80
4.3.1 Arithmetic Complexity and Throughput ........................  82
4.3.2 Internal Data Wordlengths and Scaling........................  85
4.3.3 Summary of the Evaluation ..........................................  86
4.3.4 Combining Interpolation and Decimation Filters...........  86
4.4 Implementation of the Combined Interpolator and
Decimator Structure .................................................................. 86
4.5 Comparison Between a WDF and an FIR Implementation ...... 88
4.6 Arithmetic Transformation of the Combined Interpolator and
Decimator Structure .................................................................. 88
5 Conclusions............................................................................... 91x1
1
1INTRODUCTION
In this chapter we discuss the design and implementation of frequency
selective digital ﬁlters for applications where high throughput and low
power consumption are required.
1.1 Motivation
In this thesis we discuss the design and implementation of ﬁxed function,
frequency selective digital ﬁlters, using nonrecursive as well as recursive
ﬁlter algorithms. Frequency selective digital ﬁlters are required in most
DSP (digital signal processing) applications. A typical example is mobile
communications where hand-held, battery supplied, devices, such as cel-
lular phones, are used. To obtain a long uptime between recharges of the
battery for cellular phones, low power consumption is required. Due to
the requirements on high data rates in many communication systems, the
corresponding subsystems and circuits must have a high throughput as
well. Since signiﬁcant parts of such communication systems are customer
products that are produced in large quantities and are sold at low prices,
efﬁcient, fast, and reliable design methods as well as low cost circuit
implementations are required.
The possibility of integrating an entire system, or parts of a system, on
a single chip also requires subsystems with low power consumption. For
such integrated systems, where analog and digital circuits may be imple-Introduction
2
mented on the same chip, the heat dissipation and the cooling of the chip
becomes a problem. Low power consumption is therefore a key design
constraint.
To obtaining high throughput as well as low power consumption, ﬁxed
algorithm and co-optimized implementation of the digital signal process-
ing parts are preferable, whenever possible. This co-optimization reduces
the power consumption of the circuit with at least one order of magnitude
compared to a ﬂexible implementation using, for example, digital signal
processors. Hence, algorithm–hardware co-design and trade-off of ﬂexi-
bility is very important in this respect.
1.2 FIR Filters
FIR ﬁlters constitute a class of digital ﬁlters having a ﬁnite length impulse
response. An FIR ﬁlter can be realized using nonrecursive as well as
recursive algorithms. However, the latter is not recommended due to
potential stability problems while nonrecursive FIR ﬁlters are always sta-
ble [75]. Hence, nonrecursive FIR ﬁlter algorithms are preferable for
implementation.
An FIR ﬁlter can be described by the difference equation
(1.1)
where y(n) is the ﬁlter output, x(n) is the ﬁlter input, and hk are constants,
determined by the ﬁlter speciﬁcation.
1.2.1 FIR Filter Structures
A nonrecursive FIR ﬁlter can be realized using many different structures.
Here, only two FIR ﬁlter structures are considered; the direct form and
the transposed direct form. Examples of other structures that are suitable
for implementation of FIR ﬁlters are the parallel FIR structure [56] [59]
and the differential coefﬁcient method [66].
yn () hkxn k – ()
k 0 =
N
å =FIR Filters
3
Direct Form FIR Filter Structure
The direct form FIR ﬁlter structure, shown in Fig. 1.1, is easily derived
from Eq. (1.1). An Nth-order direct form structure is composed by N
memory elements (registers) holding the input value for N sample peri-
ods, N+1 multipliers, corresponding to the constants in Eq. (1.1), and N
additions for adding the results of the multiplications.
Transposed Direct Form FIR Filter Structure
The transposed direct form FIR ﬁlter structure, shown in Fig. 1.2, is
derived from the direct form structure using the transposition theorem.
This theorem states that by interchanging the input and the output and
reverse all signal ﬂows in a signal-ﬂow graph of a single input single out-
put (SISO) system, such as the direct form FIR ﬁlter, the transfer function
of the ﬁlter remains unchanged [75].
Figure 1.1: Nth-order direct form FIR ﬁlter.
Figure 1.2: Nth-order transposed direct form FIR ﬁlter.
T T T T
h0 h1 h2 h3 hN
x(n)
y(n)
T
h0
x(n)
y(n)
hN-2
T T
hN-1 hN h1Introduction
4
For the transposed direct form structure all multiplications are per-
formed on the current input value. This yields a large fan-out of the gate
driving the multipliers which may be costly to implementation.
1.2.2 Linear-Phase FIR Filter Structures
A property of FIR ﬁlters is that they can be implemented with an exact
linear phase response. To obtain this, the FIR ﬁlter must have a symmetric
or antisymmetric impulse response. The impulse response of a linear
phase FIR ﬁlter is either symmetric around n = N/2
(1.2)
or antisymmetric around n = N/2
(1.3)
where N is the ﬁlter order. For a linear-phase FIR ﬁlter the number of
multiplications required can be reduced by exploiting the symmetry of the
impulse response, as shown in Fig. 1.3. From the ﬁgure it can also be seen
that the number of additions remains the same while the number of multi-
plications is halved, compared to the corresponding direct form imple-
mentation.
Figure 1.3: Example of a linear phase FIR ﬁlter structure.
hn () hN n – () n 0 1 ... N ,, , = , =
hn () h – Nn – () n 0 1 ... N ,, , = , =
T
h0
x(n)
y(n)
h(N-3)/2 h(N-1)/2 h1
T
T
T
T
T
T
T
TIIR Filters
5
1.3 IIR Filters
Digital ﬁlters that have inﬁnite length impulse responses are called IIR ﬁl-
ters. The difference equation describing an IIR ﬁlter can be written
(1.4)
where y(n) is the ﬁlter output, x(n) is the ﬁlter input, and ak and bk are
constants. As opposed to the nonrecursive FIR ﬁlter the ﬁlter output does
not only depend on the input sequence but on previous outputs as well.
Hence, a recursive ﬁlter algorithm is required for realization of an IIR ﬁl-
ter.
Recursive ﬁlters can be realized by several different ﬁlter structures.
However, for several of these the stability of the ﬁlter implementation is a
problem. A class of recursive ﬁlters that can be implemented with a guar-
anteed stability is wave digital ﬁlters [13]. This is also the only class of
recursive ﬁlter structures that will be consider in this thesis.
1.3.1 Iteration Period Bound for Recursive Filters
The recursive structure of an IIR ﬁlter limits the maximal sample rate
[63]. This bound is determined by the loops of the ﬁlter structure and is
given by
(1.5)
where Tmin is the iteration period bound, fs,max is the maximal sample fre-
quency, Ti is the total latency of the operations in loop i, and Ni is the
number of delay elements in loop i. The latency of an operation is deﬁned
as the time it takes to generate an output value from the corresponding
input value [76]. The loop yielding fs,max for a ﬁlter implementation is
called the critical loop.
For example, consider the recursive structure shown in Fig. 1.4 which
includes two loops, one consisting of two additions, one multiplication
yn () bkyn k – () akxn k – ()
k 0 =
N
å +
k 1 =
M
å =
Tmin
1
f sm a x ,
--------------- max
i
==
Ti
Ni
-----
îþ
íý
ìüIntroduction
6
and one delay element and one consisting of two additions, one multipli-
cation and two delay elements. The iteration period bound is determined
by the loop with only one delay element, as shown in Eq. (1.6), assuming
that the latencies are the same for the additions and multiplications.
(1.6)
The maximal sample frequency for a ﬁlter structure can always be
obtained for an implementation. However, it may require that the opera-
tions are scheduled over several sampling periods [18] [54] [73] [76].
1.4 Wave Digital Filters
Wave digital ﬁlters (WDFs) constitute a wide class of IIR digital ﬁlters
that are well suited for implementation. WDFs are derived from analog
reference ﬁlters from which they inherit several fundamental properties.
If the reference ﬁlter has a low sensitivity to variation in element values,
which is the case for certain RLC ﬁlters, this low sensitivity can be inher-
ited by the digital ﬁlter. Another property inherited from the reference ﬁl-
ter is the stability of the ﬁlter implementation. A passive LC ﬁlter
attenuates parasitic oscillations due to losses in the nonideal circuit ele-
ments. By imitating these losses in the WDF implementation, parasitic
oscillations can be suppressed for these ﬁlters as well.
Examples of WDF structures, suitable for implementation, are Rich-
ards’ structures and ladder structures. The former is derived from cas-
Figure 1.4: A recursive ﬁlter structure with two loops.
x(n) y(n)
T
T
Tmin max =
Tmult 2Tadd +
1
---------------------------------
Tmult 2Tadd +
2
--------------------------------- ,
îþ
íý
ìü
Tmult 2Tadd + =Wave Digital Filters
7
caded, lossless commensurate-length transmission line ﬁlters. However, a
class of WDFs that is even more suitable for VLSI implementation is lat-
tice WDFs (LWDFs). These ﬁlter structures are derived from analog lat-
tice ﬁlters. LWDFs is the only class of WDFs considered in this thesis.
1.4.1 Lattice Wave Digital Filters
From the reference ﬁlter, the analog lattice ﬁlter, the low passband sensi-
tivity and high stopband sensitivity is inherited by the LWDF structure.
The latter is not a problem for a digital ﬁlter implementation since the ﬁl-
ter coefﬁcients are constant. An LWDF can be designed from the refer-
ence ﬁlter [75] as well as from explicit formulas [16].
The LWDF structure is highly modular yielding a high degree of par-
allelism. This makes it well suited for VLSI implementation.
In Fig. 1.5 an example of a ninth-order LWDF is shown. In this thesis
only lattice structures, composed by two allpass sections that are con-
nected in parallel, are considered. The allpass ﬁlters are composed of cas-
caded ﬁrst- and second-order Richards’ structures implemented using
symmetric two-port adaptors. The signal-ﬂow graph of the symmetric
two-port adaptor is shown in Fig. 1.6.
1.4.2 Bireciprocal Lattice Wave Digital Filters
A subclass of LWDFs are bireciprocal LWDFs (BLWDFs). The magni-
tude function of a BLWDF is antisymmetric around p/2. Hence, only spe-
cial speciﬁcations can be realized using BLWDFs. This limits the number
of applications that such ﬁlters can be used for.
For an BLWDF more than half of the coefﬁcients are zero, compared
to an LWDF of the same ﬁlter order. This reduces the arithmetic complex-
ity of a BLWDF implementation as well as increases the throughput,
compared to an LWDF implementation. In Fig. 1.7 a ninth-order BLWDF
is shown.
The transfer function of a BLWDF is
(1.7)
where H0(z2) and H1(z2) are the transfer functions for the two allpass sec-
tions, respectively.
Hz () H0 z
2 ()z
1 – H1 z
2 () + =Introduction
8
1.4.3 Iteration Period Bound for LWDFs
The ﬁrst- and second-order Richards’ allpass sections form the only
recursive parts of the LWDFs considered here and, hence, determine the
iteration period bound. For a ﬁrst-order allpass section the critical loop, as
shown in Fig. 1.8, yields an iteration period bound of
Figure 1.5: A ninth-order LWDF.
Figure 1.6: A symmetric two-port adaptor.
a0
T
a3
a4
T
T
T
a5
a6
T
a7
a8
T
T
y(n) x(n)
1/2
T
a1
a2
T
a
- aWave Digital Filters
9
(1.8)
where Tadd is the latency of an addition and is the latency of a multi-
plication with the coefﬁcient a0.
For the second-order Richards’ allpass section the iteration period
bound is determined by the critical loop as shown in Fig. 1.9. This itera-
tion period bound is
(1.9)
Figure 1.7: A ninth-order BLWDF.
Figure 1.8: Critical loop for a ﬁrst-order Richards’ allpass section.
T
a7
2T
a3
2T
a1
2T
a5
2T
y(n) x(n)
1/2
Tmin 2Tadd Ta0 + =
Ta0
T
a0
y(n) x(n)
a0
-
x(n) y(n)
T
Tmin 4Tadd Ta1 Ta2 ++ =Introduction
10
where and are the latencies of the two multiplications with the
coefﬁcients a1 anda2, respectively.
Second-order allpass sections can also be implemented using three-
port adaptors [1] [30]. This may yield a lower iteration period bound,
depending on the coefﬁcient wordlengths required. However, ﬁlter sec-
tions composed of three-ports adaptors are not discussed in this thesis.
1.5 Digital Filters for Interpolation and
Decimation
Multirate techniques are used in many different digital signal processing
applications. In a multirate algorithm several different sample rates are
used [72]. Hence, sample rate conversions are required, for both increas-
ing and reducing the sample rate. An increase of the sample rate is called
interpolation and a reduction of the sample rate is called decimation. For
the implementation of interpolators and decimators, digital ﬁlters are
required.
The computational workload can be reduced by using several sample
rates. A typical application of multirate techniques are oversampled ana-
log-to-digital and digital-to-analog converters, where the converters use a
sample rate higher than the Nyquist rate. Hence, interpolators and deci-
Figure 1.9: Critical loop for a second-order Richards’ allpass section.
Ta1 Ta2
T
x(n) y(n)
a1
a2
T
x(n)
a1
-
a2
-
y(n)
T
TDigital Filters for Interpolation and Decimation
11
mators are required between the converters and the digital parts. The
result is that the performance of the conversions is improved.
1.5.1 Interpolation
Interpolation by an integer factor is performed by introduction of zero
valued samples into the sample sequence so that the required sample rate
is obtained. However, the zero sample insertion will introduce repeated
images of the original signal spectrum. Lowpass ﬁltering is required after
the zero sample insertion stage, to remove these images. An interpolation
structure, including the zero insertion and the digital lowpass ﬁltering, is
shown in Fig. 1.10.
1.5.2 Decimation
Decimation by an integer factor is performed by removal of samples from
the sequence until the required sample rate is obtained. To avoid aliasing
after decimation, the input signal to the decimator must be band limited.
This is solved by a lowpass anti-aliasing ﬁlter before the sample removal.
A decimation structure, including the anti-aliasing ﬁlter and the sample
removal, is shown in Fig. 1.11.
1.5.3 Polyphase Decomposition
A drawback with the interpolation and decimation structures discussed
above is that the digital ﬁltering is performed at the higher sample rate.
Figure 1.10: Structure for interpolation by a factor M.
Figure 1.11: Structure for decimation by a factor M.
H(z) M x(m) y(n)
H(z) M x(m) y(n)Introduction
12
This can be avoided by using polyphase decomposition of the ﬁlters [71].
An M-component polyphase decomposition of a digital ﬁlter is performed
by rewriting the transfer function of the ﬁlter as shown in Eq. (1.10)
(1.10)
where Hk(z) are the polyphase components of the ﬁlter H(z). For the case
when M = 2 we obtain
(1.11)
After polyphase decomposition of the digital ﬁlters, all ﬁltering can be
performed at the lower sample rate. Polyphase decomposition can be used
for decimation as well as for interpolation as illustrated in Fig. 1.12 and
Fig. 1.13, respectively. This will improve the efﬁciency of an implementa-
tion signiﬁcantly, with respect to power consumption.
If Eq. (1.7) and Eq. (1.11) are compared it can be seen that the transfer
function of the BLWDF structure is equivalent to the transfer function of
Figure 1.12: Polyphase decimation structure.
Figure 1.13: Polyphase interpolation structure.
H z () z
k – Hk z
M ()
k 0 =
M 1 –
å =
Hz () H0 z
2 ()z
1 – H1 z
2 () + =
x(2n+1)
x(2n)
y(n)
H1(z)
H0(z)
x(m) y(n) H(z) 2
H(z) 2 x(n) y(m) x(n)
y(2n) H0(z)
H1(z) y(2n+1)Arithmetic
13
a polyphase structure with M = 2. Hence, a BLWDF structure is well
suited for implementation of interpolators and decimators for sample rate
changes by a factor of two. By cascading BLWDFs, sample rate changes
by factors of power of two are possible as well [75]. An example of a
polyphase decomposed BLWDF for decimation by a factor of two is
shown in Fig. 1.14.
1.6 Arithmetic
The mapping of a ﬁlter structure to hardware includes selection of suita-
ble arithmetic. Here we brieﬂy discuss bit-serial, digit-serial, and bit par-
allel arithmetic. A more thorough discussion about bit parallel, carry-save
arithmetic is included since it will be used extensively throughout this
thesis.
1.6.1 Bit-Serial Arithmetic
In bit-serial arithmetic only one bit of the data word is processed during
each clock cycle. This can be done by using either the most signiﬁcant bit
(MSB) or the least signiﬁcant bit (LSB) ﬁrst. Here, only LSB ﬁrst bit-
serial operations will be discussed. A major advantage with bit-serial
arithmetic is that it is area efﬁcient since the processing elements are
Figure 1.14: Polyphase decomposition of a ninth-order BLWDF for decimation
by a factor of two.
a5
T
a1
T
a3
T
a7
T
y(n)
x(2n) 1/2
x(2n+1)Introduction
14
small. Bit-serial arithmetic also yields low routing complexity since com-
munication between bit-serial operators require only a single wire. A
drawback with bit-serial arithmetic is that a high clock frequency is
required to obtain a high throughput. An example of a bit-serial process-
ing element, a bit-serial adder, is shown in Fig. 1.15.
1.6.2 Digit-Serial Arithmetic
An alternative to bit-serial arithmetic is digit-serial arithmetic. Instead of
processing only one bit in each clock cycle, two, or more, bits are proc-
essed during each clock cycle. The number of bits processed during a
clock cycle is deﬁned as the digit size of the operation. In Fig. 1.16 an
example of a digit-serial adder with a digit size equal to four is shown. A
digit-serial adder can be derived from a bit-serial adder through unfolding
[56].
Compared to bit-serial arithmetic, the requirement on the clock fre-
quency for obtaining a given throughput is reduced. This is obtained at
the expense of an increased routing complexity and a longer carry propa-
gation path. These factors also depend on the digit size of the operations.
1.6.3 Bit-Parallel Arithmetic
A special case of digit-serial arithmetic is bit-parallel arithmetic. A digit-
serial operation for which the input data wordlength and the digit size are
equal is in fact a bit-parallel operation. Bit-parallel circuits yields high
throughput, relative to the clock frequency, at the expense of a larger area
required, compared to bit-serial and digit-serial circuits.
A drawback with bit-parallel arithmetic is the carry propagation
required in an addition. The MSB of an addition result depends on the
carry, which is propagated through the adder from the LSB. Hence, the
Figure 1.15: A bit-serial adder.
FA
D
A
B
SArithmetic
15
latency of a bit-parallel adder is large. It is also dependent on the data
wordlength. The latency can be reduced by using, for example, different
carry acceleration schemes [76]. These methods have the drawback of
increased chip area.
1.6.4 Carry-Save Arithmetic
Bit-parallel, carry-save arithmetic is based on a redundant number repre-
sentation which has been shown to be efﬁcient for implementation of high
throughput DSP algorithms [42]. In carry-save arithmetic a binary
number is represented by two data vectors, a sum and a carry vector. Con-
version from carry-save representation to two’s-complement representa-
Figure 1.16: A digit-serial adder with a digit size four.
Figure 1.17: A bit-parallel ripple-carry adder.
FA
D
FA
FA
FA
A4n
B4n
A4n+1
B4n+1
A4n+2
B4n+2
A4n+3
B4n+3
S4n
S4n+1
S4n+2
S4n+3
FA FA FA FA
a0 a1 aWd-2 aWd-1 bWd-1 bWd-2 b0 b1
sWd-1 sWd-2 s1 s0
cin cout
A
RCA
B
SIntroduction
16
tion is performed by a vector merging adder (VMA). A VMA can be
realized by using, for example, a ripple-carry adder (RCA) or a carry-look
ahead adder (CLA).
The Carry-Save Adder
A carry-save adder (CSA) takes three operands as input and yields the
result as two operands, one sum and one carry vector, as illustrated in Fig.
1.18. The CSA is realized using a number of full-adders that operate con-
currently, i.e., independently. The latency of a CSA operation is equal to
the latency of one full-adder operation, independent on the data word-
length of the input operands. This can be compared to the latency of the
carry-propagation adders used in conventional two’s-complement arith-
metic, such as the RCA and the CLA which has a latency of the order
O(Wd) and O(log2(Wd)), respectively [76].
Carry Overﬂows in Carry-Save Structures
A carry-save addition requires an increase of the data wordlength of the
carry vector of the adder by one bit due to the carry bit generated in the
most signiﬁcant full-adder. If the carry bit is discarded, as in conven-
tional, two’s-complement arithmetic, a so called carry overﬂow can occur.
Such overﬂows may occur when, for example, a small number is repre-
sented by a large positive and a large negative number. If, for such cases,
the carry bit is discarded, the result from the addition may be incorrect.
Carry overﬂows can be avoided by modiﬁcation of the full-adder in
the most signiﬁcant bit position in the CSA. This full-adder should be
modiﬁed according to Eq. (1.12) and Eq. (1.13)
(1.12)
Figure 1.18: A carry-save adder.
FA FA FA
a0 a1 aWd-2 bWd-2 b0 b1
sWd-2 s1 s0
dWd-2
FA
aWd-1 bWd-1
sWd-1
dWd-1
cWd-1 cWd c2 c1
d0 d1
CSA
DBA
CS
swd 1 – corr , swd 1 – cwd cwd 1 – Å Å =Implementation of Constant Multipliers
17
(1.13)
where and are the most signiﬁcant sum and carry
bits after correction, respectively, is the most signiﬁcant sum bit,
and and are the most and second most signiﬁcant carry bits
from the CSA, respectively [42].
1.7 Implementation of Constant Multipliers
A constant multiplication is a fundamental operation for implementation
of ﬁxed function FIR ﬁlters as well as LWDFs. In this section some dif-
ferent approaches for implementation of constant multipliers yielding
multipliers with low power consumption as well as high throughput are
discussed. Several of these methods can be applied on bit-serial, digit-
serial and bit-parallel arithmetic. However, in this thesis the focus is on
bit-parallel arithmetic.
1.7.1 Shift-and-Add Multipliers
A constant multiplication can be implemented using shift-and-add opera-
tions [35]. Each nonzero bit in a two’s-complement coefﬁcient corre-
sponds to one partial product of the coefﬁcient. These are generated using
ﬁxed shift operations. For bit-parallel arithmetic these can be hardwired
and, hence, require no extra gates. These partial products can then be
added together using, for example, an adder tree or an adder array.
An adder structure for adding several operands using bit-parallel arith-
metic, yielding a low latency, is the Wallace adder tree [74]. The Wallace
tree is formed by connecting the carry-save adders in a tree-like fashion,
as illustrated in Fig. 1.19. From this example it can be seen that the
latency of a Wallace tree is equal to the tree height, in terms of CSAs.
The arithmetic complexity of the shift-and-add multiplier can be
reduced by using canonic signed-digit code (CSDC) for representation of
the coefﬁcients [3]. The CSDC representation has three digits, –1, 0, +1,
as opposed to the two’s-complement representation which has only two
digits, 0 and +1. A property of the CSDC representation is that two con-
secutive bits in a CSDC number may not be nonzero. The average number
of nonzero bits in a CSDC number is approximately Wd/3 while a two’s-
complement number has on average Wd/2 nonzero bits. Hence, for a coef-
cwd 1 – corr , cwd =
swd 1 – corr , cwd 1 – corr ,
swd 1 –
cwd cwd 1 –Introduction
18
ﬁcient represented in CSDC, the number of additions required is lower
than for the corresponding two’s-complement coefﬁcient, when a shift-
and-add structure is considered for realization of a constant multiplier.
1.7.2 Minimum-Adder Multipliers
A method for implementation of constant multipliers yielding further
reduction of the arithmetic complexity compared to CSDC based shift-
and-add multipliers, is the minimum-adder multiplier technique [11] [21].
Minimum-adder multipliers can be described using the graph representa-
tion introduced in [4]. The nodes of the graph correspond to additions
while the vertices correspond to shift operations.
To illustrate this method we consider the coefﬁcient
a =4 5 10 =1 0 10101CSDC, where 1 corresponds to –1. This coefﬁcient is
represented with four nonzero bits in CSDC representation and the corre-
sponding shift-and-add multiplier requires three additions. Another possi-
ble representation is y =a x = 45x = (8+1)(4+1)x = (23+1)(22+1)x. This
multiplier can be realized using two additions only. In Fig. 1.20 multiplier
graphs corresponding to both the CSDC and the minimum-adder realiza-
tion for a constant multiplication with the coefﬁcient a =4 5 10 are shown.
1.7.3 Multiple-Constant Multiplication
In the transposed direct form FIR ﬁlter structure, one data value is to be
multiplied with several constant coefﬁcients. This makes it possible to use
Figure 1.19: A Wallace tree with six inputs.
CSA
CSA
CSA CSAImplementation of Constant Multipliers
19
multiple-constant multiplication methods. It is obvious that resources
may be shared between the different multipliers.
One multiple-constant method is the multiplier block technique [4]. A
multiplier block is composed by a number of minimum-adder multipliers
where common subgraphs can be shared between the coefﬁcients. For
example, if we consider the coefﬁcients a1 =4 5 10 = (8+1)(4+1) and
a2 =2 7 10 = (8+1)(2+1), the (8+1) part can be shared between the two
graphs as illustrated in Fig. 1.21. In [4] a heuristic algorithm for the
design of multiplier blocks was proposed. An improved algorithm was
proposed in [12] which may reduce the arithmetic complexity for the
multiplier block further.
Another method for implementation of multiple-constant multipliers
is the subexpression sharing technique [24] [25] [62]. By considering all
shift-and-add operations required for a multiple-constant problem, com-
mon subexpressions can be identiﬁed. These subexpressions can then be
computed only once and be reused for realization of several multipliers.
Figure 1.20: Two possible graph representations for a = 45.
Figure 1.21: Multiplier block implementation of a1 = 45 and a2 = 27.
1
-4
-16
64
-3 -19 45
11 1 1
8
9 45
4
1
1
8
9 45
4
1
2
1
27Introduction
20
1.8 Carry-Save Lattice Wave Digital Filters
Implementation of ﬁxed function LWDFs requires only two arithmetic
operations, addition and constant multiplication. Here an isomorphic
mapping of the ﬁlter algorithm to the hardware architecture is considered.
To obtain a high throughput realization of LWDFs, the arithmetic opera-
tions should be implemented using circuits yielding low latencies. As dis-
cussed in Section 1.6.4 using carry-save arithmetic yields low latency
arithmetic operation. It has also been shown that carry-save arithmetic is
suitable for implementation of high throughput LWDFs [32] [55] [61].
1.8.1 Mapping of LWDFs to Carry-Save Adders
Two types of basic arithmetic operations are required for implementation
of ﬁxed function LWDFs, additions and constant multiplications. An
addition is easily mapped to carry-save arithmetic while a constant multi-
plication requires some special considerations, compared to a conven-
tional constant multiplication, as discussed in 1.7.
The constant coefﬁcient multiplications required in the LWDF can be
implemented using the minimum-adder multiplier method in order to
reduce the arithmetic complexity. This will, however, not always yield a
multiplier with minimal latency. It has also been shown that for imple-
mentation of constant coefﬁcient multipliers using carry-save arithmetic,
the CSDC multiplier is as efﬁcient, with respect to the number of carry-
save adders required, as a minimum-adder multiplier for constants with
wordlengths of up to 10 bits [19]. Hence, a Wallace tree implementation
of a CSDC multiplier using carry-save arithmetic is efﬁcient with respect
to arithmetic complexity as well as throughput.
A drawback with the Wallace tree structure is that it yields an irregular
wire routing of the adder tree. An alternative to the Wallace tree with a
similar tree height for the same number of inputs and a regular structure,
more suitable for layout, is the overturned stair adder tree [39].
Normally a VMA is applied on the output of the adder tree of the mul-
tiplier to obtain two’s-complement representation. However, by keeping
the result of the multiplier in the ﬁlter section in carry-save representa-
tion, no VMA is required in the critical loop. This improves the iteration
period bound for a ﬁlter realization further.Carry-Save Lattice Wave Digital Filters
21
1.8.2 Stability of Carry-Save LWDFs
Using carry-save representation in the loops of LWDFs requires a special
saturation scheme, compared to conventional two’s-complement repre-
sentation, to guarantee the stability of the ﬁlter. To suppress parasitic
oscillations, overﬂows of the signal range must be detected at the adaptor
outputs. For a conventional implementation this is possible by sign exten-
sion of the input signal to the ﬁlter section by one bit and a simple com-
Figure 1.22: A carry-save, ﬁrst-order Richards’ allpass section with two’s-
complements input and output.
Figure 1.23: A carry-save, ﬁrst-order Richards’ allpass section with carry-save
input and output.
T
a0
x(n) y(n)
-
T
a0
x(n) y(n)
- -Introduction
22
parison between the extended bit and the sign bit [75]. If these two bits
differ an overﬂow has occurred and the signal should be saturated.
To implement this saturation function in a carry-save structure, the
value of the signal at the adaptor outputs must be computed by adding the
sum and carry vectors. This can be performed by VMAs, placed at the
adaptor outputs. This will, however, increase the iteration period signiﬁ-
cantly. Instead a saturation scheme that can be applied directly on the
carry-save number should be used. By performing a partial carry propaga-
tion on the most signiﬁcant parts of the results only, it is possible to detect
if the signal is within a certain number range. The degree of uncertainty
of whether an overﬂow has occurred or not depends on the number of bits
that are added together. For an LWDF implementation the stability can be
guaranteed using saturation logic with a two bit carry propagation only
for a carry-save implementation [33] [34].
A positive overﬂow is detected according to Eq. (1.14) and a negative
overﬂow is detected according to Eq. (1.15)
(1.14)
(1.15)
where and are the most and second most signiﬁcant bits from
the sum vector and and are the most and second most signiﬁ-
cant bits from the carry vector.
At the output of a ﬁlter section two’s-complement representation may
be required. Then a VMA and conventional saturation logic can be used.
Since this VMA is not situated in the recursive part, it does not affect the
iteration period bound of the ﬁlter implementation.
1.8.3 Iteration Period Bound for Carry-Save LWDFs
The iteration period bound for a ﬁrst-order Richards’ allpass section was
given in Eq. (1.8) as Tmin =2 Tadd+Tmult,a. For the second-order Richards’
allpass section the iteration period bound was given in Eq. (1.9) as
Tmin =4 Tadd++.
When mapping the adaptor to carry-save arithmetic the iteration
period bound for the ﬁrst-order section is
POVL cwd 1 – swd 1 – + () cwd 2 – swd 2 – + () × =
NOVL cwd 1 – swd 1 – × () cwd 2 – swd 2 – × () × =
swd 1 – swd 2 –
cwd 1 – cwd 2 –
Tmult a , 1 Tmult a , 2Carry-Save Lattice Wave Digital Filters
23
(1.16)
where TCSA is the latency of one carry-save adder. This expression is valid
if a two’s-complement input to the ﬁlter section is used. If, instead, the ﬁl-
ter section has a carry-save input, the iteration period bound is increased
by 2TCSA to
(1.17)
For a second-order allpass section realized with carry-save arithmetic,
the corresponding iteration period bounds are
(1.18)
and
(1.19)
where Eq. (1.18) corresponds to the case with a two’s-complement input
and Eq. (1.19) corresponds to the case with a carry-save input to the ﬁlter
section.
The latency of a bit-parallel, constant coefﬁcient multiplier depends on
the number of nonzero bits in the coefﬁcients, when mapped to a tree
structure. Since the input to the multiplier in the ﬁlter section is in carry-
save representation, each nonzero bit of the coefﬁcient yields two inputs
to the tree. The number of adder levels in the adder tree increases when
the number of inputs to the tree increases. Each new level introduced in
the tree yields an increase of the latency of the multiplier by a factor TCSA.
Since the number of inputs available on the top level of the tree increases
when the tree height increases, the latency of the multiplier increases
slower for coefﬁcients with larger number of nonzero bits.
As an example a ﬁrst-order Richards’ structure with a coefﬁcient
a = 0.4062510 = 0.10101CSDC is considered. This coefﬁcient has three
nonzero bits, yielding six partial product terms as input to the adder tree.
This requires a tree of height three, such as the Wallace tree shown in Fig.
1.19. This tree has a latency of 3TCSA. Hence, the iteration period bound
for the ﬁrst-order Richards’ structure of our example becomes 5TCSA if a
Tmin 2TCSA Tmult a , + =
Tmin 4TCSA Tmult a , + =
Tmin 6TCSA Tmult a1 , Tmult a2 , ++ =
Tmin 8TCSA Tmult a1 , Tmult a2 , ++ =Introduction
24
two’s-complement input is considered and 7TCSA if a carry-save input is
considered.
1.9 Power Consumption in Digital CMOS
Circuits
The power consumption of a digital CMOS circuit is
(1.20)
where P is the total power consumption of the circuit, Pdyn is the dynamic
power consumption, Pshort is the short circuit power consumption, and
Pleak is the power consumption due to leakage currents. Among these the
dynamic power consumption, due to charging and discharging of wire
and transistor capacitors in the circuit, is dominating.
The short circuit power consumption, which is due to the current ﬂow-
ing through the gate during the switching transitions, is typically less than
10% of the dynamic power consumption [8]. The leakage power con-
sumption is due to leakage currents through the transistors. These cur-
rents are very small if the power supply voltage is large, compared to the
threshold voltage. In this section examples of methods for reduction of
the dynamic power consumption are discussed.
There are two trade-offs that are important for the design and imple-
mentation of digital CMOS circuits for low power consumption. First
there is power consumption vs throughput. Often a reduction of the power
consumption of a circuit is possible at the expense of a reduction of the
throughput. Since there typically is requirements on the throughput, this
will limit the potential reduction of the power consumption. The second
trade-off is between power consumption and ﬂexibility. Algorithm-spe-
ciﬁc designs are known to be signiﬁcantly more power efﬁcient than more
ﬂexible designs.
1.9.1 Dynamic Power Consumption
The dynamic power dissipation for a digital CMOS circuit can be approx-
imated by the well known formula
PP dyn Pshort Pleak ++ =Power Consumption in Digital CMOS Circuits
25
(1.21)
where a is the switching activity of the circuit, fc is the clock frequency of
the circuit, CL is the total load capacitance of the circuit, and VDD is the
power supply voltage. The switching activity and the load capacitance are
often combined into one factor, the switched capacitance Ca. Then the
dynamic power consumption is
(1.22)
To obtain a low power consumption all these factors should be consid-
ered at all levels of the design, from the system design down to the tech-
nology level [22].
1.9.2 Power Supply Voltage Scaling
An efﬁcient method for reducing the power consumption of CMOS cir-
cuits is power supply voltage scaling [7] [8]. This method can be applied
at all levels of the design. Basically it means that any excess speed in a
design can be traded for reduced power consumption by reducing the
power supply voltage as far as possible with respect to the requirements
on throughput.
The propagation delay of a CMOS gate is approximately
(1.23)
where b is the transconductance, VT is the threshold voltage, and a is a
process parameter. For long channel devices a=2while for short channel
devices a < 2 [26] [43].
From Eq. (1.23) it can be seen that the delay of a CMOS gate scales
approximately linear with the power supply voltage, while Eq. (1.21)
shows that the power consumption scales with the square of the power
supply voltage. Hence, for a circuit with a maximal sample rate which is
larger than the required, the power supply voltage can be reduced to
obtain a lower power consumption while still meeting the throughput
requirements.
Pdyn a f cCLVDD
2 =
Pdyn f cCaVDD
2 =
td
CLVDD
b VDD VT – () a ----------------------------------- =Introduction
26
1.9.3 Power Reduction at the System Level
There are several methods that can be applied on the system level to
reduce the power consumption of a design. Examples of such methods are
dynamic power supply voltage scaling and power down techniques.
Dynamic power supply voltage scaling can be applied on systems
where the workload changes with time. The power consumption can be
reduced by increasing the power supply voltage when a high throughput
is required and reducing it when the system requires a low throughput, or
is in idle mode [5].
The power consumption can also be reduced by shutting down the sys-
tem, or parts of it, when it is idle. An example of a power down technique
is gating of the clock signal. This will not only shut down the circuit that
is idle, it will also reduce the switching activity on the clock net which
will reduce the power consumption further.
1.9.4 Power Reduction at the Algorithm Level
It is often possible to implement a digital signal processing task, for
example a digital ﬁlter, using different algorithms and still meet the given
speciﬁcation. As previously discussed a wide variety of algorithms can be
used for realization of a digital ﬁlter, such as FIR ﬁlters and WDFs [75].
These algorithms have different properties with respect to power con-
sumption as well as throughput of the ﬁlter realization.
The selected algorithm can also often be transformed and modiﬁed,
without changing the functionality, to reduce the power consumption [6]
[54] [57] [58] [76]. Such transformations and modiﬁcations of an algo-
rithm can be aimed at increasing the throughput. The increased through-
put can then be traded for reduced power consumption through power
supply voltage scaling. In this thesis we will present arithmetic transfor-
mations of ﬁrst- and second-order Richards’ allpass sections that increase
the throughput of LWDFs, implemented in carry-save arithmetic. The
proposed transformations will be further discussed in Chapter 2.
Another method for increasing the throughput of an algorithm is
pipelining, i.e. propagation of delay elements into nonrecursive parts of
the signal-ﬂow graph. This shortens the critical path and, hence, increases
the throughput, as shown in Fig. 1.24. However, the latency of the algo-Power Consumption in Digital CMOS Circuits
27
rithm (sequence level) will be increased when pipelining is introduced,
but not the latency in terms of actual physical time.
Another method for increasing the throughput is to exploit the paral-
lelism in highly parallel algorithms, by using pipelining and/or interleav-
ing the computations between the processing elements [27]. In [9] this
was combined with power supply voltage scaling and it was shown to be
efﬁcient for reduction of the power consumption of digital CMOS cir-
cuits.
1.9.5 Power Reduction at the Logic Level
At the logic level arithmetic operations are realized by gates, latches, and
ﬂip-ﬂops. To utilize power supply voltage scaling for lowering the power
consumption, these circuits should be functional at a low power supply
voltage.
Another issue at the logical level is the unwanted transition activity,
i.e., glitches. This occurs when there are paths between inputs and outputs
of a logical net with different propagation delays. This may have a large
impact on the power consumption of the circuit.
One method for reduction of the glitches is to equalize the delays in
the logical nets. The purpose with these delays is to reduce the differences
in propagation delay between the logical paths. Examples of how these
delays can be implemented is introduction of buffers in the nets for equal-
izing the delays [65] or resizing of transistors [78]. Another method for
reduction of the glitches is to introduce registers in the logical nets to
obtain shorter paths [36] [68].
Figure 1.24: A signal-ﬂow graph without (a) and with (b) pipelining.
PE1 PE2
TCP=2TPE
T PE1 PE2
TCP=TPE TCP=TPE
(a) (b)Introduction
28
1.9.6 Power Reduction at the Technology Level
Power supply voltage scaling is, as discussed above, a very efﬁcient
method for reduction of the power consumption in CMOS circuits. How-
ever, in deep submicron technologies the power supply voltage is reduced
in each generation, but the threshold voltage is not reduced proportion-
ally. This makes power supply voltage scaling less efﬁcient since the
delay of a gate increases as the power supply voltage approaches the
threshold voltage, as can be seen from Eq. (1.23). Hence, the margin for
power supply scaling for lowering the power consumption is reduced.
By using techniques for reducing the threshold voltage, the margin for
power supply voltage scaling can be increased. A reduced threshold volt-
age will improve the speed of the transistor at low power supply voltage.
However, reducing the threshold voltage results in increased leakage cur-
rents, increasing the power consumed due to leakage. One solution to this
problem is to use a multiple-threshold voltage CMOS process [40]. For
such a process, low-threshold transistors, which are fast and have large
leakage currents, are used for time critical parts and slower transistors,
with a higher threshold voltage, are used for non time critical parts.
1.10 A Design Flow for Digital Filters
Here an overview of our design environment for digital ﬁlters is given.
The design environment includes all design steps required, from the ﬁlter
design down to a physical implementation [51].
1.10.1 Filter Design
The ﬁrst step of the design ﬂow, the ﬁlter design, can be divided into two
different parts, design of FIR ﬁlters and design of LWDFs.
FIR Filters
Hardware efﬁcient FIR ﬁlters can be designed using optimization meth-
ods to ﬁnd simple coefﬁcients. Depending on the constraints on the ﬁlter,
different optimization methods may be used. The constraints for the
design of an FIR ﬁlter can typically be the passband and stopband limits
and the passband and stopband attenuations. It can also be factors that
affect the ﬁlter implementation, or more speciﬁcally, the power consump-A Design Flow for Digital Filters
29
tion of the implementation. Examples of such constraints are the number
of nonzero bits in the set of coefﬁcients or the coefﬁcient wordlengths.
We use an FIR ﬁlter design method based on mixed integer and linear
programming (MILP) [20] [23]. This method allows to apply constraints
on the number of nonzero bits required in the constant coefﬁcients, for a
given ﬁlter speciﬁcation. By ﬁnding a set of coefﬁcients with a low
number of nonzero bits, the implementation cost can be minimized.
Lattice Wave Digital Filters
The lattice wave digital ﬁlter is derived from an analog lattice structure.
From this reference structure, real coefﬁcients for the digital ﬁlter can be
derived [16] [75]. These coefﬁcient must then be quantized to a ﬁxed
point representation.
Finding an optimal set of ﬁxed coefﬁcients for an LWDF is an optimi-
zation problem with many degrees of freedom. Such problems can be
solved using, for example, simulated annealing.
1.10.2 Mapping to Hardware
The next step in the design ﬂow is to map the ﬁlter design to hardware. To
improve the design efﬁciency, the mapping to hardware is performed
using code and layout generators to as high degree as possible.
VHDL Code Generators
We have developed, and are developing, several VHDL code generators
for digital ﬁlters. The VHDL code generated by these tools is structural
and well suited for logic synthesis. Hence, by using commercial logic
synthesis tools and standard cells libraries, a physical layout is obtained.
Also, since the VHDL code is technology independent, a high degree of
reusability is possible when the technology is changed.
Here follow some examples of generators currently available or in the
ﬁnal stages of development:
• Bit-parallel FIR ﬁlters – This generator is based on bit-parallel, carry
save arithmetic. It supports several FIR ﬁlter structures such as the
direct form, polyphase structures, and differential FIR ﬁlters.
• Bit-serial LWDFs – This generator provides maximally fast bit-serial
ﬁrst- and second-order Richards’ allpass sections.Introduction
30
• Bit-parallel LWDFs – This generator is based on bit-parallel, carry-
save arithmetic. The tool generates VHDL code for ﬁrst- and second-
order Richards’ allpass sections that are isomorphic mapped to carry-
save arithmetic. The multipliers within the adaptors are mapped to
Wallace adder tree structures.
• Bit-parallel adders – From this generator several different bit-parallel
adder structures can be generated. One such adder structure is the
basic ripple-carry adder, with or without pipelining.
1.10.3 Module Generators
To improve the energy efﬁciency of the implementation further, uncon-
strained layouts are used for critical blocks. To increase the design efﬁ-
ciency of such blocks we use module generators. A module generator is a
parameterized building block which is based on unconstrained layout
cells. A typical module generator could be a constant multiplier, where
the input parameters are the coefﬁcient and the data wordlengths. Since
digital ﬁlters are implemented using a small set of operations, for exam-
ple addition and multiplication, module generators can be used.
Examples of module generators that have been developed, or are in the
ﬁnal stages of development are:
• Bit-parallel adder tree – Based on the overturned stair adder tree. Suit-
able for implementation of high speed, bit-parallel multipliers.
• RAM generator – A low power RAM memory generator with parame-
terized size and data word-length.
• Digit-serial two-port adaptor – A symmetric two-port adaptor imple-
mented using digit-serial arithmetic. The parameters are the data
word-length, the coefﬁcient, and the digit size.
These generators are developed for a 0.18 mm CMOS process from
STMicroelectronics. Also, several module generators have been devel-
oped in older CMOS processes. The main difference between generators
developed in different technologies are the building blocks used at the
lowest level. When a new technology is introduced the basic cells can be
redesigned while the rest of the generator can be reused. Hence, this
methodology simpliﬁes future technology changes.Outline of the Thesis
31
1.11 Outline of the Thesis
In Chapter 2 some proposed arithmetic transformations of LWDFs are
presented. The purposes with these transformations are to increase the
throughput as well as to reduce the power consumption. The content of
this chapter has previously been published in:
• H. Ohlsson and L. Wanhammar, “Implementation of bit-parallel lattice
wave digital ﬁlters,” in Proc. Swedish System-on-Chip Conf.,
SSoCC’01, Arild, Sweden, March 20–21, 2001.
• H. Ohlsson, O. Gustafsson, and L. Wanhammar, “Arithmetic transfor-
mations for increased maximal sample rate of bit-parallel bireciprocal
lattice wave digital ﬁlters,” in Proc. IEEE Int. Symp. on Circuits Sys-
tems, ISCAS’01, Sydney, Australia, May 6-9, 2001, pp. 825–828.
• H. Ohlsson, O. Gustafsson, H. Johansson and L. Wanhammar, “Imple-
mentation of lattice wave digital ﬁlters with increased maximal sample
rate,” in Proc. IEEE Int. Conf. on Elec. Circuits Systems, Malta, Sep
6–9, 2001, pp. 71–74.
In Chapter 3 implementation of a digital down converter for a wide-
band radar receiver is discussed. Three different ﬁlter structures used for
realizing the DDC, combining LWDFs and FIR ﬁlters, are presented.
These structures are compared with respect to their implementation prop-
erties with respect to throughput and arithmetic complexity. This chapter
is based on the following publications:
• H. Ohlsson, H. Johansson, and L. Wanhammar, “Design of a digital
down converter using high speed digital ﬁlters,” in Proc. Symposium
on Gigahertz Electronics, GHz2000, Gothenburg, Sweden, March 13–
14, 2000, pp. 309–312.
• A. Gustafsson, K. Folkesson, and H. Ohlsson, “A Simulation Environ-
ment for Integrated Frequency and Time Domain Simulations of a
Radar Receiver,” in Proc. Symposium on Gigahertz Electronics,
GHz2001, Lund, Sweden, Nov. 26–27, 2001.
• H. Ohlsson and L. Wanhammar, “A digital down converter for a wide-
band radar receiver,” in Proc. National Conf. Radio Science, RVK’02,
Kista, Sweden, June 10–13, 2002, pp. 478–481.
In Chapter 4 the implementation of a combined interpolator and deci-
mator for an OFDM system is discussed. Four different ﬁlter structures
are evaluated with respect to throughput and arithmetic complexity. Also,Introduction
32
the implementation of a circuit using a novel ﬁlter structure is presented.
The chapter is based on the following publications:
• H. Ohlsson, H. Johansson, and L. Wanhammar, “Implementation of a
combined high-speed interpolation and decimation wave digital ﬁlter,”
in Proc. IEEE Int. Conf. on Elec. Circuits Systems, ICECS’99, Paphos,
Cyprus, Sept. 5–8, 1999, pp. 721–724.
• H. Ohlsson, H. Johansson, and L. Wanhammar, “Implementation of a
combined interpolator and decimator for an OFDM system demon-
strator,” in Proc. NorChip Conf 2000, Turku, Finland, Nov. 6–7, 2000,
pp. 47–52.33
2
2ARITHMETIC
TRANSFORMATIONS OF
LATTICE WAVE DIGITAL
FILTERS
In this chapter arithmetic transformations of ﬁrst- and second-order Rich-
ards’ allpass sections aimed at increasing the throughput as well as reduc-
ing the power consumption of LWDF implementations are discussed. The
proposed transformations are applied on LWDFs, implemented using bit-
parallel, carry-save arithmetic. Similar arithmetic transformations have
previously been proposed for LWDFs implemented using bit-serial arith-
metic [53] and for implementation of LWDFs on digital signal processors
[14] [15].
The proposed arithmetic transformations yields a reduced iteration
period bound, Tmin, for LWDFs composed by ﬁrst- and second-order
Richards’ allpass sections. The reduction of the iteration period bound
can be traded for reduced power consumption through power supply volt-
age scaling.
The proposed arithmetic transformations yields an increase of the
arithmetic complexity required by an allpass section. However, since onlyArithmetic Transformations of Lattice Wave Digital Filters
34
transformations of the critical loop are required, the total arithmetic com-
plexity of a ﬁlter implementation may not increase signiﬁcantly.
The work presented in this chapter has previously been published in
[47] [48] [49].
2.1 Arithmetic Transformation of First-Order
Richards’ Allpass Sections
The ﬁrst-order Richards’ allpass sections considered here are composed
by a symmetric two-port adaptor, as shown in Fig. 2.1. The ﬁrst-order all-
pass section is the only allpass section required for implementation of
BLWDFs used for decimation and interpolation. Hence, ﬁrst-order allpass
sections yields the iteration period bound for such ﬁlters. Such ﬁlter sec-
tions can also be used in LWDF implementations. However, for such ﬁl-
ters the iteration period bound is normally not determined by a ﬁrst-order
allpass section.
2.1.1 Richards’ Structure
The signal-ﬂow graph for the conventional symmetric two-port adaptor
forming the ﬁrst-order Richards’ allpass section, as shown in Fig. 2.1, is
given by Eq. (2.1) and Eq. (2.2)
(2.1)
(2.2)
Figure 2.1: A ﬁrst-order Richards’ allpass section.
T
a0
y(n) x(n)
a0
-
x(n) y(n)
T
B1 a0 A2 A1 – () A2 + =
B2 a0 A2 A1 – () A1 + =Arithmetic Transformation of First-Order Richards’ Allpass Sections
35
where B1 and B2 are the adaptor outputs, A1 and A2 are the adaptor inputs,
and a0 is the adaptor coefﬁcient. The placement of these ports on the
adaptor is shown in Fig. 2.2.
The iteration period bound of the ﬁrst-order Richards’ allpass section
was given in Eq. (1.8) as 2Tadd+ Ta.
2.1.2 Proposed Transformation
The proposed arithmetic transformation can be derived from the signal-
ﬂow graph of the symmetric two-port adaptor directly, as shown in Fig.
2.3. In the ﬁrst step, the multiplication is propagated backwards, past the
subtraction. In the next step, the ﬁnal addition in the loop is propagated
backwards and placed after the multiplication at the adaptor input. As a
result of this transformation, a subtraction is required at the output of the
ﬁlter section. Finally, the multiplication with –a and the addition follow-
ing the multiplication are merged into one multiplication with the coefﬁ-
cient 1–a.
The signal-ﬂow graph of the transformed ﬁlter section can be
described by Eq. (2.3) and Eq. (2.4).
(2.3)
(2.4)
These equations are numerically equivalent to Eq. (2.1) and Eq. (2.2) if
the quantizations are placed as shown in Fig. 2.3. Hence, the methods for
Figure 2.2: Port deﬁnitions for the symmetric two-port adaptor.
a
-
A2 B2
A1 B1
a
A1
A2 B2
B1
B1 1 a + () A2 a – A1 =
B2 1 a – () A1 aA2 + =Arithmetic Transformations of Lattice Wave Digital Filters
36
guaranteed stability of the conventional LWDF structure can be applied
on the transformed structure as well.
The transformed ﬁrst-order allpass section requires two multiplica-
tions, compared to the single multiplication required in the conventional
structure, while the number of additions is equal between the two struc-
tures. However, as for the conventional structure, only one multiplication
is placed in the critical loop of the ﬁlter section. Also, there is only one
addition in the loop. Hence, the iteration period bound is reduced, from
2Tadd + Tmult,a to Tadd + Tmult,a for the transformed structure, compared
to the conventional structure.
Figure 2.3: Transformation steps for transformation of the ﬁrst-order allpass
section.
Q
Q
a
a
-
Q
Q
a
a
Q
a
a
a
Q
Q
1-
T
-
T
T
-
T
-
-
QArithmetic Transformation of First-Order Richards’ Allpass Sections
37
2.1.3 Mapping of the Transformed Filter Section to
Carry-Save Arithmetic
In Fig. 2.4 the mapping of the ﬁrst-order Richards’ allpass section to
carry-save arithmetic is illustrated. The ﬁgure shows two possible map-
pings, with and without a VMA at the output of the multiplication of the
adaptor input.
For a carry-save implementation of the transformed allpass section
further reduction of the iteration period bound may be possible, compared
to the conventional structure. When mapping the adaptor multiplication to
a Wallace tree in the transformed structure there may be, depending on
the number of nonzero bits in the coefﬁcient, unused inputs at the top
level of the adder tree. These inputs can be used to merge the addition
required in the loop with the Wallace tree, without increasing the height
of the adder tree. Hence, for such cases the addition can be execute in par-
allel with the multiplication and yield a further reduction of the iteration
period bound, Tmin = Tmult, is obtained.
If there are two, or more, inputs available in the Wallace tree, the result
from the multiplication with 1–a can be included in the tree in carry-save
representation. This case corresponds to the implementation shown in
Fig. 2.4 (a). The operations that can be merged into one adder tree is
marked in the ﬁgure. If there is only one input available in the tree it is
still possible to include the addition in the tree by applying a VMA on the
result from the 1–a multiplication. This is shown in Fig. 2.4 (b), with the
operations that can be merged into one tree marked. Since the input VMA
is placed outside the loop, it does not increase the iteration period bound.
Since the arithmetic complexity of a constant multiplier depends on
the coefﬁcient value, the arithmetic complexity of the transformed ﬁlter
section depends on this as well. Also, the increase depends on whether
two’s-complement or carry-save representation is used at the adaptor
input.
2.1.4 Evaluation of the Transformation
The proposed arithmetic transformations have been evaluated with
respect to throughput and arithmetic complexity, compared to a conven-
tional implementation. The former has been determined by computation
of Tmin in terms of TCSA, both with and without an input VMA. TheseArithmetic Transformations of Lattice Wave Digital Filters
38
computations have been performed for coefﬁcients with up to ten nonzero
bits in CSDC representation. The latter has been determined by computa-
tion of the number of carry-save adders required for coefﬁcients with up
to ten nonzero CSDC bits. These evaluations have been performed for all-
pass sections with two’s-complement as well as carry-save inputs.
Iteration Period Bound of the Transformed Structure
The iteration period bound for the conventional and the transformed
structure has been estimated in terms of TCSA, i.e., the latency of one
carry-save adder, for different numbers of nonzero bits in the coefﬁcient.
The iteration period bound is also affected by the data representation on
the input of the ﬁlter section, two’s-complement or carry-save, and the use
of an input VMA. The results of these computations are given in Table
2.1.
In Fig. 2.5 the reduction of Tmin when a two’s-complement input is
considered is summarized, while Fig. 2.6. shows a summary of the results
for the case with a carry-save input.
Figure 2.4: Mapping of a ﬁrst-order allpass section to carry-save arithmetic (a)
without an input VMA and (b) with an input VMA.
T
-
x(n) y(n)
T
-
x(n) y(n)
(a) (b)
a a
1-a
1-aArithmetic Transformation of First-Order Richards’ Allpass Sections
39
From the table it can be seen that the largest reduction of the iteration
period bound that can be expected is 50%, when a two’s-complements
input is considered. This is obtained when there is only one nonzero bit in
the coefﬁcient and an input VMA is used. However, without the input
VMA, a = 0.5 is the only coefﬁcient with only one nonzero bit that will
result in a reduced Tmin. If an adaptor with a coefﬁcient with two nonzero
bits is considered, a reduction of Tmin from 4TCSA to 3TCSA, i.e., a reduc-
tion with 25%, can be expected, with or without an input VMA.
For the case with a carry-save input larger reduction of the iteration
period bound is obtained. This is due to the fact that the carry-save input
will increase the iteration period bound with two TCSA for the conven-
tional ﬁlter section, as shown in Eq. (1.16) and Eq. (1.17). The trans-
formed structure has the same iteration period bound, independently of
the number representation of the input. Hence, a carry-save input yields a
larger reduction of the iteration period bound after arithmetic transforma-
tion of the adaptor.
Nonzero
bits
Conv.
(2C input)
Conv.
(CSA input)
Transformed
(no input VMA) Transf.
1 2 4 1 (for a = 0.5)/2 1
24 6 3 3
35 7 4 4
46 8 5 4
57 9 5 5
67 9 6 5
78 1 0 6 6
88 1 0 6 6
98 1 0 7 6
10 9 11 7 7
Table 2.1: Tminfor a ﬁrst-order section expressed in TCSA for different
number of nonzero bits.Arithmetic Transformations of Lattice Wave Digital Filters
40
Figure 2.5: Reduction of Tmin for a ﬁrst-order allpass section with
a two’s-complement input, with and without input VMA.
Figure 2.6: Reduction of Tmin for a ﬁrst-order allpass section with
a carry-save input, with and without input VMA.
2 4 6 8 10
0
20
40
60
80
100
Number of nonzero bits in the coefficient
R
e
d
u
c
t
i
o
n
 
o
f
 
T
m
i
n
 
(
%
)
No input VMA
Input VMA   
2 4 6 8 10
0
20
40
60
80
100
Number of nonzero bits in the coefficient
R
e
d
u
c
t
i
o
n
 
o
f
 
T
m
i
n
 
(
%
)
No input VMA
Input VMA   Arithmetic Transformation of First-Order Richards’ Allpass Sections
41
For example, for the same coefﬁcient value as discussed above, with
two nonzero bits, a reduction of Tmin from 6TCSA to 3TCSA is obtained.
Hence, a 50% reduction of the iteration period bound is obtained when a
carry-save input is considered. This should be compared to the reduction
by 25% when a two’s-complements input was considered.
Hardware Requirements of the Transformed Structure
The additional multiplier required in the transformed structure, compared
to the conventional structure, yields an increased arithmetic complexity.
The increase depends on the number of nonzero bits in the coefﬁcient as
well as on the value of the adaptor coefﬁcient when the multiplications
are mapped to Wallace trees. If the value of a is between 2/3 and 1, and a
two’s-complement input is considered, the 1–a multiplier requires one
Transformed
Nonzero bits Conv.
,
14 -4 5
26 67 8
38 9 1 01 1
41 0 1 2 1 31 4
51 2 1 5 1 61 7
61 4 1 8 1 92 0
71 6 2 1 2 22 3
81 8 2 4 2 52 6
92 0 2 7 2 82 9
10 22 30 31 32
Table 2.2: Number of carry-save adders required for a ﬁrst-order allpass
section with a two’s-complement input.
2
3
--- a 1 <<
1 – a £ 2
3
--- – <
1
3
--- a 2
3
--- <<
2
3
--- – a 1
3
--- <<Arithmetic Transformations of Lattice Wave Digital Filters
42
carry-save adder less than the a multiplier. On the other hand, if the value
of a is between –2/3 and 1/3 one extra carry-save adder is required for the
1–a multiplier. For the remaining cases, with coefﬁcient values between
1/3 and 2/3 and between –1 and –2/3, the number of carry-save adders
required for the two multipliers is equal.
Table 2.2 shows the number of carry-save adders required for the con-
ventional structure and the corresponding transformed structure with
two’s-complement input. Table 2.3 shows the number of carry-save
adders required for a ﬁlter section with a carry-save input. As before,
coefﬁcients with up to 10 nonzero bits in CSDC representation are con-
sidered.
The increase of the number of carry-save adders required for the two
cases are summarized in Fig. 2.7 and Fig. 2.8.
Transformed
Nonzero bits Conv.
,
16 -6 8
28 8 1 01 2
31 0 1 2 1 41 6
41 2 1 6 1 82 0
51 4 2 0 2 22 4
61 6 2 4 2 62 8
71 8 2 8 3 03 2
82 0 3 2 3 43 6
92 2 3 6 3 84 0
10 24 40 42 44
Table 2.3: Number of carry-save adders required for a ﬁrst-order allpass
section with a carry-save input.
2
3
--- a 1 <<
1 – a £ 2
3
--- – <
1
3
--- a 2
3
--- <<
2
3
--- – a 1
3
--- <<Arithmetic Transformation of First-Order Richards’ Allpass Sections
43
Figure 2.7: Increase of the number of carry-save adders for different
coefﬁcient values for a ﬁrst-order allpass section with
a two’s-complement input.
Figure 2.8: Increase of the number of carry-save adders for different
coefﬁcient values for a ﬁrst-order allpass section with
a carry-save input.
2 4 6 8 10
0
20
40
60
80
100
Number of nonzero bits in the coefficient
I
n
c
r
e
a
s
e
d
 
n
u
m
b
e
r
 
o
f
 
C
S
A
s
 
(
%
)
−1 £ a < −2/3, 1/3 < a < 2/3
−2/3 < a < 1/3                      
2/3 < a < 1                         
2 4 6 8 10
0
20
40
60
80
100
Number of nonzero bits in the coefficient
I
n
c
r
e
a
s
e
d
 
n
u
m
b
e
r
 
o
f
 
C
S
A
s
 
(
%
)
−1 £ a < −2/3, 1/3 < a < 2/3
−2/3 < a < 1/3                      
2/3 < a < 1                         Arithmetic Transformations of Lattice Wave Digital Filters
44
From the table it can be seen that for coefﬁcients between 2/3 and 1,
the increase of the number of carry-save adders required is quite small
(below 20%) for coefﬁcients with up to 4 nonzero bits. This is for the case
with a two’s-complement input. On the other hand, for coefﬁcients
between –2/3 and 1/3 the cost is about 30% independently on the number
of nonzero bits in the coefﬁcient. When a carry-save input is considered,
the increase of the number of carry-save adders required is somewhat
larger. Compared to the two’s-complement case the increase is about dou-
bled.
Here it should be noted that the increased arithmetic complexity only
affects the transformed ﬁlter sections. It may be sufﬁcient to transform
one ﬁlter section only of a ﬁlter realization to obtain the required through-
put. Hence, the total number of carry-save adders required for a ﬁlter
implementation may not increase signiﬁcantly and the overall increase of
the area required will be small while the reduction of the iteration period
bound may be signiﬁcant.
2.1.5 Example
The proposed arithmetic transformations are illustrated with an example.
The considered ﬁlter is a carry-save implementation of a ninth-order
BLWDF meeting the speciﬁcation given in Eq. (2.5)
(2.5)
where wcT is the cutoff frequency, wsT is the stopband frequency, and
Amin is the stopband attenuation. In Fig. 2.9 the magnitude function of the
example ﬁlter is shown.
The ﬁlter structure is shown in Fig. 2.10 and the ﬁlter coefﬁcients are
given in Table 2.4, as real numbers as well as in CSDC representation.
Since the example ﬁlter is a BLWDF, the passband ripple depends on the
stopband attenuation. For a stopband attenuation of 55 dB, as in the
example ﬁlter speciﬁcation, the passband ripple becomes very small, well
below any typical speciﬁcations and has, hence, not been included in the
ﬁlter speciﬁcation.
wcT 0.41p rad =
wsT 0.59p rad =
Amin 55 dB =Arithmetic Transformation of First-Order Richards’ Allpass Sections
45
For the example ﬁlter, ﬁrst-order allpass sections are the only recursive
components. Since the considered ﬁlter is a BLWDF, each loop includes
two delay elements, as opposed to the one delay element of the ﬁlter sec-
tion evaluated above. The same transformations can, however, be applied
Figure 2.9: Magnitude function of the example ﬁlter.
Figure 2.10: A ninth-order BLWDF.
           
−80
−70
−60
−50
−40
−30
−20
−10
0
wT [rad]
0 0.2p 0.4p 0.6p 0.8p  p
½
H
(
e
j
w
T
)
½
 
[
d
B
]
0 0.2 0.4 0.6
−1
−0.5
0
0.5
1
x 10
−4
T
a7
2T
a3
2T
a1
2T
a5
2T
y(n) x(n)
1/2Arithmetic Transformations of Lattice Wave Digital Filters
46
on such ﬁlter sections. The difference is that the iteration period bound is
halved for the example BLWDF, compared to the results given in the eval-
uation in Section 2.1.4, for a certain coefﬁcient value.
The iteration period bound of the example ﬁlter, before transforma-
tion, is derived from Table 2.1 and the result is shown in Eq. (2.6). Here
two’s-complement inputs and outputs of the adaptors are considered. The
iteration period bound, 7/2TCSA, is yielded by the two ﬁlter sections with
coefﬁcients with ﬁve nonzero bits.
(2.6)
Now, if the proposed arithmetic transformations are applied on the two
ﬁlter sections yielding Tmin, a new iteration period bound of 5/2TCSA for
the ﬁlter is obtained, as shown in Eq. (2.7).
(2.7)
Now the two transformed ﬁlter sections yield the same iteration period
bound as the two conventional structures. Hence, no further reduction of
Tmin is possible using the proposed arithmetic transformations. The result
is that the iteration period bound has been reduced by TCSA, from 7/2TCSA
to 5/2TCSA, using the proposed arithmetic transformations. This corre-
sponds to a reduction of the iteration period bound of about 30%.
The increased throughput yields an increase of the number of carry-
save adders required. From Table 2.2 the number of carry-save for the
Coefﬁcient Decimal CSDC
a1 –0.091796875 0.001010001
a3 –0.31640625 0.01010001
a5 –0.583984375 0.101010101
a7 –0.8544921875 1.0010010101
Table 2.4: Coefﬁcients for the example ﬁlter.
Tmin max = 5
2
---TCSA
5
2
---TCSA
7
2
---TCSA
7
2
---TCSA ,,,
îþ
íý
ìü 7
2
---TCSA =
Tmin max = 5
2
---TCSA
5
2
---TCSA
5
2
---TCSA
5
2
---TCSA ,,,
îþ
íý
ìü 5
2
---TCSA =Arithmetic Transformation of Second-Order Richards’ Allpass Sections
47
original as well as the transformed structures can be computed. For the
original structure the result is that 40NCSA are required while the trans-
formed structure require 49NCSA. Hence, the transformations require 9
carry-save adders more than the original structure, corresponding to a
23% increase of the arithmetic complexity.
2.2 Arithmetic Transformation of Second-Order
Richards’ Allpass Sections
In Section 1.4 it was shown that lattice wave digital ﬁlters can be imple-
mented using ﬁrst- and second-order Richards’ allpass sections. For such
LWDF implementations the iteration period bound is normally deter-
mined by the second-order sections.
Second-order Richards’ allpass sections can be implemented using
symmetric two-port adaptors. Hence, similar arithmetic transformations
to the ones derived in Section 2.1 for the ﬁrst-order section are possible.
2.2.1 Richards’ Structure
The signal-ﬂow graph for a Richards’ second-order allpass section, com-
posed by conventional symmetric two-port adaptors, is shown in Fig.
Figure 2.11: Conventional second-order allpass section.
a1
-
a2
-
a1
a2
T
T
T
T
A11 B11
B12
B21
B22
A12
A21
A22
A11 B11
B21
A12 B12
A21
B22 A22Arithmetic Transformations of Lattice Wave Digital Filters
48
2.11. The signal-ﬂow graph can be described by Eq. (2.8) through Eq.
(2.11),
(2.8)
(2.9)
(2.10)
(2.11)
were the ports named A11 and A12 are the inputs and B11, and B12 are the
outputs of the lower two-port adaptor, respectively, while the inputs A21
and A22 and the outputs B21 and B22 belongs to the upper two-port adap-
tor, as shown in Fig. 2.11. The iteration period bound for the second-order
allpass section is 4Tadd + + , as shown in Eq. (1.9).
2.2.2 Proposed Transformations
The two-port adaptors in the second-order allpass section can be modiﬁed
using the same arithmetic transformations as derived for the ﬁrst-order
section. By modifying Eq. (2.8) through Eq. (2.11) according to the trans-
formation scheme described in Section 2.1.2, the transformed adaptor
equations Eq. (2.12) through Eq. (2.15) are obtained.
(2.12)
(2.13)
(2.14)
(2.15)
As for the ﬁrst-order section, these transformations yield a numeri-
cally equivalent structure as long as the quantizations are performed at the
adaptor outputs. Hence, the methods for guaranteed stability of a conven-
B11 a1 A12 A11 – () A12 + =
B12 a1 A12 A11 – () A11 + =
B21 a2 A22 A21 – () A22 + =
B22 a2 A22 A21 – () A21 + =
Ta1 Ta2
B11 1 a1 + () A12 a – 1A11 =
B12 1 a1 – () A11 a1A12 + =
B21 1 a2 + () A22 a2 – A21 =
B22 1 a2 – () A21 a2A22 + =Arithmetic Transformation of Second-Order Richards’ Allpass Sections
49
tional implementation can be applied on the transformed second-order
allpass section as well.
The signal-ﬂow graph of a second-order section corresponding to Eq.
(2.12) through Eq. (2.15) is shown in Fig. 2.12. The critical loop, marked
in the ﬁgure, of the transformed second-order allpass section is deter-
mined by Eq. (2.13) and Eq. (2.15). The iteration period bound for the
transformed second-order section is 2Tadd + + . The
latencies of the a2-multiplication and the –a2-multiplication are about the
same for CSDC coefﬁcients. Hence, a reduction of the iteration period
bound by 2Tadd is obtained for the transformed structure, compared to the
conventional structure.
2.2.3 Mapping of the Transformed Filter Section to
Carry-Save Arithmetic
The mapping of a second-order allpass section to carry-save arithmetic is
performed in a similar fashion to the case with a ﬁrst-order allpass sec-
Figure 2.12: Transformed second-order allpass section.
Tmult a1 , Tmult a – 2 ,
y(n) x(n)
T
1+a2
T
a1
1-a1
-a2
-
-Arithmetic Transformations of Lattice Wave Digital Filters
50
tion. As for the ﬁrst-order section, depending on the number of nonzero
bits in the coefﬁcient, the additions in the loop may be included in the
adder trees forming the multiplier without increasing the latency of the
tree.
In Fig. 2.13 a transformed second-order section with two’s-comple-
ment input and output, mapped to carry-save arithmetic, is shown. This
ﬁlter section has no input VMA but, as was the case with the ﬁrst-order
section, a VMA can be used to reduce the number of inputs to the adder
Figure 2.13: Transformed second-order allpass section mapped to carry-save
arithmetic.
T
T
1+a2
-a2
-
x(n) y(n)
a1
1-a1
-
-Arithmetic Transformation of Second-Order Richards’ Allpass Sections
51
tree forming the a1-multiplication in the critical loop. For the upper adap-
tor no VMA can be applied without having a signiﬁcant impact on Tmin
since both the multiplications are placed in loops.
2.2.4 Evaluation of the Transformation
As for the ﬁrst-order section the throughput as well as the arithmetic com-
plexity of a transformed ﬁlter section have been evaluated. The through-
put is estimated using Tmin in terms of number of TCSA and the area is
estimated as the required number of carry-save adders, NCSA.
The iteration period bound and the arithmetic complexity depends on
the values of the two coefﬁcients. Since these coefﬁcients can have differ-
ent number of nonzero bits for an implementation, the results of the eval-
uation are presented for each adaptor separately.
Iteration Period Bound for the Transformed Filter Section
In Table 2.5 the contribution to the total iteration period bound, in terms
of TCSA, from the adaptor with the coefﬁcient a1 for the conventional as
well as for the transformed ﬁlter section is shown, including two’s-com-
plement as well as carry-save input as well as the case with or without an
input VMA. Table 2.6 shows the corresponding results for the adaptor
with the coefﬁcient a2. As for the ﬁrst-order ﬁlter section, the data repre-
sentation at the input only affects the iteration period bound of the con-
ventional structure.
To compute the iteration period bound for a second-order allpass sec-
tion, the number of nonzero bits in each of the two coefﬁcients in CSDC
representation must be determined. For example, if a1 has three nonzero
bits and a2 has two nonzero bits and a two’s-complement input to the ﬁl-
ter section is considered, the iteration period bound can be reduced from
5TCSA+6TCSA =1 1 TCSA to 4TCSA+3TCSA =7 TCSA by applying the pro-
posed transformations. If instead a carry-save input is considered, the
conventional structure yields Tmin =7 TCSA+6TCSA =1 3 TCSA while the
transformed structure still yields Tmin =7 TCSA. Hence, a reduction of the
iteration period bound by almost 50% is obtained.
Hardware Requirements for the Transformed Filter Section
As for the ﬁrst-order section the hardware requirements will increase,
compared to the conventional structure, after transformation. In Table 2.7Arithmetic Transformations of Lattice Wave Digital Filters
52
Nonzerobits
in coefﬁcient
Conv.
2C/CSA
Transformed
(without input VMA)
Transformed
(with input VMA)
1 2/4 1 (for a = 0.5)/2 1
2 4/6 3 3
3 5/7 4 4
4 6/8 5 4
5 7/9 5 5
6 7/9 6 5
7 8/10 6 6
8 8/10 6 6
9 8/10 7 6
10 9/11 7 7
Table 2.5: Contribution to the iteration period bound from adaptor 1.
Nonzero bits in coefﬁcient Conventional Transformed
14 2
26 3
37 4
48 5
59 5
69 6
71 0 6
81 0 6
91 0 7
10 11 7
Table 2.6: Contribution to the iteration period bound from adaptor 2.Arithmetic Transformation of Second-Order Richards’ Allpass Sections
53
and Table 2.8, the number of carry-save adders required for the a1 adap-
tor with a two’s-complements input as well as a carry-save input are
shown. Table 2.9 shows the number of carry-save adders required for the
a2 adaptor.
Transformed
Nonzero bits Conv.
,
14 - 4 5
26 6 7 8
38 91 0 1 1
41 0 1 21 3 1 4
51 2 1 51 6 1 7
61 4 1 81 9 2 0
71 6 2 12 2 2 3
81 8 2 42 5 2 6
92 0 2 72 8 2 9
10 22 30 31 32
Table 2.7: Number of carry-save adders required for adaptor 1 with a
two’s-complement input.
2
3
--- a 1 <<
1 – a £ 2
3
--- – <
1
3
--- a 2
3
--- <<
2
3
--- – a 1
3
--- <<Arithmetic Transformations of Lattice Wave Digital Filters
54
Transformed
Nonzerobits Conv.
,
16 - 6 8
28 81 0 1 2
31 0 1 21 4 1 6
41 2 1 61 8 2 0
51 4 2 02 2 2 4
61 6 2 42 6 2 8
71 8 2 83 0 3 2
82 0 3 23 4 3 6
92 2 3 63 8 4 0
10 24 40 42 44
Table 2.8: Number of carry-save adders required for adaptor 1 with a carry-
save input.
2
3
--- a 1 <<
1 – a £ 2
3
--- – <
1
3
--- a 2
3
--- <<
2
3
--- – a 1
3
--- <<Arithmetic Transformation of Second-Order Richards’ Allpass Sections
55
For example, if a second-order section with the coefﬁcients
a1 = 0.6562510 = 0.10101CSDC and a2 = –0.7510 = 1.01CSDC and a
two’s-complement input is considered, the number of carry-save adders
required for the conventional implementation is 8NCSA+8NCSA =1 6 NCSA,
according to Table 2.7 and Table 2.9. From the same tables it can be seen
that the transformed structure requires 10NCSA+10NCSA =2 0 NCSA.F o r
this example the iteration period bound is reduced by 4TCSA, from 11TCSA
to 7TCSA.
2.2.5 Example
To illustrate the proposed arithmetic transformations of the second-order
allpass section, a ninth-order LWDF is considered. The signal-ﬂow graph
of the example ﬁlter is shown in Fig. 2.14. The example ﬁlter meets the
Transformed
Nonzero bits Conv
,
16 6 6 8
28 8 1 0 1 2
31 0 1 2 1 4 1 6
41 2 1 6 1 8 2 0
51 4 2 0 2 2 2 4
61 6 2 4 2 6 2 8
71 8 2 8 3 0 3 2
82 0 3 2 3 4 3 6
92 2 3 6 3 8 4 0
10 24 40 42 44
Table 2.9: Number of carry-save adders required for adaptor 2.
1 – a £ 2
3
--- – <
2
3
--- a 1 <<
2
3
--- – a 1
3
--- – <<
1
3
--- – a 2
3
--- <<Arithmetic Transformations of Lattice Wave Digital Filters
56
speciﬁcation given in Eq. (2.16) and the magnitude function of the ﬁlter is
shown in Fig. 2.15.
(2.16)
The ﬁlter coefﬁcient are given in Table 2.10, as real numbers as well as
in carry-save representation. The ﬁlter is composed by ﬁve allpass sec-
tions, one ﬁrst-order and four second-order sections. Hence, the iteration
period bound is determined by one of the loops in these ﬁlter sections.
The iteration period bound for each of these allpass sections, before trans-
formations, are given in Eq. (2.17) through Eq. (2.21)
Figure 2.14: A ninth-order LWDF.
a0
T
a3
a4
T
T
T
a5
a6
T
a7
a8
T
T
y(n) x(n)
1/2
T
a1
a2
T
wcT 0.30p rad =
wsT 0.35p rad =
Amin 60 dB =
Amax 0.1 dB =Arithmetic Transformation of Second-Order Richards’ Allpass Sections
57
(2.17)
(2.18)
(2.19)
(2.20)
(2.21)
These equations yields a Tmin for the entire ﬁlter of
(2.22)
If each of the allpass sections are transformed, iterations period
bounds according to Eq. (2.23) through Eq. (2.27) are obtained for each
Figure 2.15: Magnitude function of the example LWDF.
           
−80
−70
−60
−50
−40
−30
−20
−10
0
wT [rad]
0 0.2p 0.4p 0.6p 0.8p  p
½
H
(
e
j
w
T
)
½
 
[
d
B
]
0 0.2 0.4 0.6
−0.1
−0.05
0
0.05
0.1
Tmin a0 , 5TCSA =
Tmin a1 a2 ,, 5TCSA 8TCSA +1 3 TCSA ==
Tmin a3 a4 ,, 6TCSA 8TCSA +1 4 TCSA ==
Tmin a5 a6 ,, 6TCSA 8TCSA +1 4 TCSA ==
Tmin a7 a8 ,, 5TCSA 7TCSA +1 2 TCSA ==
Tmin max =5 TCSA 13TCSA 14TCSA 14TCSA 12TCSA ,,,, {} =
14TCSA =Arithmetic Transformations of Lattice Wave Digital Filters
58
of the allpass sections. Here no input VMAs are considered in the trans-
formation.
(2.23)
(2.24)
(2.25)
(2.26)
(2.27)
By comparison of the iteration period bounds for the allpass sections
before and after transformation it can be seen that it is not necessary to
transform the ﬁrst-order section. However, to obtain the lowest iteration
period possible using the proposed arithmetic transformations, the four
second-order sections must be transformed. After transformation of the
second-order sections, a new Tmin is obtained as
Coefﬁcient Decimal CSDC
a0 0.609375 0.101001
a1 -0.47265625 0.1000101
a2 0.810546875 1.010100001
a3 -0.671875 1.010101
a4 0.6865234375 1.0101000001
a5 -0.83984375 1.00101001
a6 0.60546875 0.10100101
a7 -0.953125 1.000101
a8 0.5703125 0.1001001
Table 2.10: Coefﬁcients for the example ﬁlter.
Tmin a0 , 4TCSA =
Tmin a1 a2 ,, 4TCSA 5TCSA +9 TCSA ==
Tmin a3 a4 ,, 5TCSA 5TCSA +1 0 TCSA ==
Tmin a5 a6 ,, 5TCSA 5TCSA +1 0 TCSA ==
Tmin a7 a8 ,, 4TCSA 4TCSA +8 TCSA ==Arithmetic Transformation of Second-Order Richards’ Allpass Sections
59
(2.28)
Now the transformed structure yields Tmin =1 0 TCSA, while for the
conventional structure Tmin =1 4 TCSA. Hence, a reduction of the iteration
period bound of 30% is obtained.
The conventional structure require 90NCSA while the transformed
structure require 127NCSA. Hence, for the example ﬁlter, the transformed
structure require 37 more carry-save adders, corresponding to a 41%
increase of the arithmetic complexity.
Tmin max =5 TCSA 9TCSA 10TCSA 10TCSA 8TCSA , ,,, {} =
10TCSA =Arithmetic Transformations of Lattice Wave Digital Filters
6061
3
3A DIGITAL DOWN
CONVERTER FOR A RADAR
RECEIVER
In this chapter we discuss the design and implementation of a digital
down converter (DDC) for a multiple antenna radar receiver. A funda-
mental operation in many receiver topologies is the I/Q demodulation of
the incoming signal, i.e., translation of the input to a lowpass complex
signal. In the receiver structure considered here, the I/Q demodulation is
performed in the digital domain. Performing the I/Q demodulation in the
digital domain improves the receiver performance, compared to a receiver
where the I/Q demodulation is performed in the analog domain. The DDC
considered here is composed of a Hilbert transform for I/Q demodulation
and a highpass ﬁlter for decimation of the sample rate.
Digital I/Q demodulation require a DDC with high throughput. Also,
the power consumption of the DDC should be low. Hence, ﬁlter structures
that yields high throughput as well as low power consumption has been
considered for the DDC implementation.
A software model of the radar receiver has been developed [17] [70].
Also, a DDC, implemented using FIR ﬁlters, has previously been imple-
mented using unconstrained layout, but with more relaxed requirementsA Digital Down Converter for a Radar Receiver
62
on the DDC [31] [69]. The work presented in this chapter have previously
been published in [17] [45] [50].
3.1 Conventional Receiver Structures
In a conventional radar receiver the separation of the signal into inphase
(I) and quadrature (Q) components is performed in the analog domain. An
example of such structure is shown in Fig. 3.1. For the example receiver,
two modulation stages are required, working at the frequencies LO1 and
LO2, respectively. The ﬁrst stage, LO1, modulates the signal from fRF
downto fIF. The second stage, LO2, performs the I/Q demodulation. This
stage consists of two mixers working at the same frequency but with a
phase difference of 90 degrees.
The I- and Q-channels should have an phase difference of exactly 90
degree. Also, the gains of the two paths should be matched. If these two
requirements are not met for all frequencies within the signal bandwidth,
the performance of the receiver is reduced. Another drawback with this
type of receiver structure is that two ADCs are required, one in each chan-
nel. Mismatch between the ADCs will also have an impact on the receiver
performance.
3.2 Considered Radar Receiver Structure
The increased performance of digital signal processing in modern CMOS
processes makes it possible to push the digital signal processing towards
the antenna in the receiver structure. The I/Q demodulation can then be
Figure 3.1: Conventional receiver structure with I/Q demodulation.
LO2
IF-stage RF-stage
LO1
LO2
I
Q
LP
LP
ADC
ADCReceiver Specification
63
performed in the digital domain. This will reduce the gain and phase mis-
match between the channels, compared to conventional structures. By
performing the analog to digital conversion at fIF, which is larger than the
Nyquist frequency of the ADC, i.e. utilizing bandpass sampling of the
signal [77], the number of analog modulation stages required can be
reduced to one. An example of such structure can be seen in Fig. 3.2. This
approach will increase the requirement on the ADC, compared to a con-
ventional structure, for the same signal bandwidth.
By selecting the fIF according to
(3.1)
where N is an positive integer and fs is the sample frequency of the ADC,
the wanted frequency band around fIF is modulated down to a frequency
band around fs/4. The ﬁnal modulation of the signal band down to base-
band is then performed in the DDC.
A multiple antenna radar receiver based on this concept has previously
been designed and implemented at FOI (Totalförsvarets forskninginstitut)
in Linköping in cooperation with Linköping university [31] [60].
3.3 Receiver Speciﬁcation
The receiver considered in this thesis works at the X band, i.e., with an
fRF between 8 to 12 GHz. The fIF is 360 MHz and the fs of the ADC is
160 MHz. The relation between the selected fIF and fs corresponds to
N = 9 in Eq. (3.1). Hence, the bandpass sampling requirement is met. The
signal bandwidth, B, of the receiver is 36 MHz. The input sample rate of
Figure 3.2: Receiver structure with digital I/Q demodulation.
ADC DDC
I
Q
IF-stage RF-stage
LO
f IF
2N 1 –
4
---------------- f s =A Digital Down Converter for a Radar Receiver
64
the DDC is the same as the sample rate of the ADC, 160 MHz, and the
output sample rate of the DDC is 40 MHz in each channel.
3.4 The Digital Down Converter
The structure of the DDC is shown in Fig. 3.3. The DDC is composed by
three parts, a Hilbert transform, a highpass ﬁlter, and a decimation by a
factor two.
The Hilbert transform is used for the I/Q demodulation [64]. The
Hilbert transform introduces a 90 degree phase shift [28] [38]. The
inphase component is obtained by applying a delay on the same signal.
This delay is matched to the delay of the Hilbert transform. As can be
seen in Fig. 3.3 only every second sample is used in the I- and Q-chan-
nels, respectively. The result is that a decimation of the sample rate by a
factor two is obtained in each channel.
For the modulation of the signal downto baseband a highpass ﬁlter and
a decimation by another factor of two is performed. Hence, the total deci-
mation factor of the DDC is four and the output from the DDC is a low-
pass complex signal.
3.4.1 Hilbert Transform
The I/Q demodulation require that the real input signal is translated to a
complex representation. This can be done using a Hilbert transform. The
Hilbert transform has a frequency response of
Figure 3.3: Structure of the DDC.
Delay
Hilbert
HP
HP
x(m)
even
odd
I
Q 2
2The Digital Down Converter
65
(3.2)
This transfer function is applied on every odd sample of the real input sig-
nal to obtain the Q part of the signal [38] [52]. The real part of the I/Q sig-
nal is obtained by applying a delay, matched to the delay of the Hilbert
transform, on every even sample of the input signal.
A digital ﬁlter meeting the Hilbert speciﬁcation can be designed using
a lowpass ﬁlter which is shifted p/2 radians in the frequency domain. This
lowpass ﬁlter can be implemented efﬁciently using halfband FIR ﬁlters as
well as BLWDFs. The speciﬁcation for the lowpass ﬁlter forming the
Hilbert transform in the DDC is given in Eq. (3.3)
(3.3)
where wcT is the cutoff frequency, wsT is the stopband frequency, and
Amin is the stopband attenuation. For a halfband ﬁlter, as considered here,
the passband ripple depends on the stopband attenuation. For a stopband
attenuation of 60 dB the passband attenuation of the halfband FIR ﬁlter as
well as the BLWDF becomes very small, well below the requirements of
the receiver.
Since the Hilbert transform only processes every second sample, it
work at half the input sample rate, i.e., fsHil = 80 MHz.
3.4.2 Highpass Filter
The highpass ﬁlter and the decimation by two of the DDC modulates the
signal into a lowpass signal. The speciﬁcation of the highpass ﬁlter is
(3.4)
He
jwT ()
j, –0 wT p < £
j, p – wT 0 < £ î
í
ì
=
wcT 0.225p rad =
wsT 0.775p rad =
Amin 60 dB =
wcT 0.45p rad =
wsT 0.55p rad =
Amin 60 dB =A Digital Down Converter for a Radar Receiver
66
As for the Hilbert transform, the highpass ﬁlter can be implemented using
halfband FIR ﬁlters as well as BLWDFs, yielding a very small passband
ripple.
Combining the highpass ﬁlter and the decimation in the implementa-
tion using polyphase structures as discussed in Section 1.5.3 yields that
the required sample rate of this stage is fsHP = 40 MHz.
3.5 Design of the Digital Down Converter
Three different ﬁlter structures were considered for realization of the
DDC. These three cases has been evaluated with respect to their imple-
mentation costs. The purpose with the evaluation is to identify the most
power efﬁcient implementation of the DDC.
For the ﬁrst case, FIR ﬁlters are considered for realization of the
Hilbert transform as well as for the highpass ﬁlter. For the second case a
BLWDF is used for the highpass ﬁlter while the same FIR ﬁlter as in the
ﬁrst case is used for the Hilbert transform. Finally, for the third case,
BLWDFs are used for the Hilbert transform as well as for the highpass ﬁl-
ter.
3.5.1 Mapping of the DDC to Hardware
The data wordlength of the ADC is of the order 12–14 bit. With a sample
rate of 160 MHz, an input data rate of about 2 Gbit/s is obtained. This is a
high data rate for bit-serial as well as digit-serial arithmetic. Hence, bit-
parallel arithmetic has been considered. To improve the throughput fur-
ther, and thereby make power supply voltage possible, carry-save arith-
metic, as discussed in Section 1.6.4 was considered.
Polyphase decomposition has been used to make it possible to perform
all the ﬁltering at a lower sample rate. Hence, the clock frequency
required for the highpass ﬁlter implementation has been reduced which
has a large impact on the power consumption.
The multiplications within the FIR ﬁlters are implemented using shift-
and-add multipliers. The adders required for these multiplications has
been merged to one large Wallace tree. This yields an FIR ﬁlter imple-
mentation which is easy to pipeline and has a short critical path, making it
suitable for high throughput applications.Design of the Digital Down Converter
67
3.5.2 FIR-FIR Solution
For the ﬁrst DDC case a 16th-order FIR ﬁlter is required for the Hilbert
transform and a 66th-order FIR ﬁlter is required for the highpass ﬁlter.
The ﬁlter coefﬁcients for the two FIR ﬁlters have been designed to obtain
a minimal total number of nonzero bits in the coefﬁcient set using the
method presented in [20]. This design method yields an FIR ﬁlter imple-
mentation with a low arithmetic complexity and a high throughput for ﬁl-
ter implementations using shift-and-add multipliers.
The ﬁlter coefﬁcients for the Hilbert transform and the highpass ﬁlter
are shown in Table 3.1 and Table 3.2, respectively. These coefﬁcients are
given as real numbers as well as in CSDC representation. Since both ﬁl-
ters are halfband FIR ﬁlters, every odd coefﬁcient is zero. These coefﬁ-
cients have not been included in the tables. Also, the tables shows that the
coefﬁcients are antisymmetric around the centre coefﬁcient since FIR ﬁl-
ter structures have linear phase. This property has been exploited for
reducing the arithmetic complexity of the DDC implementation.
3.5.3 FIR-BLWDF Solution
For the second case considered for the DDC, the highpass ﬁlter is imple-
mented using an 11th-order BLWDF instead of an FIR ﬁlter, while the
Hilbert transform is implemented using the same FIR ﬁlter as in the ﬁrst
case. The highpass ﬁlter has been designed using the method given in [16]
and the coefﬁcients has been rounded to 11 bits to meet the given speciﬁ-
cation. The coefﬁcients of the highpass BLWDF are shown in Table 3.3,
Coefﬁcient Decimal CSDS
a0 = –a14 –0.001953125 0.000000001
a2 = –a12 0.015625 0.000001
a4 = –a10 –0.06640625 0.00010001
a6 = –a8 0.3046875 0.0101001
a7 0.50390625 0.10000001
Table 3.1: Coefﬁcients for the Hilbert transform
implemented using an FIR ﬁlter.A Digital Down Converter for a Radar Receiver
68
both as real numbers and in CSDC representation. Here every even coefﬁ-
cient is zero since a BLWDF is used.
3.5.4 BLWDF-BLWDF Solution
For the third DDC structure, BLWDFs are used for both ﬁlter stages. The
Hilbert transform requires a seventh-order ﬁlter while the highpass ﬁlter
Coefﬁcient Decimal CSDC
a0 = a64 0.00048828125 0.00000000001
a2 = a62 –0.0009765625 0.0000000001
a4 = a60 0.00146484375 0.00000000101
a6 = a58 –0.002197265625 0.000000001001
a8 = a56 0.0030517578125 0.0000000101001
a10 = a54 –0.00439453125 0.00000001001
a12 = a52 0.0059814453125 0.0000001010001
a14 = a50 –0.0078125 0.0000001
a16 = a48 0.0106201171875 0.0000010101001
a18 = a46 –0.013671875 0.00000100
a20 = a44 0.017822265625 0.000001001001
a22 = a42 –0.0233154296875 0.0000101000001
a24 = a40 0.031005859375 0.000010000001
a26 = a38 –0.04248046875 0.00010101001
a28 = a36 0.0625  0.0001
a30 = a36 –0.107421875 0.001001001
a32 = a34 0.32763671875 0.01010100001
a33 –0.515625 0.100001
Table 3.2: Coefﬁcients for the highpass FIR ﬁlter.Evaluation of the DDC Structures
69
is the same 11th-order ﬁlter as used in case two. The Hilbert transform
has been designed using the same design method used for the highpass ﬁl-
ter. Here the coefﬁcients has been rounded to 9 bits to meet the speciﬁca-
tion. The coefﬁcients for the BLWDF Hilbert transform are shown in
Table 3.4.
3.6 Evaluation of the DDC Structures
From the ﬁlter design, the hardware requirements of the three different
DDC realizations can be estimated.
3.6.1 Number of Arithmetic Operations
One estimate of the hardware requirements is the number of high level
arithmetic operations, i.e., additions and multiplications, required for
each implementation. Here the arithmetic of the ﬁlter implementations
are not considered. Instead the number of operations required are com-
Coefﬁcient Decimal CSDC
a1 -0.0830078125  0.0001010101
a3 -0.28564453125 0.01001001
a5 -0.51953125 0.10000101
a7 -0.72900390625 1.010010101
a9 -0.91015625 1.00101001
Table 3.3: Coefﬁcients for the highpass BLWDF.
Coefﬁcient Decimal CSDC
a1 –0.06640625 0.00010001
a3 –0.2734375 0.0100101
a5 –0.673828125 1.010101
Table 3.4: Coefﬁcients for the Hilbert transform
implemented using a BLWDF.A Digital Down Converter for a Radar Receiver
70
puted from the signal-ﬂow graph. This gives a rough estimate of the rela-
tive power consumption between the three cases.
The number of additions and multiplications required for each of the
three cases are shown in Table 3.5. These results implies that the FIR-FIR
solution yields a larger arithmetic complexity, in terms of additions and
multiplications, compared to the other two cases. This is due to the high
ﬁlter order required for the highpass FIR ﬁlter, compared to the BLWDF.
The highpass ﬁlter has a narrow transition band which increases the com-
plexity of the FIR ﬁlter, compared to a BLWD realization.
The number of arithmetic operations required is, however, only a
rough estimate of the arithmetic complexity. As discussed in Chapter 1,
the complexity of a constant multiplication depends on the coefﬁcient
value as well as on the hardware architecture that the operation is mapped
to.
A better estimation of the hardware complexity of a ﬁlter realization
would be to include the hardware required for the constant multipliers. As
can be seen from the ﬁlter coefﬁcients given in the tables, several of the
FIR ﬁlter coefﬁcients require only one or two nonzero bits. Constant mul-
tipliers for such coefﬁcient can be implemented efﬁciently using, for
example, shift-and-add multipliers. Such effects are not included in the
comparison shown in Table 3.5.
Another parameter that affects the complexity of the ﬁlter implemen-
tation is the data storage required. The memory requirements differs sig-
niﬁcantly between an FIR ﬁlter and a BLWDF. Hence, this will have an
impact on the power consumption of the DDC implementation as well.
Case Additions Multiplications
FIR - FIR 41 73
FIR - BLWDF 15 39
BLWDF - BLWDF 13 41
Table 3.5: Number of arithmetic operations required for
the three cases.Evaluation of the DDC Structures
71
3.6.2 Number of Adders and Memory Elements
To obtain a more accurate estimation of the hardware requirements for a
ﬁxed implementation of each of the three cases, the structure of the arith-
metic operations are considered. As discussed in Section 3.5.1 carry-save
arithmetic has been considered for all three cases. Also, in all three cases
two’s-complement inputs and outputs are required and, hence, VMAs are
required as well.
In Table 3.6 the number of CSAs, CPAs, and memory elements, i.e.
registers, required for implementation of each of the three cases are
shown. For the FIR ﬁlters, CPAs have been used at the inputs of the Wal-
lace tree to utilize the symmetry of the coefﬁcients in order to reduce the
complexity of the adder tree. Also, CPAs, used as VMAs, have been
applied on the outputs of the FIR ﬁlters as well as at the outputs of the all-
pass sections of the BLWDFs.
The FIR-FIR case require a somewhat lower number of carry-save
adders compared to the two cases including BLWDFs. However, the
number of CPAs and memory elements required are signiﬁcantly larger.
Between the two cases including BLWDFs, the most signiﬁcant differ-
ence is that the BLWDF-BLWDF solution requires less than 50% of the
amount of memory elements required for the FIR-BLWDF solution.
The energy consumed during one switching of a full-adder and a D
ﬂip-ﬂop in a CMOS standard cell design is approximately the same [2]. If
the data wordlengths of the CSAs, CPAs, and the memory elements are
assumed to be equal, the power consumption for the adders and a memory
element can be assumed to be equal, if the memory elements are imple-
mented using D-ﬂip ﬂops. Hence, to estimate the relative power con-
sumption between the three cases the total number of CSAs, CPAs, and
memory elements can be used. The result is that in total each of these
Case CSA CPA Memory Elements
FIR - FIR 84 39 108
FIR - BLWDF 102 15 30
BLWDF - BLWDF 122 13 13
Table 3.6: Number of adders and memory elements.A Digital Down Converter for a Radar Receiver
72
cases require 231, 147, and 148 adders and registers for the FIR-FIR,
FIR-BLWDF, and the BLWDF-BLWDF cases, respectively. These num-
bers implies that the FIR-FIR case consumes signiﬁcantly more power
than the other two cases.
These results are, however, still rough estimations since the data word-
lengths will vary within the structures. Also, the CPAs may require addi-
tional gates, apart from the full-adders, for carry acceleration to obtain the
required throughput. Another factor that is not considered here is the
power consumed for charging and decharging the wires of the ﬁnal imple-
mentation.
Hence, it is difﬁcult to ﬁnd a solution to the problem of ﬁnding a hard-
ware efﬁcient as well as a low power consumption solution for the DDC
by high level design considerations only. Detailed knowledge of the ﬁnal
implementation are required. Hence, for further evaluation of the three
cases, implementations on silicon are considered.
3.7 Implementation of the DDC
The three different DDC cases have been implemented in structural
VHDL which has been mapped to hardware using logic synthesis. To
obtain the required throughput, pipelining has been introduced in the FIR
ﬁlters. Also, pipeline registers has been introduced in the BLWDF to
reach the iteration period bound. These pipeline registers were not
included in the evaluations given above but will, of course, affect the per-
formance of the DDC implementations.
The VHDL code for each of the three DDC structures has been
mapped to a standard cell library in a 0.18 mm CMOS process, provided
by STMicroelectronics, through logic synthesis using Design Compiler
from Synopsys. The area requirement and the maximal sample frequency
for each of the DDC structures are given in Table 3.7. These results are
from the logic synthesis and the maximal sample frequencies are given
using a power supply voltage of 1.55 V.
The requirements on the sample rates of the two ﬁlter stages was given
in Section 3.3 as fsHil,Req = 80 MHz and fsHP,Req = 40 MHz. From Table
3.7 it can be seen that all three cases meets the sample rate requirements.
It can also be seen that the FIR-FIR and the FIR-BLWDF cases have large
margins for power supply voltage scaling that can be used to reduce the
power consumption.Implementation of the DDC
73
The most area and speed efﬁcient solution among the three considered
DDC structures is the FIR-BLWDF case. This structure yields the lowest
chip area as well as yielding a large margin for power supply voltage scal-
ing. Hence, to obtain a low power implementation of a DDC meeting the
given speciﬁcation, the FIR-BLWDF structure was identiﬁed as the most
suitable candidate.
Case
Area
(mm2)
fsHil,max
(MHz)
fsHP,max
(MHz)
FIR-FIR 0.44 128 81
FIR-BLWDF 0.33 128 75
BLWDF-BLWDF 0.40 86 75
Table 3.7: Implementation data for the DDC structures.A Digital Down Converter for a Radar Receiver
7475
4
4A COMBINED
INTERPOLATOR AND
DECIMATOR FOR AN
OFDM SYSTEM
In this chapter the design and implementation of a combined interpolator
and decimator for use in an OFDM system is discussed. OFDM is a mul-
ticarrier based communication method which is, for example, used in the
WLAN standards HIPERLAN and IEEE 802.11b and g. Such systems are
often aimed at handheld, battery driven devices. Hence, low power con-
sumption as well as high throughput are required.
The communication system considered here is an OFDM based radio
modem with a capacity of 20 Mbit/s. To improve the performance of the
ADCs and the DACs of the system, the converters work at a sample rate
higher than the Nyquist rate. In this case an oversampling factor of 2 is
used. Hence, interpolation and decimation by a factor of two is required
in the transmitter as well as in the receiver. The interpolator and the deci-
mator are implemented on the same chip and the functionality of the chip
is selected by an external control signal.
For the implementation of the combined interpolator and decimator,
novel ﬁlter structures have been proposed and evaluated with respect toA Combined Interpolator and Decimator for an OFDM System
76
arithmetic complexity and throughput. One of the considered ﬁlter struc-
tures were selected for implementation and has been implemented in a
0.35 mm CMOS process.
The work presented in this chapter has previously been published in
[44] [46].
4.1 Design of the Digital Filters
The speciﬁcation of the interpolation and decimation ﬁlters is given in Eq.
(4.1),
(4.1)
where wcT is the cutoff frequency, wsT is the stopband frequency, Amax is
the maximal passband attenuation, Amin is the minimal stopband attenua-
tion, and fs,higher and fs,lower is the higher and lower sample frequency,
respectively.
Interpolation and decimation by a factor of two can be performed efﬁ-
ciently using halfband ﬁlters, as previously discussed. However, halfband
ﬁlters are only possible to use if the cutoff and stopband frequencies are
placed symmetrically around p/2. This is not the case for the ﬁlter speciﬁ-
cation given in Eq. (4.1). Hence, other ﬁlter structures must be consid-
ered.
4.1.1 Narrow-band Frequency Masking Filters
A method for efﬁcient implementation of hardware efﬁcient digital ﬁlters
is the frequency masking technique. This technique can be divided in nar-
row-band and wide-band frequency masking. The former is used for ﬁl-
ters with a cutoff frequency less than 0.5p, while the latter is used for
ﬁlters with a cutoff frequency larger than 0.5p. The combined interpolator
wcT 0.41p rad =
wsT 0.48p rad =
Amax 1d B <
Amin 65 dB >
f sl o w e r , 25.6 MHz =
f sh i g h e r , 51.2 MHz =Design of the Digital Filters
77
and decimator has a cutoff frequency of 0.41p. Hence, narrow-band fre-
quency masking techniques can be used.
Frequency masking was originally proposed for reduction of the arith-
metic complexity of FIR ﬁlters [41] [67]. It has also been proposed as a
method for increasing the throughput of recursive ﬁlters as well. In [10] a
recursive ﬁlter were used as model ﬁlter while the masking ﬁlters were
FIR ﬁlters. The resulting ﬁlter structure yields a reduced iteration period
bound, compared to a direct, recursive implementation. In [30] it was
shown that it is possible to implement the masking ﬁlters using recursive
structures as well and still obtain a reduced iteration period bound.
A narrow band frequency masking ﬁlter is composed by a periodic
model ﬁlter, G(zM), and one, or several masking ﬁlters, F(z). The transfer
function for such ﬁlter is given as
(4.2)
and the corresponding ﬁlter structure is shown in Fig. 4.1. For the ﬁlter
considered here we will only use recursive ﬁlters, more speciﬁcally
LWDFs and BLWDFs, for the model ﬁlter and for the masking ﬁlter.
An alternative to using the frequency masking technique for imple-
menting the ﬁlter considered here would be to use an LWDF meeting the
given speciﬁcation. However, then all ﬁltering takes place at the higher
sample frequency for interpolation as well as for decimation. This is not
an efﬁcient solution with respect to power consumption. Instead, if a fre-
quency masking ﬁlter is used, a BLWDF can be used as masking ﬁlter,
making it possible to perform all ﬁltering at the lower sample frequency
using polyphase decomposition.
Figure 4.1: Narrow-band frequency masking ﬁlter structure.
Hz () Gz
M () Fz () =
G(zM) F(z) y(n) x(n)A Combined Interpolator and Decimator for an OFDM System
78
4.1.2 Efﬁcient Implementation of Cascaded Multirate
Filters
If several LWDFs are cascade the arithmetic complexity of the implemen-
tation as well as the iteration period bound can be reduced, compared to a
straightforward ﬁlter implementation [75].
For implementation of decimation and interpolation ﬁlters for sample
rate changes by factors of two, halfband IIR ﬁlters, such as the BLWDF,
has been shown to be efﬁcient. However, when such ﬁlters are cascaded,
in order to reduce coefﬁcient wordlengths, all ﬁltering can not be per-
formed at the lower sample rate, reducing the efﬁciency of the ﬁlter
implementation. A solution to this was proposed in [29]. Here a novel ﬁl-
ter structure for implementation of cascaded half-band IIR ﬁlters where
all ﬁltering can be performed at the lower sample rate was proposed. This
method also yields a reduction of the hardware required.
The novel, cascaded ﬁlter structure is derived by restating the transfer
function of several identical half-band ﬁlters in cascade into one poly-
phase form. If we consider M cascaded identical half-band ﬁlters we
obtain the transfer function
(4.3)
where A0(z) and A1(z) are the allpass ﬁlters from the original half-band
ﬁlter and H0(z) and H1(z) are the allpass ﬁlters forming the new polyphase
ﬁlter, Hcasc(z). The new allpass ﬁlters H0(z) and H1(z) are derived from
Eq. (4.3) as
(4.4)
(4.5)
where
Hcasc z () A0 z2 ()z 1 – A1 z2 () + [] M H0 z2 ()z 1 – H1 z2 () + ==
H0 z () c2iz i – A1 z () [] 2i A0 z () [] M 2i –
i 0 =
K0
å =
H1 z () c2i 1 + z i – A1 z () [] 2i 1 + A0 z () [] M 12 i – –
i 0 =
K1
å =Design of the Digital Filters
79
(4.6)
and
(4.7)
for odd M and
(4.8)
(4.9)
for even M.
For M = 2, corresponding to two cascaded ﬁlter stages, the allpass ﬁl-
ters forming the overall transfer functions becomes
(4.10)
(4.11)
The signal-ﬂow graph of the resulting ﬁlter structure, in this case an inter-
polation structure, for M = 2 is shown in Fig. 4.2.
Figure 4.2: Signal-ﬂow graph for two cascaded BLWDFs.
ci
M
i èø
æö 0 iM ££ , =
K0 K1
M 1 – ()
2
------------------- ==
K0
M
2
----- =
K1 K0 1 – =
H0 z () A0
2 z () z 1 – A1
2 z () + =
H1 z () 2A0 z () A1 z () =
A0(z)
A1(z)
A0(z)
T
y(m)
x(n)
A1(z) A1(z)
2A Combined Interpolator and Decimator for an OFDM System
80
The resulting interpolator structure yields a reduced arithmetic com-
plexity of the ﬁlter implementation since resources can be shared between
the allpass sections. Also, a reduced iteration period bound is obtained
due to the shorter coefﬁcient wordlengths required for the cascaded ﬁl-
ters, compared to a straightforward implementation.
4.2 Considered Filter Structures
Four ﬁlters meeting the ﬁlter speciﬁcation given in Section 4.1 were
designed using recursive ﬁlters and the narrow-band frequency masking
technique. As model ﬁlter an LWDF is used while for the masking ﬁlter a
BLWDF is used as proposed in [30].
The ﬁrst ﬁlter structure considered, shown in Fig. 4.3, is composed by
a ﬁfth-order LWDF for the model ﬁlter and a ninth-order BLWDF for the
masking ﬁlter. For the second structure, shown in Fig. 4.4, the model ﬁlter
is composed by two third-order LWDFs in cascade. The masking ﬁlter is
the same as for the ﬁrst structure. For the third and fourth structures, the
same model ﬁlter as for the second structure is used while the masking ﬁl-
ter is exchanged for two and three BLWDFs in cascade. These two cases
are shown in Fig. 4.5 and Fig. 4.6, respectively. The masking ﬁlters of
case three and four have been reordered using the method discussed in
Section 4.1.2.
The coefﬁcients for the four ﬁlter structures are shown in Table 4.1 as
real numbers and Table 4.2 shows the coefﬁcients in CSDC representa-
tion.
4.3 Evaluation of the Filter Structures
For the comparison of the structures the throughput and the arithmetic
complexity of the four cases have been evaluated. Only the interpolation
structure has been considered in the evaluation since the corresponding
decimation structure has a similar arithmetic complexity. To meet the
requirement on throughput, bit-parallel arithmetic was considered for the
evaluation.
The purpose with the evaluation is to ﬁnd a structure that meets the
requirements on throughput as well as yielding a low power consumption.
To obtain a low power consumption a ﬁlter structure yielding a low arith-Evaluation of the Filter Structures
81
metic complexity as well as a low iteration period bound should be
selected.
Figure 4.3: Filter structure considered for case 1.
Figure 4.4: Filter structure considered for case 2.
a0
T
a3
a4
T
T
a1
a2
T
T
a5
T
a6
T
a7
T
a8
T y(m) x(n)
a3
T
a4
T
a5
T
a6
T y(m) x(n)
a0
T
a1
a2
T
T
a0
T
a1
a2
T
TA Combined Interpolator and Decimator for an OFDM System
82
4.3.1 Arithmetic Complexity and Throughput
In Table 4.3 the arithmetic complexity as well as the iteration period
bound, expressed in TCPA, for the four structures when mapped to bit-par-
allel, carry-propagation adders are shown. The number of carry propaga-
tion adders required includes the adders required for the constant
multipliers in the adaptors.
Figure 4.5: Filter structure considered for case 3.
Figure 4.6: Filter structure considered for case 4.
a3
T
a4
T
a4
T
a4
T y(m) x(n)
a0
T
a1
a2
T
T
a0
T
a1
a2
T
T
a4
T
T
2
a3
T
a4
T
a4
T
x(n)
a0
T
a1
a2
T
T
a0
T
a1
a2
T
T
T
3
3
a4
T
a4
T
a4
T y(m)Evaluation of the Filter Structures
83
As can be seen, the number of CPAs required are of the same order for
all four cases with case three requiring the lowest number of CPAs. Also,
the iteration period bound is the same for cases two, three, and four. For
Coeffs Case 1 Case 2 Case 3 Case 4
a0 –0.0498046875 –0.4375 –0.4375 –0.4375
a1 –0.6044921875 -0.75 –0.75 –0.75
a2 –0.6884765625 -0.875 –0.875 –0.875
a3 –0.9189453125 –0.096923828125 –0.28125 –0.25
a4 –0.8310546875 –0.330810546875 –0.75 –0.75
a5 –0.091552734375 –0.600341796875
a6 –0.307373046875 –0.862060546875
a7 –0.562744140625
a8 –0.840576171875
Table 4.1: Adaptor coefﬁcients for the four cases.
Coeffs Case 1 Case 2 Case 3 Case 4
a0 0.0001010101 0.1001 0.1001 0.1001
a1 0.1010010101 1.01 1.01 1.01
a2 1.0101000001 1.001 1.001 1.001
a3 1.0001010101 0.001010001001 0.01001 0.01
a4 1.01010100101 0.010100010101 1.01 1.01
a5 0.001010001001 0.100100000001
a6 0.010100010101 1.001010010101
a7 0.100100000001
a8 1.001010010101
Table 4.2: Adaptor coefﬁcients in CSDC representation for the four cases.A Combined Interpolator and Decimator for an OFDM System
84
these three cases the iteration period bound is determined by the second-
order allpass sections of the model ﬁlter.
As previously discussed, carry-propagation based operations are not
suitable for implementation of recursive ﬁlters that require a high
throughput. For such applications bit-parallel, carry-save arithmetic can
be used instead. For the CMOS process considered here, a 0.35 mm proc-
ess, a bit-parallel implementation, based on carry-propagation adders, did
not meet the requirements on the throughput. Hence, a carry-save imple-
mentation was considered.
In Table 4.4 the number of carry-save adders required for the four
cases is shown. The table also show the iteration period bound expressed
in TCSA. The difference in the number of adders required between the four
cases is larger for the carry-save implementations than for the carry-prop-
agation implementations. The reason for this is that the multipliers in the
carry-save implementation require a larger number of adders than the
multipliers in the carry-propagation structure. As before, the iteration
Case NCPA Tmin expressed in TCPA
16 1 1 0
25 2 6
34 7 6
45 3 6
Table 4.3: Design data for carry propagation implementation of the four
cases.
Case NCSA Tmin expressed in TCSA
1 114 16
29 2 1 0
37 0 1 0
47 4 1 0
Table 4.4: Design data for carry-save implementation of the four cases.Evaluation of the Filter Structures
85
period bound is the same for case two, three, and four, yielded by the sec-
ond-order allpass sections of the model ﬁlter.
For the carry-save implementation, case three and four yields the low-
est number of CSAs as well as the lowest iteration period bound. Hence,
these two structures are the most suitable for implementation.
4.3.2 Internal Data Wordlengths and Scaling
Another factor that affects the complexity of the ﬁlter implementation is
the internal quantization noise. The amount of quantization noise gener-
ated within the ﬁlter depends on the internal data wordlength. If the inter-
nal data wordlength is increased, the quantization noise is reduced, while
a reduced data wordlength yields increased quantization noise.
As a reference level for the amount of internal quantization noise that
can be tolerated, the quantization noise in the input signal is used. The
internal quantization noise is only allowed to increase by as much as this
reference level yields.
To avoid overﬂows of the number range, the signal levels in the ﬁlter
must be scaled [75]. There are several strategies for computing the scaling
coefﬁcients for a ﬁlter implementation, such as safe scaling and LP-norms
scaling. Using safe scaling guarantees that no overﬂows occurs for any
kind of signals. This is, however, not always an efﬁcient scaling method.
By selecting a LP-norm which is suitable for the input signal for the con-
sidered application, a more efﬁcient scaling is obtained. However, for LP-
norm scaling overﬂows are not completely avoided and parasitic oscilla-
tions may occur in recursive structures. WDFs has the property that they
can suppress parasitic oscillations, which makes it suitable to scale such
ﬁlters using LP-norms.
In Table 4.5 the required internal data wordlength for the four cases
are shown. Case two, three, and four require the same number of extra
internal data bits while case one requires a slightly larger number of bits,
when safe scaling is considered. Since case three and case four has a
lower arithmetic complexity, these two cases were evaluated using L2-
scaling as well. As can be seen, L2-norm scaling yields further reduction
of the required internal data wordlength.A Combined Interpolator and Decimator for an OFDM System
86
4.3.3 Summary of the Evaluation
The result of the evaluation is that the ﬁlter structure considered in case
three requires the lowest number of CSAs, the shortest internal data word-
length, and the lowest iteration period bound. Hence, case three was
selected for implementation.
4.3.4 Combining Interpolation and Decimation Filters
By applying the transposition theorem on the interpolation structure, a
decimation structure meeting the same ﬁlter speciﬁcation is obtained
[75]. The transposed interpolation structure, i.e., the corresponding deci-
mation structure, is shown in Fig. 4.7. By comparing Fig. 4.5 and Fig. 4.7
it can be seen that the allpass sections required for the interpolation ﬁlter
and for the decimation ﬁlter are the same. Hence, these resources can be
shared between the two structures and both ﬁlters can be combined on the
same chip. The ﬁlter chip can then be conﬁgured as either an interpolation
ﬁlter or a decimation ﬁlter using an external control signal.
4.4 Implementation of the Combined Interpola-
tor and Decimator Structure
The ﬁlter was implemented in structural VHDL and mapped to a physical
layout using logic synthesis. The standard cell library used was a ﬁve
Case Extra Internal Bits
1, safe scaling 8
2, safe scaling 7
3, safe scaling 7
3, L2 scaling 5
4, safe scaling 7
4, L2 scaling 5
Table 4.5: Required data wordlengths.Implementation of the Combined Interpolator and Decimator Structure
87
metal layers, 0.35 mm CMOS process from Alcatel Mietec. The ﬁlter
implementation is shown in Fig. 4.8.
The functionality of the implemented ﬁlter has been veriﬁed and the
requirement on throughput was met at the speciﬁed power supply voltage.
Figure 4.7: Decimation structure derived using the transposition theorem.
Figure 4.8: Chip photo of the implemented ﬁlter.
2
a4
T
x(m)
T
a3
T
a4
T
y(m)
a0
T
a1
a2
T
T
a0
T
a1
a2
T
T
a4
T
a4
TA Combined Interpolator and Decimator for an OFDM System
88
The power consumption of the implementation using random input data
at the speciﬁed sample rate was measured to be about 130 mW at a power
supply voltage of 3 V. The results of the implementation are summarized
in Table 4.6.
4.5 Comparison Between a WDF and an FIR
Implementation
In [37] two FIR ﬁlters, a transposed direct form structure and a polyphase
structure, meeting the same ﬁlter speciﬁcation as the WDFs discussed
above were designed and implemented using logic synthesis from a
VHDL description. These ﬁlters were mapped to a standard cell library in
a three metal layer 0.8 mm CMOS process. Hence, the FIR ﬁlter imple-
mentations can not be compared directly with the WDF implementation
discussed above. In Table 4.7, the area requirement and the power con-
sumption of each of the three ﬁlter implementations are shown.
4.6 Arithmetic Transformation of the Combined
Interpolator and Decimator Structure
The throughput of the combined interpolator and decimator can be
increased using the arithmetic transformations discussed in Chapter 2. In
this section the proposed arithmetic transformations are applied on the
implemented ﬁlter structure.
Supply Voltage 3 V
Input/Output Data Wordlength 12 bit
Internal Data Wordlength 17 bit
Sample Frequency (fs,higher, fs,lower) 51.2 MHz, 25.6 MHz
Number of Standard Cells 8096
Core Area 1.49 mm2
Power Consumption 130 mW
Table 4.6: Implementation data.Arithmetic Transformation of the Combined Interpolator and Decimator Structure
89
For the combined interpolator and decimator implementation the criti-
cal loop belongs to the second-order allpass sections of the model ﬁlter.
The two second-order sections are identical and has the adaptor coefﬁ-
cients a1 = –0.7510 = 1.01CSDC and a2 = –0.875 = 1.001CSDC. Both
coefﬁcients have two nonzero bits and they contribute to the sample
period bound with 4TCSA and 6TCSA, respectively, according to Table 2.5
and Table 2.6. Hence, the resulting iteration period bound for the imple-
mented ﬁlter is Tmin = 4TCSA + 6TCSA = 10TCSA.
The proposed transformation was applied on the two second-order
sections. According to Table 2.5 and Table 2.6, the iteration period bound
of the transformed section is Tmin =3 TCSA +3 TCSA =6TCSA, were the
two 3TCSA-terms are the contribution to the sample period bound from the
a1-adaptor and the a2-adaptor, respectively.
The next step of the transformation process is to identify the new criti-
cal loop. For this ﬁlter the second-order sections still yields Tmin. Hence,
no further transformations are required. Hence, a reduction of Tmin of
40%, from 10TCSA to 6TCSA, can be expected of the transformed struc-
ture, compared to the conventional implementation.
The transformation result in an increase of the number of carry-save
adders. According to Table 4.4 the original implementation require
70NCSA. After transformation the number of carry-save adders required
for each second-order section is increased with one, from fourteen to ﬁf-
teen. Hence the total amount of carry-save adders is increased by 2NCSA,
from 70NCSA to 72NCSA.
To verify the result a transformed ﬁlter was implemented in structural
VHDL. The conventional as well as the transformed ﬁlter structures were
LWDF
Polyphase
FIR Filter
[37]
Transposed
FIR Filter
[37]
Technology 0.35 mm 0.8 mm 0.8 mm
Area [mm2] 1.49 15.79 15.67
Power Supply Voltage [V] 3 3.3 3.3
Power Consumption [mW] 130 789 1571
Table 4.7: Comparison between LWDF and FIR ﬁlter implementations.A Combined Interpolator and Decimator for an OFDM System
90
mapped to a three metal layer, 0.35 mm CMOS standard cell library. The
results of the implementations are shown in Table 4.8. The maximal sam-
ple rate has been increased by 46% after transformation of the structure.
This corresponds well to the estimations. It can also be seen that after
transformation the area is increased with about 9%, from 1.67 mm2 to
1.82 mm2.
Filter Structure Maximal Sample Rate
[MHz]
Chip Area
[mm2]
Conventional 77 1.67
Transformed 113 1.82
Table 4.8: Comparison between the original ﬁlter and
the transformed ﬁlter.91
5
5CONCLUSIONS
In this thesis we discussed the design and implementation of frequency
selective digital ﬁlters with high throughput and low power consumption.
Methods for increasing the throughput and reducing the power consump-
tion have been discussed and several digital ﬁlter structures have been
evaluated with respect to their arithmetic complexity.
In the thesis we have proposed arithmetic transformations of LWDFs
that reduces the iteration period bound for LWDFs composed of ﬁrst- and
second-order Richards’ structures, implemented using carry-save arith-
metic. The increased throughput can be traded for reduced power con-
sumption through power supply voltage scaling. We have evaluated the
transformations and found that the increased throughput is obtained at the
expense of an increased arithmetic complexity. However, large reductions
of the iteration period bound are possible for coefﬁcients with a low
number of nonzero bits, yielding small increases in the required hard-
ware.
One application for high throughput and low power digital ﬁlters con-
sidered in the thesis was a digital down converter for a multiple antenna
radar receiver. For the digital down converter three different ﬁlter struc-
tures were designed and implemented, an FIR-FIR structure, an FIR-
BLWDF structure, and a BLWDF-BLWDF structure. The three designs
were mapped to hardware and the hardware complexity of each of theseConclusions
92
were evaluated. The comparison between the three cases indicates that the
FIR-BLWDF case yields the lowest hardware complexity.
The second application considered was a combined interpolator and
decimator for oversampled ADCs and DACs in an OFDM system. Novel
ﬁlter structures were evaluated with respect to throughput and arithmetic
complexity. The most hardware efﬁcient structure was identiﬁed and
implemented in a 0.35 mm CMOS process using a standard cell library.
The implemented ﬁlter has been veriﬁed and the power consumption has
been measured.93
REFERENCES
[1] M. S. Anderson, S. Summerﬁeld, and S. S. Lawson, “Realization of lat-
tice wave digital ﬁlters using three-port adaptors,” Electronics Letters,
vol. 31, no. 8, pp 628–629, April 1995.
[2] Austrian Micro Systems, 0.35 mm CMOS Digital Standard Cell Data-
book, 2001.
[3] A. Avizienis, “Signed-digit number representation for fast parallel arith-
metic,” IRE Trans. Elec. Comp., vol. 10, pp. 389–400, 1961.
[4] D. R. Bull and D. H. Horrocks, “Primitive operator digital ﬁlters,” IEE
Proc. G, vol 138, no. 3, pp. 401–412, June 1991.
[5] T. D. Burd, T. A. Pering, A. J. Stratakos, and R. W. Brodersen, “A
dynamic voltage scaled microprocessor system,” IEEE J. Solid-State
Circuits, vol. 35, no. 11, pp. 1571–1579, Nov. 2000.
[6] A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. W.
Brodersen, “Optimizing power using transformations,” IEEE Trans.
Computer-Aided Design, vol. 14, no. 1, pp. 12–31, Jan. 1995.
[7] A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS
Design, Kluwer, Boston, 1995.
[8] A. P. Chandrakasan and R. W. Brodersen, “Minimizing power consump-
tion in digital CMOS circuits,” Proc. IEEE, vol. 83, no. 4, pp. 498–523,
April 1995.
[9] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power
CMOS digital design,” IEEE J. Solid-State Circuits, vol. 27, no. 4, pp.
473–484, April 1992.
[10] J. G. Chung and K. K. Parhi, “Pipelined wave digital ﬁlter design for
narrow-band sharp-transition digital ﬁlters,” in Proc. IEEE Workshop
VLSI Signal Processing, La Jolla, USA, Oct. 1994, pp. 501–510.
[11] A. G. Dempster and M. D. Macleod, “Constant integer multiplication
using minimum-adders,” IEE Proc. Circuits Devices Systems, vol. 141,
no. 6, pp. 407–413, Oct. 1994.94
[12] A. G. Dempster and M. D. Macleod, “Use of minimum-adder multiplier
blocks in FIR digital ﬁlters,” IEEE Trans. Circuits Systems–II, vol. 42,
no. 9, pp. 569–577, Sep. 1995.
[13] A. Fettweis, “Wave digital filters: Theory and practice,” Proc. IEEE,
vol. 74, no. 2, pp. 270–327, Feb. 1986.
[14] A. Fettweis, “Modified wave digital filtes for improved implementation
by commercial digital signal processors,” Signal Processing, vol. 16,
no. 3, pp. 193-207, March 1989.
[15] A. Fettweis and T. Leickel, “On ﬂoating-point implementation of modi-
ﬁed wave digital ﬁlters,” in Proc. IEEE Int. Symp. on Circuits Systems,
San Diego, USA, May 10–13, 1992, pp. 1812–1815.
[16] L. Gazsi, “Explicit formulas for lattice wave digital filters,” IEEE
Trans. Circuits Systems, vol. 32, no.1, pp. 68–88, Jan. 1985.
[17] A. Gustafsson, K. Folkesson, and H. Ohlsson, “A simulation environ-
ment for integrated frequency and time domain simulations of a radar
receiver,” in Proc. Symp. on Gigahertz Electronics, Lund, Sweden,
Nov. 26–27, 2001.
[18] O. Gustafsson, On mapping of Digital Filter Algorithms to Hardware,
Thesis no. 838, Linköping University, 2000.
[19] O. Gustafsson, H. Ohlsson, and L. Wanhammar, “Minimum-adder inte-
ger multipliers using carry-save adders,” in Proc. IEEE Int. Symp. on
Circuits Systems, Sydney, Australia, May 6–9, 2001, pp. 709–712.
[20] O. Gustafsson, H. Johansson, and L. Wanhammar, “An MILP approach
for the design of linear-phase FIR ﬁlters with minimum number of
signed-power-of-two terms,” in Proc. European Conf. on Circuit Theory
and Design, Espoo, Finland, Aug. 28–31, 2001.
[21] O. Gustafsson, A. G. Dempster, and L. Wanhammar, “Extended results
for minimum-adder constant integer multipliers,” in Proc. IEEE Int.
Symp. on Circuits Systems, Scottsdale, USA, May 26–29, 2002, pp. 73–
76.
[22] O. Gustafsson and L. Wanhammar, “Some issues in low power arithme-
tic for ﬁxed-function DSP,” in Proc. National Conf. Radio Science,
Stockholm, Sweden, June 10–13, 2002, pp. 473–477.
[23] O. Gustafsson and L. Wanhammar "Design of linear-phase FIR ﬁlters
combining subexpression sharing with MILP," in Proc. IEEE Midwest
Symp. Circuits Systems, Tulsa, OK, Aug. 4–7, 2002.95
[24] O. Gustafsson and L. Wanhammar, “ILP modelling of the common sub-
expression sharing problem,” in Proc. IEEE Int. Conf. Elec. Circuits
Systems, Dubrovnik, Croatia, Sept. 15–18, 2002, pp. 1171–1174.
[25] R. I. Hartley, “Subexpression sharing in ﬁlters using canonic signed
digit multipliers,” IEEE Trans. Circuits Syst.–II, vol. 43, pp. 667–688,
Oct. 1996.
[26] C. Hu, “Low-voltage CMOS device scaling,” in Proc. IEEE Int. Solid-
State Circuits Conf., 1994, pp. 86–87.
[27] Z. Jiang and A. N. Willson, “Efﬁcient digital ﬁltering architectures
using pipelining/interleaving,” IEEE Trans. Circuits Systems–II, vol. 44,
pp. 110–118, Feb. 1997.
[28] H. Johansson and L. Wanhammar, “Digital Hilbert transformers com-
posed of identical allpass subﬁlters,” in Proc. IEEE Int. Symp. on Cir-
cuits Systems, Monterey, USA, May 31–June 3, 1998, pp. 437–440.
[29] H. Johansson and L. Wanhammar, “High-speed recursive ﬁlter struc-
tures composed of identical all-pass subﬁlters for interpolation, decima-
tion, and QMF banks with perfect magnitude reconstruction,” IEEE
Trans. Circuits Systems–II, vol. 46, pp. 16–28, Jan. 1999.
[30] H. Johansson and L. Wanhammar, “Wave digital ﬁlter structures for
high-speed narrow-band and wide-band ﬁltering,” IEEE Trans. Circuits
Systems–II, vol. 46, pp. 726–741, June 1999.
[31] M. Karlsson, Distributed Arithmetic: Design and Applications, Thesis
No. 696, Linköping University, Sweden, 1998.
[32] U. Kleine and M. Böhner, “A high-speed wave digital ﬁlter using carry-
save arithmetic,” in Proc ESSCIRC’87, Bad-Soden, Jan. 16–18, 1987,
pp. 43–46.
[33] U. Kleine and T. G. Noll, “Wave digital ﬁlters using carry-save arithme-
tic,” in Proc. IEEE Int. Symp. on Circuits Systems, Espoo, Finland,
1988, pp. 1757–1762.
[34] U. Kleine and T. G. Noll, “On the forced response stability of wave dig-
ital ﬁlters using carry-save arithmetic,” Arc. Elektr. Übertragungstech-
nik, vol. 41, no. 6, pp. 321–324, Nov/Dec. 1987.
[35] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, New Jersey,
1993.
[36] J. Leijten, J. van Meerbergen, and J. Jess, “Analysis and reduction of
glitches in synchronous networks,” in Proc. European Design and Test
Conf., Paris, France, March 6–9, 1995, pp. 393–404.96
[37] M. Martinez-Peiro and L. Wanhammar, “High-speed, low-complexity
FIR ﬁlter using multiplier block reduction and polyphase decomposi-
tion,” in Proc. Int. Symp. on Circuits Systems, Geneva, Switzerland,
May 28–31, 2000, pp. 367–370.
[38] S. K. Mitra and J. F. Kaiser, Handbook for Digital Signal Processing,
John Wiley and Sons, New York, 1993.
[39] Z. Mou and F. Jutand, “‘Overturned stairs’ adder trees and multiplier
design,” IEEE Trans. on Computers, vol. 41, pp. 940–948, Aug. 1992.
[40] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J.
Yamada, “1-V power supply high-speed digital circuit technology with
multithreshold-voltage CMOS,” IEEE J. Solid-State Circuits, vol. 30,
no. 8, pp. 847–854, Aug. 1995.
[41] Y. Neuvo, D. Cheng-Yu, and S. K. Mitra, “Interpolated ﬁnite impulse
response ﬁlters,” IEEE Trans. on Acoust. Speech, Signal Processing,
vol. 32, pp. 563–570, June 1984.
[42] T. G. Noll, “Carry save architectures for high-speed digital signal
processing,” J. VLSI Signal Processing, vol. 3, pp. 121–140, 1991.
[43] T. Njølstad and E. J. Aas, “Validation of an accurate and simple delay
model and its application to voltage scaling,” in Proc. IEEE Int. Symp.
on Circuits Systems, Monterey, USA, May 31–June 3, 1998, pp. 101–
104.
[44] H. Ohlsson, H. Johansson, and L. Wanhammar, “Implementation of a
combined high-speed interpolation and decimation wave digital ﬁlter,”
in Proc. IEEE Int. Conf. on Elec. Circuits Systems, Paphos, Cyprus, Sep.
5–8, 1999, pp. 721–724.
[45] H. Ohlsson, H. Johansson, and L. Wanhammar, “Design of a digital
down converter using high speed digital ﬁlters,” in Proc. Symp. on Giga-
hertz Electronics, Gothenburg, Sweden, March 13–14, 2000, pp. 309–
312.
[46] H. Ohlsson, H. Johansson, and L. Wanhammar, “Implementation of a
combined interpolator and decimator for an OFDM system demonstra-
tor,” in Proc. NORCHIP Conf., Turku, Finland, Nov. 6–7, 2000, pp. 47–
52.
[47] H. Ohlsson and L. Wanhammar, “Implementation of bit-parallel lattice
wave digital ﬁlters,” in Proc. Swedish System-on-Chip Conf., Arild,
Sweden, March 20–21, 2001.
[48] H. Ohlsson, O. Gustafsson, and L. Wanhammar, “Arithmetic transfor-
mations for increased maximal sample rate of bit-parallel bireciprocal97
lattice wave digital ﬁlters,” in Proc. IEEE Int. Symp. Circuits Systems,
Sydney, Australia, May 6–9, 2001, pp. 825–828.
[49] H. Ohlsson, O. Gustafsson, H. Johansson, and L. Wanhammar, “Imple-
mentation of bit-parallel lattice wave digital ﬁlters with increased maxi-
mal sample rate,” in Proc. IEEE Int. Conf. Elec. Circuits Systems, St.
Julian, Malta, Sept. 2–5, 2001, pp. 71–74.
[50] H. Ohlsson and L. Wanhammar, “A digital down converter for a wide-
band radar receiver,” in Proc. National Conf. Radio Science, Stockholm,
Sweden, June 10–13, 2002, pp. 478–481.
[51] H. Ohlsson, O. Gustafsson, W. Li, and L. Wanhammar, "An environ-
ment for design and implementation of energy efﬁcient digital ﬁlters,"
in Proc. Swedish System-on-Chip Conf., Eskilstuna, Sweden, April 8–9,
2003.
[52] A.V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing,
Prentice Hall, Englewood Cliffs, NJ, 1989.
[53] K. Palmkvist, M. Vesterbacka, and L. Wanhammar, “Arithmetic trans-
formations for fast bit-serial VLSI implementations of recursive algo-
rithms,” in Proc. Nordic Signal Processing Symp., Espoo, Finland, Sept.
24–27, 1996, pp. 391–394.
[54] K. Palmkvist, Studies on the Design and Implementation of Digital Fil-
ters, Diss. no. 583, Linköping University, Sweden, 1999.
[55] J. Pandel, and U. Kleine, “Design of bireciprocal wave digital filters for
high sampling rate applications,” Frequenz, vol. 40, pp. 300–308, 1986.
[56] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and
Implementation, Wiley, New York, 1998.
[57] K. K. Parhi and D. G. Messerschmitt, “Pipeline interleaving and paral-
lelism in recursive digital filters – Part 1: pipelining using scattered
look-ahead and decomposition,” IEEE Trans. Acoust. Speech, Signal
Processing, vol. 37, no. 7, pp. 1099–1117, July 1989.
[58] K. K. Parhi and D. G. Messerschmitt, “Pipeline interleaving and paral-
lelism in recursive digital filters – Part 2: pipeline incremental block fil-
tering,” IEEE Trans. Acoust. Speech, Signal Processing, vol. 37, no. 7,
pp. 1118–1134, July 1989.
[59] D. A. Parker and K. K. Parhi, “Area-efﬁcient parallel FIR digital ﬁlter
implementations,” in Proc. Application Speciﬁc Systems, Architectures,
and Processors, Chicago, USA, Aug. 1996, pp. 93–111.98
[60] L. Petersson, M. Danestig, and, U. Sjöström, “An experimental S-band
digital beamforming antenna,” in Proc. Symp. Phased Array Systems
and Tech., Boston, USA, Oct 1996, pp. 93–98.
[61] J. Pihl, “Design automation with the TSPC circuit technique: a high-
performance wave digital ﬁlter,” IEEE Trans. VLSI Systems, vol. 8, no.
4, pp 456–460, Aug. 2000.
[62] M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, “Multiple con-
stant multiplications: efﬁcient and versatile framework and algorithms
for exploring common subexpression eliminations,” IEEE Trans. Com-
puter-Aided Design, vol. 15, pp. 151–161, Feb. 1996.
[63] M. Renfors and Y. Neuvo, “The maximum sampling rate of digital ﬁl-
ters under hardware speed constraints,” IEEE Trans. Circuits Systems–
II, vol. 28, pp. 196–202, 1981.
[64] D. W. Rice and K. H. Wu, “Quadrature sampling with high dynamic
range,” IEEE Trans. Aerosp Elec. Systems, vol 18, pp. 736–739, 1982.
[65] T. Sakuta, W. Lee, and P. T. Balsara, “Delay balanced multipliers for
low power/low voltage DSP core,” in Proc IEEE Symp. on Low Power
Elec., San Jose, USA, Oct. 9–11, 1995, pp. 36–37.
[66] N. Sankarayya and K. Roy, “Algorithms for low power and high speed
FIR ﬁlter realization using differential coefﬁcients,” IEEE Trans. Cir-
cuits Systems–II, vol. 44, pp. 488–497, June 1997.
[67] T. Saramäki, Y. Neuvo, and S. K. Mitra, “Design of computationally
efﬁcient interpolated FIR ﬁlters,” IEEE Trans. Circuits Systems, vol. 35,
pp. 70–88, Jan. 1988.
[68] C. V. Schimple, S. Simon, and J. A. Nossek, “Optimal placement of reg-
isters in data paths for low power design,” in Proc. IEEE Int. Symp. on
Circuits Systems, Hong Kong, June 9–12, 1997, pp. 2160–2163.
[69] U. Sjöström, M. Karlsson, and M. Hörlin, “A digital down converter
chip,” in Proc. European Signal Processing Conf., Trieste, Italy, Sept.,
1996, pp. 284–287.
[70] R. Standert, “Software model of a radar receiver,” Master Thesis, IR-
SB-EX-0202, Royal Institute of Technology, Feb., 2002.
[71] P. P. Vaidyanathan, “Multirate digital ﬁlters, ﬁlter banks, polyphase net-
works, and applications, a tutorial,” Proc. IEEE, vol. 78, no. 1, pp. 56–
93, Jan. 1990.
[72] P. P. Vaidyanathan, Multirate systems and Filter Banks, Prentice Hall,
Englewood Cliffs, NJ, 1993.99
[73] M. Vesterbacka, On Implementation of Maximally Fast Wave Digital
Filters, Diss. no. 487, Linköping University, Sweden, 1997.
[74] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Elec.
Comp., vol. 13, pp. 114–117, Feb. 1964.
[75] L. Wanhammar and H. Johansson, Digital Filters, Linköping University,
2002.
[76] L. Wanhammar, DSP Integrated Circuits, Academic Press, San Diego,
1999.
[77] W. M. Waters and B. R. Jarret, “Bandpass signal sampling and coherent
detection,” IEEE Trans. Aerosp. Elec. Systems, vol. 18, pp. 731–736,
1982.
[78] A. Wróblewski, C. V. Schimpﬂe, and J. A. Nossek, “Automated transis-
tor sizing algorithm for minimizing spurious switching activities in
CMOS circuits,” in Proc. IEEE Int. Symp. on Circuits Systems, Geneva,
Switzerland, May 28–31, 2000, vol. 3, pp. 291–294100