Low-Power Audio Input Enhancement for Portable Devices by Yoo, Heejong








of the Requirements for the Degree
Doctor of Philosophy in Electrical Engineering
School of Electrical and Computer Engineering
Georgia Institute of Technology
January 2005
Copyright© 2005 by Heejong Yoo
LOW-POWER AUDIO INPUT ENHANCEMENTS FOR
PORTABLE DEVICES
Approved by:
Dr. David V. Anderson, Advisor
School of Electrical& Computer Engineering
Georgia Institute of Technology
Dr. Douglas B. Williams
School of Electrical& Computer Engineering
Georgia Institute of Technology
Dr. Paul E. Hasler
School of Electrical& Computer Engineering
Georgia Institute of Technology
Dr. W. Marshall Leach Jr.
School of Electrical& Computer Engineering
Georgia Institute of Technology
Dr. Brani Vidakovic
School of Industrial and Systems Engineering
Georgia Institute of Technology
Date Approved: January 2005
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude to my advisor, Dr. David Anderson, for his
guidance, support, endurance, and encouragement during my staying at Georgia Tech. He
has always been supportive of me not only when I made progress but also when I made
mistakes. I am also deeply impressed with his dedication to research and students. I am
happy to say that this research is only a small part of what I’ve learned from him. I also
would like to thank to Dr. Paul Hasler for his extensive guidance on my continuous-time
audio signal processing research. I am deeply impressed by the depth of his knowledge in
both analog signal processing and neuro-morphic signal processing. This research would
not have been possible without the leadership and the support of Dr. Anderson and Dr.
Hasler.
I was fortunate to be a member of the CADSP research group. In particular, I’ve en-
joyed working with David Graham and Rich Ellis and would like to thank them for their
constructive feedback, outstanding achievements on VLSI circuit layout, measurement,
and testing, and especially their willingness to share invaluable experimental results with
me. I also had an opportunity to work closely with Daniel Allred, Venketash Khrishnan,
and Walter Huang. I would like to thank them for rewarding research experience that we
shared together.
I am grateful to my parents and other family members for their love and support. Most
of all, I thank my lovely wife, Jaeeun, and beloved son, Taehwan, for their endless love,
support, and perseverance during my time at Georgia Tech.
iii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Cooperative Analog and Digital Signal Processing . . . . . . . . . . . . . 2
1.2 Analog Computing Elements: The Floating-Gate Approach . . . . . . . . 4
1.3 Acoustic Noise Suppression (ANS) . . . . . . . . . . . . . . . . . . . . . 6
1.4 Acoustic Echo Cancellation (AEC) . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Adaptive Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Double-Talk Detector . . . . . . . . . . . . . . . . . . . . . . . . 14
CHAPTER 2 DISTRIBUTED ARITHMETIC FIR FILTER . . . . . . . . . . 16
2.1 LUT-Based DA FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Adder-Based DA FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 DA for High-Order FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 DA-Offset Binary Coding (DA-OBC) . . . . . . . . . . . . . . . . . . . . 21
2.5 High-Speed DA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
CHAPTER 3 PROGRAMMABLE CONTINUOUS-TIME ANS . . . . . . . . 26
3.1 Background of ANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Continuous-Time ANS . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Analog VLSI Implementation . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Frequency Decomposition . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Signal and Noise Level Estimation . . . . . . . . . . . . . . . . . 32
3.3.3 SNR Calculation and Normalization . . . . . . . . . . . . . . . . 35
3.3.4 Programmable Non-Linear Gain Function . . . . . . . . . . . . . 36
3.4 Simulation and Implementation Results . . . . . . . . . . . . . . . . . . . 39
CHAPTER 4 CONTINUOUS-TIME DELAY FILTER . . . . . . . . . . . . . 42
4.1 Property of The Delay Filter . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Low-Pass-To-Band-Pass Transformation . . . . . . . . . . . . . . . . . . 46
4.2.1 Geometrically Symmetrical Transformation . . . . . . . . . . . . 47
4.2.2 Arithmetically Symmetrical Transformation . . . . . . . . . . . . 50
4.3 Subband Delay Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Delay Network with Low-Pass Filters and Modulation . . . . . . . 54
4.3.2 Delay Network with Band-Pass Filters . . . . . . . . . . . . . . . 56
iv
CHAPTER 5 HYBRID DA FIR FILTER . . . . . . . . . . . . . . . . . . . . . 59
5.1 Hybrid DA-OBC Architecture . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Hybrid DA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Hardware Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Reusable DA Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
CHAPTER 6 DISCRETE-TIME AEC . . . . . . . . . . . . . . . . . . . . . . 81
6.1 Traditional Hardware Implementation . . . . . . . . . . . . . . . . . . . . 82
6.2 DA LMS Adaptive Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.1 DA-A-LUT Update . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.2 DA-F-LUT Update . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.3 DAAF for High-Order Filters . . . . . . . . . . . . . . . . . . . . 91
6.2.4 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . 94
CHAPTER 7 HYBRID DA LMS ADAPTIVE FILTER . . . . . . . . . . . . . 102
7.1 Hybrid DA LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
CHAPTER 8 CONCLUSION AND FUTURE RESEARCH . . . . . . . . . . . 113
8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . 115
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
v
LIST OF TABLES
Table 1 LMS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Table 2 NLMS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Table 3 Original LUT contents of DA-OBC . . . . . . . . . . . . . . . . . . . . 23
Table 4 Bessel polynomials,Qn(s), in unfactored form . . . . . . . . . . . . . . 46
Table 5 Transistor counts for digital logic functions . . . . . . . . . . . . . . . . 67
Table 6 Transistor count comparison of various base units for thek- ap FIR filter
(register version) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Table 7 Transistor count comparison of various base units for thek- ap FIR filter
(hardwired version) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Table 8 Summary of the FPGA implementation for FIR filters ranging from 4
taps to 1024 taps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Table 9 Summary of the FPGA implementation for FIR filters ranging from 8
taps to 2048 taps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Table 10 Rotation of address lines for the DA-A-LUT whenK = 4 . . . . . . . . 88
Table 11 Memory requirements in KB for variousK andk . . . . . . . . . . . . . 99
Table 12 Power consumption estimates in mW for variousK andk . . . . . . . . . 100
Table 13 Summary of the FPGA implementation for LMS adaptive filters ranging
from 4 to 1024 taps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Table 14 Summary of the FPGA implementation for LMS adaptive filters ranging
from 8 to 1024 taps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Table 15 Summary of the FPGA implementation for LMS adaptive filters ranging
from 16 to 1024 taps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
vi
LIST OF FIGURES
Figure 1 Block diagram of the low-power audio input enhancement system . . . . 2
Figure 2 Gene’s law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Figure 3 Block diagram of CADSP approach . . . . . . . . . . . . . . . . . . . . 4
Figure 4 Layout, cross section, and circuit diagram of a floating-gate pFET . . . . 5
Figure 5 Simplified Schroeder’s continuous-time noise suppression system . . . . 7
Figure 6 Diethorn’s discrete-time noise suppression system . . . . . . . . . . . . 8
Figure 7 Acoustic echo cancellation (AEC) system for a teleconferencing appli-
cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Figure 8 Block diagram of subband AEC (SAEC) for a teleconferencing application 13
Figure 9 Block diagram of a 4-tap DA FIR filter . . . . . . . . . . . . . . . . . . 18
Figure 10 Block diagram of a 4-tap DA FIR filter whenm = 2 andk = 2. . . . . . . 21
Figure 11 Block diagram of a 4-tap DA-OBC FIR filter . . . . . . . . . . . . . . . 24
Figure 12 Block diagram of a 2-tap DA FIR filter with 2BAAT access to the LUT . 25
Figure 13 Block diagram of the continuous-time ANS system . . . . . . . . . . . . 27
Figure 14 Detailed view of the subband gain calculation block . . . . . . . . . . . 28
Figure 15 Schematic of capacitively-coupled current conveyer (C4) and C4 SOS . . 29
Figure 16 Magnitude responses of C4 and C4 SOS . . . . . . . . . . . . . . . . . . 30
Figure 17 Architecture used for programming arrays of floating-gate devices . . . . 31
Figure 18 Magnitude responses of filter bank of 32 programmable vanilla C4s . . . 33
Figure 19 Peak detector for estimation of the noisy signal and its step responses for
various time constants . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 20 Minimum detector for estimation of the noise envelope . . . . . . . . . . 35
Figure 21 Translinear division circuit for SNR calculation . . . . . . . . . . . . . . 36
Figure 22 Averaged noisy signal envelope and normalized SNR . . . . . . . . . . . 37
Figure 23 Comparison of Wiener, bi-linear, and sigmoid gain functions . . . . . . . 38
Figure 24 Combined schematic of the multiplication and gain circuit . . . . . . . . 39
vii
Figure 25 Simulated and measured noise suppressed waveform . . . . . . . . . . . 41
Figure 26 Block diagram of the tapped-delay line and the parallel-delay line . . . . 43
Figure 27 Group delay of fourth-order analog low-pass filters . . . . . . . . . . . . 47
Figure 28 Group delay of the second-order band-pass filter for variousQs . . . . . 48
Figure 29 Comparison of geometrically symmetrical transformation and arithmeti-
cally symmetrical transformation . . . . . . . . . . . . . . . . . . . . . 50
Figure 30 Simulation results of the group delay of the band-pass filter transformed
using the the geometrically symmetrical transformation . . . . . . . . . . 51
Figure 31 Simulation results of the group delay of the band-pass filter transformed
using the arithmetically symmetrical transformation . . . . . . . . . . . 53
Figure 32 Group delay of the Bessel low-pass and band-pass filters . . . . . . . . . 54
Figure 33 Delay network with low-pass delay filters . . . . . . . . . . . . . . . . . 55
Figure 34 Two examples of delay lines for a delay network with low-pass delay filters 56
Figure 35 Delay network with the band-pass delay filters for a single subband . . . 57
Figure 36 Multi-level delay network with band-pass delay filters . . . . . . . . . . 58
Figure 37 Block diagram of a 4-tap DA-OBC FIR filter with 22 size LUT . . . . . . 61
Figure 38 Block diagram of a 4-tap DA-OBC FIR filter with 21 size LUT . . . . . . 62
Figure 39 Block diagram of the LUT-less DA-OBC for a 4-tap FIR filter . . . . . . 62
Figure 40 Block diagram of the LUT-less hybrid DA-OBC for a 4-tap FIR filter . . 64
Figure 41 Hybrid DA architecture for a 4-tap FIR filter with 23 size LUT . . . . . . 65
Figure 42 Hybrid DA architectures with 22 size LUT and 21 size LUT for a 4-tap
FIR filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 43 LUT-less hybrid DA architecture for a 4-tap FIR filter . . . . . . . . . . 66
Figure 44 Transistor count estimation comparison of various base units for differ-
ent filter sizes withBc = 18 (register version) . . . . . . . . . . . . . . . 69
Figure 45 Transistor count estimation comparison of various base units for differ-
ent filter sizes withBc = 18 (hardwired version) . . . . . . . . . . . . . . 71
Figure 46 Direct implementation of 8-point DCT with DA . . . . . . . . . . . . . . 75
Figure 47 DA architecture for the even-odd decomposition of the 8-point DCT . . . 77
viii
Figure 48 DA architecture for second recursive even-odd decomposition of 8-point
DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Figure 49 Block diagram of the reusable DA base unit and the reusable DA . . . . . 79
Figure 50 Multiplexed architecture for the 8-point DCT with reusable DA . . . . . 80
Figure 51 Block diagram of the DAAF for a single LUT-based implementation . . . 83
Figure 52 Simplified flowchart of the DAAF . . . . . . . . . . . . . . . . . . . . . 85
Figure 53 Update of the DA-A-LUT entries from timen− 1 to timen . . . . . . . 86
Figure 54 Block diagram for the DA-A-LUT address rotation . . . . . . . . . . . . 87
Figure 55 Learning curve comparison between original LMS and QE-LMS . . . . . 90
Figure 56 Block diagram of the area-optimized DAAF for high-order filters . . . . 92
Figure 57 Detailed view of the DA base filtering and adaptation unit (DA-BFAU(k)) 93
Figure 58 Throughput comparison between a microprocessor and the DAAF . . . . 95
Figure 59 Throughput comparison of the DAAF and the MMAF for various filter
sizes,K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Figure 60 Number of LEs for variousK andk . . . . . . . . . . . . . . . . . . . . 97
Figure 61 Memory usage comparison between a microprocessor and the DAAF . . 98
Figure 62 Learning curve for a 512-tap LMS adaptive filter implemented with the
DAAF(4,128) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Figure 63 Example of a block diagram of a 4-tap base unit for the hybrid DA LMS
architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Figure 64 Filtering operation for a 4-tap base unit withkL = 2 andkA = 2 . . . . . . 104
Figure 65 Update of the DA-A-LUT and input registers for a 4-tap base unit with
kL = 2 andkA = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Figure 66 Update of the filter coefficients stored in the DA-F-LUT for a 4-tap base
unit with kL = 2 andkA = 2 . . . . . . . . . . . . . . . . . . . . . . . . . 106
Figure 67 Update of filter coefficients stored in registers for 4-tap base unit with
kL = 2 andkA = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Figure 68 FPGA resource usage rate of the hybrid DA LMS architecture over the
DAAF architecture for the same throughput . . . . . . . . . . . . . . . . 112
ix
SUMMARY
With the development of VLSI and wireless communication technology, portable de-
vices such as personal digital assistants (PDAs), pocket PCs, and mobile phones have
gained a lot of popularity. Many such devices incorporate a speech recognition engine, en-
abling users to interact with the devices using voice-driven commands and text-to-speech
(TTS) synthesis. It is well known that the speech recognition rate can be reduced when
the signal-to-noise ratio (SNR) of the input is low unless the speech recognition engine is
designed to be robust to high environmental noises.
The power consumption of DSP microprocessors has been consistently decreasing by
half about every 18 months, following Gene’s law. The capacity of signal processing,
however, is still significantly constrained by the limited power budget of these portable
devices. In addition, analog-to-digital (A/D) converters can also limit the signal processing
of portable devices. Many systems require very high-resolution and high-performance A/D
converters, which often consume a large fraction of the limited power budget of portable
devices.
The proposed research develops a low-power audio signal enhancement system that
combines programmable analog signal processing and traditional digital signal process-
ing. By utilizing analog signal processing based on floating-gate transistor technology, the
power consumption of the overall system as well as the complexity of the A/D converters
can be reduced significantly. The system can be used as a front end of portable devices
in which enhancement of audio signal quality plays a critical role in automatic speech
recognition systems on portable devices. The proposed system performs background audio
noise suppression (ANS) in a continuous-time domain using analog computing elements





With the development of VLSI and wireless communication technology, portable devices
such as personal digital assistants (PDAs), pocket PCs, and mobile phones have gained a lot
of popularity. Many such devices incorporate a speech recognition engine that enables users
to interact with devices using voice-driven commands and text-to-speech (TTS) synthesis.
A microphone, whether it is external or internal, often picks up a disturbing noise signal as
well as a speech signal. The low signal-to-noise ratio (SNR) of the input adversely affects
the performance of the speech recognition rate of the systems unless the speech recognition
engine is designed to be robust to a high noise variance. In general, different characteristics
of noise, such as acoustic echo, background noise, and babble noise, will be picked up in a
microphone signal, so each type of noise requires a specific enhancement approach.
In this thesis, two common audio-enhancement techniques, acoustic noise suppression
(ANS) and acoustic echo cancellation (AEC), are discussed. Figure 1 shows a system
diagram of the research. We propose to implement ANS in a continuous-time domain for
the benefits of the low-power parallel processing of analog computing elements and AEC in
a discrete-time domain using a reconfigurable custom IC to gain high throughput of digital
computing elements.
Research in ANS and AEC techniques has been popular for the last few decades and
continues to be an active research area in the speech signal enhancement field. These
two speech-enhancement techniques are normally solved separately and then are combined
for better performance; some exceptions merge these two algorithms into one problem to
find a global optimization solution [57, 76]. However, most previous attempts have been
made in a purely discrete-time domain and the computational complexity of the existing
algorithms tend to increase when seeking better performance. The increase in the computa-












Figure 1. Proposed audio input enhancement system for portable devices. For portable PDA devices
with speech recognition functionality, a continuous-time ANS IC is employed before the A/D converter.
For portable communication devices, a continuous-time ANS and a discrete-time AEC IC are con-
nected for low-power and real-time processing, respectively.
more millions of instructions per second (MIPS), causing a real-time processing problem
in systems such as portable devices, in which higher MIPS are hard to attain because of the
power constraint.
Power consumption in DSP microprocessors has been consistently decreasing by half
about every 18 months, following Gene’s law [37], as shown in Figure 2. The capacity of
signal processing, however, is still significantly constrained by the limited power budget
of these portable devices. In addition, analog-to-digital (A/D) converters can also limit
the signal processing of portable devices. Many systems require very high-resolution and
high-performance A/D converters, but these A/D converters often consume a large portion
of the limited power budget of portable devices.
1.1 Cooperative Analog and Digital Signal Processing
Cooperative analog and digital signal processing (CADSP) can be defined as a method of
intelligently combining programmable analog signal processing and digital signal process-
ing techniques to achieve low-power and real-time signal processing. Figure 3 shows the
difference between the traditional DSP approach and the CADSP approach. All signals in
2

































trend jump of 
20 years          
Figure 2. DSP power consumption per MIPS has been decreased for the past few decades, following
Gene’s Law [37]. The trends can be dramatically improved by shifting some of the processing from the
digital domain to the analog domain. VLSI chips fabricated using cooperative analog and digital signal
processing (CADSP) techniques show an equivalent of up to a 20-year increase in power per MIPS.
the real world are analog, while most modern signal processing and communications occur
in a digital. Current trends immediately convert real-world analog signals into the digital
domain using an A/D converter, so the majority of the system is implemented in the digital
domain. Finally, the outputs of the digital system are reconverted to the analog signal using
a digital-to-analog (D/A) converter. One drawback of this approach is that it consumes too
much power, which is not acceptable, particulary for portable devices in which power con-
sumption and battery life are critical. CADSP allows more freedom of movement for the
partition between analog and digital computation. CADSP performs some of the process-
ing in the analog domain prior to the A/D converter, reducing the computational load of the
digital domain.
Many analog techniques are orders of magnitude more efficient than their digital coun-
terparts in terms of speed and power dissipation [68]. The low-power consumption of ana-
log computing is possible for floating-gate technology, which is explained in the following
section. The output of the analog computing system gives more refined information such
as Fourier coefficients [60] and Cepstrum [88] than a literal map of the incoming signal.
3


















Figure 3. In traditional DSP systems, the A/D converter is placed as close to the real-world as pos-
sible. However, significant power savings can be achieved by moving some of the signal processing
functionality into the analog domain (prior to the A /D converter).
Since this refined information requires much lower resolution when converted to the digital
domain, much simpler and smaller A/D converters can be used instead of traditional A/D
converters.
Although analog signal processing is capable of several important functions, the effects
of resolution on these analog computing systems still remain unknown. The computa-
tional cost generally involves chip area, power consumption, design time, and development
cost. While the computational cost of digital computation increases linearly as the bits of
required resolution increase, the computational cost of analog computation increases expo-
nentially. Sarpeskar [81] showed that analog computation has significant advantages when
the resolution of the input signal is no more than 10 to 12 bits.
1.2 Analog Computing Elements: The Floating-Gate Approach
Floating-gate transistors such as EPROMs, EEPROMs, and flash memories have been used
for some time now. Since the late 1980s, considerable research introducing new ways of




















Figure 4. Layout, cross section, and circuit diagram of the floating-gate pFET in a standard double-
poly, n-well MOSIS process: The cross section corresponds to the horizontal line slicing through the
layout view. The pFET transistor is a standard pFET transistor in the n-well process. The gate input
capacitively couples to the floating-gate by either a poly-poly capacitor, a diffused linear capacitor, or
an MOS capacitor, as seen in the circuit diagram. BetweenVtun and the floating-gate is our symbol for
a tunneling junction—a capacitor with an added arrow designating the charge flow.
number of solutions. As a result, floating-gate devices are not just for memory anymore,
but are circuit elements with analog memory and important time-domain dynamics [45,
69]. Floating-gate devices and circuits can be divided into three major categories: analog
memory elements, capacitive-based circuits, and adaptive circuit elements.
As shown in Figure 4, a floating gate is a polysilicon layer that has no contact with other
layers and thus has no DC path to a fixed potential. The polysilicon gate of an MOS transis-
tor is completely wrapped in silicon dioxide, storing a charge almost permanently. Charge
on the gate can be modified by three process–UV photo injection, electronic tunneling, and
hot-electron injection [48]. The last two are the primary means of programming floating-
gate circuits. For the tunneling process [51], a very large voltage is placed on the tunneling
capacitor, as shown in Fig 4. As this large tunneling voltage increases, the effective width
5
of the barrier decreases. This allows some electrons to breach the gap without adversely
affecting the insulator. Through this process, electrons can be removed from the gate in a
controlled manner. Hot-electron injection [48] has two requirements for changing the gate
charge. First, an appreciable amount of current must flow through the device. Second, the
source-to-drain voltage must be large. When both of these criteria are met, holes in a pFET
that are flowing through the channel can build up enough energy for the impact ionization
of an electron-hole pair. The electron can have enough energy to pass through the insulator
and onto the floating gate. Therefore, hot-electron injection puts electrons onto the floating
gate in a controlled manner.
1.3 Acoustic Noise Suppression (ANS)
ANS has been employed in many telecommunication systems to increase the intelligibility
of the audio signal and/or reduce adverse artifacts. Speech communication and speech
recognition systems can be degraded by various sources of background noise, which can
vary from nonstationary noises, such as heaters, air conditioners, and computer fan noise,
to stationary noises, such as transportation, roadway, and babble noise.
Various noise suppression algorithms have been proposed. Such algorithms include
the short-time Fourier transform method (spectral modification) [38], the hidden Markov
model (HMM) [33, 80], Wiener filtering [52], and parameter estimation methods using,
for instance, maximum likelihood (ML) estimation, maximuma posteriori (MAP) esti-
mation [94], or minimum mean-square error (MMSE) estimation [66]. Short-time Wiener
filters are presented in [32, 66].
One of the popular short-time Fourier transform methods was originally developed by
Boll [12], who coined the term “spectral subtraction.” The spectral subtraction method,
however, creates a significant amount of high-frequency noise with a very short duration
when used without any modification. This noise is referred to as “musical noise,” and many























Figure 5. Simplified Schroeder’s continuous-time noise suppression system. Schroeder’s system was a
purely analog implementation of a speech-enhancement system. A bank of band-pass filters separates
the noisy signal intoM different subbands whose bandwidth is about 300 Hz. Each subband signal is
rectified and averaged to estimate a short-time noisy speech envelope.
with subband processing [34, 86] to alleviate the musical noise problem.
The first use of spectral gain modification in speech enhancement dates back to Schroeder’s
noise reduction systems [82, 83] in the 1960s. Figure 5 shows the block diagram of
Schroeder’s subband noise reduction algorithm based on subband gain modification. Schroeder’s
system was a purely analog implementation of a speech-enhancement system. A bank of
band-pass filters separates the noisy signal intoM different subbands whose bandwidths
are about 300 Hz. Each subband signal is rectified and averaged to estimate a short-time
noisy speech envelope. The noisy speech envelope is then subtracted from an estimate of
the noise envelope. The subtracted result is rectified at another rectifier and multiplied by
the subband signal. Finally, the subband signal is reconstructed to form a full-band estimate
of the speech signal [42].
Diethorn [26] also proposed a subband noise suppression method based on ana poste-
riori SNR voice activity detector (VAD). Diethorn’s system, as shown in Figure 6, imple-



























Figure 6. Diethorn’s discrete-time noise suppression system. VAD and a single-pole recursive envelope
estimator with different attack and decay time constants are used. A bi-linear gain function was used
with noise suppression threshold,γ, specifying the certainty of speech. The bi-linear gain function,
however, suffers severe loss of signal magnitude at low SNRs, especially, when largeγ is used for more
noise suppression.
with different attack and decay time constants for estimating both noise and the noisy en-
velope [42]. The bi-linear gain function was used with noise suppression threshold,γ,
specifying the certainty of speech. The bi-linear gain function, however, suffer severe loss
of signal magnitude at low SNRs, especially when largeγ is used for more noise suppres-
sion.
1.4 Acoustic Echo Cancellation (AEC)
Sondhi [38, 90, 91] developed an echo canceler in the 1960s to address the network (elec-
trical) echo problem, which occurs as a result of the impedance mismatching of a four-wire
long distance connection and a two-wire local loop of the communication circuits. Net-
work echo cancellation is a relatively easy problem compared to AEC because the network
echo is very short and stationary, while the acoustic echo is long and time varying. For a
typical office room, the length of the room impulse response can last up to 250 msec. At an
















Figure 7. AEC system for a teleconferencing application. The far-end speech signal,x[n], is played
through a loudspeaker in a receiving room and its echo,y[n], is picked up by a microphone for a single
channel system. The microphone also picks up near-end speech signal,v[n], as well as background
noise,u[n]. An adaptive filter estimates the room impulse response, w[n], and then the echo replica,
ŷ[n], is subtracted from the microphone signal.
than 2000 for higher echo return loss enhancement (ERLE).
Figure 7 shows an AEC system for a teleconferencing application. The far-end speech
signal, x[n], is played through a loudspeaker in a receiving room, and its echo,y[n], is
picked up by a single microphone for a single channel system. The microphone also picks
up a near-end speech signal,v[n], as well as background noise,u[n]. The room impulse re-
sponse,h[n], may rapidly change during the connection as the position and the direction of
either loudspeaker or microphone vary. An adaptive filter in an AEC system continuously
estimates the room impulse response,w[n], and the echo replica, ˆy[n], is subtracted from
the microphone signal before the error signal,e[n], is transmitted back to the far-end side
over a communication channel.
1.4.1 Adaptive Filters
Adaptive filters have been used in many areas such as system identification, equalization,
signal detection, noise cancellation, and echo cancellation [38]. In particular, finite impulse
response (FIR) adaptive filters have gained considerable popularity over infinite impulse
9
response (IIR) for several reasons. First, FIR adaptive filters are more stable than IIR
adaptive filters since filter coefficients are more robust to the quantization noises. Second,
FIR adaptive filters have a simple form in terms of weight update. Finally, the performance
of these algorithms is well understood in terms of their convergence and stability.
The goal of an FIR adaptive filter in the mean-square error sense is to find the vector,
w[n], at timen that minimizes the quadratic function
ξ[n] = E{|e[n]|2}, (1)
wheree[n] = d[n] − y[n], d[n] is a desired signal, andy[n] is a filtered output. Equation (1)
can be solved in many ways, but the steepest descent algorithm is the most common it-
erative procedure. The steepest descent algorithm is a method that finds the extrema of
non-linear functions. It estimates the solution vectors at every time index by adding a cor-
rection term to the estimate vector of a previous time index such that the current estimate
is closer to the optimal solution than the previous estimate.
In the realization of an FIR adaptive filter, the most extensively used adaptive filter
is the least mean-square (LMS) algorithm, summarized in Table 1. The LMS algorithm,
which was first introduced by Widrow and Huff [97], is simply an approximated realization
of the steepest descent algorithm. The LMS algorithm uses a one-time sample mean as
the approximation of the cross-correlation between the error signal and the input vectors,
E{e[n]x[n]}. The adaptation characteristic of the LMS algorithm, however, is good enough
in most applications. The computational complexity of the LMS algorithm isO(K), where
K is the number of filter taps.
One drawback of the LMS algorithm is its slow convergence rate for colored input
signals. The LMS algorithm converges to the vicinity of its optimal value in the mean-
square sense if and only if 0< µ < 2/λmax, whereλmax is the maximum eigenvalue of
the correlation matrix of input samples. For colored input signals (a speech signal is the
typical example of a colored signal),λmax is much greater than 1 and consequently the
upper limit of the step-size,µ, is severely restricted, limiting the convergence speed of the
10
Table 1. LMS algorithm has three equations: the filtering operation, the error calculation, and the
weight update operation. The computational complexity of the LMS algorithm isO(K), whereK is the
number of filter taps.
The LMS algorithm
Filtering operation: y[n] =
∑K−1
k=0 wkx[n− k]
Error calculation: e[n] = d[n] − y[n]
Weight adaptation: wk[n + 1] = wk[n] + µe[n]x[n− k] (0 ≤ k ≤ K − 1)
LMS algorithm.
In the standard LMS algorithm, the weight correction term is directly proportional to
the magnitude of the input samples. Therefore, when the magnitude ofx[n−k] is large, the
LMS algorithm suffers a “gradient noise amplification” [52, 53] problem. To overcome this
problem, the normalized LMS (NLMS), shown in Table 2, was introduced. The NLMS al-
gorithm is a special implementation of the LMS algorithm, which has a more stable and fast
converging properties. The NLMS algorithm takes into account any variation in the signal
level of the input signals in the selection of the step size. The convergence of the NLMS
algorithm in the first and second moment is guaranteed for a stationary process when the
normalized step size,β, satisfies 0< β < 2. Many variants of the LMS, including SS-LMS,
SD-LMS, SE-LMS [52, 53], LMS-Newton [35], and PNLMS [42], are also available in the
literature.
Narayan et al. [73] proposed a transform-domain LMS algorithm to enhance the con-
vergence speed of the LMS algorithm by using an orthogonal transform to decorrelate
the input signal. A transform-domain adaptive filter converts a time-domain input sig-
nal into a transform-domain input signal using the orthogonal data-independent transform
followed by a power normalization. The Karhunen Loéve transform (KLT) is the opti-
mal transform [9] in the sense that it performs an ideal decorrelation of the input data by
projecting them onto the eigenvectors of their autocorrelation matrix [35]. However, in
11
Table 2. NLMS algorithm is a special implementation of the LMS algorithm, which has more stable
and fast converging properties. It takes into account the variation in the signal level of the input signals
in selection of the normalized step size,β. The convergence of the NLMS algorithm is guaranteed for a
stationary process when0 < β < 2.
The NLMS algorithm
Filtering operation: y[n] =
∑K−1
k=0 wkx[n− k]
Error calculation: e[n] = d[n] − y[n]
Weight adaptation: wk[n + 1] = wk[n] + β
x[n−k]
ε+P̂x[n]
e[n] (0 ≤ k ≤ K − 1)
Power estimation: P̂x[n + 1] = βP̂x[n] + (1− β)x[n]2
practice, KLT is not the best choice since the KLT is signal dependent, and the input sta-
tistics are usually unavailable in advance. Many suboptimal orthogonal transforms such
as the discrete Fourier transform (DFT), discrete cosine transform (DCT), discrete Hartley
transform (DHT), and Walsh-Hadamard transform (WHT) have been proposed in the lit-
erature [9, 35, 62, 73]. TheO(K) complexity transform-domain LMS using a sliding FFT
algorithm is also presented in [36].
The recursive least-square (RLS) algorithm [35, 53] is also a popular adaptive filter
that recursively minimizes the error in the least-square sense. The convergence speed of
the RLS is generally known to be faster than that of the LMS-based algorithms, but its
high computational complexity,O(K2), is generally too expensive for a real-world system.
Numerous fast versions of the RLS algorithm [35] were developed, but they still have
instability problems resulting from multiple simplifications.
Ozeki and Umeda [75] developed the affine projection algorithm (APA), a generaliza-
tion of the NLMS algorithm, using affine subspace projections [42]. The APA updates its
weights based on the previousP input vectors, whereP is referred to as the projection or-
der, while the NLMS algorithm updates its weights based on the current input vector. The
complexity of the APA is 2KP + O(P2), and the fast version APA, which is known as fast





















Figure 8. Block diagram of subband AEC (SAEC) for a teleconferencing application. The SAEC can
result in faster convergence than the LMS-based algorithms since each subband signal has a smaller
eigenvalue spread than the original wideband signal due to band-partitioning property of the analysis
filter bank.
lower computational complexity than the APA and a faster convergence speed than that of
the NLMS algorithm. Different versions of FAPs [27, 28, 63, 93] are also available in the
literature, depending on the way the inverse of the autocorrelation matrix is estimated.
When the required number of adaptive filter taps,K, is extremely large, as in the AEC
system, even a low complexity algorithm such as LMS can cause a problem for the real-
time processing of hands-free devices or a teleconferencing system. For a multi-channel
AEC, real-time processing becomes even more challenging. Among the many different
adaptive filters, subband AEC (SAEC) is very attractive for real-time processing applica-
tions. Figure 8 shows the synthesis-independent structure of the subband adaptive filter.
In SAEC, a number of subband adaptive filters can independently estimate subband echo
using subband far-end and microphone signals. By analyzing input statistics, different
adaptive filter algorithms can be adopted for different subbands to achieve a faster conver-
gence rate. Subband echoes are merged into a wideband echo using a synthesis filter bank,
and the wideband echo is transmitted to a far-end side. For applications in which a delay in
the signal path is critical, other subband AEC structures, referred to as synthesis-dependent
structures, can also be used [35, 72].
13
Subband processing ensures faster convergence for the LMS-based algorithms since it
reduces the eigenvalue spread of the autocorrelation function of the input process by band-
partitioning a wideband signal into a subband signal. Subband processing can result in
low-power computing by optimizing parallelism and employing dedicated hardware for re-
peated data transform. Furthermore, a significant amount of computational savings can be
achieved because of the reduced sampling rate, which is related to the number of subbands
and the properties of the analysis and synthesis filter banks. DFT filter banks are commonly
used for the efficient realization of the analysis and synthesis filter banks. Polyphase struc-
ture [42] and weighted-overlap-add (WOA) structures [22, 23, 35] have been used as DFT
filter banks for critical-sampling and over-sampling systems, respectively.
1.4.2 Double-Talk Detector
In real applications, the performance of the adaptive filter suffers significant degradation
in double-talk situations. Double talk occurs when two speakers on both near-end and
far-end sides speak simultaneously. In Figure 7, the signal level of the near-end speech
signal,v[n], becomes dominant in the microphone signal,d[n], when a double-talk situation
occurs. This causes the adaptive filter to diverge since the correlation between echo and
the microphone signal drops significantly. One common solution for this problem is to use
a double-talk detector (DTD). A DTD, in general, monitors the relevance of the various
signals of Figure 7. Once double talk is detected, the update of adaptive filters can be either
completely stopped or at least slowed down, depending on the correlation between echo
and the microphone signal.
The general procedure of most DTDs is described by the following [42]:
1. A detection statistics,ξ, is calculated using available signals, e.g.,x[n],d[n],e[n],
and the estimated filter coefficients,w[n].
2. The detection statistic,ξ, is compared to a threshold,T and double talk is declared
if ξ < T.
14
3. Once double talk is declared, the filter update is disabled for a minimum period of
time,Thold.
4. If ξ > T consecutively overThold, the filter resumes adaptation and the comparison
of ξ to T continues untilξ < T.
A number of DTDs have been developed, and an objective method of evaluating various
DTDs was proposed in [19]. The Giegel algorithm [29] was successful for a line echo; how-
ever, it has not always been successful for an acoustic echo. The two-path model DTD [74]
is based on the combination of an adaptive background filter and a fixed foreground fil-
ter. The foreground filter constantly models the acoustic echo. Whenever the background
filter performs better than the foreground filter, its filter coefficients are copied to the fore-
ground. Other methods based on cross-correlations [10, 98, 77] and coherence [39] appear
to be more popular for AEC applications. A few examples of the most widely used cross-
correlations are cross-correlation coefficients betweenx[n] and d[n], ρxd[n], and cross-
correlation coefficients betweenx[n] and e[n], ρxe[n]. The cross-correlationsρxd[n] and









wherePx[n] is the power of the far-end input signal,Pe[n] is the power of the error signal,
Pd[n] is the power of the near-end microphone signal,Pxe[n] is the cross-power between
the far-end input and the error signal, andPxd[n] is the cross-power between the far-end in-
put and the near-end microphone signal. The detection statistic,ξ, is then calculated from
these cross-coefficients and compared to a threshold. The fundamental problem of these
methods is that these cross-correlations are not well normalized; thus, setting a universal
threshold for reliable performance in various situations is difficult. To address this prob-
lem, “normalized cross-correlation,” a special case of the “generalized cross-correlation”
technique, is presented in [42].
15
CHAPTER 2
DISTRIBUTED ARITHMETIC FIR FILTER






wherewk could be fixed coefficients or time-varying coefficients. Multiply-and-accumulate
(MAC) is widely used for such filtering, but it is well known that MAC is expensive to im-
plement with hardware because of its complex logic and large area size. Alternatively,
MAC operations may be replaced by a series of look-up-table (LUT) accesses and summa-
tions. Implementation of the filter, known as distributed arithmetic (DA) [24, 78], achieves
higher throughput and lower logic complexity at the cost of increased memory usage. DA
provides a multiplier-less implementation of FIR filters through a bit-serial computation,
utilizing all possible combination sums of the filter coefficients [96].
2.1 LUT-Based DA FIR Filter
When input samples are represented as two’s-complement binary numbers scaled such that
|x[n− k]| < 1, thenx[n− k] can be written as





wherebk,n ∈ {0,1}, bk,0 is a sign bit, andbk,B−1 is a least significant bit (LSB). Now,y[n] of



































It is noted that theCl only takes one of 2K possible combination sums of the filter coeffi-
cients sincebk,n ∈ {0,1}. These values can be pre-computed and stored in an LUT if the
filter coefficients are fixed. The filtering operation, then, can be done by theB look-up,
shift, and accumulate operations.
One example of a typical DA implementation for a 4-tap (K = 4) FIR filter is shown
in Figure 9. Most DA architecture can be categorized into three small units: the shift
register unit, the DA base unit, and the adder/shifter unit. For the traditional LUT-based
DA architecture of fixed filters, the DA base unit consists of only one 2K size LUT. As one
can see in Chapters 5 and 7, the DA base unit can be more complicated than just a single
LUT, but both the shift register unit and the adder/shifter unit are common for different
DA architectures. A ROM is often used as a realization of the LUT for fixed filters, and
a RAM is used for adaptive filters [4]. The shift register unit storesK most recent input
samples and its contents are shift righted at every system clock cycle. The concatenation
of the output of the shift register unit (rightmost bits of the shift register unit) becomes
the address of the LUT inside the DA base unit. LSBs of input samples are used first as
LUT addresses, and every output of LUT is shifted and accumulatedB consecutive times,
whereB is the precision of the input data, adding up with one bit shift-righted value of the
previous accumulator. For most significant bits (MSBs) of input samples, sign control, S1,
is set to 1, and thus, the output of LUT is subtracted from the shift-righted accumulator.
Since DA can complete the FIR filtering operation inB clock cycles regardless ofK DA
has been widely used for high-speed filter implementation, especially whenK >> B.
17
b3 b2 b1 b0 data
0  0  0  0             0
0  0  0  1            w0
0  0  1  0            w1
0  0  1  1          w0+w1
0  1  0  0            w2
0  1  0  1          w0+w2
0  1  1  0          w1+w2
0  1  1  1       w0+w1+w2
1  0  0  0            w3
1  0  0  1          w0+w3
1  0  1  0          w1+w3
1  0  1  1       w0+w1+w3
1  1  0  0          w2+w3
1  1  0  1       w0+w2+w3
1  1  1  0       w1+w2+w3


























Figure 9. Block diagram of a4-tap (K = 4) DA FIR filter. The DA block consists of three small units: the
shift register unit, the DA base unit, and the adder/shifter unit. Each coefficient hasB bits of precision.
As one can see in Chapters 5 and 7, the DA base unit can be more complicated than just a single LUT,
but both the shift register unit and the adder/shifter unit are common for different DA architectures.
A ROM is oftentimes used as a realization of the LUT for fixed filters, and a RAM is used for adaptive
filters.
2.2 Adder-Based DA FIR Filter
As an alternative implementation of a memory-intensive LUT-based DA, an adder-based
DA was first proposed in [18] for a one-dimensional IDCT processor and further expanded
to a two-dimensional IDCT processor [14]. The adder-based DA architecture was also
attempted in [15] for LUT-based FPGAs with “vertical subexpression sharing” to minimize
area.
The conventional LUT-based DA architecture decomposes the input signal into bit-
serial forms (Equation (5)) and stores all possible combination sums of the filter coefficients
into the LUT. The adder-based DA architecture is proposed in [14, 18, 15] in contrast to
decompose filter coefficients into bit-serial forms. Filter coefficients,wk, can be written as





whereak,n ∈ {0,1}, ak,0 is a sign bit,Bc is the word length of the filter coefficients, and
18


































The main difference between LUT-based and adder-based DA architecture is that the adder-
based DA uses an adder network instead of a pre-computed LUT to implementAl. Unlike
Cl, in whichbk,l can’t be previously known,Al can be constructed using an adder network
since bit representations of filter coefficients,ak,l are fixed values for allk and l. Adder-
based DA, as shown in [14, 18, 15], can be more area efficient than LUT-based DA for
the following reasons. First, adder-based DA architecture constructs adders only for non-
zero bits and thus saves more area with fewer non-zero filter coeffi ients, whereas the size
of LUT-based DA is not influenced by the number of non-zero bits of filter coeffici nts.
Another reason is that the area of adder-based DA architecture can be further minimized
by bit-level common subexpression sharing [16, 17, 47]. For instance, word-level common
subexpression sharing can be explained in the following example. The filtering operation
y[n] = x[n]+x[n] << 1+x[n] << 3+x[n−1]+x[n−1] << 2+x[n−2]+x[n−2] << 1, (11)
where ‘<< b’ denotes ab digit shift-left operation. If we definew[n] as,
w[n] = x[n] << 1 + x[n− 1], (12)
then Equation (11) can be represented as
y[n] = w[n] + w[n− 1] << 1 + x[n] + x[n] << 3 + x[n− 2]. (13)
19
Direct implementation of Equation (11) requires six adders while the implementation of
Equation (13) requires only four adders. The additional intermediate delay problem [47]
was addressed by the introduction of vertical subexpression sharing [15], ensuring bit-level
computations and communications suitable for bit-level grain FPGA.
2.3 DA for High-Order FIR Filter
It must be noted that, as filter size increases, the memory requirements of LUT-based DA
architecture grow exponentially. This problem may be alleviated by breaking a larger DA
base unit into smaller DA base units that require tractable memory sizes and then summing
the outputs of these units. This technique has been referred to as a “multiple memory bank,”
or the “partial sum technique” [103].
The summation in the square braces in Equation (7) may be split so that theK-tap filter
is divided intom smaller filters, each withk-tap DA base units (K = m× k). Here it is





















 2−l . (14)
The terms in parentheses in Equation (14) may be implemented usingm DA base units,
each implementing the expression in brackets. In Figure 10, DA architecture based on
Equation (14) is shown for 4-tap FIR filters whenm = 2 andk = 2. In addition to the three
units of the original DA architecture, an adder tree unit ofm-depth is inserted intoK-tap
filters consisting ofm units ofk-tap DA base units. The area-efficient DA architecture for
the partial sum technique approach should satisfy the following conditions: first, that the
control of the adder/shifter unit be independent of the input samples stored in the shifter
register unit; second, that the control of the adder/shifter unit be independent of the filter
coefficients; third, that a global adder/shifter unit and a global adder/shifter be employed;
fourth, that eachk-tap DA base unit, storing uniquek number of filter coefficients, be
independent of the remaining DA base units; and finally, that the adder tree unit, consisting











0  0         0
0  1        w0
1  0        w1
1  1     w0+w1
b3 b2 data
0  0         0
0  1        w2
1  0        w3












Figure 10. Block diagram of a4-tap DA FIR filter when m = 2 and k = 2. The DA block, which uses the
partial sum technique, consists of four small units: the shift register unit, the DA base unit, the adder
tree unit, and the adder/shifter unit.
For theK-tap filter divided intom units ofk-tap DA base units (K = m× k), the total
memory requirement would bem× 2k memory words. The total number of clock cycles
required for this implementation would beB + dlog2(m)e; the additional second term is the
number of clock cycles required for the adder tree unit to calculate the sums of the DA base
units. Thus, the decrease in throughput of this implementation is marginal. For instance,
if K = 128, then instead of 2128 in a full LUT implementation, one can choosek = 4
andm = 32, which would require only 512 memory words. The number of clock cycles
required for this implementation would be 21 clock cycles compared to 16 clock cycles for
a single LUT implementation.
2.4 DA-Offset Binary Coding (DA-OBC)
The LUT size can be reduced by interpreting input samples as the offs t binary code,
{1,−1}. This method is called “DA-offset binary coding (DA-OBC)” [20, 55, 96]. In DA-
OBC, x[n− k] is written as,
x[n− k] = 1
2
[
x[n− k] − (−x[n− k])], (15)
21
and the negative ofx[n− k] is written as




−l + 2−(B−1), (16)
where the overscore indicates one’s complement of a binary bit. Plugging Equation (16)
into Equation (15) yields,
x[n− k] = 1
2
[
− (bk,0 − bk,0) +
B−1∑
l=1






























































−l + Dinit2−(B−1). (18)
Now, Dl also can be pre-computed and stored in an LUT sinceDl only takes one of 2K
possible combination sums of filter coefficients withdk,n ∈ {−1,1}. Direct LUT implemen-
tation ofDl is shown in Table 3. The size of Table 3 can be reduced by half due to the fact
that the lower half of the LUT is a mirror version of the upper half of the LUT, but with
reversed signs.
Figure 11 illustrates the block diagram of DA-OBC for a 4-tap FIR filter. The LUT
size of DA-OBC is half of the LUT size of DA using the mirror image of Table 3 at the
cost of an additional four (K) XORs. The control signal, S2, is only set to 1 for the LSBs
of the input samples to selectDinit. Unlike the original LUT-based DA architecture, the
adder/shifter unit of DA-OBC is directly controlled by an input signal through XOR and
22
Table 3. Original LUT contents of DA-OBC. The LUT size can be reduced by half with additional
circuitry using the observation that the lower half of the LUT is a mirror image of the upper half of the
LUT, but with reversed signs.
Addresses
b3,n b2,n b1,n b0,n Memory Contents
0 0 0 0 −(w3 + w2 + w1 + w0)/2
0 0 0 1 −(w3 + w2 + w1 − w0)/2
0 0 1 0 −(w3 + w2 − w1 + w0)/2
0 0 1 1 −(w3 + w2 − w1 − w0)/2
0 1 0 0 −(w3 − w2 + w1 + w0)/2
0 1 0 1 −(w3 − w2 + w1 − w0)/2
0 1 1 0 −(w3 − w2 − w1 + w0)/2
0 1 1 1 −(w3 − w2 − w1 − w0)/2
1 0 0 0 (w3 − w2 − w1 − w0)/2
1 0 0 1 (w3 − w2 − w1 + w0)/2
1 0 1 0 (w3 − w2 + w1 − w0)/2
1 0 1 1 (w3 − w2 + w1 + w0)/2
1 1 0 0 (w3 + w2 − w1 − w0)/2
1 1 0 1 (w3 + w2 − w1 + w0)/2
1 1 1 0 (w3 + w2 + w1 − w0)/2
















c2 c1 c0 data
0  0  0       -(w3 + w2 + w1 + w0)/2
0  0  1       -(w3 + w2 + w1 – w0)/2
0  1  0       -(w3 + w2 - w1 + w0)/2
0  1  1       -(w3 + w2 - w1 - w0)/2
1  0  0       -(w3 - w2 + w1 + w0)/2
1  0  1       -(w3 - w2 + w1 - w0)/2
1  1  0       -(w3 - w2 - w1 + w0)/2









Dinit = -(w3 + w2 + w1 + w0)/2
23-word LUT of DA-OBC
Figure 11. Block diagram of a4-tap (K = 4) DA-OBC FIR filter. The LUT size of DA-OBC is half of
the LUT size of DA using the mirror image of Table 3 at the cost of an additional four (K) XORs. The
control signal, S2, is only set to1 for the LSBs of the input samples to selectDinit .
the sum of the weight coefficients through 2x1 MUX. Thus, DA-OBC architecture is not
suitable for an area-efficient modular approach such as the partial sum technique without
modification.
2.5 High-Speed DA
Additional speed-up of DA implementation is possible by usingn bit-at-a-time (BAAT)
access to the LUT withn > 1. The clock cycle required for a filtering operation can be re-
duced todB/ne clock cycles fromB clock cycles. However, due to its even faster-increasing
memory size than 1BAAT DA, to the best of our knowledge, no hardware implementation
of nBAAT DA ( n > 1) has been reported in the literature for relatively high-order digital
filters. The LUT size ofnBAAT DA for K-tap FIR filter implementation increases substan-
tially from 2K to 2nK with a single LUT-based DA, and from × 2k to m× 2nk with the
partial sum technique, havingm number ofk-tap DA base units (K = m×k). For instance,
the size of a single LUT for a 4-tap FIR filter increases from 24 words for 1BAAT access












b3 b2 b1 b0 data
0  0  0  0             0
0  0  0  1            w0
0  0  1  0           2w0
0  0  1  1           3w0
0  1  0  0            w1
0  1  0  1          w1+w0
0  1  1  0          w1+2w0
0  1  1  1          w1+3w2
1  0  0  0            2w1
1  0  0  1          2w1+w0
1  0  1  0          2w1+2w0
1  0  1  1          2w1+3w0
1  1  0  0            3w1
1  1  0  1          3w1+w0
1  1  1  0          3w1+2w0





24-word  LUT of 2BAAT DA
Figure 12. Block diagram of a2-tap DA FIR filter with 2BAAT access to the LUT. The filtering opera-
tion can terminate at dB/ne clock cycles at the expense of exponentially increasing memory usage.
filter is shown in Figure 12 . More architectures that combine the DA-OBC andnBAAT




3.1 Background of ANS
While most noise suppression methods focus on discrete-time audio signal processing,
a continuous-time ANS based on the CADSP framework is introduced in this section.
The system, which is for single-channel background audio noise suppression, is unique
in that it is implemented in programmable analog VLSI circuits operating at a subthreshold
range [68] for low-power consumption. By performing a significant portion of the process-
ing in low-power analog circuits [50], the overall functionality of an entire system can be
enhanced by utilizing analog/digital computation in a mutually beneficial way.
3.2 Continuous-Time ANS
A common model for a noisy signal,x(t), is a signal,s(t), plus additive noise,n(t), that is
uncorrelated with the signal
x(t) = s(t) + n(t). (19)
The goal of the research is to design a real-time, low-power system that generates some
optimal estimate, ˆs(t), of s(t) from x(t). Additive background noise is assumed to be un-
correlated and stationary over a long period of time, relative to the short-term stationary
patterns of normal speech. The signal estimate, ˆs( ), can be found in the frequency domain
by spectral subtraction [12, 34, 86] or by applying a Wiener filter gain [32, 66]. The Wiener





whereΓ2(ω) = Φs(ω)/Φn(ω) with Φs(ω) andΦn(ω) representing the power spectral densi-
ties (PSDs) of the signal and noise, respectively.


























Figure 13. Block diagram of the continuous-time ANS system. At each subband, gain is calculated
from the non-linear gain function. The gain is then multiplied by a subband signal and the results are
summed to build a full-band signal estimate.
spectral modification techniques using a continuous-time speech signal. The method used
in the research [7, 30] is based on Schroeder [82, 83] and Diethorn’s noise suppression
system [26], but differs in its filter bank structure, type of gain function, method of the
estimating noise, and method of estimating the subband gain.
Frequency-domain processing is accomplished using a one-third octave filter bank. In
each subband, a noisy signal envelope is detected and smoothed. From the smoothed sub-
band signal envelope, the noise envelope is estimated in each subband independently. The
SNR in each band is estimated from the noisy signal and noise envelopes. A non-linear
(sigmoid) gain function is used to approximate the Wiener gain. Finally, the original band-
limited signal in each band is multiplied by its corresponding gain, and the result is summed
to construct the full-band “clean” signal estimate. The overall structure of the system is
shown in Figure 13, with a more detailed view of the gain calculation block of a single

























Figure 14. Detailed view of the subband gain calculation block. Within each frequency band, the noisy
signal envelope is estimated using a peak detector. Based on the voltage output of the peak detector, the
noise level is estimated using a minimum detector. Translinear division circuit outputs a current that
represents the estimated SNR. A nonlinear function is applied to the SNR current, and the resulting
gain factor is multiplied with the band-limited noisy signal to produce a band-limited “clean” signal.
3.3 Analog VLSI Implementation
A continuous-time ANS algorithm has been fabricated on a 0.5µmCMOS VLSI chip. Since
each subband operates independently of the others in the array, we have 32 parallel signal
processors operating simultaneously on 32 band-limited signals. The following sections
elaborate on the algorithm and relate the circuits that perform the underlying functions.
3.3.1 Frequency Decomposition
The filter bank in Figure 13 separates the noisy signal into 32 subbands whose center fre-
quencies are logarithmically spaced similar to the human auditory system [6], resulting in
approximately one-third octave spacing in the center frequencies of the filter bank. By us-
ing one-third octave filters, any frequency distortions whose bandwidth is on the order of
the bandwidth of each band also lie within the same critical band and can be minimized for
the perceptual impact [6].
3.3.1.1 Band-pass filter element
The implementation of the band-pass filter used to separate a wideband signal into subband
signals is based on the capacitively-coupled current conveyer (C4) presented in [49], char-
acterized in [44, 87], and shown in Figure 15(a). The C4 is a capacitively-based band-pass
























































Figure 15. (a) C4 takes on the form of a band-pass filter within the region of interest with±20 dB/decade
slopes outside the pass band. The mid-band gain is−C1/C2. (b) The schematic of the C4 second-order
section (C4 SOS). Tuning of the bias currents sources is accomplished by programming floating-gate
transistors to the desired current.





sτl(1− sτ f )
s2τhτl + s(τl + τ f (
Co
κC2
− 1)) + 1, (21)













and the total capacitance,CT, and the output capacitance,CO, are defined asCT = C1 +
C2 + CW andCO = C2 + CL, respectively. The currentsIτl andIτh are the currents through
M2 andM3, respectively, as shown in Figure 15(a). With normal usage,τ f is so fast that the
zero it produces lies far outside of the operating range. Hence, the C4 takes on the form of
a band-pass filter within the region of interest with±20 dB/decade slopes outside the pass
band. The mid-band gain is−C1/C2.
Removing the feedback capacitorC2, a configuration called avanilla C4, transforms
the C4 into a high-gain filter with a much largerQ peak value [87]. The C4 second-order
section (C4 SOS) [31, 45], shown in Figure 15(b), is simply a cascade of two vanilla C4s






























Figure 16. Magnitude responses of C4 and C4 SOS. C4 takes on the form of a band-pass filter within
the region of interest with ±20 dB/decade slopes outside the pass band. The mid-band gain is−C1/C2.
By tuning each of the vanilla C4s so that they have identical time constants, the overall response of the
C4 SOS has±40 dB/decade slopes outside the pass band and a potentially highQ peak.
constants, the overall response of the C4 SOS has±40 dB/decade slopes outside the pass
band and a potentially highQ peak. A highQ peak is useful in many applications because
it helps isolate the respective center frequency. More information about the details of the
C4 SOS is available in [45]. Tuning of the bias currents is accomplished by programming
floating-gate transistors to the desired current, shown in Figure 15(b) as current sources.
Figure 16 shows the magnitude responses of C4 and C4 SOS. C4 has±20 dB/decade slopes
outside the pass band while C4 SOS has±40 dB/decade slopes outside the pass band.
3.3.1.2 Programming of the filter bank
Programming a large number of floating-gate devices, as would be required for a program-
mable filter bank, requires that the floating-gate transistors be arranged in an array for ease
of programming as is shown in Figure 17 [48, 89]. Tunnelling can be used to program
currents accurately, but selectivity is not completely controllable. When one element is
tunnelled, the charge of all the other devices in the array will be altered. As a result, the
30
V1 V2 V3 V4
Figure 17. Architecture used for programming the arrays of floating-gate devices. Two conditions
must exist for injection to occur: (1) a high source-to-drain field to make the electron “hot” and (2)
a channel for device current to flow. Because these conditions can be created orthogonally to each
other with source-to-drain voltage and gate voltage (to modulate the channel), a single element can be
selected for injection or measurement.
tunnelling operation is reserved for “erasing” a charge that is already on the floating gate
and not for very accurate programming.
However, hot-electron injection allows complete selectivity of an individual element.
Hence, injection is used for precise and accurate programming of floating-gate arrays [89].
The method of programming by injection, depicted in Figure 17, is explained as follows.
After selecting a specific floating-gate device, the gate lines of all the columns not con-
taining that device are connected toVDD. Then all the drains of the rows not containing
the selected element are also connected toVDD. Therefore, the gates, the drains, or both
of all the non-elected elements are connected toVDD, ensuring that no appreciable current
will flow in any of the other devices, indicating that they cannot be injected, and that their
floating-gate charge cannot be changed. A voltage can be supplied to the input of the se-
lected element, allowing current to flow. Finally, the drain of the selected element is pulsed
down so that the source-to-drain voltage is temporarily large. The two criteria for injection
are met, so electrons will be added to the floating gate of the selected element. Any such
element in the array can be chosen and programmed, and when all the currents have been
31
set to the desired values, the terminals of the transistor are connected to the rest of the
circuit in which they are operating as fully functional transistors.
A filter bank of 32 C4 SOSs was fabricated with a 0.5µm process available through
MOSIS [46]. While band-pass filters can be programmed to any desired center frequency
spacing and bandwidth, exponentially-spaced center frequencies with narrow bandwidths
and moderateQs are programmed, as this type of configuration is highly advantageous
in audio signal processing. This configuration, which closely models the biology of the
human cochlea, allows subbands of frequency to be independently manipulated since real-
time frequency decomposition occurs.
Figure 18 shows the frequency responses of each of the 32 filter taps. Only the output
from the first stage (a single C4) is shown. The filter taps are programmed so that theIτls
andIτhs for each have exponentially-spaced currents with 95% programming accuracy.
Figure 18(a) shows the magnitude response of the original filter bank [45] that was
programmed using large resistive lines to bias the transistors, and Figure 18(b) shows the
enhanced magnitude response of the filter bank whose corner frequencies are programmed
using floating-gate transistors. Figure 18(b) illustrates a fundamental design issue with ana-
log circuits. Regardless of how accurately biases can be set, circuit performance is affected
by mismatches that occur during the fabrication process. The programmable band-pass fil-
ter bank is monotonically spaced, and this monotonicity and spacing is simply programmed
by the floating-gate biases. Figure 18(b) shows that corner frequencies are not perfectly
spaced, but that is of no concern because the errors due to the mismatch of transistor and
capacitor sizes can be programmed out with the floating gates.
3.3.2 Signal and Noise Level Estimation
After the incoming noisy signal is band limited by the C4 SOS filter bank, a gain factor
is then calculated based on the characteristics of the noisy signal of each subband. The
first step in the gain calculation is to find the noisy signal envelope estimate, ˆex(k, t), when











































Figure 18. Magnitude responses of filter bank of 32 programmable vanilla C4s. (a) Magnitude re-
sponse of the original filter bank [45], which is programmed using large resistive lines to bias the
transistors. (b) Magnitude response of the enhanced filter bank in which the time constants are set by
floating-gate transistors. Corner frequencies are programmed to be exponentially spaced with 95%
programming accuracy. This programmed filter bank shows a marked improvement over the original,
non-programmable filter bank.
vx(k, t), a band-limited signal having a relatively flat magnitude andex(k, t), the envelope
variation over time. The noise envelope estimate, ˆen(k, t), is then calculated using a tech-
nique closely related to the minimum statistics method [65], whereby the noise is found
from the envelope of the noisy signal. The SNR of each subband is estimated from both










whereêx(k, t) is represented as the sum of the signal envelope estimate, ˆes(k, t), and the
noise envelope estimate, ˆn(k, t).
3.3.2.1 Noisy signal envelope estimate using a peak detector
A peak detector is used to estimate the envelope of each noisy subband signal [31]. When
a speech signal is active in the input, the peak detector follows the envelope of the signal,
rising rapidly with increasing signal amplitude and decaying slowly enough to result in a
smooth envelope. The peak detector also follows the level of the additive noise, particularly
in times when the signal is absent. Each peak detector has a programmable time constant
33



























Figure 19. (a) The peak detector circuit shown in this schematic tracks the incoming band-limited
noisy signal, yielding the noisy signal envelope estimate. Two outputs in this circuit are the current
Isignal, which goes into the division circuit, and the voltageVsignal, which is the input to the minimum
detector. The bias voltageVτd sets the time constant at which the output will decay after a peak. (b)
Step responses of the peak detector for various time constants.
so that an appropriate smoothed envelope can be determined for each of the bands [100].
When an initial noisy signal envelope estimate, ˆex(k, t), is available, ˆex(k, t) is averaged
via low-pass filtering. Other averaging methods are analyzed in [42]. Averaging is benefi-
cial to the noise envelope estimate, which is assumed to be stationary. The averaged noisy
signal estimate, ¯ex(k, t), which will be used to obtain a smoothed SNR, is very critical to
making the ANS system robust to the artifacts such as musical noise and missing conso-
nants. The circuit diagram of the peak detector block [31] and its step responses for various
time constants are shown in Figure 19.
3.3.2.2 Noise envelope estimate using a minimum detector
One effective method of noise-level estimation is the minimum statistics approach [65].
An approximation of this approach is to use an inverted peak detector or minimum detector
with a long time constant on the averaged noisy signal envelope estimate, ¯ex(k, t) [8]. The
inverted peak detector operates on ¯ex(k, t), maintaining an estimate of the minimum value,
which is assumed to be the noise floor. A “minimum detector” is therefore used to estimate
the averaged noise envelope, ¯en(k, t), by following the minimum values of ¯ex(k, t). Figure 20
34
Figure 20. Minimum detector is biased withVτa to have a much slower time constant than the peak
detector, so as to follow the slowly changing curve that results at the bottom of the noisy signal envelope.
The output Inoise is an estimate of the noise envelope.
shows the circuit diagram of the minimum detector [31]. A bias voltage sets the attack time
constant to run more slowly than that of the peak detector so that the slow changes found
in the amplitude of relatively stationary noise can be followed. When the signal is present,
the output will maintain a slow-rising level; when the signal is not present, the minimum
detector will track the noise level.
3.3.3 SNR Calculation and Normalization
After ēx(k, t) and ēn(k, t) are obtained, multiplication and division operations can be per-
formed using the translinear principle [84] for the SNR calculation. Figure 21 is the circuit






The currentIscale, set by the bias voltageVscale, and is used to put the output current into the




Figure 21. The translinear circuit implements the division of Inoise into Isignal, yielding a current repre-
senting the SNR,IS NR. Iscale is subtracted from the output current to yield the function described in
Equation (22). The voltageVscale sets the bias currentIscale, which is used to putIS NR into the proper
range for the gain function circuit.
Diethorn [26] estimated ˆex(k, t) andên(k, t) in parallel directly from a band-limited sig-
nal. This approach, however, when implemented in a continuous-time domain with rela-
tively simple and sample-by-sample analog circuit estimators, causes many false alarms,
shown in Figure 22(b), and deteriorates the overall intelligibility. To decrease the number
of false alarms, ¯ex(k, t) is estimated fromx(k, t), andēn(k, t) is estimated not fromx(k, t) but
from ēx(k, t), as shown in both Figures 13 and 14. This approach, also called the “normal-
ized SNR method,” helps overall SNR to be unity or very close to unity when the signal
is assumed to be absent. The normalized SNR method forces the SNR of the noise-only
period to unity, independent of input noise variance, and accordingly the suppression rate
will be consistently the same. This normalized SNR is used as the input of the sigmoid
gain function, which is explained in the next section.
3.3.4 Programmable Non-Linear Gain Function
The final elements of the gain calculation algorithm are the gain function and multiplier.
Several different gain functions may be used, but all of them have the general characteristics
of low gain for low SNR and high gain (at or near unity) for high SNR, with varying
36

























R SNR(with possible false alarm)
Normalized SNR
(b)
Figure 22. (a) Averaged noisy signal envelope,¯x(k, t). The data are obtained from the computer sim-
ulation as explained in Sec 3.4. (b) SNR obtained from the algorithm of [26] and normalized SNR
obtained from the simulation of the proposed continuous-time ANS. The normalized SNR method gen-
erates unity SNR or very close to unity SNR when the signal is assumed to be absent.
smoothing between these two regions. The transition between low gain and high gain must
be smooth to avoid adverse auditory effects, yet it must quickly follow the SNR changes to
ensure an acceptable amount of noise suppression.
A few examples of well-known gain functions are as follows. The Wiener gain, Equa-
tion (20), is widely used in discrete-time ANS since it minimizes error energy when the
input process is non-stationary. Bi-linear gain function [26],Hbi(t), is also frequently used









whereγ is the threshold that specifies the certainty of the presence of speech. The bi-linear
gain function, however, suffers severe loss of signal magnitude at low SNRs, particularly
when bigγ is used for more noise suppression. To minimize this signal energy reduction
at low SNRs, we proposed a sigmoid gain function [100],Hsig(t), which is suitable for the
continuous-time circuit implementation because its transfer function is similar to that of
37














 a posteriori SNR
bi−linear(γ = 10)
bi−linear(γ = 5)
sigmoid(γ =10, α = .5, δ = 1/2)
sigmoid(γ =10, α = 1, δ = 1/2)
Wiener gain(oversubtraction =1)
Wiener gain(oversubtraction =10)
Figure 23. Wiener, bi-linear, and sigmoid gain functions. Wiener gain minimizes error energy when
the input process is non-stationary. Bi-linear gain function is easy to program but suffers severe signal
magnitude reduction at low SNRs, especially, when bigγ is used for more noise suppression. Sigmoid
gain function has three control parameters and thus can achieve a low gain at low SNRs and a high















where,α, φ, andδ are parameters for slope, horizontal, and vertical shift, respectively.
Figure 24(a) shows the combined circuit schematic of the gain function and the mul-
tiplier. The input to the multiplier is the band-limited noisy signal, i.e., the output of the
C4 SOS band-pass filter. The transistors at the top of the schematic are the differ nt al pair
where the actual multiplication takes place; the multiplier will be described shortly. The
transistors below the differential pair create the behavior of the gain function. The output
of the gain function circuit,Igain, can be approximated by
Igain =
(I2S NR+ Imin1)Imax
(I2S NR+ Imin1) + Imax
. (26)
Equation 26 has almost the exact form of the Wiener gain, with the addition of a bias





















































































Figure 24. (a) The schematic of the multiplication and gain function circuit. The output current,Iout1−
Iout2, is the product of the band-limited input signal, Vin1 − Vin2, and the gain factor, Igain. Each voltage
bias sets a current that forms the overall behavior of the gain function circuit. (b) The gain function
of the circuit depicted in (a) is plotted versusIsnr. At the extremes of the SNR ratio, either a low gain
factor or a unity gain factor is the result. Multiple curves show the bias voltages being adjusted.
voltage biasesVmax andVmin1 and effectively create the upper and lower bounds of the gain
output. The circuit is biased to operate in the linear range of the following equation:






whereUT is the thermal voltage of the transistors. Figure 24(b) shows the measured data
of the gain functions with different parameters.
3.4 Simulation and Implementation Results
The initial continuous-time ANS algorithm was functionally simulated in MATLAB. The
simulated system contained the same functional blocks, shown in both Figures 13 and 14,
and was implemented in the analog system with continuous-time transfer functions con-
verted to discrete-time functions using the bi-linear transform [5]. The sampling rate in
the simulator was chosen to be at least four times the required Nyquist rate to avoid the
frequency warping that occurs near the Nyquist rate with the bi-linear transform. Multi-
rate signal processing was used to improve effici ncy, but the high over-sampling rule was
39
always followed. The simulation gain function parameters were chosen so that the atten-
uation was limited to less than 40 dB, leaving some residual background noise. Noise
suppressed output from the simulation overlapped with the original waveform is shown in
Figure 25(a). The perceived quality of the noise-suppressed signal is remarkably free from
artifacts normally associated with Wiener filtering and spectral subtraction, which can be
attributed to the method of frequency decomposition that matches human auditory critical
bands and the proportional bandwidths of the subband envelopes.
Figure 25(b) shows an actual measured subband noisy speech signal and a noise sup-
pressed signal processed by the components in our system. The system is effect ve at
adaptively reducing the amplitude of noise-only portions of the signal while leaving the
desired portions relatively intact. The power consumption of the noise suppression system
is remarkably small because of its subthreshold operation. The core circuitry, assuming 32
frequency bands and a 3.3 V power supply, consumes less than 50µW. Therefore, integrat-
ing this circuit into an existing system will not likely have a measurable impact on battery
performance. Any noise or distortion created by the gain calculation circuits minimally
affects the output signal because these circuits are not directly in the signal path. While the
band-pass filters and the multipliers inject a certain amount of noise into each frequency
band, this noise will be averaged out by the summation of the signals at the output of the
system. Distortion in the signal path will arise from the band-pass filters and the multipli-
ers. In the band-pass filter array, the C4 SOS structure is not cascaded, as in the cochlea
models [68]. Therefore, there is no distortion or noise accumulation. The distortion level
in each frequency band for a 30 mV single-ended input signal is second-harmonic limited
at−30 dB at peak. A differential filter bank will eliminate second-harmonic distortion and
reduce the third-harmonic level to−40 dB at peak. The distortion introduced by the multi-
plier is dependent on the output levels of the band-pass filters. If the signal is near 30 mV,
third-harmonic distortion will be−20 dB less; however, and if the signal is near 7.5 mV,
the third-harmonic distortion will approach−46 dB.
40































Figure 25. (a) Time-domain waveform of original noisy signal (gray) and noise-suppressed signal (dark)
from our functional circuit simulation. (b) Noise suppression from the analog VLSI circuit in one
frequency band. The light gray waveform is the subband noisy speech input signal and the black




Adaptive filters have gained lots of popularity in the high-quality audio signal processing
applications such as adaptive ANS and AEC. The filter order increase of these adaptive
filters because of the high sampling frequency has challenged real-time and low-power
computing in the digital domain. Analog circuit engineers have made considerable effort
to resolve real-time and low-power constraints of the digital-domain approach by imple-
menting adaptive FIR filters with analog circuitry.
The adaptive FIR filter consists of weight adaptation, tapped-delay, and multiplication-
and-addition. The multiplication-and-addition is relatively straightforward to implement
with analog circuitry. Weight adaptation in the continuous-time domain is needed to solve
a differential equation [67] requiring basic circuit primitives such as adders, multipliers,
and integrators, which can be efficiently implemented with analog VLSI. Fully differen-
tial current-mode implementation of the weight update using the multiple input translin-
ear element (MITE) [70] is explained in [67]. Above all, the implementation of tapped-
delayed lines with analog circuitry is challenging since it requires ideal delay elements.
Sample mode (discrete mode) delay elements [64] using a charge coupled device (CCD) or
a switched capacitor were also investigated in [61]. However, the necessity of clocking and
the occurrence of aliasing eff cts still remain as drawbacks. Several attempts [11, 13] have
been made to implement delay elements in a purely continuous-time domain. Delay im-
plementation in the continuous-time domain, however, can experience accumulated noise
problems as the number of cascaded delay increases. Figure 26(a) shows a typical tapped-
delay line commonly used in standard FIR filtering structures. The accumulated noise
problem can be significantly alleviated by a parallel-delay line, shown in Figure 26(b),
when the following conditions are satisfied:
• The filter order for multiple delay should be the same as the filter order for unity
42
   
   	 














Figure 26. (a) The tapped-delay line where unit delay elements are positioned in tandem. When imple-
mented with analog circuitry, the performance of the tapped-delay line is not robust to accumulated
noise. (b) The parallel-delay line, which requires a multiple delay filter, is very robust to accumulated
noise, when implemented with analog circuitry.
delay, or at least it should not increase significantly for a large delay.
• Maximum delay should be achieved by a single delay element.
• The performance of the adaptive filter should not be significantly affected by the
non-integer fractional delay caused by imperfections of the analog devices.
4.1 Property of The Delay Filter
Most analog delay elements in a continuous-time domain have been implemented with
first-order low-pass (Gamma) filters [58], first-order all-pass filters [13], or a combination
of first-order low-pass and all-pass (Laguerre) filters [58] for their constant group delay
property within their pass band and small area size. The constant group delay of low-pass
and all-pass filters is important in many applications in which preserving a waveform is
highly desired.
















Equations (28) and (29) show that both first-order low-pass and all-pass filters have constant
group delays ofDl(w) ≈ 1/ωo andDa(w) ≈ 2/ωo, whenw ωo. The constant group delay
43
property of low-pass and all-pass filters is desirable in many applications. One drawback
of these delay filters is that their group delays are inversely proportional to the corner
frequency [101], limiting the maximum group delay of the delay filters to be small for a
wideband audio signal [101, 102]. For example, when the input signal is band limited by
8 KHz (ωo > 8000), then the maximum group delays of first-order low-pass and all-pass
delay filters are limited to 0.125 msec and 0.25 msec, respectively.
Among the many available analog filters, the Bessel filter, or the Bessel-Thompson
filter, is best known for its maximally flat group delay property. The Bessel filter is less
known compared to other analog filters such as the Butterworth, Chebyshev, and Elliptic
filters because the magnitude response of the Bessel filter is not superior to the others. The
transfer function of the Bessel filter can be derived in many ways [5]. The first method is
the “recursive fraction method,” which is based on the fact that the output of the delay filter,
xo(t), ideally should be the delayed version of the input signal,xi(t). When unity delay is
considered,xo(t) becomes
xo(t) = xi(t − 1). (30)
















whereK is constant and denominator,Q(s), satisfies
Q(s) = E(s) + O(s), (33)
whereE(s) is an even polynomial andO(s) is an odd polynomial. From Equation (31),
Q(s) = es, andes can be expressed in terms of even and odd functions such as
es = cosh(s) + sinh(s). (34)
44
Now, cosh(s) and sinh(s) can be approximated by the even and odd series












+ · · · . (35)























The transfer function of the Bessel low-pass filter can be determined if we truncate Equa-
tion (36) to the required order and then assembleQ(s) using Equation (33). Then we de-








s2 + 3s+ 3
, (38)
for n = 1 andn = 2, respectively. The other method is to systematically evaluateQ(s),
which is also referred to as “Bessel polynomials,” or “Thompson polynomials,” using the
following recursion:
Qn(s) = (2n− 1)Qn−1(s) + s2Qn−2(s), (39)
whereQ0(s) = 1 andQ1(s) = s + 1. Table 4 shows the Bessel polynomials,Qn(s), in
unfactored form for 0≤ n ≤ 7.
Figure 27 shows the group delays of fourth-order Butterworth, Chebyshev 1, Elliptic,
and Bessel low-pass filters with corner frequency normalized to 1 Hz. The Elliptic filter,
which is preferred among all the analog IIR filters because of its equi-ripple property in
both the pass band and the stop band, shows the highest peak group delay around the
corner frequency. It is obvious from Figure 27 that the Bessel low-pass filter generates a
maximally flat group delay among these filters.
45
Table 4. Bessel polynomials,Qn(s), in unfactored form for 0 ≤ n ≤ 7. The transfer function of the Bessel





2 s2 + 3s+ 3
3 s3 + 6s2 + 15s+ 15
4 s4 + 10s3 + 45s2 + 105s+ 105
5 s5 + 15s4 + 105s3 + 420s2 + 945s+ 945
6 s6 + 21s5 + 210s4 + 1260s3 + 4725s2 + 10395s+ 10395
7 s7 + 28s6 + 378s5 + 3150s4 + 17325s3 + 62370s2 + 135135s+ 135135
The general transfer function for the second-order band-pass filter is
H(s) =
K(woQ )s




wherewo is the center frequency,K is the gain,Q is the quality factor defined aswo/BW,




(wo2 − w2)2 + (woQ )2w2
. (41)
Group delays of second-order band-pass filters for differentQs are shown in Figure 28. It is
obvious from Equation (41) that the maximum delay ofD(w) is 2/BW atw ≈ wo. It should
be noted that the maximum delay of the band-pass delay filter, 2/BW, is independent of the
center frequency and only the function of the bandwidth. AsQ increases, the group delay
around the corner frequency increases, approaching 2/BW, and group delay atw << w0
decreases.
4.2 Low-Pass-To-Band-Pass Transformation
The design procedure of analog filters such as high-pass, band-pass, band-stop, or all-pass
filters generally begins with the design of a low-pass filter with the desired specifications
and then transforms the low-pass filter into different filters.
46






















Figure 27. Group delay of analog filters. The filter order for all low-pass filters is4, and the corner
frequency is normalized to1 Hz. The Elliptic filter has the highest group delay peak while the Bessel
filter has almost a constant group delay.
4.2.1 Geometrically Symmetrical Transformation
A widely known normalized low-pass-to-band-pass transformation, referred to as “geomet-
rically symmetrical transformation,” is [25]
ω̃ =
ω
ωH − ωL −
ωHωL
(ωH − ωL)ω, (42)
whereω̃ is the frequency variable of the original low-pass filter, andω is the frequency vari-
able of the band-pass filter. Equation (42) transforms low-pass frequencies ˜ω = −1,0,+1
into band-pass frequenciesω = ωL,
√
ωLωH, ωH. Transfer functions of the low-pass filter,
h( jω), and the band-pass filter,H( jω̃), are related by
h( jω) = H( jω̃)|ω̃=ω̃(ω), (43)
whereω̃(ω) is given by Equation (42) for the geometrically symmetrical transformation,
and the phase responses are related by





























Figure 28. Group delay of the second-order band-pass filter for different Qs with wo = 3000Hz. When
Q is low, the group delay of the band-pass filter has a constant group delay for a frequency lower
than the lower cut-off frequency. As Q increases, the maximum group delay of the center frequency
approaches2/BW, and group delay at the constant group delay region, where the frequency is smaller
than the lower corner frequency, decreases.
48
WhenD(ω̃) is the group delay function of the low-pass filter, andd(ω) is the group delay






























The term inside the brackets contributes to the distortion of the group delay of the band-
pass filter. Even ifD(ω̃) is constant within its pass band, as is the case with the Bessel
low-pass filter,d(ω) is not constant because of this distortion. For frequencies close to the
center frequency (ω2 ≈ ωLωH), the distortion is quite small. However, as the frequency,
w, deviates from the center frequency,
√
ωLωH, significant group delay distortion is un-
avoidable. Figure 29(a) illustrates the geometrically symmetrical transformation from the
low-pass variable, ˜ω, to the band-pass variable,ω. Since this transformation is geometri-
cally symmetric, exhibiting symmetry on a logarithmic frequency scale, the linear-phase
property of low-pass filters is no longer preserved in the transformed band-pass filter.
Figure 30 shows the distortions of the group delay of band-pass filters transformed
from a normalized low-pass filter using the geometrically symmetrical transformation. The
corner frequency of every low-pass filter is normalized to 1 Hz, and the center frequency,
ωo, and bandwidth,BW, of the band-pass filter are also normalized to 1 Hz. It can be shown
in Figure 30(h) that the group delay of the Bessel band-pass filter, which is transformed
using Equation (42), fails to maintain constant group delay property. It also can be noted
from Figures. 30(e)–(g) that group delay of the Butterworth, Chebyshev 1, and Elliptic


















Figure 29. Comparison of the geometrically symmetrical transformation and the arithmetically sym-
metrical transformation. (a) Low-pass frequenciesω̃ = −1,0,+1 are transformed into band-pass fre-
quenciesω = ωL,
√
ωLωH , ωH or ω = −ωH ,−√ωLωH ,−ωL. Since this transformation is geometri-
cally symmetric, the linear-phase property of low-pass filters is no longer preserved in the transformed
band-pass filter. (b) Low-pass frequencies̃ω = −1,0,+1 are transformed to band-pass frequencies
ω = ωL, (ωL + ωH)/2, ωH or ω = −ωH ,−(ωL + ωH)/2,−ωL. This transform doesn’t distort the linearity
of the group delay of the low-pass filter and also, if center frequency shift±ωo is sufficiently large, then
the group delay distortion affected by each components of Equation (51) will be negligible [11].
center frequency.
4.2.2 Arithmetically Symmetrical Transformation
Geffe [43] and Szentirmai [92] introduced another low-pass-to-band-pass filter transform,
which is referred to as “arithmetically symmetrical transformation.” Unlike the geometri-




ωH − ωL (ω − ωo), (47)
whereω̃ is the frequency variable of the original low-pass filter,ω is the frequency variable
of the band-pass filter, and the center frequency,ωo, is equal to (ωH + ωL)/2. It transforms
low-pass frequencies ˜ω = −1,0,+1 to band-pass frequenciesω = ωL, (ωL + ωH)/2, ωH.
Since Equation (47) only yields the pass-band regionωL ≤ ωo ≤ ωH, a transformation that
generates a negative frequency pass band should also be introduced [25]. The complete
arithmetically symmetrical transformation using both transfer functions can be written as
h( jω) = H( jω̃+)H( jω̃−), (48)
50














(b) Chebyshev 1 LPF




























(f) Chebyshev 1 BPF

















Figure 30. Simulation results of the group delay of the band-pass filter transformed from a normalized
low-pass filter using the geometrically symmetrical transformation (Equation (42)). The filter orders
of the low-pass and band-pass filters are4 and 8, respectively.
51
where
ω̃+ = 2(ω − ωo)/(ωH − ωL), (49)
ω̃− = 2(ω + ωo)/(ωH − ωL). (50)
The group delay of the band-pass filter,d(ω), is
d(ω) =
2














whereD(·) is the group delay function of the normalized low-pass filter. If the center
frequency shift±ωo is sufficiently large, then the influence between the twoD[·]s of Equa-
tion (51) can be neglected [11]. Figure 31 shows the simulation results for the group delay
of band-pass filters transformed using the arithmetically symmetrical transformation. The
results show that the group delay maintains symmetry and that the constant group delay
property of the Bessel low-pass filter is well preserved over a large bandwidth, compared
to that shown in Figure 30.
Figure 32 shows the group delay of the third-order Bessel low-pass and the sixth-order
Bessel band-pass filter. The center frequency of the band-pass filter is 2 KHz, and the band-
width is 1 KHz. The solid curve in Figure 32(b) is the group delay of the band-pass filter
transformed using the arithmetically symmetrical transformation, whereas the dotted curve
is the group delay of the band-pass filter transformed using the geometrically symmetri-
cal transformation. This clearly shows that the arithmetically symmetrical transformation
doesn’t distort the linearity of the group delay of the Bessel low-pass filter. It also should
be noted that the difference between the two transformations is noticeable when the filter
order of the band-pass filter is no less than third order.
4.3 Subband Delay Structure
We have seen the group delay properties of general low-pass and band-pass filters, the






























































Figure 31. Simulation results of group delay of the band-pass filter transformed from a normalized
low-pass filter using the arithmetically symmetrical transformation. The filter orders of the low-pass
and band-pass filters are4 and 8, respectively.
53






























(a) Bessel low−pass filter 
Figure 32. Group delay of the Bessel low-pass and band-pass filters. The filter orders of the low-pass
and band-pass filters are3 and 6, respectively. The corner frequency of the low-pass filter is normalized
to 1 Hz, and the center frequency and the bandwidth of the band-pass filter is de-normalized to2 KHz
and 1 KHz. The arithmetically symmetrical transformation doesn’t distort the linearity of the group
delay of the Bessel low-pass filter.
on these information, two different delay structures for band-limited input signals are pre-
sented in this section. The former is best suited for applications in which the linearity of the
group delay is very critical, while the latter is for applications in which the maximum group
delay of a delay element is even more important than the linearity of the group delay [101].
4.3.1 Delay Network with Low-Pass Filters and Modulation
The first delay network consists of low-pass delay elements, shown in Figure 33. This
structure can be applicable to situations in which the linearity of the group delay is more
crucial than the maximum group delay that the delay network can create. Figure 33 is very
similar to a complex modulator model of the DFT filter bank structure in a digital domain
except it is for continuous-time implementation, in this case the sampling rate change is
unnecessary.
In the beginning of each subband, the band-limited signal is modulated to a baseband














Delay network with LP
delay elements
Delay network with LP
delay elements
Figure 33. Delay network with low-pass delay filters. This structure can be applicable to situations
where the linearity of the group delay is more crucial than the maximum group delay that the delay
network can create. The delay element in each subband can be implemented with the low-pass filter
because each band is multiplied by a complex number to modulate the band-limited signal into a base
band. Modulation to a baseband is advantageous in many ways. First, it enables a delay network to
create a larger group delay by decreasing the highest frequency of the band-limited input signal to a
low-frequency baseband. Second, it allows the delay network to have the same time-constant (corner
frequency) distribution of delay elements throughout different subbands, simplifying programming
complexity of the corner frequency.
advantageous in many ways. First, it enables a delay network to create a larger group delay
by decreasing the highest frequency of the band-limited input signal to a low-frequency
baseband. As shown in Equation (28), the group delay of the low-pass filter is inversely
proportional to the corner frequency. In fact, the lowest corner frequency of the low-pass
delay filter can’t be smaller than the highest frequency of the band-limited input signal
to avoid the spectral magnitude of the input signal. This can be a serious problem for
higher-frequency subbands that have to create the same amount of group delay as lower-
frequency subbands, because the dynamic range of the corner frequency of higher subbands
is much smaller than that of lower subbands. Second, it allows the delay network to have the
same time-constant (corner frequency) distribution of delay elements throughout different
subbands, simplifying the programming complexity of the corner frequency.
Figure 34 shows two examples of delay lines for each subband. The Bessel low-pass
filter is best suited for each delay filter because of its maximally flat group delay, explained
in Sec. 4.1. Figure 34(a) shows the tapped-delay line, which consists of low-pass delay
55
…





2 3 4 (N-1)






Figure 34. Two examples of delay lines for a delay network with low-pass delay filters. (a) Tapped-
delay line consisting of low-pass delay filters with a uniform corner frequency distribution. Each delay
element generates the same delay and the number of cascades can be significantly restricted by the
amount of accumulated noise at each delay filter. (b) Parallel-delay line consisting of low-pass delay
filters with a non-uniform corner frequency distribution. Each delay element is programmed to have a
larger bandwidth for a smaller delay or a smaller bandwidth for a larger delay.
filters with a uniform corner frequency distribution. Each delay element generates the
same delay and the number of cascades can be significantly restricted by the amount of
accumulated noise at each delay filter. Figure 34(b) shows the parallel-delay line, which
consists of low-pass delay filters with a non-uniform corner frequency distribution. Each
delay element is programmed to have a larger bandwidth for a smaller delay or a smaller
bandwidth for a larger delay. The accumulated noise problem can be mitigated with this
structure, but the corner frequency mismatch between the other subbands can limit the
performance of this structure.
4.3.2 Delay Network with Band-Pass Filters
A delay network with band-pass delay elements can be used for applications in which cre-


























Figure 35. Delay network with the band-pass delay filters for a single subband. A high-Q band-pass
filter can be used as an element of the analysis filter bank to increase frequency selectivity and thus
increase the maximum group delay of the delay filter. The bandwidth of the delay element should be
decreased to increase the group delay.
delay element, the constant group delay of the band-pass delay filter within its pass band
can be obtained only when the band-pass delay filters are transformed using the arithmeti-
cally symmetrical transformation from Bessel low-pass filters, explained in Sec. 4.2.2.
Figure 35 shows the delay network with band-pass delay elements. A parallel-delay
line is adopted instead of a tapped-delay line since the band-pass delay element is known
to be less robust to the accumulated noise than the low-pass delay element. A wideband
input signal is transformed into subband signals through the analysis filter bank at the
beginning of each subband. High-Q band-pass filters can be used as the analysis filter
bank to increase frequency selectivity and thus increase the maximum group delay of the
delay filter. However, band-pass delay elements don’t need to be high-Q filters as long as
their bandwidth is wider than that of the band-pass filter used in the analysis filter bank.
Figure 35 shows that the bandwidth of the delay element is programmed to be narrower for
larger group delay.
Figure 36 presents another example of a delay network that can be used for applications
in which the maximum group delay out of a delay element is not large enough to avoid the
cascade at all, which occurs when the filter order of the delay filter is rather small or in
applications such as AEC, which require an extremely long delay line. The first row in

































Figure 36. Multi-level delay network with the band-pass delay filters for a single subband. The first row
is designed to create a larger delay by programming a smaller bandwidth while the remaining rows are
designed to provide a finer delay resolution smaller than the delay of the first row.
the ramaining rows are designed to provide a finer delay resolution smaller than the delay
of the first row. Other variants of Figure 36 can be made by allowing a different number
of rows for different columns of delay elements in the first row. Since a typical room
impulse response carries most of its energy at the first decaying part rather than at the long
reverberation part, more depth at the beginning columns and less depth at the latter columns
can be advantageous for continuous-time adaptive filter implementation.
58
CHAPTER 5
HYBRID DA FIR FILTER
In this chapter, a new DA architecture based on an alternative implementation of an LUT
is presented. We refer to this type of DA as “hybrid architecture” since it can use both an
LUT and adders simultaneously in contrast to LUT-based or adder-based DA. The hybrid
architecture is originally based on the LUT size reduction technique for DA-OBC [20], and
we further develop new memory reduction technique for the original DA [99].
In the original LUT-based DA, LUT size is fixed to 2k for a k-tap DA base unit or 2k−1
for ak-tap DA-OBC base unit. The hybrid architecture, however, provides more flexibility
than the LUT-based or adder-based DA by allowing the arbitrary selection of LUT size. In
hybrid architectures, the LUT size of ak-tap DA base unit can vary from as low as 0 to 2k,
and the LUT size of ak-tap DA-OBC base unit can vary from as low as 0 to 2k−1. It should
be noted that LUT-based DA and adder-based DA are only two extreme cases of hybrid
DA.
5.1 Hybrid DA-OBC Architecture
One memory reduction technique for DA-OBC was proposed in [20]. The memory size
of DA-OBC can decrease exponentially at the expense of linearly increased control logic.
The memory reduced DA-OBC architecture, however, is still not suitable for high-order
FIR filter implementation based on the partial sum technique, since the adder/shift unit
is directly controlled by filter coefficients and input samples. To address this problem, I
propose “hybrid DA-OBC architecture” for the area-efficient implementation of high-order
FIR filters. The hybrid DA-OBC architecture features the standard adder/shifter unit of
DA, which is not controlled by sum of the filter coefficients and input samples.
The memory reduction technique for DA-OBC can be best understood with examples.
Figure 11 is a block diagram of a 4-tap (K = 4) original DA-OBC FIR filter. The LUT
59
size of DA-OBC can decrease to 2K−1 at the additionalK XORs and a 2x1 MUX using
the mirrored property of Table 3. Further memory reduction technique proposed in [20]
depends on following two steps:
1. Memory reduction by a factor of 2.
2. Control logic (XOR) optimization.
The first step is based on the observation that if−w3/2 term can be located outside the LUT,
then the lower half of the LUT is a mirror image of the upper half of the LUT with the signs
reversed. Thus, if we can move−w3/2 term to the outside of the LUT, only half of the LUT
of DA-OBC can be used at the cost of additional small control logic. Figure 37(a) shows
the new DA-OBC architecture. The LUT size is reduced by a factor of 2 over the original
DA-OBC with additional control logic such as XORs, a full adder with subtraction, and
a register for storing−w3/2. The second step is to reduce the increased number of XORs
using Boolean algebra simplification. After the second step, the total number of XORs
inside the base unit is reduced toK. The final DA-OBC architecture for a 4-tap FIR filter
with 22 size LUT is shown in Figure 37(b).
The memory reduction technique can also be recursively applied to the system shown in
Figure 37(b). Figure 38(a) shows the system after the first step, and Figure 38(b) shows the
system after the second step. Finally, one more memory reduction of Figure 38(b) results
in LUT-less DA-OBC architecture. Figure 39 illustrates the block diagram of the LUT-less
DA-OBC for a 4-tap FIR filter.
Figure 39, however, poses several problems for high-order FIR filters. As explained in
the previous section, DA architecture can’t be used for high-order filters without the partial
sum technique because of the exponentially increasing memory size. To be area effici nt
for high-order filters, DA architecture, as explained in Sec. 2.3, should have a global shift
register unit, a global adder/shifter unit,m number ofk-tap base units, and an adder tree

















0  0     - (w2 + w1 + w0)/2
0  1     - (w2 + w1 – w0)/2
1  0     - (w2 - w1 + w0)/2













Dinit = -(w3 + w2 + w1 + w0)/2

















0  0     -(w2 + w1 + w0)/2
0  1     -(w2 + w1 – w0)/2
1  0     -(w2 - w1 + w0)/2












22-word  LUT of DA-OBC
-w3/2
Dinit = -(w3 + w2 + w1 + w0)/2
(b)
Figure 37. Block diagram of a4-tap DA-OBC FIR filter with 22 size LUT. The LUT size is reduced by
a factor of 2 over the original DA-OBC. (a) DA-OBC architecture for a 4-tap FIR filter with 22 size
LUT after the first step of the memory reduction technique. (b) The overall additional control logics

















0    -(w1+w0)/2
































0      -(w1 + w0)/2














21-word  LUT of DA-OBC
-w3/2-w2/2 Dinit = -(w3 + w2 + w1 + w0)/2
(b)
Figure 38. Block diagram of a4-tap DA-OBC FIR filter with 21 size LUT. The LUT size is reduced by
a factor of 2 over the original DA-OBC. (a) DA-OBC architecture for a 4-tap FIR filter with 21 size
LUT after the first step of the memory reduction technique. (b) The overall additional control logics



































Shift Register Unit DA-OBC Base Unit Adder/Shifter Unit
Figure 39. Block diagram of the LUT-less DA-OBC for a 4-tap FIR filter. The adder /shifter unit is
controlled by the input sample,x[n− K], and the filter coefficients,Dinit .
62
global adder/shifter unit for the following two reasons.
• The adder in the adder/shifter unit is controlled by the bit streams fromx[n− K].
• The adder in the adder/shifter unit is multiplexed withDinit, which is the function of
the filter coefficients.
The proposed architecture, shown in Figure 40, prevent the input sample,x[n − K], and
the filter coefficients,Dinit, from controlling the adder/shifter unit. This architecture can be
derived from the following three steps:
1. The sign control of the adder located inside the adder/shifter unit is disconnected
from the input samples by adding an inverter and a 2x1 MUX inside the base unit.
The select line of the 2x1 MUX is connected to the input sample,x[n− K], coming
from the shift register unit. The 2x1 MUX selects the positive/negative sums ofK−1
adders located inside the base unit, depending on the 0/1 values of thex[n − K]
bitstream.
2. Dinit and 2x1 MUX are moved into the base unit with an additional full adder. The
select line of this 2x1 MUX is unchanged (“1” for LSB and “0” for others) but it
selects 0/Dinit, depending on the 0/1 values of the select line.
3. The 2x1 MUX controlled byx[n− K], an inverter, and a full adder are replaced by a
full adder with a subtraction selector.
The overall cost for converting the LUT-less DA-OBC architecture into the architec-
ture shown in Figure 40 is a single full adder with a subtraction selector. By following the
same three steps, other DA-OBC architectures shown in this section can be transformed
into hybrid architectures, which we refer to as “hybrid DA-OBC,” at the cost of a single
full adder with a subtraction selector. Although the base unit of hybrid DA-OBC archi-
tecture requires one more full adder with a subtraction selector than the base unit of its








































Figure 40. Block diagram of the LUT-less hybrid DA-OBC for a 4-tap FIR filter. The adder /shifter
unit has a standard structure in which the control of the adder/shifter unit is independent of the input
sample and filter coefficients.
global adder/shifter unit. In addition, adders used inside the hybrid DA-OBC base unit can
be configured to have a tree architecture instead of a linear connection to reduce the critical
data path delay, shown in Figure 40.
5.2 Hybrid DA Architecture
The hybrid DA architecture presented here is based on the memory reduction technique of
DA, which minimizes the redundancy of LUT contents of DA architecture. The memory
reduction technique for DA can also be best understood with examples. The LUT of the
original LUT-basedK-tap DA architecture has the following property:
LUT(1,bK−2, · · · ,b1,b0) = LUT(0,bK−2, · · · ,b1,b0) + wK−1, (52)
where LUT(bK−1,bK−2, · · · ,b1,b0) is the LUT content addressed byK−1,bK−2, · · · ,b1,b0.
Figure 9 shows that the lower half of LUT (locations whose addresses have a 1 in the MSB)
is equal to the upper half of LUT (locations whose addresses have a 0 in the MSB) plus the
w3 term. Hence, LUT size can be reduced by a factor of 2 with an additional 2x1 MUX
and a full adder, as shown in Figure 41. The 2x1 MUX selects 0/w3, depending on 0/1
values ofx[n− K]. Inspection of the LUT of Figure 41 shows that the symmetry property


















b2 b1 b0 data
0  0  0              0
0  0  1             w0
0  1  0             w1
0  1  1          w1 + w0
1  0  0             w2
1  0  1          w2 + w0
1  1  0          w2 + w1





23-word LUT of DA
Figure 41. Hybrid DA architecture for a 4-tap FIR filter with 23 size LUT. LUT size was reduced by a
factor of 2 with an additional 2x1 MUX and a full adder using the observation that the lower half of
LUT (locations whose addresses have a1 in the MSB) of Figure 9 is equal to the upper half of LUT
(locations whose addresses have a0 in the MSB) plus thew3 term.
hybrid DA architectures with 22 size LUT and 21 size LUT. After several iterations of the
memory reduction, one can finally obtain the LUT-less hybrid DA architecture shown in
Figure 43.
5.3 Hardware Cost Analysis
Hardware complexity for LUT-based DA, DA-OBC, hybrid DA-OBC, and hybrid DA ar-
chitectures are compared in terms of a transistor count estimate and FPGA resource utiliza-
tion. Hybrid DA-OBC and hybrid DA architectures generally describe DA-OBC and DA
architectures with various LUT sizes including the LUT-less architecture. For simplicity,
we consider only the LUT-less hybrid DA-OBC and the LUT-less hybrid DA architectures
in this section. In addition, only base units are considered for transistor count compar-
ison since both the shift register unit and the adder/shifter unit are common for all four
architectures.
Estimating transistor count is a frequently used technique for comparing custom VLSI
chip size among different architectures. Digital logic functions, however, can be imple-
mented in many different ways, depending on their targeting design constraints such as



















0  0         0
0  1        w0
1  0        w1
























0         0

















21-word LUT of DA
(b)
Figure 42. (a) Hybrid DA architecture for a 4-tap FIR filter with 22 size LUT. (b) Hybrid DA architec-






































Proposed DA Base Unit
Figure 43. LUT-less hybrid DA architecture for a 4-tap FIR filter. Adders inside the base unit are
configured to have a tree architecture instead of a linear connection, to reduce the critical data path
delay.
66
Table 5. Transistor counts for digital logic functions. Estimating transistor count is a frequently used
technique for comparing custom VLSI chip size among different architectures. Digital logic functions,
however, can be implemented in many different ways, depending on their targeting design constraints
such as silicon area, power consumption, and speed. Transistor count is estimated based on the static
CMOS implementation of digital logic or the assumptions made in [20].
Logic Transistor count
INV (1 bit) 2
XOR (1 bit) 8
2x1 MUX (1 bit) 6
Adder Cin = 0 (Bc/2 bit) 12(Bc/2)
(fixedCin) Cin = 1 (Bc/2 bit) 14(Bc/2)
Adder (Bc bit) 30Bc
Adder/Subtractor Cin = 0 (Bc/2 bit) (12+ 8)(Bc/2)
(fixedCin) Cin = 1 (Bc/2 bit) (14+ 8)(Bc/2)
Adder/Subtractor (Bc bit) 30Bc + 8Bc
2k × Bc Decoder C(1, k)
(ROM) Data D(1, k, Bc)
Register (1 bit) 16
with four 2x1 NAND gates, which are equivalent to 16 transistors, or with full static CMOS
logics, which are equivalent to 8 transistors [59, 79]. Table 5 lists the transistor count as-
sumed in the analysis for each logic function. The transistor count estimate in Table 5 is
based on static CMOS implementation of digital logic or assumptions made in [20]. Cost
functions in Table 5 are defined as





(b− i) + 2b−a+1
)
(53)
D(a,b, c) = 2b−a+1 × c. (54)
When one of the three inputs of a full adder has a fixed value (we assume the fixed input
is carry,Cin, for convenience), 12 and 14 transistors can be used forCin = 0 andCin = 1,
respectively, rather 30 transistors for a regular full adder [79]. The occurrences ofCin = 0
andCin = 1 are statistically assumed to be the same withBc/2 bits for Cin = 0 and the
remainingBc/2 bits for Cin = 1. The adder with a subtraction selector requires 8 more
transistors for an extra 2x1 MUX and an inverter in addition to the original full adder [79].
When the transistor count for various DA architectures are estimated, two versions–the
67
Table 6. Transistor count comparison of various base units for thek-tap FIR filter. The filter coefficients
of the hybrid DA-OBC and hybrid DA architectures are stored in registers to allow the programming
of filter coefficients prior to the filtering operation.
LUT-less LUT-less
Logic functions LUT-based DA-OBC Hybrid DA-OBC Hybrid DA
DA (register version) (register version)
ROM decoder C(1, k) C(2, k) 0 0
ROM data D(1, k, Bc) D(2, k, Bc) 0 0
XOR 0 8k 8(k− 1) 0
2x1 MUX 0 6Bc 6Bc 6k× Bc
Register 0 16Bc 80Bc 16k× (Bc − dlog2ke)
Adder 0 0 0 (k− 1)× 30Bc
Adder/Sub.Cin = 0 0 0 0 0
Adder/Sub.Cin = 1 0 0 0 0
Adder/Sub. 0 0 k× 38Bc 0
register version and the hardwired coefficient version– are considered, depending on how
filter coefficients are implemented. These two versions only differ n the hybrid DA-OBC
and the hybrid DA architectures since the original LUT-based DA and DA-OBC store their
coefficients only inside the LUT. In the register version, each coeffici nts are assumed to be
stored in explicitly created registers. This case assumes manual load and clear operations
of filter coefficients stored in the registers. This version is suitable for general FIR filtering
processors whose coefficients can be programmed prior to the actual filtering operation.
Table 6 shows the transistor count comparison of various base units of thek-tap FIR filter
for the register version.
Figure 44 compares the silicon area of hybrid architectures, such as the hybrid DA-
OBC and the hybrid DA, whose filter coefficients are stored in registers, with the original
DA and DA-OBC in terms of transistor count whenBc = 18. Figure 44 is re-created from
Table 6 and thus only considers base unit. It is obvious from Figure 44 that hybrid ar-
chitectures require a smaller area than LUT-based DA and DA-OBC for high-order filter
68


































Figure 44. Transistor count estimation comparison of various base units for different filter sizes with
Bc = 18. Hybrid architectures require a smaller area than LUT-based DA and DA-OBC for high-order
filter implementation; in particular, the hybrid DA architecture requires the smallest transistor counts
among the four different base units throughout all filter sizes.
implementation. Originally, hybrid architectures are derived to reduce memory usage ex-
ponentially at a cost of linearly increased control logic complexity, but Figure 44 shows
that hybrid architectures are not only memory efficient but also area efficient for high-order
filters. Furthermore, it should be noted that the hybrid DA architecture requires the smallest
transistor counts among the four different base units throughout all filter sizes.
The other version is the hardwired version, in which filter coefficients of the hybrid
DA-OBC and the hybrid DA are assumed to be hardwired to either VCC or GND. The
disadvantage of the hardwired version is a lack of flexibility and programmability. Once
the VLSI chip is fabricated, the filter coefficients can’t be modified. The hardwired version,
however, is smaller than the register version since it is maximally area optimized for pre-
chosen specific coefficients. The hardwired version is useful for well-known transform
operations such as DCT, inverse DCT (IDCT), DFT, or inverse DFT (IDFT) processors.
These processors have been extensively investigated for VLSI implementation [14, 15, 103]
69
Table 7. Transistor count comparison of various base units for thek-tap FIR filter. The filter coefficients
of the hybrid DA-OBC and hybrid DA architectures are hardwired, assuming they are not changed.
Due to this assumption, the original adder can be replaced with the smaller adder with fixed carry,Cin.
LUT-less LUT-less
Logic functions LUT-based DA-OBC Hybrid DA-OBC Hybrid DA
DA (hardwired version) (hardwired version)
ROM decoder C(1, k) C(2, k) 0 0
ROM data D(1, k, Bc) D(2, k, Bc) 0 0
XOR 0 8k 8(k− 1) 0
2x1 MUX 0 6Bc 6Bc 6k× Bc
Register 0 16Bc 16Bc 0
Adder 0 0 0 (k− 1)× 30Bc
Adder/Sub.Cin = 0 0 0 (k− 1)× 10Bc 0
Adder/Sub.Cin = 1 0 0 (k− 1)× 11Bc 0
Adder/Sub. 0 0 38Bc 0
and their coefficients do not need to be updated once the size and type of transform are
determined. Table 7 shows the transistor count comparison of various base units of the
k-tap FIR filter for the hardwired version.
Figure 45 compares the silicon area of hybrid architectures, whose filter coefficients are
hardwired, with the original DA and DA-OBC architectures in terms of transistor count
whenBc = 18. As seen in Figure 44, it is obvious from Figure 45 that hybrid architectures
require a smaller area than LUT-based DA and DA-OBC architectures for high-order filters,
and especially the hybrid DA-OBC requires the smallest transistor counts among the four
different base units for high-order filters because it can be implemented with fixed-carry
adders, which are much smaller than regular adders.
5.4 FPGA Implementation
To illustrate the merits of hybrid architectures, LUT-based DA, the hybrid DA-OBC, and
the hybrid DA are physically implemented on an Altera Stratix EP1S80F1508C6 FPGA
chip. Implementation results of the FPGA synthesis are provided by the compilation reports
of Altera’s Quartus II 4.0 software.
70


































Figure 45. Transistor count estimation comparison of various base units for different filter sizes with
Bc = 18. Filter coefficients of the hybrid DA-OBC and hybrid DA architectures are hardwired, as-
suming they are not changed. Hybrid architectures require a smaller area than LUT-based DA and
DA-OBC architectures for high-order filters; in particular, the hybrid DA-OBC architecture requires
the smallest transistor counts among the four different base units for high-order filters because it can
be implemented with fixed-carry adders, which are much smaller than regular adders.
71
Table 8 shows the implementation results for randomly generated FIR filters, ranging
from 4 to 1024 taps when the word length of the LUT,Bc is 18 and base unit size,k is
4. One can conclude from Table 8 that the hybrid DA architecture is hardware-efficient,
requiring fewer logic elements (LEs) and memory than the original LUT-based DA, while
the hybrid DA-OBC architecture is memory-efficient, requiring more LEs and less mem-
ory than the original LUT-based DA. The hybrid DA architecture is more promising than
the hybrid DA-OBC for FPGA implementation since it requires fewer LEs and the same
memory. For instance, a 1024-tap hybrid DA architecture implemented with a 4-tap base
unit uses only 48% of LEs and 16% of memory over the LUT-based DA architecture at
a cost of two additional clock cycles (assuming the adders inside thek- ap base unit are
pipelined). Likewise, a 1024-tap hybrid DA architecture requires only 37% of LEs and the
same memory over the hybrid DA-OBC.
Other implementation results for randomly generated FIR filters whose sizes vary from
8 to 2048 taps are presented in Table 9. The word length of the LUT,Bc, is 18, and the
base unit size,k, is 8. Table 9 clearly shows that LUT-based DA architecture can synthesize
only up to 512-tap FIR filters, while the hybrid DA-OBC and the hybrid DA can synthesize
1024-tap FIR filters or more higher-order FIR filters, respectively, on the same FPGA chip.
In particular, the hybrid DA architecture uses only 17% of LEs and less than 1% of the
memory of an Altera Stratix EP1S80F1508C6 FPGA chip.
The implementation results of the hybrid DA architecture are outstanding in terms of
saving hardware resources of an FPGA for several reasons. First, the hybrid DA architec-
ture has a more compact base unit, which consists of three adders, four 2x1 MUXs, and
four registers, than the base unit of the hybrid DA-OBC, which consists of three XORs,
four adders with a subtraction selector, one 2x1 MUX, and five registers. Second, even
with the same compilation option and the same manners of definition of filter coefficients,
Altera’s Quartus II software optimizes the hybrid DA architecture more than the hybrid
DA-OBC architecture, automatically hardwiring the filter coefficients of the hybrid DA
72
architecture [1].
Table 8. Summary of the FPGA implementation results (Altera Stratix EP1S80F1508C6 FPGA chip)
for randomly generated FIR filters, ranging from 4 taps to1024taps, when theBc = 18and k = 4. Two
hybrid DA architectures require fewer LEs and less memory than the original LUT-based DA.
Filter size (K)
4 16 64 128 256 512 1024
LUT-based DA LE 272 551 1639 3056 5890 11547 22862 (100%)
Memory 344 1376 5504 11008 22016 44032 88064 (100%)
LUT-less LE 300 667 2104 3984 7746 15259 30286 (132%)
Hybrid DA-OBC Memory 56 224 896 1792 3584 7168 14336 (16%)
LUT-less LE 210 367 887 1569 2946 5659 11086 (48%)
Hybrid DA Memory 56 224 896 1792 3584 7168 14336 (16%)
Table 9. Summary of the FPGA implementation results (Altera Stratix EP1S80F1508C6 FPGA chip)
for randomly generated FIR filters, ranging from 8 taps to2048taps, when theBc = 18and k = 8. The
LUT-based DA architecture can synthesize only up to512-tap FIR filters, while the hybrid DA-OBC
and the hybrid DA can synthesize1024-tap FIR filters or more higher-order FIR filters, respectively.
In particular, the hybrid DA architecture uses only 17% of LEs and less than1% of memory of the
Altera FPGA chip.
Filter size (K)
8 16 64 128 256 512 1024 2048
LUT-based DA LE 327 481 1359 2521 4841 9474 x x
Memory 4976 9952 39808 79616 159232 318464 x x
LUT-less LE 395 609 1862 3538 6849 13468 26700 x
Hybrid DA-OBC Memory 112 224 896 1792 3584 7168 14336 x
LUT-less LE 228 304 638 1073 1938 3676 7116 13992
Hybrid DA Memory 112 224 896 1792 3584 7168 14336 28672
5.5 Reusable DA Block
The DCT has been widely used in image and audio signal processing, for its properties are













wherex(n) is input sequence,c(0) = 1/
√
2, andc(k) = 1 for k = 0,1,2, · · · ,N−1. The con-
stant scaling factor,
√
2/N, can be neglected without loss of generality. Then Equation (55)
can be rewritten in matrix form as
X = T(N)x, (56)
whereX = [X(0),X(1), · · · ,X(N − 1)]T , x = [x(0), x(1), · · · , x(N − 1)]T , andT(N) is an
N × N matrix whose (k,n)-th component is
T(N)(k,n) = c(k) cos

















A A A A A A A A
D E F G −G −F −E −D
B C −C −B −B −C C B
E −G −D −F F D G −E
A −A −A A A −A −A A
F −D G E −E −G D −F
C −B B −C −C B −B C













whereA = cos(π4), B = cos(
π
8), C = cos(
3π
8 ), D = cos(
π
16), E = cos(
3π















←− r 0 −→
←− r 1 −→
←− r 2 −→
←− r 3 −→
←− r 4 −→
←− r 5 −→
←− r 6 −→







































































Figure 46. Direct implementation of the8-point DCT with DA. Each DA unit stores all the possible sums
of the corresponding row vector in the LUT. The shift register unit, which is not drawn, generates8-bit
address lines for28 LUT.
75
where eachr k is k-th row vector.
Figure 46 shows the direct implementation of the 8-point DCT using DA. The single
shift register unit, which is not drawn here, creates 8-bit address lines for eight DA units.
Each DA unit generatesX(0),X(1), · · · ,X(7), respectively. The LUT in each DA unit stores
each row vector (r 0, r 1, · · · , r 7) of transform matrix,T(8). The direct implementation of the
8-point DCT requires 8-bit address lines for 28 size LUT.
The DCT architecture shown in Figure 46 can be enhanced, for example, by using
the symmetric or periodic properties of the DCT matrix, as explained in Equation (58).
An example of an enhanced 8-point DCT architecture using the even-odd decomposition
method [103] is shown in Figure 47. The even-odd decomposition method [14, 15, 103]










A A A A
B C −C −B
A −A −A A

















D E F G
E −G −D −F
F −D G E









Although this architecture still requires eight DA units, it requires only 4-bit address lines
for 24 size LUT instead of 8-bit address lines for 28 size LUT. The extra hardware required
for this architecture increases to four adders and four subtractors. The DCT architecture
of Figure 47 requires significantly less silicon area since the LUT size is decreased expo-



















































Figure 47. DA architecture for the even-odd decomposition of the8-point DCT. Although this archi-
tecture still requires eight DA units, each DA unit has4-bit address lines for 24 size LUT instead of
8-bit address lines for28 size LUT. The extra hardware required for this architecture increases to four






















































Figure 48. DA architecture for the second recursive even-odd decomposition of the8-point DCT. Equa-
tion (60) is further decomposed into even and odd matrices of a smaller size, exploiting the symmetric
and periodic properties.
It is obvious that a more recursive decomposition of the DCT matrix can lead to more
area savings. Equation (61), unfortunately, can’t be easily decomposed into smaller matri-
ces, but Equation (60) can be further decomposed into even and odd matrices of a smaller










(x(0) + x(7)) + (x(3) + x(4))











(x(0) + x(7))− (x(3) + x(4))
(x(1) + x(6))− (x(2) + x(5))
 . (63)
Figure 48 shows the DA architecture for the 8-point DCT, which is simplified using Equa-
tions (62) and (63). For smaller DA units, only 2-bit address lines are required for 22 size
LUT. The more recursive decomposition of an even decomposition matrix is still possible
to optimize area usage, as proposed in [103].
From these DCT implementations, one can observe that LUT size could be reduced or
78




































Reusable DA Base Unit
Figure 49. Block diagram of the reusable DA base unit and the reusable DA. The LUT-less hybrid DA
architecture can lead to a new architecture, called “reusable DA block,” by moving registers storing
filter coefficients to the outside of DA base unit. Since the DA base unit has a generic architecture
independent of filter coefficients, it can be reused for different filter coefficients.
the LUT itself could be replaced with optimized adder networks using common subexpres-
sion sharing approaches. However, the number of DA units can’t be changed since each DA
unit stores its own unique filter coefficients in the LUT, even after the application of vari-
ous area saving techniques such as recursive matrix decomposition [103] and adder-based
DA [14, 15].
To eliminate this drawback of the DA architecture for DCT, we propose a new DA ar-
chitecture that enables the recycling of the DA unit. Among the hybrid DA architectures
introduced in the previous section, the LUT-less hybrid DA architecture can lead to a new
architecture that we name “reusable DA block (RDA),” shown in Figure 49. The RDA con-
sists of a regular adder/shifter unit and a “reusable DA base unit,” whose filter coefficients
are not located inside DA base unit, so the DA base unit has a generic architecture inde-
pendent of the filter coefficients. A new 8-point DCT architecture using the RDA is shown
in Figure 50. Two clockwise commutators are used to feed the row vectors (r 0, 1, · · · , r 7)
of the original DCT matrix,T(8), into the RDA, and pass the filtered output of the RDA to
79

























Figure 50. Multiplexed architecture for the 8-point DCT with reusable DA. The RDA consists of a
regular adder/shifter unit and a “reusable DA base unit,” whose filter coefficients are not located inside
of the DA base unit so the DA base unit has a generic architecture independent of filter coefficients.
Two commutators, rotating clockwise, are used to feed the row vectors of the original DCT matrix,
T(8), into the RDA, and pass the filtered output of the RDA to the eight DCT coefficients sequentially.
the eight DCT coefficients (X(0),X(1), · · · ,X(7)) sequentially. The RDA is a highly area-
optimized architecture for the matrix-vector multiplication such as DCT and DFT. Even
though the RDA requires more clock cycles than the typical DA, this architecture is very




In Chapter 2, it was shown that the DA implementation of FIR filters enables the filtering
operation to end at a fixed number of clock cycles, which is equal to the bit precision of
the input samples, regardless of the filter size. Thus, the DA architecture is popular for the
high-speed implementation of high-order FIR filters. DA can also be extended to fixed-
coefficient IIR filters without loss of generality. However, the application of LUT-based
DA architecture to adaptive filters has not been very successful. Several past attempts have
been made to implement adaptive filters using DA [21, 95], but the approximations made
to standard adaptation algorithms may be unsuitable for practical applications.
The application of traditional LUT-based DA architecture to adaptive filters shows sev-
eral challenges. First, RAMs, instead of ROMs, are required for the realization of the LUT
since filter coefficients need to be updated at every sample clock. Many different types
of RAMs are available in the market, but the silicon sizes of RAMs are generally larger
than those of ROMs because ROMs use as little as a single transistor to store a single-bit
data. The larger area size of RAMs is a particularly serious problem for high-order adap-
tive filters. Second, the implementation of the LMS adaptive filter using DA demands that
entries of the LUT containing all possible combination sums of the filter coeffi ients need
to be recalculated and updated on a sample-by-sample basis. A conventional implementa-
tion method updating each filter coefficient individually and reconstructing the contents of
the LUT with new coefficients is computationally expensive and time consuming, causing
a significant reduction in the filter throughput. A few attempts have been made [21, 95],
but updating a large LUT still poses a major problem in DA-based adaptive filters. For
instance, a conventional update of the LUT could take approximately 1000 clock cycles for
a 128-tap FIR filter [4].
81
In this chapter, new architecture for a DA-based LMS adaptive filter (DAAF) is pro-
posed. The DAAF architecture is a hardware implementation of the LMS adaptive filter
with higher throughput than traditional hardware implementation, particularly for high-
order filter sizes. Since the DAAF architecture is targeted to the high-speed implementation
of high-order adaptive filters used in a discrete-time AEC, we will attempt to implement
up to 1024 taps. Furthermore, the throughput is almost independent of the filter size and
largely depends on the bit precision of the input signal. An area-optimized DAAF archi-
tecture for high-order filter sizes is also presented in this section.
6.1 Traditional Hardware Implementation
A typical implementation of the LMS adaptive filter on a hardware system with a single
multiply-and-accumulator (MAC) unit will requireK MAC operations to perform the fil-
tering andK MAC operations to perform the weight adaptation shown in Table 1. For
real-time implementation of high-order filters, the system clock must be much faster than
the input signal sampling rate.
Alternatively, multiple MAC units may be employed to parallelize the adaptive filter
implementation [2]. In the multiple MAC-based LMS adaptive filter (MMAF) system de-
scribed in [2], the filtering and the weight adaptation are done using one or more custom
hardware MAC units. The implementation on the MMAF system is similar to the imple-
mentation of the adaptive filter on a DSP processor. In many modern DSP processors, up
to four MAC units process input samples simultaneously. The throughput of the MMAF
system depends on the filter size and the number of MAC units. As the number of MAC
units increases, higher throughput can be achieved. However, the required silicon area and


































Figure 51. Block diagram of the DAAF for a single LUT-based implementation. The DA-A-LUT con-
tains all possible combination sums ofK most recent input samples, and the DA-F-LUT contains all
possible combination sums of the filter coefficients. The DA-A-LUT in the DA auxiliary module is used
for the fast update of DA-F-LUT of the DA filter module.
6.2 DA LMS Adaptive Filter
In this section, a novel filter coefficient update structure that requirs fewer clock cycles
than the conventional LUT recalculation is described. The conventional update method
of the LUT must update each new filter coefficient, recalculate all possible combination
sums of the new coefficients, and then store the new sums in the LUT. This is computa-
tionally expensive and time consuming, causing a significant reduction in filter throughput.
The proposed method applies the LMS algorithm directly to the contents of the LUT by
introducing an additional auxiliary LUT.
The overall system diagram of the DAAF is shown in Figure 51. It has a “DA filter
module” for the filtering operation and a “DA auxiliary table module” for an efficient filter
coefficient update, and a “DA filter update control module” for controlling the previous two
modules. The DA filtering LUT (DA-F-LUT) in the DA filter module contains all possible
83
combination sums of the coefficients. The DA-F-LUT is exactly the same with LUT used
for fixed-coefficient FIR filters. The entries of the DA-F-LUT must be recalculated and
updated at every sample clock, once the weight update term of Table 1,µe[n]x[n − k], is
available. The DA auxiliary table module contains an auxiliary LUT, denoted as the DA
auxiliary LUT (DA-A-LUT), which contains all possible combination sums of theK most
recent input samples. The DA-A-LUT is mainly introduced for the speed-effici nt update
of the DA-F-LUT. The table size and architecture of the DA-A-LUT are identical to those of
the DA-F-LUT. It is this analogous architecture that allows the adaptation scheme described
below to work efficiently. The DA filter update control module generates addresses and
control signals for the DA-A-LUT and DA-F-LUT. It also calculates the error signal,e[n],
and generatesµe[n]x[n− k] for the LMS adaptive filter. The address for updating the DA-
A-LUT is rotated to the left at every sample clock for an efficient update mechanism of the
DA-A-LUT, which is explained later.
A single filtering and adaptation step of the DAAF algorithm at the time instancen
is described in Figure 52. The notations DA-A-LUT[n] and DA-F-LUT[n] are used to
refer to the DA-A-LUT and DA-F-LUT at the time instancen. The filtering operation on
x[n], x[n − 1] . . . x[n − K + 1] is performed using the DA-F-LUT[n]. While the filtering
operation is in progress, the DA-A-LUT is updated from DA-A-LUT[n − 1] to DA-A-
LUT[n]. When both the filtering and the update of the DA-A-LUT have been completed,
the DA-F-LUT is updated from DA-F-LUT[n] to DA-F-LUT[n+1]. Once DA-F-LUT[n+1]
is updated, the filtering and adaptation step at timen is complete. The DAAF then awaits
the arrival of a new samplex[n + 1], and the entire algorithm of Figure 52 is repeated.
The updates of the DA-A-LUT and DA-F-LUT, upon the arrival of the samplex[n], are
described in the next section.
6.2.1 DA-A-LUT Update
The simple and slow update method for the DA-A-LUT would be to recalculate all possible

















Figure 52. Simplified flowchart of the DAAF. While the filtering operation is in progress, the DA-A-
LUT is updated from DA-A-LUT [n − 1] to DA-A-LUT [n]. When both the filtering and the update of
























































Figure 53. Update of the DA-A-LUT entries from time n − 1 to time n. The entries of the DA-A-LUT
addressed from0000to 0111at time n− 1 can be reused by mapping these values to the even address
locations of the table at timen. The values in the odd address locations at timen can be updated by
adding the newest samplex[n] with preceding even-addressed data.
store the results in the DA-A-LUT. A more efficient update method can be devised by
considering how the DA-A-LUT changes as a new sample arrives and the oldest sample is
discarded [54].
Figure 53 shows the update of DA-A-LUT[n] from DA-A-LUT[ n−1] for K = 4. It may
be observed that the contents of even-addressed locations (locations whose addresses have
a 0 in the LSB) of the DA-A-LUT[n] are the contents of the lower half (locations whose
addresses have a 0 in the MSB) of the DA-A-LUT[n− 1]. It may also be observed that the
contents of the odd addressed locations (locations whose addresses have a 1 in the LSB) of
the DA-A-LUT[n] can be obtained from the even-addressed locations of the DA-A-LUT[n]
according to
DA-A-LUT (2l+1)[n] = DA-A-LUT (2l)[n] + x [n] , l = 0, . . . , 2
K−1 − 1. (64)































































Figure 54. Block diagram for the DA-A-LUT address rotation. The lower half of the DA-A-LUT [n −
1] is re-mapped to even-addressed locations of the DA-A-LUT[n]. Instead of physically moving the
contents of the DA-A-LUT, this re-mapping operation can be performed by a simple left rotation of the
K address lines of the DA-A-LUT. Address rotation allows the physical contents of memory to remain
the same, even as the external logic sees the table as re-mapped.
87
Table 10. Rotation of address lines for the DA-A-LUT whenK = 4. The address in the table is illustrated
in Figure 54. External addresses (a0, a1, a2, and a3) are left rotated at every sample clock cycle.
Select Internal Address
int addr0 intaddr1 intaddr2 intaddr3
00 a0 a1 a2 a3
01 a1 a2 a3 a0
10 a2 a3 a0 a1
11 a3 a0 a1 a2
• The lower half of the DA-A-LUT[n − 1] is re-mapped to even-addressed locations
of the DA-A-LUT[n], as shown by the arrows in Figure 53. Instead of physically
moving the contents of the DA-A-LUT, this re-mapping operation can be performed
by a simple left rotation of theK address lines of the DA-A-LUT. Address rotation
allows the physical contents of memory to remain the same, even as the external
logic sees the table as re-mapped. The address rotation can be achieved using the
circuit shown in Figure 54. The relationship between the internal and the external
addresses forK = 4 is shown in Table 10. The term “intaddr” in the Table 10 refers
to the physical addresses of the DA-A-LUT, and “extaddr” refers to the address as
seen by the external logic. It can be observed that the external address referring to
a given internal address at timen is the left-rotated version of the external address
referring to the same internal address at timen− 1. Therefore, the effect of address
rotation can be accomplished by connecting “extaddr” and “intaddr” viaK, K-to-1
input multiplexers, as shown in Figure 54. The log2(K) select lines of each of theK
multiplexers are connected to the log2(K) bits of a counter, which is incremented with
the sample clock. Thus, by address rotation, the entire mapping of the DA-A-LUT
can be done instantaneously at the arrival of the new samplex[n].
• It must be noted that address rotation maps the upper half of the DA-A-LUT[n −
1], which contains sums involving the oldest samplex[n − 4], to the odd-addressed
locations at timen. The entries in these odd-addressed locations of the DA-A-LUT[n]
88
are overwritten by values obtained according to Equation (64). In other words, the
contents of the odd-addressed locations of the DA-A-LUT[n] are obtained by reading
the contents of the corresponding preceding even-addressed locations, adding the
newest samplex[n], and then storing the result back in the odd-addressed locations.
6.2.2 DA-F-LUT Update
The DAAF updates the contents of the DA-F-LUT directly, instead of updating the filter
coefficients and then storing the sums of the filter coefficients to the DA-F-LUT. Once the
update of the DA-A-LUT[n] as well as the filtering operation are done, an update of the
DA-F-LUT[n + 1] is performed. The update from timen to n + 1 of the r th entry of the
DA-F-LUT is given by
DA-F-LUT(r)[n + 1] = DA-F-LUT(r)[n] + µe[n]DA-A-LUT (r)[n]. (65)
The DA-F-LUT[n + 1] is updated by reading the same memory location in both the DA-
F-LUT[n] and the DA-A-LUT[n], multiplying the output of the DA-A-LUT[n] by µe[n],
adding this quantity to the output of the DA-F-LUT[n], and finally storing the result back in
the same memory location of the DA-F-LUT[n+ 1]. This process is repeated from address
1 to address 2K − 1. The entry in address 0 is not updated since it always has a zero value.
The addresses and the control signals for the DA-F-LUT update are provided by the DA
filter update controller module.
The multiplication operation of the entries of the DA-F-LUT[n] by µe[n] can be accom-
plished by using a custom hardware multiplier. To save the number of hardware multipliers
required to computeµe[n]DA-F-LUT (r)[n], µ is often set to a power of two, replacing mul-
tiplication with a right shift. More saving of hardware multipliers can be achieved by using
the well-known sign error (SE)-LMS. However, it is well known that the performance of
SE-LMS is worse than the performance of the LMS algorithm, especially when the ab-
solute value of an error is large at the beginning of the convergence [35, 52]. Another
method would be quantize error (QE)-LMS withµ fixed to some power of two and the
89























Figure 55. MATLAB simulation of the learning curve for LMS implementations: Case (i) Actual µe[n]×
DA-A-LUT [n] is used, and Case (ii)µe[n] is quantized to one ofL = 8 values (powers of2) using a round
function. The e[n]’s used in the plot for both the above mentioned cases are obtained by averaging150
independent trials of the respective MATLAB simulations.
error signal,e[n], quantized to the closest power of two, essentially quantizing the product
µe[n] to a power of two. This enables us to minimize the on-chip area usage by replacing
the hardware multiplier by a simple barrel shifter. In other words, the product of the con-
tents of the DA-A-LUT[n] with µe[n] is approximated by a right shift of the contents of the
DA-A-LUT[ n].
Depending on the quantization method of the error signal,e[n] the performance of the
learning curve can suffer degradation. This degradation of the QE-LMS, however, can be
minimized, as shown in Figure 55, by employing the rounding function, which gives the
closest power of the 2 value of the error signal. In both cases, a white Gaussian random
noise signal with a zero mean and unit variance was used as the input and the desired signal,
d[n], was generated by filtering the input with a 256-tap low-pass FIR filter. Thee[n]’s used
in Figure 55 for both the above mentioned cases are obtained by averaging 150 independent
trials of MATLAB simulations.
90
6.2.3 DAAF for High-Order Filters
For large values ofK, the DAAF can be implemented using multiple smaller DA base fil-
tering and adaptation units (DA-BFAUs). The idea of breaking up the DAAF into multiple
DA-BFAUs is similar to the partial sum technique for fixed FIR filters. The filter out-
put, y[n], is generated by summing the filtered outputs of all DA-BFAUs using an adder
tree. Figure 56 shows theK-tap DAAF architecture implemented withmDA-BFAUs, each
of sizek. For simplicity, ak-tap DA-BFAU will be referred to as DA-BFAU(k). A K-
tap DAAF architecture havingm DA-BFAU(k), whenK = m× k, will be referred to as
DAAF(m,k). For example, when a 4-tap DA-BFAU is used to implement a 128-tap adap-
tive LMS filter, the DAAF architecture will be denoted as DAAF(4,32).
A detailed view of the DA-BFAU fork = 4 andB = 16 is shown in Figure 57. Each
DA-BFAU contains a system similar to the one shown in Figure 51. However, the following
optimizations are performed to make the DAAF more area and memory efficient:
• Since the addresses and control signals for all the DA-BFAUs are essentially the
same, a single DA filter update controller module that fans its control signals and
addresses to all the DA-BFAUs is used.
• Instead of using one accumulator and shift register in each DA-BFAU, a single DA
accumulator and shift register are used at the output of the adder tree. Thus, the
outputs of the DA-F-LUT of the DA-BFAUs are directly connected to the adder tree
unit, and the resulting sum is accumulated and shiftedB times to generate the output,
y[n].
• A single shift register unit containing the bits of the input samples (x[n], x[n−1], . . . , x[n−
K + 1]) is used. The addresses for accessing the contents of the DA-F-LUTs of the
DA-BFAUs are derived from this shift register.
The “adder1” in Figure 57 is a half-adder for the update of the DA-A-LUT, and it adds































































Figure 56. Block diagram of the area-optimized DAAF for high-order filters. The shift register unit, the
adder/shift unit, and the adder tree unit are the same with fixed FIR filters. AK-tap DAAF architecture








































































Figure 57. Detailed view of the DA base filtering and adaptation unit, DA-BFAU(4), for µ = 6, and
B = 16. Each DA-BFAU contains a system similar to Figure 51, but it is more optimized so that the
DAAF implementation are more area and memory efficient. For the filtering operation, 2x1 MUX
chooses address lines from the shift register unit. For the weight adaptation operation, it chooses
address lines generated sequentially from the counter beginning from1 to 2k − 1.
93
a 1 in the LSB) of the DA-A-LUT according to Equation (64). The “adder2” is a full-
adder with sign control for the update of DA-F-LUT, according to Equation (65). A “barrel
shifter” is used for further area optimization instead of a hardware multiplier. Withµ fixed
to some power of two, the barrel shifter shifts its input,µDA-A-LUT, to the right by “dis-
tance,” which was calculated at the DA filter and update control module. A “2x1 MUX”
selects address lines for the DA-F-LUT. For the filtering operation, 2x1 MUX chooses ad-
dress lines from the shift register unit, which is defined as “filtaddrfilter” in Figure 57. For
the weight adaptation operation, it chooses the address lines from the DA filter and update
control module, which is defined as “filtaddrupdate” in Figure 57. A “filtaddrupdate”
is generated sequentially from the counter, beginning from 1 to 2k − 1. Two address lines
for DA-A-LUT, “aux wr addr” and “auxrd addr,” are actually “intaddr” of Figure 54
and Table 10. The “auxrd addr” corresponds to the even external addresses, and the
“aux wr addr” corresponds to the next odd external addresses of “auxrd addr.”
6.2.4 Implementation Results
In this section, implementation results of the DAAF architecture (Figure 56) are compared
in terms of four metrics to quantify design the trade-offs: throughput, memory require-
ments, logic complexity, and power consumption estimate [3]. These metrics are used to
compare the performances of the DAAF(m, k) for different values ofk andm.
Throughput: The throughput is defined as the number of signal samples processed by
an adaptive filter per second. If “T” is the number of clock cycles required for filtering and





For a K-tap DA adaptive FIR filter implemented as DAAF(m, k), the update of the DA-
A-LUT can be done in 2k−1 clock cycles, as described in Section 6.2.1. This can be done
along with the filtering operation, which takesB clock cycles. Thus, the total number of
clock cycles for the filtering and the update of the DA-A-LUT is max(B,2k−1). The updated
94





























64−tap µ processor 
# of filter taps in base unit (k) 
Figure 58. Throughput versus the number of filter taps in the base unit for64-, 256-, and 1024-tap
adaptive filters implemented on a microprocessor and the DAAF. A microprocessor is assumed not to
have multiple hardware multipliers. The throughput of the DAAF does not vary significantly for the
change of the overall filter size,K, but vary significantly for the change of the filter taps of DA-BFAU, k.
It clearly shows that the throughput improvement of the DAAF over a microprocessor becomes wider
as K increases. A maximum of two orders of throughput improvement over a microprocessor can be
achieved for a1024-tap LMS adaptive filter implemented with the DAAF( 2,512) or the DAAF(4,256).
DA-A-LUT is then used to update the DA-F-LUT, requiring 2k clock cycles. Finally, the
adder tree unit requiresdlog2(m)e clock cycles. Thus, the overallK-tap DAAF requires
2k + max(B,2k−1) + dlog2(m)e clock cycles. The throughput for the DAAF is given by
Throughput=
clock rate
2k + max(B,2k−1) + dlog2(m)e
. (67)
The throughput of the DAAF and a microprocessor for 64-, 256-, and 1024-tap adaptive
filters are shown in Figure 58. A microprocessor is assumed not to have multiple hardware
multipliers. A 50 MHz system clock is used for both a microprocessor and DAAF. It can
be observed that the throughput of the DAAF does not vary significantly due to the change
in the overall filter size,K, but varies significantly due to the change of the filter taps of
DA-BFAU, k. Figure 58 clearly shows that the throughput improvement of the DAAF over
a microprocessor becomes wider as the overall filter size,K, increases. A maximum of two
95

































MMAF with 2 MACs
MMAF with 4 MACs
Figure 59. Throughput comparison of the DAAF and the MMAF for various filter sizes,K. For K = 32,
the throughput of the MMAF with 4 MACs is almost the same as that of the DAAF withk = 2. However,
as the filter tap size,K, increases, the throughput of the MMAF (note that the throughput is drawn in
log scale) decreases exponentially while the DAAF maintains almost the same throughput.
orders of throughput improvement over a microprocessor can be achieved for a 1024-tap
LMS adaptive filter implemented with the DAAF(2,512) or the DAAF(4,256).
As explained in Sec. 6.1, most modern DSP processors have multiple hardware multi-
pliers. Since many different DSP processors have their own unique specialized-hardware
logic for fast MAC operation, comparing their performance is challenging. Instead, we
synthesized a general MMAF processor, having multiple hardware multipliers that objec-
tively compare the throughput with the DAAF. Figure 59 shows the throughput comparison
between the MMAF with 2 or 4 MACs and the DAAF withk = 2 or k = 4. ForK = 32,
the throughput of the MMAF with 4 MACs is almost the same as that of the DAAF with
k = 2. However, as the filter tap size,K increases, the throughput of the MMAF (note that
the throughput is drawn in log scale) decreases exponentially while the DAAF maintains
almost the same throughput.
96























Figure 60. Number of LEs for various adaptive filter sizes,K, and base unit sizes,k. When K is fixed,
DA-BFAU with larger k is more area efficient, requiring a fewer number of LEs.
Number of LEs: In programmable logic systems, the size of the logic design is mea-
sured by the number of LEs. For Altera’s Stratix architecture, the platform used in imple-
mentation, the LE is the smallest unit for implementing logic functions. Each LE contains
a four-input LUT, a programmable register, and a carry chain with carry select capability.
Further details of the LE can be found in [1]. Ten LEs are grouped into one logic array
block (LAB) and the LABs are interconnected through a row- and column-based network.
In this paper, the number of LEs, instead of the number of LABs, is considered as a metric
for area usage since it conveys more detailed information about the actual chip area used in
the implementation.
Figure 60 shows the number of LEs for various adaptive filter sizes,K, and base unit
sizes,k. WhenK is fixed, DA-BFAU with largerk is more area efficient, requiring a fewer
number of LEs. It makes sense that DA-BFAU with smallerk requires more full adders,
which take up a relatively large silicon area, in the adder tree unit. It should be noted from
Figures 58 and 60 that there are trade-offs between the number of LEs and throughput. For
97































Figure 61. Memory usage versus the number of filter taps in the base unit for64-, 256-, and 1024-tap
filters implemented on a microprocessor and the DAAF.
area-optimized implementation, DA-BFAU with largerk can be a good choice, while for
throughput-optimized implementation, DA-BFAU with smallerk is advantageous.
Memory: The DA filtering method, due to its nature of using LUTs, is known to be
more memory intensive than DSP processor implementation. Altera’s Stratix architecture
provides three types of RAM blocks consisting of M512, M4K, and M-RAM blocks [1].
However, we compare memory usage in KB, which is available in the compilation reports
of FPGA synthesis software, as a metric rather than the number of used M512, M4K, and
M-RAM blocks for more general comparison with other hardware platforms.
Figure 61 illustrates the memory usage of 64-, 256-, and 1024-tap adaptive filters im-
plemented on the DAAF and a microprocessor. Figure 61 shows that memory require-
ments for the DAAF grow exponentially as the base unit size increases, and DA-BFAU
with smallerk is more memory efficient. Although the DAAF requires at least four times
more memory than microprocessor-based implementation since the memory is considered
98
Table 11. Memory requirements in KB for various K and k. Memory requirements for the DAAF grow
exponentially as the base unit size increases, and DA-BFAU with smallerk is more memory efficient.
Overall Filter Size (K)
k 16 32 64 128 256 512 1024
2 0.16 0.33 0.66 1.31 2.63 5.25 10.51
4 0.31 0.63 1.25 2.5 5 10 20
8 2.41 4.81 9.63 19.25 38.50 77 154
cheaper than the specialized logic for a microprocessor or a DSP core, this marginal mem-
ory increase might be implemented without additional cost. Moreover, as it is found in
Table 11, the overall memory requirement for a 1024-tap LMS algorithm implemented
with DAAF(4,256) is no more than 20 KB.
Power Consumption Estimates:The power consumption estimates are obtained from
the “Simulation Report” generated by using the PowerGaugeTM feature of Altera’s Quar-
tus II software. According to [1], the PowerGaugeTM stimates power through a software
simulation of the hardware design.
Table 12 shows the power consumption comparison of the DAAF for variousK and
k. It can be seen that DA-BFAU with largerk when overall filter size (K) is fixed is more
power efficient than DA-BFAU with smallerk. The power consumption of the DAAF ar-
chitecture can be compared with other methods of implementation in conjunction with the
throughput advantage of the DAAF architecture. For instance, two orders of throughput
improvement of the 1024-tap LMS adaptive filter implemented with the DAAF(2,512) or
the DAAF(4,256) over a microprocessor means that the DAAF can be clocked at a max-
imum of 100 times more slowly than a microprocessor with the same performance. It is
well known that the overall power consumption of the system decreases as the clock speed
decreases.
Finally, to verify the convergence property of the DAAF implemented on an Altera
1These values are extrapolated since it is not possible to fit a 1024-tap DAAF with 2-tap base units on the
Stratix EP1S80F1508C6 FPGA.
99
Table 12. Power consumption estimates in mW for variousK and k. The power estimates are obtained
using the PowerGaugeTM feature of Altera’s Quartus II software. The DA-BFAU with larger k is more
power efficient than DA-BFAU with smaller k.
Overall Filter Size (K)
k 16 32 64 128 256 512 1024
2 14.43 24.54 34.54 54.01 90.53 132.5
4 15.83 19.82 37.15 53.91 84.77 115 144.8
8 11.87 14.94 24.95 31.16 37.14 48.17 72.87
Stratix EP1S80F1508C6 FPGA system, the learning curve for a 512-tap LMS adaptive
filter implemented with the DAAF(4,128) is presented in Figure 62. White Gaussian noises
with a zero mean are used as input samples for this measurement.
100
























Figure 62. Learning curve for a 512-tap LMS adaptive filter implemented with the DAAF( 4,128), mea-
sured from Altera Stratix EP1S80F1508C6 FPGA system. White Gaussian noises with zero mean are
tested for the verification of implementation.
101
CHAPTER 7
HYBRID DA LMS ADAPTIVE FILTER
Two novel hybrid architectures–the hybrid DA-OBC architecture and the hybrid DA architecture–
introduced in Chapter 5, enable efficient hardware implementation for fixed-coefficient fil-
ters. Hybrid architectures provide more flexibility than LUT-based DA or adder-based DA
by allowing arbitrary selection of the LUT size, ranging from 2k to as low as 0 for ak-
tap DA base unit, or from 2k−1 to as low as 0 for ak-tap DA-OBC base unit. It should
be noted that LUT-based DA and adder-based DA are only two extreme cases of hybrid
architectures. FPGA implementation results confirm that hybrid architectures can imple-
ment higher-order FIR filters because of their flexibility in the selection between LEs and
memory.
In addition, new DA architecture for the LMS adaptive filter (DAAF) was presented
in Chapter 6. The DAAF structure is based on a LUT and provides high-speed hardware
implementation of the LMS adaptive filter by introducing additional auxiliary LUT for fast
update of the original filtering LUT. The DAAF, however, still requires intensive memory
usage as the base unit size,k, increases since it is LUT-based architecture.
7.1 Hybrid DA LMS
To overcome increased memory problems of the DAAF for high-order base unit, hybrid
architectures are extended to an LMS adaptive filter in this section. The new structure,
referred to as “hybrid DA LMS,” has features of both the hybrid DA architecture and the
DAAF architecture. It allows smaller LUT size, ranging from 2k to as low as 0, for ak-tap
DA base unit. The LUT update scheme of the DAAF, which makes the DAAF architecture
attractive for high-speed filtering applications, can be again applied to a smaller LUT of
the hybrid DA LMS. In the hybrid DA LMS architecture, some of the filter coefficients are









0 1  w0
1 0  w1





















































Figure 63. Example of a block diagram of a4-tap base unit for the hybrid DA LMS architecture. Two
filter coefficients (kL = 2) are stored in a LUT and the remaining two coefficients (kA = 2) are stored
in the registers. The hybrid DA LMS has features of both the hybrid DA architecture and the DAAF
architecture. The LUT update scheme of the DAAF, which makes the DAAF architecture attractive for
high-speed filtering applications, can be again applied to a smaller LUT of the hybrid DA LMS.
the registers. LetkL be the number of filter coefficients stored in the DA-F-LUT andkA be
the number of filter coefficients stored in the registers. Then the following is satisfied.
k = kL + kA, (68)
wherek is the number of base unit taps of the hybrid DA LMS.
Details of the hybrid DA LMS are presented with an example. Figure 63 shows an
example of the block diagram of a 4-tap base unit for the hybrid DA LMS architecture. The
shift register unit, the adder tree unit, and the adder/shifter unit are identical to those of the
hybrid DA architecture. The timing control logic of the DA filter update controller module
should be changed from the DAAF architecture since the base unit structure is changed to









0 1  w0
1 0  w1




















































Figure 64. Filtering operation for a 4-tap base unit with kL = 2 and kA = 2. Dark lines (red) are active
ones for the filtering operation of the hybrid DA LMS architecture. When filtering is performed, active
structures are the same as the hybrid DA architecture.
coefficients (kA = 2) are stored in the registers. Four bits of addresses (b3b2b1b0) from
the shift register unit are connected to the base unit, but only two bits (b1b0) are used as
addresses to the DA-F-LUT and the remaining two bits are used as selectors for 2x1 MUX,
by selecting either filter coefficients or zeroes. It can be seen from Figure 63 that the size
of both LUTs are reduced from 2k to 2kL . Adders labelled as “Adder1” and “Adder2” are
the same as adders used inside the base unit of the DAAF (DA-BFAU), and the remaining
two adders are newly inserted as they were in the hybrid DA architecture.
Figure 64 illustrates the active logics for the filtering operation. Dark lines (red) are
active for the filtering operation of the hybrid DA LMS architecture. When the filtering
operation is in progress, these active logics are exactly the same as the hybrid DA archi-










0 1  w0
1 0  w1




















































Figure 65. Update of the DA-A-LUT and input registers for a 4-tap base unit with kL = 2 and kA = 2.
Dark lines (red) are active for the DA-A-LUT and input registers update operation. This operation
takes place at the same time as the filtering operation. The same update scheme of the DAAF can be
used for the hybrid DA LMS architecture; the only di fference is that the update clock cycles decrease









0 1  w0
1 0  w1




















































Figure 66. Update of the filter coefficients stored in the DA-F-LUT for a 4-tap base unit with kL = 2 and
kA = 2. Dark lines (red) are active for the DA-F-LUT update operation.
Figure 65 illustrates the active logics for the update operation of the DA-A-LUT and
input registers. This operation takes place at the same time as the filtering operation, shown
in Figure 64. For the DAAF architecture, DA-A-LUT[n + 1] was updated using address
rotation and data reuse of DA-A-LUT[n] to lower the required clock cycles to 2k−1. The
same update scheme can be used for the hybrid DA LMS architecture; the only difference
is that the update clock cycles decrease from 2k−1 to 2kL−1. Once the update of DA-A-LUT
is over, a new input sample,x[n + 1], at time indexn + 1 is loaded into the input registers
shown in the lower-left corner of Figure 65. Input registers storek previous input signals
and are used for the update of the filter coefficients stored in the registers.
In the hybrid DA LMS architecture,kL filter coefficients are stored in the DA-F-LUT,
andkA filter coefficients are stored in the registers. The update of these two sets of filter
coefficients is done in tandem as follows. The filter coefficients in the DA-F-LUT are
106
updated as follows:
DA-F-LUT(r)[n + 1] = DA-F-LUT(r)[n] + µe[n]DA-A-LUT (r)[n], 0 ≤ r ≤ 2kL − 1, (69)
and then the filter coefficients in the registers are updated as follows:
wi[n + 1] = wi[n] + µe[n]x[n− i], (kL ≤ i ≤ k− 1), (70)
wherek is the base unit size of the hybrid DA LMS architecture.
Figure 66 illustrates the update of the filter coefficients stored in the DA-F-LUT, as
explained in Equation (69), for a 4-tap base unit whenkL = 2 andkA = 2. The hardware
logics involved in the DA-F-LUT update are fundamentally the same as the DA-F-LUT
update logics of the DAAF architecture. The DA-F-LUT[n + 1] is updated by reading
the same memory location in both the DA-F-LUT[n] and the DA-A-LUT[n], multiplying
the output of the DA-A-LUT[n] by µe[n], adding this quantity to the output of the DA-
F-LUT[n], and finally storing the result back in the same memory location of the DA-F-
LUT[n + 1]. This process is repeated from address 1 to address 2kL − 1. The product of
the contents of the DA-A-LUT[n] with µe[n] is approximated using a barrel shifter. The
“sel A” control signal is set to 0 for the DA-F-LUT update operation. The addresses of
both LUTs are connected to the counter, which is incremented from 1 to 2kL − 1.
Figure 67 illustrates the update of the filter coefficients stored in the registers, as ex-
plained in Equation (70), for a 4-tap base unit whenkL = 2 andkA = 2. While this
operation is in progress, “selA” is set to 1 and “selB” is increased from 0 tokA− 1 for the
sequential update of the filter coefficients stored in the registers. The size of the MUXs and
the DEMUX, controlled by “selB,” is kAx1.
7.2 Performance Analysis
Throughput: Section 6.2.4 showed that the number of clock cycles (T) of the DAAF(k,m)
architecture for theK-tap LMS adaptive filter (K = k×m) was









0 1  w0
1 0  w1




















































Figure 67. Update of the filter coefficients stored in the registers for a4-tap base unit with kL = 2 and
kA = 2. Dark lines (red) are active for this operation of the hybrid DA LMS architecture. The size of
the MUXs and the DEMUX, controlled by “sel B,” is kAx1.
108
In the hybrid DA LMS architecture, the number of clock cycles (T2) is
T2 =

2kL + kA + max(B,2kL−1) + dlog2(m)e if kL , 0
k + B + dlog2(m)e if kL = 0
(72)
The adder tree unit summing up the filtered output of the base unit requiresdlog2(m)e for
both the DAAF and the hybrid DA LMS, since these two architectures have differences only
inside the base unit. The filtering operation requires max(B,2kL−1) clock cycles forkL , 0
andB clock cycles forkL = 0, respectively. Neither the DA-F-LUT nor the DA-A-LUT are
required forkL = 0. This structure is called the “LUT-less hybrid DA LMS” and is a special
example of the hybrid DA LMS architecture, wherekL is 0 andkA is k. This architecture
minimizes memory usage and maximizes the LEs of an FPGA. If we increasekL from 0 to
k gradually, the number of LEs can be decreased at the expense of more increased memory
usage.
The major difference between the clock cycles of the DAAF and the hybrid DA LMS
results from the way the filter coefficients are updated. It takes 2k clock cycles for updating
2k size DA-F-LUT for the DAAF, while it takes 2kL + kA clock cycles for the hybrid DA
LMS architecture. 2kL clock cycles are for the update of the filter coefficients stored in
the DA-F-LUT, andkA are for the update of the filter coefficient stored in the registers.
Some additional overheads are necessary in the hybrid DA LMS architecture due to the
MUXs and the DEMUX in the critical data path. However, for the LUT-less hybrid DA
LMS architecture, it takes onlykA(= k) clock cycles for the filter update since all the filter
coefficients are stored in the registers. This clearly shows that the LUT-less hybrid DA
LMS is superior to the DAAF in terms of required clock cycles. For instance, to update the
filter coefficients of a 4-tap base unit, the LUT-less hybrid DA LMS requires only 4 clock
cycles with additional small overheads, while the DAAF requires 16 clock cycles.
This indicates that the base unit size of the LUT-less hybrid DA can increase to 16 to
achieve the same throughput as the DAAF. In the DAAF, the increase of base unit size
causes a decrease in throughput, as shown in Figure 58. Increasing the base unit size
109
without additional clock cycles has the following two advantages. First, the number of
adders inside the adder tree unit decreases as the number of base unit,m, decreases. This
can save additional clock cycles for the adder tree unit. Second, the barrel shifter required
for the filter update equation inside the base unit can be replaced by a hardware multiplier
for more precision. If the base unit size is small for largerK, the number of the base unit,
m, becomes large, prohibiting the simultaneous use ofm amount of dedicated hardware
because of the limited number of hardware multipliers.
Resource Usage:As an example of the hybrid DA LMS architecture, the LUT-less
hybrid DA LMS was implemented on an Altera’s Stratix EP1S80F1508C6 FPGA chip. In
addition, the LUT-less hybrid DA-OBC architecture was also implemented on the same
type of FPGA chip for comparison with the hybrid DA LMS architecture. It should be
noted that the general hybrid DA-OBC architecture is harder to apply to adaptive filters
since the efficient update method for a LUT of DA-OBC is yet known. However, the LUT-
less hybrid DA-OBC can be easily extended to adaptive filters since no LUT was required
inside the base unit.
Table 13. Summary of the FPGA implementation results (Altera Stratix EP1S80F1508C6 FPGA chip)
for LMS adaptive filters, ranging from 4 to 1024taps, when theBc = 18and k = 4.
Filter size (K)
4 16 64 128 256 512 1024
DAAF LE 543 916 2432 4474 8554 16705 x
Memory 632 2528 10112 20224 40448 80896 x
LUT-less LE 601 1595 5539 10781 21264 42096 x
Hybrid DA-OBC Memory 92 368 1472 2944 5888 11776 x
LUT-less LE 571 1434 4882 9500 18702 37097 74132
Hybrid DA Memory 96 384 1536 3072 6144 12288 24576
Tables 13 to 15 show the FPGA implementation results of various LMS adaptive filter
sizes for 4-tap, 8-tap, and 16-tap base units, respectively. Bit precision of filter coefficients,
Bc, is 18. In Table 13, one can see that only the hybrid DA LMS architecture is capable of
implementing a 1024-tap LMS adaptive filter on the same type of FPGA chip. The hybrid
110
DA-OBC fails to implement a 1024-tap LMS adaptive filter since its LE exceeds the max-
imum LE limit (79040) of an FPGA chip. Interestingly, the DAAF also fails to implement
a 1024-tap LMS adaptive filter although its expected number of required LEs and memory
usage are only about 42% and 1% of the full capacity of the FPGA, respectively.
Table 14. Summary of the FPGA implementation results (Altera Stratix EP1S80F1508C6 FPGA chip)
for LMS adaptive filters, ranging from 8 to 1024taps, when theBc = 18and k = 8.
Filter size (K)
8 16 64 128 256 512 1024
DAAF LE 652 796 1630 2704 4895 9267 18024
Memory 9840 19680 78720 157440 314880 629760 1259520
LUT-less LE 871 1461 4963 9622 18967 37558 74920
Hybrid DA-OBC Memory 184 368 1472 2944 5888 11776 23552
LUT-less LE 791 1299 4321 8356 16407 32495 64674
Hybrid DA Memory 192 384 1536 3072 6144 12288 24576
Table 14 shows the implementation results for various LMS adaptive filter sizes with
a 8-tap base unit size. Unlike Table 13, all three structures can successfully synthesize
a 1024-tap LMS adaptive filter. The hybrid DA LMS architecture for a 1024-tap LMS
adaptive filter requires three and a half times more LEs than the DAAF and only 9.5%
of memory usage of the DAAF. This, however, is not a fair comparison in the sense that
required clock cycles for the 8-tap DAAF and the 8-tap hybrid DA LMS are different.
Silicon area comparison between different architectures should be done at the same clock
cycles or at the same throughput. Figure 68 shows the FPGA resource usage rate of the
hybrid DA LMS architecture over the DAAF at the same throughput. It can be seen that
the hybrid DA LMS requires 25% to 65% more LEs and 80% less memory than the DAAF
when compared at the same throughput condition.
Table 15 shows the implementation results for various LMS adaptive filter sizes with
16-tap base unit size. The DAAF architecture fails to synthesize a 16-tap base unit for all
filter sizes attempted, while both hybrid LMS architectures can synthesize a 16-tap base
unit.
111















Figure 68. FPGA resource usage rate of the hybrid DA LMS architecture over the DAAF architecture
at the same throughput condition. The hybrid DA LMS requires 25% to 65%more LEs and 80% less
memory than the DAAF for the same throughput condition.
Table 15. Summary of the FPGA implementation results (Altera Stratix EP1S80F1508C6 FPGA chip)
for LMS adaptive filters, ranging from 16 to 1024 taps, when theBc = 18 and k = 16. The DAAF
architecture fails to synthesize a16-tap base unit for all filter sizes attempted, while both hybrid LMS
architectures can synthesize a16-tap base unit.
Filter size (K)
16 32 64 128 256 512 1024
DAAF LE x x x x x x x
Memory x x x x x x x
LUT-less LE 1366 2438 4576 8842 17372 34422 68519
Hybrid DA-OBC Memory 368 736 1472 2944 5888 11776 23552
LUT-less LE 1174 2034 3770 7232 14221 28102 55864
Hybrid DA Memory 384 768 1536 3072 6144 12288 24576
112
CHAPTER 8
CONCLUSION AND FUTURE RESEARCH
New architectures for audio enhancement algorithms–ANS and AEC–are proposed, imple-
mented, and tested for applications, in which power budget is limited. In this chapter, we
will summarize our work by listing the contributions of this thesis and addressing future
research directions as well as remaining work.
8.1 Summary of Contributions
The primary contributions of the thesis are summarized as follows:
• In Chapter 3, a continuous-time ANS system, based on the spectral gain modification
algorithm, was developed. This architecture provides the framework for advanced
analog audio signal processing by allowing programmability for each computation
block. The continuous-time ANS system was fabricated on a 0.5µm CMOS VLSI
chip using programmable floating-gate computing elements for ultra low-power com-
putation.
• In Chapter 3, a noisy signal envelope and a noise envelope were estimated using a
programmable peak detector and a minimum detector, respectively. The selection
of the decay time constant of the peak detector and the attack time constant of the
minimum detector is crucial for the overall performance of the continuous-time ANS
system.
• In Chapter 3, a noise signal envelope was estimated not from band-limited input
signal, but from the averaged noisy signal envelope estimate. This approach, also
called the “normalized SNR method,” helps the overall SNR to be unity or very close
to unity when the signal is assumed to be absent. The normalized SNR method forces
the SNR of the noise-only period to unity, independent of input noise variance, and
113
accordingly, the suppression rate will be consistently the same.
• In Chapter 3, programmable non-linear gain functions were described. The non-
linear gain function should be chosen carefully since it is very crucial not only for
overall noise suppression rate but also for the intelligibility of the noise suppressed
output signal. Wiener gain functions with or without over-subtraction, bi-linear gain
functions with different thresholds, and sigmoid gain functions with various parame-
ters were compared.
• In Chapter 4, a tapped-delay line and a parallel delay line as well as a low-pass delay
filter and a band-pass delay filter were described and compared. In addition, low-
pass-to-band-pass transformation rule, with an emphasis on the preserving group
delay of low-pass filters, are introduced.
• In Chapter 4, two subband delay networks are proposed. The first one is a delay
network with low-pass delay elements and modulation, which is desirable for an
application in which the linearity of the group delay is important. The other one is a
delay network with band-pass delay elements, which is desirable for creating larger
group delay.
• In Chapter 5, a hybrid DA-OBC architecture was proposed. The hybrid DA-OBC,
which was based on the memory reduction of DA-OBC [20], was re-structured so
that it could be applied to high-order FIR filters in conjunction with the “partial sum
technique.”
• In Chapter 5, a hybrid DA architecture was proposed. Unlike the conventional LUT-
based DA structure and the adder-based DA structure, the hybrid DA architecture
allows the use of both the LUT and adders simultaneously for the filtering operation.
It was shown that the hybrid DA architecture is not only memory effici nt but also
silicon-area efficient.
114
• In Chapter 5, a reusable DA base unit and reusable DA (RDA) block was proposed.
The RDA can significantly reduce the custom VLSI silicon size for matrix-vector
multiplication such as DCT and DFT by multiplexing a reusable DA base unit. Even
though the RDA requires more clock cycles than the typical DA, this architecture is
still very attractive in terms of throughput compared to non-DA approaches.
• In Chapter 6, LUT-based DA architecture was extended to the LMS adaptive fil-
ter algorithm. This structure efficiently updates the DA filtering LUT (DA-F-LUT)
by introducing auxiliary LUT (DA-A-LUT) and using an address rotation schemes.
LMS adaptive filters whose sizes vary from 16 to 1024 taps were implemented and
the results are analyzed.
• In Chapter 7, a hybrid DA LMS adaptive filter architecture was proposed. Imple-
mentation results of the LUT-less hybrid DA LMS shows that the hybrid DA LMS
architecture is more suitable not only for higher-order adaptive filters but also for
larger-size base units than the DAAF architecture.
8.2 Directions for Future Research
In this section, the remaining issues related to the current structure and future research
directions are addressed.
• Continuous-time ANS using analog computing elements operating in the subthresh-
old range was proposed for low-power computation. Each programmable analog
computation block for the continuous-time ANS was fabricated, programmed, tested,
and measured individually. Thus, integrating all 32 subbands into a single VLSI chip
is still challenging because of the limited silicon area. More efforts should be made to
minimize the size of the computation blocks, to increase the precision of the program-
ming, and to reduce circuit noises caused by imperfect properties and mismatches of
analog circuitry.
115
• Continuous-time low-pass and band-pass delay elements were investigated, and de-
lay networks based on these delay elements were presented. It is worthwhile to
characterize the group delay properties of subthreshold continuous-time filters such
as voltage-based C4 circuits and current-based MITE circuits.
• For adaptive filter applications, LUT-based LMS architectures and LUT-less LMS
architectures have been implemented in this research. Hybrid LMS architectures are
generalizations of the DA LMS architecture, enabling simultaneous usage of LUT
and adders; these two implementations are only special cases of hybrid LMS ar-
chitectures, maximizing memory usage or LE usage, respectively. Future research
should include the implementation of hybrid LMS architectures that use both LUT
and adders at the same time for the balanced usage of memory and LE and the esti-
mation of the optimalkL andkA values that can maximize an implementable number
of filter taps for various FPGAs.
• LMS adaptive filters with long filter taps were implemented on a reconfigurable ar-
chitecture for high-speed implementation targeting AEC applications. In the real
world, the actual performance of the LMS adaptive filter is severely limited by the
double-talk situation. For more realistic system implementation, implementing a
double-talk detector is also highly desirable.
• It is well known that the NLMS adaptive filter works better than the LMS adaptive
filter in terms of convergence speed and stability. LMS adaptive filter implementa-
tion based on the efficient LUT update method using an auxiliary LUT for high-speed
applications was proposed in this research. Therefore, development of new DA archi-
tectures for the NLMS adaptive filter, which can make convergence speed faster than
that of LMS adaptive filter while maintaining throughput advantage of the DAAF or
the hybrid DA LMS, can be a very interesting research field.
116
• The architecture for a highly area-optimized matrix-vector multiplication proces-
sor, e.g., the DCT processor, based on a reusable DA (RDA) was proposed. Since
this architecture comes with area-throughput trade-offs, I believe that future research
should include the theoretical analysis of the area-throughput trade-offs and the fab-
rication of custom VLSI processors.
• Transform-domain adaptive filters show better performance than the same type of
adaptive filter implemented in the time-domain. Most commonly used transforms
include DCT, DFT, or DWT, and they can be implemented using RDA architecture
in a very area-efficient manner, as explained in Chapter 5.5.
• Subband AEC (SAEC) can work better than wideband AEC as long as corner fre-
quencies of prototype filters for the analysis filter bank and the synthesis filter bank
are carefully chosen to prevent becoming badly conditioned because of the low exci-
tation levels near the band edges [56, 71]. These prototype low-pass filters normally
have a few hundred filter taps for extremely higher stop-band attenuation. As the
number of subbands increases, a number of high-order prototype filters need to be
implemented. All parallel implementation of these prototype filters can’t be feasi-
ble because of fast increasing area size. I believe that the RDA architecture could




[1] “Stratix device family data sheet.” Altera Corporation,
http://www.altera.com/literature/lit-index.html.
[2] A, D. J., K, V., H, W., andA, D., “Implementation of an
LMS adaptive filter on an FPGA employing multiplexed multiplier architecture,” in
Proceedings of the Asilomar Conference on Signals, Systems, and Computers, vol. 1,
(Pacific Grove, CA), pp. 918–921, Nov. 2003.
[3] A, D. J., Y, H., K, V., H, W., andA, D. V., “LMS adap-
tive filters using distributed arithmetic for high throughput,”IEEE Transactions on
Circuits and Systems I. Resubmitted with minor changes.
[4] A, D. J., Y, H., K, V., H, W., andA, D. V., “A novel high
performance distributed arithmetic adaptive filter implementation on an FPGA,” in
Proceedings of the IEEE ICASSP, vol. 5, (Montreal, Canada), pp. 161–164, May
2004.
[5] A, A., Analog and digital signal processing. Boston, MA: PWS Publishing
Company, 1995.
[6] A, D. V., “Model based development of a hearing aid,” Master’s thesis,
Brigham Young University, Provo, Utah, 1994.
[7] A, D. V., H, P., E, R., Y, H., G, D., andH, M., “A
low-power system for audio noise suppression: A cooperative analog-digital signal
processing approach,” inProceedings of the 10th IEEE DSP workshop, vol. 2, (Pine
Mountain, GA), pp. 728–731, Oct. 2002.
[8] A, L., MC, A., andV, V., “New methods for adaptive noise
suppression,” inProceedings of the IEEE ICASSP, vol. 1, May 1995.
[9] B, F., “Transform-domain adaptive filters: An analytical approach,”IEEE
Transactions on Signal Processing, vol. 43, pp. 422–431, Feb. 1995.
[10] B, J., M, D. R., andC, J. H., “A new class of doubletalk detectors
based on cross-correlation,”IEEE Transactions on Speech and Audio Processing,
vol. 8, pp. 168–172, Mar. 2000.
[11] B, H., “A note on wide-band group delay,”IEEE Trans. on Circuit Theory,
pp. 577–578, Sept. 1971.
[12] B, S. F., “Suppression of acoustic noise in speech using spectral subtraction,”
IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, pp. 113–
120, Apr. 1979.
118
[13] B, K. and W, H., “A CMOS analog continuous-time delay line with
adaptive delay-time control,”IEEE Journal of Solid-State Circuits, vol. 23, no. 3,
pp. 759–766, 1988.
[14] C, T.-S., C, C., andJ, C.-W., “New distributed arithmetic algorithm and
its application to IDCT,”IEE Proceedings Circuits, Devices and Systems, vol. 146,
pp. 159–163, Aug. 1999.
[15] C, T.-S.andJ, C.-W., “Hardware-efficient implementations for discrete func-
tion transforms using LUT-based FPGAs,”IEE Proceedings Circuits, Devices and
Systems, vol. 146, pp. 309–315, Nov. 1999.
[16] C, A. andR, R. K., “An architectural transformation program for opti-
mization of digital systems by multi-level decomposition,” inProceedings of 30th
ACM/IEEE Digital Automation Conference, pp. 343–348, 1993.
[17] C, A., R, R. K., and’A , M. A., “Greedy hardware optimization for
linear digital circuits using number splitting and refactorization,”IEEE Transactions
on Very Large Scale Integration Systems, vol. 1, pp. 423–431, Dec. 1993.
[18] C, C., C, T.-S., andJ, C.-W., “The IDCT processor on the adder-based
distributed arithmetic,” in1996 Symposium on VLSI Circuits Digest of Technical
Papers, pp. 36–37, 1996.
[19] C, J. H., M, D. R., andB, J., “An objective technique for evaluating
doubletalk detectors in acoustic cancelers,”IEEE Transactions on Speech and Audio
Processing, vol. 7, pp. 718–724, Nov. 1999.
[20] C, J., S, S., andC, J., “Efficient ROM size reduction for distributed arith-
metic,” inProceedings of the IEEE ISCAS, vol. 2, (Geneva, Switzerland), pp. 61–64,
May 2000.
[21] C, C. F. N.andM, J., “New digital-adaptive filter implementation using
distributed-arithmetic techniques,”IEE Proceedings, vol. 128, Pt. F, pp. 225–230,
Aug. 1981.
[22] C, R. E., “A weighted overlap-add method of shoft-time fourier analy-
sis/synthesis,”IEEE Transactions on Acoustics, Speech, and Signal Processing,
vol. ASSP-28, pp. 99–102, Feb. 1980.
[23] C, R. E.andR, L. R., Multirate Digital Signal Processing. Engle-
wood Cliffs, NJ: Prentice Hall, 1983.
[24] C, A., E, D. J., L, M. E., andR, V., “Digital filter for PCM
encoded signals.” U.S. Patent No. 3,777,130, issued Apr, 1973.
[25] D, R. W., Approximation Methods for Electronic Filter Design. New York,
NY: McGraw-Hill, 1974.
119
[26] D, E. J., “A subband noise-reduction method for enhancing speech in tele-
phony & teleconferencing,” inIEEE Workshop on Applications of Signal Processing
to Audio and Acoustics, (Mohonk Mountain House, New Paltz, NY), pp. 19–22, Oct.
1997.
[27] D, H., “A stable fast affine projection adaptation algorithm suitable for low-cost
processors,” inProceedings of the IEEE ICASSP, vol. 1, pp. 360–363, June 2000.
[28] D, S. C., “Efficient approximate implementations of the fast affine projection
algorithm using orthogonal transforms,” inProceedings of the IEEE ICASSP, vol. 3,
(Atlanta, GA), pp. 1656–1659, May 1996.
[29] D, D. L., “A twelve-channel digital echo canceler,”IEEE Transactions on
Communications, vol. 26, pp. 647–653, May 1978.
[30] E, R., Y, H., G, D., H, P., andA, D. V., “Analog audio
signal enhancement system using a noise suppression algorithm.” U.S. Patent Serial
No. 394783, filed Mar. 21, 2003.
[31] E, R., Y, H., G, D., H, P., andA, D. V., “A continuous-
time speech enhancement front-end for microphone inputs,” inProceedings of the
IEEE ISCAS, vol. 2, (Phoenix, AZ), pp. 728–731, May 2002.
[32] E, Y. andM, D., “Speech enhancement using a minimum mean-square
error short-time spectral amplitude estimator,”IEEE Transactions on Acoustics,
Speech, and Signal Processing, vol. ASSP-32, pp. 1109–1121, Dec. 1984.
[33] E, Y., M, D., andJ, B. H., “On the application of hidden Markov
models for enhancing noisy speech,”IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. ASSP-37, pp. 1846–1856, Dec. 1989.
[34] E, W. andM, G. S., “Noise reduction by noise adaptive spectral magni-
tude expansion,”Journal of the Audio Engineering Society, vol. 42, May 1994.
[35] F-B, B., Adaptive Filters: Theory and Applications. Chichester,
England: John Wiley and Sons, 1998.
[36] F-B, B., L, Y., andK, C. C., “Sliding transforms for efficient
implementation of transform domain adaptive filters,”Signal Processing, vol. 52,
pp. 83–96, 1996.
[37] F, G., “Digital signal processor trends,”IEEE Micro, pp. 52–59, Nov. 2000.
[38] F, S.andS, M. M., Advances in speech signal processing. New York, NY:
Marcel Dekker, 1991.
[39] G̈, T., H, M., I, C. J., andS, G., “A double-talk
detector based on coherence,”IEEE Transactions on Communications, vol. 44,
pp. 1421–1427, Nov. 1996.
120
[40] G, S. L., “A fast converging, low complexity adaptive filtering algorithm,” inPro-
ceedings of 3rd International Workshop on Acoustic Echo Control, pp. 223–226,
Sept. 1993.
[41] G, S. L., “The fast affine projection algorithm,” inProceedings of the IEEE
ICASSP, pp. 3023–3026, 1995.
[42] G, S. L. and B, J., Acoustic Signal Processing for Telecommunication.
Boston, MA: Kluwer Academic Publishers, 2000.
[43] G, P. R., “On the approximation problem for band-pass delay lines,” inProceed-
ing of IRE, pp. 1986–1987, Sept. 1962.
[44] G, D., Continuous–time bandpass second–order sections and their applica-
tions in cochlea modeling. Atlanta, GA: MS thesis, Georgia Institute of Technology,
2003.
[45] G, D. andH, P., “Capacitively-coupled current conveyer second-order
section for continuous-time bandpass filtering and cochlea modeling,” inProceed-
ings of the IEEE ISCAS, vol. 5, (Phoenix, AZ), pp. 485–488, May 2002.
[46] G, D., S, P., E, R., C, R., andH, P., “A programmable
bandpass array using floating gates,” inProceedings of the IEEE ISCAS, vol. 1, (Van-
couver, British Columbia), pp. 97–100, May 2004.
[47] H, R. I., “Subexpression sharing in filters using canonic signed digit multi-
pliers,” IEEE Transactions on Circuits and Systems II, vol. 43, pp. 677–688, Oct.
1996.
[48] H, P., D, C., M, B. A., andM, C., Advances in Neural Information
Processing Systems 7, ch. Single transistor learning synapses, pp. 817–824. Cam-
bridge, MA: MIT Press, 1995.
[49] H, P., K, M., andM, B. A., “A transistor–only circuit model of the au-
tozeroing floating–gate amplifier,” inProceedings of the IEEE Midwest Symposium
on Circuits and Systems, (Las Cruces), pp. 157–160, 1999.
[50] H, P. andL, T. S., “Overview of floating-gate devices, circuits, and sys-
tems,”IEEE Transactions on Circuits and Systems II, vol. 48, pp. 1–3, Jan. 2001.
[51] H, P., M, B. A., D, J., andD, C., “Adaptive circuits and synapses
using pFET floating-gate devices,” inLearning in Silicon(C, G., ed.),
pp. 33–65, Kluwer Academic, 1999.
[52] H, M. H., Statistical digital signal processing and modeling. New York, NY:
John Wiley and Sons, 1996.
[53] H, S., Adaptive Filter Theory. Upper Saddle River, NJ: Prentice Hall, 1996.
121
[54] H, W., K, V., A, D. J., andY, H., “Design analysis of a dis-
tributed arithmetic adaptive FIR filter on an FPGA,” inProceedings of the Asilo-
mar Conference on Signals, Systems, and Computers, vol. 1, (Pacific Grove, CA),
pp. 926–930, Nov. 2003.
[55] H, S., H, G., K, S., andK, J., “New distributed arithmetic algorithm
for low-power FIR filter implementation,”IEEE Signal Processing Letters, vol. 11,
pp. 463–466, May 2004.
[56] II, P. L. D. L. andE, D. M., “Experimental results with increased bandwidth
analysis filters in oversampled, subband acoustic echo cancellers,”IEEE Signal
Processing Letters, vol. 2, no. 1, pp. 1–3, 1995.
[57] J̀, R. L. B., S, P., F, G., and. B, “Combined noise and
echo reduction in hands-free systems: A survey,”IEEE Transactions on Speech and
Audio Processing, vol. 9, pp. 808–820, Nov. 2001.
[58] J, J., H, J. G., andP, J. C., “Analog hardware implementation of
adaptive filter structures,” inProceedings of the International Joint Conference on
Neural Networks, vol. 2, pp. 916–921, June 1997.
[59] K, U., B, P. T., and L, W., “Low-power design techniques for high-
performance CMOS adders,”IEEE Transactions on Very Large Scale Integration
Systems, vol. 3, pp. 327–333, June 1995.
[60] K, M., L, A., H, P., andN, J., “A programmable continuous-time
analog fourier processor,”IEEE Transactions on Circuits and Systems II, vol. 48,
pp. 90–99, Jan. 2001.
[61] L, K. R., G, A., andF, P. E., “Design and implementation of
cascaded switched-capacitor delay equalizers,”IEEE Transactions on Circuits and
Systems I, vol. 32, pp. 700–711, July 1985.
[62] L, J. C. and U, C. K., “Performance of transform domain LMS adaptive
algorithms,” IEEE Transactions on Acoustics, Speech, and Signal Processing,
vol. ASSP-34, pp. 499–510, June 1986.
[63] L, Q. G., C, B., andH, K. C., “On the use of a modified fast affine
projection algorithm in subbands for acoustic echo cancellation,” inProceedings of
the 10th IEEE DSP workshop, p. 354–357, 1996.
[64] L, S.andM, C. A., “Continuous-time adaptive delay system,”IEEE Transac-
tions on Circuits and Systems II, vol. 43, pp. 744–751, Nov. 1996.
[65] M, R., “Noise power spectral density estimation based on optimal smoothing
and minimum statistics,”IEEE Transactions on Speech and Audio Processing, vol. 9,
pp. 504–512, July 2001.
122
[66] MA, R. J.and M, M. L., “Speech enhancement using a soft-decision
noise suppression filter,”IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 28, pp. 137–145, Apr. 1980.
[67] M, E. J.andM, B. A., “Synthesis of a translinear analog adaptive filter,”
in Proceedings of the IEEE ICASSP, (Orlando, FL), pp. 321–324, May 2002.
[68] M, C. A., Analog VLSI and Neural Systems. Reading, MA: Addison-Wesley,
1989.
[69] M, B. A., “Multiple-input translinear-element log-domain filters,”IEEE Trans-
actions on Circuits and Systems II, vol. 48, pp. 29–36, Jan. 2001.
[70] M, P. H. B. A.andD, C., “An autozeroing floating-gate bandpass filter,” in
Proceedings of the IEEE ISCAS, (Monterey, CA), pp. 131–134, 1998.
[71] M, D. R., “Slow asymptotic convergence of LMS acoustic echo cancellers,”
IEEE Transactions on Speech and Audio Processing, vol. 3, no. 2, pp. 126–136,
1995.
[72] M, D. R. and T, J. C., “A delayless subband adaptive filter architecture,”
IEEE Transactions on Signal Processing, vol. 43, pp. 1819–1830, Aug. 1995.
[73] N, S. S., P, A. M., and N, M. J., “Transform domain
LMS algorithm,” IEEE Transactions on Acoustics, Speech, and Signal Processing,
vol. ASSP-31, pp. 609–615, June 1983.
[74] O, K., A, T., andO, T., “Echo canceler with two echo path mod-
els,” IEEE Transactions on Communications, vol. COM-25, pp. 589–595, 1977.
[75] O, K. andU, T., “An adaptive filtering algorithm using an orthogonal pro-
jection to an affine subspace and its propoerties,”Electronics and Communications
in Japan, vol. 67-A, no. 5, pp. 19–27, 1984.
[76] P, S. J., C, C. G., L, C., andY, D. H., “Integrated echo and noise canceler
for hands-free applications,”IEEE Transactions on Circuits and Systems II, vol. 49,
pp. 188–194, Mar. 2002.
[77] P, S. J., C, C. G., L, C., andY, D. H., “Integrated echo and noise canceler
for hands-free applications,”IEEE Transactions on Circuits and Systems II, vol. 49,
pp. 188–195, Mar. 2002.
[78] P, A. andL, B., “A new hardware realization of digital filters,”IEEE Trans-
actions on Acoustics, Speech, and Signal Processing, vol. 22, pp. 456–462, Dec.
1974.
[79] R, J. M., C, A., andN, B., Digital Integrated Circuits: A
Design Perspective. Upper Saddle River, NJ: Prentice Hall, 2002.
123
[80] S, H., S, H., andD, L., “HMM-based strategies for enhancement
of speech signals embedded in nonstationary noise,”IEEE Transactions on Speech
and Audio Processing, vol. 6, pp. 445–455, Sept. 1998.
[81] S, R., Efficient precise computation with noisy components: extrapolat-
ing from an electronic cochlea to the brain. Pasadena, CA: PhD thesis, California
Institute of Technology, 1997.
[82] S, M. R., “Apparatus for suppressing noise and distortion in communica-
tion signals.” U.S. Patent No. 3,180,936, issued Apr. 27, 1965.
[83] S, M. R., “Processing of communications signals to reduce eff cts of
noise.” U.S. Patent No. 3,403,224, issued Sept. 24, 1968.
[84] S-G, T., L-B, B., andA, A. G., “A general
translinear principle for subthreshold MOS transistors,”IEEE Transactions on Cir-
cuits and Systems I, vol. 46, pp. 607–616, May 1999.
[85] S, S., “The modulated lapped transform, its time-varying forms, and its applica-
tion to audio coding standards,”IEEE Transactions on Speech and Audio Processing,
vol. 5, no. 4, pp. 359–366, 1997.
[86] S, B. L., T, Y. C., andC, J. S., “A parametric formulation of the general-
ized spectral subtraction method,”IEEE Transactions on Speech and Audio Process-
ing, vol. 6, pp. 328–337, July 1998.
[87] S, P., C, R., G, D., andH, P., “A five–transistor bandpass filter
element,” inProceedings of the IEEE ISCAS, vol. 1, (Vancouver, British Columbia),
pp. 861–864, May 2004.
[88] S, P., K, M., E, R., H, P., andA, D. V., “Mel-frequency
cepstrum encoding in analog floating-gate circuitry,” inProceedings of the IEEE
ISCAS, vol. IV, (Phoeniz, AZ), pp. 671–674, May 2002.
[89] S, P., K, M., andH, P., “Accurate programming of analog floating–gate
arrays,” inProceedings of the IEEE ISCAS, vol. 5, (Scottsdale, AZ), pp. 489–492,
May 2002.
[90] S, M. M., “An adaptive echo canceler,”Bell System Technical Journal, vol. 46,
pp. 497–511, 1967.
[91] S, M. M. andP, A. J., “A self-adaptive echo canceler,”Bell System Tech-
nical Journal, vol. 45, pp. 1851–1854, 1966.
[92] S, G., “The design of arithmetically symmetrical band-pass filters,”IEEE
Trans. on Circuit Theory, pp. 367–375, Sept. 1963.
[93] T, M., M, S., andK, J., “A block exact fast affine projection algo-
rithm,” IEEE Transactions on Speech and Audio Processing, vol. 7, pp. 79–86, Jan.
1999.
124
[94] T, H. L. V., Detection, Estimation, and Modulation Theory, part I. New York,
NY: Wiley, 1968.
[95] W, C. H.andL, J. J., “Multi memory block structure for implementing a digital
adaptive filter using distributed arithmetic,”IEE Proceedings, vol. 133, Pt. G, pp. 19–
26, Feb. 1986.
[96] W, S. A., “Applications of distributed arithmetic to digital signal processing: A
tutorial review,”IEEE ASSP Magazine, vol. 6, pp. 4–19, July 1989.
[97] W, B. andS, S. D., Adaptive Signal Processing. Englewood Cliffs, NJ:
Prentice-Hall Inc., 1985.
[98] Y, H. andW, B. X., “A new double-talk detection algorithm based on the orthog-
onality theorem,”IEEE Transactions on Communications, vol. 39, pp. 1542–1545,
Nov. 1991.
[99] Y, H. andA, D. V., “Hardware-efficient distributed arithmetic architecture
for high-order digital filters,” inProceedings of the IEEE ICASSP, (Philadelphia,
PA), May 2005. to be published.
[100] Y, H., A, D. V., andH, P., “Continuous-time audio noise suppression
and a custom low-power IC implementation,” inProceedings of the IEEE ICASSP,
(Orlando, FL), pp. 3980–3983, May 2002.
[101] Y, H., A, D. V., andH, P., “On delay structures for the analog adap-
tive filters with long filter taps,” inProceedings of the Asilomar Conference on Sig-
nals, Systems, and Computers, vol. 2, (Pacific Grove, CA), pp. 2021–2025, Nov.
2003.
[102] Y, H., G, D., A, D. V., andH, P., “C4 band-pass delay filter for
continuous-time subband adaptive tapped-delay filter,” inProceedings of the IEEE
ISCAS, vol. 5, (Vancouver, Canada), pp. 792–795, May 2004.
[103] Y, S.andS, E. E., “DCT implementation with distributed arithmetic,”
IEEE Transactions on Computers, vol. 50, pp. 985–991, Sept. 2001.
125
