VLSI low-power digital signal processing by Farag, Emad N.
VLSI Low-Power Digit al Signal Processing 
Emad N. Farag 
A t hesis 
presented to the University of Waterloo 
in fulfilment of the 
thesis requirement for the degree of 
Doctor of Philosophy 
in 
Electncal Engineering 





Acquisitions and Acquisitions et 
Bibliographie Services services bibliographiques 
395 Wellington Street 395, rue Wellington 
Ottawa ON K1A ON4 Ottawa ON K I  A ON4 
Canada Canada 
Your NB V m  df&er?@ 
Our fi Natre rdUnwrce 
The author has granted a non- 
exclusive licence allowing the 
National Library of Canada to 
reproduce, loan, distribute or selI 
copies of this thesis in microfonn, 
paper or electronic formats. 
The author retains ownership of the 
copyright in this thesis. Neither the 
thesis nor substantial extracts fkom it 
may be p ~ t e d  or otherwise 
reproduced without the author's 
permission. 
L'auteur a accordé m e  licence non 
exclusive permettant a la 
Bibliothèque nationale du Canada de 
reproduire, prêter, distribuer ou 
vendre des copies de cette thèse sous 
la forme de microfiche/film, de 
reproduction sur papier ou sur format 
électronique. 
L'auteur conserve la propriété du 
droit d'auteur qui protège cette thése. 
Ni la thèse ni des extraits substantiels 
de celle-ci ne doivent être imprimés 
ou autrement reproduits sans son 
autorisation. 
The University of Waterloo requkes the signatures of all persons using or pho- 
tocopying this thesis. Please sign below, and give address and date. 
Abstract 
This t hesis reports on new high-level low-power design techniques for digit al signal 
processing for wireless port able sys tems. Through proper choice and op timization 
of an algont hm or an architecture, significant power dissipation saving is achieved. 
Up to an order of magnitude, with Iittle or no degradation in speed or SNR perfor- 
mance, is achieved. 
At the heart of these techniques is the minimization of the computational com- 
plexity. by the elimination of redundant and irrelevant computations. Redundant 
computations are extra computations that can be eliminated by applying appro- 
priate transformations to an architecture or an algorithm without changing its 
functionality. Irrelevant computations are unnecessary computations that can be 
eliminated by optimizing the datapath width. 
The elimination of redundant computations has been applied to the design of a 
division algorit hm. The division algorit hm generates the quotient in the minimum 
signed-digit representation. Hence, the number of addition/subtraction operations 
is rninimized. 
A subband coding image compression algorithm with a simplified filtering struc- 
ture that requires only addition and subtraction operations has been developed. 
This sirnplified filtering structure reduces the power dissipation by 23 times. A 
new vector quantization a lgor i th ,  having a simplified decoding structure, has also 
been developed for this subband coding algorithm. 
The increased flexibility and functionality of signal processing in the digital 
domain is pushing digital signal processing more and more into the arena of high- 
speed analog signals. To be able to do this high-speed high-resolution analog-to- 
digital converters are required. Sigma-Delta A/D converters have been known for 
t heir high-resolution capabilities using low-precision component S. 
Parallelism by 4x of analog signal processors is applied to the design of a band- 
pass Sigma-Delta modulator. The speed of the modulator is increased without 
increasing the speed requirement of the individual building blocks. 
The elhination of redundant and irrelevant computations has been employed 
in the design of the decimation filter. The decimation filter consists of two parts, 
the Sinc decimator and a lowpass decimation filter. In the Sinc decimator, the 
computational redundancy is minimized. The datapath width of the Sinc decimator 
is optimized to elirninate irrelevant computations. 
The lowpass decimat ion fdter employs multiplication minirnization, and opera- 
tion interleaving to reduce the power dissipation. Furthermore, the lowpass deci- 
mation filter is designed to be resolution-programmable, dowing the deactivation 
of the blocks correspondhg to the least significant bits when a lower resolution is 
sufficient. The decimation filter has been designed in a 0.5pm, 3.3 Volt CMOS 
t echnology. 
Eliminating the pre-filter multiplier substantidy reduces the power dissipation 
of a digital channel selection algorithm. The pre-filter multiplier has been substi- 
tuted by less cornputationdy complex operators, such as multiplexers and XOR 
gates. The frequency spectrurn is divided into four overlapping frequency bands. 
This reduces the filter sharpness requirements, and hence contributes to the power 
saving. This algorithm achieves up to an order of magnitude saving in power dis- 
sipation. 
Acknowledgements 
It is by the grace and power of God the Aknighty that 1 was able to complete this 
work. All things were made by Him; and without Him was not anything made that 
was made. 
Looking back at the past three years 1 realize that this work could not have 
been successfdy completed without the supervision, guidance, assistance, encour- 
agement and support of others. Firs t ,  and foremost , 1 would like to thank my super- 
visor Professor Mohamed 1. Elmasry, for his valuable suggestions, for his guidance, 
encouragement and support throughout the program. His assistance has been of 
great value and is greatly appreciated. 
1 would also like to express my thanks and gratitude t o  the mernbers of the 
Wireless Circuits and Systems department at Bell Laboratories, Lucent Technolo- 
gies. Particularly, 1 would like to thank Dr. Ran-Hong Yan, the department head, 
for his valuable consultations, and for his support during the my internship pro- 
gram there. 1 would also like to thank Peng-Wen Ong, Eric H. Westerwick, and 
Donald D. Shugard for their consultations and assistance with the CAD tools. 
1 would also like to thank Professor M. Anwarul Hasan for his valuable consul- 
tations and his encouragement. Thanks is also due to Phil Regier and Ani1 Rana, 
the system administrators at the University of Waterloo and at  Bell Laboratories 
respectively, for their valuable computer assistance and support. 
vii 
I am also deeply indebted to ail the members of my family, for thek encourage- 
ment and genuine support throughout my education Me. Especidy, I would Iike 
to thank my father, my mother, my brothers, Naguib and Maged, and my fiancée, 
Irene. None of this could have been achieved without their enthusiasm. 
The help of ail m y  &iends and colleagues both past and current is gratefdy 
acknowledged. Special thanks is due to Essam S. Tony of the University of Water- 
loo, and Stephan ten Brink of the University of Stuttgart, Germany, for being good 
friends and for their valuable discussions. 1 would also Like to thank my colleagues 
at the VLSI Research Group for their useful discussions and for th& inspiring 
ideas. 
This research is supported in part by MICRONET and by an Ontario Graduate 
Scholarship (OGS). This support is greatly appreciated. 
But most of all, and above all, thanks to The LORD who makes all things 
possible. 
This work is dedicated to the memory of my grandparents 
Dr. Naguib Farag 
Dr. Zaki Iskander 
Mrs. Fayqua Farag, and 
Mrs. Mary Guirguis 
Contents 
1 Introduction 1 
1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .  4 
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  6 
2 Wireless Communication Systems 9 
2.1 Land Mobile Wireless Sys tems Standards . . . . . . . . . . . . . . .  11 
2.2 Wireless Transceiver Architectures . . . . . . . . . . . . . . . . . .  14 
2.3 Software Radio . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  16 
2.4 ChapterSummary . . . . . . . . . . . . . . . . . . . . . . . . . . .  20 
3 Low-Power Design Techniques 21 
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  21 
3.2 Sources of Power Dissipation . . . . . . . . . . . . . . . . . . . . . .  22 
3.3 Estimating the Power Dissipation . . . . . . . . . . . . . . . . . . .  24 
3.4 Low-Power Examples of Portable Systems . . . . . . . . . . . . . .  27 
3.5 Reducing the Power Dissipation at the Device and Circuit Levels . 31 
3.6 Low-Voltage Low-Power Operation . . . . . . . . . . . . . . . . . .  33 
3.6.1 Pipelining and Pardelism at  the Architecture Level . . . . .  34 
3.6.2 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . . . .  38 
3.7 Pipelining and Pardelism of the Discrete Cosine Transform . . . .  39 
3.7.1 Three Alternative Architectures . . . . . . . . . . . . . . . -  40 
3.7.2 Reducing Power Through P i p e k g  and Parallelism . . . .  42 
3.7.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . .  44 
3.8 Effect of the Number Systern on the Switching Activity . . . . . . .  50 
3.9 Reducing the Number of Iterations . . . . . . . . . . . . . . . . . .  56 
3 .9.1 Higher Radix Division Algorithms . . . . . . . . . . . . . . .  56 
3.9.2 Minimizing Add/Sub Operations in Division . . . . . . . . .  65 
3.10 Reducing the Computational Complexity: Vector Quantization Ex- 
ample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  72 
. . . . . . . . . . . . . . . . . . . . . . . . . . .  3.11 Chapter Summary 75 
4 Subband Coding: A Low-Power Design 77 
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  77 
4.2 Video Compression Algorithrns . . . . . . . . . . . . . . . . . . . .  78 
4.3 Subband Coding for Image Compression . . . . . . . . . . . . . . .  81 
4.4 Performance-Power TradeoE for Subband Coding . . . . . . . . . .  87 
4.4.1 The Analysis/S ynthesis Filter . . . . . . . . . . . . . . . . .  89 
4.4.2 Statistical Properties of the Subband Coded Signal . . . . .  92 
4.4.3 TheVectorQuantization Algorithm . . . . . . . . . . . . . .  97 
4.5 Performance of the Subband Coding Algorithm . . . . . . . . . . .  101 
4.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .  107 
5 A/D Converter for Software Radio 108 
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  108 
5.2 The Resolution Requirement . . . . . . . . . . . . . . . . . . . . . .  110 
5.3 A Parde l  Bandpass Sigma-Delta Modulator . . . . . . . . . . . . .  112 
5.4 The Performance of the Pardel Sigma-Delta Modulator . . . . . .  118 
5.5 Switched-Capacitor Architecture . . . . . . . . . . . . . . . . . . . .  136 
5.6 The Decimation Filter Architecture . . . . . . . . . . . . . . . . . .  139 
5.7 Power Efficient Sinc Decimator Architecture . . . . . . . . . . . . .  142 
5.7.1 First Architecture . . . . . . . . . . . . . . . . . . . . . . . .  142 
5.7.2 Second Architecture . . . . . . . . . . . . . . . . . . . . . .  143 
5.7.3 Third Architecture . . . . . . . . . . . . . . . . . . . . . . .  144 
5.7.4 Fourth Architecture . . . . . . . . . . . . . . . . . . . . . . .  146 
5.7.5 Cornparison of the Sinc Decimator Architectures . . . . . . .  148 
5.8 Sinc Decimator Numerical Accuracy . . . . . . . . . . . . . . . . . .  151 
5.9 Power Efficient Lowpass Filter Design . . . . . . . . . . . . . . . . .  163 
5.9.1 Variable Resolution Lowpass Architecture . . . . . . . . . .  168 
5.10 VLSI Implementation of the Decimation Filter . . . . . . . . . . . .  176 
5.11 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .  179 
6 Low-Power Multiplier-Accumulator Array 
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
6.2 The Modified Booth Algorithm Multiplier . . . . . . . . . . . . . .  
6.2.1 Sign Extension . . . . . . . . . . . . . . . . . . . . . . . . .  
6.2.2 Partial Product Generation . . . . . . . . . . . . . . . . . .  
6 -3 The Multiplier-Accumulator AiTay . . . . . . . . . . . . . . . . . .  
6.3.1 Analysis of the Computational Efficiency of the MAC Array 
6.4 Programmable MAC Array . . . . . . . . . . . . . . . . . . . . . . .  
6.5 VLSI Implementation of the Programmable MAC k a y  . . . . . .  
6.6 Chapter Surnmary . . . . . . . . . . . . . . . . . . . . . . . . . . .  
7 Digital Channel Selection 215 
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  215 
7.2 The Conventional Channel Selection Algorithm . . . . . . . . . . .  216 
7.3 Pre-Filter Multiplier Elimination . . . . . . . . . . . . . . . . . . .  221 
7.3.1 The Channel Selection Algonthm . . . . . . . . . . . . . . .  224 
7.4 Filter Sharpness Relaxation . . . . . . . . . . . . . . . . . . . . . .  225 
7.4.1 The Channel Selection Algorithm . . . . . . . . . . . . . . .  226 
7.4.2 Algorithm Implementation . . . . . . . . . . . . . . . . . . .  230 
7.5 Digital Channel Selection Algorithms: Cornparison . . . . . . . . .  236 
7.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .  239 
8 Summary. Conclusions and Future Directions 240 
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  240 
8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  242 
8.3 FutureDirections . . . . . . . . . . . . . . . . . . . . . . . . . . . .  247 
8.3.1 Low-Power Digital Radio . . . . . . . . . . . . . . . . . . . .  247 
8.3.2 Low-Power Multimedia . . . . . . . . . . . . . . . . . . . . .  248 
8.3.3 Low-Power CAD . . . . . . . . . . . . . . . . . . . . . . . .  249 
8.3.4 New Low-Power Techniques . . . . . . . . . . . . . . . . . .  251 
A The Simulation of the Sigma-Delta Modulator 
B Sinc Decimator Analysis 
Bibliography 
List of Tables 
2.1 Cornparison of analog and digital communication techniques . . . . .  10 
2.2 Major analog cellular standards . . . . . . . . . . . . . . . . . . . . .  11 
2.3 Digit al cellular standards . . . . . . . . . . . . . . . . . . . . . . . .  12 
2.4 Wireless data network standards . . . . . . . . . . . . . . . . . . . .  14 
3.1 DCT multiplier implementation delay and power dissipation . . . . .  45 
3.2 DCT pure ROM architecture delay and power dissipation . . . . . .  47 
3.3 DCT mixed ROM architecture delay and power dissipation . . . . .  48 
3.4 Pure ROM implementation: Pipelining and pardelism . . . . . . . .  49 
3.5 Mixed ROM implementation: Pipelining and parallelism . . . . . . .  50 
3.6 Gray code representation: Switching activity . . . . . . . . . . . . . .  53 
3.7 Parameters of the division algorithm radùc-dependent blocks . . . . .  61 
3.8 Power dissipation of the division algorithm bbcks . . . . . . . . . . .  62 
3.9 Delay of the division algorithm blocks . . . . . . . . . . . . . . . . .  62 
3.10 Power dissipation and delay of the radix 2 division algorithm . . . .  63 
3.11 Proposed division algorithm: Performance cornparison . . . . . . . .  72 
. . . .  3.12 Vector quantization: Computation and memory requirements 75 
Variance for the two-level subband image compression system . . . .  94 
Variance and bit allocation for a one-level subband system . . . . . .  103 
Variance and bit allocation for a two-level subband system . . . . . .  105 
Power dissipation of the proposed VQ and that of a memory-based 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  VQ 106 
Dynamic range versus OSR for a second and a third order LPSD . . 113 
Sinc decimat ors computational requirements . . . . . . . . . . . . . .  148 
Sinc decimator 1: Effect of &action bit elimination . . . . . . . . . .  157 
Sinc decimator IV: Effect of ffaction bit elimination . . . . . . . . . .  161 
Lowpass fl ter blocks: Power dissipation . . . . . . . . . . . . . . . .  171 
Decimation filter: Power dissipation savings . . . . . . . . . . . . . .  178 
Decimation filter: Number of transistors . . . . . . . . . . . . . . . .  179 
Partial product generation . . . . . . . . . . . . . . . . . . . . . . . .  185 
Multiplier accumulator: Basic components . . . . . . . . . . . . . . .  193 
MAC array blocks: Area. delay and power dissipation . . . . . . . .  195 
W/L for the difFerent CMOS technologies . . . . . . . . . . . . . . .  196 
MAC array power coefficients . . . . . . . . . . . . . . . . . . . . . .  208 
Conventional digit al channel selection algorithm: Power dissipation . 219 
Cascaded decimation filter: Power dissipation . . . . . . . . . . . . .  221 
7.3 Frequency-shift-multiplier output . . . . . . . . . . . . . . . . . . . .  233 
7.4 Novel digital charnel selection algorithm: Power dissipation . . . . .  235 
B.1 Sinc decimator impulse response . . . . . . . . . . . . . . . . . . . .  262 
List of Figures 
2.1 A single-stage up conversion transmitter . . . . . . . . . . . . . . . .  15 
2.2 A Two-stage superheterodyne receiver . . . . . . . . . . . . . . . . .  15 
2.3 Direct conversion (homodyne) ïeceiver . . . . . . . . . . . . . . . . .  16 
2.4 The software radio architecture . . . . . . . . . . . . . . . . . . . . .  19 
A simplified system consisting of two building blocks and the inter- 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  connection busses 
The effect of reducing the the supply voltage in CMOS circuits on 
delay and power dissipation . . . . . . . . . . . . . . . . . . . . . . .  
Unbalanced pipeline example . . . . . . . . . . . . . . . . . . . . . .  
A two-datapath pardel system . . . . . . . . . . . . . . . . . . . . .  
Combining parallehm wit h pipelining to balance pipes tage delays . . 
Multiplier architecture for an &point 1D.DCT . . . . . . . . . . . .  
Pure ROM architecture for an bpoint 1D.DCT . . . . . . . . . . . .  
Mixed ROM architecture for an 8-point 1D.DCT . . . . . . . . . . .  
The effect of pipelining on reducing the power dissipation . . . . . .  
. . . . . .  3.10 The effect of parallelism on reducing the power dissipation 
. . . . . . . . . . . . . . . . .  3.11 Combining parallelisrn with pipelining 
. . . . . . . . . . . . . . . . .  3.12 ComEining parallelism with pipelining 
3.13 Conditional probability distribution between successive samples for 
. . . . . . . . . . . . . . . . . . . . . .  the Gray code representation 
3.14 Relative switching activity of the Gray code representations . . . . .  
3.15 Equal power dissipation curves for the Gray code and the unsigned 
binary aciders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
. . . . . . . .  3.16 Block diagram of a digit-recurrence division algorithm 
3.17 The effect of the radix on the power dissipation and throughput of 
the division algori t hm . . . . . . . . . . . . . . . . . . . . . . . . . .  
. . . . . . . . . . . . . . . .  3.18 Minimum Add/Sub Division Algorithm 
3.19 Minimum add/sub division algorithm using a limited precision QDS 
and a CSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
. . . . . . . . . . . . . . . . . .  3.20 Tree-Structured Vector Quantization 
. . .  4.1 General block diagram of a imagelvideo compression algorithm 
. . . . . . . . . . . .  4.2 A 2D. Cband. 1-level analysis/synthesis systern 
4.3 Frequency p artitioning among the different bands . . . . . . . . . . .  
. . . . .  4.4 Block diagram of a one-level2D subband analysis filter bank 
4.5 A 16-subband 2-level analysis/synt hesis sys t em . . . . . . . . . . . .  
4.6 A 7-subband 2-level analysis/synthesis system . . . . . . . . . . . . .  
4.7 Frequency spectrum of the simplified flters . . . . . . . . . . . . . .  
. . . . . . . . . . . . . . . . . .  4.8 Simplified subband coding algorithm 93 
4.9 Aeroplane: The image used in subband coding . . . . . . . . . . . .  94 
. . . . . . . . . . .  4.10 Statistical distribution of level one suband signals 95 
4.11 Statistical distribution of level two subband signals . . . . . . . . . .  96 
4.12 The architecture of the proposed VQ decoding algorithm . . . . . . .  102 
4.13 Aeroplane: The effect of the proposed two-level subband coding im- 
age compression aigorithm . . . . . . . . . . . . . . . . . . . . . . . .  104 
5.1 Bandpass Sigma-Delta A/D converter . . . . . . . . . . . . . . . . .  109 
5.2 Digital IF receiver architecture . . . . . . . . . . . . . . . . . . . . .  109 
5.3 A/D resolution requirement . . . . . . . . . . . . . . . . . . . . . . .  112 
5.4 Valid bandpass sampling rate regions . . . . . . . . . . . . . . . . . .  115 
5.5 Using four lowpass Sigma-Delta modulators to implement a bandpass 
Sigma-Delta modulator . . . . . . . . . . . . . . . . . . . . . . . . .  116 
5.6 Second-order bandpass Sigma-Delta modulator . . . . . . . . . . . .  117 
5.7 Second-order bandpass Sigma-Delta modulator with two cross-coupled 
branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  117 
5 -8 Fourth-order bandpass Sigma-Delta modulator with two cross-coupled 
branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  118 
5.9 Mismatch in integrators . . . . . . . . . . . . . . . . . . . . . . . . .  119 
. . . . . .  5.10 Degradation in SNR due to mismatch in the value of GeSt 120 
5.11 The effect of non-unity in Gi,, on the SNR of the conventional 
Sigma-Delta modulator . . . . . . . . . . . . . . . . . . . . . . . . .  121 
The effect of mismatch and non-unity in Gl, on the SNR perfor- 
mance of the bandpass Sigma-Delta modulator of Figure 5.7.b. . . . 
The effect of mismatch and non-unity in Gi, on the SNR perfor- 
mance of the bandpass Sigma-Delta modulator of Figure 5.7.a. . . . 
Change in SNR due to mismatch in the value of Glq. . . . . . . . 
Distortion in the output of the parallel bandpass Sigma-Delta mod- 
dator  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
Distortion in the output of the pardel bandpass Sigma-Delta mod- 
ulator . . . . . . . , . . . . . . . . . . . . . . . . . . . . . . . . . . .  
Frequency spectrum of the Sigma-Delta modulators of Figure 5.7 
having Gl-/GZ- = 1.0f1.0. . . . . . . . . . . . . . . . . . . . . . 
Frequency spectrum of the Sigma-Delta modulators of Figure 5.7 
having Gl-/Gl- = 0.99/0.99. . . . . . . . . . . . . . . . . . . . . 
Requency spectrum of the Sigma-Delta modulators of Figure 5.7 
having Gi-/Gi- = 0.98f0.98. . . . . . . . . . . - . . . . . . . . . 
Frequency spectrum of the Sigma-Delta modulator of Figure 5.7.b 
having Gi-/GZ- = 0.99J0.98. . . . . . . . . . . . . . . . . . . . . 
Frequency spectrum of the Sigma-Delta modulator of Figure 5.7.a 
having Gi-/GZ- = 0.99/0.98. . . . . . . . . . . . . . . . . . . . 
The effect of Gl,, of the f i s t  stage on the fourth-order bandpass 
C - A modulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
The effect of Gi,  of the second stage on the fourth-order bandpass 
C - A modulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
5.24 The eKect of Gr,, of the first stage on the fourth-order bandpass 
C - A modulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  131 
5.25 The effect of Gi, of the second stage on the fourth-order bandpass 
Z - A modulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  132 
5.26 Frequency spectrum of a single-charnel fourth-order bandpass Sigma- 
Delta modulator, having Gl-/Gi- = 1.0/1.0 for the first and 
second stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  133 
5.27 Frequency spectrum of a single-channel fourth-order bandpass Sigma- 
Delta modulator, having the integrated placed after the subtractor 
in the Çs t  stage, and having Gr-/Gl- = 0.99/0.98 for the first 
stage, and Gi-/Gi- = 1.0/1.0 for the second stage. . . . . . . .  134 
5.28 Frequency spectrum of a single-channel fourth-order bandpass Sigma- 
Delta modulator, having the integrated placed before the branch 
splitting in the f i s t  stage, and having Gr-/Gi- = 0.99/0.98 for 
. .  the first stage, and Gr-/Gl- = 1.0/1.0 for the second stage.. 134 
5.29 Frequency spec trum of a single-channel f a r t  h-order bandpass Sigma- 
Delta modulator, having the integrated placed after the subtractor in 
the second stage, and having Gr-/Gi- = 1.0/1.0 for the second 
. . . . . .  stage, and Gl,/Gi- = 0.99/0.98 for the second stage. 135 
5.30 Frequency spectrum of a single-channel fourth-order bandpass Sigma- 
Delta rnodulator, having the integrated placed before the branch 
splitting in the second stage, and having Gr-/Gi- = 1.0/1.0 for 
the fkst stage, and Gi-/Gr,, = 0.99/0.98 for the second stage. . 135 
5.31 Switched-Capacitor Integrator.. . . . . . . . . . . . . . . . . . . . .  136 
5.32 A modified single-charnel second-order bandpass Sigma-Delta mod- 
da tor  with two cross-coupled branches . . . . . . . . . . . . . . . . .  137 
5 -33 The switched capacitor implement ation of the proposed second-order 
bandpass C . A modulator . . . . . . . . . . . . . . . . . . . . . . .  138 
5 -34 A modified single-channel fourt h-order bandpass Sigma-Delt a rnod- 
ulator with two cross-coupled branches. having non-causa! blocks . . 138 
5.35 A modified single-channel fourth-order bandpass Sigma-Delta mod- 
ulator with two cross-coupled branches. having no non-causal blocks . 139 
5.36 The switched capacitor implementation of the proposed second-order 
bandpass C . A modulator . . . . . . . . . . . . . . . . . . . . . . .  140 
5.37 The decimation filter . . . . . . . . . . . . . . . . . . . . . . . . . . .  140 
5.38 The transfer function of a Sinc decimator having, M = 8. and N = 3.141 
5.39 First architecture of a Sinc decimator . . . . . . . . . . . . . . . . .  143 
5.40 Second architecture of a Sinc decimator . . . . . . . . . . . . . . . .  144 
5.41 Third architecture of a Sinc decimator . . . . . . . . . . . . . . . . .  145 
5.42 Sinc decimator III: Sinc3(16) . . . . . . . . . . . . . . . . . . . . . .  146 
5.43 Fourth architecture of a Sinc decimator . . . . . . . . . . . . . . . .  147 
5.44 Thkd order Sinc computational requirements . . . . . . . . . . . . .  149 
5.45 Fourth order Sinc computational requirements . . . . . . . . . . . . .  150 
5.46 Sinc decimator 1: Frequency spectrum at fd resolution . . . . . . . .  153 
5.47 Sinc decimator 1: Frequency spectrum with approximation after the 
firststage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  154 
5.48 Sinc decimator 1: Frequency spectrum with approximation after the 
second stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  155 
5.49 Sinc decimator 1: Frequency spectrum with 7 bit output resolution . 158 
5.50 Sinc decimator IV: Eight bit resolution output . . . . . . . . . . . .  159 
5.51 Sinc decimator IV: Eight bit . nine bit resolution fiequency spectntm.160 
5.52 Conventional multiply-accumulate filter architecture . . . . . . . . .  164 
5.53 Lowpass filter timing diagram . . . . . . . . . . . . . . . . . . . . . .  166 
5.54 Synchronization Block . . . . . . . . . . . . . . . . . . . . . . . . . .  167 
5.55 Synchronization block timing diagram: Case 1 . . . . . . . . . . . . .  167 
5-56 Synchronization block timing diagram: Case II . . . . . . . . . . . .  168 
5.57 The modified filter architecture . . . . . . . . . . . . . . . . . . . . .  169 
5.58 Relative power dissipation of a LPDF using block deactivation a = 0.4.173 
5.59 Relative power dissipation of a LPDF using block deactivation a = 0.6.174 
5.60 Relative power dissipation for a three parde l  unit lowpass decima- 
tionfilter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  175 
5.61 Relative power dissipation for a three parallel unit lowpass decima- 
tion filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  177 
5.62 Block diagram of the designed decimation filter . . . . . . . . . . . .  179 
5.63 VLSI layout of the decimation filter . . . . . . . . . . . . . . . . . .  180 
6.1 Sign extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  187 
6.2 The row-decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  187 
6.3 The ModSed Booth Algorithm Array Multiplier . . . . . . . . . . .  189 
. . . . .  6.4 The array cells of the array multiplier given in Figure 6.3. 190 
6.5 Partid product bit generator . . . . . . . . . . . . . . . . . . . . . .  190 
. . . . . . . . . .  6.6 Block diagram of a multiplier-accumulator (MAC) 191 
6.7 The Multiplier-Accumulator Array . . . . . . . . . . . . . . . . . . .  191 
6.8 The array ceus of the MAC array given in Figure 6.7. . . . . . . . .  192 
6.9 10 x 10 bit multiplier-accumdator delay-power relationship . . . . .  196 
6.10 20 x 20 bit multiplier-accumulator delay-power relationship . . . . .  197 
6.11 MAC array power dissipation performance . . . . . . . . . . . . . . .  199 
6.12 MAC m a y  speed performance . . . . . . . . . . . . . . . . . . . . .  200 
6.13 MAC array area performance . . . . . . . . . . . . . . . . . . . . . .  201 
6.14 Adding 5 bits using h& adders only . . . . . . . . . . . . . . . . . .  202 
6.15 Adding 5 bits using full adders . . . . . . . . . . . . . . . . . . . . .  203 
6.16 Merging t h e e  half adders into a single full adder . . . . . . . . . . .  204 
6.17 Binary number addition using half adders and full adders . . . . . .  204 
6.18 The bypass path for a deactivated cell . . . . . . . . . . . . . . .  206 
6.19 The blocking modules for the recoded multiplier signals . . . . . . .  206 
6.20 Programmable MAC array relative power dissipation . . . . . . . . .  210 
6.21 Programmable MAC array relative power dissipation . . .  211 
6.22 The VLSI layout of a programmable MAC array . . . . . . . . . . .  213 
7.1 The frequency spectrum of a baseband signal consisting of eight 
channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  217 
7.2 The conventional channel selection algorithm. . . . . . . . . . . . . 217 
7.3 Number of filter taps vs. adjacent channel rejection, for the digital 
selection of 1 out of 32 channels. . . . . . . . . . . . . . . . . . . . 218 
7.4 Channel selection example using lowpass and highpass Mters only. . 223 
7.5 Filter used in each frequency band. . . . . . . . . . . . . . . . . . . 226 
7.6 Channel selection example using lowpass, highpass and bandpass 
flters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 
7.7 Implementation of HP, BPN and BPP filters. . . . . . . . . . . . . 231 
7.8 Proposed channel selection algorithm for the selection of 1 out of 32 
channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . - . . . . . 231 
7.9 Implementation of a {O, f l , 2 )  4 frequency shifter followed by a low- 
pass filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 
7.10 Implementation of a f {1,3) frequency shifter fdlowed by a low- 
pass Hter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 
7.11 Computational power dissipation for digital channel selection algo- 
rithms.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 
7.12 Relative cornputational~power dissipation between conventional dig- 
ital channel selection algorithms and proposed algorithm. . . . . . . 238 
8.1 Pie-chart of the distribution of the power dissipation in portable 
terminais. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . , 252 
A. 1 A second-order bandpass Sigma-Delta modulator modeled in SP W. 254 
A.2 SP W model of a single-charnel second-order bandpass Sigma-Delta 
modulator, with two cross-coupled branches, and common filtering 
done after subtraction. . . . . . . . . . . . . . . . . . . . . . . . . .  255 
A.3 SPW model of a single-channel second-order bandpass Sigma-Delta 
modulator, with two cross-coupled branches, and common filtering 
done before branch splitting. . . . . . . . . . . . . . . . . . . . . . .  255 
A.4 SPW model of a single-channel fourth-order bandpass Sigma-Delta 
modulator, with two cross-coupled branches, and common filtering 
. . . . . . . .  done after subtraction in the e s t  and second stages. 256 
A.5 SP W model of a single-channel fourth-order bandpass Sigma-Delta 
modulator, witk two cross-coupled branches, and common filtering 
done after subtraction in the k s t  stage and before branch splitting 
in the second stage. . . . . . . . . . . . . . . . . . . . . . . . . . . .  256 
A.6 SPW model of a single-channel fourth-order bandpass Sigma-Delta 
modulator, with two cross-coupled branches, and common filtenng 
done before branch splitting in the f i s t  stage and after subtraction 
in the second stage. . . . . . . . . . . . . . . . . . . . . . . . . . . .  257 
A. 7 SP W model of a single-channel fourt h-order bandpass Sigma-Delta 
modulator, wit h two cross-coupled branches, and common filtering 
. . . . .  done before branch splitting in the first and second stages. 257 
. . . . . . . . . . . . .  A.8 SP W mode1 of the block nintl of Figure A. 1. 258 
. . . . . . . . .  A.9 SPW mode1 of the block int2 of Figures A.2 - A.7. 258 
B.l The progression of an input sample through a Sinc decimator having 
X : = l , m = 3 a n d n = 3 .  . . . . . . . . . . . . . . . . . . . . . . . .  261 
Chapter 1 
Introduction 
The s tate-of-t he-art in modern telecommunications is mos t fascinating and intrigu- 
ing. Technicd specialists are competing to digest modern techniques in order to 
be capable of providing due service to the inspired and ambitious users, to whom 
technology offers new areas and fields yet to be ventured for the service of mankind. 
The last few years witnessed the widespread of portable equipment from cellular 
phones to multimedia portable terminals. However, these mobile equipments are 
constrained in computational capability due to battery limitations and size limita- 
tions [l] . Over the las t 30 years bat tery capacity has increased by a factor of 2 to 4, 
while the computational power of digital IC's increased more than 4 orders of mag- 
nitude [2]. The energy density of the Ni Cd batteries used in portable terminals is 
20 Wat t-Hour/Pound [3]. Bat tery capacity isn't expected to increase dramatically 
over the next few years. New battery technology such as Nickel-Metal-Hydrite is 
expected to have a capacity of no more than 30-35 Watt-Hour/Pound. 
With the increase in market demand for new capabilities and functionality in 
mobile equipments, new approaches are required to reduce the power dissipation 
CHAP TER 1. XNTROD UCTION 2 
and hence prevent the battery size from growing in tandem with computational 
complexi ty. 
Until recently power consumption was not a high prionty issue in the design 
of VLSI systems. Performance (speed) and cost (area) were the two metrics that 
governed the design of VLSI systems (41. However, the need for longer battery life 
in future portable terminais has added power dissipation to the metrics that should 
be considered when designing a VLSI system. 
Mobile applications are not the only factor driving the need for lower power 
dissipation. Power dissipation in cut ting-edge immobile equipment has reached a 
limit where any further increase in power dissipation will lead to significant increase 
in the cost of packaging and the cooling system. The addition of a heat sink codd 
increase the component cost by $5-910 [ 5 ] .  In addition, large power dissipation 
leads to lower component reliability. Every 10°C increase in temperature doubles 
the component failwe rate [2]. 
Findy,  there are the econornical and environmental advantages of reducing 
the power dissipation. A study in 1993, [6] showed that the 60 million personal 
computers in the USA dissipated $2 Billion of electricity per year, and that they 
indirectly produced as much CO2 as 5 million cars. In 1993 personal computers 
accounted for 5% of the commercial electricity demand, this is expected to increase 
to 10% by the year 2000. 
Low-power design is finding its way into numerous applications. From portable 
communication products such as cellular phones, cordess phones and pagers, to 
portable consumer products such as camcoders and portable CDS, to laptop and 
notebook computers, to sub-GHz processors for high performance works t ations. 
Design for low power is essential in all these applications. 
CK4PTER 1. INTROD UCTION 
Reducing the power dissipation can be done at the various levels of the design 
process, starting at the algorithmic and architectural levels and going down to the 
circuit and device levels [5] .  The power minimization problem at each one of these 
levels has dEerent characteristics and meets dinerent challenges. At the higher 
design levels, the designer faces alternative choices with little information about 
the design parameters of the lower layers. At the lower design levels, the number 
of parameters is limited making the low-power design problem easier. 
However, implernentation of the low-level low-power design techniques requires 
greater investment and longer time to implement than the high-level low-power 
design techniques. Consider, for example, process scaling as a technique to reduce 
power dissipation. This requires a greater investment and a longer time to imple- 
ment than changing the algorithm as a means of reducing the power dissipation. 
Despite their great potential for reducing the power dissipation, high-level tech- 
niques are the leas t inves tigated techniques. Selecting the suitable algorit hm and 
rnapping it to the appropriate architecture can have a great influence on the mini- 
mization of power dissipation [7] [8]. Eliminating redundant and irrelevant compu- 
tations has a substantial effect on the reduction of the power dissipation. 
Future portable terrninals are required to handle multimedia information - 
speech, video and data [8]. Because of the limited bandwidth allocated to mobile 
sys t ems , compression/decompression of information is required in mobile t erminals. 
Compression algori t hms and in particular video compression algori t hms demand 
large computation capability [9] which in turn leads to high power dissipation. 
The desire to have multimedia portable equipment has motivated work towards 
low-power implementations of video compression algori thms. 
Increased public demand for higher performance, better quality of service, and 
CHAPTER 1. INTRODUCTION 4 
system interoperability, has motivated the idea of software radio [IO]. In software 
radio, the digitization of the received/transmitted radio signal is performed as elec- 
trically close to the antenna as possible. The signal processing is done digitally 
after that, using a general purpose programmable hardware. 
Software radios requke wideband (high-speed) high-resolution analog-tdigital 
converters [Il]. They also require high DSP horse-power (up to 10 GFLOPS/s) (121. 
The desire to have a portable software radio has motivated work towards lowering 
the power of wideband high-resolution A/D converters. It has also motivated work 
towards the clevelopment of DSP algorithrns with lower computational complexity 
to be used in the software radio. 
The objective of tlus thesis is to investigate, develop, design and irnplement 
low-power techniques for portable wireless terminals at the architectural and the 
algori t hmic levels. The low-power techniques are applied t O video compression 
algorithrns used in multimedia portable terminals. Low-power algorit hms are also 
developed to lower the power dissipation in software radios. 
1.1 Thesis Contributions 
1. Analysis of high-level low-power design tradeoffs. Three examples of such 
analysis are given in sections 3.6.2, 3.8 and 3.9.1 of chapter 3. These are; 
the use of carry Save adders in FIR filters, the use of the Gray code number 
system, and the use of higher-order radix in the division algorithm. 
2. A new division algorithm that minimizes the number of addition/subtraction 
operations required to generate the quotient. This algorithm is presented in 
section 3.9.2 of chapter 3. 
CHAPTER 1. INTRODUCTION 5 
3. A new low-power subband coding image compression algorithm developed in 
chapter 4. The filtering structure for the proposed subband coding algorithm, 
requises only addition/subtraction operations, this significantly reduces the 
power dissipation. A novel vector quantization coding aigorithm having a 
simplified decoding architecture has been developed in chapter 4. 
4. A novel bandpass Sigma-Delta modulator, dong with its switched-capacitor 
implernentation, presented in chapter 5. Pardelism by 4x of analog signal 
processors is applied to the design of the bandpass Sigma-Delta rnodulator. 
This increases the speed of the modulator without increasing the speed re- 
quirement of the individual building blocks . A swit ched-capaci t or circuit wi t h 
a minimum number of operational amplifiers is also given for the proposed 
modulator architecture. 
5. The design of a decimation fdter incorporating several low-power design tech- 
niques such as; operation minimization, multiplier elimination, and block 
deactivation. The decimation filter is resolution-programmable, allowing the 
deactivation of the blocks corresponding to the least significant bits, when a 
lower resolution is sufficient. The design of the decimation filter is given in 
chapter 5. 
6. The design of a resolution-programmable multiplier-accumulator (MAC) ar- 
ray. The interleaving of the adder in the multiplier array reduces the power 
dissipation. The resolution of the MAC array is programmable dowing the 
deactivation of the blocks corresponding to the least significant bits when 
a lower resolution is sufficient. The design of the MAC array is given in 
chap ter 6. 
7. A novel digital channel selection algorithm with no pre-filter multiplier. The 
CHAPTER 1. INTRODUCTION 6 
channel selection algorithm uses lowpass, highpass and bandpass flters. The 
basic filter is the lowpass filter. Other filters are implemented using the 
lowpass filter and simple logic gates such as multiplexers and XOR gates. 
The channel selection is done in stages. The elimination of the pre-flter 
multiplier reduces the power dissipation. The design of the digital channel 
selection algorithm is given in chapter 7. 
1.2 Thesis Outline 
The thesis consists of eight chapters and two appendices. Chapters 2 and 3 are a 
survey of wireless architectures and standards, and low-power design techniques. 
Chapters 4 - 7 present the main contributions of this thesis for the high-level 
low-power design of multimedia wireless terminals. A person interested in power- 
efficient design of multimedia terminals can proceed directly to chapter 4. 
After the introduction, which provides for the motivation and a brief description 
of the thesis, chapter 2 deals with wkeless communication systems. It talks about 
the different standards for voice and data wireless communications. The transceiver 
architecture is reviewed in this chapter. The emerging software radio architecture 
is also presented in this chapter. 
In chapter 3, low-power design techniques are explored. In this chapter, the 
sources of power dissipation are investigated, and power estimation methods are 
considered. Low-power design has recently captured the attention of many re- 
searches. A survey of low-power techniques employed in the design of portable 
equipment is presented in this chapter. Also in this chapter, the application of 
low-power techniques t O the design of certain algori thms and architectures is inves- 
tigated. 
CHAPTER 1. INTRODUCTION 7 
In chapter 4, a new low-power subband coding image compression algorithm is 
presented. Subband coding is a technique in which the video signal is divided into 
subbands and each subband is docated a number of bits according to the informa- 
tion it carries and its spectral importance. Tradeoffs between the computational 
complexity (power dissipation) and the signal-to-noise ratio (SNR) performance 
of the subband coding algonthm are considered, and an algonthm with low corn- 
putationd complexity is presented. Findy, the performance of this algorit hm is 
evaluated. 
In chapter 5, a high-speed high-resolution A/D converter is presented. In the 
first part of this chapter, a novel bandpass Sigma-Delta modulator architecture 
is developed. ki this architecture parallelism by 4x of analog signal processors 
is applied to the design of the bandpass Sigma-Delta modulator. The switched- 
capacitor implementation of the proposed architecture is also presented in this 
chapter. In the second part of the chapter, severallow-power design techniques are 
applied to the design of the decimation filter. These techniques include, operation 
minimization, multiplier elirnination and block deactivation. The design of the 
decimation filter in a 0.5pm, 3.3 Volt CMOS technology is dso presented in this 
chap ter. 
In chap ter 6, the design of a new resolution-programmable multiplier-accumulat or 
(MAC) array is presented. The multiplier of the MAC array is based on the mod- 
ified Booth algorithm. The accumulator's input and output are in the sum-carry 
representation. The effect of interleaving the adder in the multiplier array on re- 
ducing the power dissipation is discussed in this chapter. To further reduce the 
power dissipation a block deactivation architecture is developed, where the cells 
corresponding to the leas t significant bits are deactivat ed when a s m d e r  resolution 
is sufficient. The design of the MAC array in a 0.5pm, 3.3 Volt CMOS technology 
CHAPTER 1. INTRODUCTION 
is also presented in this chapter. 
In chapter 7, the design of a novel power-efficient digital channel selection algo- 
rithm is presented. In software radio, a block of channels is digitized, the channel 
selection is performed in the digital domain. Conventionally, channel selection is 
done by a multiplier followed by a lowpass füter. The multiplier operates at a high 
sarnpling rate and hence, it dissipates a large amount of power. A novel digital 
charnel selection algorithm is developed that eliminates the pre-filter multiplier, 
this can reduce power dissipation by up to an order of magnitude. 
Chapter 8 contains the summary of the research, dong with the major con- 
tributions of this dissertation. Also contained in this chapter are the conclusions 
reached after conducting t his research. Findy? future directions in research for 
power minimization at the algorithmic and architectural levels for future portable 
wireless terminais are discussed. 
Appendix A presents the SPW TM 
Delta moddator. Appendix B presents 
simulation mode1 of the bandpass Sigma- 
an analysis for the Sinc decimator. 
Chapter 2 
Wireless Comrnunicat ion Systems 
The las t decade witnessed an explosion in the development and the commercializa- 
tion of wireless communication products, such as cellular phones: cordless phones, 
pagers, wireless LAN and WAN terminais, etc. This development was fueled by the 
accep t ance of the wireless communication standards, the advancement in wireless 
circuit design techniques and high-sp eed monoli t hic IC t echnology, as well as the 
development of new wireless system architectures. 
New services and features are now being envisioned for future mobile commu- 
nication systems. These systems will allow users to have access to information 
databases and to communicate in any form of media - voice, video, images or 
data - at any time and in any place [l] [13]. Increased public demand for better 
performance, higher quality of service and lower costs has led to the development of 
new wireless communication techniques and new wireless transceiver architectures. 
In the last few years, there has been a shift fiom analog to digital communication 
techniques. This shift led to enhanced performance and lower cost. Table 2.1 
compares the analog communication techniques to the digital ones. In the future, 
CHAPTER 2. WIRELESS COMMUNICATION SYSTEMS 
Table 2.1: Cornparison of analog and digit al communication techniques. 






(Spectral efficiency ) 
Techniques 1 Techniques 
Analog Communication 
Discretef Hybrid 1 Monolithic 
Digital Communication 
FM 1 QPSK, GMSK ... 
FDMA 
Low 
1 TDMA, CDMA 
High 
Low l Bgh  
this digitization trend is expected to continue moving into the fiont-end of the 
transceiver, and eventudy leading to a tme software radio. 
There exists numerous standards for wireless communication systems through- 
out the world. These standards regulate the spectrum usage and define the key 
parameters of the different wireless systems, such as the cellular, cordless and wire- 
less data systems. In section 2.1, we review the standards used in analog and digit al 
cellular systems, as well as wireless data systems. 
Traditionally, the superheterodyne principle has been used in the design of wire- 
less receivers since its discovery by Armstrong in the 20's [14] [15] [16]. With 
consumers demanding more functionality and enhanced performance and with the 
advancernent of the IC technology, interest bas been growing in direct conversion 
receivers 1171, as well as software radios [IO] 1181. These architectures are examined 
in sections 2.2 and 2.3. 
CHAPTER 2. WIRELESS COMMUNICATION SYSTEMS 
Table 2.2: Major analog cellular standards. 
Standard Down Link Channel 














2.1 Land Mobile Wireless Systems Standards 
Cellular system design was pioneered by Bell Laboratories in the 70's [19]. The 
f i s t  generation of cellular systems used analog frequency modulation. Frequency 
Division Multiple Access was used to divide the spectnun between the different 
users. Table 2.2 gives the salient features of the major analog cellular standards 
1191. 
The desire for larger system capacity and bet ter performance, coupled with the 
CHAPTER 2. WIRELESS COMMUNICATION SYSTEMS 




Number of channels 
Multiple access 
Modulation 
Channel bit rate 
* More spectrum is 
Table 2.3: Digital cellular standards. 
GSM IS-136 (DAMPS) 
824 - 849 






890 - 915 







824 - 849 






docated around 1.5 GHz. 
- 
PDC 
940 - 956' 




a /4  DQPSK 
42 Kb/s 
advancement of digital integrated circuit design and low bit rate speech coding 
algorithms led to the emergence of the second generation cellular systems which 
use digit al modulation techniques [20]. Digital cellular systems are more efficient 
in their spectral usage than analog cellular systems. There exists different digital 
cellular standards, Table 2.3 gives the salient features of these standards [21]. 
Each standard defines the control signals, and the minimum performance re- 
quirements for the mobile terminal and the base station. For the DAMPS standard, 
the mobile terminal is required to satis6 the foIlowing minimum requirements (22): 
Adjacent channel selectivity : 
Assigned channel = -107 dBm 
Adjacent channel = -94 dBm 
Error rate = 3% 
CHAPTER 2. VVIRELESS COMMUNICATION SYSTEMS 
Alternate channel selectivity : 
Assigned channel = -107 dBm 
Second adjacent channel = -65 dBm 
Error rate = 3% 
Intermodulation Spurious Response At t enuat ion : 
Assigned channel = -107 dBm 
Unmodulated RF @ f 120 KHz = -45 a m ,  OR 
Modulated RF @ k 240 KHz = -45 dBm 
Error rate = 3% 
Spurious Response Interference : 
Assigned Channel = -107 dBm 
Undesired RF = -52 dBm 
Undesired RF modulated in cellular band, unmodulated elsewhere. 
Error rate = 3% 
Except within 90 KHz of the assigned channel. 
In addition to the cellular standards for wireless voice networks, there exists 
standards for wireless data networks [19] [23]. Wireless data networks are classified 
into: wide-area mobile data networks, and wireless local area networks (WLAN) . 
Wide area mobile data networks are low-speed networks. Several standards exist 
for wide area mobile data networks, such as; Advanced Radio Data Information 
Service (ARDIS), Mobitex, and Cellular Digital Packet Data (CDPD). The salient 
features of these standards are given in Table 2.4. 
WLANs are lugh-speed networks but they have a limited coverage area. WLAN 
use spread spectrurn in the unlicensed ISM bands [24]. WLAN standards include 
IEEE 802.11 in North America and MPERLAN in Europe. 
CHAPTER 2. WIRELESS COMMUNICATION SYSTEMS 






Number of chaanels 
Modulation 
Channel bit rate 
806 - 824 





896 - 901 






' 824 - 849 






2.2 Wireless Transceiver Architectures 
A wireless transceiver consists of a transmitter and a receiver. The duplexer is used 
for directing each of the transmit/receive signals to its intended path. The incoming 
RF signal from the antenna is directed to the receiver, while the transmitted signal 
is directed 60m the transmitter to the antenna with no coupling to the receiver. 
The transmitter converts the baseband signal to the RF carrier fiequency. It can 
do so using a single-stage quadrature modulator (251. This is shown in Figure 2.1. 
A filter is needed after the power amplifier. This fdter removes any out of band 
frequency components due to the nonlinearity of the power amplifier or spurious 
fiequencies from the oscillator. A transmitter can also have multi-stage mixing. 
Figure 2.2 shows the block diagram of the conventional wireless receiver. This 
receiver is a two-stage superheterodyne receiver [26] [27]. Typicdy the f i s t  IF 
stage down converts the fkequency by an order of magnitude. The first IF' stage 
is used for image rejection. The second IF stage is used for channel selectivity. 
The e s t  IF stage frequency is about 90 MHz. The second IF stage fiequency is 
CHAPTER 2. WTRELESS COMMUNICATION SYSTEMS 
LPF 
DAC 





Figure 2.2: A Two-stage superheterodyne receiver. 
I 
LPF l%[ NI-[ Duplexer 
b BPF-RF l % }  BPF-RF 
455 KHz [17]. 
In the superheterodyne receiver, the IF frequency is fured. The design of the 
channel selection filter is easier in the IF band than in the RF band [17], because 
the center fiequency of the IF filter is fixed and the relative filter bandwidth with 
respect to the center frequency is larger in the IF band than in the RF band. 
The complexity of having a tw-stage IF superheterodyne receiver can be elirn- 
inated by using the direct conversion receiver [28] - [34]. The direct conversion 
receiver is also referred to as the homodyne receiver when the local oscillator is syn- 
chronized in phase with the incoming RF signal carrier [17]. The homodyne receiver 
converts the RF signal to baseband directly, eliminating the need for intermediate 
IF stages. Figure 2.3 shows the block diagram of the direct conversion (homodyne) 
receiver. Direct conversion receivers enable the highest level of intergration and 
require the least amount of tuning [XI. 
CHAPTER 2. WTRELESS COMMUNICATIONSYSTEMS 
DSP 
Figure 2.3: Direct conversion (homod yne) receiver. 
An alternative direct conversion receiver architecture 1361 uses an external band- 
pass filter with a 90" phase splitter. The same VCO signal is fed to both mixers in 
this case. 
Despit e i t s simplicity, direct conversion receivers have sever al draw backs [17] 
[34] compared to the superheterodyne receiver. These indude? the possibility of 
mismatch between the 1 and Q paths, carrier leakage and DC feed-through, sen- 
sitivity to the l / f  noise, limited dynamic range, and findy, back radiation of the 
receiver's local o s c ~ a t o r  signal. 
Having Merent  standards creates a need for a multistandard transceiver ar- 
chitecture. In 1371, the authors consider using a zero-IF receiver and an image 
rejection receiver to achieve multistandard operation for DECT, GSM/DCS1800 
and INMARSAT M. Another way to achieve a multistandard receiver is to use 
software radio. Software radio is introduccd in the next section. 
2.3 Software Radio 
In software radio [IO] (181, the receivedltransmitted radio signal is digitized as 
electrically close to the antenna as possible. The signal processing is done digitally 
after that, using general purpose programmable hardware. Performing the radio, IF 
and baseband functions in programmable digit al hardware increases the flexibility 
CHAPTER 2. WIRELESS COMMUNICATION SYSTEMS 17 
of the transceiver. Although software radios use digital techniques, digital radios 
are generally not software radios. The key difference is the total programmability 
of software radios, including programmable RF bands, multiple access modes, and 
modulation schemes. 
Software radio was conceived in the 70's. However, technology limitations pre- 
vented it fkom being implemented. The f t s t  operational digital high fiequency 
communications system was built in 1980 [38]. It was used by the military. That 
system occupied many racks, it dissipated a large amount of power, and it only 
Iiad a bandwidth, for simultaneous coverage, of 750 KHz, with a dynamic range of 
60 dB- 
Today the US military is in phase II of developing its software radio - Speakeasy 
[ 181 [39]. S peakeas y is a programmable multi-band multi-mode radio (MBMMR) 
that operates in the HF to the UHF bands, fkom 2 MHz to 2 GHz. Speakeasy em- 
dates 15 existing military radios. It supports 9 modulation schemes, and 4 digital 
audio coding algorithrns. It also supports multiple internetworking protocols, mul- 
tiple interfaces, multiple forward error correction codes and multiple information 
security ( I W O  SEC) algori t hms. 
For civil applications, software radio is used in cut ting-edge base stations. 
The design of portable terminais is a compromise between low-power and high- 
performance, this involves a tradeoff between analog ICS, low-power ASICs, DSP 
cores and embedded microprocessors [IO]. However, as low-power techniques and 
design methodologies emerge, digital signal processing wiIl gradually replace analog 
signal processing in the wireless port able terminal. 
There are numerous advantages to increasing the portion of the radio that is 
implement ed digit &y. These include relaxing the analog component s requirement S. 
CHAPTER 2. WIRELESS COMMUNICATION SYSTEMS 18 
Digital implementations tend to be compact and inexpensive for large volume pro- 
duction. One of the most important advantages is the ability to program digital 
structures to meet the communication needs of different networks using a single 
hardware platform. 
The access, modulation and coding schemes used in a software radio are pro- 
grammable, making it possible to reprogram the transceiver if any of these schemes 
change. Channel selec tion, propagation channel characterization, antenna s teer- 
ing and power level adjustment are all done under software control [IO]. In the 
transmit mode, the software radio characterizes the a d a b l e  channels, s t eers the 
transmit beam in the right direction, selects the appropriate power level and than 
transmits the signal. In the receive mode, the software radio analyzes the received 
spectrum, in frequency, time and space. It identifies the interferers and nulls them. 
It estimates the multi-path propagation channel mode1 and adaptively equalizes 
the received signal. The signal is then demodulated and decoded. 
Software radio is characterized by its modula, open architecture dowing con- 
stant upgrades as the technology advances. The software radio architecture, as 
shown in Figure 2.4, consists of three subsystems. The real-time channel process- 
ing subsystem is where d the radio funceions are performed. This subsystem 
must have isochronous performance, which means that the input samples must 
be processed during a limited time duration. The environment management sub- 
system constantly characterizes the radio environment. This information is used 
by the channel processing subsystem for better transmission and reception. The 
environment management subsystem has near real-time operation. The software 
tools subsystem provides incremental service enhancements. This subsystem d o w s ,  
defining, prototyping, testing and delivering these service enhancements. 
To implernent software radio the entire spectrum of a particular standard should 









Figure 2.4: The software radio architecture. 
be digitized. This is 25 MHz for GSM, IS-136 and 1s-95, as given in Table 2.3. To 
digitize a 25 MHz bandpass signal, bandpass sampling is used [Il]. To satisfy 
the Nyquist sampling criteria, the sampling frequency should be at least twice the 
bandwidth. Assurning the sampling fiequency is 2.5 times the bandwidth, then a 
sampling fiequency of 62.5 MSa/s is required. To meet the requirements of the 
different wireless standards the A/D converter is required to have over 20 bits 
resolution (this will be shown later in Chapter 5). This is the e s t  bottleneck 
facing software radio, i.e. high-speed high-resolution analog-to-digi t al conversion. 
In chapter 5, the design of an A/D converter that has a sampling frequency of 1.25 
MHz and a programmable resolution up to 20 bits is examined. 
In addition to the high-speed, high-resolution A/D converter, software radio 
also requires high DSP horsepower. Typically Software Radio requires up to 10 
GFLOPS/s [12]. Such high processing power is beyond the capabilities of todays 
DSPs. This is the second bottleneck facing software radio. In chapter 7, a digital 
CHAPTER 2. WIRELESS COMMUNICATION SYSTEMS 
channel selection algorithm, that can be employed in software radios to reduce the 
computational complexity required for digital channel selection, is presented. 
2.4 Chapter Summary 
In this chapter, the salient features of analog and digital cellular standards, as 
weU as the wireless data standards wese presented. The superheterodyne principle 
has been cornmonly used in conventional transceivers. The advancement of the IC 
technology, and the demand for enhanced performance has Led to the emergence of 
new architectures, such as the homodyne architecture and the software radio. 
Software radio, which dows  more functionality and programmability, is cur- 
rently being used for military applications and in cutting-edge base stations. For 
the mobile terminds, new low-power techniques need to be developed to make soft- 
ware radio a power-efficient architecture that can compete with the superheterodyne 
and homodyne radio architectures. 
Chapter 3 
Low-Power Design Techniques 
3.1 Introduction 
In this chapter, the techniques used to lower the power dissipation at the architec- 
tural and algorithmic design levels are investigated. These are the higher design 
levels, as opposed to the device and circuit levels, the lower design levels. 
The organization of this chapter is as follows, section 3.2 talks about the sources 
of power dissipation in CMOS circuits and the parameters they depend on. In 
section 3.3, the mode1 used in the estimation of the power dissipation is presented. 
In section 3.4, low-power examples for: wireless port able syst ems, digit al signal 
processors, video compression algorithms and microprocessors are presented. Low- 
power techniques, used at the device and circuit levels, are presented in section 3.5. 
The quadratic dependency of the power dissipation on the voltage makes voltage 
reduction an effective way to reduce the power dissipation. This is examined in 
section 3.6. However, reducing the voltage leads to longer delays. Techniques 
used to maintain a constant throughput with voltage scaling are also considered 
CHAPTER 3. LOW-POWER DESIGN TECHNIQUES 22 
in section 3.6. Section 3.7 demonstrates the effect of pipelining and parallelism, 
by applying t hese techniques to t hree dXerent architectures of the discrete cosine 
t r ans for m. 
Reduchg the switching activity is another degree of &dom in reducing the 
power dissipation. In section 3.8, the &ect of the Gray code number system on 
reducing the switching activity is considered. In section 3.9, the effect of reducing 
the number of block iterations on the power dissipation of the division algorithm 
is considered. Two examples demonstrate this. First, a higher order radix is used. 
Second, a division algorithm is developed, requiring a minimum number of add/sub 
operations. In section 3.10, reducing the computational complexity of the vector 
quantization algorithm is considered. 
3.2 Sources of Power Dissipation 
The power dissipated in an electronic system depends on the implementation tech- 
nology and the circuit style used. Current mode BJT and NMOS have DC (static) 
power dissipation, while CMOS aLnost has no DC power dissipation, making its 
power dissipation lower than the two former technologies. The CMOS style is the 
mos t commonly used style for the implernentation of VLSI systems. 
In CMOS circuits, there are three sources of power dissipation [40]: 
1. Switching power dissipation. 
2. Short-circuit-current power dissipation. 
3. Leakage-curent power dissipation. 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 23 
The most dorninate of these is the switching power dissipation which is given 
The switching power dissipation, as seen from the previous equation, de- 
pends on four parameters. The switching activity factor ao+~, the load capac- 
itance CL, the supply voltage VDD and the dock frequency fclk. Of these, the 
supply voltage has the greatest effect on the switching power dissipation because of 
the quadratic dependence. In a well designed CMOS circuit, the switching power 
dissipation accounts for 90% of the power dissipation. 
The switching activity factor ao-1, the probability of a zero-one transition. 
depends on: 
1. Logic funceion. For example, a NAND gate with equi-probable and indepen- 
dent inputs has 
while an XOR gate, with equi-probable and independent inputs has 
2. Logic style. Dynamic logic has higher switching activity than static logic, 
because the output is precharged at  the end of each cycle. However, dynamic 
logic is glitch fiee. The logic style also influences the capacitances. 
3. Signal s tatis tics. The higher the correIation between the successive samples 
the lower the switching activity. 
4. Circuit topology. e.g. chah structure versus tree structure. A Chain strgcture 
has lower switching activity, but higher glitching power. 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 24 
The short-circuit-current power dissipation, unlike the switching power 
dissipation, depends on the rise and fall times of the input signal. To minimize the 
eEect of the short-circuit power dissipation it is desirable to have equal input and 
output edge times [42]. In this case, the power dissipation is less than 10% of the 
total dynamic power dissipation. 
The lealcage-current power dissipation is due to: 
1. Reverse-bias diode leakage current. This is in the order of 25 p A  for a 1 million 
transistor chip. Hence, it represents a negligible component of the power 
dissipation (411. 
2. Subthreshold current. Associated with this is the subthreshold slope Sth, 
which is the voltage required to reduce the subthreshold current by an order 
of magnitude [43]. The absolute minimum of Sth is about 60 mv/(decade cur- 
rent) at room temperature. This can be achieved using Silicon-On-Insulator 
(SOI) technology [44]. Lowering the sub threshold voltage increases this com- 
ponent. 
Of these three power dissipation components, the switching power dissipation 
component is the most dominant [41]. Hence, this is the component we usually 
seek to minirnize, especidy at the architectural and algorithmic levels. 
3.3 Estimating the Power Dissipation 
Power estimation can be a complex task. Not only does it require knowledge about 
the technological parameters of the system under consideration such as the oper- 
ating voltage, the physical capacitance, the circuit style, etc., but it also requires 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 
Figure 3.1: A simplified system consisting of two building blocks and the intercon- 
nection busses. 
detailed knowledge of the signal statistics such as the data activity and the signal 
correlations. 
The aim of power estimation is to h d  the average power dissipated in a system 
based on a certain model. Power estimation becomes more inaccurate as the degree 
of model abstraction increases. Hence, the most accurate power estimators are the 
circuit simulators [45]. However, circuit simulators are slow and require complete 
and specific information about the inputs [46]. 
Gate-level probabilistic techniques have been proposed, ranging from simple 
techniques [47] which assume a zero-delay gate model and thus don't calculate the 
glitching power which can be as high as 70% [48], to more elaborate techniques [49] 
[50] that not only consider the effect of glitching, but they also take into account 
the effect of temporal and spatial correlations [46]. 
The research done on power estimation at the higher abstraction levels is still 
limited [51] [52]. At the architecture level, the system is described in terms of 
interconnected operators (adders, multipliers, etc.) and memory blocks (registers, 
ROMs, etc.). These building blocks, as they will be c d e d  from now on, are inter- 
connec ted by busses. Figure 3.1, shows a simplified architecture consis ting of: two 
building blocks, one interconnecting bus (bus b), one input bus (bus a), and one 
output bus (bus c). 
The total power dissipated in such an architecture is the sum of the power 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 26 
dissipated in the building blocks and the power dissipated in the busses. The 
power dissipated by a building block depends on: 
Block activity factor (P )  (number of executions per second). 
Output signal activity factor (a). 
Normalized block energy En. The normalized energy is the energy dissipated 
by the building block per execution when the signal switching activity factor 
is one. 
The total power dissipated by the building blocks is given by: 
Where R is the set of all building blocks. The power dissipated by a bus depends 
on: 
1. Signal activity factor (a). 
2. Length of bus (0. 
3. Capacitance per unit length (C). 
The total power dissipated by the busses is given by: 
Where N is the set of all busses. K is some constant that depends on the 
operating voltage. When determining ln the dimensions of the building blocks 
should be taken into consideration. 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 
3.4 Low-Power Examples of Portable Systems 
There are numerous low-power techniques used by researchers and designers to 
lower the power dissipation of portable systems. Some of these techniques, used for 
the design of wirdess port able sys tems, digit al signal processors, video compression 
algorithms and microprocessors are presented in this section. 
The first low voltage, very low current integrated circuits were developed about 
25 years ago for the watch [53]. However, for other electronic systems power 
dissipation was only an afterthought. During the 1 s t  decade, this has began to 
change. There has been great interest in the implementation of low-power, s m d  
size portable communicators for voice, video, images and data information as weU 
as low-power note-book and lap-top computers [8] [54] [55] [56]. 
In cellular systems, a considerable fraction of bat tery energy is used for trans- 
mission. Reducing the cell size not only increases the spectnun efficiency through 
frequency reuse but it also dows operation at lower transmission power levels. This 
in turn leads to longer battery Me. Currently mobile phones operate in a cell of 
several hundred meters radius, and transmit power in the order of 0.1-1 Watt [57]. 
The Viterbi decoder, used in CDMA cellular applications, presented in [58], 
employs Mnous Iow-power techniques. The squared Euclidean rneasure has been 
substituted by a non-squared Euclidean rneasure. This reduces the complexity of 
the branch metric unit and the word-length of the path metnc unit. The Viterbi 
decoder presented uses minimum sized processing units. To meet the throughput 
requirement, parallelism and pipelining are employed. To reduce spurious transi- 
tions on high-capacitance busses, gated control signals are used for controlling the 
multiplexers connected to these busses. 
Surviving-path memory management (591 is one of the operations required in the 
CHAPTER 3. LO W-PO WER DESIGN TECHMQ UES 28 
Viterbi decoder. There are two techniques for surviving-pat h memory management: 
exchange register and trace back. In (601, the effect of hybrid techniques on reducing 
the power dissipation is considered. 
In a receiver, the matched filter is positioned between the RF section and the 
baseband section. Hence, it can be implernented in digit al or in analog technology. 
The effect of each implementation on the power dissipation is considered in [61]. 
In turns out that for slow matched filters, with a l ~ g e  number of taps and high 
precision, the digital implementation is more power efficient t han the analog one. 
By lowering the supply voltage fsom 5 volts to 1.5 volts, the power dissipation of 
different digital filters has been lowered by 8-11 times [62]. Architectural transfor- 
mations such as pardelism, associativity, distributivity, commutativity, operation 
substitution and bit width optimization were used to maintain a constant tlirough- 
put, lower the glitching activity, and reduce the interconnect capacitance. 
In (631, a low-voltage low-power DSP is designed. The operating speed of the 
DSP is 63 MHz at 1 Volt. The power dissipation at this voltage and speed is 
17.0 mW. During active operation of the DSP, power saving is realized by the use 
of locdy gated docks. Global gating is also a d a b l e  and it is controlled by three 
power-down instructions. The memory is divided into 8 arrays, only one array is 
activated during each memory access. A multi-level threshold voltage, VT, is used. 
High VT is used in the 6 transistors of the memory ceUs to lower the standby current. 
Low VT is used in the peripheral circuitry to allow high-speed operation at 1 Volt. 
To lower the power dissipation in the lowpass interpolation and decimation fil- 
ters, the filter order is adapted according to the input and output signal character- 
istics [64]. This avoids the use of higher order flters when a lower order is sufficient. 
Powering-down control is used in the ALU when a lower order is sufficient. The 
CHAPTER 3. LOW-POWER DESIGN TECHNIQUES 29 
saving in power dissipation achieved for the decimation and interpolation filters is 
42% and 21% respectively. 
A variable threshold voltage scheme is used in [65], to lower the st andby power 
dissipation in a low VT CMOS technology. I t  also mitigates the eEect of fluctuations 
in VT on the system delay. The threshold voltage is controlled by changing the 
substrate voltage IrB8 For the NMOS, in the active mode Iles = -0.5 Volt, and 
VT = 0.1 Volt. In the standby mode, VBB = -3.3 Volt and VT = 0.5 Volt. 
A low-power subband video compression algorithm decoder is presented in [66]. 
Parallelisrn of the subband algorithm has been exploited to achieve an excess 
throughput that can be traded for lower power by reducing the supply voltage. 
Off chip memory is avoided to eliminate the high power consurnption of external 
memory access. An asymmetric wavelet filter is used in the lowpass and highpass 
füters, the filter uses 3-2 adders in its implementation. For the high-frequency 
bands, the data is zero run-length encoded to reduce the number of external in- 
puts by almost a factor of 4. At 1.0 Volt, the decoder is capable of operating at a 
3.2 MHz real time video rate, and dissipates 1.2 mW. 
Unlike standard Vector Quantization (VQ) decoders which require codebook 
storage, the Pyramid Vector Quantization (PVQ) decoder relies on intensive arith- 
metic computations [67]. Several bw-power techniques have been used in the imple- 
mentation of that PVQ decoder. The architecture is divided into four independent 
processing blocks, each block is separated by a FIFO. Each processing block oper- 
ates as long as it has data to process and its output FIFO is not f d ,  otherwise, it 
enters into the standby mode by gating its clock. The critical path of the vector de- 
coder is optimized to improve throughput and hence allow lower voltage operation. 
The FIFO uses a pointer based scheme for better energy-efficiency. 
CHAPTEK 3. LO W-PO WER DESIGN TECHMQ UES 30 
With microprocessors' speed approaching 300 MHz, for the DEC Alpha 21164, 
the power dissipation can reach 50 Watts [3] [68]. Various low-power techniques 
have been employed to lower the power dissipation in microprocessors. In [69], 
a low-power RISC processor that dissipates less than 2 Watts is presented. The 
processor uses a 6Pbit common bus for floating point as weU as integer instruction 
execution, this reduces the number of functional elements. The number of instruc- 
tion cache access is reduced by half. The number of I /O transactions is rninimized. 
Data and instruction caches are partitioned into four banks with only one bank 
active at a time. The dynamic nodes are charged to VDD - VT, rather than to 
VDo. Through software programming, the system clock can be reduced to 25% of 
its value. 
In 1701, low-power techniques were applied to the Alpha 21064 microprocessor 
[71]. These techniques were able to reduce the power dissipation by over 50 times. 
Such low-power techniques include; the lowering of the internal power supply to 1.5 
Volts. Reduced functionality, the floating point unit and the branch history table 
were eliminated. Process scaling from 0.75pm to 0.35prn, this reduced the total 
switched capacitance. The microprocessor dissipates 450 mW. It has two power 
down modes. During the i d e  mode the internal clock is stopped, power dissipation 
drops to 20 mW. During the sleep mode the internal power supply is switched off, 
the current drops to 50pA. 
In [72], another low-power microprocessor is presented that dissipates less than 
h o  watts. This processor features several low-power modes. In the "haltn mode, 
the processor stops its internal clock. The system can also change the input fie- 
quency. During the shutdown state, the system disconnects the processor from 
the VDD7 the register contents are saved in the memory. At power up, the system 
ret u n s  the regis ters to their previous st at e. 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 31 
In [73], another low-power microprocessor is presented that dissipates 3 Watts. 
The processor has dynamic as well as static power management modes. The dy- 
namic power management disables blocks that are not required to operate during 
a cycle. Dynamic power management can give up to 30% power saving. There are 
three static power management modes; Doze, Nap and Sleep. This processor uses 
an H-Tree clock distribution network over a set of distributed buffers to minimize 
active power. 
3.5 Reducing the Power Dissipation at the De- 
vice and Circuit Levels 
There are several techniques, during each level of the design process, to reduce the 
power dissipation. At the device level, the following techniques lead to lower power 
dissipation: 
Silicon-On-Insulator (SOI) technology, this leads to lower Ieakage cur- 
rents and lower parasitic capacitances [74] [75]. 
Place and route optimization. Assign signals with high switching activ- 
ities to short wires. Also, assign global signals, such as the clock, to layers 
with low capacitance per unit length. 
a Transistor sizing. Increasing (W/L) decreases the transistor delay which 
d o w s  a decrease in voltage to maintain a constant throughput. However, 
increasing the transistor size increases the capacitance and hence the power 
dissipation. Hence, an optimum transistor size for minimum power dissipa- 
tion exists. If the interconnect capacitance is C,, and the transistor input 
CHAPTER 3. LOW-POWER DESIGN TECHNIQUES 
capacitance is Ci. It is found that, for s m d  Cp/Ci, the optimum transis- 
tor size is the minimum size, othemise there is an optimum size that gives 
minimum power dissipation [4l]. 
Using su bmicron devices. This reduces the parasitic capacit ances and 
allows the use of lower supply voltage, with minimum eEect on the delay, for 
a velocity-saturated device [76]. 
0 Reducing the subthreshold voltage. This allows a reduction in the op- 
erating voltage, with minimum effect on the delay. But this leads to larger 
subthreshold currents. Hence, a compromise is required. Some designs use a 
multi- t hreshold voltage technology [77]. WhiIe others use a variable threshold 
voltage (651. 
At the circuit and logic Ievels, the following techniques can be used to reduce 
the power dissipation: 
a Reduce gat e capacitance, for example cornplementary pass- transistor logic 
has a lower input capacitance than conventional CMOS logic [78]. 
a Reduced logic swing [41] by making VH = VDD - VT. However, this has 
two disadvant ages: 
1. Low noise-margin-hi& (NT&). 
2. Following gate can dissipate static power. 
0 Low-power support circuitry 
- Level converting circuit [41]. 
- High efficiency low-voltage DC/DC converter [79]. 
CHAPTER 3. LOW-POWER DESIGN TECHNIQUES 
0 Logic level power-down. Modifying the circuits to d o w  power-down of 
unused logic blocks. This adds some overhead but can be beneficid if there 
are certain blocks that are not used for a large portion of the tirne [41] [80]. 
0 Multi-threshold circuit technology. This allows the optimization of low- 
voltage circuits for high-speed and low-power [63] [81]. 
Scded multi-buffer stages. This compromises speed and power for gates 
driving large capacitive loads [3] 1821. 
3.6 Low-Voltage Low-Power Operation 
The switching power is proportional to the square of the voltage, thus a quadratic 
reduction in power dissipation is achieved by lowering the supply voltage. However, 
the delay increases with the reduction of the supply voltage [l]. There are certain 
techniques used to keep the throughput constant despite the longer delay of the 
various building blocks. In this section, some of these techniques are investigated. 
Figure 3.2.a shows the relative increase in delay as the supply voltage is scaled 
down. Figure 3.2.b shows the relative decrease in power dissipation as the voltage 
is scaled down. Both figures were obtained for a CMOS inverter gate loaded by a 
1 pF load and using 0.8prn BiCMOS technology. 
From Figure 3.2, it can be notice that at low-voltage the rate of increase of the 
delay exceeds the rate of decrease of the power dissipation. This usudy  places a 
limit on the extent of using voltage scaling techniques. 
In this section, two methods which d o w  the use of lower supply voltage without 
reducing the throughput are presented. These are: 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 34 
Figure 3.2: The effect of reducing the supply voltage in CMOS circuits, for a 0.8pm 
BiCMOS technology: 
(a) Relative delay versus supply voltage. 
(b)  Relative power dissipation versus supply voltage. 
1. Pipelining and pardelism. 
2. Using c a r y  Save adders. 
3.6.1 Pipelining and Pardelism at the Architecture Level 
The architecture level, is the level in which operators (functional units) act on 
sets of logic values grouped into words. The manner in which these operators are 
intercomected or sequenced in t h e  can have an influence on the performance of 
the architecture in terms of throughput, power dissipation and/or area, without 
afTecting the actual functionality of the architecture. 
For example, consider an architecture consisting of two cascaded operators as 
shown in Figure 3.3.a. Pipelining [83] this architecture, as shown in Figure 3.3.b. 
gives an alternative architecture with the same functionality as the original ar- 
chitecture, but with different performance. It is possible to operate the pipelined 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 
Figure 3.3: Two cascaded operators: 
(a) Nonpipelined architecture. 
(b ) Two-s t age pipelined architecture. 
architecture at a lower supply voltage and at the same throughput as the original 
architecture. Thus through pipelining, it is possible to preserve the functionahty 
and the throughput of the system but lower its power dissipation [l], at  the cost of 
latency and extra overhead. 
It is also possible, through parallelism, to decrease the power dissipation of the 
system [l]. In pardelism, the datapath is repeated N times, where N is the degree 
of pardelism. Like pipelining, the reduction of power dissipation in parallelism is 
due to the reduction in voltage, while the throughput is kept constant. Figure 3.4 
illustrates pardelism. 
The merits of pipelining over parallelism are: 
2. Reduced logic depth. Rence, less power due to glitches. 
On the other hand, the disadvantage of pipelining over parallelism is the unbal- 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 
Figure 3.4: A twedatapath parde1 system. 
Datapath 1 
anced pipe-stage delay problem [83]. To overcome this, we can combine parallelism 
and pipelining together as ilIustrated in the following example. 
DeMUX 
Example: Combining Pipelining and Parallelism 
MUX 
Consider two cascaded operators A and B as shown in Figure 3.3.a. Assume that: 
Datapath 2 
D(A)  = 1 unit 
D(B) = 2 units 
D(AB)  = 3 units 
Where D(X) means the delay of operator X. 
Pipeline this architecture as shown in Figure 3.3.b. Neglect the register delay 
with respect to that of the operator. The effective delay of the pipelined system 
is determined by the delay of the slowest stage which is 2 units in this case. It is 
possible now to reduce the voltage of the pipelined system to make the through- 
put of that system equal to the throughput of the non-pipelined one. The power 
dissipation, from Figure 3.2, is reduced by about 56%. 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 
n 
Figure 3 -5: Combining parallelisrn with pipelining to balance pipest age delays. 
While the power was reduced by more than half of its value, yet we didn't make 
f d  use of pipelining due to the unbalanced pipestage delays. It is possible, by 
combining parallelism with pipelining, to balance the pipestage delays and hence 
achieve a Iarger reduction in power dissipation. Figure 3.5. shows a system that 
combines pardelism with pipelining. In this case, the effective system delay is 1 
unit. dowing an 89% reduction in power dissipation, while maintaining the same 
throughput as the non-pipeiined system. 
There is a limit to pipelining and parallelism beyond whch no improvement in 
power dissipation is possible, this is determined by: 
1. The extra overhead required for pipelining and parallelism. This is repre- 
sented by the pipeline registers for pipelining, and by the multiplexers, de- 
multiplexers and the extra wiring capacitance for pardelism. 
2. At low-voltage, the rate of increase of delay exceeds the rate of decrease of 
power dissipation. 
The concept of parallelism can also be applied to memory accesses, where several 
bytes are accessed in parde l  instead of accessing them sequentidy. Paralleiism 
in rnemory access is possible only if the data access pattern is sequential in nature 
1411- 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 
3.6.2 Carry Save Adder 
Addition is the most fiequent operation performed by a digital signal processor, 
whether explicitly or within other operations such as multiplication. Hence, if we 
are able to use faster adders, the delay of the critical path c m  be reduced which 
allows the use of lower supply voltage. 
The delay of the adder is mainly due to the propagation of the carry from 
the least significant bit position to the most sipificant bit position. Consider an 
N - bit adder, the delay of the ripple-carry adder 1841, which is the most power- 
efficient adder [85] compared to other adders at the same voltage, is proportional to 
N. Adders, such as the cmy-lookahead adder and conditional sum adder, have a 
delay proportional to log(N) [84], hence it is possible to lower the supply voltage of 
tkese adders and get a delay equal to that of the ripple-cmy adder. The lowering 
of the supply voltage in these adders leads to a lower power dissipation than that 
of the ripple-cary adder [86]. 
However, it will be much better if the c a r y  propagation can be eliminated all 
together. This will be specially usefd when we have a cascade of adders, which is 
u s u d y  the case in a finite impulse response (FIR) filter. In this case, by using a 
carry Save adder [84] for all adders and using a ripple-cary adder (or any other fast 
adder) for adding the sum and carry words of the last carry Save adder, the delay 
can be significantly reduced, hence a lower supply voltage can be used and lower 
power dissipation is achieved. 
The choice of the adder depends on the following parameters; 
1. The number of adders in the adder chah M. 
2. The number of bits per word N. 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQ UES 
3. The ratio between the sum delay Ta and the carry delay Tc. 
4. The ratio between the adder power PA and the load power PL. 
5. The relation between delay and power versus voltage (Figure 3.2). 
The power-optimum adder is not necessary the fastest adder. In some cases 
[8ï], the carry Save adder architecture is faster than the ripple c m y  adder architec- 
ture, but the latter is more power efficient even after voltage scaling to maintain a 
constant throughput. This is due to the extra hardware required for the carry Save 
adder architecture [87]. In general, optimization for high throughput is different 
from optimization for low-power [51]. 
3.7 Pipelining and Parallelism of the Discrete 
Cosine Transform 
Discrete cosine transform (DCT) is kequently used in video compression [9] [88]. 
In this section, three different architectures for DCT are considered. the effect of 
pipelining and parallelism on reducing the power dissipation of each architecture is 
also considered. 
The mathematical formula for the one dimensional DCT (ID-DCT) is given by: 
for k = 0 , 1 ,  ..., N - 1  
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 
Figure 3.6: Multiplier architecture for an %point ID-DCT. 
where, 
( 1 otherwise 
Direct implementation of this algorithm requires N2 multiplications and N(N - 
1) additions, this not ody increases the power dissipation, but the area and/or delay 
as well. 
3.7.1 Three Alternative Architectures 
The Multiplier Architecture 
While the direct implementation of an bpoint ID-DCT requires 64 multiplications 
and 56 additions, various algorithms have been proposed that require a fewer num- 
ber of additions and multiplications. As an example, the algorithm given in [89] 
and shown in Figure 3.6, requires only 29 add/sub blocks and 13 multipliers. 
3. LO W-PO WER DESIGN TECHNIQUES 
Figure 3.7: Pure ROM architecture for an 8-point ID-DCT. 
The Pure ROM Architecture 
The idea of this implementation is to replace the multiplication operation with ad- 
dition and a look-up ROM table. This process is known as distributed arithmetic [9] 
Pol 
We can use distnbuted arithmetic to implement the 8-point ID-DCT. The block 
diagram for this architecture is shown in Figure 3.7. Eight 256-word ROMs are 
required for this architecture. 
The Mixed ROM Architecture 
The &point ID-DCT can be expressed as the product of an 8 x 8 matrix by an 
eight element column vector. However, through dgebraic manipulation (91 (891, 
this rnatrix can be broken down into two 4 x 4 matrices, as given by the foIlowing 
equat ions: 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 
where, 
In this case, the number of words per ROM is only 16 words (16 times less than 
the pure ROM architecture), but some overhead has been incurred in the adders 
required to calculate the address of the ROMs. Figure 3.8 shows a block diagram 
for this architecture. 
3.7.2 Reducing Power Through Pipelining and Parallelism 
Pipelining 
As the voltage decreases, to decrease the power dissipation, the overd delay of 
the datapath increases (an undesirable side effect). To counteract this increase in 
delay, we can use pipelining (11. The rational of using pipelining here is that the 
increase in delay accompanying the decrease in voltage is balanced by dividing the 
dat apat h into smaller pipes tages and keeping the maximum delay of any pipestage, 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQ UES 
Figure 3.8: Mixed ROM architecture for an &point ID-DCT. 
at the lower voltage, equal to the ove rd  delay of the datapath without pipelining, 
at the higher voltage. 
As a f i s t  order approximation, make the following assumptions: 
O Neglect the effect of overhead caused by the pipeline registers, whether in 
t erms of increased capaci tance or increased delay. 
O Assume that the pipeline can be perfectly balanced. All pipestages have the 
same delay. 
0 The delay of any stage is inversely proportional to the applied voltage. 
To maintain the same maximum throughput when dividing the datapath into 
N pipestages, each pipestage operates at a voltage VIN. The power dissipation in 
this case is approximately given by: 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 44 
Where, P. is the power dissipation before pipelining, and Ppl is the power dis- 
sipation after pipelining and reducing the voltage. 
Parallelism can also be used to keep the throughput of the system constant as the 
voltage deneases (11. To compensate the increase in datapath delay as the voltage 
decreases, we replicate the datapath N times. The input samples are split among 
the N datapaths. The outputs of the N datapaths are then multiplexed onto a 
single output stream. This allows the system to maintain its throughput, while 
each datapath is operating at a lower rate and a lower voltage. 
Unlike pipelining, parallelism greatly increases the area. Making approximations 
similar to those made in the analysis of pipelining, we can show that the power 
dissipation of a system consisting of N parde l  datapaths and having the sarne 
throughput rate is approximately given by: 
Where, P. is the power dissipation for a single datapath system, and P, is the 
power dissipation for a system consisting of N pardel datapaths and having the 
same maximum throughput as the single datapath system. 
3.7.3 Performance Evaluation 
The Multiplier Architecture 
Table 3.1 gives the Spice simulation results of the overd  delay and power dissipation 
of the multiplies architecture of Figure 3.6 in a 0.8pm BiCMOS technology, the 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 45 
Table 3.1: Delay and power dissipation at 5 MHz for the multiplier implementation 
with no pipelining. 
power dissipation is evaluated at a frequency of 5 MHz. Notice that the power 
dissipation is reduced as the voltage decreases, but this is at the expense of the 
increased delay (decreased throughput). To maintain the same throughput rate at 





Figure 3.9 shows the approximate relation between the power dissipation and 
the number of pipestages, given by Equation 3.7. Also shown are the simulation 
results obtained from pipelining the multiplier architecture. Everything has been 
normalized to the case of a single stage system operating at the same throughput. 
It is also possible to maintain the same throughput while reducing the voltage 
by applying parallelism. Figure 3.10 shows the approximate relation between the 
power dissipation and the number of datapaths, given by Equation 3.8. Also shown 
are the simulation results obtained from increasing the degree of pardelism of the 
multiplier architecture. Everything has been normalized to the case of a single 









CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 
Figure 3.9: The effect of pipelining on reducing the power dissipation, white main- 
Pipelined power 1.4 








taining a cons tant t hroughput . 
I 1 - - 
2 pipestages 
- 0 3 pipestages 






1 1 I I 
ParaIlel power 1 A 








O 1 2 3 1 5 
N 
* 2 path paralle1 systern at 5V 
0 3 path parallet system at 4V 1 





1 1 I 
o 
1 
O 1 - 3 3 4 5 
Figure 3.10: The effect of pardelism on reducing the power dissipation, whJe 
maintaining a constant throughput . 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 47 
Table 3.2: Total delay and power dissipation at  5 MHz for the pure ROM architec- 
t ure. 
The ROM Architectures 
Two different types of ROM architectures were considered. First, the pure ROM 






ulation results of the overd delay and power dissipation for this architecture are 
given in Table 3.2. All eight ROMs have the same address, hence a single ROM 









The second type of ROM architecture considered is the mixed ROM architec- 
ture. This architecture requires eight lbword ROMs. But it requires eight extra 
adderlsubtractor units. The simulation results of the overall delay and power dissi- 
pation for this architecture are given in Table 3.3. The eight ROMs can be divided 
into two groups, the ROMs of each group have the same address. Hence, two ROM 
address decoders were used. 
t 
Three architectural alternatives, in addition to the single stage alternative, were 
considered for each of the pure and mixed ROM implementations: 
O Two stage pipeline. 
0 Two stage pipeline with two pardel adders per path, Figure 3.11. 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 48 
Table 3.3: Total delay and power dissipation at 5 MHz for the mixed ROM archi- 
tecture. 





Two stage pipeline with three pardel adders per path, Figure 3.12. 
Tables 3.4 and 3.5 show the effect of pipelining and parallelism on reducing 
the power dissipation in the pure ROM and mixed ROM implementations. The 





In terms of power dissipation, the multiplier architecture has the lowest power 
dissipation, while the pure ROM and the mixed  ROM architectures have higher 
power dissipations. In t ems  of speed, the multiplier architecture is the slowest and 





Pipelining the ROM architectures provides only a modest reduction in power 
dissipation, or a modest increase in throughput if the voltage is kept constant. This 
is because of the unbalanced pipestage delays. The delay of the second pipestage is 
2-2.5 times the delay of fLst pipestage. Pardelism in the second pipestage is used 
CHAPTER 3- LO W-PO WER DESIGN TECHNIQUES 
Figure 3.12: Using t k e e  parde l  adders in the second pipestage. 
Table 3.4: Reducing the power dissipation by pipelining and parallelism in the pure 
ROM implementation. 
Delay(ns) 
85 - 90 
65 - 70 
52 - 57 
42 - 46 
Architecture 
Single stage 
2 stage 1 addlpath 
2 stage 1 add/path 
2 stage 2 add/path 
2 stage 2 addlpath 
2 stage 3 addlpath 
2 stage 2 addlpath 
2 stage 3 addlpath 
Power (mW) 
30.4 (@ 5V) 
21.9 (@ 4V) 
36.1 (O 5V) 
14.5 (Q 3.3V) 
24.6 (Q 4V) 
14.5 (Q 3.3V) 
40.6 (Q 5V) 
24.9 (@ 4V) 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 50 
Table 3.5: Reducing the power dissipation by pipelining and parallelism in the 
mked ROM implement ation. 
Delay(ns) 1 
to devia te  this problem as shown in Figure 3.11 and Figure 3.12. 
Architecture 1 ~ o w e r  ( m ~ )  1 
7 - 
6 5 - 7 0  
50 - 55 
41 - 43 
Using a two stage pipeline, and three times pardelism in the second stage of 
the pure ROM architecture, reduces the power dissipation by 52% and increases 
the throughput by 38%, when lowering the voltage from 5 Volts to 3.3 Volts. For 
the mixed ROM architecture, a two stage pipeline, with three times pardelism in 
the second stage, achieves a 47% saving in the power dissipation, and increases the 
throughput by 36%. 
3.8 Effect of the Number System on the Switch- 
ZstageIadd/path 
2 stage 2 addlpath 
2 stage 2 add/path 
2 stage 3 addlpath 
2 stage 2 addlpath 
2 stage 3 add/path 
ing Activity 
35.7(@5V) 
15.0 (@ 3.3V) 
24.4 (B 4V) 
16.0 (O 3 . W )  
40.2 (O 5V) 
25.8 (B 4V) 
In this section, I demonstrate how the choice of the number system can reduce 
the swit ching activity. Two number systems are considered, the two's complement 
number system and the Gray code number system. The Gray code number system 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 51 
has the advantage that any two adjacent numbers dXer in one bit only. So that, 
by coding correlated sampies in Gray code we can effectively reduce the switching 
activity (911. 
Assume positive samples, hence instead of considering two's complement repre- 
sentation, unsigned binary representation is considered. Each sarnple is represented 
by an N bit binary word, each binary word is assigned an integer n. The range of 
n is: 
n = O...zN - 1 
The binary word is the unsigned binary representation of n, or the Gray code 
of n depending on which representation is used. Let i be the integer representing 
the curent sample and j be the integer representing the previous sample. Assume 
that the conditional probability distribution is given by: 
M + l - ! i - j [  li - jl 5 M (M+1)2  
O ot herwise 
M is a factor which describes the correlation between the successive samples. 
The larger the value of M, the less correlated the samples are. Figure 3.13 shows 
the conditional probability distribution of x, given x,-1, for different values of M. 
Using the probability distribution given in Equation 3.9, we can derive the 
switching activity for different values of M. This is given in Table 3.6. a, is the 
switching activity for the Gray code representation, while a, is the switching ac- 
tivity for the unsigned binary representation. 
Notice that as the correlation between the successive samples is reduced, the 
effectiveness of the Gray code in reducing the switching activity becomes less. An- 
ot her factor in determining the power-op timum number system is the complexity 






Figure 3.13: Conditional probabili ty dis tribution between successive samples for 
Werent  values of M. x,-1 is the value of the previous sample. The x-axis represents 




CHAPTER 3. LOW-POWER DESIGN TECHNIQUES 53 
Table 3.6: Switching activity of the Gray code and the unsigned binary represen- 
tations for correlated samples. 
of the operator and hence the energy dissipated per single execution of the opera- 
tor. It is quite ~ossible that the choice of a number system to lower the switching 
activity will lead to higher operator energy. Hence, a compromise is required in 
choosing the power-optimum number system. 
As an example, consider the addition operat or. The adder operates repeatedly 
on correlated successive samples. Two number systems are considered: 
1. The unsigned binary number system. 
2. The Gray code nurnber system. 
Four factors are considered in detennining the optimum number system: 
1. The correlation factor p. 
2. The number of bits per word N. 
3. The operator energy ratio. 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 
4. The relative load capacitance. 
The effect of the correlation factor and the number of bits per word on the 
switching activity is given by Table 3.6 and shown i . ~  Figure 3.14. The operator 
energy ratio depends on the details of the circuits and the implementing technology 
for each number system adder. Hence, this ratio can vary fkom one implementa- 
tion to the other. Yet, a f i s t  order estimation is required. This was done by 
writing a VHDL description for each adder, and then synthesizing this design and 
estimating its power dissipation using the COMPASS tools. For a single operator 
execution. the energy dissipated in an unloaded Gray code adder is about double 
that dissipated in an unsigned binary representation adder. 
Let a, be the switching activity of the unsigned binary number representation. 
Let ag be the switching activity of the Gray code number representation. Let Pu 
be the normalized power dissipated by the unsigned binary adder. Let Pg be the 
normalized power dissipated by the Gray code adder, and let f i  be the normalized 
power dissipated by the adder load. The total power dissipated by each adder and 
its load is given by: 
pu = %(pu + PI) (3.10) 
PG = a g ( P g  + 8) (3.11) 
for the unsigned binary adder and the Gray code adder respectively. 
The objective is to find the boundary at which the two adders dissipate the same 
amount of power. On one side of this boundary, the unsigned binary representation 
has lower power dissipation. While on the other side, the Gray code representation 
has lower energy dissipation. The Equation of this boundary is given by: 
CHAPTER 3. LOW-POWER DESIGN TECHNIQUES 
Relative Switching 
Activity 
Figure 3.14: The ratio between the switching activity of the unsigned binary rep- 
resentation and the Gray code representation versus the correlation factor M, for 
different word length N. 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 
w here, 
S.= z. 
Figure 3.15 shows the relation between RG and RL for different values of N 
and M. N is the number of bits per word, while M is a factor related to the 
correlation between the successive samples. For a certain RL, a value of Rc higher 
than that given by the cuves means that the unsigned binary representation is 
more power-efficient than the Gray code representation and vice versa. 
3.9 Reducing the Number of Iterations 
In this section, the effect of reducing the number of block iterations per output on 
the power dissipation is examined. Two examples are given. In the first example, 
the number of block iterations per output for a division algorithm is reduced. This 
is done by using higher-order radix. The effect of this on power dissipation is 
illus trated. 
In the second example, a division algorithm is developed which reduces the 
number of add/sub operations required per output [92]. The effect of this on 
reducing the power dissipation is illustrated. 
3.9.1 Higher Radix Division Algorit hms 
In the division process 1931 the dividend X is divided by the divisor D to generate 
the quotient Q. The division operation is expressed as: 
CHAPTER 3. LOW-POWER DESIGN TECHNIQUES 
Figure 3.15: The relation between the relative Gray code adder power and the 
relative Ioad power, for equal power dissipation in the Gray code and unsigned 
binary adders. N is the number of bits per word. M is a factor related to the 
cordation between successive sampIes. The larger the value of M, the less the 
correlation between successive samples. 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 
In Equation 3.13, T is the radix order. For the purpose of this discussion, r is 
a power of 2, r = 2m, where m is a positive integer. q O 1,. . . , T - 1 .  The 
purpose of the division algorithm is to find the quotient digits ql,  q z ,  . . . q,. In the 
digit-recurrence division a lgor i th ,  the quotient digits ql , q 2 ,  . . . , qn are ob tained 
sequentidy starting with ql. 
In the process of obtaining the quotient in the two's complement representation, 
an intermediate quotient in a redundant signed-digit representation is first obtained. 
For the intermediate quotient, q; E {-a,. . . , - l , O ,  1,. . . !a). Where, a < r. a is an 
integer. 
During iteration j + 1, the quotient digit qj+l is generated by the Quotient Digit 
Selection (QDS) unit: 
qj+l = QWPW, (3 .14)  
A new partial remainder is generated according to the following equation, 
P ( j  -t 1) = r P ( j )  - Dqj+i, (3.15) 
with 
Where 
P ( j )  is the partial remainder after j iterations. 
j =  O...(n-1). 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 
r is the radix of the number system. 
D is the divisor. 
Figure 3.16 shows the block diagram of the digit-recurrence division algorithm. 
Several points need to be clarifed about this block diagram. The adder used is 
a carry Save adder which generates two outputs the sum and the carry of the 
partial remainder. The QD S requires the partial remainder in two's complement 
format. However, when using a redundant quotient-digit set the accuracy of the 
partial remainder required by the QDS is limited. Hence, the carry propagate adder 
(CPA) is of limited accuracy, adding only a few of the most significant bits of the 
sum and carry of the partial remainder. The quotient is generated in signed-digit 
format [94] (a redundant quotient-digit set), hence a module is required to convert 
it to the two's complement format. This is done by On-The-F?y (OTF) module 
[95l 
For the same accuracy, the number of iterations n depends on the radix used. 
Assume that n is 12 for a radix 2 algorithm, then n is 6 for a radix 4 algorithm, 4 
for a radix 8 algorithm, 3 for a radix 16 algorithm and so on. However, reducing 
the block activity factor, i.e. the number of iterations per second, doesn't necessary 
lead to lower power dissipation. Higher radix blocks are more complex and hence, 
dissipate more power. 
For low-power design, the objective is to minimize the power dissipation after 
meeting the throughput requirement. There are two conflicting factors that need 
to be taken into consideration: 
0 The block activity factor. This decreases as the radix order increases. 
The normalized energy per block. This increases as the radix order increases. 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 
OTF I =- Q 
Figure 3.16: Block diagram of a digit-recurrence division algorithm. 
In addition, the effect of voltage scaling, if permissible by the technology, has 
to be taken into account. 
Table 3.7 shows the parameters of each radix-dependent block. For the QDS, 
the numbers shown determine the number of integer bits and the required fractional 
accuracy for the shifted partial rernainder TP and the divisor D. The actual number 
of bits the QDS requires fiom D is one less than the accuracy of D, because D is 
always in the range [0.5,1). The range of the signed-digit quotient digits is [a, -a]. 
a is chosen to minimize the cornpukation done in the Divisor Multiples module. 
Using the data of Table 3.7, the relative speed and relative power dissipation 
of each block in Figure 3.16 can be estimated. The word relative means relative to 
radix 2. Tables 3.8 and 3.9 show this data. For the CPA, the power dissipation 
is proportional to the number of bits added. The delay is also proportional to 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 
Table 3.7: Parameters of the radix-dependent blocks of Figure 3.16. 
the number of added bits. The QDS is a combinational logic book, the power 
dissipation is estimated according to [96]: 
where, 
n and m are the number of inputs and outputs respectively. 
Hi and Ho are the entropies of the input and output respectively. 
Assuming that ail inputs and outputs are equally probable 
2nfL 
Poc m(n + 2m) 
37472 + m) 
The delay of the QDS is proportional to the radix. For the Divisor Multiples 
(DM), this module generates one or two outputs (in case of radix 8 and 16) from 
the divisor D through shifting. In the case of radix 8 and 16, these two outputs 
should be added, hence a two-level CSA is required. 
The OTF consists of two parts. Two registers N bits each, and a combinational 
logic block. The relative power dissipation between these two blocks has to be 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQ UES 
Table 3.8: Relative power dissipation for the different blocks of Figure 3.16. 
Table 3.9: Relative delay for the different blocks of Figure 3.16. 
determined, so when comparing the relative power dissipation and delay for the 







To evaluate the performance of each radix implement ation, the relative power 
dissipation and the relative delay of each building block for the Radix 2 division 
algorithm has to be known. This was done through cornputer simulation. The 
results obtained are shown in Table 3.10. 
Using the data given in Tables 3.8, 3.9 and 3.10, the power dissipation and the 
t hroughput of any radix division algorithm relative to the ra& 2 division algorithm 
can be calculated. If it is also possible to scale the voltage to get equal throughput, 



























1 + 1  
1 + 2  
1 + 3  
1 + 4  
CHAPTER 3. LOW-POWER DESIGN TECHNIQUES 63 
Table 3.10: Relative power dissipation and delay for the different blocks of Fig- 
ure 3.16 for a radix 2 division algorithm. 
ower 7.2 1 0.17 4.9 1 ie lay / 1.7 / 1 1 1.2 / 0.85 
SCR 
algorithm can be calculated with help of Figure 3.2. Figure 3.17 shows the relative 
throughput and power dissipation with and without voltage scaling. 
Notice from Figure 3.17.a that as the ra& order increases the power initially 
decreases. This is because the reduction in the block activity is greater than the 
increase in the normalized block power. As the radix order increases more, the 
increase in the normalized block power exceeds the decrease in the block activity 
factor. Hence, the over all power dissipation increases. 
CPA 
Another factor, that can be taken into consideration, is the increase in through- 
put as the radix order increases. This dows a reduction in voltage to equalize 
the throughputs, reducing the power dissipation of the higher radix systems. Even 
after taking into account the effect of voltage scaling, rad& 16 has higher power 
dissipation than radix 2. However, with voltage scaling the minimum-power radix 
has shifted fiom 4 to 8. 
The reason for the high power dissipation of radk 16 is the complexity of the 
QDS module. This is because a is limited to 10, so as to limit the number of output 
words from the Divisor Multiples unit to 2, which only requkes one extra CSA level. 
Allowing a to go up to 14 greatly reduces the complexity and power dissipation of 
the QDS, but increases the power dissipation in the Divisors Multiples unit and 
in the CS A. The choice of one alternative over the other depends on the relative 
QDS CSA 
CHAPTER 3. LO W-P O WER DESIGN TECHNIQ UES 
Relative 





( I I 1 
: withour with volngc lta e sding d i n  - : 
. 
Figure 3.17: The  eEect of the radix on the power dissipation and throughput of the 
division algorit hm. 
(a) Relative power dissipation with and without voltage scaling. 
(b) Relative throughput . 
0.1 - 1 I 
2 4 8 16 
Radix 
CHAPTER 3. LO W-PO WER DESIGN TECHWQUES 
power dissipation between the various modules. 
3.9.2 Minimizing Add/Sub Operat ions in Division 
In the multiplication process, the minimum number of add/sub operations occurs 
when the multiplier is in the minimal signed-digit ( SD ) representation (having the 
minimum number of non-zero digits). One can thus expect the minimum number 
of add/sub operations required in the division operation to occur when the resul- 
tant quotient has the minimal SD representation. In the following discussion, we 
concentrate on how to generate this minimal SD quotient. 
Consider the division operation: 
where X < D. Based on the values of X, we can proceed to calculate the partial 
rernainder as follows [97] [98]: 
where, O 5 y < !. 
To get the shifted partial remainder, we only have to multiply X by 2. 
where, O 5 y < i. 
112 requires only one digit for its representation. To get the shifted partial 
remainder, we subtract 0 1 2  from X and multiply the remainder by four. 
CHAPTER 3. CO W-POWER DESIGN TECHNIQUES 
3. If $LI 5 X < D, then 
where, O 5 y < a. Alternatively, 
where, O 5 r < !. 
The second representation is preferied over the first one, because 1 can be 
represented by one digit, while 4 needs two digits (i.e. 0.7510 = 0.112). In this 
case, to get the shifted partial remainder, we subtract D fiom X and multiply 
the remainder by four. 
It is clear from the above discussion that X is e s t  compared with aD and $D. 
Depending on this cornparison result one of the following two actions is taken: 
Either a subtraction operation is performed and the resulting remainder is 
multiplied by 4 (shift to the left by 2) to get the new partial remainder. 
0 Or no subtraction is required and the partial remainder is multiplied by 2 
(shift to the left by 1) to get the new partial remainder. 
Now a division algorithm can be formdated as follows: 
If < 0 1 2  
CHAPTER 3. LOW-POWER DESIGN TECHNlQUES 
As it can be seen in the above algorith,  each iteration is divided into two steps, 
each of which can be performed in a clock cycle, when implemented in VLSI. In the 
first step, the cornparison is performed. The addition (subtraction), if required, is 
performed in the second step. As a result, a one step iteration produces one quotient 
digit, while a two step iteration produces two quotient digits. This technique has 
two advantages. First, it allows the use of a shorter clock cycle. Second, the division 
period is independent of the number of non-zero quotient digits. Figure 3.18 shows 
a block diagram of the proposed division algorithm. 
Now we want to show that the obtained SD quotient contains the minimum 
number of nonzero digits. For a minimal SD quotient in the canonical form, any 
two adjacent digits should contain at least one zero digit (841. However, for the 
quotient representation obtained by the algoritkm presented here, it is possible 
to get two adjacent 1's or two adjacent 1's. Consider the case of two adjacent 
1's. This is obtained when we have a partial remainder with a value in the range 
[$D, $D), and the following partial remainder is in the range [:D, D). In this case, 
the resultant sequence of digits is 0 1 1 0. Changing this to canonical form, it 
becomes 10I0, which also contains two nonzero bits. Thus, the SD representation 
obtained for the quotient is a minimal SD representation. The average number of 
zeros, in the quotient, is 66.7%. 
B y  exploiting the redundancy of the signed-digit representation it is possible to 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 
üivisor Multiples '1 
Figure 3.18: Minimum Add/Sub Division Algori t hm. 
CHAPTER 3. LO W-PO WER DESIGN TECHNlQ UES 
Figure 3.19: Minimum addlsub division algorithm using a limited precision QDS 
and a CSA. 
use a limited precision QDS which has a lower complexity, and to use a CSA adder. 
This leads to lower power dissipation and to faster operation [92]. Figure 3.19 shows 
the block diagram of the modified algorithm. 
The updating of the d and p+ registers proceeds as foUows: 
1. If' an addition operation occurred in the last cycle, the and $+ registers 
are loaded with the values produced by the CPA. 
2. If no addition operation occmed in the last cycle, P and p+ are updated as 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 
where, 
S and C are the sum and carry registers respectively. 
Subscripts n and o denote the new value and the old value of the registers 
respectively. 
The superscript is the bit position with respect to the partial remainder. 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQUES 71 
In order to find the average number of zeros in the quotient, due to the diaculty 
in solving this problem andytically or even numerically, we had to restore to simu- 
lations. The simulation has been perfonned on half-a-million r andomly generated 
numbers. The average percentage of zeros in the quotient has been found to be in 
the order of 65%, which is quite dose to the previous percentage of 66.7%. 
The power dissipation of the proposed division algorithm is compared to that of 
Radix 2 and Radix 4 division algorithms [93]. Figure 3.16 shows the block diagram 
of the Radix 2 and Ra& 4 division algonthms [93], used in the comparisons. The 
Radix 2 division algorithm generates one bit per iteration. The number of iterations 
is equal to the number of digits in the quotient. Each iteration requires one addition 
and one QDS operation. 
The Radix 4 division algorithm generates two digits per iteration. The number 
of iterations is thus haIf the number of iterations required for the Radix 2 division 
a lgor i th ,  reducing the number of additions and QDS operations by 50%. However, 
the reduction in power dissipation is less than 50%, because of the increase in 
complexity as explained previously. 
The proposed division algorithm reduces the number of addition/subtraction 
operations to 33% of the number required by the Radix 2 division algorithm. Corn- 
pared to the Radix 4 division algori t hm it is reduced by 33%. However, t here is an 
increase in the number of QDS operations for the proposed algonthm over that of 
the Radix 4 algorithm. 
The power, speed and area performance of the diBerent division algorithms have 
been compared using computer simulation. A VHDL description has been written 
for each module of these algorithms. The VHDL files have then been synthesized 
in a 0.8pm 5 Volt Standard Cell CMOS technology. Through simulation, the per- 
CHAPTER 3. LO W-PO WER DESIGN TECHNIQ UES 72 
Table 3.11: Performance cornparisons of the different division algorit hms. The 
power is measured at a speed of 15 M division operation per second. 
/ Features 11 Radix 2 / Radix 4 1 Proposed / 
( Power (mW) 11 13.2 1 8.5 1 7.2 / 
Area ( p m )  
1 Speed (M wordsfs) 11 1.10 1 1.45 1 1.40 1 
formance of the synthesized algorithms has been measured. Table 3.11 gives the 
simulation result s for the division algorit hms considered. 
Algori thm 
1600 
The power saving in the proposed division algorithm over that of the Radix 4 
division algorithm is less than 33%, this is because some units, which operate 
during the first clock cycle, operate for 67% of the tirne. It has been found through 
computer simulation that the power saving for the proposed algorithm over the 
Radix 4 algorithm is 15%. 
When the power dissipation of the proposed division algorithm is compared to 
that of the Radix 2 division algorithm, the power saving is found to be 45%. 
Algorit hm 
1900 
3.10 Reducing the Computational Complexity: 
Algorit hm 
2000 
Vector Quant izat ion Example 
Vector quantization [88] is a compression technique. It exploits the correlation 
that exists between successive samples by quantizing a group of successive samples 
together. The encoder of a vector quantizer finds the representation vector closest 
CHAPTER 3. LO W-POWER DESIGN TECHNIQUES 73 
to the quantized vector. The index of this vector is transmitted to the decoder. 
The decoder uses a look-up table or through computations finds the value of the 
representation vector. 
The search process performed by the encoder, requires large computational ca- 
pabili t ies to be performed exact ly. Rowever , approxima t e search algori t hms exis t , 
which greatly reduce the computational complexity with a much less degradation 
in performance. In this section, Full-Search Vector Quantization (FSVQ) and Tree- 
S tructured Vector Quantization (TSVQ) [88] are considered. 
Consider a vector X consisting of 8 samples. Each sample consists of 6 bits. 
This vector is to be approximated to the dosest representation vector in set {Ci)  
of 64 representation vectors. That is, the compression ratio is 8:l. The closest 
representation vector to vector X is the one having the minimum square error 
Full-Search Vector Quantizat ion, requires the calculation of the square error for 
every representation vector, then comparing the square errors to find the index of 
the representation vector with minimum square error. This always finds the nearest 
neighbour, however, it requires great computational capability as can be seen from 
Table 3.12. 
In Tree-Structured Vector Quantization , the search is performed in stages [88]. 
During each stage, a subset of the representation vectors is eliminated from con- 
sideration by a relatively small number of operations. In general, consider a tree 
n stages deep and having a branching factor m (the number of branches leaving a 
node) as shown in Figure 3.20. Each node in the tree has m vectors corresponding 
to each one of its m branches, the branch whose vector gives the minimum square 
CHAPTER 3. LOW-POWER DESIGN TECHNIQUES 
Figure 3.20: Tree-Structured Vector Quantization. 
error, is chosen. The representation vectors corresponding to ail the other branches 
are eliminated. 
The total number of representation vectors is given by 
The computational complexity of this algorithm is proportional to rn log,(N) [88]. 
Minimum computational complexity is achieved at rn = e.  But m has to be an 
integer and preferably a power of two. Hence, m = 2 or 4 is an  optimum choice 
for minimum computational complexity. However, higher values of m give better 
performance but a t  the expense of greater computational complexity [88] [99]. Ta- 
CHAPTER 3. LO W-POWEK DESIGN TECHNIQUES 
Table 3.12: Cornputational complexity and memory requirement of VQ encoding 
algorithms. The VQ algorithm encodes a 6Clevel eight-sample vector into one of 
64 representation vectors. 
ble 3.12 compares the computational requirernents for various TSVQ algorit hms 
and that of the FSVQ algorithm. 
Algori t hm 
FSVQ 
TSVQ m = 2 
TSVQ nz = 4 
TSVQ rn = 8 
Notice fkom Table 3.12, that while the computational complexity decreases the 
ROM size increases which can Iead to an increase in the power dissipation. An- 
ot her factor in choosing the vector quantization a lgof i th  is the performance of 
the algorithm, generally the lower the computational complexity, the lower the per- 
formance. However, for TSVQ the degradation in performance can be quite s m d  
[a81 
3.11 Chapter Summary 





Reducing the power dissipation of CMOS circuits is a necessity for the design 
of portable systems. In this chapter, the sources of power dissipation in CMOS 
circuits, the methods of estimating the power dissipation, and examphs of low- 
power electronic systems were considered. 















CHAPTER 3. LOW-POWER DESIGN TECHNIQUES 76 
Reduction of the power dissipation can be achieved at the various design levels. 
Some of the techniques used to lower the power dissipation at the device and circuit 
levels were presented in this chapter. At the architectural and algorithmic levels, 
there is great opportunity to further lower the power dissipation. 
For example, through architectural changes such as pipelinhg and parallelism 
or the use of fast adders, it is possible to increase the speed of the architecture and 
hence reduce the voltage to maintain the same throughput and lower the power 
dissipation. The effect of pipelining, parallelisrn, and a combination of both were 
considered for three different architectures of the discrete cosine transform (DCT). 
One of these architectures uses a fast DCT algorithm (891. The other two depend 
on the use of distributed arithmetic [SOI. 
The choice of the number system can influence the power dissipation. The 
Gray code representation and its power dissipation relative to the unsigned binary 
representation was considered. Reducing the block activity factor is anot her way to 
reduce the power dissipation. This can be done by the choice of a higher radix, or 
by the choice of a number representation that minimizes the number of operations 
required per output. Findy,  the effect of using approximate search algorithms, for 
vector quantization, such as the tree search vector quantization algorithm [88] was 
considered. 
Chapter 4 
Subband Coding: A Low-Power 
Design 
4.1 Introduction 
The increase in demand for mobile telecommunication systems and the limited 
bandwidth allocated to these systems has forced research for innovative techniques 
to increase the spectral efficiency of mobile systems. Some of these techniques are 
related to the architecture of the mobile network [IO01 [101]. While others are based 
on the compression of the user information transmitted across the mobile network. 
Power efficiency is of ut mos t importance when designing compression algori t hms 
for mobile terminah. Compression algorithms and in particular video compres- 
sion algorithrns demand great cornputational capability [9], which in turn leads 
to greater power dissipation. The desire to have multimedia portable equipment 
has motivated work towards low-power implementations of video compression al- 
gonthms [102]. 
CHAPTER 4. SUBBAND CODING: A LOW-POWER DESIGN 
In this chapter, the design of a low-power subband coding image compression 
algorithm is investigated. Section 4.2 is a brief overview of videolimage compression 
dgorithms. In section 4.3, the basics of the subband image compression algorithm 
are reviewed. 
In section 4.4, the effect of performance-power tradeoff for subband coding is 
considered. The structure of the analysis/synthesis filter system used in the low- 
power subband coding image compression algorithm is developed. A filtering struc- 
ture with a small number of taps is used. The statisticd properties of each subband 
signal is obtained. This is required to calculate the number of bits allocated to each 
subband. A power-efficient vector quantization dgorithm, dong with the architec- 
ture used to implement it are also developed in this section. 
Findy, in section 4.5, the performance of the new subband coding algorithm 
and its power dissipation are evaluated and compared to those of conventional 
subband coding image compression algorithms. 
4.2 Video Compression Algorithms 
The information transmitted over the mobile network can be divided into three 
categories, data, audio and video (and image). Each type of information has its 
own characteristics and the corresponding compression algorithm should satisfy 
certain requkements. For data, it is important that the compression algorithm 
introduces no errors, a lossless compression algorithm must be used in this case. 
For audio and video, some noise is tolerable. The amount and spectral content of 
this noise depends on the characteristics of the human auditory and visual systems. 
The compression algorithm needs to take into account the type of correlation 
CHAPSER 4. SUBBAND CODING: A LOW-POWER DESIGN 79 
(redundancy) in the signal to be compressed signal. Audio signals have one dimen- 
sional correlation (temporal redundancy ) . Images have two dimensional correla- 
tions (spatial redundancy). While video signals have both spatial redundancy as 
well as temporal redundancy. 
The target of image/video compression is to reduce the bit-rate required to 
transmit the signal, while maintaining its quality. A digital picture at TV resolution 
requires about one million bytes without compression. Hence direct use of digital 
transmission or storage will not be efficient. Current imagelvideo compression 
standards offer from 1/10 to 1/50 compression ratios without affecting the image 
quality. 
The following values, give the relation between the amount of compression and 
the quality of the video signal (1031: 
0.25-0.5 bpp (bit per pixel) moderate to  good quality. Adequate for some 
applications. 
0.5-0.75 bpp good to very good quality. Adequate for many applications. 
0.75-1.5 bpp excellent quality. Adequate for most applications. 
1.5-2.0 bpp usually indistinguishable fkom original. Adequate for most de- 
manding applications. 
Any video compression algorithm can be divided into three parts [104], as shown 
in Figure 4.1: 
1. Signal processing. This is required to prepare the signal for quantization so 
that it can give better performance. For example: 
CHAPTER 4. SUBBAND CODING: A LOW-POWER DESIGN 80 
Figure 4.1: General block diagram of a image/video compression algorit hm. 
Video 
Signal 
DCT for JPEG [103] 
O Motion Compensated DCT (MC-DCT) for H.261, MPEG-1 and MPEG- 




0 Subband filtering for subband coding of images [107]. 
2. Qumtization. This is where ail the lossy compression occurs. The quantiza- 
tion can be: 
0 Scaler quantization. 
Compressed 
Bit Stream Quantization 
Vector quantization [88]. The complexity of the vector quantization al- 
gorithm used is a function of the required SNR, the required compression 
ratio and the allowed system complexity which is determined by factors 
such as cost and power dissipation. 
Lossless 
Coding 
3. Lossless coding. This is where lossless compression occurs. Examples of this 
type of coding include, zero-runlength coding and Huffman coding. 
In the signal processing part, we convert two correlated random variables into 
two new random variables with lit tle or no correlation between them. This can be 
done by: 
1. Linear prediction [$SI. 
2. Orthogonal transformation. e.g. DCT [103]. 
CHAPTER 4. SUBBAND CODING: A LO W-PO WER DESIGN 
3. Subband filtering and wavelet transform [108] [log]. 
Discrete cosine transform (DCT) based compression algorithms such as JPEG 
and MPEG are computationally intensive. A two-dimensional DCT algorithm re- 
quires in the order of N2 log, N multiplications [9] [110]. Distributed arithmetic [go] 
can also be used in the implementation of the DCT [Ill] [112]. Distributed arith- 
metic architectures dissipate more power but have a higher throughput [113]. 
Subband coding has the potential of having a low computational complexity 
and hence low computational power dissipation. This is at the expense of some 
degradation in the performance of the algorithm. In this chapter, subband coding 
image compression algorithms are investigated. An implementation wit h lower 
complexity and hence lower power dissipation is presented. 
4.3 Subband Coding for Image Compression 
Subband coding was originally introduced for digital speech coding in [1l4]. One of 
the operations required to transmit speech digitally is quantization. Direct quanti- 
zation introduces noise which is spread equally over most of the speech spectrum. 
However, the quantization noise is not equally detectable at all frequencies. Di- 
viding the signal into subbands and quantizing each one of these subbands inde- 
pendently offers greater control over the spectrum of the quantization noise. Fre- 
quency bands with higher subjective importance are coded with higher resolution 
than other bands. Subband coding was considered for video and image applications 
in [log] [107] [115]. 
The basic idea of subband coding is to decompose the image into several sub- 
bands using a filter bank and to quantize and code the subbands instead of the 
CHAPTER 4. SUBBAND CODING: A LOW-POWER DESIGN 
original image. The quantization characteristics of the Mlious bands can be made 
to match the psycho-visual characteristics of the human visual system. Such that 
the higher spatial fiequency components are quantized with a larger quantization 
step size than the lower ones. 
Subband coding, unlike DCT-based coding techniques, doesn't introduce block- 
ing artifacts. This is because DCT-based systems process separately adjacent 
blocks, while subband systems process overlapping blocks of the signal. Subband 
coding dows bit docation according to the spectral importance of each band. 
Subband coding systems consist of two parts: 
1. The analysis/synthesis filter banks. 
2. The coding system which determines how the subbands are quantized and 
coded. Examples of coding systems include: 
Predictive coding (DPCM) (1071. 
O VQ within subbands. 
VQ across subbands. 
a Predictive VQ. 
Analysis/synthesis filter banks (1151 - [120] &ide the signal into subbands 
and then reconstruct the signal fiom its subband components. To be useful for 
image applications the filters have to be two dimensional. A Cband 2D-filter bank 
is shown in Figure 4.2. The output of each bank corresponds to a certain part of 
the 2D spectrum as shown in Figure 4.3. After filtering, decimation by a factor of 
two in each dimension is required, this makes each subband signal a fullband one 
at the lower sample rate. 
CHAPTER 4. SUBBAND C0DING:A LOW-POWER DESIGN 
Vertical 
Frequenc y 




Figure 4.3: Frequency partitioning among the difFerent bands. 
CHAPTER 4. SUBBAND CODING: A LO W-POWER DESIGN 
Horizontal Vertical 
Figure 4.4: Biock diagram of a one-level 2D subband analysis filter bank. 
LPF 
2D filter banks can be implemented using separable filters [Il51 which have the 
advantage of computation simplicity, while they lack the directiod capability of 
nonseparable 2D filter banks. Separable 2D filter banks are implemented using a 
two-level 1D flter bank as shown in Figure 4.4 (1091. 
- 
It is possible to cascade the filter banks and continue the frequency band division 
process to any desired degree. It is common to do the fiequency band division to 
the low frequency band and leave the high fiequency bands undivided [log], because 
the low frequency subband is correlated and contains most of the energy. 
LPF 
The filter banks used in subband image coding have to satisfy the following 
- HPF - 
crit eria: 
1. Alias-free operation. This is not guaranteed due to the impossibility of irn- 
plementing ideal lowpass or highpass filters. 
2. Perfect reconstruction, this involves: 
O Avoiding amplitude distortion. 
Avoiding phase distortion. 
CHAPTER 4. SUBBAND CODING: A LO W-POWER DESIGN 85 
It has been shown [116] [121] that the following equations are necessary and 
sÿfficient conditions to satisfy the above criteria: 
1. To remove alias distortion: 




go(4 = -(-qn9i(n) 
2. To remove amplitude and phase distortions: 
~ ~ ( e ~ " ) G ~ ( e j " )  - ~ ~ ( - e j ~ ) ~ ~ ( - e " )  = e-jw6 
If we take, 
CHAPTER 4. SUBBAND CODING: A LOW-PO WER DESIGN 
Where, &(ejW) and Hl(ejw) are the fiequency response of the lowpass and 
highpass filters on the analysis side respectively. While, Go(ejw)  and Gi(eh) are 
the fiequency response of the lowpass and highpass filters on the synthesis side 
respectively. 
Consider now the case of a symmetric FIR filter with N taps, 
N must be even, and the distortion free condition becomes, 
An analysis/synthesis filter bank satisfying Equations 4.1, 4.3, 4.6 and 4.9 pro- 
duces an output which is an exact replica of the input except for a delay. 
The other part of the subband image coding system is the quantization part 
[IO71 [log] [122]. To quantize the subbands efficiently, the statistical characteris- 
tics of the subband signals have to be investigated. The image signal exhibits a 
great deal of correlation in both the vertical and horizontal directions. In fact, the 
autocorrelation function (acf) [123] can be approximated by a separable negative 
exponential function [log] 
The low frequency subband acf can be fitted to this negative exponential distri- 
bution, while the degree of correlation in the higher frequency subbands becomes 
weaker [IO91 [122]. Hence, it h a  been suggested [122] [124] to use DPCM for the 
lower frequency subband and to use PCM for the other subbands. 
CHAPTER 4. SUBBAND CODING: A LO W-PO W .  DESIGN 87 
The probability distribution function of the prediction error for the lower fie- 
quency subband and of the actual samples for the other subbands was found to 
follow the Generalized Gaussian pdf [lOS]. This pdf is given by 
Where, 
î(.) is the Gamma function. The value of 7 depends on the subband. For the 
prediction error of the low frequency subband, 7 = 0.75. For the other subbands, 
7 = 0.5. Knowing the variance and probability distribution of each subband we can 
allocate bits to the subbands (1251 in accordance and determine the quantization 
intervals using Max-Lloyd algonthm [88] [126]. 
Knowing the behaviour of the signal statis t i c d y  dows  the investigation of 
tradeoffs in computational complexity for the sake of lower power dissipation with 
minimum effect on performance. Furthemore, knowing the s t  atis tics of the signal 
dows  an estimation of the switching activity and hence an estimation of the power 
dissipation. 
4.4 Performance-Power Tradeoff for Subband 
Coding 
Any subband image compression algorithm consists of two subsystems. The and- 
ysis/synthesis subsystem and the coding subsystem. The complexity of the anal- 
CHAPTER 4. SUBBAND CODING: A LOW-POWER DESIGN 
I I I  
Figure 4.5: A 16-subband 2-level analysis/synt hesis sys tem. 
Figure 4.6: A 7-subband 2-level analysis/synt hesis sys tem. 
ysis/synthesis subsystem depends on the number of filter bank levels and on the 
length (number of taps) of the flters used. 
To lower the complexity and hence the power dissipation, it is desirable to 
decrease the number of filter bank levels used. In [126], the optimum S N R  was 
found to be for a two-level system consisting of 16 subbands, shown in Figure 4.5. 
A two-level system consisting of only 7 subbands, with the lower fiequency subband 
of the f t s t  level being the only level divided into smaller subbands, as shown in 
Figure 4.6, has a 1 dB degradation oves the 16 subband system [126]. 
The hardware complexity of the analysis/synthesis system shown in Figure 4.6 
CHAPTER 4. SUBBAND CODING: A LOW-POWER DESIGN 89 
is 60% lower than that of Figure 4.5. To compare the computational compiexity, 
we have to consider the rate at which each filter bank operates at. The filter banks 
in the second level operate at quarter the speed of the filter banks in the first 
level due to decimation. Hence, the reduction in computational complexity for the 
7-subband system over that of the lû-subband system is 37.5%. 
The length of the filters also determines the complexity of the analysis/synthesis 
subsystem. In [126], it was found that the improvement in the SNR for FIR. filters 
with more than 8 taps doesn't exceed 1 dB, and the irnprovement in SNR for filters 
with more than 12 taps doesn't exceed 0.2 dB. This indicates that it is possible to 
achieve good SNR performance with reasonable length tilters. 
There are several ways to do coding in the subband image compression system. 
Using scaler quantization [107] [122] is the least complex scheme and hence it is 
expected to have the least power dissipation. Vector quantization coding (71 (1271 
[128] [129] has greater compiexity but with superior performance. 
4.4.1 The Analysis/Synt hesis Filter 
For video applications, the analysis/synthesis filter banks have to be two dimen- 
sional. A two dimensional filter bank can be decomposed into two levels of one 
dimension filter banks as shown in Figure 4.4. For perfect reconstruction, Equa- 
tions 4.1, 4.3, 4.6, and 4.9 have to be satisfied. To reduce the design complexity 
and hence the computational power dissipation, the filters should have the smdest  
number of taps. Take M, the number of taps, to be two. The set of filters satisfying 
the prefect reconstruction conditions [Il61 (1 171 are given by: 
CHAPTER 4. SUBBAND CODING: A LQW-POWER DESIGN 
These filters simply find the sum and difference between two successive sam- 
ples. Using the average and half the ciifference at the receiver side it is possible to 
recons truct the original signal perfectly in the absence of quantization error. What 
makes this flter intuitively pleasing is that video signals are highly correlated and 
hence most of the energy is compact into the average component making that of 
half the difference component quite small. 
Another advantage of these filters, fiom the power dissipation point of view, is 
that it needs no multipliers. Multiplication by half is just a shift right operation 
which can be accomplished by the reodering of the datapath. 
Figure 4.7 shows the fiequency spectrum of both the lowpass and the highpass 
fîlters. Notice that, the over simplified structure has Iead to a poor fiequency 
response. However, it remains to be seen if through the use of efficient coding 
we can compensate the poor frequency response. Remember that the filter banks 
in themselves introduce no distortion. The distortion is actually produced during 
quantization. 
Figure 4.8 shows the simplified subband coding algorithm. Initially, the image 
is divided up into partitions each containing 4 samples So . . . S3. These samples are 
CHAPTER 4. SUBBAND CODING: A LO W-POWER DESIGN 
Figure 4.7: Frequency spectrum of the simplified filters. 
transformed into the variables A, Bi, B2 and B3 according to the folIowing set of 
equations: 
The samples corresponding to the variable A are partitioned into groups of 4. 
These samples are than transformed into the variables C, Dl, LIz and D3 according 
to the following set of equations: 
CHAPTER 4. SUBBAND CODING: A LOW-POWER DESIGN 
4.4.2 Statistical Properties of the Subband Coded Signal 
The image signal is a highly correlated signal. Hence, it is expected that the energy 
content in the low frequency part of the spectnim to be much higher than that in the 
high frequency part. Figure 4.10 shows the statistical distribution of the &st level 
subband signals of the aeroplane, shown in Figure 4.9. The low fiequency subband 
(LL1) is further decomposed into four subband signals the statistical distribution 
of which are shown in Figure 4.11. 
Table 4.1 gives the variance for each subband of the two-level decomposed image. 
For LL2, the variance of the successive sample ciifference is given. For that subband, 
the adjacent samples are still correlated hence DPCM is used to encode it [122]. For 
the other subbands, PCM or vector quantization is used depending on the number 
of bits allocated to that subband. The vector quantization algorithm is explained 
in the next section. 
a.. a .  a .  a 
Figure 4.8: Simplified sub band coding algorit hm. 
CHAPTER 4. SUBBAND CODING: A LOW-PO'WER DESIGN 
Figure 4.9: Aeroplane: The image used in subband coding. 
Table 4.1: Variance for the two-evel subband image compression system. 
CHAPTER 4. SUBBAND CODING: A LO W-PO WER DESIGN 





CHAPTER 4. SUBBAND CODING: A LOW-POWER DESIGN 
Figure 4.11: S tatistical distribution of level two subband signals. 
(a) LL2. 
( c )  LH2. 
(b) HL2. 
(d) HH2. 
CHAPTER 4. SUBBAND CODING: A LOW-POWER DESIGN 
4.4.3 The Vector Quant kat ion Algorit hm 
Vector quantization groups samples together, codes are assigned to the most Iikely 
patterns in the sequence of samples in such a way that the mean square error (MSE) 
is minimized. The vector quantization algorithm needs to be designed to minimize 
the power dissipation during decoding. The decoding of most VQ algonthms re- 
quires a memory lookup table [88]. However, memory access has large power dissi- 
pation (71. Hence, to be power-efficient, the decoding of the VQ algorithm should 
be done in a computational way. with the least amount of computations. 
The algorithm considered here is a simplified modification of the pyramid vector 
quantization algorithm (PVQ) [7]. In the case of subband coding, the samples of 
the upper frequency subbands are usually around the zero except near the edges. 
For a vector of length N, assume that at most two samples are nonzero. Each 
nonzero sample is encoded using two bits. The total number of possible ways in 
which two and only two samples out of N can be nonzero is given by: 
In addition, there are N + 1 alternatives in which only one sample or no samples 
are nonzero. Assuming that N = 2", it can be shown that the total number of bits 
required to encode one vector is 2n + 3. Hence, the number of bits per sample is 
given by: 
Clearly, increasing n leads to higher compression. IR the following, it will be 
explained how the information of one vector is encoded into the 2n + 3 bits, in such 
CHAPTEB 4. SUBBAND CODING: A LO W-PO WER DESIGN 
a way that its decoding requires minimum computation. 
Assume that two samples are nonzero, let i be the most significant of these 
samples and let j be the least significant. Most significant and least significant 
refers to the position of the sample in the vector. Divide the 28  + 3 bits required 
for each vector into three parts, the first part is n - 1 bits long, assume these to be 
the most s igdcant  bits, and denote them by LI. The second part is n bits long, 
and is denoted by L2. The third and final part is 4 bits long and is denoted by Lj. 
If 
i = k  
then j can range from O to k - 1, that is j can assume any one of k d u e s .  If. 
then j can range from O to 2" - k - 1, that is j can assume any one of 2n - k values. 
It is thus evident that the two previous values of i are complementary in the sense 
that the total values j can take in both cases is 2". Hence, it is reasonable to group 
these two values together. The value of i is determined by Li. In this case: 
Where OC is the one's complement operation assuming n bits. To determine 
which value of i to choose, we have to look at L2. If, 




The field L3 determines the values of the two nonzero samples, two bits per 
sample. 
The above equations are valid for Ll = 0,1,. . . , (2"-' - 2). However, if LI = 
2"-' - 1 the following set of equations are used ins tead: 
1. If the mos t significant bit of La is zero. 
and 
CHAPTER 4. SUBBAND CODING: A LO W-POWER DESIGN 100 
2. If the most significant bit of L2 is one. In other words, the most significant 
n bits of the decoded word are one. Then there is only one or no non-zero 
s amples. 
If the second most significant bit of LI is one. All the N samples are 
zeros. 
0 If the second most significant bit of L2 is zero. One and only one sample 
of the vector is nonzero. The address of this sample is determined by 
the n - 2 least significant bits of L2, and the two most significant bits of 
L3. In this case, the two least significant bits of L3 determine the value 
of the nonzero sample. 
Now that the algorithm has been described, it has to be seen how it can be 
mapped into a power-efficient architecture. Assume that the word to be decoded 
of length 2n + 3 bits is stored in register R which has three fields Ri, R2 and R3, 
corresponding to L I ,  L2 and L3 respectively. Let RN be the vector corresponding 
to the N samples. Initially, the samples of this vector are set to zero. 
Let A. be the ANDing of the n + 2 most significant bits of R, and let Ai be 
the ANDing of the n + 1 most significant bits of R. If A. is high, the N samples of 
the vector are all zeros, which is the value aLeady contained in RN. AU the other 
functional units are deactivated in this case. 
If Al is high, while A. is low, then one and only one sample is nonzero. The 
address of this sample is determined by the n - 2 least significant bits of R2 and the 
two most significant bits of R3. While the value of the nonzero sample is determined 
by the two least significant bits of R3. In addition to the AND operation, one 
ROM access is required to determine the value of the nonzero sample. One mwr 
CHAPTER 4. SUBBAND CODING: A LOW-PO WER DESIGN 101 
operation is required to determine the address of the nonzero sample. FinaIly, one 
write operation into a bank of N registers is required. 
Findy, if At and & are both zeros, there will be two nonzero samples and 
their addresses have to be calculated. This process requires, two ROM accesses, 
two ADD operations, one INV operation, three MUX operations, and two write 
operations into a bank of N registers. The architecture required to implement the 
proposed vector quantization decoding dgorithm is shown in Figure 4.12 
4.5 Performance of the Subband Coding Algo- 
rit hm 
Two cases of the subband algorithm are considered. The first is the one-level 
subband algorithm. The second is the two-level subband algorithm, the second 
level is the decomposition of the low-fiequency subband of the first level. 
Table 4.2 gives the variance of each subband, the theoretical number of bits that 
should be docated to each subband [88], and the actual number of bits docated 
to each subband for a one-level subband system. For the low fiequency subband, 
the variance shown is that of the differential signal. The bit rate is 1.025 bits/pixel. 
The overd peak signal-to-noise ratio using the aeroplane was found to be 24 dB. 
This is about 8 - 9 dB lower than subband coding systems using a FIR filter having 
more than 8 taps [126]. 
For the one-level subband coding image compression dgorithm, DPCM is used 
to encode the signal of subband LL1. The proposed VQ algorithm with n = 4 is 
used to encode subband HL1. The proposed VQ algorithm with n = 5 is used to 
encode subband LH1. 
CHAPTER 4. SUBBAND CODING: A LO W-POWER DESIGN 
ADD 
IR1 
Figure 4.12: The architecture of the proposed VQ decoding algorithm. 
CHAPTER 4. SUBBAND CODING: A LOW-POWER DESIGN 
Table 4.2: Variance and bit allocation for a one-level subband system. 
Table 4.3 gives the variance of each subbaud, the theoretical number of bits that 
should be allocated to each subband [88], and the actual number of bits allocated 
to each subband for a two-level subband system. For the low-fiequency subband. 
the variance shown is that of the differential signal. The bit rate is 1 .O16 bitsfpixel. 
The overall peak signal-tenoise ratio using the aeroplane was found to be 28 dB. 
Note that, the use of a two-level subband gave a 4 dB improvement in the SNR over 
the one-level subband. However? this is still 4 - 5 dB lower than subband coding 
systems using a FIR filter having more than 8 taps. 
Subband 
LL1 
For the two-level subband coding image compression algorithm, DPCM is used 
to encode the signal of subband LL2. PCM is used to encode the signals of subbands 
HL2 and LH2. The proposed VQ algorithm with n = 3 is used to encode the signal 
of subband HL1. The proposed VQ algorithm with n = 4 is used to encode the 
signal of subband LH1. 
Figure 4.13 shows the aeroplane after passing through a two-Ievel subband corn- 
pression/decompression system. The quality of the signal is lower than that of other 
subband image compression algorithms. But a large reduction has been achieved in 







CHAPTER 4. SUBBAND CODING: A LOW-POWER DESIGN 
Figure 4.13: Aeroplane: The effect of the proposed two-level subband coding image 
compression algorithm. 
CHAPTER 4. SUBBAND CODING: A LO W-POWER DESIGN 
Table 4.3: Variance and bit allocation for a two-level subband system. 
operation for each ID filter, while a subband system using 8 tap FIR filters usually 
requires 7 addition/subtraction operations and 4 multiplication operations (assum- 
ing a symmetrk filter) for each 1D füter. Assuming that the 2D füter is separable, 
then each 2D filter consists of six 1D filters, as shown in Figure 4.4. 
Subband 
For the one-level subband coding system, the proposed filtering structure re- 
quires 2 ADD/SUB operations per sample. While a filtering structure based on an 
8 tap FIR filter requires 14 ADDISUE3 and 8 MULT operations per sample. For 
a two-level subband coding system, the proposed filtering structure requires 2.5 
ADD/SUB operations per sample. While a filtering structure based on an 8 tap 
FIR filter requires 17.5 ADD/SUB and 10 MULT operations per sample. Assum- 
ing that the multiplier dissipates 4 times the adder power. The proposed filtering 
structure dissipates 23 times less power than a filtering structure using an 8 tap 
FIR filter [126]. 
The simulation results of the vector quantization algorithm, for the two-level 
Variance Bits Required Bits Used 
CHAPTER 4. SUBBAND CODING: A LOW-POWER DESIGN 106 
Table 4.4: The power dissipation of the proposed VQ decoding algorithm and that 
of a mernory-based VQ algorithm. Both have the same compression factor and are 
designed in a 0.5pm, 3.3 Volt CMOS technology. Each sample is 4 bits. The power 
dissipation is calculated at a speed of 1 Vector per p. 
1 Samples per 1 Compression 1 Memory based 1 Proposed algorithm 1 Power 1 
vect or factor Power 1 reduction 
8 
16 
subband coding image compression algorithm, show that, 40% - 50% of the vectors 
were zero, 15% - 25% had only one nonzero sample and 30% - 40% had two 
nonzero samples. The decoding algorithm thus requires 2 A M )  operations, 0.7 
ADD operation, 0.35 INV operation, 1.25 MUX operation, 0.9 ROM access and 0.9 
register write operations per vector. It should be noted that number of operations 
per vector is independent of the vector dimension. In PVQ the number of operations 
per vector is proportional with the vector dimension and it turns out to be much 
larger than that of the new algorithm. 
Table 4.4 compares the power dissipation of the proposed VQ decoding alge 
&hm, to that of a rnemory-based VQ decoding algorithm. The power dissipation 
of the proposed algorithm varies slightly as the vector size grows larger. This is 
because the number of operations required is independent of the vector size, but the 
datapath width of the operators increases logarithmically with the vector size. On 
the other hand, the size of the memory in a memory-based VQ decoding algorithm 
increases approximately with the cube of the vector size. 
3219 
64/11 
17.6mWI 112 pW 1 156 times 32 128113 1 8 K x 1 2 8  
0.5 K x 32 





7.3 t' mes  
23 times 
CHAPTER 4. SUBBAND CODING: A LOW-PO WER DESIGN 
4.6 Chapter Summary 
A subband coding image compression algorithm with low computational complexity 
has been deveioped in this chapter. The chapter starts with an overview of image 
compression algorithms and in particular subband coding image compression. 
The use of a simpMed filtering structure is one of the distinct features of the 
new subband coding algorithm. The analysis/synthesis filter system is a two-level 
system, with the lower frequency subband of the first level being the only one 
divided into smder  subbands. Addition and subtraction are the only operations 
used in the filter, no multiplication is required. 
The statistical properties of each subband are evaluated to determine the num- 
ber of bits allocated to each subband. A vector quantization algorithm which 
avoids the need of large look-up tables for subband decoding was developed. The 
filtering structure used reduces the computational power dissipation by 23 times. 
The reduction in computational complexity is achieved at the expense of a 4-5 dB 
degradation in the S N R  performance, for a subband image compression algorithm 
employing a two-level analysis/synt hesis sys tem. 
Chapter 5 
A/D Converter for Software 
Radio 
5.1 Introduction 
Intricate signal processing of real world analog signals often requires signal conver- 
sion into the digital domain. Conversion makes feasible the use of either conven- 
tional digital cornputers or special purpose digital signal processors. This increases 
the sys tems flexibility and programmability. 
Software radios require high-speed high-resolution A/D converters. To achieve 
a resolution of up to 20 bits a Sigma-Delta A/D converter (1301 [131] [132] is used. 
The Sigma-Delta A/D converter is composed of a Sigma-Delta modulator, followed 
by a decimation fdter [133] [134] [135], which digitally transforms a low-resolution 
oversampled signal into a high-resolution Nyquist-rate sampled signal. Figure 5.1 
shows the block diagram of a bandpass Sigma-Delta A/D converter. 
Figure 5.2 shows a typical receiver where the digitization is done after the f i s t  
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.1: Bandpass Sigma-Delta A/D converter. 
Figure 5.2: Digital IF receiver architecture. 
IF (Intermediate Frequency) stage [136]. The bottle neck of this architecture is the 
A/D (analog-to-digital) converter. Not oniy does this A/D operate at a high speed 
(in the MHz), but it requires high resolution as well (12 - 20 bits). 
Pardelism by 4x of analog signal processors is applied to the design of a band- 
pass Sigma-Delta modulator. The speed of the modulator is increased without 
increasing the speed reqnirement of the individual building blocks. Several archi- 
tectures are considered in terms of their resilence to implementation details such 
as mismatch and gain errors. A switched-capacitor circuit is also given for the 
proposed modulator. 
Several high-level low-power design techniques have been incorporated in the 
design of the decimation filter. These include; operation minimisation, multiplier 
elirnination, operation interleaving and block deactivation. Analysis and simulation 
results indicate t hat t hese techniques can achieve a 4 times reduction in power dis- 
sipation. A novel memory access algorithm is employed in the design of the lowpass 
filter. An interleaved multiplier-accumulator array is used in the lowpass filter. In 
this chapter, the effect of optimizing the datapath width of the Sinc decimator on 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 110 
its numerical accuracy is dso considered. The decimation filter designed has a pro- 
grammable resolution, that varies from 12 to 20 bits. The entire decimation fdter 
has been designed in a 0.5prn, 3.3 Volt CMOS technology. 
The organization of this chapter is as follows. In the next section, the resolution 
C 
requirement of the A/D converter is determined. In section 5.3, a novel architecture 
that applies pardelism by 4x of analog signal processors to the design of a bandpass 
Sigma-Delta modulator is presented. In section 5.4, the performance of the pro- 
posed Sigma-Delta architecture is evaluated for different configurations, in terms 
of their resilcnce to implernentation details such as mismatch and gain errors. In 
section 5.5, a switched-capacitor implementation of the proposed architecture with 
minimum number of operational amplifiers is presented. In section 5.6, the decima- 
tion Nter architecture is presented. The decimation filter designed is composed of 
a Sinc decimator and a lowpass decimation Nter (LPDF). In section 5.7, the design 
of the Sinc decimator is investigated, and operation minimization is applied to min- 
imize the power dissipation by eliminating redundant computations. In section 5.8, 
the effect of optimizing the datapath width on the numencal accuracy and power 
dissipation of the Sinc decimator is considered, this eliminates irrelevant computa- 
tions. In section 5.9, the low-power design of the LPDF fùter is investigated. The 
VLSI design of the decimation filter in a 0.5pn, 3.3 Volt CMOS technology is given 
in section 5.10. 
5.2 The Resolution Requirement 
The resolution requirement of the A/D converter and hence the resolution require- 
ment of the decimation filter varies according to the strength of the received signal 
as well as the background noise and interference. Simulation results for a system 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 111 
based on the DAMPS standard [22], indicate that for a digital receiver digitizing 
an IF signal of bandwidth 0.96 MHz (32 TDMA channels), the maximum required 
dynamic range for the A/D converter is 20 bits. Simulation results also indicate 
that when the input signal is strong enough a dynamic range of 10 bits is sufficient. 
The total dynamic range required can be expressed as: 
D RaGc is the required dynamic range due to the variation in the strength of the 
received input signal. The strength of the input signal can vary by up to 120 dB. 
This necessitates the use of automatic gain control (AGC) in the conventional 
receiver. D a n t ,  is the dynamic range required so that the digital stages following 
the A/D converter can distinguish the desired channel from any interference. The 
digitized signal includes more than one channel, the desired channel is then selected 
digitally, the dynamic range of the A/D converter should be enough to be able to 
perforrn this channel selection in the following digital stages. The D AMP S standard 
[22] specifies that the receiver should operate properly when an interference 55 dB 
greater than the desired signal exists 90 KHz or more from the desired signal. 
Assume that the receiver consists of an AGC amplifier followed by the A/D 
converter as shown in Figure 5.3.a. The dynamic range of the A/D converter in 
this case is 55 dB (zz 10 bits). The AGC amplifier is required to have a gain 
variation of 120 - 55 = 65 dB. 
Without the AGC amplifier, Figure 5.3.b, the required dynamic range of the 
A/D converter is 120 dB (- 20 bits). However, the high resdution is not required 
in all cases. h fact, if the input signal is large enough, which corresponds to the 
AGC amplifier of Figure 5.3.a having a gain Gmin7 a 12 bit A/D converter is a l l  
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.3: A/D resolution requirement: (a) With AGC amplifier. (b) Without 
AGC amplifier. 
A/D 12 bits 
that is required. If the input signal is weak, the AGC amplifier of Figure 5.3.a will 
have a gain G, (= Gmin + 65 dB), which corresponds to a 20 bit A/D converter 
in Figure 5.3.b. 
AID 20 bits 
The reason for having an A/D converter with variable resolution is to Save power 
when the lower resolution is sufficient. This leads to the concept of Automatic 
Resolution Control (ARC) where the resolution of the A/D converter is varied as 
opposed to AGC where the gain of the amplifier is controlled by the level of the 
input signal. The required resolution is determined by the digital stages following 
the A/D converter. In section 5.6 of this chapter, the design of a decimation filter, 
to be used with the Sigma-Delta modulator with resolution varying between 12 to 
20 bits, is examined. 
5.3 A Parallel Bandpass Sigma-Delta Modulator 
Sigma-Delta modulation has been commonly used in high resolution analog-to- 
digital converters because of the ability to shape noise away from the desired band. 
Moreover, Sigma-Delta modulators require a two-level quantizer to achieve a high- 
resolution Nyquist-rate sampled-stream. 
Sigma-Delta modulation has commody been used for lowpass signals 11321 [137]. 
However, the signal digitized at the IF stage is a bandpass signal. The signal can 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 113 
Table 5.1: Dynamic range versus OSR for a second and a third order LPSD. 
1 64 / 77 dB (12 bits) 1 105 dB (17 bits) 1 
OSR 
32 
be subsampled with no loss of information due to aliasing. In this case, a bandpass 
Sigma-Delta modulator (BPSD) [138] - [142] is used instead of a lowpass Sigma- 
Delta modulator (LPSD). 
2nd Order LPSD 
62 dB (10 bits) 
128 
The bandpass Sigma-Delta A/D modulators presented in the literature so far 
have been one channel A/D modulators, with bandwidth 30 kHz (for DAMI'S 
s~s tems)  [141], or bandwidth 200 kHz (for GSM systems) [142]. As the digitized 
3rd Order LPSD 
84 dB (14 bits) 
signal bandwidth increases, more than one channel is digitized and then the desired 
92 dB (15 bits) 
channel is filtered out digitally. This increases the sufficient dynamic range required 
126 dB (21 bits) 
to meet the standard's interference rejection criteria. To operate a t  a high sampling 
rate, parallelism by 4x of analog signal processors is applied to the design of the 
bandpass Sigma-Delta modulator. This increases the overd speed of the modulator 
wit hout increasing the sp eed requirement of the individual building blocks. 
The dynamic range of the Sigma-Delta A/D converter depends on the order of 
the Sigma-Delta modulator, as well as the oversample ratio (OSR). Table 5.1 gives 
the output dynamic range of the Sigma-Delta modulator for different oversampling 
ratios, for second-order and third-order lowpass Sigma-Delta modulators. A fourth- 
order BPSD is equivalent, in its dynamic range performance, to a second-order 
LPSD. A sixth-order BPSD is equivalent to a third-order LPSD. 
To subsample a bandpass signal and ensure that spectrum overlap doesn't occur, 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
the sampling fkequency has to satisfy the following inequality [Il]: 
w here, 
fa is the sampling frequency. 
fh is the highest frequency in the bandpass signal. 
fi is the lowest frequency in the bandpass signal. 
k is an integer satisfying the following inequality: 
f, is the band center frequency, 
BW is the bandwidth, 
%w=fh-fi 
According to Inequality 5.2, the sampling frequency depends on both, the band- 
width and the band position of the bandpass signal. Figure 5.4 [143], shows the 
valid sampling fkequency as a function of the bandwidth BW and the center fre- 
quency f, of the bandpass signal. Generally, the sampling frequency is given by 
[l36] :





Figure 5.4: Valid bandpass sampling rate regions. 
It is clear that the samples can be divided into four groups: GO, G1, G2 and 
G3, e q u d y  spaced in time. Croups GO and G2 sample the in-phase channel, 
while groups G1 and G3 sample the quadrature-phase channel. The division of the 
samples into four groups suggests, that we can replace the bandpass Sigma-Delta 
modulator with four lowpass Sigma-Delta modulators as shown in Figure 5.5.b. 
Each one of these lowpass modulators would operate a t  quater  the speed of the 
bandpass modulator relaxing the circuits speed requirements. It would, however , 
suffer the impact of component mismatch and a loss in the dynamic range. This 
loss in dynarnic range is 12 dB for a fourth-order bandpass Sigma-Delta modulator, 
and codd be avoided using a cross-coupled architecture. 
The conventional second-or der bandpass Sigma-Delt a modulator, shown in Fig- 
ure 5.6.a, can be split into two branches, one for the in-phase channel and the other 
for the quadrature-phase channel. This is shown in Figure 5.6.b. Consider one of 
the branches of Figure 5.6.b. It is desirable to split that branch into two branches, 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.5: Using four lowpass Sigma-Delta modulators to implement a bandpass 
Sigma-Delta modulator. 
each operating at half rate, while maintainhg the overall transfer function and 
hence the signal-to-noise ratio is maint ained. 
H ( z )  can be expressed as a surn of an even function and an odd one: 
Notice that, H&) (the even function) and H&) (the odd one) are s d a r  
with the exception of a delay. This means that the common filtering can be done 
before the split or after the subtraction. The two possible solutions are shown 
in Figure 5.7 [Ml [Ml. Note that, both figures show a single channel (1 or Q) 
bandpass Sigma-Delta modulator. 
The same concept can be extended to higher order bandpass Sigma-Delta modu- 
lators. Figure 5.8 shows the extension of this analog parallelism to a single channel 
fourth-order bandpass Sigma-Delta modulator [145]. Without the cross-coupling 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Fibgx.re 5.6: Second-order bandpass Sigma-Delta modulator (a) conventional (b) 
with separate IQ branches 
Figure 5.7: A single-channel second-order bandpass Sigma-Delta modulator wit h 
two cross-coupled branches and common fdtering done: (a) before splitting (b) 
sub traction. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
I 
Figure 5.8: A single-channel fourth-order bandpass Sigma-Delta modulator with 
two cross-coupled branches. 
shown in Figure 5.7 and Figure 5.8, the SNR of the Sigma-Delta modulator is de- 
graded by 6N dB, where, N is the order of the equivaIent lowpass Sigma-Delta 
modulator. 
5.4 The Performance of the Parallel Sigma-Delta 
Modulat or 
The two split-branch architectures for the single-channel second-order bandpass 
Sigma-Delta modulator, shown in Figure 5.7, have the same hearized transfer func- 
tion as the conventional single-channel second-order bandpass Sigma-Delta mod- 
ulator. However, in the presence of mismatch, the response of each modulator 
becomes different . 
The errors considered are assumed to occur in the even/odd-sample integrator 
block, which idedy  should have a transfer function: 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.9: (a) An ideal integrator (b) Integrator with mismatch. 
The mismatch is modeled by the elements Gi, and G,t as shown in Fig- 
ure 5.9.b. Idedy, both elements should be 1. However, practically Gl,, can be 
slightly less than 1, while Get can be slightly greater than or less than one. Gi, 
is the leakage of the integrator, caused by the finite gain, A, of the operational 
amplifier in the in tegra t or [130] : 
The Gi,GeZt product is the gain of the integrator. 
The simulation of the proposed Sigma-Delta modulator in different configu- 
rations, dong with the simulation of the conventional Sigma-Delta modulator was 
performed using SP wTM. Details of the simulation mode1 are given in appendix A. 
In this section, the obtained results are presented. 
Notice that, GeZt has no effect on the performance of the Sigma-Delta modulator 
of Figure 5.7.b, because the gain doesn't effect the signal's polarity and hence the 
operation of the comparator. However, this is only true in Figure 5.7.a if Gezt of 
the even and odd branches are equal. But if there is a discrepancy between them, 
it would degrade the performance. Figure 5.10 shows the degradation in SNR due 
to discrepancy in the value of Gat between the even and the odd branches of the 
Sigma-Delta modulator given in Figure 5.7.a. A 2% clifference in GeZt would lead 
Figure 5.10: Degradation in SNR due to mismatch in the value of GeZt between the 
even and the odd branches of the Sigma-Delta modulator shown in Figure 5.7.a 
to a 1.5 dB degradation in S N R .  
Figure 5.11 shows the effect of non-unity in Gl,, on the conventional Sigma- 
Delta modulator of Figure 5.6. 
Figure 5.12 shows the effect of mismatch and non-unity in Gi,, on the SNR per- 
formance of the bandpass Sigma-Delta modulator of Figure 5.7.b. Notice that, the 
S N R  performance is very close to that of the conventional Sigma-Delta modulator 
given in Figure 5.11. 
Figure 5.13 shows the effect of mismatch and non-unity in Gl,, on the SNR 
performance of the bandpass Sigma-Delta modulator of Figure 5.7.a. Notice the 
substantial degradation in performance with mismatch. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
SNR 
40-- 
3 0 - -  
2W- 
Input Signal in dB 
Figure 5.11: The effect of non-unity in Gi,, on the SNR of the conventional Sigma- 
Delta modulator. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
SNR 
- 1.0/1 .O 
I Input Signa1 in dB 
Figure 5.12: The effect of mismatch and non-unity in Gi,, on the SNR performance 
of the bandpass Sigma-Delta modulator of Figure 5.7.b. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
I 
SNR 
Figure 5.13: The effect of mismatch and non-unity in Glw on the S N R  performance 
of the bandpass Sigma-Delta modulator of Figure 5.7.a. 
CHA 
Figure 5.14: Change in S N R  due to mismatch in the value of Gl, for the conven- 
tional single-channel second-order bandpass Sigma-Delta modulator, and for the 
bandpass Sigma-Delta modulator of Figure 5.7.b. 
Figure 5.14 shows the effect of mismatch and non-unity in Gi,, on the conven- 
tional bandpass Sigma-Delta modulator (Figure 5.6) and the bandpass Sigma-Delta 
modulator of Figure 5.7.b. A 10% degradation in Gf,, could lead up to 4.5 dB 
degradation in SNR. 
Discrepancy in the value of Gl,, between the even and the odd branches of the 
bandpass Sigma-Delta modulator given in Figure 5.7.a can cause severe dis t ortion 
to the signal as shown in Figures 5.15 and 5.16. 
The resdts presented so far dernonstrate that the Sigma-Delta modulators of 
Figures 5.7.a and 5.7. b, even though they have identical behaviour under ided 
conditions, yet t heir performance is differently affected by parameter mismatch. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.15: Distortion in the output signal of the bandpass Sigma-Delta modulator 
shown in Figure 5.7.a due to Gi,, = 0.99 and Gi- = 0.95. 
Figure 5.16: Distortion in the output signal of the bandpass Sigma-Delta modulator 
shown in Figure 5.7.a due to Gi,, = 1.0 and Gi,, = 0.95. 




m g  1::  Point+  = 4121 
(dB) -;:: Bin* Pts = 2 5  8193 
( W .  i 
I &al o.= 0 . a ~  @.O* Freq = 0.00305176 
3.1 
Lhg. = - 2 0 . 2 6 6  
Phase Phase = 2.79274 
(radia-) -3.142 
Figure 5.17: Frequency spectrum of the Sigma-Delta modulators of Figure 5.7 
having Gr-/Gl,, = 1.0/1.0. 
Mismatch can cause the architecture of Figure 5.7.a to become unstable and it 
causes unacceptable signal distortion. On the other hand, the effect of mismatch 
on the architecture given in Figure 5.7.b is quite s m d ,  and its SNR performance is 
comparable to that of the conventional second-order bandpass Sigma-Delta modu- 
lator. 
The frequency spectnim at the output of the proposed Sigma-Delta rnodula- 
tors of Figure 5.7, for different values of Gi,/Gl-, is shown in Figures 5.17 - 
5.21. The injected sinusoidal signal has an amplitude of 0.5, and a frequency of 
0.0031. The sampling frequency is 1.0. Notice that the mismatch increases the low- 
fkequency quantization noise subs tantially for the modulator of Figure 5.7.a. Whie 
it has a negligible effect on the low-frequency quantization noise of the modulator 
of Figure 5.7.b. 
Similar analysis for the fourth-order bandpass Sigma-Delta modulator show that 
placing the integrator after the subtractor (as shown in Figure 5.8) significantly 
reduces the degradation in SNR due to mismatch. 
Figure 5.22 shows the effect of mismatch and non-unity in Gi, of the fkst 
stage on the SNR performance of a fourth-order bandpass Sigma-Delta modulator, 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Phase 
(radians) 
Frequency R e ç p o n s e  
-1os-(rYp 1 * P t s  = 8193 
(. 1 
8 1-01 O- U 0.m 0.04 
Freq = 0.00305176 
mg. = -20.3016 3.1 Phase = 2.79252 
Figure 5.18: Frequency spectnun of the Sigma-Delta modulators of Figure 5.7 




-45 I Frequency Response i -
P o i n t *  = 4121  
B i n #  = 25 
-105 $ Pts = 8193 
Figure 5.19: Frequency spectrum of the Sigma-Delta modulators of Figure 5.7 




Point *  = 4121  
-90 B k #  = 2 5  
-105 + P t s  = 8193 
Figure 5.20: Frequency spectnun of the Sigma-Delta modulator of Figure 5.7.b 
having Gl-/Gi,, = 0.99/0.98. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.21: Frequency spectnun of the Sigma-Delta modulator of Figure 5.7.a 
having Gl,/Gl,, = 0.99/0.98. 
having the integrator after the subtractor in the first stage. Notice the negligible 
degradation in SNR performance due to mismatch. 
Figure 5.23 shows the effect of mismatch and non-unity in Gl,  of the second 
stage on the SNR performance of a fourth-order bandpass Sigma-Delta modulator, 
having the integrator after the subtractor in the second stage. Notice the negligible 
degradation in SNR performance due to mismatch. 
Figure 5.24 shows the effect of mismatch and non-unity in Gl,, of the first stage 
on the SNR performance of a fourth-order bandpass Sigma-Delta modulator, having - 
the integrator before the branch splitting in the first stage. Notice the substantial 
degradation in SNR performance due to mismatch. 
Figure 5.25 shows the effect of mismatch and non-unity in Gi,, of the second 
stage on the SNR performance of a fourth-order bandpass Sigma-Delta modulator, 
having the integrator before the branch splitting in the second stage. Notice the 
slight degradation in SNR performance due to mismatch. 
Figures 5.22 - 5.25 indicate that, if the gain of the operational amplifier of the 
integrator is 37 dB with a f 3 dB mismatch, the degradation in the SNR with the 
integrator after the subtractor (Figure 5.8) is about 1 dB, while the degradation in 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
SNR 
-40  -35  -30 - 25  -2 O -15 
- 
-10 - 5 
Input Signa1 in dB 
Figure 5.22: The eiTect of mismatch and non-unity in Gi,, of the ftst  stage on the 
S N R  performance of a single-channel fourth-order bandpass Sigma-Delta modula- 
tor, having the integrator &ter the subtractor in the first stage. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
SNR 
l 
-40 -3 5 -30 -2s -20 -15 -i0 - 5 
Input Signal in dB 
Figure 5.23: The effect of mismatch and non-unity in Gi, of the second stage 
on the SNR performance of a single-channel fourth-order bandpass Sigma-Delta 
modulator, having the integrator after the sub tractor in the second stage. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
SNR 
I Input Signal in dB 
Figure 5.24: The effect of mismatch and non-unity in Gi, of the firs t stage on the 
S N R  performance of a single-channel fourth-order bandpass Sigma-Delta modula- 
tor, having the integrator before the branch splitting in the first stage. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
SNR 
Figure 5.25: The effect of mismatch and non-unity in Gi,, of the second stage 
on the S N R  performance of a single-channel fourt h-or der bandpass Sigma-Delt a 
modulator, having the integrator before the branch splitting in the second stage. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Frequency Responçe -
Point+ = 4121 
(dB) -90 
-105 
Figure 5.26: Frequency spectrum of a single-charnel fourth-order bandpass Sigma- 
Delta modulator, having Gl-/Gi- = 1.0/1.0 for the first and second stages. 
the SNR with the integrator before the branch splitting is about 25 dB. 
The frequency spectrum at the output of the proposed Sigma-Delta modula- 
tor of Figure 5.8 and its variants for different values of Gi-/Gloopo, is shown in 
Figures 5.26 - 5.30. The injected sinusoidal signal has an amplitude of 0.5, and 
a frequency of 0.0031. The sampling frequency is 1.0. Notice that the mismatch 
increases the low frequency quantization noise substantidy for the architectures 
having the integrator placed before the branch splitting. Notice also that the mis- 
match in Gi,, of the &st stage causes a greater increase in the noise than the 
mismatch in Gr,, of the second stage. Mismatch h a  a negligible effect on the 
low-frequency quantization noise of the architectures having the intergrator placed 
after the subtractor. 
Another advantage of placing the integrator after the subtractor is that it is pos- 
sible, in a switched-capacitor implementation, to use the same operational amplifier 
for integration and addition (subtraction), thus reducing the number of operational 
amplifiers required to implement the modulators. The switched capacitor imple- 
mentation is explained in the next section. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
-4 5 Frequency Response 
-60 
-7 5 -Point* = 4121 
-9 0 B i n #  = 2 5  
-105 + P t s  = 8193 
Preq = 0.00305176 
O 0.01 0.02 0.n3 0.04 o.mHag. =-20.3231 
Figure 5.27: Frequency spectrum of a single-charnel fourth-order bandpass Sigma- 
Delta modulator, having the integrated placed after the subtractor in the first stage. 
and having Gi-/Gi- = 0.99/0.98 for the first stage, and Gi-/Gr,, = 1.0/1.0 
for the second stage. 
Phase 
(radians) 
F requency Response -
Point* = 4296 ": -9 0 E h #  = 200 
-105 + P t s  = 8193 
Freq = 0.0244141 
O 0.01 0.02 0.m 0.04 0-05 Haa. = -73.3404 
Figure 5.28: Frequency spectrum of a single-channel fourth-order bandpass Sigma- 
Delta modulator, having the integrated placed before the branch splitting in the first 
stage, and having Gi-/Gi,, = 0.99/0.98 for the first stage, and Gi-/GI- = 
1.0J1.0 for the second stage. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Phase 
Cradiair) 
Figure 5.29: Frequency spectnun of a single-channel fourth-order bandpass Sigma- 
Delta modulator, having the integrated placed after the subtractor in the second 
stage, and having GL-/Gi- = 1.0/1.0 for the second stage, and Gi-/Gl- = 
0.99f0.98 for the second stage. 
Phase 
(radiaru) 
P re quency Response -
Point* = 4345 
B i n #  = 249 * Pts = 8193 
Freq = 0.0303955 
Hag. = -79.4707 
Phase = 0 . 2 3 9 7 2 1  
Figure 5.30: Frequency spectnun of a single-channel fourth-order bandpass Sigma- 
Delta modulator, having the integrated placed before the bianch splitting in 
the second stage, and having Gl,/Gr,, = 1.0/1.0 for the first stage, and 
GL-/G~,, = 0.99/0.98 for the second stage. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.31: Switched-Capacitor Integrator. 
5.5 Switched-Capacitor Architecture 
The switched-capacitor implementation for the Sigma-Delta modulat ors of Fig- 
ures 5.7.b and 5.8 is developed in this section [145]. The switched-capacitor inte- 
grator used in these implementations is given in Figure 5.31 [142]. Notice that this 
architecture introduces a lialf cycle delay between the input and the output. 
In this section the word cycle refers to the duration between two consecutive 
even samples (or odd samples). In this case, a delay of one cycle is z-2 in the z- 
domain, and a delay of half a cycle is z-'. The even and odd samples at the input 
as well as those at the output of the comparator are held for an entire cycle. The 
even and odd samples are staggered by half a dock cycle. In the switched capacitor 
implementations given in this section the delays are implemented by proper timing 
of the switches. 
Fkst, consider the implementation of the single-channel second-order bandpass 
Sigma-Delt a modulat or given in Figure 5.7. b. Since the integrator introduces a 
delay of half a cycle (2-'), this delay should be included with each integrator. 
Also, the four adders are combined into two adders. Figure 5.32 shows the modified 
modulator. 
The modified architecture of Figure 5.32 requires only two operational ampli- 
fiers for its implementation. The delays are implemented by proper timing of the 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.32: A modified single-channel second-order bandpass Sigma-Delta modu- 
lator wit h two cross-coupled branches. 
swit ches. Figure 5.33 shows the switched-capacitor implementation for the single- 
channel second-order bandpass Sigma-Delta modulator. 
For the single-channel fourth-order bandpass Sigma-Delta the modulator of Fig- 
ure 5.8 is modified to include a delay (r-') with each integrator. However, the 
modified architecture contains non-causal blocks as shown in Figure 5.34. 
It is possible to manipulate the blocks around to retain the causality of each 
block. The new modified architecture is shown in Figure 5.35. This modified 
architecture requires four operational amplifiers for its implementation. The delays 
are obtained by proper timing of the switches. Figure 5.36 shows the switched- 
capacitor implementation for the single-charnel fourth-order bandpass Sigma-Delt a 
modulator. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.33: The swit ched capacitor implementation of the single-channel second- 
order bandpass Sigma-Delta modulator with two cross-coupled branches. 
Figure 5.34: A modified single-channel fourth-order bandpass Sigma-Delta modu- 
lator with two cross-coupled branches, having non-causal blocks. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.35: A modifled single-channe1 fourth-order bandpass Sigma-Delt a modu- 
lator with two cross-coupled branches, having no non-causal blocks. 
5.6 The Decimation Filter Architecture 
The decimation füter (Figure 5.37) consists of two parts; the Sinc decimator and 
the lowpass decimation Hter (LPDF). The Sinc decimator is characterized by its 
simple structure, requinng only addition operations which rnakes it a power-efficient 
structure. The order of the Sinc decimator used depends on the order of the Sigma- 
Delta modulator, and is given by [135]: 
Order of Sinc = Order of LPSD + 1 
For a Sinc decimator with order N and a decimation factor of M, the transfer 
function (before down-sampling is given by ) : 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.36: The switched capacitor implementation of the single-channel fourth- 
order bandpass Sigma-Delta modulator with two cross-coupled branches. 
Figure 5.37: The decimation filter. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.38: The transfer function of a Sinc decimator having, M = 8, and N = 3. 
Figure 5.38 shows the transfer function for a Sinc decimator having M = 8, 
and N = 3. Notice the zeros of the Sinc decimator a t  fkequencies: &fa/& f f8/4, 
f 3f.18, and f f8/2. When the frequency spectrum is folded three times, around 
f.14, fJ8, and f&6, the zeros of the folded spectrum f d  on the spectrum at 
f. = O. This minimizes the out-of-band noise added to the low frequency spectrum 
due to folding. 
Due to the variable-resolution requirement of the A/D converter, the order of 
the Sinc, N, and the decimation factor, M, can vary. N can be 3 or 4 for a 
fourth-order or a sixth-order BPSD respectively. M can be 8, 16 or 32. 
Due to its gradual transition from the passband to the stopband, the Sinc deci- 
mator can't be used in the entire decimation process. The last stage of decimation 
is done using an LPDF, which does decimation by a factor of four (1331. The LPDF 
is built as a two stage LPDF each doing decimation by a factor of two. Due to 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 142 
the variable-resolution requirement of the A/D converter, this filter is designed to 
operate with variable resolution, and hence reduce power dissipation when operat- 
ing at  the lower resolution [146] (1471, by eliminating irrelevant computations. A 
novel memory access algorithm is employed in the LPDF. An interleaved multiplier- 
accumulator array is also used in the LPDF [Ml. 
5.7 Power Efficient Sinc Decimator Architecture 
Several architectures were considered for the implementation of the Sinc decimator 
[Mg]. In this section, the power dissipation of four architectures that implement an 
n th order Sinc decimator that does decimation by a factor of 2" are compared. The 
metric for the power dissipation cornparison is the number of operations required 
to generate a single output. The most power-efficient Sinc decimator is the one 
that requires the least number of computations to generate a single output, and 
thus eliminates redundant computations. 
5.7.1 First Architecture 
In the fkst implementation, the Sinc decimator is divided into rn stages. Each stage 
is an n th order Sinc decimator that does decimation by a factor of 2. Figure 5.39 
shows the implementation of such a decimator. Assuming that h is the resolution of 
the input to the Sinc decimator, the resolution at the output of the Sinc decimator is 
k + m n  bits. Notice that in this case, the resolution after each Sinc stage increases 
by n bits. Also notice that each Sinc stage operates at  double the speed of the 
following Sinc stage because of down-sampling by 2. 
Assuming that n = 3. Each Sinc stage in Figure 5.39, having i bits at its input, 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
4 rn stages C 
Total Decimation: 2m 
Figure 5.39: First architecture of a Sinc decimator. 
requires a total of 4i + 2 additions to generate a single output. Hence, the total 
number of one-bit additions per output for a third order Sinc is: 
One-bit additions per output = (4b + 14)(2" - 1) - 12m (5.12) 
Assuming that n = 4. Each Sinc stage in Figure 5.39, having i bits at its input, 
requises a total of 5i + 4 additions to generate a single output. Hence, the total 
number of one-bit additions per output for a fourth order Sinc is: 
One-bit additions per output = (5k + 24)(2m - 1) - 20m (5.13) 
5.7.2 Second Architecture 
In this implementation, the Sinc decimator is divided into n stages. Each stage 
is a Sinc decimator of the f i s t  order and having 2m taps. The decimation by a 
factor 2" is done after the last stage. Figure 5.40.a shows a block diagram of this 
implemen t ation. 
A Sinc decimator of the fist  order and having 2" taps is simply the moving 
sum (or average) of the last 2m samples. The new output can be obtained from the 
previous output by adding sample x ( k )  and subtracting sample x ( E  - 2m). This is 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
- n stages - 




Figure 5.40: Second architecture of a Sinc decimator. (a) Block diagram. (b) First 
n - 1 Sinc stages. ( c )  Last Sinc stage. 
Sinc ( p )  
shown in Figure 5.40.b. Since the last Sinc stage is followed by a down-sampler, the 
last stage is simply an accumulate and dump. It accumulates 2" samples than the 
= 
output is cleared and the accumulation starts again. This is shown in Figure 5.40.c. 
The resolution at the input is k bits, the resolution after each stage increases by 
Sinc(r") 
rn bits. Hence, the resolution at the output is Iz + rnn bits. The number of one-bit 
- - - - - -  1 sine (Y 127- 
k+mn biis 
additions per output for this architecture is given by: 
One-bit additions per output = [(* - l ) k  + m((n - 1)* + n)]2" 
5.7.3 Third Architecture 
The transfer function of the Sinc decimator [Sincn(2")] c m  be expressed as: 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
k bits k+nm bits 
k+n(m- 1 ) bits , 
I Total de~irnation:2~ 
Figure 5.41: Third architecture of a Sinc decimator. 
Where, M = 2". The even and old samples are split into two branches. Each 
branch is fltered using a sincn(2"-') filter, the Utered signals are then merged and 
filtered using a sincn(2) filter. The block diagram of this implementation is shown 
in Figure 5.41. 
The decimation is distributed throughout the blocks of Figure 5.41. To show 
how this can be achieved, considered the implementation of the Sinc decimator 
[sinc3(16)]. The implementation of this filter is shown in Fi,we 5.42.a. The output 
rate is 1/16 the input rate. The filter Sinc3(2) is a four tap filter. It requires four 
inputs (two from the even branch and two from the odd branch) to generate a single 
output. To avoid unnecessary computations, the füters Sinc3(8) of the even and 
odd branches should generate two outputs for every 16 inputs. This is why there 
are two outputs for each filter. 
Figure 5.42.b shows a computationdy efficient method to generate the two 
outputs of Sinc3(8) filter. Each output is at 118 the input rate. The number of 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.42: (a) Block diagram of Sinc3(16) Hter. (b) Block diagram of sinc3(8) 
filter generating two outputs every 8 inputs. 
one-bit additions required per output, for a third-order Sinc decimator, is given by: 
One-bit additions per output = (5k + 7(rn - 1))2" + 7H + 21m - 14 (5.16) 
For a fourth-order Sinc decimator, the number of one-bit additions required per 
output is given by: 
5.7.4 Fourth Architecture 
The transfer function for the Sinc decimator [ S i n ~ " ( 2 ~ ) ]  is given by: 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
- .  n stages --
Figure 5.43: Fourth architecture of a 
Diff. 1- Diff. Diff. 
k + nm bits , D . .  k +atm biu 
'--q - - -  
Integrator stage. (c) DSerentiator stage. 
LU. 
Sinc decimator. (a) Block diagram. (b) 
Where, M = 2m. According to Equation 5.18, the Sinc decimator can be 
implemented as a cascade of n integrators followed by a 2m down-sampler and then 
followed by n differentiators [133]. The block diagram of such an architecture is 
given in Figure 5.43.a. Figure 5.43.b gives the implementation of the integrator 
stage. While Figure 5.43.c gives the implementation of the differentiator stage. To 
prevent overfiow, the datapath width of the integrators and the dxerentiators has 
to be: 
k + mn bits (5.19) 
The number of one-bit addition operations required per output is given by: 
One-bit additions per output = ( 2 m  + l)(k + nm)n (5.20) 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 148 




1 Figure 5.40 1 n II [(2n - l)k + m((n - 1)* + 7412" ( 
Order 
Figure 5.39 
Number of additions per output 
3 
1 Figure 5.43 ( n II (2" + l ) ( k  + nm)n l 




This architecture has the advantage that the down-sampling need not be a power 
of two, it can be any integer. In Figure 5.43, the down-sampling was chosen a power 
of two for the sake of cornparison with the other Sinc architectures. 
(5k + 24)(2m - 1) - 20rn 
5.7.5 Cornparison of the Sinc Decimat or Architectures 
3 
4 
Table 5.2 gives the number of one-bit additions required to generate a single out- 
put for each Sinc architecture. Figures 5.44 and 5.45 show the number of one-bit 
addition operations per Sinc output, for each of the four Sinc architectures, for a 
third-order and a fourth-order Sinc respectively, with h = 1. 
(5k + 7(m - 1))2" + 7k + 21m - 14 
(7k + 13(m - l ) )Zm + 9k + 36m - 32 
Fiom Figure 5.44 and 5.45, it can be seen that the first architecture requires 
3 to 5 times less one-bit addition operations than the other three architectures. 
Architecture 3, in which the even and odd samples are filtered separately and then 
merged together and filtered again, is more computationally efficient, especially for 
higher decimation factors, than a decimator which filters aIl the samples together 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.44: The number of one-bit addition operations required to generate a 
single output for the different implementations of a third-order Sinc decimator. 2m 
is the decimation factor of the Sinc decimator. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.45: The number of one-bit addition operations required to generate a 
single output for the different implementations of a fourth-order Sinc decimator. 
2m is the decimation factor of the Sinc decimator. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 151 
(architecture 2), this is becaiise it removes some of the redundant computations. 
In terms of the decimation factor programmability, the fourth architecture is the 
most easily programmable. However, this architecture is the leas t computationdy 
efficient. Notice that, the integrators of this architecture operate at a high-rate and 
at a high-resolution. For architecture 1, the decimation factor must be a power of 
2, however, this architecture is the most computationally efficient, hence it was the 
implementation used in realizing the Sinc decimator. 
5.8 Sinc Decimator Numerical Accuracy 
Op timizing the datapath width without degrading the output numericd accuracy 
plays a central role in achieving a power-efficient architecture. In the previous 
section, four possible architectures for a Sinc decimator [Sincn(2")] were considered. 
In this section, the effect of reducing the datapath width of the internd operators, 
and the corresponding reduction in compu t ational complexit y, on the numerical 
accuracy (signal-to-noise ratio) at the output of the Sinc decimator is considered 
(1491. 
The architectures considered in this analysis are the Ç s t  and the fourth Sinc 
architectures given in the previous section. The k s t  architecture was found to be 
the most computation efficient architecture, while the fourth architecture is the 
most flexible in terms of programmability. The Sinc decimator considered in this 
andysis has the following parameters: 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
The resolution at the output should be 10 bits. However, this resolution is 
more than what is sufficient. A third-order Sinc decimator is used to decimate the 
oversampled output of a second-order (lowpass equivalent ) Sigma-Delt a mo dulat or. 
Decimating the oversampled signal by a factor of 8 (m = 3) achieves an SNR of 
32 dB. This is equivalent to 5-6 bits resolution. The output resolution can be 
lowered koom 10 bits with little impact on the output numerical accuracy. Thus 
eliminating any irrelevant comput ations. 
Figure 5.46 shows the spectnun at the output of a third-order Sinc decimator 
based on the first architecture given in Figure 5.39. The Sinc decimator is connected 
to the output of a second-order lowpass Sigma-Delta modulator. The injected sine 
wave into the Sigma-Delta modulator has an amplitude of 0.5 and a frequency of 
0.0031 relative to the sampling frequency at input of Sigma-Delta modulator. The 
output sampling frequency of the Sinc decimator Fi = 8. The S N R  at the output 
f' f'] of the Sinc decimator, where is the noise is limited to the frequency band [- y ,  2 
is given by 40.2 dB. Theoretically, this SNR should have been 41 dB. 
The output of the Sinc decimator can have any value between -1.000000002 to 
1.000000002. This requires 1 integer bit, 1 sign bit and 8 fraction bits. The integer 
bit is required only to represent 1.000000002. If we can eliminate this value by 
approximating it to 0.111111112, we will require only 9 bits, 1 sign and 8 fraction 
bits. 
The question now is where do we do this approximation? We can do it after 
the first stage by approximating 1.002 to 0.112. In this case, the maximum output 
of the Sine decimator is 0.110000002. Or we can do it after the second stage by 
approximating 1.000002 to 0.111112. In this case, the maximum output of the Sinc 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.46: The output frequency spectrum of the Sinc 
ure 5.39, having k = 1, m = 3, and n = 3, operating at 
decimator given in Fig- 
full resolution as shown 
in Figure 5.39. The input is the output of a second-order lowpass Sigma-Delta 
modulator, having an input sine wave of amplitude 0.5 and a frequency 0.0031 the 
s ampling frequenc~. 
decimator is 0.11111000~. F i n d y ,  we can do the approximation at the output of 
third and final stage. 
The advant age of doing this approximation in an earlier stage is to reduce the 
datapath width in the following stages. The disadvantage being lower S N R  at the 
Sinc output. Figure 5.47 shows the fkequency spectrum at the output of the Sinc 
decimator, under the same conditions that have been previously explained, and 
with the approximation done after the first stage. Figure 5.48 shows the same 
frequency spectrum but with the approximation done after the second stage. 
The reduction in computational complexity resulting from approximating the 
sampled signal after the second stage, by the elimination of the integer bit, is 4.5%. 
The degradation in the SNR a t  the output of the Sinc decimator, evident from 
comparing Figure 5.48 to Figure 5.46, is negligible. While the reduction in the 
computational cornplexity resulting fkom approximating the sampled signal after 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
mitput of Sinc filter uith ht. approx. a t  1st stage 
Mag -75 Frequency Respol 
(dB) -90 
I 
Point#  = 6144 
B i n #  = 2048 
-105 # P t s  = 8293 
-120 Freo = 0.0312! 
Figure 5.47: The output frequency spectrum of the Sinc decimator given in Fig- 
ure 5.39, having k = 1, m = 3, and n = 3, the sampled signal is approximated after 
the first stage by eliminating the integer bit. The input is the output of a second- 
order lowpass Sigma-Delta rnodulator, having an input sine wave of amplitude 0.5 
and a frequency 0.0031 the sampling fiequency. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
ilutput of Sinc filter uïth ùit. approx. at  !hl stage 
Frequency R e s p o ~  
(dB) 
I 
P o i n t *  = 6144 
II' # P t s  = 8193 
1 1 I Freq = 0.0312! 
O 
1 
0.015 O .  03 0.045 ' mg. = -100.6. 0.06~hase = 1.3?6! 
Figure 5.48: The output frequency spectrum of the Sinc decimator given in Fig- 
ure 5.39, having k = 1, m = 3, and n = 3, the sampled signal is approximated after 
the second stage by eliminating the integer bit. The input is the output of a second- 
order lowpass Sigma-Delta modulator, having an input sine wave of amplitude 0.5 
and a frequency 0.0031 the sampling fkequency. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 156 
the first stage, by the elimination of the integer bit, is 13.5%. However, in this 
case, the degradation in the SNR at the output of the Sinc decimator, evident fiom 
comparing Figure 5.47 to Figure 5.46, is snbstantial. 
To further reduce the datapath width at the output of the Sinc decimator, it 
is possible to eliminate the le& significant fraction bits. This elimination can be 
done after any Sinc stage. The earlier it is preformed, the more the reduction in the 
computational complexity and the lower the S N R  performance at  the output of the 
Sinc decimator. Notice that, when the least significant bit is eliminated after the 
las t stage, t here is no saving in the computational complexity of the Sinc decimator. 
However, the computational complexity of the following stage, which is the LPDF, 
is reduced due to the lower datapath width. 
Two methods for fraction bit elimination were considered. The first is trunca- 
tion. The second is alternate up/down rounding, were the sample is rounded up 
for one sample and rounded down for the next. Table 5.3 gives the SNR perfor- 
mance at the output of Sinc decimator for the two bit-elimination methods and 
with rounding performed after the first, second and third Sinc stages. The noise 
calculated at the output of the Sinc decimator is limited to the frequency band 
Notice, that when performing fiaction bit elimination after the first or second 
stages, alternate up/down rounding has substantially better performance than trun- 
cation. Notice also, that fkaction bit elimination after the fkst stage leads up to 
7 dB degradation in the SNR, when alternate up/down rounding is used. While 
bit elimination after the second stage leads only to a 3 dB degradation in the SNR, 
which is equivalent to half a bit. Hence, it is more appropriate to use. If we per- 
form, fraction bit elimination after the second and the third stages, and integer bit 
elimination after the second stage, the SNR at the output of the Sinc decimator is 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Table 5.3: The effect of fraction bit elimination on the S N R  at the output of a 
Sinc decimator based on the fmt  architecture. The input signal is the output of 
a second-order Sigma-Delta modulator having an input sine wave of an amplitude 
0.5 and a fkequency 0.0031 the sampling fkequency. 
Bit elimination method 
Truncation 
Bit elhination stage SNR Saving in CC' 
* CC = computational complexity 
CHAPTER 5. A / D  CONVERTER FOR SOFTWARE RADIO 
-13 5 * Pts = 8193 
-150 Freq = 0.03125 
I I 
O 0.015 o .  03 ' mg. = -91.94 O-06phase = 1.98108 
Figure 5.49: The output fiequency spectnim of the Sinc decimator given in Fig- 
ure 5.39, having k = 1, n = 3, and n = 3, and having a resolution 4 bits after 
the first stage, 5 bits after the second stage and 7 bits at the output of the Sinc 
decimator. The input to the Sinc decimator is the output of a second-order low- 
pass Sigma-Delta modulator, having an input sine wave of amplitude 0.5 and a 
frequency 0.0031 the sampling frequency. 
36.4 dB. This is equivalent to a 3.8 dB (just over half a bit) reduction fiom the SNR 
of the full resolution case, while the resolution at the output has dropped by 3 bits 
from 10 to 7 bits. The computational complexity of the Sinc decimator is reduced 
by 9% in this case. The frequency spectnun at the output of the Sinc decimator 
when the output resolution is 7 bits is shown in Figure 5.49. 
Now consider the effect of reducing the numerical accuracy on the output of a 
Sinc decimator based on the fourth architecture shown in Figure 5.43. When the 
output is at fd resolution and each intergrator and differentiator is operating at 
the f d  resolution (10 bits), the output spectnim is identical to that of the first 
architecture shown in Figure 5.46. The output sample word has 1 integer bit, 1 sign 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 159 
Output of Sinc 4 w i t h  8 b i t  resolution - 
fypt = Doable 
sasp. Freq. = 1 
4 s Pts = raooo 
PoiDtS = 5968 
Tire = 5968 sec 
Valat = -96 
Figure 5.50: The output of a Sinc decimator based on the fourth architecture, 
having k = 1, m = 3, and n = 3, and having integrators and ditferentiators with a 
datapath width of 8 bits. The input to the Sinc decimator is the output of a second- 
order lowpass Sigma-Delta modulator, having an input sine wave of amplitude 0.5 
and a frequency 0.0031 the sampling frequency. 
bit and 8 fraction bits. If we eliminate the integer bit, by using 9 bit intergrators 
and differentiators. The output of the Sinc decimator is not Sected, except if 
the output should have been +1.000000002, which is interpreted as -1.000000002 
(a full scale error). For this output to occur, an input pattern consisting of 22 
consecutive ones is required (see appendix B). This input pattern rarely occurs? 
and if we slightly limit the amplitude of the input signal it will never occur. 
If we try to further reduce the datapath width of the integrators and differ- 
entiators to 8 bits, by rernoving a most significant bit. The output signal will be 
distorted as shown in Figure 5.50. 
If the resolution of the first integrator is 8 bits, and the remainder of the dat- 
apath is 9 bits. The output signal WU be substantidy distorted. This is evident 
from the fkequency spectrurn shown in Figure 5.51. 
The resolution of any stage of a Sinc decimator based on the fourth architecture 
cannot fd l  below 9 bits (by removing most significant bits) without substantidy 
degrading the output performance. So far we tried to reduce the resolution by 
the elimination of the most significant bits. However, if we try to reduce the 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
-45 
-60 
Mag -75 Frequency Response 
(dB) -90 
-
Point* = 6144 
B i n *  = 2048 
-105 P t s  = 8193 
-120 Freq = 0.03125 
O 0.015 0 . 0 3  0 .045  m g .  = -10.0158 0 e 0 6  Phase = 1.59778 
Figure 5.51: The output frequency spectrum of a Sinc decimator based on the fourth 
architecture, having k = 1, m = 3, and n = 3, and having the first integrator of 
resolution 8 bits, and the remainder of the datapath with resolution 9 bits. The 
input of the Sinc Decimator is the output of a second-order lowpass Sigma-Delta 
modulator, having an input sine wave of amplitude 0.5 and a frequency 0.0031 the 
sampling frequency. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 161 
Table 5.4: The effect of fraction bit elimination on the SNR at the output of a Sinc 
decimator based on the fourth architecture. The input signal to the Sinc d e b a t o r  
is the output of a second-order lowpass Sigma-Delta modulator having an input 
sine wave of an amplitude 0.5 and a frequency 0.0031 the sarnpling kequency. The 
differentiators have same resolution as t hat of the las t integrator stage. 
Bits per integrator stage 1 
* CC = computational complexity. This is relative to the fd resolution case were 
each integrator has 10 bits. 
resolution by the elimination of the least significant bits. The degradation in the 
output performance is more gracefd. The results of these simulations are shown in 
Table 5.4. 
The simulation results that have been presented for a Sinc decimator based on 
the fowth architecture indicate that , reducing the computational complexity by 
the elimination of the most significant bit results in a reduction in the computa- 
tional cornplevity by IO%, with a negligible degradation in the S N R  performance at 
the output of the Sinc decimator. The reduction of the computational complexity, 
in this architecture, is greater than the corresponding reduction in computational 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 162 
cornplexity of a Sinc decimator based on the first architecture when the most sig- 
nificant bit is eliminated after the second Sinc stage. 
When the output has 8 bits resolution (eliminating the least significant bit in 
addition to the most significant bit), and with the f i s t  and second integrators 
having a 9 bit resolution, the output SNR is 37.1 dB. The reduction in the compu- 
tational complexity, due to the elimination of the least significant bit only, is 4%. 
Note that, this case is comparable to that of a Sinc decimator based on the fist 
architecture were the approximation is performed after the second stage. In that 
case, the output SNR is 37.3 dB, and the reduction in computational complexity 
is 4.5%. 
If we further reduce the numerical accuracy of the third integrator stage to 7 
bits, the output S N R  is reduced by 5 dB (almost 1 bit) to 32.1 dB. In this case, the 
degradation in performance is too large to make it a practical solution. However, 
if we choose the resolution of the third integrator stage to be 8 bits, and perform 
the approximation from 8 bits to 7 bits at the output of the Sinc decimator, by 
dropping the least significant bit. In this case, the SNR at the output of the Sinc 
decimator is 35.8 dB. The overd  reduction in computational complexity is 14%. 
The reduction in the numerical accuracy fbom the fidl resolution case is 4.5 dB 
(0.75 bits). While the datapath width of the output of Sinc decimator has been 
reduced fkom 10 bits to 7 bits. 
Even though the reduction in the computational complexity achieved in a Sinc 
decimator based on the fourth architecture (14%) is greater than that achieved 
in a Sinc decimator based on the first architecture (9%). Yet, the computational 
cornplexity (number of one-bit addition operations required per output) of the f i s  t 
architecture is still 2.8 times lower than that of the fourth architecture operating 
at the lower numerical accuracy. 
- - _ _  
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
5.9 Power Efficient Lowpass Filter Design 
The lowpass decimation filter (LPDF) is based on the multiply-accumulate archi- 
tecture. This architecture is shown in Figure 5.52. The filter designed is a linear 
phase FIR filter, it has symmetric coefficients. The filter architecture has been 
modified [147] to take into account the symmetry of the LPDF coefficients, by 
reading two values from the memory adding them together, and then mdtiplying 
them by the desired coefficient. This reduces the number of multiplications by 
50%. However, the number of RAM reads remains unchanged. By simulation of 
the different blocks it was found that, the RAM dissipates 35% of the total LPF 
power dissipation. Hence, the power saving achieved by eliminating multiplications 
due to filter symmetry is 33%. 
The number of multiplications is further reduced by using a halfband filter [120]. 
Halfband füters have half of their coefficients zero. However, this does not reduce 
the number of multiplications by 50% because halfband filters are usually longer 
(have more delay units and hence more taps) than non-halfband filters that provide 
the same out-of-band attenuation. 
To give the same stop band attenuation, halfband filters are designed 25%- 
30% longer than their corresponding non-halfband flters. Hence, the designed 
h a a n d  filters require 30%-35% less multiplications (and RAM reads) than the 
non-halfband filters. Hence, the power saving achieved by halfband frlters is 30%- 
35%. 
Figure 5.53 is the timing diagram for the LPDF. Every two consecutive RAM 
reads (one fiom the A stream and the other fkom the B stream) are added and 
multiplied by a single ROM coefficient. During the RAM's read state, the output 
of the LPDF is being accumdated. At the end of the read state, the LPDF output 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.52: Conventional multiply-accumulate flter architecture. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 165 
is generated, the RAM changes to the write state, and two samples are written into 
the RAM. The RAM than changes back to the read state. Every read cycle, the 
samples read are advanced two memory locations. The fact that two samples are 
written into the RAM every time a single sample is generated at the output means 
that the LPDF does down-sampling by a factor of two. 
The samples are written into the RAM at non-uniform periods (two samples 
are writ ten every read/write cycle). W e ,  the input samples to the LPDF are at 
uniform periods. To achieve synchronization, the synchronization block shown in 
Figure 5.54 is used (1471. This synchronization block allows the rnemory to operate 
between two quasi-synchronous blocks. It eliminates the need for an interrup t every 
time an input sample is available. This simplifies the design of the control unit of 
the LPDF. 
Figure 5.55 and Figure 5.56 show the timing diagrams for the synchronization 
block for two cases. In the first case, Figure 5.55, CM doesn7t change state during 
the RAM write state (when R/W = O). Notice that when w = 1, the output to the 
RAM is always the earliest sample found in the two registers R1 and R2. When 
w = 0, the output to the RAM is the latest sample found in the two registers RI  
and R2. 
In the second case, Figure 5.56, CM changes state during the RAM write state. 
In this case, the value of CM, during the moment R/W changes from 1 to O (read 
to write state), is stored in CM'. This is necessary to avoid the neglection of the 
samples stored in R2, for the case CM changes f b o m  O to 1 when R/W = O, this is 
the case shown in Figure 5.56. Or the neglection of the samples stored in RI, for 
the case clkI changes from 1 to O when R/W = 0. 
The multiplier-accumulator (MAC) used in the LPDF, has been designed such 





Ih Bj = Bi + (i-j) 
ROM address 
A 





Figure 5.53: Lowpass filter timing diagram. 
Write Read 





1 Wnte first word in R A M  
O Write second word in R A M  
1 
Figure 5.54: S ynchronization Block. 
Figure 5.55: Synchronization block timing diagram. 
R l  
MUX To M M  
3 R 2  
t 





To RAM A0 1 BO 1 A l  1 B1 1 A2 1 B2 
Figure 5.56: Synchronization block timing diagam. 
that the adder is interleaved in the multiplier array. Figures 6.7 and 6.8 shows 
the architecture used to implement the MAC (1481. For a 20 x 20 multiplier- 
accumulator , t his structure achieves a 17% reduction in power dissipation. 
5.9.1 Variable Resolution Lowpass Architecture 
To reduce the power dissipation when a lower resolution is sufficient, the datap- 
ath is divided into pardel units as shown in Figure 5.57 [146] [147]. The blocks 
corresponding to the least significant bits are deactivated when a lower resolution 
is sufficient. In Figure 5.57, the datapath is shown consisting of 3 parallel units, 
making its resolution one of three values: 12, 16 or 20 bits. 
The division of the datapath into pardel units increases the overhead in terms 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
I I 1 1 
RAM RAM RAM 
r I 
l 







Figure 5.57: The modified filter architecture. 
CHAPTER 5. A/D COlWERTER FOR SOFTWARE RADIO 170 
of power dissipation (as well as area). It is the purpose of this section to analyze 
this block deactivation technique, to determine the amount of power saving that 
can be achieved using it, and to determine the optimum number of parallel units 
the datapath is divided into for a certain application and technology. 
The datapath of the LPF consists of functional blocks as shown in Figure 5.57. 
The power dissipation of each functional block F can be divided into two compo- 
nents, the first component is a constant independent of the datapath width. while 
the second component is proportional to the datapath width: 
Equation 5.21 is d d  for a multiplier unit only if the width of the rnultiplicand 
is variable? while the width of the multiplier is fixed. 
Table 5.5 gives the power components for each functional block in the datapath, 
in addition to the power components of the entire datapath. These numbers are for 
a 0.5pm, 3.3 Volt CMOS technology [150]. For the multiplier unit, it is assumed 
that the multiplier has a fked width of 20 bits. The total power dissipation is for 
the architecture given in Figure 5.52, which has; 1 RAM, 1 multiplier, 2 adders and 
3 registers. Assume that the total power components are POT and Pl= for the fked 
and width-dependent power components respectively. Thus, if the datapath is not 
divided into parallel units the total power dissipation is given by: 
If the datapath is divided into M parallel units. The power dissipation when 
the fd accuracy of the datapath is required is: 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 171 
Table 5.5: Power components for the individual functional blocks and the entire 
datapath of the architecture shown in Figure 5.52. 
Adder 
Regis t er 
Func t iond Unit 
RAM 
Notice that, the power dissipation in this case has increased by ( M  - 1) POT. ki 
the low resolution case, the power dissipation is given by: 
PT = POT + PITNL 




by (N - NL)PiT. NL is the 
PIF 
55pW,/=z 
datapath width in the low resolution case. The amount of power saving is dependent 
on the percentage of time the datapath is required to operate at each resolution, 
as well as the ratio between the two power components. 
Suppose that the datapath can take one of M resolution. The lowest resolution, 
NL bits, has a probability a. The remaining resolutions: 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
are equi-probable and have a probability: 
Using this probability distribution it is possible to h d  the average power dissi- 
pation: 
M 
pav, =  POT[^ + 2-(1 - a)] + PITINL + M 2(M - 1 )  ( N  - NL)(~ - a)] (5.27) 
Define Pr, the relative power, to be the ratio between the power dissipation of 
the system with pardel  datapath units, to that dissipated in a single datapath 
system. p is defined as: 
T herefore, 
P Pr = - M  1  M , + N D +  ~ ( 1 - 4 1  + =[NL+ 2(M - 1)  ( N  - N L ) ( ~  - a)] (5.29) 
To minimize Pr, the optimum number of units, the datapath should be divided 
into, is given by: 
For the design parameters of the lowpass decimation filter being designed, M,t 
is 3. Figures 5.58 and 5.59 show the relationship between the relative power dissi- 
pation and M, the number of pardel units the datapath is divided into for different 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.58: Relative power dissipation, Pr, of a lowpass filter using block deacti- 
vation versus M,  for a = 0.4, N = 20 and NL = 12. M is the number of pardel 
datapath units. 
values of a and p. N was taken to be 20 and NL to be 12. Figure 5.60 shows the 
relative power dissipation, for a 3 pardel-unit datapath, versus 7 and 6. 7 is the 
probability that a 12 bit resolution is sufncient. d is the probability that a 16 bit 
resolution is suficient . 
From the results shown in Figures 5.58 - 5.60, we can make the foIlowing 
conclusions: 
a The optimum value of M (which makes Pr minimum) depends on p. As p 
decreases, Mqt increases . 
0 For the total power components, POT , PlTi given in Table 5.5, p = 2.29, we 
find that MWt is 3, which is the value used in the design of the LPF decimation 
CHAPTER 5. A / D  CONVERTER FOR SOFTWARE RADIO 
Figure 5.59: Relative power dissipation, Pr, of a lowpass filter using block deacti- 
vation versus M, for a = 0.6, N = 20 and NL = 12. M is the number of parde l  
datapath units. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.60: Relative power dissipation, Pr, of a Iowpass decimation filter consisting 
of a three-unit pa rde l  datapath versus 7 and 6, for M = 3 and p = 2.29. The 
datapath resolution can be: 12, 16, or 20 bits. 7 is the probability that a 12 bit 
resolution is sufficient. 6 is the probability that a 16 bit resolution is sufficient. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
The maximum reduction in power dissipation that can be achieved using this 
algorithn is 35%. This is achieved when a 12 bit resolution is sufficient for 
most of the tirne. 
The analysis of the block deactivation technique presented so far assumes that 
the multiplier unit has a fked multiplier width of 20 bits. E the width of the 
multiplier as well as the width of the multiplicand are allowed to vary, which is the 
case in the multiplier-accumulator array design in chapter 6 ,  the power saving that 
can be achieved with the block deactivation technique, presented in this section, is 
even grea t er. 
Figure 5.61 shows the relative power dissipation, for a three-unit pardel dat- 
apath, versus 7 and 6. 7 is the probability that a 12 bit resolution is suffcient. 
6 is the probability that a 16 bit resolution is sufficient. The maximum reduction 
in power dissipation that can be achieved is 40%. This is achieved when a 12 bit 
resolution is sufficient for most of the t h e .  
In this section, several low-power techniques have been employed in the design 
of the lowpass decimation filter. The power saving each technique can achieve is 
shown in Table 5.6. The total reduction in power dissipation is about 4 times. 
S. 10 VLSI Implementation of the Decimat ion 
Filt er 
The decimation filter is designed to generate a 12 - 20 bits resolution sampled 
signal fiom a 1 - 2 bits resolution oversampled signal. The first stage of the deci- 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.61: Relative power dissipation, Pr, of a lowpass decimation filter consist- 
ing of a three-unit pardel datapath, versus 7 and 6, for M = 3. The lowpass 
decimation filter uses the programmable multiplier-accumulator array designed in 
chapter 6. The power dissipation of the other components are as given in Table 5.5. 
The datapath resolution can be: 12, 16, or 20 bits. 7 is the probability that a 12 
bit resolution is sufficient. 6 is the probability that a 16 bit resolution is sufficient. 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Table 5.6: Power savings for high-level low-power techniques used in the design of 
the lowpass decimation filter. 
Low power technique 
Symmetric filter coefficients 
Halfband fdter 
mation filter is the Sinc decimator. The order of the Sinc decimator, as well as its 
decimation factor, can be programmed. 
Power saving 
33% 
30 - 35% 
Interleaved MAC array 
D at apat h division 
TOTAL 
The Sinc decimator can be a third-order Sinc decimator. This accepts a one 
bit sampled signal fiom a fourth-or der bandpass Sigma-Delt a modulat or. The Sinc 
decimator can also be a fourth-order Sinc decimator. This accepts a two-bit sampled 
signal from a sixth-order bandpass Sigma-Delta modulator. The decimation factor 
of the Sinc decimator can be 8, 16 or 32. 
- - 
17% 
10 - 40% 
65 - 78% 
The second stage of the decimation filter is the lowpass decimation filter. The 
lowpass decimation filter consists of two stages in cascade, each stage does decima- 
tion by a factor of two. Each lowpass stage is a linear phase stage. Each lowpass 
decimation filter can be programmed to be halfband filter or a non-halfband filter. 
The output resolution of each lowpass stage can be programmed to be 12, 16 or 
20 bits. The number of taps required in the first LPF stage is 15 taps, while the 
number of taps requked in the second LPF stage is 47 taps. 
Figure 5.62 shows a block diagram of the decimation filter. The decimation 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.62: Block diagram of the designed decimation filter. 
Table 5.7: The number of transistors required for each stage of the lowpass deci- 
filter was designed in a 0.5pm, 3.3 Volt CMOS technology. The total number of 
transistors required for the decimation filter is 187 K. Table 5.7 shows the number 
of transistors required for each stage of the decimation filter. The total area of the 
decimation filter is 7.4mm x 6.4mm. The VLSI layout of the decimation filter is 
shown in Figure 5.63. 
5.11 Chapter Summary 
Parallelism by 4x of analog signal processors is applied to the design of a bandpass 
Sigma-Delta modulator. The speed of the modulator is increased without increas- 
CHAPTER 5. A/D CONVERTER FOR SOFTWARE RADIO 
Figure 5.63: VLSI layout of the decimation filter. 
CHAPTER 5. A/D C O M R T E R  FOR SOFTWARE RADIO 181 
ing the speed requirement of the individual building blocks. Several architectures 
were considered in terms of their resilence to implementation details such as mis- 
match and gain errors. A switched-capacitor circuit has also been designed for the 
proposed modulator. 
Several high-level low-power design techniques have been incorporated into the 
design of the decimation filter. These include; operation minimization, multiplier 
elimination and block deactivation. These techniques lower the computational corn- 
plexity by eliminating redundant and i r r e l e ~ n t  computations. In the case of the 
Sinc decimator, eliminating redundant computations involves using architectures 
that avoid unnecessary computations due to decimation. Reducing the datapath 
width of the Sinc decimator, eliminates irrelevant computations. These are the 
computations that have little effect on the output numerical accuracy, because the 
numerical accuracy is lirnited by some other block in the system. In this case, the 
other block is the Sigma-Delta modulator. 
The lowpass decimation füter also eliminates kelevant computations, by us- 
ing a block deactivation technique that avoids unnecessary computations when a 
lower resolution is sufficient. The other low-power techniques used in the lowpass 
decimation filter include operation interleaving and multiplier elimination. The 
decimation filter, with a programmable resolution that varies from 12 to 20 bits, 
has been designed in a 0.5pm, 3.3 Volt CMOS technology. 
Chapter 6 
Low-Power 
Multiplier- Accumulat or Array 
Several low-power design techniques have been applied to the design of a power- 
efficient multiplier-accumulator (MAC) array. The addition operation has been 
interleaved into the multiplier array. This can achieve up to 27% saving in power 
dissipation. The MAC array is designed to have a programmable resolution so that 
the blocks corresponding to the least significant bits can be deactivated when a lower 
resolution is sufficient, this achieves up to 50% saving in power dissipation. The 
multiplier-accumulator has been designed in a 0 .5pn ,  3.3 Volt CMOS technology. 
6.1 Introduction 
The rapid development in integrated circuit technology has led to the emergence of 
powerfd, faster and smaller digital signal processors. Many of the functions that 
were performed in the analog domain are now being performed digitdy. Digital 
CHAPTER 6. LOW-PO WER MULTIPLIER-ACCUMULATOR ARRAY 183 
processing not only improves quality, but it enhances the performance as weU by 
dowing more programmability. 
There is a trend to continue with the digital processing to higher speed analog 
signals. To be able to do this power-efficient DSP architectures are required. At 
the heart of a digital signal processor is the multiply-accumulate operation. 
In this chapter, the design of a power-efficient multiplier accumulator (MAC) is 
investigated. Several low-power techniques have been uicorporated into the design 
of this MAC. The addition operation has been interleaved into the multiplier array, 
this achieves a power saving of up to 27% for a 10 x 10 array. The MAC is de- 
signed to have a programmable resolution so that the blocks conesponding to the 
least significant bits can be deactivated when a lower resolution is sufficient. This 
achieves a power saving of up to 50%. 
The multiplier of the MAC array is based on the modified Booth algorithm. 
The accumulator's input and output are in the sum-carry representation. 
In section 6.2, the rnodified Booth algorithm multiplier is presented. The sign 
extension algorit hm and the multiplier array architecture are developed in t hat 
section as well. In section 6.3, the multiplier-accumdator array is developed. The 
resolution-programmable MAC array is developed in section 6.4. The MAC array 
has been designed in a 0.5pm, 3.3 Volt CMOS technology, this is discussed in 
section 6.5. 
6.2 The Modified Booth Algorithm Multiplier 
There are t hree types of multipliers [84]. Pardel multipliers [151] generate the par- 
tial products concurrently and sum them using a multi-operand adder. Sequential 
CHAPTER 6. LO W-POWER MULTIPLIER-ACCUMULATOR ARRAY 184 
multipliers generate the partial products sequentially and accumulate them to the 
previously summed partial products. Array multipliers [152] - [155] are made up 
of an array of identical c d s  that generate the partial product and do the summa- 
tion. Array multipliers have a regdar structure making them suitable for VLSI 
implement ation. 
The modified Booth algorithm multiplier [156] [157] is used for multiplying 
two's complement operands. The number of partial products generated is half 
the number of multiplier bits. The multiplier is divided into overlapping three- 
bit groups. Each three-bit group generates a single partial product, this is done 
according to Table 6.1 (31. The partial product can be -2, -1, 0, 1, 2 times the 
multiplicand. The multiplication factor is determined by calculating [84] : 
Each partial product is shifted two bits fiom the previous partial product. This 
shifting process requires sign extension [3]. In the following we examine one way in 
which sign extension is done. Let the first partial product be 
Where, A, is the sign bit. Therefore, the value of A is given by: 
CHAPTER 6. LO W-PO WER MULTIPLIER-ACCUMULATOR ARRAY 185 
Table 6.1: Partial product generation. PP is the partial product, MD is the multi- 
plicand, and M& is the i th bit of the multiplier. 
Let the second partial product be 
In this case, Bn+z is the sign bit. Therefore, the value of B is given by: 
To add these two numbers taking into account the sign of each, the following 
modifications are done to A and B. For A, complement the sign bit (A, = 1 -An), 
add 2". The number obtained NA, which is in unsigned binary representation, is 
related to A by: 
CHAPTER 6. LOW-PO'WER MULTIPLIER-ACCUMULATOR ARRAY 186 
Note that, this number haç the same value as A to a resohtion of n + 1 bits. 
For B, complement the bits Bn+2 and B,+t, add 2"+' The nurnber ob- 
tained BN, which is in unsigned binary representation, is related to B by: 
Adding NA and NB as unsigned binary numbers, we get: 
Which equals A + B to a resolution of n + 3 bits. This sign extension process 
is shown in Figure 6.1. The sign extension for the remaining partial products is 
identical to that of B. The s m  of the partial products using this sign-extension is 
given by: 
Where, rn is the number of multiplier bits, n is the number of rnultiplicand bits. 
Thus, the product obtained is accurate to n + n - 1 bits. To make the product 
accurate to m + n bits, the (rn + n)th bit is complemented. This is equivalent to 
subtracting 2"+"+'. 
6.2.2 Partial Product Generation 
The multiplier is divided into overlapping three-bit aoups. Each group is encoded 
into three bits ~ 2 ,  x l ,  and s, as given in Table 6.1. Figure 6.2 shows the logic 
circuit of the row-decoder used for encoding each three-bit group. If x 2  = 1, the 
CHAPTER 6. LO W-PO WER MULTIPLIER-ACCUMULATOR ARRAY 187 
Figure 6.1: Sign extension. (a) Signed-digit representation. (b  ) Equivalent unsigned 
binary representation. 
Figure 6.2: The row-decoder. 
magnitude of the partial product is twice the magnitude of the rnultiplicand. If 
x 1 = 1, the magnitude of the partial ~roduct  equals that of the multi~licand. If 
s = 1, the partial product and the multiplicand have different signs. In terms of 
x2, x 1 and s, the partial product is expressed as: 
PP = (-1.3 + l .~)(2 .x2 + 1.xl)MD (6-8) 
The i th bit of the partial product, PPi, is generated according to the following 
CHAPTER 6. LOW-POWER MULTIPLIER-ACCUMULATOR ARRAY 188 
equation: 
For i = O . . . n. Where, MD; is the i th bit of the multiplicand, MD-' = O and 
MD, = MDne1. n is the number of multiplicand bits. To correctly generate the 
two's cornpiement of the multiplicand when s = 1, s should be added to the least 
significant bit of the partial product . Figure 6.3 shows the Booth multiplier array 
based on Equation 6.9 and the sign extension algorithm given in Figure 6.1 [158]. 
The array cells of Figure 6.3 are given in Figure 6.4. The logic circuit diagram of 
the row-decoder (RD) of Figure 6.3 is given in Figure 6.2. The logic circuit diagram 
of the partial product generator (PPGen) of Figure 6.4, described by Equation 6.9. 
is given in Figure 6.5. 
The array multiplier of Figure 6.3 generates the product in the sum-carry rep- 
resentation. To get the two's complement product, the sum product word (PS) and 
the carry product word (PC) need to be added. 
The resolution of the product is the sum of the resolution of the multiplier rn 
and the multiplicand n. Assume that both the multiplier and the multiplicand are 
integers, t hen the value of the multiplier f d s  in the range [-2"-' , 2m-L - 1). While 
the value of the multiplicand falls in the range [-2"-l, 2"-' - 11. Thus the product 
f d s  in the range: 
With the exception of 2m+n-2, the other values can be represented using m+n- 1 
bits. The product, 2m"+2 , is obtained when MR = -2"-' and MD = -2"-'. If 
CHAPTER 6. LO W-PO WEEl MULTIPLIER-ACCUMULATOR ARRAY 
Figure 6.3: The Modified Booth Algorithm Array Multiplier. 
this case is avoided, then m + n - 1 bits will be sufficient to represent the product. 
Thus, the c m y  fkom the left most ceU in the tast row of Figure 6.3 can be neglected. 
The array multiplier of Figure 6.3 is the basis of the multiplier accumulator 
array (MAC array) and the programmable MAC array developed in the following 
sections. 
6.3 The Multiplier- Accumulator Array 
The multiplier-accumulator (MAC) accepts three operands, two operands are mul- 
tiplied and accumulated (added) to the third operand. Figure 6.6 shows a block 
diagram for the MAC. In the MAC array [159] [160], the adder is interleaved into 
the array multiplier. The MAC array presented here is unique to other implemen- 
tations in that [158]: 
1. The accumulator input and output are in the sum-carry representation. 










Figure 6.4: The array celk of the az-ray multiplier given in Figure 6.3. 
MDi-1 
S 
x 2 , -  
Out 
Figure 6.5: Partial product bit generat or. 
Figure 6.6: Block diagram of a multiplier-accumulat or (MAC). 
Figure 6.7: The Multiplier- Accumulat or Array. 
2. The array multiplier used is based on the modified Booth algorithm, and it 
uses the sign extension algorithm developed in section 6.2.1. 
Figure 6.7, a modification of Figure 6.3, is the multiplier accumulator array. 
The multiplier and the multiplicand are in the two's complement represent ation, 
the accumulator input (AI) and output (AO) are in sum-carry representation. The 
array ceus of Figure 6.7 are given in Figure 6.8. 





Full Adder 4 4  
Full Addcr I r l 1  
Figure 6.8: The array cells of the MAC array given in Figure 6.7. 
The area, speed and power dissipation of the MAC array, given in Figure 6.7, are 
compared to that of a MAC with a separate multiplier/adder as shown in Figure 6.6. 
The basic components of each architecture are. the row-decoder (RD), the partial 
~roduct  generator (PPGen), the full adder (FA) and the half adder (HA). The 
multiplier has m bits, the multiplicand has n bits and the accumulator input and 
output have rn + n - 1 bits. Table 6.2, gives the number of basic components 
required by the MAC array and the separate multiplier/adder MAC. 
Notice that the MAC array has no haIf adders. However, one of the f d  adders 
could have been a half adder, but it was chosen to be a f d  zdder to make the array 
reguiar . 
For the separate multiplier/adder MAC, the adder has to add four operands, 
the sum and carry of the accumulator input and the sum and carry generated by 
the multiplier. A two level carry Save adder is used to do this. One of the f d  
CHAPTER 6. LO W-POWER MULTIPLIER-ACCUMULATOR ARRAY 193 
Table 6.2: Number of basic components for the MAC array and the separate mul- 
tiplierfadder MAC. 
adders of the separate multiplier/adder MAC could have been a haIf' adder, but it 






Cornparhg the number of components used in the MAC array versus that used 
in the separate multiplier/adder MAC, it turns out that the latter uses m + n - 2 
extra half adders. 







m(n+l) +(m-1 )  
O 
D(X) is the delay of the basic component X. X can be RD, PPGen, FA or 
HA. For the separate multipIier/adder MAC, the array multiplier of Figure 6.3 has 
a critical path delay of: 





( n - l ) ( y - 2 )  +2(m+n-1)-1 
m + n - 2  
The CSA has a delay of 2D(FA). Hence, the entire critical path delay of the 
separate multiplier/adder MAC is: 
CHAPTER 6. LO W-PO'WER MULTIPLIER-ACCUMULATOR ARRAY 194 
which is greater than the delay of the MAC array by 2D(HA). 
Table 6.3 gives the area, delay and power dissipation/MHz of the basic compo- 
nents used in building the MAC. These values are based on CMOS standard cell 
Iibraries [150] [161] - [164]. The delay and power dissipation of the basic com- 
ponents designed in a 0.5pm, 3.3 Volt CMOS technology were obtained through 
simulations. For the other CMOS technologies, the delay and power dissipation 
obtained in Table 6.3 are based on the data sheet parameters of each technology. 
Table 6.4 gives the W/ L for the N-transistor and P-transistor of each CMOS tech- 
nology. 
The delay-power of the MAC array and the separate multipIier/adder MAC 
using different CMOS technologies is shown in Figure 6.9 and Figure 6.10 for a 
10 x 10 bit and 20 x 20 bit multiplier respectively. The following observations can 
be made: 
The MAC array has a lower power-delay product than the separate multi- 
~lier/adder MAC of the same technology. 
The lower the resolution of the MAC, the greater the reduction of the delay- 
power product of the MAC array over the separate multiplier/adder MAC. 
The MAC array reduces the power dissipation, delay and area relative to the 
separate multiplier/adder MAC. The relative reduction of the power dissipation, 
delay and area for a 0.5pm, 3.3 Volt CMOS technology is shown in Figures 6.11, 
6.12 and 6.13 respectively. 
CHAPTER 6. LO W-PO WER MULTIPLIER-ACCUMULATOR ARRAY 195 
Table 6.3: Axea in p z 2 ,  delay in n s  and power in pW/MHz for the basic components 
of the MAC architecture in dinerent CMOS standard ceU technologies. 
Library Component Area Delay 
pm2 ns 
RD 3873 1.41 
O.6pm PPGen 4401 2.48 
5 Volt FA 3345 1.99 
HA 1936 0.80 
FtD 3873 2.11 
0.6pm PPGen 4401 3.53 
5.3 Volt FA 3345 2.84 
HA 1936 1.19 
RD 1728 1.13 
0.5pm PPGen 1964 1.69 
5 Volt FA 1414 1.58 
HA 864 0.68 
RD 1728 1.14 
0.5pm PPGen 1964 1.68 
1.3 Volt FA 1414 1.57 
HA 864 0.72 
RD 1093 0.77 
1.35pm PPGen 1243 1.07 
5 Volt FA 1044 1.08 
HA 497 0.49 
RD 1088 0.74 
CHAPTER 6. LO W-POWER MULTIPLIER-ACCUMULATOR ARRAY 196 
Table 6.4: W/ L for the Werent  CMOS technologies. 
1 CMOS 
Technology 
- -  
0.6pm 5 Volt 
0.6pm 3.3 Volt 
0.5pm 5 Volt 
0.5pm 3.3 Volt 
0.35prn 5 Volt 
0.35pm 3.3 Volt 
N-Transis tor 
W I L  
1 I I I 1 I 
4 8 12 16 20 D 
C 
n sec 
Figure 6.9: The delay-power relationship of the MAC array and the separate mul- 
tiplierladder MAC having a 10 x 10 bit multiplier implemented in different CMOS 
technologies . 
1 1 1 1 1 I 
8 
- 
16 24 32 40 D 
n sec 
Figure 6.10: The delay-power relationship of the MAC array and the separate 
multiplierladder MAC having a 20 x 20 bit multiplier irnplemented in different 
CM0 S technologies. 
CHAPTER 6. LO W-POWER MULTIPLIER-ACCUMULATOR ARRAY 198 
For a 0.5pm, 3.3 Volt CMOS, at a resolution of 20 bits, the saving in power 
dissipation when using the MAC array is about 3.6%. The saving in area is about 
4.5%. While the increase in speed is about 7.7%. The increase in speed can lead to 
further reduction in the power dissipation [1] by reducing the voltage to maintain 
the same throughput. To maintain the same throughput the voltage is reduced by 
7.2% [150]. This corresponds to a fiuther 13.9% reduction in power dissipation. 
Hence, at a 20 bit resolution, the total reduction in power dissipation, achieved 
using the MAC array, is 17%. 
At a resolution of 10 bits, the saving in power dissipation when using the MAC 
array is about 6.5%. The saving in area is about 8.2%. While the increase in 
speed is about 13.5%. The increase in speed can lead to further reduction in the 
power dissipation (11 by reducing the voltage to maintain the same throughput. 
To maintain the same throughput the voltage is reduced by 11.9% [150]. This 
corresponds to a further 22.4% reduction in power dissipation. Hence, at a 10 bit 
resolution, the total reduction in power dissipation, achieved by using the MAC 
array, is 27.4%. 
6.3.1 Andysis ofthe ComputationalEfficiency oftheMAC 
Array 
The difference between the MAC array and the separate multiplier/adder MAC, 
is that the addition operation has been interleaved into the multiplier array of 
the former. This provides an opportunity to merge some blocks together. The 
multiplier array, Figures 6.3 and 6.4, use half adders in some of its ceus. Whereas, 
the MAC array, Figures 6.7 and 6.8, uses only f d  adders. This is where the 
elimination of redundant computations occur. 
CHAPTER 6. LOW-POWER MULTIPLIER-ACCUMULATOR ARRAY 
MAC Array Power Improvement 
L = 0.5 pn v = 3.3 volts 
Figure 6.11: The relative decrease in the power dissipation of the MAC array cf 
Figure 6.7 over that of the separate multiplier/adder MAC. Both designed in a 
0.5pm, 3.3 Volt CMOS technology. 
CHAPTER 6. LO W-POWER MULTIPLIER-ACCUMULATOR ARRAY 200 
\ MAC Array Speed Improvernent L = 0.5 J.lm V = 3.3 volts 
Figure 6.12: The relative decrease in the delay of the MAC array of Figure 6.7 over 
that of the separate multiplier/adder MAC. Both designed in a 0.5/rm, 3.3 Volt 
CMOS technology. 
CHAPTER 6. LOW-POWER MULTIPLIER-ACCUMULATOR ARRAY 201 
MAC Array Area Improvement 
L = 0.5 pm V = 3.3 vorts 
Figure 6.13: The relative decrease in the area of the MAC array of Figure 6.7 over 
that of the separate multiplier/adder MAC. Both designed in a 0.5pm, 3.3 Volt 
CMOS technology. 
CHAPTER 6. LO W-PO WER MULTIPLIER-ACCUMULATOR ARRAY 202 
Figure 6.14: Adding 5 bits using half adders only. This requires 6 half adders. 
IIalf adders have two inputs and two outputs. Hence the number of bits that 
remain to be added after the half adder remains the same. Consider the scenario 
shown in Figure 6.14, where 5 bits need to be added. After the application of the 
fîrst HA, the number of bits that remain to be added is still5 bits. Rowever, there is 
redundancy in this case (Figure 6.14.b), because a and b cannot be 1 sirnultaneously. 
After the second HA, there is stiu 5 bits to be added. Again, there is redundancy 
in that c and d cannot be 1 simultaneously. After the third HA, there is ody 4 bits. 
The carry out, f ,  of this HA is always zero. 
To reduce the number of bits fÎom 5 bits to 4 bits it took three half adders. This 
could have been accomplished using one full adder as shown in Figure 6.15. Simi- 
lady, to go from 4 bits, as shown in Figure6.14.d, to 3 bits as shown in Figure 6.14.g, 
three half adders are required. 
It takes three half adders to add three bits together, an operation that can be 
done using one full adder. Figure 6.16 shows the implementation of the f d  adder 
CHAPTER 6. LOW-PO WER MULTIPLIER-ACCUMULATOR ARRAY 203 
Figure 6.15: Adding 5 bits using f d  adders. This requires 2 full adders. 
using three half adders. 
Consider now the scenario shown in Figure 6.17, where the bits x and r are 
to be added together. Two scenarios are considered. In the first, the bits x are 
added using a half adders. The resulting bits y are then added to bits z using 
full adders and half adders. The output is in the sum carry representation. This 
implementation requires a total of 9 f d  adders and 6 half adders. 
In the second implementation, the x bits and the z bits are added directly 
using full adders and half adders to give the same sum carry representation as that 
generated by the first implementation. This implementation requires only 9 f d  
adders and 2 half adders. Thus, achieving a saving of 4 half adders. 
When the adder is interleaved into the multiplier array, the half adders are 
merged into the full adders required for the addition process. Hence, eliminating 
redundant computations and achieving a saving in the power dissipation. 
CHAPTER 6. LO W-PO WER MULTIPLIER-ACCUMULATOR ARRAY 204 
C B A  
Figure 6.16: Merging three half adders into a single fd adder. 
X X X X X  
5 HA 
X X X X X  
z z z z z  LV 
Z Z Z Z Z  
Y Y Y Y Y  
Y Y Y Y Y  
Z Z Z Z Z  
z z z z z  
Figure 6.17: Binary number addition using half adders and fidl adders. 
CHAPTER 6. LO W - P O W E R  MULTIPLIER-ACCUMULATOR ARRAY 205 
6.4 Programmable MAC Array 
To reduce the power dissipation of the MAC when a lower resolution is sufficient the 
unused ceus of the MAC anay - those corresponding to the Ieast significant bits - 
are deactivated, making the resolution of the MAC array programmable (1581. This 
eliminates irrelevant comput ations. lirelevant comput ations are t hose t hat affect 
the output of the unit under consideration, the MAC array in this case, but whose 
elimination doesn't affect the over all SNR performance of the system because the 
SNR is limited by some other system considerations, or by another unit in the 
sys tem. 
For the MAC array of Figure 6.7, the multiplicand comes from the bottom. and 
the multiplier cornes fiom the left side. When the resolution of the multiplicand 
is reduced, the cells of the right-hand-side columns are deactivated. When the 
resolution of the multiplier is reduced, the ce& of the top rows are deactivated. 
When a certain cell is deactivated, the inputs are bypassed to the output. This is 
done through the use of the bypass logic. The bypass logic is an overhead in the 
programmable MAC array. 
The c w y  output of each cell shifts one column only, while the sum output shifts 
two columns. This fact is taken into account when designing the bypass logic. The 
dotted line in Figure 6.18 represents the bypass path for a deactivated cell. 
Extra logic is required for the bypass. This extra logic leads to larger area. It 
also increases the power dissipation when the f d  resolution of the MAC is used. At 
the lower resolution the power dissipation decreases. The amount of power saving 
depends on the percentage of time the MAC spends at  each resolution. 
In addition to the bypass logic, extra logic is also required to prevent any signal 
changes on the input lines from reaching the deactivated cells. The blocking logic 
CHAPTER 6. LO W-PO WER MULTIPLIER-ACCUMULATOR ARRAY 206 
Figure 6.18: The bypass path for a deactivated cell. 
Figure 6.19: The blocking modules for the recoded multiplier signals. 
for the recoded multiplier signals is shown in Figure 6.19. The resolution of the 
multiplicand determines the value of the control signals Cl, C2 ... CL. When the 
multiplicand has minimum resolution, n,,, the recoded multiplier bits are blocked 
at the left most blocking module. When the multiplicand has maximum resolution, 
n~ 7 the recoded multiplier bits pass through all the blocking modules. Similar 
blocking logic is required for the multiplicand. 
Assume the number of multiplicand bits is given by: 
CHAPTER 6. LO W-POWER MULTIPLIER-ACCUMULATOR ARRAY 207 
1 
n = nn + -(n, -nn) 
L 
where, 
L + 1 is the number of resolutions the multiplicand can take. 
n, is the maximum resolution of the multiplicand. 
n, is the minimum resolution of the multiplicand. 
Z=O,l,  ..., L. 
Similady, the number of multiplier bits is given by: 
where, 
K + 1 is the number of resolutions the multiplier can take. 
m, is the maximum resolution of the multiplier. 
m, is the minimum resolution of the multiplier. 
k = O , l  ,..., K. 
In generd, the power dissipation of the programmable MAC array was found 
to be given by: 
Power = P2nm-k P1m-k Po f P3 [m1(3 + n,) - n,(3 + nn)] + P4nk + P5mZ (6.12) 
where, 
CHAPTER 6. LOW-PO WER MULTIPLIER-ACCUMULATOR ARRAY 208 
Table 6.5: Power coefficients, in pW/MHz, of the programmable MAC used in 
Equation 6.12, for a 0.5pm, 3.3 Volt CMOS technology. 
Po - P5 are technology dependent coefficients. 
The first three terms of Equation 6.12 are the power dissipation terms of a 
regular MAC array. The fourth term is due to the bypass logic. The last two ' 
terms are due to the input blocking logic. The values of Po - P5 for a 0.5 pm,  3.3 
Volt CMOS technology are given in Table 6.5. Equation 6.12 dong wit h the power 
coefficients of Table 6.5 were initially derived analytically. Later they were verified 
t hrough simulations. 
The amount of power saving is dependent on the percentage of time the multi- 
plicand and the multiplier are required to operate at each resolution. The multiplier 
can take one of K+1 resolutions. Assume the lowest resolution m, has a probability 
a. The remaining resolutions: 
are equi-probable and have a probability of: 
Similady, the multiplicand can t ake one of L + 1 resolutions. Assume the lowes t 
resolution n, has a probability B. The remaining resolutions: 
CHAPTER 6. LO W-POWER MULTIPLIER-ACCUMULATOR ARRAY 209 
are equi-probable and have a probability of: 
The MAC array designed is required to have a maximum resolution of 21 bits 
for the multiplicand and 20 bits for the multiplier. The power dissipation of the 
nonprogrammable MAC array, implemented in a 0.5pm, 3.3 Volt CMOS technol- 
ogy, is 1320pWIMHz. To reduce the power dissipation when a lower resolution is 
sufficient, the MAC array is designed to have a programmable resolution for both 
the multiplicand and the multiplier. The resolution of the multiplicand c m  be; 21, 
17, or 13 bits. The resolution of the multiplier can be; 20, 16, or 12 bits. 
The power dissipation ratio between that of the programmable MAC array and 
that of the non-programmable MAC array depends on the probability distribution 
parameters a and 0. This power dissipation ratio is shown in Figure 6.20. A power 
saving of up to 50% c m  be achieved. 
To further investigate the effect of the probability distribution on the power sav- 
ing that can be achieved using the programmable MAC array, assume the following 
probability distributions: 
Prob(Multip1icand is 13 bits) = Prob(Multip1ier is 12 bits) = 7 (6.17) 
Prob(Multip1icand is 17 bits) = Prob(Multip1ier is 16 bits) = 8 (6.18) 
CHAPTER 6. LO W-PO WER MULTIPLIER-ACCUMULATOR ARRAY 210 
Figure 6.20: The power dissipation ratio between the programmable MAC array 
having K = 2, L = 2 and the non-programmable MAC array. a and ,û are the 
resolution probability distribution parameters, from Equations 6.13, 6.14, 6.15 and 
6.16. 
CHAPTER 6. LO W-PO WER MULTIPLIER-ACCUMULATOR ARRAY 21 1 
Figure 6.21: The power dissipation ratio between the programmable MAC array 
having K = 2, L = 2 and the non-programmable MAC array. 7 and 6 are the 
resolution probability distribution parameters, from Equations 6.17, 6.18 and 6.19. 
Prob(Multip1icand is 21 bits) = Prob(Mu1tiplier is 20 bits) = 1 - 7 - d (6.19) 
The relatived power dissipation for different values of 7 and 6, as defined by 
Equations 6.17 and 6.18, is given in Figure 6.21. This power dissipation ratio 
varies between 0.5 (when 7. = 1 and 6 = O), to 1.1 (when 7 = b = 0). 
CHAPTER 6. LO W-POWER MULTIPLIER-ACCUMULATOR ARRAY 212 
6.5 VLSI Implementation of the Programmable 
MAC Array 
The programmable MAC array has been designed with L = 2 and K = 2. The al- 
lowed resolutions for the multiplier are; 20, 16, and 12 bits. The dowed resolutions 
for the multiplicand are; 21, 17 and 13 bits. 
The programmable MAC array has been designed in a 0.5pm,  3.3 Volt CMOS 
technology. It has over 17000 transistors. It occupies an area 1815pm x 1542pm. 
Simulation results show that the programmable MAC array can r u  at  a speed 
up to 42 MHz. Simulation results also show that the average power dissipation of 
the programmable MAC array, assuming that aIl resolutions are equi-probable is 
1.0 mW/MHz. This is 24% lower than the power dissipation of the corresponding 
non-programmable MAC array. Figure 6.22 shows the VLSI layout of the pro- 
grammable MAC array. 
6.6 Chapter Summary 
A multiplier-accumulator array has been designed in a 0.5 p z ,  3.3 Volt CMOS 
technology. The multiplier of the MAC array is based on the modified Booth 
algorithm. The accurnulator's input and output are in the sum-carry represent ation. 
To reduce the power dissipation, the adder was interleaved in the multiplier 
array. This eliminates redundant computations. The MAC array achieved up to 
17% saving in power dissipation for a 20 x 20 array, and up to 27.4% saving in 
power dissipation for a 10 x 10 array. 
To further reduce the power dissipation a block deactivation architecture was 
CHAPTER 6. LOW-POWER MULTIPLIER-ACCUMULATOR ARRAY 213 
Figure 6.22: The VLSI layout of a programmable MAC array designed in a 0.5pm,  
3.3 Volt CMOS technology. The total area of the MAC m a y  is 1815pm x 1542pm. 
CHAPTER 6. LO W-PO WER MULTIPLIER-ACCUMULATOR ARRAY 214 
developed, where the cells corresponding to the least significant bits are deactivated 
when a smaller resolution is s&uent . This eliminates redundant computations. 
The power dissipation saving achieved using this architecture can be up to 50%. 
However, a more conservative estimate, when all the resolutions are equi-probable, 
puts the power dissipation saving at 24%. 
Chapter 7 
Digital Channel Selection 
7.1 Introduction 
In software radio, the IF as well as the baseband functions are done digitdy. These 
functions require high DSP horsepower, which can be as high as 10 GFLOPS/s [12]. 
This in turn leads to high power dissipation. 
In software radio, the digitized signal represents a block of channels. The chan- 
ne1 to be selected is then filtered digitdy. In this chapter, 1 examine ways to 
reduce the computational complexity and hence the power dissipation of the digital 
channel selection algorithm. The salient low-power features of this digital chamel 
selec tion algorit hm are [l65] : 
1. The pre-flter multiplier has been eliminated. 
2. The channel selection filtering is performed in stages. During each stage the 
sampled signal is decirnated by a factor of 2. 
3. The multipliers in the filters have been replaced by adders. 
CHAPTER 7. DIGITAL CHANNEL SELECTrON 216 
In this chapter, the effect each of these has on the power dissipation of the 
digit al channel selection algorithm is demons trated. 
In section 7.2, the conventional channel selection algorithm is considered, and 
its comput ational power dissipation is compu ted. The effect of performing channel 
selection in stages on the computational power dissipation is also considered in this 
section. 
In section 7.3, an algorithm for eliminating the pre-fdter multiplier is consid- 
ered. However, this algorithm requires sharp filtering stages. In section 7.4. the 
algorithm of section 7.3 is further developed to relax the filter sharpness require- 
ment. The irnplementation of this algonthm is considered, and its computational 
power dissipation is computed. 
Finally, in section 7.5, the cornputational power dissipation of the novel digital 
channel selection algorit hm is compared to that of other digital channel selection 
algori t hms . 
7.2 The Conventional Channel Selection Algo- 
In software radio, the digitized signal represents a block of adjacent channels, having 
a spectrum as shown in Figure 7.1. This signal is c d e d  the composite digital signal, 
it is denoted by C. C is a complex discrete-time baseband signal, given by; 
C(nT.) = QnTJ + jQ (G) (7.1) 
Ta is the sampling frequency. C is composed of M = 2m fkequency division mul- 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
P l  
Figure 7.1: The fkequency spectrum of a baseband signal consisting of eight chan- 
nels. 
LPDF 1- S: 
Figure 7.2: The conventional channel selection algorithm. 
tiplexed channels, each denoted by Sn. Each of these channels occupies a frequency 
band df. The center frequency of channel Sn is given by: 
M Where, n = -- y . . . , ) - 1. n can be represented in a two's complement binary 
format. Figure 7.1 shows the binary channel-numbering for the case M = 8. 
To select channel Sn, the composite signal is multiplied by a sinusoidal signal 
of frequency - f,. This shifts channel Sn to be centered around the zero frequency. 
A lowpass decimation filter (LPDF) selects channel Sn and rejects the remaining 
channels. This channel selection process is shown in Figure 7.2. 
The algorithm of Figure 7.2 requires: 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
IO 1s 2a 2 5 30 
1 
3s 
Adjacent Channel Rejection 
Figure 7.3: Number of filter taps vs. adjacent channel rejection, for the digital 
selection of 1 out of 32 charnels. 
1. A fiequency synthesizer, to generate the selected channel fiequency. 
2. A pre-füter multiplier. This multiplier operates at the high sampling rate, 
and its operands have high resolution. 
3. A sharp LPDF which has a large number of taps to provide sufficient adjacent 
channel rejection. The number of taps required to achieve a certain channel 
rejection ratio is shown in Figure 7.3. However, using a polyphase filter 
structure [120], the LPDF operates at the lower output rate. 
In the DAMPS standard, the adjacent channel selectivity is 13 dB [22]. DQPSK 
is the modulation scheme used in IS-136, to achieve a bit error rate of IO-', 2 
CHAPTER 7. DIGITAL CHANNEL SELECTION 219 
Table 7.1: Computational power dissipation for the conventional digital channel 
selection algorithm shown in Figure 7.2, and designed in a 0.5prn, 3.3 Volt CMOS 
t echnology. 
Criteria 1 Pre-Filter Multiplier 1 LP Decimation Filter 1 Tot al 
should be 6 dB (1661, assuming that 50% of the noise is due to adjacent channel 
interference than the filter is required to achieve a 22 dB adjacent channel rejection. 
From Figure 7.3, this requires a 215 tap filter to digitally select one out of 32 
channels. Through simulation, it was found that the füter coefficients should have 
a resolution of 16 bits to achieve the sufficient channel rejection required for non- 
adjacent channels [22]. 
Speed 
Operations 
The computational power dissipation for a conventional digital channel selection 
algorithm used in the selection of one out of 32 30 KHz channels, as specified by 
the DAMPS standard [22], is given in Table 7.1. The sample rate of the composite 
digital signal is 960 KHz (complex sample rate). The sample rate at the output 
of the LPDF of Figure 7.2 is 30 KHz (complex sample rate). The values given in 
Table 7.1 are for a 0.5pm, 3.3 Volt CMOS technology [150]. The lowpass decimation 
filter is a linear phase filter, this reduces the number of multiplications by half. 
To reduce the lowpass filter sharpness and hence reduce the number of taps and 
lower the filter computational power dissipation, the lowpass decimation füter is 
960 KHz 
4 MULT(20 x 20) 
30 KHz 
216MULT(20 x 16) 
30 KHz 
128 MULT(2O x 20) 
216MULT(20 x 16) 
CHAPTER 7. DIGITAL CHANNEL SELECTION 220 
implemented as a cascade of lowpass decimation filters. Each filter decimates the 
sampled signal by a factor of 2. The number of filtering stages required is log, M, 
for a filter selecting one out of M charnels. 
Assuming M = 32, 5 cascaded fütering stages are required. The first three 
stages dont  require sharp filters, hence a Sinc decimator is used. For the last 
two stages, a LPF is used. Table 7.2 gives the sdc ien t  image channel rejection 
each filtering stage is required to achieve in crder to obtain a BER of for 
the DAMPS standard. Also given in Table 7.2 is the order of the Sinc decimator 
and the number of the LPF taps necessary to achieve the required image channel 
rejection. 
The number of operations required per stage per output, power dissipation per 
stage and the total computational power dissipation including that of the pre-filter 
multiplier are also given in Table 7.2. The values given in this table are for a 0.5pm, 
3 -3 Volt CMOS technology. 
The total computational power dissipation given in Table 7.2 is that of the 
filter and the pre-filter multiplier. The power saving achieved is 45%, over the 
conventional channel selection algorithm of Table 7.1. However , now the dominate 
source of computational power dissipation is that of the pre-filter multiplier which 
is 5.09 mW. This represents 75% of the total computational power dissipation. In 
the next section, a digital channel selection algorit hm that eliminates the pre-filter 
multiplier is examined. 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
Table 7.2: Computational power dissipation for a lowpass decimation filter imple- 
mented as cascaded stages in a 0.5pm, 3.3 Volt CMOS technology. 
( Order/# of tapes / 3 
Cri teria 
Image rejection 
1 Speed 1 480 KHz ( 240 KHz 
1 Power dissipation 1 0.192 mW 1 0.120 m W  
Iststage 
64 dB 




Pre-Filt er Multiplier Eliminat ion 
According to Table 7.2, the dimination of the pre-filter multiplier achieves 75% sav- 
ing in the computational power dissipation. To achieve this, filtering is performed 
in stages. During each filtenng stage half of the channels are eliminated. Thus, the 








The channel selection in this algorithm [165] depends on the use of highpass 
or lowpass filters to select the desired channel. Selecting one channel out of 2" 
channels requires m filtering stages. The first m - 1 stages are cornposed of a 
highpass/lowpass filter and a factor two down-sampler. In the last stage we do 
fiequency shifting followed by lowpass filtering. 
5th stage 
14MULT(20 x 12) 
5 1  dB 
13 
60 KHz 
The selection of a highpass filter or a lowpass filter depends on the location of 
the channel to be selected. If that channel falls in one of the bands [-et -41 or 
CHAPTER 7. DIGITAL CHANNEL SELECTION 222 
[a 61 a highpass filter is used to select the desired channel. If the channel to be 
4 '  2 
selected falls in the band [-4, $1 a lowpass filter is used. 
Channel Selection Example 
Figure 7.4 is an example showing the selection of one channel out of 16 channels. 
The channel to be selected is charnel 6 (0110). Since the number of channels 
is halfed after each filtering stage, the number of bits required in numbering the 
channels is reduced by one bit after each stage. 
The selection process is performed as follows. Channel 6 (the desired channel) 
falls in the highpass region. A highpass filter is used, the highpass filter selects 
channels -8, -7, -6, -5, 4, 5, 6, and 7. The highpass filter is foIlowed by a factor 2 
down-sampler, leading to the second fkequency spectrum in Figure 7.4. Notice that, 
only three bits are required to number the remaining channels shown in the second 
fiequency spectrum of Figure 7.4. The most significant bit has been removed. 
Furthemore, notice that because of the highpass filtering channel 6 has moved 
from the positive hequency region to the negative one. 
Channel 6 now lies in the lowpass region [-$41. A lowpass flter is used 
for selecting channels -8, -7, 6 and 7. The lowpass filter is followed by a factor 2 
down-sampler. Again the most significant bit is removed leaving ody  two bits. A 
highpass füter followed by a factor 2 down-sampler is used to select channels -7 and 
6. 
Channel -7 is the image of channel 6. The final selection process involves a fie- 
quency shift and lowpass filtering. The fiequency shift is performed by multiplying 
the received signal by e-ifn, n is the discrete time. 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
Figure 7.4: Channel selection example using lowpass and highpass filters ody. 




C f  
UHP 2 
P 
imi ioio j ioi i  irm i l o i  I i i o  3; iiii / m / mi mio mii otm oioi ouo o t i i  
CHAPTER 7. DIGITAL CHANNEL SELECTION 
7.3.1 The Channel Selection Algorit hm 
The number of stages required to select one channel out of 2" channels is rn stages. 
During the f i s t  m- 1 stages a lowpass filter or a highpass filter is used followed by a 
factor 2 down-sampler. The last stage is a Gequency shifi by 4 or -4,  followed by 
a lowpass filter and a factor 2 down-sampler. The required operation is determined 
from the binary representation of the desired channel. Assume the desired channel 
has the binary representation: 
Define xi to be: 
If xi = 1 use a highpass filter during stage i. Otherwise, use a lowpass filter. 
In stage m, the sign of the frequency shift is determined by the least significant 
bit bo. If bo = 1, the frequency shifi is 4. Otherwise, it is -$. 
The frequency spectmm is divided into two non-overlapping bands; the lowpass 
band and the highpass band. The fact that two bands are non-overlapping neces- 
sitates the use of sharp lowpass or highpass filters. This in turn leads to flters 
having a large number of taps, and hence an adverse effect of the computational 
power dissipation. In the next section, an algorithm that has overlapping selection 
bands is presented. This relaxes the sharpness requirement of the Hters, which in 
turn leads to filters with a fewer number taps and hence lower power dissipation. 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
7.4 Filter S harpness Relaxation 
The algonthm of the previous section is modified such that, the frequenc~ spectrum, 
which extends fkom - f to 5, is divided into four overlapping frequency bands [165]. 
Each frequency band has a bandwidth of f .  Any chamel lies in two fkequency 
bands. In one band it wdI be closer to its center, in the other band it WU be closer 
to its edge. The channel is selected by the band in which it is doser to its center, 
see Figure 7.5. The advantage of doing this is to relax the sharpness requirement 
of the used Nters, and hence lower the computational power dissipation. 
The channel selection algorithm requires the use of lowpass and bandpass füters. 
A highpass füter is only required during the first channel selection stage. The 
bandpass filter is centered around or -4 and it has a bandwidth of $. The 
bandpass filter is implemented as a frequency shift of -+ or fi' followed by a lowpass 
filter. The highpass filter is implemented as a fiequency shift of f followed by a 
lowpass filter. 
The type of filter used is determined by the fkequency band in which the signal 
lies. This is shown in Figure 7.5. BPN is the negative kequency bandpass filter 
centered around -4.  While, BPP is the positive fiequency bandpass filter centered 
around 4. 
Channel Selection Example 
Figure 7.6 is an example showing the selection of one channel out of 16 channels. 
The channel to be selected is channel 6 (0110). The channel selection process 
proceeds as follows; channe16 lies near the center of the highpass band. A highpass 
filter is used, the highpass Nter selects channels -8, -7, -6, -5, 4, 5, 6, and 7. The 
highpass füter is followed by a factor two down-sampler, leading to the second 
CHAPcrER 7. DIGITAL CHANNEL SELECTION 
Figure 7.5: Filter used in each frequency band. 
1 
frequency spectrum in Figure 7.6. Notice that, only three bits are required to 
number the remaining channels. The most significant bit is removed. 
HP BPN 
Channel 6 now lies near the center of the negative fiequency bandpass band. A 
negative hequency bandpass filter is used selecting charnels; 4, 5, 6, and 7. The 
negative frequency bandpass filter is followed by a factor 2 down-sampler. The 
most significant bit is removed and the second most significant bit is inverted in 
this case. 
Channel 6 now lies in the lowpass band. If we were to select channel 6 with 
channe15 using a lowpass flter, then this lowpass filter WU be sharp and it will re- 
quire many taps. Instead, the spectrum is shifted by - $, a lowpass filter is used for 
selecting charnel6 in two stages. An implementation with reduced computational 
complexity, for this shif'ting process, is presented in section 7.4.2. 
-L - f, - f, O 3 f f 
8 2 4 8 8 2 
BPN 
7.4.1 The Channel Select ion Algorit hm 
The channel selection process, previously described for the selection of channel 6, 
is formulated in the following algorithm for the selection of any channel [165]: 
LP ; LP B 
1 
BPP j HP 
CHAPTER 7. DIGITAL CHANNEL SELECTXON 
Figure 7.6: Channel selection example using lowpass, highpass and bandpass füters. 
-8 -7 -6 -5 -4 -3 -2 -1 
- f, I f 
2 
' HP 2 
v 
P I  
l 
1 
lm 1 lai 
O 1 2 3 4 5 6 7 
I 
mn / mi io10 mia loi 1 m iim oim 
! 
i i o i  i ii to I 111 i aloi oiio ail i 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
For the e s t  stage 
Use a highpass filter. 
Use a lowpass filter. 
Use a negative frequency bandpass filter. 
Use a positive frequency bandpass filter. 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
Use a negative fiequency bandpass filter. 
\ r 
Use a positive frequency bandpass filter. 
Where, i = 2 . .  . m - 2 
For stage rn - 1 
Shift the spectrum by a. 
Use a lowpass filter. 
else if al bo = 1 
Shift the spectrum by B. 
Use a lowpass füter. 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
else if ül go = 1 
Shift the spectrum by -f$ 
Use a lowpass filter. 
else if bo = 1 
Shift the spectrum by -8. 
Use a lowpass filter. 
For stage m 
Always use a lowpass filter. 
7.4.2 Algorit hm Implementat ion 
The algorithm just described requires four types of filters, a lowpass filter, a high- 
pass filter, a negative frequency bandpass filter, and a positive frequency bandpass 
filter. Each one of these filters has a bandwidth of 2. The four filters can be im- 
plemented using a lowpass fdter preceded by a multiplier that does the appropriate 
frequency s hift ing . 
For the highpass filter, the frequency shift is 2. This is performed by multi- 
plying the incorning samples by a sequence of 1, -1,1, - 1,l. . . . This is shown in 
Figure 7.7.a. 
For the negative frequency bandpass filter, the frequency shift is 4- This is 
performed by multiplying the incoming samples by a sequence of 1, j ,  - 1, - j, 1, j . . .. 
This is shown in Figure 7.7.b. 
CHAPTER 7. DIGITAL CHANNEL SELECTION 231 
Figure 7.7: Implementation of: (a) HP filter (b) BPN filter (c) BPP filter using a 
multiplier and a LP filter. 
Figure 7.8: The proposed digital channel selection algorithm for the selection of 
one out of 32 channels. Stage A is a {O, f l , 2 ) 4  fiequency shifter followed by a 
lowpass filter. Stage B is a zt(1,3)8 fiequency shifter fouowed by a lowpass filter. 
b4b3b2 bl b0 is the binary representation of the selected channel number. 
For the positive fiequency bandpass filter, the fkequency shift is - 4 .  This is per- 
formed by multiplying the incoming samples by a sequence of 1, - j, - 1, j ,  1, - j . . . . 
This is shown in Figure 7.7.c. 
The multiplier coefficients in these implementations are: 1, - 1, j, and - j. This 
doesn't require the use of a f d  multiplier, aIl that is required is multiplexers and 
simple logic gates. This achieves a dramatic saving in the power dissipation. 
The implementation of the digital channel selection algorithm for the selection 
of one channel out of 32 channels is shown in Figure 7.8 [165]. This algorithm has 
five stages. The first three stages have a block diagram as shown in Figure 7.9. 
The fourth stage has a block diagram as shown in Figure 7.10. The fifth stage is a 
lowpass decimation filter. 
In stage m - 1 a fiequency shift of *{Il 31% is repired. To eliminate the use 
CHAPTER 
Ii - 
7. DIGITAL CHANNEL SELECTION 
Cl C2 
O O LP 
0 1 H P  
1 O BPN 
1 1 BPP 
Cl = e l  @ eO 
C2=e2 @ (el +eO) 
Figure 7.9: knplementation of a {O, f l , 2 )  $ fiequency shifter followed by a lowpass 
filter. 
of a multiplier for this frequency shift operation, we exploit the fact that the filter 
following the frequency shifter is a decimation filter, that decimates the signal by 
a factor of 2. This filter is implemented as a ~ o l y ~ h a s e  filter [120] that separates 
the input sample stream into even and odd sample streams. The even samples are 
always multiplied by k l ,  or f j. m e  the odd samples are multiplied by $[f 1 I 
j]. The multiplier is replaced by adders, multiplexers and other logic gates. The 
multiplier coefficient (&) is hidden in the filter coefficients. The implementation 
of the fkequency shifter followed by the filter is shown in Figure 7.10. Details of 
t his implement ation are explained below. 
Assume a multiplier is used to provide the desired fiequency shift , the multiplier 
coefficients are: 
e f j(1.3)Gn 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
Table 7.3: Frequency-shift-multiplier output . 
Where, n is the discrete time. These coefficients are periodic with a period N = 8. 
Table 7.3, gives eight consecutive samples of the multiplier output for the four 
different frequency shifts. The input sample is 1; + jQi.  
Notice fkom Table 7.3, that the even samples require only multiplication by f 1, 
in addition to Ii/Qi interchange, which is equivalent to multiplying by f j. The 
odd samples always require multiplication by 1/& in addition to subtraction or 
addition operations. The multiplication operation is hidden in the coefficients of 
the odd sample branch of the polyphase filter [120]. Hence, no extra multiplications 
are required to implement the fkequency shift . 
Now that the sharpness of the filters is not critical, in the initial filtering stages 
Sinc decimators are used eliminating the need for multipliers. However, in the last 
two stages a sharper lowpass filter is used. Table 7.4 gives the suEcient image 
channel rejection required to achieve a BER of IO-* for a DAMPS system. Also 
given in this table is the Sinc decimator order and the number of LPF taps required 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
Figure 7.10: Implementation 
filter. 
of a f (1,318 frequency shifter followed by a lowpass 
to satisfy this requirement. 
The number of operations required per stage per output, the power dissipation 
per stage, and the total power dissipation are also shown in Table 7.4'. The values 
given in this table are for a 0.5pm,  3.3 Volt CMOS technology. 
For a digital channel selection algorithm that selects one out of 32 channels, 
eliminating the pre-filter multiplier achieves an 81% saving in power dissipation 
'An eighth order Sinc filter with a decimation factor of two has the transfer function: 
Direct implementation of Equation 7.5 requires 18 addition operations. Using the properties of 
arithmetic operaton, such as commutativity, sssociativity and common factoring, Equation 7.5 
can be reformated as: 
Now the Sinc filter, according to Equation 7.6, requires only 10 addition/subtraction operat ions. 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
Table 7.4: Computational power dissipation for the novel digital channel selection 
algorithm shown in Figure 7.8, and designed in a 0.5pm. 3.3 Volt CMOS technology. 
Cri teria 1st stage 2nd stage 3rd stage 4th stage 5th stage 
- -  - 
Image rejection 
Order/# of tapes 
Speed 480 KHz 240 KHz 120 KHz 60 KHz 30 KHz 
Operations 
1 Power dissipation 
1 Total power dissipation 
CHAPTER 7. DIGITAL CHANNEL SELECTION 236 
when compared to the conventional channel selection algonthm of Table 7.1' and 
65% saving in power dissipation when compared to the channel selection algorithm 
of Table 7.2. 
7.5 Digital Channel Select ion Algorit hms: Corn- 
parison 
In this section, the power dissipation of three channel selection algorithms are 
compared. 
Architecture A. This is the architecture given in Figure 7.2. In this architec- 
ture a pre-filter multiplier is used to shift the channel to be selected to be centered 
around the zero frequency. A single stage polyphase füter is then used to select the 
desired channel. 
Architecture B. This architecture is also given in Figure 7.2. In this architec- 
ture a ~re-filter multiplier is used to shift the channel to be selected to be centered 
around the zero fkequency. A multi-stage filter is then used to select the desired 
channel. Each stage does decimation by a factor 2. 
Architecture C. This architecture is described in section 7.4. This architec- 
ture elkninates the pre-filter multiplier. Channel selection is achieved t hrough an 
algorithm that determines the type of filter (LP, HP, BPN, or BPP) based on the 
channel to be selected, see section 7.4.1. 
Figure 7.11 gives the comput ational power dissipation versus the number of 
channels, for the three different architectures of the digital channel selection algo- 
rithm. Notice that, the increase in computational power dissipation of architecture 
A as the number of channels from which one channel is selected increases, exceeds 
CHAPTELS 7. DIGITAL CHANNEL SELECTION 
Power dissipation (mW) 
1 .  
16 3 2 64 128 256 
Number of Chmnels 
Figure 7.11: Computational power dissipation for digital channel selection dg* 
rithms. 
the rate of computational power dissipation increase of architecture B which in turn 
exceeds the rate of increase of computational power dissipation of architecture C. 
Figure 7.12 gives the relative computational power dissipation of architecture A 
and architecture B relative to that of architecture C. Again as the number of chan- 
nels form which one channel is selected increases, the efficiency of architecture C in 
saving power over architectures A and B becomeç more apparent. When selecting 
one out of 256 charnels, architecture C can achieve an order of magnitude saving 
in computational power dissipation over that of architecture A. Architecture C also 
lowers the computational power dissipation by 4.5 times over that of architecture B. 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
Relative power dissipation 
1 Architecture A / 
t / Architecture B 
Figure 7.12: Relative computational power dissipation between conventional digital 
channel selection algorithms and proposed algorithm. 
CHAPTER 7. DIGITAL CHANNEL SELECTION 
7.6 Chapter Summary 
A novel digital channel selection algorithm with no pre-filter multiplier has been 
developed in this chapter. The algorithm performs channel selection in stages. 
During each stage, half the channels are rejected, and the other half is selected. 
The algorithm employs a basic lowpass filter, which can also perform bandpass and 
highpass filtering functions by using simple logic gates. 
The use of overlapping frequency bands to perform the channel selection? relaxes 
the filter sharpness requirement. Thus permit ting the use of Sinc filters in the initial 
stages. Sinc decimators require only addition operation. This greatly reduces the 
computationd power dissipation. 
The computational power dissipation of the novel digital channel selection al- 
gorithm is compared to that of other channel selection dgorithms. The reduction 
in computational power dissipation can be up to an order of magnitude. 
Chapter 8 
Summary, Conclusions and Future 
Uirections 
8.1 Summary 
The research reported in this dissertation is a study of high-level (algorithmic and 
architectural) techniques to lower the power dissipation in portable wireless termi- 
nais. 
The firat part of this dissertation is a survey of the wireless communication 
systems; architectures and standards, this is presented in chapter 2, and of low- 
power techniques, this is presented in the fist  half of chapter 3. 
In the second part of this dissertation, low-power techniques for portable wire- 
less terminals at the architectural and the dgorithmic levels were inves tigated, 
developed, designed and evaluated. The low-power techniques are applied to video 
compression algorithms used in multimedia port able terminals. Low-power a lge  
nthms are also developed to lower the power dissipation in software radios. 
CHAPTER 8. SUMMARY. CONCLUSIONS AND FUTURE DIRECTIONS 241 
The first contribution of this dissertation is the analysis of high-level low-power 
design techniques, to show the limit to which some of these design techniques can 
be applied effectively, to reduce the power dissipation. Three examples of such 
analysis are given in chapter 3. These are; the use of ca ry  Save adders in FIR 
filters, the use of the Gray code number system, and the use of higher-order radix 
for the division operation. 
The second contribution of this dissertation is a new division algorithm that 
minimizes the number of addition/subtraction operations required to generate the 
quotient. This algorithm is presented in chapter 3. 
The third contribution of this dissertation is a new low-power subband coding 
image compression algorithm developed in chapter 4. Subband coding is a technique 
in which the video signal is divided into subbands and each subband is allocated 
a number of bits according to the information it carries and its spectral impor- 
tance. The filtering structure for the proposed subband coding image compression 
dgoritkm, requires only addition/subtraction operations, this greatly reduces the 
power dissipation. A novel vector quantization coding algori t hm having a simplified 
decoding architecture has been developed in chapter 4. 
The fourth contribution of this dissertation is a novel bandpass Sigma-Delta 
modulator, dong with its switched capacitor architecture, presented in chapter 5. 
Pardelisrn by 4x of analog signal processors is applied to the design of the band- 
pass Sigma-Delta modulator. This increases the speed of the modulator without 
increasing the speed requirement of the individual building blocks. A switched- 
capacitor circuit with a minimum number of operational amplifiers is also given for 
the proposed modulator architecture. 
The fifth contribution of this dissertation is the design of a decimation filter 
CHAPTER 8. SUMMARY, CONCLUSIONS AND FUTURE DlRECTrONS 242 
incorporating several low-power design techniques such as; operation minimization, 
multiplier elimination and block deactivation. The decimation filter is resolution- 
programmable, allowing the deactivation of the blocks corresponding to the least 
significant bits, when a lower resolution is sufficient. The design of the decimation 
filter is given in chapter 5. 
The sixth contribution of this dissertation is the design of a resolution pro- 
grammable multiplier-accumdator (MAC) array. The interleaving of the adder in 
the multiplier array reduces the power dissipation. The resolution of the MAC 
array is programmable allowing the deactivation of the blocks corresponding to the 
least significant bits wben a lower resolution is sufficient. The design of the MAC 
array is given in chapter 6. 
The seventh contribution of this dissertation is a novel digital channel selection 
algorithm with no pre-filter multiplier. The channel selection algonthm uses low- 
pass, highpass and bandpass filters. The basic filter is the lowpass filter. Other 
filters are implemented using the lowpass filter and simple logic gates such as mul- 
tiplexers and XOR gates. The channel selection is done in stages. The elimination 
of the pre-füter multiplier reduces the power dissipation by up to an order of mag- 
nitude. The design of the digital channel selection algorithm is given in chapter 7. 
8.2 Conclusions 
The widespread use of portable wireless terminais, and the increase in consumer 
demand for more capabilities and fuctionality, coupled wit h the limit ed energy 
supply of batteries is driving research into the low-power design arena. 
The promising results obtained from this dissertation indicate that the proper 
CHAPTER 8. SUMMARY, CONCLUSIONS AND FUTURE DIRECTIONS 243 
choice of an algorithm or architecture can achieve up to an order of magnitude, 
or even more in some cases, of savings in the power dissipation. High-level low- 
power design techniques are easier and fas ter to implement t han low-level low-power 
design techniques. 
The high-level techniques presented in this dissertation to reduce the power 
dissipation include; multiplier elimination, operation minimization, operation in- 
t erleaving and blo ck deac t ivat ion. 
At the heart of these techniques is the minimization of the computational com- 
plexity by trie elimination of redundant and irrelevant computations. Redundant 
computations are those whose elimination has no effect on the operation of the ar- 
chitecture. This is generdy achieved by proper encoding of the input and output 
data, or by mod*ing the architecture. For example, multiplication by 7 requires 
two addition operations: 
However, encoding 7 in sign-digit representation 7 = 100ï, eliminates one addi- 
tion operation and substitutes the other one by subtraction: 
This is an example of redundant computations elimination. Redundant com- 
putations also occur in decimation flters, where there is redundancy in the inter- 
mediate stages, due to the decimation at  the output of the filter. Through proper 
choice of an architecture, the redundancy in the intermediate stages is minimized. 
Irrelevmt computations are those whose elimination generdy affects the oper- 
ation of the architecture. But under certain circums t ances, t his effect is irrelevant . 
CHAPTER 8. SUMMARY, CONCL USIONS AND FUTURE DIRECTIONS 244 
For example, reducing the datapath width of an operator when the SNR is lim- 
ited by some other system considerations, or by another unit in the system, is an 
example of irrelevant comput ations elimination. 
In chapter 7, a novel digital channel selection algorithm was presented that elim- 
inat es the pre-filter multiplier. The multiplier was replaced by less comput ationally 
complex operators, such as adders, multiplexers and XOR gates. This algorithm 
flters the desired channel in stages. During each stage, the signal is decimated by 
a factor of 2. To further reduce the computational complexity, multiplication is 
substituted by addition in the f i s t  filtering stages. This reduces the computational 
power dissipation by up to an order of magnitude. This is an example of redundant 
comput ations elimination. 
The lowpass decimation filter presented in chapter 5, was a halfband filter with 
symmetric coefficients. This reduced the number of multiplications required per 
output sample, which in turn led to a power saving of over 50%. 
The efFect of operation minimization on reducing the power dissipation was 
demonstrated in section 3.9.2 of chapter 3. A novel division algonthm was presented 
in that section that produces the quotient in a minimal sign-digit representation. 
Simulation results indicate that the new algorithm achieves a 15% saving in power 
dissipation when compared to a radix 4 division algorithm. This is an example of 
redundant computations elimination. 
Operation minimization was also considered in chapter 5, for the design of the 
Sinc decimator. The fact that the output of the Sinc filter is decimated, offers 
opportunity to minimize the number of one-bit addition operations required per 
Sinc decimator output. Of the different Sinc dechator architectures considered 
in chap ter 5, the mos t computationally efficient implement ation reduces the num- 
CHAPTER 8. SUMMARY, CONCLUSIONS AND FUTURE DIRECTIONS 245 
ber of one-bit addition operations by up to five times when compared to other 
Sinc decimator architectures. This is also an example of redundant computations 
elimination. 
Besides multiplier elimination and operat ion rninimizat ion, operation interleav- 
ing is another technique to lower the power dissipation. Operation interleaving al- 
lows two (or more) operators to be interleaved into a single operator. In chapter 6, 
the effect of interleaving an adder and a multiplier in a single array was presented. 
For a 10 x 10 multiplier-accumulator array, operation interleaving achieves a 27.4% 
saving in power dissipation. 
Wireless portable terminais are required to have very high sensitivity and se- 
lectivity, in order to operate in severe radio environments. In software radio, this 
leads to high resolution A/D converters and high resolution digital signal proces- 
sors in the digital IF portion of the receiver. However, in many cases the level of 
received signal is strong enough to obviate the need for such high resolution. In 
this case, block deactivation is used to deactivate the blocks corresponding to the 
least significant bits. 
Block deactivation was applied to the design of the lowpass decimation filter 
of chapter 5, and also to the design of the multiplier-accumulator (MAC) array of 
chapter 6. The block deactivation technique presented can achieve a power saving 
of up to 50% for the MAC array and up to 40% for the Iowpass decimation filter. 
This is an example of irrelevant comput ations elimination. 
Reducing the datapath width wit hout degrading the output numerical accuracy 
plays a central role in achieving a power-efficient architecture. A Sinc decimator of 
order N + 1 is used to decimate the output signal of the Sigma-Delta modulator 
of order N. Each octave of oversampling in the Sigma-Delta modulator increases 
CHAPTER 8. SUMMARY, CONCLUSIONS AND FUTURE DIRECTIONS 246 
the resolution by N + 0.5 bits. However, each octave of decimation by the Sinc 
decimator increases the datapath width by N + 1 bits. Half a bit more than what 
is required. 
For a third-order Sinc decimator having a decimation factor of 8, it is possible 
to reduce the output datapath width from 10 bits to 7 bits, at  the expense of a 3 
- 4 dB degradation in the SNR (which is equivalent to just over half a bit). The 
saving in computational complexity is 9-14%. This is an example of irrelevant 
comput ations elimination. 
The choice of an algorithm can have the greatest impact on the power dissipa- 
tion and the system performance. In many cases, it is possible to trade a slight 
degradation in algorithm performance, for a larger reduction in power dissipation. 
The subband image compression algorithm presented in chapter 4 has low compu- 
t ational complexity. The filtering structure reduces the comput ational complexity 
23 times. The vector quantization algorithm used in the decoding of the high fke- 
quency subbands achieves a large reduction in computational complexity over the 
other vector quantization algorithms such as table-lookup vector quantization de- 
coding algorit hms. This reduction in computational complexity is achieved at the 
expense of a 4 - 5 dB reduction in the SNR. 
The choice of a particular low-power technique depends on the design parame- 
ters of the system being design. Using an algorithm or an architecture to reduce one 
component of the power dissipation may lead to an increase in another component 
of the power dissipation. In chapter 3, several such examples where considered, one 
of which was the Gray code number system. The Gray code number system reduces 
the switching activity for successively correlated samples, when compared to the 
two's complement number system. However, the energy per operation is higher 
for the Gray code number system. The effect of the Gray code number system on 
CHAPTER 8. SUMMARY, CONCLUSIONS AND FUTURE DIRECTIONS 247 
reducing the power dissipation depends on several factors such as, the correlation 
coefficient, the load capacitance, and the energy per operation. In chapter 3, the 
limits of applying the Gray code to reduce the power dissipation were presented. 
In chapter 5, a pardel bandpass Sigma-Delta modulator is proposed. The basic 
advantage of the proposed architecture is that each block operates at a lower speed 
(because of the parallelism) than that of the conventional bandpass Sigma-Delta 
modulator for the same oversampling ratio. This makes it suitable for high-speed 
Sigma-Delta modulators, such as those used in RF applications, for high-speed 
analog-to-digit al conversion. 
Another advantage of the switched capacitor architecture, given in chapter 5, is 
in the reduced number of operational amplifiers required. A fourth-order bandpass 
Sigma-Delta modulator typicdy requires four operational amplifiers [140] [l41]. 
The bandpass Sigma-Delta modulator presented in chapter 5 uses a factor of four 
parallelism (for both I and Q channels), while the number of operational amplifiers 
required is eight, only increased by a factor of two. 
8.3 Future Directions 
8.3.1 Low-Power Digit al Radio 
Future mobile wireless terminais WU be able to perform, in the digital domain, 
many of the functions that are currently being done in the analog domain. An 
example of this is digital channel selection which was considered in chapter 7. New 
power-efficient algorithms for these operations, suit able for digital implementations, 
need to be developed. 
CHAPTER 8. SUMMARY, CONCLUSIONS AND FUTURE DIRECTIONS 248 
Furthemore, software radios d o w  new features which are not available in cur- 
rent transceivers, such as the ability to change coding scheme, modulation scheme, 
and even multi-access scheme, based on the radio channel environment. This topic 
needs further investigation. New algorithms need to be developed to determine 
which coding/modulation/muIti-access schemes can lead to optimum system per- 
formance for a certain radio channel environment. 
We have presented an A/D converter architecture with a sampling rate of 
1.25 MS/s. This captures only part of the 25 MHz cellular band. To capture 
the entire cellular band an A/D converter with a sampling rate of 62.5 MS/s (1.25 
times the Nyquist-rate) is required, and with a 20+ bits resolution. More research 
is required in t his area to achieve p ower-efficient high-resolution A/D conver t ers 
operating at such a rate. 
8.3.2 Low-Power Multimedia 
New applications are likely to emerge as wireless multimedia terminals becorne 
popular. New power-efficient algorithms and architectures need to be developed for 
these applications. For example, to enable video conferencing, a duplex video link 
needs to be established. The vector quantization algorithm presented in chapter 4, 
is power-efficient for video decompression. For video compression a power-efficient 
search algorithm is required. Thus, a multimedia terminal has to be able to perform 
video compression as weIl as video decompression in a power-efficient manner. 
There are enormous opportunities for future research to reduce the power dissi- 
pation in wireless multimedia terminals at the system level. Future wireless termi- 
nais are required to handle different types of information such as speech, video and 
data. Each type of information has its own requirements in terms of the acceptable 
CHAPTER 8. SUMMARY, CONCL USIONS AND FUTURE DIRECTIONS 249 
delay and the probability of error. For example, voice packets must have a low 
delay, while delay is not critical for data packets. On the other hand, voice packets 
can tolerate a small amount of transmission errors, while data packets can not tol- 
erate such enors. Based on this, a strategy for the power-efficient transmission of 
packets needs to be established. In an unfavourable radio environment, transmit- 
ted packets are likely to suffer transmission errors, which leads to retransmission 
requests for data packets leading to a high loss of valuable power. To minimize data 
packet retransmission, data packets can be transmitted with a higher power level 
or use a more elaborate error correction code, both these lead to higher power dis- 
sipation, but can lead to a power efficient solution in the sense that they avoid data 
retransmissions. The optimum error coding algorithm, and the optimum power 
transmission level for the most power-efficient transinission strategy of information 
packets need to be determined for different radio channel environments. 
8.3.3 Low-Power CAD 
Traditiondy, hi&-level synthesis systems have been associated with the optimiza- 
tion of area and speed [167] [168]. Recently, high-level synthesis systems have 
started to address the optimization of power dissipation as well [51] 11691 [170]. 
However, automated tools for synthesizing power optimum algorithms are stiU not 
available. Further research needs to be carried out in this area. 
An integral part of a high-level low-power synthesis systems is fast but accu- 
rate high-level power estimation. The majority of the literature deds with power 
estimation at the transistor, switch or gate levels [45] [47] [49]. Power estimation 
at the register transfer level (RTL) and the architectural level is starting to cap- 
ture the attention of researchers [52] [96] [171] [172]. However, further research is 
CHAPTER 8. SUMMARY, CONCLUSIONS AND FUTURE DIRECTIONS 250 
still reqnired for power estimation at the architectural level. Power estimation at 
the algonthmic level is the least researched segment of the power estimation pro- 
cess. This area has the greatest research potential because of the dramatic impact 
algorithm selection has on the minimization of the power dissipation. 
There is a need for a low-power CAD system, that explores the algorithmic 
design space to determine the most power-efficient algorithm, for a certain imple- 
mentation techology, and under specific constraints. It is envisioned that such a 
CAD system will consist of three parts, the input part, the processing part and the 
output part. The input part consists of all the information that needs to be known 
before processing can start. This information consists of: 
0 Basic building block parameters. This is a library that contains information 
such as delay, power dissipation and area about the basic building blocks. 
The basic building blocks include adders, multipliers, multiplexers, memory 
elements, etc. 
0 Design constraints. The design constraints include throughput, SNR, com- 
pression ratio and total area. 
0 Low-power techniques knowledge-based system. This is a database of the 
low-power techniques and transformations, and the extent by which they can 
reduce the power dissipation. 
0 Algorithm knowledge-based system. For example in a software radio GAD 
system, this is a database of the software radio algorithms and the tradeoffs 
that c m  reduce the computational complexity and the effect of that on the 
performance of the algorithm. 
CHAPTER 8. SUMMARY, CONCLUSIONS AND FUTURE DIRECTIONS 251 
The processing part, which is the set of design d e s ,  makes use of all of this 
input information and the builtin knowledge-based systems and it determines the 
algorithm and its correspondlig architecture that minimize the power dissipation. 
Finally, the output part is the power optimum design. 
Most of the low-power techniques presented in this dissertation rely on the min- 
imization of the computational complexity by eliminating the redundant computa- 
tions. But what is the minimum computation required for a certain architecture? 
And how can we reach this limit? These questions can be answered for each ar- 
chi t ec t u e  individually. However , t here is no au tomated design met hodology t hat 
can transform any general architecture into an architecture with minimum compu- 
t ations. This topic needs further research. 
8.3.4 New Low-Power Techniques 
The break up of power dissipation in digital ICs used in portable terminals is given 
by the pie-chart of Figure 8.1'. It is interesting to note that the power dissipated 
by the dock and its associated circuits is about 50% of the total power dissipation. 
Asynchronous circuits dont require a dock and thus have a potential of achieving 
a saving in power dissipation of up to 50%. 
Whereas synchronous circuits have been around for many years and are f d y  
understood, the application of asynchronous circuits in low-power has jus t s t arted 
recently [173] (1741. Further research is needed in this area. It is envisioned that 
future systems might be some type of hybrid system using both asynchronous and 
s ynchronous circuit S. 
'F'rom the presentation of Dr. Deo S. Singhat at the University of Waterloo, on Tuesday the 
1st of April 1997 
CHAPTER 8. SUMMARY, CONCL USIONS AND FUTURE DIRECTIONS 252 
Clock 
Figure 8.1: Pie-chart of the distribution of the power dissipation in portable termi- 
nais. 
Low-power design is vital for the survival of the telecommunication and com- 
puter industries. As the functionality of portable terminals increases and as tradi- 
tional low-power techniques exhaust, new low-power techniques must be developed 
and integrated into future portable terminals. 
Appendix A 
The Simulation of the 
Sigma-Delta Modulator 
The simulation of the Sigma-Delta modulator was performed using spwTM (Signal 
Processing WorkSystem @ ). The Signal Processing WorkSystem is an integrated 
fiamework for developing discrete-time signal processing systems and communica- 
tion protocols [175]. SPW was used to mode1 and simulate different Sigma-Delta 
modulator architectures wit h the purpose of verifying the functionality and deter- 
mining the effect of implementation details such as mismatch and gain errors on 
the different architectures. 
SPW consists of several modules. The main modules used for modeling and 
simulating the Sigma-Delta modulator are [175]: 
1. The Block Diagram Editor also called Designer BDE. This is where the 
discrete-time signal processing system is created, edited and wired together. 
2. The Signal Flow Simulator. This is the tool that simulates the operation of 
APPENDX A. THE SIMULATION OF THE SIGMA-DELTA MODULATOR254 
Figure A. 1: A second-order bandpass Sigma-Delta modulator modeled in SP W. 
the discrete-time signal processing system designed using SPW block diagram 
editor . 
3. The Signal calculatorTM. This is used for creating input signals and ana- 
lyzing output signds. 
Figure A.1 shows the block diagram of a single channel(1 or Q) bandpass Sigma- 
Delta modulator. The input signal generated by SIGNAL GEN is a sinusoidal wave 
wit h a low frequency. Since the bandpass sampler samples the bandpass signal at 
double the carrier frequency for the 1 or Q channels. Therefore, to get a sampled 
stream identical to that at the output of the bandpass sampler, the low fiequency 
sinusoidal wave is multiplied by a 1, -1, 1, -1, .... stream. 
Figures A.2-A.? show the SPW mode1 of the novel parallel bandpass Sigma- 
Delta modulator in different configurations. The block diagrams of the block nintl 
of Figure A. 1, and of the block int2 of Figures A.2 - A. 7, are shown in Figures A.8 
and A.9 respectively. 
APPENDIX A. THE SIMULATION OF THE SIGMA-DELTA MOD ULATOR255 
Figure A. 2: SP W mode1 of a single-channel second-order bandpass Sigma-Delt a 
modulator, wit h two cross-coupled branches, and cornmon filtering done after sub- 
traction. 
Figure A. 3: SP W mode1 of a single-channel second-order bandpass Sigma-Delt a 
modulator, with two cross-coupled branches, and common filtering done before 
branch split ting. 
APPENDLX A. THE SIMULATION OF THE SIGMA-DELTA MOD ULATOR256 
Figure A .4: SP W model of a single-channel four t h-or der bandpass Sigma-Delt a 
modulator, with two cross-coupled branches, and common filtering done after sub- 
traction in the first and second stages. 
Figure A -5: SP W model of a single-channel fourt h-order bandpass Sigma-Delt a 
modulator, with two cross-coupled branches, and common filtering done after sub- 
traction in the h s t  stage and before branch splitting in the second stage. 
APPENDIX A. THE SIMULATION O F  THE SIGMA-DELTA MOD ULATOR257 
Figure A.6: SPW model of a single-channel fourth-order bandpass Sigma-Delta 
modulat or, wi th two cross-coupled branches, and common filtering done before 
branch splitting in the first stage and after subtraction in the second stage. 
Figure A. 7: SP W model of a single-chamel four t h-order bandpass Sigma-Delt a 
modulator, with two cross-coupled branches, and common filtering done before 
branch splitting in the fkst and second stages. 
APPENDM A. THE SIMULATION OF THE SIGMA-DELTA MOD ULATOR258 
G Loap 1.0 
G E x t  1.0 
Figure A.8: SPW model of the block nzntl of Figure A S .  
G Loop 1 .0  
G E x t  1.0 
Figure A.9: SPW model of the block int2 of Figures A.2 - A.7. 
Appendix B 
Sinc Decimat or Analysis 
The transfer function for the Sinc decimator [Sincn(2")j is given by: 
Where, M = 2m. According to Equation B.1, the Sinc decimator can be im- 
plemented as a cascade of n integrators followed by a 2" down sampler and then 
followed by n Werentiators [176]. This architecture has been previously explained 
in section 5.7.4, and is shown in Figure 5.43. 
The purpose of the following analysis is to find the effect of each input sample ib7 
on an output sample OB. This analysis shows that even though the Sinc decimator 
consists of integrators which are infinite impulse response blocks, yet the response of 
the entire decimator of Figure 5.43.a has fmite impulse response. Furthermore, we 
determine the maximum Sinc decimator output and the input sequence necessary 
to give this maximum output. 
Before proceeding with the analysis, the format of the input and output samples 
APPENDLX B. SINC DEClMATOR ANALYSE 260 
need to be defined. The input samples, which can be represented by k bits, are 
bipolar samples having values: 
i.: 2 k - 1 ,  2'-3, ...... - ( 2 * - 3 ) ,  - ( zk -1 )  
These samples are reinterpreted as unsigned binary values, through the following 
mapping: 
The output samples, Os of the Sinc decimator are unsigned binary integers with 
resolution k + rnn bits. When we divide this by 2mn, as required in Equation B.l, 
the output samples, Ob, become unsigned binary numbers with k integer bits and 
mn fraction bits. To obtain the actud output samples Os fiom Ob. The following 
mapping is used: 
Os is a signed binary number with one sign bit, h integer bits and rnn - 1 
fiaction bits. 
The objective of this analysis is to find the relation between ib and Os, for a 
Sinc decimator having k = 1, rn = 3, and n = 3. Figure B.1 shows the progression 
of an input sample at discrete time T = 1 through the Sinc decimator, to its output. 
Each integrator has a delay of one unit. The down-sampler decimates the signal at 
discrete time d, d + 8, d + 16, ... etc. Where 4 < d 5 11. 
For an output sample at discrete time T. The latest input sample that can 
contribute to this output sample is at discrete time T - 3. The earliest input 
APPENDUC B. SINC DECIMATOR ANALYSE 
d+ l d c T d t 2 4  - 3  (d*13; Id+lI] 
I T-2 2 +tT+- 2 8d+76 64 OSd- IO)(& t 1) 
Int Int Int 'A == 0i f i  Diff 
Figure B.l: The progression of an input sample through a Sinc decimator liaving 
k = l , m = 3 a n d n = 3 .  
sample that can contribute to this output sample is at discrete time T - 24. Notice 
the finite impulse response of the Sinc decimator, despite the use of infinite impulse 
response integrators. Furthemore, the contribution of each input sample to the 
output sample is given in Table B.1. If we add all these contributions together, 
which means we have 22 successive ones at the input, the output will be 512. This 
is the maximum output. 
Because any decimation filter is generdy, a linear time-variant system, having 
22 successive ones at the input of the Sinc decimator is not a sufficient condition to 
obtain an output of 512. The Iatest of these 22 successive ones must be 3 discrete 
time units earlier than the sampling time of the down-sampler. 
APPENDLX B. SINC DECIMATOR ANALYSE 
Table B.1: The effect of an input sample to the Sinc decimator at discrete time 
T - n on the output of the Sinc decimator at discrete t h e  T. The Sinc decimator 
is sinc3(8). 
1 Discrete t h e  
I
Contribution to output at T Discrete t h e  Contribution to output at T 
O T-13 48 
Bibliography 
[Il A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low-power CMOS 
digital design," IEEE Journal of Solid State Circuits, vol. SC-27, pp. 473- 
484, Apr. 1992. 
[2] J. M. Rabaey and M. Pedram, Low pover design rnethodologies. Boston: 
Kluwer Academic Publishers, 1996. 
[3] A. Bellaour and M. 1. Eknasry, Low-Power Digital VLSI Design: Circuits 
and Systems. Boston: Kluwer Academic Publishers, 1995. 
[4] P. Michel, U. Lauther, and P. Duzy, The Synhesis Approach to Digital System 
Design. Boston: Kluwer Academic Publishers, 1992. 
[5] D. Singh, J. M. Rabaey, M. Pedram, F. Catthoor, S. Rajgopal, N. Sehgal, 
and T. J. Mozdzen, "Power conscious CAD tools and methodologies: A per- 
spective," Proceedings of the IEEE, vol. 83, pp. 570-594, Apr. 1995. 
[6] B. Nadel, "The green machine," PC Magazine, vol. 12: pp. 110-145, May 25 
1993. 
[7] T. H. Meng, B. M. Gorgon, E. K. Tsern, and A. C. Hung, "Portable vide* 
on-demand in wirdess communication," Proceedings of the IEEE, vol. 83, 
pp. 659-679, Apr. 1995. 
[8] S. Sheng, A. Chandrakasan, and R. W. Brodersen, "A portable multimedia 
terminal," IEEE Communications Magazine, pp. 64-75, Dec. 1992. 
[9] P. Pirsch, N. Demassieux, and W. Gehrke, "VLSI architectures for video 
compression - A survey," Proceedings of the IEEE, vol. 83. pp. 220-246. 
Feb. 1995. 
[IO] J. Mitola, "The software radio architecture," IEEE Communications Maga- 
zine, pp. 26-38, May 1995. 
[Il] J. A. Wepman, "Analog-to-digi t al converters and t keir applications in radio 
receivers ," IEEE Communications Magazine, pp. 3945,  May 19%. 
[12] R. Baines, "The DSP bottleneck," IEEE Communicatiorcs Magazine. pp. 46- 
54, Mar. 1995. 
[13] G. H. Heilmeier, "Personal communications: Quo vadis," in IEEE Interna- 
tional Solid-State Circuits Conference, pp. 24-26, 1992. 
[14] E. H. Armstrong, "A new system of short wave amplifications,'' Proceedings 
of the Institute of Radio Engineers, vol. 9 ,  pp. 3-27, 1921. 
[15] E. H. Armstrong, "The superheterodynce - its origin, development and some 
recent improvements," Proceedings of the Institute of Radio Engineers, vol. 12, 
pp. 539-552, 1924. 
(161 W. Gosling, Radio Receivers. Cambridge, England: Peter Peregrinus Ltd.. 
1986. 
[17] A. A. Abidi, "Low-power radic4kequency ic's for portable communications," 
Proceedings of the IEEE. VOL 83, pp. 544-569, Apr. 1995. 
[18] R. J. Lackey and D. W. Upmal, 'Speakeasy: The military software radio," 
IEEE Communications Magazine, pp. 56-61, May 1995. 
[19] J. E. Padgett, C. G. Gunther, and T. Hatorri, "Overview of wireless persond 
communications," IEEE Commanications Magazine, pp. 28-41. Jan. 1995. 
[20] D. J. Goodman, '<Second generation wireless information networks," IEEE 
Transactions on Vechicular Technology, vol. 40? pp. 366-374, 1991. 
[2 11 L. E. Larson, R F  and Microwave Circuit Design for Wirless Communications. 
Boston: Artech House, 1996. 
[22] IS-137, "800 MHz TDMA cellular - radio interface - minimum performance 
standards for mobile stations," 1994. 
[23] K. Pahlavan and A. H. Levesque, "Wireless data communications," Proceed- 
ings of the IEEE. vol. 82, pp. 1398-1430, Sept. 1994. 
[24] M. J. Marcus, "Recent U.S. regulatory decisions on civil uses of spread spec- 
trum," in GLOBECOM, vol. 1, (New Orleans, Louisiana), pp. 16.611-3, Dec. 
1985. 
[25] T. Tsukkahara, M. Ishikawa, and M. Muraguchi, "A 2V 2GHz Si-bipolar 
direct conversion quadrature modulator," in IEEE International Solid-State 
Circuits Conference, (San Francisco, California), pp. 4041 ,  Feb. 1994. 
[26] F. E. Terman, Radio Engineers ' Handbook. New York: MaGraw-Hill, 1943. 
BIBLIOGRAPHY 266 
[27] T. Okanobu, H. Tomiyama, and H. Arimoto, "Advanced low voltage single 
chip radio IC," IEEE Transactions on Consumer Electronics, vol. 38, pp. 465- 
475, Aug. 1992. 
[28] H. Tsurami and T. Maeda, "Design study on direct conversion receiver fiont- 
end for 280 MHz 900 MHz, and 2.6 GHz band radio communication systems," 
in IEEE Vechicular Technology Conference, (St. Louis, Missori), pp. 457462, 
May 1991. 
[29] J. Sevenhans, A. Vanwelsenaers, J. Wenin, and J. Baro, "An integrated Si 
bipolar RF transceiver for a zero IF 900 MHz GSM digital radio front-end 
of a hand port able phone," in Custorn Integrated Circuits Conference, (San 
Diego, California), pp. 7.711-4, May 1991. 
G. Schultes, A. L. Scholtz, E. Bonek, and P. Veith, U A  new incoherent direct 
conversion receiver," in IEEE Vechicular Technolog y Con ference, (Orlando, 
Florida), pp. 668-674, May 1990. 
G. Shultes, E. Bonek, P. Weger, and W. Herzog, "Basic performance of a 
direct conversion DECT receiver," Electronics Letter, vol. 26, pp. 1746-1748, 
1990. 
[32] A. Bateman and D. Haines, "Direct conversion transceiver design for compact 
low-cos t port able mobile radio terminais ," in IEEE Vechicular Technology 
Conference, vol. 1, pp. 57-62, 1989. 
[33] P. Estabrook and B. B. Lusignan, "The design of a mobile radio receiver using 
a direct conversion architecture," in IEEE Vechicular Technology Conference, 
vol. 1, (San Francisco, CaLifornia), pp. 63-72, May 1989. 
[34] R. W. A. Bateman, D.M. Haines, "Linear transceiver architectures," in IEEE 
Vechicislar Technology Conference, (Philadelphia, Pennsylvania) , pp. 478- 
484, Juae 1988. 
[35] F. Piazza and Q. Huang, U A  170hECz RF front-end for ERMES pager applica- 
tions," in IEEE International Solid-State Circuits Conference, (San Francisco, 
California), pp. 324-325, Feb. 1995. 
[36] C. Takahashi et al., "A 1.9GHz Si direct conversion receiver IC for QPSK 
modulation systems," in IEEE International Solid-State Circuits Conference, 
(San Francisco, California), pp. 138-139, Feb. 1995. 
[37] A. Fernandez-Duran et al., "Zero-IF receiver architecture for multisandard 
compatible radio sys tems," in IEEE Vehicular Technology Conference, vol. 2, 
(Atlanta, Georgia), pp. 1052-1056, Apr. 1996. 
[38] J. Kennedy and M. C. Sullivan, "Direction h d i n g  and "smart antennas" 
using software radio architectures," IEEE Comrnlmications Magazine, pp. 62- 
68, May 1995. 
[39] "SPEAKeasy Home, 
http://troi.web.d.af.mil:8001/Technology/Demos/SPEAKEASY/." 
[40] N. H. Weste and K. Eshraghian, Princéples of CMOS VLSI Design: A Sys- 
tems Perspective. Reading, Massachusetts: Addison-Wesley Publishing Com- 
pany, 1993. 
[41] A. P. Chandrakasan and R. W. Brodersen, "Minimizing power consumption 
in digital CMOS circuits," Proceedings of the IEEE, vol. 83, pp. 498-523, 
Apr. 1995. 
[42] H. J. M. Veendrick, "Short-circuit dissipation of static CMOS circuitry and its 
impact on the design of buffer circuits," IEEE Journal of Solid State Circuits, 
vol. SC-19, pp. 468-473, Aug. 1984. 
[43] E. S. Yang, Micro-Electronic Deuices. New York: McGraw Hill, 1988. 
[44] J .-P. Collinge, Silicon-on-Insulator Technolog y: Mat erials to VLSI. Bos ton: 
Kluwer Academic Publishers, 199 1. 
[45] S. M. Kang, "Accurate simulation of power dissipation in VLSI circuits." 
IEEE Journal of Solid State Circuits, vol. SC-21, pp. 889-891, Oct. 1986. 
[46] F. N. Najm, L'A survey of power estimation techniques in VLSI circuits," 
IEEE Transactions on VLSI Systerns, vol. 2, pp. 446-455, Dec. 1994. 
[47] M. A. Cirit, & E s t k a t h g  dynamic power consumption of CMOS circuits," 
in IEEE International Conference on Cornputer-Aided Design, (Santa Clara, 
California), pp. 534-537, Nov. 1987. 
[48] A. Shen, A. Ghosh, S. Devadas, and K. Keutzer, "On average power dissipa- 
tion and random pattern testability of CMOS combinational logic networks," 
in IEEE/ACM International Conference on Conputer-Aided Design, (Santa 
Clara, California), pp. 402-407, Nov. 1992. 
[49] F. N. Najm, "Transition density: A new measure of activity in digital cir- 
cuits," IEEE Transactions on Cornputer-Aided Design, vol. 12, pp. 310-323, 
Feb. 1993. 
[50] A. Ghosh, S. Devadas, K. Keutzer, and J. White, "Estimation of average 
switching activity in combinational and sequential circuits ," in IEEE/A CM 
International Conference on Cornputet-Aided Design, (Santa Clara, Califor- 
nia), pp. 253-259, Nov. 1992. 
(511 A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. W. 
Brodersen, 'Op timizing power using transformations," IEEE Transactions 
on Cornputer-Aided Design, vol. 14, pp. 12-31, Jan. 1995. 
[52] P. E. Landman and J. M. Rabaey, "Power estimation for high level synthesis," 
in Proceedings of the European Conference on Design Automation, (Paris. 
France), pp. 304-308, Feb. 1993. 
[53] E. A. Vittoz, "Low-power design: Ways to approach the limits," in IEEE 
International Solid-State Circuits Conference, (San Francisco, California), 
pp. 14-18, Feb. 1994. 
[54] D. C. Cox, "Universal digital portable radio communications," Proceedings of 
the IEEE, vol. 75, pp. 436-477, Apr. 1987. 
[55] A. Matsuzawa, "Low-voltage and low-power circuit design for mixed ana- 
log/digital systems in portable equipments," IEEE Journal of Solid State 
Circuits, vol. 29, pp. 470-480, Apr. 1994. 
[56] A. P. Chandrakasan, A. Burstein, and R. W. Brodersen, "A low-power chipset 
for a portable multimedia I /O terminal," IEEE Journal of Solzd State Cir- 
cuits, vol. 29, pp. 1415-1428, Dec. 1994. 
[57] T. Barber, P. Carvey, and A. Chandrakasan, "Designing for wireless LAN 
communications," Circuits 8 Devices Magazine, pp. 29-33, Jdy 1996. 
[58] 1. Kang and A. N. Willson, "A low-power state-sequantional Viterbi decoder 
for CDMA digital cellular applications," in IEEE International Circuits and 
Systems Conference, vol. 4, (Atlanta, Georgia), pp. 272-276, May 1996. 
[59] R. Cypher and C. B. Shung, "GeneraIized traceback techniques for survivor 
memory management in Viterbi decoder," in GLOBECOM, vol. 2, (Houston, 
Texas), pp. 1318-1322, Dec. 1993. 
[60] E. Boutillon and N. Demassieux, "High speed low power architecture for 
memory management in a Viterbi decoder," in IEEE International Circuits 
and Systems Conference, vol. 4, (Atlanta, Georgia), pp. 284-287, May 1996. 
[61] M. D. Hahm, E. G. Friedman, and E. L. Titlebaum, "Analog vs. digital: 
A cornparison of circuit implementations for low-power matched filters ," in 
IEEE International Circuits and System Conference, vol. 4, (Atlanta, Geor- 
gia), pp. 280-283, May 1996. 
[62] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, and R. W. Brodersen, 
"KYPER-LP: A system for power minimization using architectural trans- 
formations," in IEEE International Conference on Cornputer-Aided Design, 
(Santa Clara, California), pp. 300-303, Nov. 1992. 
[63] W. Lee et al., "A 1V DSP for wireless communications," in IEEE Interna- 
tional Solid-State Circuits Conference, (San Francisco, California), pp. 92-93, 
Feb. 1997. 
[64] C. J. Pan, 'LA low-power digital filter for decimation and interpolation using 
approximate processing," in IEEE International Solid-State Circuits Confer- 
ence, (San Francisco, California), pp. 102-103, Feb. 1997. 
[65] T. Kuroda et al., "A 0.9V 150kECz lOmW 4mm2 2-D discrete cosine transform 
core processor with variable-threshold-voltage scheme," in IEEE Interna- 
tional Solid-State Circuits Conference, (San Francisco, California), pp. 166- 
167, Feb. 1996. 
(66) B. M. Gordon, T. H. Meng, and N. Chaddha, "A 1.2mW video-rate 2D color 
subband decoder," in IEEE International Solid-State Circuits Conference, 
(San Francisco, California), pp. 290-291, Feb. 1995. 
[67] E. K. Tsern and T. H. Meng, "A low-power vide-rate pyramid VQ decoder," 
in IEEE International Solid-State Circuits Conference, (San Francisco, Cali- 
fornia), pp. 162-163, Feb. 1996. 
[68] W. J. Bowhill et al., "A 300MHz 64b quad-issue CMOS RISC microproces- 
sor," in IEEE International Solid-State Circuits Conference, (San Francisco, 
California), pp. 182-183, Feb. 1995. 
1691 N. K. Yeung et al., "The design of a 55SPECint92 RISC processor under 
2 W," in IEEE International Solid-State Circuits Conference, (San Francisco, 
California), pp. 206-207, Feb. 1994. 
[70] J. Montanaro et al., "A 160-MHz, 32-b, 0.5-w CMOS RISC microprocessor," 
IEEE Journal of Solid State Circuits, pp. 1703-1714, Nov. 1996. 
[71] D. W. Dobberpuhl et al., "A 200-MHz 64-b dual-issue CMOS microproces- 
sor," IEEE Journal of Solid State Circuits, vol. 27, pp. 1555-1567, Nov. 1992. 
[72] R. Bechade et al., "A 32b 66MHz 1.8W microprocessor," in IEEE Interna- 
tional Solid-State Ckcuits Conference, (San Francisco, California), pp. 208- 
209, Feb. 1994. 
BIBLIOGRAPHY 272 
[73] D. Pham et al., "A 3.OW 75SPECint92 85SPECfp92 superscdar RISC mi- 
croprocessor," in IEEE International Solid-State CErcuits Conference, (San 
Francisco, California), pp. 212-213, Feb. 1994. 
[74] J. M. C. Stork, uTechnology leverage for ultra-low power information sys- 
tems," Proceedings of the IEEE, vol. 83, pp. 607-618, Apr. 1995. 
[75] M. In0 et al., "0.25prn CMOS/SIMOX gate array LSI," in IEEE International 
Solid-State Circuits Conference, (San Francisco, California), pp. 86-87, Feb. 
1996. 
[76] M. Kakumu and M. Kinugawa, "Power-supply voltage impact on circuit per- 
formance for half and lower submicrometer CMOS LSI," IEEE Transactions 
on Electron Devices, vol. 37, pp. 1902-1908, Aug. 1990. 
[77] S. Mutoh et al., "1 V high-speed digital circuit technology with 0.5-prn 
multi-threshold CMOS," in IEEE International ASIC Conference and Ex- 
hibit, (Rochester, New York), pp. 186-189, Sept. 1993. 
[78] K. Yano et al., "A 3.8-ns CMOS 16 x 16-b multiplier using complementary 
pass-transistor logic," IEEE Journal of Solid State Circuits, vol. 25, pp. 388- 
395, Apr. 1990. 
[79] A. J. Stratakos, S. R. Sanders, and R. W. Bordersen, "A low-voltage CMOS 
DC-DC converter for a protable battery operated system," in IEEE Power 
Electronics Specialrsts Conference, pp. 619-626, 1994. 
[80] M. Alidha, J. Monterio, S. Devadas, A. Ghosh, and M. Papaefthmiou, 
"Precomputation-based sequential logic optimization for low power," in In- 
ternational Workshop in Low Power Design, 1994. 
BLBLIOGRAPHY 273 
[SI] T. Douseki et al., "A 0.5V SIMOX-MTCMOS circuit with 200ps logic gate," 
in IEEE International Solid-State Circuits Conference, (San Francisco, Cali- 
fomia), pp. 84-85, Feb. 1997. 
[82] S. R. Vernuni and A. R. Thorbjornsen, "Variable-taper CMOS buffer," IEEE 
Journal of Solid Stafe Circuits, vol. 26, pp. 1265-1269, Sept. 1991. 
[83] J. L. Hennessy and D. A. Patterson, Computer Architecture A Quantatiue 
Approach. San Mateo, California: Morgan Kaufmann Publishers, Inc., 1990. 
[84] 1. Koren. Computer An'thmetic Algorithms. Englewood Cliffs: Prentice Hd, 
1993. 
[85] T. K. Cdaway and E. E. Swartzlander, "Estimating the power consumption 
of CMOS adders," in IEEE Symposium on Computer Arithmetic, (Windsor, 
Ontario, Canada), pp. 210-216, June 1993. 
[86] T. K. Cdaway and E. E. Swartzlander, "Otimizing arithmetic elements for 
signal processing," in VLSI Signal Processing Worltshop, pp. 91-100, 1992. 
[87] E. N. Farag and M. 1. Elmasry, "Using c a r y  Save adders to reduce power 
dissipation," in Eighth International Conference on Microe~ectronics, (Cairo, 
Egypt), pp. 173-176, Dec. 1996. 
[88] A. Gersho and R. M. Gray, Victor quantization and signal compression. 
Kluwer Academic Publishers, 1992. 
[89] B. G. Lee, "A new algorithm to compute the discrete cosine transform," 
IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP- 
32, pp. 1243-1245, Dec. 1984. 
S. A. White, "Applications of distributed arithmetic to digital signal process- 
hg :  A tutorial review," IEEE ASSP Magazine, pp. 4-19, July 1989. 
C.-L. Su, C.-Y. Tsui, and A. M. Despain, "Low power architecture design and 
compilation techniques for high-performance processors," in Compcon, (San 
Francisco, California), pp. 489-498, Mar. 1994. 
E. N. Farag and M. 1. Elmasry, "Low-power subband coding aigorithm," in 
IEEE International Conference of Acowtics, Speech and Signal Processing, 
vol. 4, (Atlanta, Georgia), pp. 2116-2119, May 1996. 
M. D. Ercegovace and T. Lang, Division and Square Root Digit-Recurrence 
Algorithms and Implementations. Kluwer Academic Publishers, 1994. 
A. Avizienis, "Signed digit number representation for fast pardel arith- 
metic," IRE Tram. Electron. Compat., vol. EC-10, pp. 389400, Sept. 1961. 
M. D. Ercegovace and T. Lang, "On-the-fly conversion of redundant into 
conventional representation," IEEE Transactions on Communications, vol. C- 
36, pp. 895-897, July 1987. 
F. N. Najm, "Towards a high-level power estimation capability," in Proceed- 
ings of 1995 International Symposium on Low Power Design, pp. 87-92, Apr. 
1995. 
E. N. Farag, M. A. Hasan, and M. 1. Elmasry, "Low-power radix 2 divi- 
sion algonthm with minimum add/sub operations," in SPIE Aduanced Sig- 
na1 Processing Algorithms, Architectures and Implementations VI, vol. 2846, 
(Denver, Colorado), pp. 39-50, Aug. 1996. 
E. N. Farag, M. A. Hasan, and M. 1. Elmasry, "Minimizing add/sub opera- 
tions in a radix 2 division algorithm," IEEE Transactions o n  Cornputers. 
D. Wong, B. H. Juang, and A. H. Gray, "An 800 bit/s vector quantization 
LPC vocoder," IEEE Transactions o n  Acotlstics, Speech and Signal Process- 
ing, vol. ASSP-30, pp. 770-779, Oct. 1982. 
E. Farag, M. Saleh, N. Elnady, and M. Elmasry, "Structure and network 
control of a hierarchical mobile network architecture," in Phoenix Conference 
on Cornputers and Communications, pp. 671-677, 1995. 
E. N. Farag, M. 1. Elmasry, M. N. Saleh, and N. M. Elnady, "A two-level 
Iiierarchical mobile network: Structure and network control," International 
Journal of Reliability, Quality and Safety Engineering, vol. 3, pp. 325-351, 
Dec. 1996. 
B. M. Gordon and T. H. Meng, "A low power subband video decoder archi- 
tecture," in IEEE International Conference on  Acoustics, Speech and Signal 
Processing, pp. 11-409-412, Apr. 1994. 
G. Wallace, "The JPEG s till picture compression standard," Communications 
of the ACM, vol. 34, pp. 3044,  Apr. 1991. 
W. Li and Y.-Q. Zhang, "Vector-based signal processing and quantization for 
image and video compression," Proceedings of the IEEE, vol. 83, pp. 317-335, 
Feb. 1995. 
[105] M. Liou, "Overview of the px64 kbitfs video coding standard," Cornmunica- 
tions of the ACM, vol. 34, pp. 59-63, Apr. 1991. 
[106] D. LeGall, "MPEG: A video compression standard for multimedia applica- 
tions," Communications of the ACM, vol. 34, pp. 46-58, Apr. 1991. 
(1071 J. W. Woods and S. O'Neil, "Subband coding of images," IEEE Transactions 
on Acowtics, Speech and Signal Processing, vol. 34, pp. 1278-1288, Oct. 1986. 
[108] W. Li and Y.-Q. Zhang, "A study of non-separable subband fdters for video 
coding," in SPIE Visual Communications and Image Processing, vol. 1818, 
Part 1, (Boston, Massachusetts), pp. 233-240, Nov. 1992. 
[log] J. Woods, Subband Image Codzng. Boston: Kluwer Academic Publishers, 
1991. 
[Il01 N. 1. Cho and S. U. Lee, "Fast algorithm and implementation of 2-D dis- 
crete cosine transform," IEEE Transactions on Circuits and Systems, vol. 38, 
pp. 297-305, Mar. 1991. 
[III] M.-T. Sun, T.-C. Chen, and A. M. Gottlieb, "VLSI implementation of a 16 x 
16 discrete cosine transform," IEEE Transactions on Circuits and Systems, 
vol. 36, pp. 610-617, Apr. 1989. 
[112] Y.-H. Chan and W.-C. Siu, 'On the realization of discrete cosine transform 
using the distributed aithmetic," IEEE Transactions on Circxits and Sys- 
t e m ,  vol. 39, pp. 705-712, Sept. 1992. 
[113] E. N. Farag and M. 1. Elmasry, uLow-power implementation of discrete cosine 
transform," in Sixth Great Lakes Symposium on VLSI, (Ames, Iowa), pp. 174- 
177, Mar. 1996. 
[114] R. E. Crochiere, S. A. Webber, and J. L. Flanagan, "Digital coding of speech 
in sub-bands," The Bell System Technical Journal, vol. 55, pp. 1069-1085, 
Oct. 1976. 
[Ils] M. Vetterli, "Multidimensional subband coding: Some theory and algo- 
rithms," Signal Processing, vol. 6 ,  pp. 97-112, Apr. 1984. 
[116] P. P. Vaidyanathan, "Quadrature mirror filter banks, M-Bank extensions and 
perfect-reconstruction techniques," IEEE ASSP Magazine, vol. 3, pp. 4-20? 
July 1987. 
(1171 M.  Vetterli and D. LeGall, "Perfect reconstruction FIR filter banks: Some 
properties and factorizations," IEEE Transactions on Acoustics, Speech and 
Signal Processing, vol. 37, pp. 1057-1071, July 1989. 
[118] P. P. Vaidyanathan, 'Multirate digital filters, füter banks, polyphase net- 
works, and applications: A tutorid," Proceedings of the IEEE, vol. 77, Dec. 
1989. 
[119] G. Karlsson and M. Vetterli, "Theory of two-dimensional rnultirate filter 
banks," IEEE Transactions on Acoustics, Speech and Signal Processing, 
vol. 37, pp. 925-937, June 1990. 
[120] P. P. Vaidyanathan, Multirate systems and filter banb. Englewood CM'S, 
N.J.: Prentice Hall, 1993. 
[121] R. E. Crochiere and L. R. Rabiner, Multirate Digital Signal Processing. En- 
glewood Cliffs, NJ: Prentice Hall, 1983. 
(1221 H. Gharavi and A. Tabatabai, "Sub-band coding of digital images using t w e  
dimensional quadrature &or filtering," IEEE Transactions o n  Circuits and 
Systems, vol. 35, pp. 207-214, Feb. 1988. 
[123] N. S. Jayant and P. Noll, Digital Coding of Waveforms. Englewood Cliffs, 
NJ: Prentice Hall, 1984. 
[124] A. K. Al-Asmari and R. E. Ahmed, "VLSI architecture foe HDTV sub-band 
coding ushg GQMFs' filterbanks," IEEE T~amac t ions  on Consumer Elec- 
tronics, vol. 41, pp. 1-11, Feb. 1995. 
P. H. Westerink, J. Biemond, and D. E. Boekee, "An optimal bit alloca- 
tion algorithm for sub-band coding," in IEEE International Conference on 
Acoustics, Speech and Signal Processing, pp. 757-760, Apr. 1988. 
P. H. Westerink, J. Biemond, and D. E. Boekee, "Evaluation of image sub- 
band coding schemes," in Proceedings of the Eu~opean Signal Processing Con- 
ference, pp. 1149-1152, Sept. 1988. 
P. H. Westerink, J. Biemond, and D. E. Boekee, "Sub-band coding of images 
using predictive vector quantization," in IEEE International Conference on  
Acoustics, Speech and Signal Processing, pp: 1378-1381, Apr. 1987. 
P. H. Westerink, D. E. Boekee, J. Biemond, and J. W. Woods, "Subband 
coding of images using vector quantization," IEEE Dansactions on Commu- 
nications, vol. COM-36, pp. 713-719, June 1988. 
W. Li and Y.-Q. Zhang, "A study of vector transform coding of subband- 
decomposed images," IEEE 7Fansactions o n  Circuits and Systems for Video 
Technology, vol. 4, pp. 383-391, Aug. 1994. 
B I .  L I 0  GRAPHY 279 
P. M. Aziz, H. V. Sorensen, and J. V. D. Spiegel, "An overview of sigma-delta 
converters ," IEEE Signal P~ocessing Magazine, pp. 61-84, Jan. 1996. 
A. M. Thurston, "Sigma-delta IF A-D converters for digital radios," GEC 
Journal of Research, vol. 12, pp. 76-85, Feb. 1995. 
J. C. Candy, B. A. Wooley, and 0. J. Benjamin, "A voiceband codec with digi- 
tal filtering," IEEE Sranîactions on Communications, vol. COM-29, pp. 815- 
830, June 1981. 
B. P. Brandt and B. A. Wooley, "A low-power area efficient digital filter for 
decimation and interpolation." IEEE Journal of Solid State Circuits, vol. 29, 
pp. 679-687, June 1994. 
R. Schreier and W. M. Snelgrove, "Decimation for bandpass sigma-delta 
analog-tedigital conversion," in IEEE Internation Conference on Circuits 
and Systems, vol. 3, pp. 1801-1804, May 1990. 
J. C. Candy, "Decimation for sigma delta modulation," IEEE Transactions 
on Communications, vol. COM-34, pp. 72-76, Jan. 1986. 
H. Meyr and R. Subramanian, "Advsnced digital receiver principles and tech- 
nologies for P CS ," IEEE Commwzicatzons Magazine, pp. 68-78, Jan. 1995. 
M. Rebeschini, N. R. V. Bavel, P. Rakers, R. Greene, J. Caldwell, and J. R. 
Haug, "A 16-b 160-kHz CM0 S A/D converter using sigma-delt a modulation," 
IEEE Journal of Solid State Circuits, vol. 25, pp. 431-440, Apr. 1990. 
R. Schreier and M. Snelgrove, "Bandpass sigma-delta modulation," Electron- 
ics Letter, vol. 25, pp. 1560-1561, Nov. 1989. 
BIBLIOGRAPHY 280 
[139] S. Jantzi, R. Schreier, and M. Snelgrove, "Bandpass sigma-delta analog-to- 
digital conversion," IEEE Transactions on Circuits and Systems, vol. 38, 
pp. 1406-1409, NOV. 1991. 
[140] S. A. Jantzi, W. M. Snelgrove, and P. F. Ferguson, "A fourth-order band- 
pass sigma-delta rnodulator," IEEE Journal of Solid State Ci~cuits, vol. 28, 
pp. 282-291, Mar. 1993. 
[141] L. Longo and B.-R. Horng, "A 15b 30kHz bandpass sigma-delta modulator," 
in IEEE International Solid-State Circuits Conference, (San Francisco, Cali- 
fornia), pp. 226-227, Feb. 1993. 
11421 F. W. Singor and W. M. Snelgrove, '5witched-capacitor bandpass delta-sigma 
A/D modulation at 10.7 MHz," IEEE Journal of Solid State Circuits, vol. 30, 
pp. 184-192, Mar. 1995. 
(1431 A. J. Coulson, "A generalization of nonuniform bandpass sampling," IEEE 
Transactions on Signal Processing, vol. 43, pp. 694-704, Mar. 1995. 
[144] E. N. Farag, R.-H. Yan, and M. 1. Ehasry,  "A novel pardel architecture for 
a switched-capacitor bandpass Z - A modulator," in Midwest Symposium on 
Circuits and Systems, (Sacramento, California), Aug . 1997. 
[145] E. N. Farag, R.-H. Yan, and M. 1. Elmasry, "A parallel bandpass !C - A 
modulator: Architecture and performance," IEEE T~awactions on Circuits 
and Systems. 
[146] E. N. Farag, R.-H. Yan, and M. 1. Elmasry, UA programmable power-efficient 
decimation filter for software radios," in International Symposium on Low 
Power Electronics and Design, (Monterey, California), Aug. 1997. 
BLBLIOGRAPHY 281 
[147] E. N. Farag, R.-H. Yan, and M. 1. Elmasry, ,'Decimation fdters for software 
radios: A power efficient design," IEEE Transactions o n  Circuits and Sys- 
t ems .  
[148] E. N. Farag, R.-H. Yan, and M. 1. Elmasry, "Power-efficient multiplier- 
accumulator design for FIR filters," in IEEE Canadian Conference on Electri- 
cal and Computer Engineering, vol. 1, ( S  t . John's, Newfoundland, Canada), 
pp. 27-30, May 1997. 
[149] E. N. Farag, R.-H. Yan, and M. 1. ELmasry, "Decimation filters: Low- 
power design and optimization," in IEEE Paczfic R im Conference o n  Com- 
munications, Computers and Signal Processing? (Victoria, British Columbia, 
Canada), Aug. 1997. 
[150] A. Microelectronics, "HL400C 3 volt 0.5pm CMOS standard-ceIl library," 
1995. 
[151] C. S. Wallace, "A suggestion for pardel rnultipliers," IEEE Transactions on 
Electronic Computers, vol. EC-13, pp. 14-17, Feb. 1964. 
[152] K. Hwang, Computer Arithmetic: Principles, Architecture and Design. John 
Wiley & Sons, 1979. 
[153] J. C. Majithia and R. KitaI, "An interative array for multiplication of signed 
binary numbers," IEEE Tkansactions on Cornputers, vol. C-20, pp. 214-216, 
Feb. 1971. 
[154] C. R. Baugh and B. A. Wooley, "A two7s complement pardel array mul- 
tiplier," IEEE Transactions on Cornputers, vol. C-22, pp. 1045-1047, Dec. 
1973. 
[155] A. K. Kwentus, K.-T. Hung, and A. N. Willson, "An architecture for high- 
performance/small area multipliers for use in digital filtering," IEEE Journal 
of Solid State Circuits, vol. 29, pp. 117-121, Feb. 1994. 
[156] A. D. Booth, "A signed binary multiplication technique," Quart. J. Mech. 
Appl. Math., vol. Volume 4 Part 2, pp. 236-241, 1951. 
[157] N. Ling, "A high-speed pardel  array multiplier," International Journal of 
Mini and Microcompilters, vol. 14 No. 3, pp. 139-146, 1992. 
[158] E. N. Farag, R.-H. Yan, and M. 1. Ehasry,  "A programmable power-efficient 
multiplier accumdator array," IEEE Transactions on Circuits and Systems. 
[159] D. Poornaiah, R. Haribabu, and M. Ahmad, "Design and VLSI implemen- 
tation of a novel concurrent 16-bit multiplier-accurnulator for DSP applica- 
tions," in IEEE International Conference on Acoustics, Speech and Signal 
Processing, vol. 1, pp. 385-388, 1993. 
[160] M. Ahmad and D. Poornalah, "Design of an efficient VLSI imer-product 
processor for real-time DSP applications," IEEE Transactions on Circuits 
and Systems, vol. 36, pp. 324-329, Feb. 1989. 
[161] A. Microelectronics, "HS6OOC 5 volt lp600c 3 volt CMOS standard-cell li- 
braries," 1994. 
[162] A. Microelectronics, "HS5OOC 5 volt 0.5pm CMOS standard-cell library," 
1995. 
[163] L. T. lMicroelectronics group, "EIS350C 5 volt 0.35pm CMOS standard-ceU 
library," 1996. 
L. T. Microelectronics group, "HL350C 3 volt 0.35pm CMOS standard-cell 
l ibrar~ ,~  1996. 
E. N. Farag, R.-H. Yan, and M. 1. Elmasry, "A novel digital channel selection 
algonthm wit h no pre-filt er multiplier ," IEEE Transactions on Acoustics, 
Speech and Signal Processing. 
K.  Feher , Advanced Digital Communications Syst ems and Signal Processing 
Techniques. Englewood ClifEs, New Jersey: Prentice Hall, 1987. 
B. S. Haroun and M. 1. Elmasry, "Architectural synthesis for DSP silicon 
compilas," IEEE Transactions on Computer-Aided Design, vol. 8 ,  pp. 431- 
447, 1989. 
[168] J. Rabaey, C. Chu, P. Hoang, and M. Potkonjak, "Fast prototyping of 
datapat h-intensive architectures," IEEE Design kY Test of Cornputers, pp. 40- 
51, June 1991. 
[L69] S. Wuytack, F. V. Catthoor, and H. J. D. Man, "Transforming set data types 
to power optimal structures," IEEE Transactions on Compter-Aided Design, 
vol. 15, pp. 619-629, June 1996. 
(1701 L. Goodby, A. Orailoglu, and P. M. Chau, L'Microarchitecturd synthesis of 
peâormance-constrained, low-power VLSI designs," in IEEE International 
Conference on  Cornputer Design, pp. 323-326, 1994. 
[171] P. E. Landman and J.  M. Rabaey, "Activity-sensitive architectural power 
analysis," IEEE Transactions on Computer-Aided Design, vol. 15, pp. 571- 
587, June 1996. 
[172] D. Marculescu, R. Marculescu, and M. Pedram, "Information theoretic mea- 
sures for power anaiysis," IEEE Transactions on Cornputer-Aided Design, 
vol. 15, pp. 599-610, June 1996. 
[173] K. van Berkd e t  al., "A f d y  asynchronous low-power error corrector for the 
DCC player," IEEE Journal of Solid State Circuits, vol. 29, pp. 1429-1439, 
Dec. 1994. 
[174] S .-Je Jou and 1.-Y. Chung, "Low-power self-timed circuit design technique," 
Electronics Letters, vol. 33, pp. 110-111, 16th Jan 1997. 
[175] Alta ~ r o u ~ ~ ~  of Cadence Design S ys terns, Inc., Signal Processing WorkSys- 
tem Designer/BDE User's Guide. 1996. 
(1761 S.-S. Hang and R. Jain, "Decimation îdter compiler for oversampling A/D 
applications," in IEEE hternational Conference on Acowtics, Speech and 
Signal Processing, vol. 5, pp. 537-540, 1992. 
Publications and Patents Result ing from t his 
Research 
Published 
1. E. N. Farag, R.-H. Yan, and M. 1. ELmasry, "Power-efficient multiplier-accu- 
mulator design for FIR fdters," in IEEE Canadian Conference on Electn'cal 
and Cornputer Engineering, vol. 1, ( S  t. John's, Newfoundland, Canada), 
pp. 27-30, May 1997. 
2. E. N. Farag and M. 1. Ehasry ,  "Using carry Save adders to reduce power 
dissipation," in Eighth International Conference on Microelectronics, (Cairo, 
Egypt), pp. 173-176, Dec. 1996. 
3. E. N. Farag and M. 1. Elmasry, "Low-power subband coding algorithm," in 
IEEE International Conference of Acoustics, Speech and Signal Processing, 
vol. 4, (Atlanta, Georgia), pp. 2116-2119, May 1996. 
4. E. N. Farag, M. A. Hasan, and M. 1. Elmasry, %ow-power radix 2 division 
algorit hm with minimum add/sub operations ," in SHE Aduanced Signal Pro- 
cessing Algorithms, Architectures and Irnplementations VI, vol. 2846, (Den- 
ver, Colorado), pp. 39-50, Aug. 1996. 
5. E. N. Farag and M. 1. Ehasry, "Low-power implementation of discrete cosine 
transform," in Sizth Great Lakes Symposium on  VLSI, (Ames, Iowa), pp. 174- 
177, Mar. 1996. 
Accepted 
6. E. N. Farag, R.-H. Yan, and M. 1. Elmasry, "A novel pardel architecture for 
BIBLIOGRAPHY 286 
a switched-capacitor bandpass S - A modulator," in Mzdwest Symposium on 
Circuits and Systems, (Sacramento, California), Aug. 1997. 
7. E. N. Farag, R.-H. Yan, and M. 1. Elrnasry, "A programmable power-efficient 
decimation filter for software radios," in International Symposium on Low 
Power Electronics and Design, (Monterey, California), h g .  1997. 
8. E. N. Farag, R.-H. Yan, and M. 1. Elmasry, 'Decimation fdters: Low-power 
design and optimization," in IEEE Pacific Rim Conference on Communicu- 
tionî, Computers and Signal Processing, (Victoria, British Columbia, Canada), 
Aug. 1997. 
Submitted 
9. E. N. Farag, M. A. Hasan, and M. 1. Ehasry,  "Minimizing add/sub opera- 
tions in a radk 2 division algorithm," submitted to IEEE Transactions on 
Computers. 
10. E. N. Farag, R.-H. Yan, and M. 1. Elmasry, "A pardel bandpass C - A 
modulator: Architecture and performance," submitted to IEEE Transactions 
on Circuits and Systems. 
11. E. N. Farag, R.-H. Yan, and M. 1. Elmasry, "Decimation filters for software 
radios: A power efficient design," submitted to IEEE IPrcrnsactions on Circuits 
and Systems. 
12. E. IV. Farag, R.-H. Yan, and M. 1. Ehasry, "A programmable power-efficient 
multiplier accumulator array," submitted to IEEE Transactions on Circuits 
and Systems. 
13. E. N. Farag, R.-H. Yan, and M. 1. Elmasry, 'A novel digital channel selection 
algorithm with no pre-filter multiplier," submitted to IEEE Ransactiorts on 
Acoustics, Speech and Signal Processing. 
