Low power techniques and architectures for multicarrier wireless receivers by Hasan, Mohd.
S 
0 1  
5-  4.6 IN 
C) 
Low Power Techniques and Architectures for 
Multicarrier Wireless Receivers 
Mohd. Hasan 
r  ,A  , 06,~ h~ . t. I 	'Am i7lftuqk~ 11 
A thesis submitted for the degree of Doctor of Philosophy. 
The University of Edinburgh. 
August 2003 
Abstract 
Power consumption is a critical issue in portable wireless communication. Multicarrier code 
division multiple access (MC-CDMA) has a significant potential to be included as a standard 
in the next generation of mobile communication. This thesis investigates new low power archi-
tectures for a MC-CDMA receiver. The FFT processor is one of the major power consuming 
blocks in multicarrier systems based on Orthogonal frequency division multiplexing (OFDM), 
like MC-CDMA, wireless LANs etc. Three low power schemes are presented for reducing 
the power consumption in FFT processors namely order based processing, coefficient memory 
reduction and simplified coefficient addressing. 
The order based processing scheme is based on a novel concept of using either the normal or 
two's complement form for only the real part of the coefficients selectively to minimise the 
Hamming distance between successive coefficients fed to the multipliers. This significantly 
reduces the switching activity at the coefficient input of the multiplier and hence the power 
consumption. The coefficient memory reduction scheme exploits the relationship among the 
coefficient values to reduce the coefficient memory size from N12 locations to ((N18) + 1) 
locations for an N-point FFT, thereby saving both area and power for long FFTs. The proposed 
coefficient addressing scheme implements the complete coefficient addressing for all stages of 
a radix-2 FFT processor by using a simple multiplexer instead of a cascade of Barrel shifters. 
Low power single butterfly radix-2 FF11' processor and radix-4 ordered pipelined FF1' processor 
architectures based on the novel order based processing scheme are also proposed. 
The ordered low power radix-4 FIT processor is combined with the combiner to realise a low 
power MC-CDMA receiver. The power consumption in a MC-CDMA receiver can be further 
reduced by introducing the concept of dynamically altering the complexity of the receiver in 
real time as per the changing channel parameters such as the delay spread, maximum Doppler 
frequency, transmission rate and signal to noise ratio instead of using a receiver designed for 
the worst case scenario. The FFT size in multicarrier systems like MC-CDMA varies from 
16-point to 1024-points depending upon the channel parameters. This thesis has proposed 
a reconfigurable 256-point FF1' processor architecture that can be configured in real time to 
act as a 64-point or 16-point FFT processor to prove the concept. The power reduction is 
significant in moving from a fixed 256-point FF11' to a reconfigurable 256-point FFT provided 
that the FF1 size is varying over a large range, which is indeed, the case for a MC-CDMA 
receiver. This power reduction is achieved by using an appropriate FF1' size (shorter FFTs) by 
disabling the clocks of the higher stages in real time. A reconfigurable pipelined MC-CDMA 
receiver architecture is also proposed that can be configured in real time to process 256 or 64 
sub-carriers on the basis of the channel parameters. The power saving is obtained by disabling 
the first stage and the last ordering stage of the FIT processor and also by disabling the unused 
equaliser memory in the Combiner by switching from 256 to 64 sub-carriers. 
An FIR filter is also an important block in wireless receivers. A number of novel low power FIR 
filter cores based on different low power algorithms and their hybrid have also been presented. 
Declaration of originality 
I hereby declare that the research recorded in this thesis and the thesis itself was composed and 




I would like to thank Dr Tughrul Arsian and Dr John Thompson for their excellent support and 
guidance during the research period. 
Grateful thanks to Dr Ahmet Erdogan for his help throughout the last three years. Thanks to 
Dr. Emad Al-susa, Mr. Robert Thompson, Dr. Nizamettin Aydin, Dr. Peter Hillman and all my 
lab colleagues for their help throughout my research work. 
Special thanks to the Association of Commonwealth Universities and the British Council for 
funding my research work. 
Heartfelt thanks to my wife for her patience, encouragement and support throughout my re-
search work. Thanks to my son Imaad for making my life so wonderful. 
A very special thanks to my parents who have guided and supported me throughout my life. 
iv 
Contents 
Declaration of originality .............................iii 
Acknowledgements ................................iv 
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 
List of figures ...................................ix 
List of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 
Acronyms and abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi 
Introduction 	 1 
1.1 	Motivation .....................................1 
1.2 	Contribution ....................................1 
1.3 	Structure ......................................4 
1.4 	Summary 	.....................................5 
2 	Low power techniques and architectures for multicarrier wireless receivers 
2.1 	Introduction .................................... 
2.2 	Sources of power consumption in CMOS technology 	.............. 
2.3 	General low power techniques for reducing the switched 
capacitance 	.................................... 
2.3.1 Clock gating 	............................... 
2.3.2 Operation minimisation 	......................... 
2.3.3 Operation substitution 	.......................... 
2.3.4 Input and constant coefficient ordering .................. 
2.3.5 Reducing glitching activity 	........................ 
2.3.6 Precomputation 	.............................. 
2.3.7 Data representation 	............................ 
2.3.8 Bus encoding 	............................... 
2.3.9 Memory partitioning 	........................... 
2.3.10 State assignment 	............................. 
2.3.11 Scheduling and resource binding ..................... 
2.3.12 Selection of appropriate gate level implementation 	........... 
2.3.13 Technology decomposition and mapping 	................ 
2.3.14 Wordlength reduction ........................... 
2.3.15 Physical capacitance reduction 	...................... 
2.4 	Low power techniques and architectures for IFFY processor ........... 
2.4.1 Cache based architecture 	......................... 
2.4.2 Partial product ordering 	......................... 
2.4.3 Wordlength optimisation 	......................... 
2.4.4 Operation substitution 	.......................... 
2.4.5 High radix architecture 	.......................... 
2.4.6 Reduced precision redundancy 	...................... 






























2.4.8 	Data representation 	............................ 15 
2.5 	Low power techniques and architectures for the FIR filter ............ 16 
2.5.1 	Coefficient ordering 	........................... 16 
2.5.2 	Coefficient segmentation 	......................... 16 
2.5.3 	Block processing 	............................. 16 
2.5.4 	Approximate processing 	......................... 17 
2.5.5 	Multirate architectures 	.......................... 17 
2.5.6 	Coefficient scaling and optimisation 	................... 18 
2.5.7 	Filter realisation through differential coefficient ............. 18 
2.5.8 	Reduced two's complement data representation 	............. 18 
2.5.9 	Sharing multiplication 	.......................... 19 
2.6 	Low power architecture of a MC-CDMA receiver ................ 19 
2.7 	Power characteristics of commonly used multipliers ............... 20 
2.8 	Summary 	..................................... 21 
3 The Discrete and the fast Fourier transform 	 22 
3.1 	Overview of DET ................................. 22 
3.2 The Fast Fourier transform (PET) ......................... 23 
3.3 	A unified approach to the FF1 .......................... 26 
3.3.1 	Radix-4 decimation-in-time FF1' algorithm ...............29 
3.4 	Summary 	..................................... 31 
4 Low power schemes for FFT processor 	 32 
4.1 Order based coefficient processing scheme ....................3 2 
4.1.1 	Results 	.................................. 35 
4.2 Coefficient memory reduction scheme ...................... 38 
4.2.1 	Memory implementation ......................... 41 
4.2.2 	Design flow ................................ 42 
4.2.3 	Results 	.................................. 44 
4.3 	Coefficient addressing scheme .......................... 46 
4.3.1 	Detailed design .............................. 46 
4.3.2 	Results 	.................................. 48 
4.4 	Summary 	......................................50 
5 	Low power single butterfly FFT processor architecture 51 
5.1 	Covcntional radix-2 single butterfly FF1 processor 
architecture 	.................................... 51 
5.1.1 	Memory organisation in a single butterfly FFT processor 	........ 52 
5.1.2 	Architecture of the FFT processor 	.................... 54 
5.2 	Low power radix-2 FF1 processor architecture .................. 57 
5.2.1 	Ordered butterfly module ......................... 59 
5.2.2 	Multiplication module 	.......................... 60 
5.3 	Results 	....................................... 60 
5.4 	Summary 	..................................... 62 
6 	Low power radix-4 pipelined FFT processor architecture 63 
6.1 	Need of pipelined FF1 processor architectures .................. 63 
vi 
Contents 
6.2 Conventional radix-4 pipelmed FIT processor architecture 	........... 64 
6.2.1 	Bi and Jones algorithm for DFT decomposition 	............. 65 
6.2.2 	Hardware implementation 	........................ 67 
6.3 Low power ordered pipelined radix-4 FFT processor 
architecture 	.................................... 72 
6.3.1 	Order based processing of coefficients in a radix-4 pipelined FF1' pro- 
cessor................................... 73 
6.3.2 	Stage 1 Commutator design for a 16-point FF1' Processor ........ 75 
6.3.3 	Low power butterfly 	........................... 78 
6.3.4 	Ordered complex multiplier 	....................... 79 
6.4 Results 	....................................... 80 
6.5 Summary 	..................................... 83 
7 Low power MC-CDMA receiver architecture 	 84 
	
7.1 	Overview of CDMA 	...............................84 
7.2 	Overview of MC-CDMA .............................86 
7.3 Modeling of MC-CDMA transmitter and receiver in Matlab ...........90 
7.4 MC-CDMA receiver architecture .........................92 
7.4.1 	Low power Combiner architecture ....................94 
7.4.2 	Results 	..................................99 
7.5 	Summary 	.....................................101 
8 Low power reconfigurable MC-CDMA receiver architecture 	 104 
8.1 Motivation 	..................................... 104 
8.2 Dependence of FFT size on channel parameters 	................. 105 
8.3 Reconfigurable FIT processor architecture .................... 107 
8.4 Reconfigurable MC-CDMA receiver architecture 	................ 109 
8.4.1 	Architecture of the 256-point reconfigurable FF1' processor used in the 
reconfigurable receiver 	.......................... 110 
8.4.2 	Reconfigurable Combiner architecture .................. 112 
8.5 Results 	....................................... 114 
8.6 Summary 	..................................... 118 
9 	Low power FIR filter architectures 120 
9.1 Overview of the direct form FIR filter 	...................... 120 
9.2 Conventional DF FIR filter architecture 	..................... 122 
9.3 Coefficient ordering based FIR filter architecture 	................ 124 
9.4 Coefficient segmentation based FIR filter architecture 	.............. 125 
9.5 Block processing based FIR filter architecture 	.................. 127 
9.6 Combination of block processing and coefficient segmentation based filter ar- 
chitecture 	..................................... 129 
9.7 Results 	....................................... 130 
9.8 Summary 	..................................... 136 
10 Summary and Conclusions 137 
10.1 Introduction 	.................................... 137 
10.2 Summary 	..................................... 137 
vii 
Contents 
10.3 Conclusions .................................... 140 
10.4 Achievements ................................... 143 
10.5 Future work .................................... 144 
References 
	 146 
A Publications 	 154 
A.1 	Refereed Journals ................................. 154 
A.2 Refereed Conferences ............................... 154 
B C-code for the order based processing algorithm 
	 156 
C MATLAB code for the MC-CDMA transceiver 	 167 
Cd Main section of code ............................... 167 
C.2 Function used to model the MC-CDMA transmitter ...............168 
C.3 Function used to model the MC-CDMA receiver ................169 
D Verilog code for the MC-CDMA receiver 	 171 
D.1 Verilog code for the FFT ............................. 171 
D.2 Verilog code for the Combiner .......................... 182 
Viii 
List of figures 
3.1 Flow graph of an 8-point FFT calculated using two N/2-point DFTs . . . . . . . 24 
3.2 Flow graph of an 8-point radix-2 decimation-in-time FFT............. 25 
3.3 Signal flow representation of a radix-2 decimation-in-time butterfly . . . . . . . 25 
3.4 Simplified representation of a radix-2 decimation-in-time butterfly . . . . . . . . 25 
3.5 Simplified flow graph of an 8-point radix-2 decimation-in-time FFT algorithm 26 
3.6 Row wise arrangement of data array . . . . . . . . . . . . . . . . . . . . . . . . 27 
3.7 Column wise arrangement of data array . . . . . . . . . . . . . . . . . . . . . . 28 
3.8 Flow graph of a 16-point radix-4 decimation-in-time FFT algorithm . . . . . . . 30 
3.9 Radix-4 decimation-in-time butterfly .. . . . . . . . . . . . . . . . . . . . . . . 30 
4.1 Flow chart of the order based processing scheme . . . . . . . . . . . . . . . . . 34 
4.2 Signal flow graph of a 16-point FFT processor . . . . . . . . . . . . . . . . . . 36 
4.3 Hardware implementation of Our coefficient memory reduction scheme . . . . . 41 
4.4 Flow chart depicting the design flow . . . . . . . . . . . . . . . . . . . . . . . . 43 
4.5 Architecture of a single butterfly based radix-2 FFT processor. The coefficient 
address generator is enclosed by a dotted rectangle . . . . . . . . . . . . . . . . 47 
4.6 Comparison of the two schemes in terms of power . . . . . . . . . . . . . . . . 49 
4.7 Comparison of the two schemes in terms of area. The area is expressed in 
equivalent nand gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 	49 
4.8 Percentage reduction in power and area of Our scheme over Cohen scheme. 	50 
5.1 Signal flow graph of a 16-point FFT. 'M' indicates the memory location. ....52 
5.2 Parallel memory block organisation on the basis of parity bit . . . . . . . . . . . 54 
5.3 Conventional single butterfly radix-2 FFT processor architecture ..........55 
5.4 Conventional butterfly architecture ......................... 56 
5.5 Low power single butterfly ordered radix-2 FFT processor architecture. ....58 
5.6 Low power butterfly ordered architecture . . . . . . . . . . . . . . . . . . . . . 59 
5.7 Architecture of the multiplication module for the low power butterfly . . . . . . 60 
6.1 Signal flow graph of a radix-4 16-point FFT.................... 
6.2 Conventional N-point radix-4 pipelined FET processor architecture....... 
6.3 Commutator architecture for the conventional radix-4 pipelined FFT processor 
architecture..................................... 	68 
6.4 Timing diagram of commutator outputs for the first stage of a 16-point radix-4 
pipelined FIT processor.............................. 
6.5 (a) Timing diagram showing the input and output values of a FIFO of length 
equal to four, (b) The contents of the four location dual port RAM based FIFO, 
its Input and Output after every clock cycle for six consecutive clock cycles(+ 
means just after).................................. 70 
6.6 Commutator architecture for the DM based FIFO................. 71 
6.7 	Conventional butterfly architecture......................... 
	72 
6.8 	Conventional complex multiplier.......................... 
	73 
ix 
List of figures 
6.9 Signal flow graph of the ordered 16-point radix-4 pipelined FFT . . . . . . . . . 74 
6.10 Ordered 16-point radix-4 pipelined FIT processor architecture . . . . . . . . . . 75 
6.11 Input and output data sequence of ADM . . . . . . . . . . . . . . . . . . . . . . 76 
6.12 Timing diagram of normal and ordered commutator outputs for the first stage 
of a 16-point radix-4 pipelined FFT processor . . . . . . . . . . . . . . . . . . . 76 
6.13 First stage commutator architecture for the ordered 16-point radix-4 FFT 
	
processor...................................... 	77 
6.14 Low power butterfly architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 79 
6.15 Ordered complex multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 	80  
7.1 Illustration of the spreading principle with the help of (a) Data signal b(t), 
(b) CDMA code sequence c(t) and (c) Spread signal s(t), the processing gain 
is assumed as eight . 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 86 
7,2 (a) CDMA transmitter and (b) Power spectrum of the transmitted wideband 
CDMA signal . 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 87 
7.3 (a) Data signal, (b) CDMA code sequence, (c) Illustration of the one chip/carrier 
MC-CDMA scheme . 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 88  
7.4 Power spectrum of the MC-CDMA signal depicting eight sub-carriers with 
their overlapping spectra . 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 89 
7.5 MC-CDMA transmitter . . 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 89  
7.6 MC-CDMA receiver . 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 90  
7.7 MC-CDMA transmission frame assumed for 64 sub-carriers . 	. . . . . . . . . . 91 
7.8 Estimation and demodulation phases in a MC-CDMA receiver . . . . . . . . . . 93 
7.9 Block diagram of the conventional MC-CDMA receiver . 	. . . . . . . . . . . . 94 
7.10 Block diagram of our low power MC-CDMA receiver . 	. . . . . . . . . . . . . 94 
7.11 Combiner architecture . 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 95 
7.12 Multiplication and accumulation module . 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 	. 96 
7.13 Division module.................................. 97 
7.14 Layout of the Conventional MC-CDMA receiver core, 
Power = 148.23mW, Area=2.22mrrt2 .............. 	... 	....... 102 
7.15 Layout of Our low power MC-CDMA receiver core, 
Power = 129.55mW, Area=2.27mrn2....................... 103 
8.1 	MC-CDMA receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 
8.2 Optimum number of sub-carriers as a function of normalised delay spread for 
different maximum Doppler frequencies . . . . . . . . . . . . . . . . . . . . . . 106 
8.3 Architecture of the radix-4 256-point reconfigurable FIT processor . . . . . . . 107 
8.4 Architecture of a basic stage of a radix-4 256-point reconfigurable FF1' processor. 108 
8.5 Block diagram of REGEIVER-I.......................... 109 
8.6 Block diagram of RECEIVER-lI.......................... 109 
8.7 Architecture of the 256-point reconfigurable FFT processor used in the 
reconfigurabic 256 sub-carrier MC-CDMA receiver . . . . . . . . . . . . . . . . 110 
8.8 Architecture of the reordering stage in a pipelined FIT processor ........Ill 
8.9 Architecture of the reconfigurable Combiner . . . . . . . . . . . . . . . . . . . 112 
8.10 MC-CDMA frame assumed for 256/64 sub-carriers . . . . . . . . . . . . . . . . 113 
8.11 Finite state machines module for the reconfigurable receiver . . . . . . . . . . . 114 
x 
List of figures 
	
9.1 	Direct form FIR filter structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 
9.2 	Generic direct form FIR core . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 
9.3 Conventional arithmetic unit . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 123 
9.4 Coefficient ordering based FIR filter architecture . 	. 	 . 	 . 	 . . 	 . . 	 . 	 . 	 . 	 . 	 . . 	 . 	 . . 125 
9.5 Arithmetic unit for coefficient segmentation . . 	 . 	 . . 	 . 	 . 	 . 	 . 	 . . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 126 
9.6 Illustration of block processing with a block size of two . 	. . . . . . . . . . . . 127 
9.7 Arithmetic unit for block processing . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 128 
9.8 Arithmetic unit for the combination . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 129 
9.9 Distribution of cell power for the overall FIR core . 	. 	 . 	 . 	 . 	 . . 	 . 	 . 	 . . 	 . . 	 . . 	 . 131 
9.10 Distribution of cell power for the AUs....................... 133 
xi 
List of tables 
2.1 Power consumption analysis for different multipliers . 	. . . . . . . . . . . . . . 20 
4.1 Listing of the ordered coefficient sets obtained by different order based pro- 
cessing schemes for a 32-point FFT processor as an example . . . . . . . . . . . 35 
4.2 Switching activity comparison of different schemes for different lengths of the 
radix-2 FF1 processor . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 36 
4.3 Switching activity comparison of different schemes for different wordlengths 
for a 128-point radix-2 FFT processor. The coefficient set is obtained by round- 
ing to the nearest integer . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 37 
4.4 Switching activity comparison of different schemes for different wordlengths 
for a 128-point radix-2 FF1' processor. The coefficient set is obtained by round- 
ing up to the nearest integer . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 37 
4.5 Description of the various memory organisation schemes . . . . . . . . . . . . . 38 
4.6 Comparison of the various schemes in terms of power . 	. . . . . . . . . . . . . 44 
4.7 Comparison of the various schemes in terms of area. The area is expressed in 
equivalent nand gates[n] . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 45 
4.8 Power and area saving of Our scheme as compared to Others scheme for differ- 
ent coefficient wordlengths for a 2K-point radix-2 FFT processor . 	. . . . . . . 45 
5.1 Power consumption comparison of FF1 cores . . . . . . . . . . . . . . . . . . . 61 
5.2 Power consumption of the different cells along with net switching power of a 
64-point FF1' core . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 62 
6.1 Comparison of the different pipeined FF1' architectures. * indicates digit serial 65 
6.2 Control signals for the different values of rnj (0=addition, 1=subtraction). . 	71 
6.3 Ordered and conventional coefficient sequences for a 16-point radix-4 FF1'. 75 
6.4 Contents of ROM  for generating addressing and control signals for the 
commutator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 	78 
6.5 Comparative power consumption for the Ordered and conventional FF1' pro- 
cessors. 	. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 	81 
6.6 Comparative major cells power consumption for the Ordered and conventional 
FF1' processor for the 64-point FF1 processor with csa multiplier . . . . . . . . 82 
7.1 Power consumption comparison of the MC-CDMA receivers for different mul- 
tiplier types . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 99 
7.2 Power consumption comparison of the major blocks of the MC-CDMA receiver 
for csa multiplier . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 99  
7.3 Power consumption in the various blocks of the combiner for csa multiplier. . . 100 
7.4 Area comparison of the MC-CDMA receiver for csa multiplier . . . . . . . . . . 100 
8.1 Power comparison between the fixed and the reconfigurable FFT processor 
architectures . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 115 
xli 
List of tables 
8.2 Aggregate Energy saving of RFFT-I and RFFT-II architectures with respect 
to a fixed 256-point FFT processor for three data sets corresponding to three 
different FIT size requirements over three different time durations . . . . . . . . 116 
8.3 Power consumed by the major blocks of the fixed and rcconfigurable 256-point 
FFr processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 
8.4 Power comparison between the fixed and the reconfigurable receiver architectures. 118 
8.5 Power consumed by the major blocks of the fixed and the reconfigurable receivers. 118 
	
9.1 	FIR Cell power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 
9.2 	Arithmetic unit Cell power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 
9.3 Power consumption analysis for different multipliers . . . . . . . . . . . . . . . 132 
9.4 Power consumption analysis for FIR cores . . . . . . . . . . . . . . . . . . . . 134 
9.5 Area comparison for FIR cores .. . . . . . . . . . . . . . . . . . . . . . . . . . 134 
9.6 Delay analysis of AU's using different multipliers . . . . . . . . . . . . . . . . . 135 
xlii 
Acronyms and abbreviations 
ACC Accumulator 
ADC Analog to digital converter 
ADD Adder 
ADM Additional dual port random access memory 
ASIC Application specific integrated circuit 
AU Arithmetic unit 
BP Block processing 
BPSK Binary phase shift keying 
CA Coefficient address 
CDMA Code division multiple access 
CLACC Clearing accumulator logic 
CMOS Complementary metal oxide semiconductor technology 
COMB Combination of coefficient segmentation and Block processing 
COMP Inversion 
CONV Conventional 
CSEG Coefficient segmentation 
CSA Carry save array multiplier 
DAC Digital to analog converter 
DF Direct form 
DFT Discrete Fourier transform 
DM Dual port random access memory 
DS-CDMA Direct sequence code division multiple access 
DSP Digital signal processing 
FDMA Frequency division multiple access 
FFT Fast Fourier transform 
FIFO First in first out memory 
FIR Finite impulse response filter 
FSM Finite state machine 
IFFY Inverse fast Fourier transform 
Acronyms and abbreviations 
I/O Input-output 
ISI Intersymbol interference 
LAN Local area network 
LNS Logarithmic number system 
LSB Least significant bit 
LUT Look-up-table 
MC-CDMA Multicarrier code division multiple access 
MMSE Minimum mean square error 
MSB Most significant bit 
NBW Non-Booth coded Wallace tree multiplier 
OFDM Orthogonal frequency division multiplexing 
ORD Ordering block in a reconfigurable MC-CDMA receiver 
ORDER Coefficient ordering 
RAM Random access memory 
RAME Even parity memory bank 
RAMO Odd parity memory bank 
RF Radio frequency 
RISC Reduced instruction set computer 
ROM Read only memory 
RTL Register transfer language 
SAIF Switching activity interchange format 
SDF Standard delay format 
SNR Signal to noise ratio 
SOC Silicon on chip 
SR Shift register 
SUB Subtractor 
SUM Summer 
TCMP Two's complementer module 
TDF Transpose direct form 
TDMA Time division multiple access 
TM Triple port random access memory 
VLSI Very large scale integration 
WALL Booth-coded Wallace tree multiplier 
xv 
Nomenclature 
clk 	Clock signal 
Cloud Load capacitance 
f 	Frequency of operation 
fd 	Maximum Doppler frequency 
A 	Number of active users dependent parameter 
Micron 
N 	FFT size in points 
Nf 	FIR filter order 
Nfd Normalised maximum Doppler frequency 
P 	Processing gain 
PSL 	Switching power consumption 
R 	Transmission rate 
S 	Switching activity 
Delay spread 
Complement 
x(n) Input sampled sequence 
X(k) DFT of the input sequence x(n) 





The design of portable devices requires critical consideration of the time averaged power con-
sumption which is directly proportional to the battery weight and volume required to operate 
circuits for a given amount of time. The battery life depends on both the power consump-
tion of the system and the battery capacity. The battery technology has improved considerably 
with the advent of portable systems but it is not expected to offer significant advances in the 
near future [1, 2]. Most of the portable applications demand high speed computation, complex 
functionalities and often real-time processing capabilities with low power consumption. The 
explosive growth of portable wireless devices like cellular phones, pagers, wireless modems 
and laptops along with the limitation of the battery technology has elevated power consump-
tion to be one of the most critical design requirement [3]. 
Moreover, there is a strong need to reduce the power consumption in high performance micro-
processors to limit the cost of packaging and cooling requirements [4-6]. Also, high power 
systems are more prone to several silicon failure mechanisms. Every 10 0 C rise in operating 
temperature roughly doubles a component's failure rate [7]. Hence, power consumption has 
now become an important design criteria just like speed and silicon area. 
Multicarrier systems perform much better than single carrier systems in the hostile wireless 
environment [8]. MC-CDMA, a combination of OFDM and CDMA, has a lot of potential to 
be included as a standard in the future generations of mobile communications [9] and therefore 
power consumption is an important issue in multicarrier systems. This thesis is motivated by 
the desire to reduce the power consumption of a MC-CDMA receiver. 
1.2 Contribution 
This thesis investigates low power techniques and architectures for the important signal process- 
ing and Telecommunication blocks used in MC-CDMA and other multicarrier based receivers 
1 
Introduction 
such as the FFT processors, Combiner and FIR filters by reducing their switched capacitances. 
It also introduces a novel concept of dynamically altering the hardware complexity of a MC-
CDMA receiver in real time as per the channel parameters instead of using a fixed receiver 
designed for the worst case channel conditions. 
The FFT processor is one of the most critical power consuming blocks in all multicarrier based 
wireless communication systems [10]. This thesis proposes both single butterfly and pipelined 
low power FFT processor architectures. The novel low power architectures are based on a 
new order based processing scheme. The conventional order based processing scheme is based 
on ordering the coefficients so as to minimise the Hamming distance (Bit transition) between 
successive coefficients fed to the multiplier [11, 12]. This results in a reduction of switching 
activity at the coefficient input of the power consuming multiplier and hence its power con-
sumption. It has been demonstrated that the conventional order based processing scheme does 
not lead to much power savings. This is due to only small switching activity reduction and 
hardware overhead required to realise the scheme in case of FFT processor. 
The modified order based processing scheme is also based on Hamming distance minimisation 
between successive coefficients fed to the butterfly with the difference that either the real part 
of the coefficient or its two's complemented value is used for the minimisation of switching ac-
tivity. It has been demonstrated that this procedure results in significant reduction in switching 
activity of the order of 53% compared to only 27% with conventional ordering approach in case 
of FFT processors. This reduction in switching activity leads to power savings in the range of 
25% to 1% for 16-point to 512-point FFT processor cores respectively. 
The order based processing scheme has been extended to the domain of radix-4 pipelined FFT 
processor architecture 13]. It has been possible to apply order based processing to the first 
stage of a 16-point or second stage of a 64-point radix-4 FFT processor. The results have 
shown around 14-37% power reduction as compared to the conventional architecture for the 
16-point radix-4 FF1 processor depending upon the multiplier type. This technique is limited 
only to the first stage of a 16-point FF1 to contain the hardware overhead in the form of an 
additional memory required for restoring the data order after every stage. The power saving is 
around 4-30% for the 64-point FFT processor depending upon the type of multiplier and the 
FIFO implementation. 
This thesis also proposes a coefficient memory reduction scheme. According to this scheme, 
Introduction 
the size of ROM required to store N12 coefficient in an N-point FFT can be reduced to N/8+1 
rather than N/4 [14]. This reduction in memory leads to both power and area savings for long 
FFT's. 
This thesis also presents a novel coefficient addressing scheme for the single butterfly radix-2 
FFT processor. The coefficient addressing scheme is based on a single multiplexer rather than 
a cascade of Barrel shifters [15]. This saves both area and power. 
This thesis also proposes a low power MC-CDMA receiver by combining a low power FF1' 
processor architecture with a low power Combiner for binary phase shift keying (BPSK). 
In MC-CDMA, the FFT size directly depends upon the channel parameters like the delay 
spread, maximum Doppler frequency and the transmission rate [16]. The FF1' size varies from 
16-point to 1024-point depending upon the channel parameters which in turn depends on the 
location of the user. Two reconfigurable 256 point radix-4 pipelined FFT processor architec-
tures namely RFFT-I and RFFT-II are proposed which can also be dynamically configured to 
act as a 16-point or 64-point FF1' processors depending upon the channel parameters to prove 
the concept. The difference between the two architectures lies in the extent of clock gating to 
disable the unused modules. This approach will significantly reduce the power consumption 
because only the optimum FFT size is used at all times instead of a large fixed FF1' for the 
worst case channel conditions. 
Two reconfigurable MC-CDMA receiver architectures namely REcEIVER-I and RECEIVER-11 
are also proposed by combining the respective 256-point reconfigurable pipelined FFT proces-
sor architectures with the reconfigurable Combiner. The receiver can be reconfigured in real 
time to operate either on 256 or 64 sub-carriers depending upon the channel parameters. This 
reconfigurabilty leads to power reduction in the receiver by switching from 256 to 64 sub-
carriers as compared to a fixed 256 sub-carrier receiver. This reduction is achieved by shutting 
down the first stage and partly disabling the last ordering stage of the 256-point reconfigurable 
FF1' processor, and also the unused RAM for storing the equaliser coefficients in the Com-
biner for 64 sub-carriers in RECEIVER-I. The shutdown of the unused modules is partial in 
RECEIVER-IL 
The FIR filter is also an important block in wireless receivers and other Telecommunication 
systems. The power consumption in case of direct form FIR filters can be reduced by apply- 
ing techniques such as coefficient ordering, coefficient segmentation and block processing etc. 
3 
Introduction 
Most of the work in the literature deals with the effect of these techniques on the power con-
suming multiplier block of the FIR filter [12, 17, 18]. Each of these techniques has hardware 
overhead and therefore it is very important to know the power reduction obtained by these tech-
niques on the overall FIR filter. This work also proposes low power architectures for the direct 
form FIR filter based on the previously mentioned techniques and their combination. Each 
architecture has been investigated for the different types of multiplier. It has been found that 
the combination of block processing and coefficient segmentation algorithms yield best power 
reduction results at a slightly higher expense in area. Any low power architecture can be chosen 
depending upon the power-area budget. 
1.3 Structure 
The structure of this thesis is as follows: 
• Chapter 2 presents review of the research work in the area of low power techniques and 
architectures for the multicarrier based wireless receiver. 
• Chapter 3 presents an overview of the Discrete Fourier transform and the fast Fourier 
transform. It mainly covers radix-2 and radix-4 FFT algorithms. 
• Chapter 4 presents new techniques for reducing the power consumption in an FFT pro-
cessor. A novel form of order based processing, tailored for FFT processors, is proposed. 
A new technique to reduce the coefficient memory from N/4 locations to N/8+1 loca-
tions for an N-point FFT processor is also described. The chapter concludes with the 
presentation of a new circuit for coefficient address generation. 
• Chapter 5 describes a low power architecture of a single butterfly FFT processor based 
on a novel order based processing algorithm. Power results are presented to demonstrate 
the effectiveness of the technique. 
• Chapter 6 presents a low power pipelined radix-4 FFT processor based on a novel order 
based processing algorithm. The first stage of a 16-point radix-4 FFT processor is mod-
ified to incorporate the corresponding data ordering as per the new coefficient ordering. 
Power results are also given to demonstrate the effectiveness of the scheme. 
• Chapter 7 presents a pipelined architecture for a MC-CDMA receiver. This architecture 
ru 
Introduction 
is based on a low power 64-point radix-4 pipelined FIT processor. The combiner module 
architecture is also tailored for low power by disabling the unused hardware blocks. 
• Chapter 8 describes reconfigurable architectures for both the pipelined FFT processor 
and the pipelined MC-CDMA receiver. In multicarrier wireless systems, FIT size varies 
from 16-point to 1024-point depending upon the channel parameters. Two rcconfigurable 
256-point FFT processor architectures that can be configured as a 64-point or a 16-point 
are proposed to prove the concept. The difference between the two architectures lies in 
the extent of applied clock gating. Two reconfigurable MC-CDMA receivers are also pro-
posed which can be configured to act either as a 256 sub-carrier system or a 64 sub-carrier 
system depending upon the channel parameters in real time. The difference between the 
two receivers again lies in the extent of clock gating. Power saving results, in switching 
from 256 to 64 sub-carriers as compared to a fixed 256 sub-carrier receiver, for both the 
architectures are presented. Similarly, power saving results in switching from 256-point 
FF1' processor to a 64-point or a 16-point FFT processor for both the FFT architectures 
with respect to a fixed 256-point FFT processor, are also given. 
• Chapter 9 presents low power FIR filter architectures for the different low power algo-
rithms namely coefficient ordering, coefficient segmentation, block processing and their 
combination. The post layout power and area results are presented for the whole FIR 
filter. 
• Chapter 10 presents conclusions and a summary of the thesis and suggests topics for 
future research. 
• Appendix A lists the publications arising from this thesis. 
• Appendix B lists the C code for the order based processing scheme. 
• Appendix C lists the MATLAB code for the MC-CDMA transceiver. 
• Appendix D lists the Verilog code for the 64 sub-carrier MC-CDMA receiver. 
1.4 Summary 
Low power design is a crucial research area not only for portable systems but also for high per- 
formance systems like microprocessors. Multicarrier systems are very attractive for all portable 
5 
Introduction 
wireless applications. This thesis presents techniques and architectures to reduce the power 
consumption in the blocks of a MC-CDMA receiver like the FFT, Combiner and FIR filter as 
well as for the whole receiver. The next chapter describes the existing low power techniques 
and architectures for the multicarricr based receivers. 
6 
Chapter 2 
Low power techniques and 
architectures for multicarrier wireless 
receivers 
2.1 Introduction 
This chapter starts with the identification of the sources of power consumption in CMOS tech-
nology. This is followed by the review of the general algorithmic and architectural level low 
power techniques for reducing the switched capacitance. Then the low power techniques and 
architectures targeted for the basic blocks of the multicarrier based wireless receiver namely 
the FIT and the FIR filter are described. After that, a low power technique for reducing the 
power consumption of the Combiner block and the whole MC-CDMA receiver is presented. In 
the end, the power characteristics of the three commonly used multiplier types for different low 
power algorithms for a 73-tap FIR filter are presented. 
2.2 Sources of power consumption in CMOS technology 
There are three significant sources of power consumption in CMOS circuits: switching power 
P,, short-circuit power and leakage power [5, 19]. P, is the power consumed in charging 
and discharging the load capacitance Cload• The switching power, given by the following ex-
pression, accounts for around 80% [20] of the total power consumption in CMOS. 
Psw = 112Sw C1oaäVf 
Where Vdd  is the supply voltage, f is the clock frequency , Cload is the load capacitance of the 
gate and S, is the switching activity factor which is defined as the average number of times that 
the gate makes a logic transition (1 - 0 or 0 -p 1) in each clock cycle. The product S?L,Cload 
is defined as the switched capacitance. 
7 
Low power techniques and architectures for multicarrier wireless receivers 
The short-circuit power is due to the conduction path established between the supply rails on 
account of finite signal rise/fall times. The leakage power [21] is due to the flow of subthresh-
old and reverse-biased diodes currents. This work deals with the reduction of the dominant 
switching power component of the total power consumption in CMOS. 
The switching power is a strong function of the supply voltage and therefore any technique 
of scaling down the supply voltage significantly brings down the power consumption. The 
switching power can also be reduced by decreasing the switched capacitance. The switched 
capacitance can be reduced either by lowering the load capacitance Clod or/and the switching 
activity S. The switched capacitance based power reduction has been studied at various levels 
of the design abstraction starting from algorithmic level down to the technology level [5, 19]. 
Among these levels, power consumption can be significantly reduced at the algorithmic and 
architectural level [22] especially in memory and mathematically intensive digital signal pro-
cessing algorithms by reducing the switched capacitance. 
This thesis explores techniques at algorithmic and architectural levels for reducing the switched 
capacitance thereby saving power in multicarricr based wireless receivers. 
2.3 General low power techniques for reducing the switched 
capacitance 
The major power consumption in CMOS occurs only during switching and therefore the switch-
ing activity has to be reduced to the minimal level required to perform the computation. The 
following sections explore the system level approaches to minimise the switched capacitance 
at the algorithmic, architectural, logic and physical design levels. 
2.3.1 Clock gating 
Clock switching has a considerable impact on power. It can be as much as twice the logic 
power [19]. Clock gating is used to disable unused modules of the system. It saves power by 
both preventing unnecessary switching activity in the logic modules as well as by eliminating 
power dissipation in the clock distribution network [23,24]. Clock gating must be carefully 
applied by taking into account the physical design aspect as well. The optimal solution points 
to a partial approach in which power saving due to gating is balanced by efficient wiring. 
E1 
Low power techniques and architectures for multicarricr wireless receivers 
2.3.2 Operation minimisation 
The switched capacitance can be effectively reduced by minimising the number of operations. 
This is accomplished by transforming the algorithm such that it can be realised by using less 
hardware in terms of multiplier, memory and other power consuming modules [25]. 
2.3.3 Operation substitution 
The multiplication with constants is a commonly used operation in most of the signal processing 
algorithms. This multiplication can also be realised by using shift-add operation. The power 
consumed by the execution unit of an 11-tap FIR filter is reduced to one-eighth of the original 
power by replacing multiplication by shift and add operations [251. The shifter and adder 
consume much less power as compared to the multiplier. 
The multiplication can also be performed by a ROM, shift registers and adders using Distributed 
arithmetic [26,27]. This concept is quite attractive for low throughput applications. 
2.3.4 Input and constant coefficient ordering 
The switched capacitance can also be reduced by reordering the inputs in a chain of operation 
such that the higher activity inputs should enter the chain at a later stage [25]. 
The multiplication with a constant coefficient is a common operation in signal processing algo-
rithms. It is possible to save around 15% power by carrying out the multiplication after ordering 
the coefficients as per Hamming distance in single functional unit architectures Li 1], 
2.3.5 Reducing glitching activity 
Witching refers to spurious transitions due to finite propagation delays from one logic block 
to the next [5]. It arises when paths with unbalanced propagation delays converge at the same 
point in the circuit. A node can undergo multiple power consuming transitions in a single clock 
cycle before settling to the correct logic level. An average of 15-20% of the total power is 
consumed in gliching [3]. It can be minimised by balancing all signal paths along with the 
reduction of logic depth [5]. A retiming approach for glitch reduction was proposed in [28, 
291 by placing registers to compensate different path delays. Several other techniques like 
Low power techniques and architectures for multicarrier wireless receivers 
restructuring multiplexer networks and clocking control signals have also been proposed [30]. 
2.3.6 Precomputation 
It is based on selectively precomputing the output logic values of the circuit one clock cycle 
before they are required, and using the precomputed values to reduce the internal switching 
activity in the succeeding clock cycle [31, 32]. The precomputation logic determines the output 
values for a subset of input conditions. The original circuit can be turned off in the subse-
quent clock cycle resulting in reduced switching activity. The size of the precomputation logic 
determines the power dissipation reduction and area increase relative to the original circuit. 
2.3.7 Data representation 
The choice of data representation also influences the switching activity. In two's complement 
representation, the signal transition from positive to negative or vice-versa causes the MSB 
sign-bits to switch resulting in high switching activity. The switching activity increases signif-
icantly when the signals being processed switch frequently around zero and when they do not 
utilise the entire bit-width. On the other hand, in sign magnitude representation, the switching 
around zero results in slight increase in switching activity due to only sign bit toggling [5]. 
In [33], floating and logarithmic number systems are compared with fixed point in terms of 
accuracy and switched capacitance for speech coding application. For little or no increase in 
distortion, capacitance reduction factor of two to three are achieved by trading precision for 
dynamic range. 
2.3.8 Bus encoding 
Gray coding is normally employed when the data to be transmitted on the bus is sequential and 
highly correlated. Gray-coded instruction addressing results in around 30-50% reduction in 
switching activity in a RISC processor compared to normal binary coded addressing [34, 351. 
Bus-invert coding reduces I/O bus activity leading to reduction in I/O peak power by 50% and 
average I/O power consumption by 25% [36]. It uses an extra control bit called invert. This 
coding scheme is realised by first computing the Hamming distance between the present bus 
value and next data value. If the Hamming distance is greater than half the number of bits, set 
Low power techniques and architectures for multicarrier wireless receivers 
invert to high and make the next bus value equal to the inverted next data value. Otherwise, the 
invert bit is set to low and the next bus value equals the next data value. The receiver extracts 
the correct bus value by conditionally inverting the bus value according to the invert bit. 
2.3.9 Memory partitioning 
In many embedded applications, highly accessed locations can fit into a relatively small mem-
ory space. The memory is partitioned such that the highly accessed locations are mapped to a 
small memory [34,37-39]. The average power in accessing the memory is decreased because 
a large fraction of accesses is concentrated on a small power efficient memory. Moreover, other 
memory banks that are not accessed in a given cycle are disabled through their chip select 
inputs. 
2.3.10 State assignment 
A state assignment procedure has been presented in [40] for reducing the switching activity of 
the state variables by minimising the number of bit changes during state transitions. The proba-
bility of state transitions in a FSM is calculated from the given input switching probability. This 
information is then used to find out an encoding that minimises the switching probability of the 
state variables. The authors in [41] have addressed both FSM and combinational logic synthesis 
problems to minimise the average number of transitions. The synthesis process comprises of 
two parts: state assignment, which determines the combinational logic function and multilevel 
optimization of the combinational logic, which tries to minimise area while at the same time 
trying to reduce the circuit activity at the internal nodes of the circuit. 
2.3.11 Scheduling and resource binding 
Scheduling and resource binding algorithms are proposed in [42] to minimise the number of 
transitions on the signals feeding the functional units (adders, multipliers etc.) and registers, 
which effectively minimises the switched capacitance. This is achieved by scheduling the can-
didate nodes in control steps as close as possible and binding them to the same resource. The 
candidate nodes are selected such that there is no change of values in the input operands be-
tween consecutive operations of the same functional unit. 
11 
Low power techniques and architectures for multicarrier wireless receivers 
2.3.12 Selection of appropriate gate level implementation 
The switching activity also depends upon the gate level topology of different implementation 
for the same function. In [43], six different adders are constructed with inverters and two 
to four-input AND and OR gates. Extensive simulations are used to evaluate their switching 
characteristics and the results of those simulations are used to rank the adders on speed, size 
and the number of logic transitions. This approach is adopted for the different multipliers 
as well [44,45]. Hence, it is important to choose the appropriate gate level implementation 
depending upon speed, power and area. 
2.3.13 Technology decomposition and mapping 
For the same gate-level implementation, there are many possible circuit level implementations 
and each of these implementations, has different switching activity. Therefore, technology 
decomposition and mapping techniques have been introduced in the literature for minimising 
switching activities of such networks [46,47]. According to this concept, a given Boolean 
network is decomposed such that the switching activity of the network is minimised. Further 
power reductions are then achieved by hiding high activity nodes inside more complex CMOS 
gates. Complex gates tend to exhibit an overall lower capacitance since more signals are con-
fined to internal nodes rather than to the more heavily loaded output nodes. This results in the 
mapping of signals with high switching activities to low capacitance internal nodes. 
2.3.14 Wordlength reduction 
The number of bits used strongly affect all key parameters of a design including speed, area 
and power [25, 48]. It is desirable to minimise the number of bits during power optimisation 
because fewer bits result in fewer switching events and therefore lower switched capacitance. 
Moreover, fewer bits not only reduce the number of transfer lines, but also decrease the average 
interconnect length and capacitance. 
2.3.15 Physical capacitance reduction 
The switching power depends linearly on the physical capacitance being switched. The physi- 
cal capacitance at the output of a CMOS gate is the sum of the input capacitances of N driven 
12 
Low power techniques and architectures for multicarrier wireless receivers 
gates, parasitic output capacitance of the driver gate and the interconnect capacitance. The par-
asitic capacitances in CMOS technology are of either parallel plate type or voltage dependent 
diode junction type. These capacitances can be minimised by using least logic, smaller devices 
and shorter wires [3]. The area of the circuit can be reduced by logic minimisation and gate 
re-sizing [49]. The interconnect capacitance can be reduced by appropriate placement [50], 
partitioning [51] and wire sizing [52]. 
2.4 Low power techniques and architectures for FFT processor 
Here is a brief description of the low power techniques and architectures for the FIT processor 
which is one of the most power consuming block of a multicarrier based wireless receiver. 
2.4.1 Cache based architecture 
The basic FFT has poor locality because each of the N outputs of an N-point FF1 depends 
on each of the N inputs. The author in [53,54] has proposed a low power cache based 1024-
point FFT processor. The author has proposed an FF1 algorithm that offers good locality over 
large portions of the computation. The cache based architecture is similar to the single memory 
architecture except that a small cache memory resides between the processor and the main 
memory. The cache memory architecture is energy efficient because small memories require 
lower energy per access. 
2.4.2 Partial product ordering 
The aim of reordering is the minimization of the sum of Hamming distances between succes-
sive pairs of partial products on the busses connecting the data and coefficient storage elements 
to the functional units. The authors in [ 5 - 7] have proposed a general technique to reorder the 
sequence of evaluation of partial products that constitute the basic computation. The Hamming 
distance between the coefficient part of the partial product can be directly evaluated since coeffi-
cients are known before realisation. The evaluation of the exact Hamming distance between two 
data samples is impossible since data are not known before run-time. An approximate average 
Hamming distance between the data terms of the partial product is determined by simulation 
of typical data as per the application. In [55, 57], the authors have shown that the coefficient 
13 
Low power techniques and architectures for multicarrier wireless receivers 
switching activity has reduced by 19-25% but the data switching activity varies slightly in both 
directions for the different FFTs after reordering. This indicates that data ordering is not useful. 
2.4.3 Wordlength optimisation 
The size of the pipelined FFT processor is dominated by the FIFOs. It is, therefore, important to 
minimize the sizes of the FIFOs. Since the FIFO size is large in the early stages, it is desirable 
to have as shorter wordlength as possible in the early stages. The authors in [58] have proposed 
a wordlength optimisation scheme for the pipelined FFT processor. The optimisation of the 
wordlength depends on the desired value of signal-to-noise ratio (SNR) for a given input and 
output wordlengths. The desired SNR value is achieved by altering the internal wordlengths. 
The authors have shown that increasing the wordlcngth progressively from the input wordlength 
of 12 bits to 16-bits towards the output before rounding it back to 12 bits for the output, gives 
much better performance in terms of SNR and memory size than a fixed wordlength system. 
It is possible to achieve 10dB increase in SNR with less than 5% increase in memory size by 
using progressive wordlength instead of fixed wordlength of 12 bits. 
2.4.4 Operation substitution 
The authors in [59, 60] have proposed an FFT algorithm in which non-trivial multiplication by 
\//2 is performed by shifters and adders for reducing the power consumption. The algorithm 
is not very useful because it is not as regular as radix-2 or radix-4. 
2.4.5 High radix architecture 
High radix FF1' algorithms have less multiplications and reduced number of stages than radix-
2 but their butterfly is complex having large number of complex multipliers and adders [61]. 
Radix-4 is a judicious choice and a compromise between the number of stages, multiplications 
and the complexity of the butterfly for pipelined FF1 [61,62]. Radix-4 algorithm can be very 
efficiently realised in a pipelined architecture by using only one complex multiplier just like 
radix-2 [13,63]. All the pipelined FFT processor architectures in this work are based on this 
low power radix-4 pipelined architecture. 
14 
Low power techniques and architectures for multicarrier wireless receivers 
2.4.6 Reduced precision redundancy 
In reduced precision redundancy, a reduced precision replica operates in parallel with the main 
system in order to detect and correct errors. The authors in [64,65] have proposed an algorith-
mic noise-tolerance technique referred to as reduced precision redundancy for compensating 
the degradation in the SNR at an FFT output due to voltage overscaling (Scaling the supply 
voltage beyond the critical voltage required for correct operation). The soft errors due to volt-
age overscaling in the main system appear first in the most significant bits as the arithmetic 
units are assumed to use least significant bit first computation. This results in large magnitude 
error thereby degrading SNR severely. This large error can be detected by finding out the dif -
ference between the main system and reduced precision replica outputs. If the difference is 
more than a threshold, the output of the reduced precision system is declared an actual output. 
It has been assumed that the reduced precision system does not suffer from soft errors. This 
assumption is valid provided the critical path delay of the reduced precision system is smaller 
than the clock period. This is quite true for array adder and multiplier architectures where the 
delay decreases linearly with precision. The proposed technique for the butterfly multipliers in 
radix-2 FFT processor requires hardware overhead of 40%. The power saving achieved through 
voltage overscaling more than compensate this overhead. 
2.4.7 Asynchronous implementation 
Asynchronous techniques are appealing because of the power saving obtained in the clock dis-
tribution network. The authors in [66,671 proposed a low power asynchronous FFT processor 
architecture. The novelty of the architecture lies in its high localisation of components and 
pipelining with no need to share a global memory. High throughput is attained using large 
number of small, local components working in parallel. 
2.4.8 Data representation 
The complexity of the log-FFT depends on the size of the look-up table which is determined 
by the bit-width of the LNS. The authors in [68,691 proposed a low power FFT based on 
LNS. In coded OFDM, simulation results have shown that there is no degradation in bit error 
rate performance when only two fractional bits are used for LNS. The look-up table can be 
easily implemented for such small bit-width. The power consumption in the butterfly module 
15 
Low power techniques and architectures for multicarrier wireless receivers 
is reduced by 60% as compared to the fixed point representation. 
2.5 Low power techniques and architectures for the FIR filter 
There are many techniques for reducing power consumption in FIR filters. Here is a brief 
description of the commonly used techniques employed to bring down the switched capacitance 
in FIR filter architectures. 
2.5.1 Coefficient ordering 
It is possible to reduce the power consumption in an FIR filter by reordering the filter coeffi-
cients so as to reduce the number of logic transitions between those filter coefficients used in 
successive multiplication operations by adopting criteria like minimum Hamming distance [ii, 
121. This reduces the number of transitions at the coefficient input of the multiplier resulting 
in power saving. The choice of an ordered coefficient set in which successive coefficients are 
highly correlated is both computationally complex and NP-complete for practical size filters. 
The identification of such an order will require a heuristic search or Genetic algorithms. 
2.5.2 Coefficient segmentation 
This algorithm decomposes individual coefficients into two primitive sub-components [17]. 
The decomposition, performed using a heuristic approach, divides a given coefficient such that 
a part is produced which can be implemented using a single shift operation leaving another part 
with a reduced word length to be applied to the coefficient input of the hardware multiplier. This 
results in a significant reduction in the amount of switched capacitance and consequently power 
consumption. The algorithm has been used with a number of practical FIR filter examples 
achieving up to 63% power saving at the multiplier level. 
2.5.3 Block processing 
In the direct form realisation of the FIR filter, a new data sample and the corresponding coef-
ficient are multiplied at each clock cycle followed by accumulation in a conventional multiply 
and accumulate unit. This leads to high switching activity because both the inputs of the mul-
tiplier receive new data at every clock cycle. Any technique, which leads to a reduction in 
iri 
Low power techniques and architectures for multicarrier wireless receivers 
this switching activity, will directly reduce the power consumption. Another source of power 
consumption in DSP is the activity on data and address buses. Since each time a new data 
sample is to be multiplied with a new coefficient, both data and address buses experience high 
switching activity. This activity is responsible for large power consumption since bus capaci-
tances are usually several orders of magnitude higher than those of internal gates of the circuit. 
Consequently, considerable amount of power can be saved by reducing the number of memory 
accesses. There is considerable improvement in the power consumption if the filter outputs are 
processed in blocks [18] rather than individually. It is possible to retain the coefficients and data 
at the input of the multipliers for more than one cycle leading to considerable power saving. 
2.5.4 Approximate processing 
Adaptive filtering algorithms have been generally used to dynamically change the values of the 
filter coefficients, while maintaining a fixed filter order. This algorithm involves the dynamic 
adjustment of the filter order in accordance with the stopband energy of the input signal [70]. 
This approach leads to filtering solutions in which the stopband energy in the filter output 
may be kept below a specified threshold while using as small a filter order as possible. Since 
power consumption is proportional to filter order, the approach achieves power reduction with 
respect to a fixed order filter whose output is similarly guaranteed to have stopband energy 
below the specified threshold. The overhead associated with the update process is much less 
as compared to the power saving achieved by using a lower order filter. The filter order is dy-
namically adjusted by observing the strength of the stopband component of the input signal. 
When the strength of the stopband component of the signal increases, it is desirable to increase 
the stopband attenuation of the filter. This can be accomplished by using a higher-order fil-
ter. Conversely, the filter order may be lowered when the energy of the stopband component 
decreases. 
2.5.5 Multirate architectures 
The multirate architectures enable computationally efficient implementations of FIR filters [71]. 
Multirate architectures involve implementing the FIR filter in terms of its decimated sub-filters. 
These filters are derived using Winograd's algorithms for reducing computational complexity 
of polynomial multiplications. For an N-tap filter, the direct form FIR structure requires N 
multiplications and N—i additions per output whereas the multirate architecture requires 3N/4 
17 
Low power techniques and architectures for multicarrier wireless receivers 
multiplications and (3N+2)/4 additions per output. This reduced computational complexity of 
the multirate architectures enables reducing the frequency and the supply voltage for the same 
throughput as the conventional direct form FIR filter, thus significantly reducing the power 
dissipation. 
2.5.6 Coefficient scaling and optimisation 
Scaling the coefficients preserves the filter characteristics in terms of passband ripple and stop-
band attenuation, but results in an overall magnitude gain equal to the scaling factor. The 
coefficients are scaled in the first stage such that the total Hamming distance between succes-
sive scaled coefficients is least. It is followed by slightly modifying the scaled coefficients in 
the second stage so as to reduce the total Hamming distance while still satisfying the filter char-
acteristics. This modification is an iterative process and continues till no further reduction in 
total Hamming distance is possible [72]. 
2.5.7 Filter realisation through differential coefficient 
Most realization of FIR filters use the coefficients directly to compute the convolution with the 
input data. This technique involves the use of various orders of differences between coeffi-
cients along with stored intermediate results for computing the convolution [73]. The memory 
requirement and the number of memory accesses are more as compared to the conventional im-
plementation on account of storage of intermediate results, but the net computations necessary 
per convolution has gone down as compared to directly using the coefficients. This results in 
net power reduction at the multiplier due to small values of the differences as compared to the 
coefficient. This algorithm is useful only if the range of the differences is small compared to 
the coefficients. In the multiplication operation, a long multiplier is traded for a short one along 
with overheads in the form of extra memory requirement. If the power savings in multiplication 
is greater than the net cost due to overheads then a net saving in power is obtained. 
2.5.8 Reduced two's complement data representation 
A reduced representation for 2's complement numbers has been proposed to avoid sign-extensi-
on and the switching of the sign-extended bits [74]. The maximum magnitude of a 2's comple-
ment number is detected and its reduced representation is dynamically generated to represent 
18 
Low power techniques and architectures for multicarrier wireless receivers 
the signal. A constant error is introduced by the reduced representation and this error is also 
compensated. The proposed signal representation is more useful in filters with slowly varying 
coefficients having small magnitude. 
2.5.9 Sharing multiplication 
This technique is based on computation sharing multiplier, which targets the reduction of re-
dundant computations in FIR filtering [75]. In vector scaling operations, a set of small bit 
sequences are selected so that the actual multiplication result can be obtained by only add and 
shift operations. These bit sequences are chosen such that they cover all the coefficients. In 
TDF filter implementation, a precomputer block contains a set of multipliers for multiplying 
the data input with the short length bit sequences. This precomputer block is shared by all the 
coefficients. All the coefficient multipliers can be replaced by just shifters and adders in the 
presence of this common precomputer block. This sharing of the precomputer block leads to 
power savings in FIR filters. 
2.6 Low power architecture of a MC-CDMA receiver 
The low power MC-CDMA receiver architecture is based on a low power algorithm which 
processes the received symbols in blocks rather than individually [76,771. This reduces power 
by holding one input to the multiplier circuit constant over a number of clock cycles resulting 
in a power reduction of 50% in the Combiner circuit. This algorithm is also extended to the 
FFT and results in a power reduction of 13% for the whole receiver. 
The processing of symbols in block form is applicable provided that the channel fading is 
sufficiently slow so as to allow the use of the same channel equaliser coefficients for the entire 
block length of symbols. The assumption of the slow fading channel is satisfied at pedestrian 
speed but it may not be true at vehicular speeds. The problem with this architecture is that the 
memory size in FFT increases linearly with the block size. This means that the reduction in 
switching activity or power consumption, obtained by increasing the block size, is neutralised 
by the power consumed in the additional memory required in the FFT processor to support 
the bigger block size. Moreover, the length of the block size depends upon the channel and 
therefore will be different at pedestrian and vehicular speeds. Hence, there is a need of some 
generic techniques for reducing the power consumption of a MC-CDMA receiver. This thesis 
IR 
Low power techniques and architectures for multicarricr wireless receivers 
I Scheme csa nbw wall 
(mW) % (MW) % (mW) % 
CONV 0.682 - 0.566 - 0.687 - 
ORDER 0.556 18 0.425 25 0.373 46 
CSEG 0.331 51 0.325 43 0.597 13 
BP 0.466 32 0.388 31 0.438 36 
COMB 0.207 70 0.227 60 0.349 49 
Table 2.1: Power consumption analysis for different multipliers. 
explores novel techniques for reducing the power consumption in the individual blocks of a 
MC-CDMA receiver as well as for the whole receiver. 
2.7 Power characteristics of commonly used multipliers 
This thesis proposes power saving schemes in all the blocks of a MC-CDMA receiver by re-
ducing their switched capacitances. Most of the low power schemes for the signal processing 
blocks in a MC-CDMA receiver save power by reducing the power consumed by the multiplier. 
This section investigates the power saving potential of the commonly used multiplier types after 
the application of low power schemes to a 73-tap FIR filter. Table 2.1 lists the power consumed 
by the three commonly used multiplier types namely Carry save array (csa), Non-Booth coded 
Wallace tree (nbw) and Booth coded Wallace tree (wall) after the application of the low power 
schemes like Coefficient ordering (ORDER) [11, 12], Coefficient segmentation (CSEG) [17], 
Block processing (BP) [18] and the combination of Block processing and Coefficient segmen-
tation (COMB) to a 73-tap band pass FIR filter. The conventional FIR filter is referred to as 
CONy. The following conclusions can be drawn: 
The multiplier nbw consumes least power for the CONy architecture. Therefore, nbw 
multiplier is a logical choice for all low power conventional architectures which are not 
incorporating any low power scheme. The power profile of the multiplier after the ap-
plication of low power schemes changes significantly. For instance, in Table 2.1, nbw 
consumes least power in the CONy architecture but after the application of ordering, 
the wall multiplier has the lowest power consumption. It is interesting to note that csa 
consumes least power consumption for the COMB architecture. The power consumption 
results for the various multipliers are filter dependent and therefore cannot be generalised. 
20 
Low power techniques and architectures for multicarrier wireless receivers 
. The ORDER scheme is most effective for the wall multiplier. 
. The CSEG scheme is least effective for the wall multiplier as compared to other multipli-
ers due to the reduction of switching activity only in the most significant bits of the filter 
coefficients after GSEG rather than in all the bits. The wall multiplier saves maximum 
power if the switching activity reduction takes place in all the bits. The CSEG scheme is 
most effective for the csa multiplier type for the filter considered. 
• The wall multiplier also performs best for BP scheme because this scheme also tries to 
reduce the switching activity at the inputs of the multiplier like the ORDER scheme for 
all the bits. 
• The power saving in csa multiplier is maximum for the combination of CSEG and BP 
because of its good power saving potential for both CSEG and BP as compared to other 
multipliers. 
It can be further concluded from Table 2.1 that the power consumption in the multiplier can be 
significantly reduced to different levels by employing the above mentioned low power schemes. 
The power saving potential of most of the low power schemes are investigated only up to the 
multiplier level without taking into consideration hardware overhead associated with each of 
these algorithms. This thesis also investigates the power saving potential of each of these low 
power schemes on the overall FIR core rather than on only the multiplier for a fair comparison. 
2.8 Summary 
This chapter described general and block specific low power techniques for reducing the switch-
ed capacitance and therefore the power consumption of the multicarrier based wireless receiver. 
Most of the techniques exploit signal correlations, hardware size reduction through some trans-
formations and shutting down unused portion of the architecture in real time. The FFT is one 
of the most power consuming block of the multicarrier receiver. This thesis first explores the 
low power architectures for the FFT processor. The next chapter deals with the overview of the 
discrete and the fast Fourier transform (FIT). 
21 
Chapter 3 
The Discrete and the fast Fourier 
transform 
This chapter begins with the review of the discrete Fourier Transform. The remainder of the 
chapter focuses on the introduction to a collection of algorithms used to efficiently compute 
the DFT. These algorithms are known as FFT algorithms. This chapter concentrates on only 
radix-2 and radix-4 FFT algorithms. Many books exist on the topic of FIT [61,78,791. 
3.1 Overview of DFI' 
The DFT is one of the most widely used digital signal processing algorithms. DFT is always 
calculated with the help of FFT, which comprises of a collection of algorithms that efficiently 
calculate the DFT of a sequence. The DFT operates on an N-point sequence of numbers x(m). 
This sequence is usually obtained by uniform sampling of a finite period of some continuous 
function. The DFT of x(m) is also an N-point sequence X(k), and is defined by equation 3.1. 
X(k) =x(n)ei 2 hi, 	k = 0, 1, ..., N - 1 	 (3.1) 
This can also be written as follows. 
N-I 
X(k) 	E x(n)W " , 	k = 0, 1, ..., N - 1, 	 (3.2) 
n=O 
Where WN is defined as 
WN = e 27' '. 	 (3.3) 
It is easily seen that Wj/ is periodic with period N. The inverse DFT (IDFT) which transforms 
X(k) back into x(n) is as follows. 
x(n) = 	X(k)Wk , 	n=0,1,...,N-1 	 (3.4) 
22 
The Discrete and the fast Fourier transform 
Both x(n) and X(k) are defined only in the interval 0 to N - 1. However, since x(n) and X(k) 
are periodic in N, they also exist for all ri and k respectively. It is quite clear from equation 3.2 
that for each value of k, direct computation of X(k) involves N complex multiplications and 
N - 1 complex additions. Consequently, to compute all N values of the DFT requires N 2 
complex multiplications and N(N - 1) complex additions. 
3.2 The Fast Fourier transform (FF1) 
The fast Fourier transform is a class of efficient algorithms for computing the DET. The term 
FFT was originally used by Cooley and Tukey in their landmark paper [801. For simplicity, N 
is chosen to be a power of 2 (N = 2171), where rn is a positive integer. With this assumption, 
it is possible to break x(n) of length N into two sequences of lengths N12. The first sequence 
contains all even samples of x(n)(x even (rn)) and the second sequence x0d11(m) contains all 
odd samples. Equation 3.2 can now be written as follows. 
X(k)= 	x(n)wk+ 	x(n)W 	 (3.5) 
fl6ve10 	 odd 1 
If 2m and 2m + 1 are substituted for n in the even and odd summations respectively, then 
equation 3.5 can be written as follows. 
N12-1 	 N12-1 
X(k) = 	x (2m )(W r ) 1Thk + 	x(2rn -j- 1)(W 1 )mk)W 	 (3.6) 
M=O 	 M=O 
But W = WN12 and hence, 
N12-1 	 N12-1 
X(k) = 	x(2m)W + W/r > x(2rn + 1)W 	 (3.7) 
M=O 	 M=O 
Equation 3.7 can also be written in terms of even and odd DFTs as follows. 
X(k) =Fe ,en(k) + WNk 	 (3.8) 
The terms on the right hand side of equation 3.7 corresponds to N12 point DFTs of even 
(Feven (k)) and odd (FOdd(k)) parts of x(n). The direct computation of Feven (k) requires 


















The Discrete and the fast Fourier transform 
Figure 3.1: Flow graph of an 8-point FFT calculated using two N12-point DFTs. 
ditional complex multiplications required to compute WF0dd(k). Hence the computation of 
X(k) requires 2(N/2) 2 + N12 complex multiplications. This first step has resulted in a reduc-
tion of the number of multiplications from (N) 2 to (N) 2 /2+ N12, a reduction of 50% for large 
N. It is possible to reduce the N multiplications by WNk to half by exploiting the relationship 
N/ 2 WN = -wj. Figure 3.1 shows the data flow of this algorithm for N = 8 in a graphi-
cal format. The vertical axis represents memory locations. A total of N-memory locations are 
needed for storing the sequence x(n). The horizontal axis represents stages of the computation. 
Data processing starts from the input sequence x(n) on the left side and progresses to the right 
until the output X(k) is obtained. The calculations after the WN multiplications are 2-point 
DFTs. 
The number of FFT points N is chosen to be power of 2 and if N is greater than 2 then Xeven (m) 
and x0dd(m) also have even number of points. It means that they too can be decimated further 
into their even and odd sequences, and computed from N/4-point DETs. This decimation can 
be applied recursively until an even and odd separation results in sequences with two members. 
This decimation procedure can be applied 1092  (N) - 1 times, producing 1092 (N) stages. From 
equation 3.2, the final 2-point DFTs are very easy to calculate and requires no multiplication. 
Figure 3.2 shows the datafiow diagram of the 8-point FIT. It is clear from the dataflow diagram 
that the number of multiplications per stage equals N12 and therefore the total number of 
multiplications for computing the N-point FFT reduces to N/21092 (N). 
24 





r /\,\/ 	•X(4) 
wo 
K(1) 	
'N 7' w1 	 X(5) x(3)s 
	
W° 	>( 2 • X(6) K(5) . 	• 
wo >< w2 	- 
x(7 • I 	• X(7) 
Figure 3.2: Flow graph of an 8-point radix-2 decimation-in-time FFT 
A. • 	 •X=A+BW 
fA =A-BW 
Figure 3.3: Signal flow representation of a radix-2 decimation-in-time butterfly. 
VA 
Y=A-BW 
Figure 3.4: Simplified representation of a radix-2 decimation-in-time butterfly. 
The FFT algorithm described here is called decimation-in-time algorithm since at each stage 
the input sequence (time sequence) is divided into smaller sequences. Since at each stage the 
DFT is broken into two smaller DFTs, this FFT is called radix-2 FF1'. 
The basic operation of the radix-2 decimation-in-time algorithm is the butterfly shown in Fig- 
ure 3.3 and Figure 3.4 in both signal flow and simplified representations respectively. The 
butterfly accepts two inputs A and B and generates two outputs X and Y. The multiplication 
25 

















Figure 3.5: Simplified flow graph of an 8-point radix-2 decimation-in-time FFT algorithm. 
factor WNk in the butterfly due to the combining operation is called twiddle factors. Twiddle 
factors are referred to as coefficient in this thesis. The simplified flow graph of an 8-point 
radix-2 decimation-in-time FIT algorithm is shown in Figure 3.5. The coefficients (W) in the 
simplified flowgraph are represented by their corresponding 'k' values for simplicity. 
A second algorithm namely radix-2 decimation-in-frquency algorithm is obtained by decimat-
ing the frequency values X(k) during each stage. The details of this algorithm can be found in 
the references [61, 78, 79]. 
It is clear from the flow graph, shown in Figure 3.5, that the basic operations in the FFT are 
multiplication of the complex data inputs with the FF1 coefficients at each stage of the flow 
graph followed by their addition or subtraction. The multiplication and addition/subtraction 
operation are performed by the butterfly module in the FF1 processor. A RAM is required to 
store the data inputs and intermediate outputs after every stage of the signal flow graph. A 
ROM is needed to store the fixed coefficient, A FSM is also required to properly generate 
the coefficient and data addresses for providing appropriate data and coefficient to the butterfly 
module in successive stages of the FF1'. 
3.3 A unified approach to the FFT 
All the different FFT algorithms can be derived from successively representing a one-dimensio- 
nal array into a two dimensional array. Consider an N-point DFT such that it can be represented 
26 
The Discrete and the fast Fourier transform 
0 	1 	 2 	 M-1 
X(0) X(1) X(2) X(M-1) 
X(M) X(M+1) X(M+2) X(2M-1) 
X(2M) X(2M+1) X(2M+2) •.. X(3M-1) 
X(L-1)M X((L-1)M+1) X((L-1)M+2) X(LM-1) 
Figure 3.6: Row wise arrangement of data array. 
as a product of two integers namely L and M, 
N=LM 
The sequence x(n), 0 < n < N - 1 is a one dimensional array index by W. The same sequence 
can also be stored in a two dimensional array indexed by 1 and m, where 0 < I < L - 1 and 
0 < m < M - 1. Thus the sequence x(n) can be stored in a rectangular array in a number of 
ways depending upon the mapping of index a to the row index 1 and column index m. Let us 
start with the row-wise mapping, 
= Ml + rn. 
This results in an arrangement in which the first row contains the first M elements of x(n), 
the second row contains the next M elements of x(n) and so on as shown in Figure 3.6. The 
column-wise mapping, shown in Figure 3.7, is given by, 
Ti 1+ mL. 
This results in an arrangement in which the first L elements of x(n) are stored in the first column, 
the next L elements in the second column and so on. The DFT values are also mapped along 
similar lines. In case of DFT, the mapping is from a index k to a row index p and a column 





k = Mp+q, 
OXA 
The Discrete and the fast Fourier transform 
0 	1 	 2 	 M-1 
X(0) X(L) X(2L) X(M-1)L 
 X(L+1) X(2L+1) - X((M-1)L+1) 
 X(L+2) X(2L+2) -- X((M-1)L+2) 
X(L-1) X(2L-1) X(3L-1) -- X(LM-1) 
Figure 3.7: Column wise arrangement of data array. 
and the column-wise mapping by, 
k = qL+p. 
In row-wise mapping, the first row consists of the first M elements of DFT X(k), the second 
row consists of the next M elements of X(k) and so on. 
If x(n) is mapped column-wise to x(l,m) and X(k) is mapped row-wise to X(p,q), then the DFT 
can be expressed as a double sum over the elements of the rectangular array multiplied by the 
corresponding phase factors given by equation 3.9. 
M-1 L-1 
X(p, q) = i 	x(l, 









- N/N 	' N 	N/M 
Equation 3.9 can now be simplified as, 
L-1 	M-1 
X(p,q) = 	{W[ 	x(l,m)W]}W 	 (3.10) 
1=0 	m=O 
The arrangement of the input sequence as a two dimensional array involves the computation 




The Discrete and the fast Fourier transform 
	
• Compute the M-point DFTs G(l, q) 	id x(l, m)W', 0 < q !~ M - 1 for each 
of the rows 1 = 0,1,...,L-1. 
• The M-point DFTs G(l, q) are transformed into a new array H(l, q) by multiplying it 
with the coefficient W. 
H(1, q) = WG(1, q), 	0 < q < M - 1, 0 < I < L - 1 
• Compute the L-point DFTs for each column q = 0, 1, ..., M - 1 of the array H(l, q) 
X(p,q) = 	H(I,q)W 
The decomposition of larger DFT into smaller DFT always leads to a reduction in computa-
tional complexity as already discussed. When N is a highly composite number such that it can 
be factored into a product of prime numbers N = rlr2 ... rV  then the above decomposition can 
be carried out (v-i) more times leading to more computationally efficient algorithm. Further 
decomposition of the rectangular array involves segmentation of each row and column of the 
rectangular array into smaller rectangular arrays. The decomposition ends when N is factored 
into its prime factors. 
The already discussed radix-2 decimation-in-time algorithm is obtained by selecting L=2 and 
M= N12 in the unified approach for each of the 1092N  stages. This leads to n = 2m, + 1 and 
k= (N12)p-i-q. 
3.3.1 Radix -4 decimation -in -time FFT algorithm 
When the number of data points N in the DFT is a power of 4(N 	4V),  then it is more 
efficient computationally to employ a radix-4 algorithm instead of a radix-2 algorithm. A radix-
4 decimation-in-time FFT algorithm is obtained by splitting the N-point input sequence x(n) 
into four subsequences x(4n), x(4n + 1), x(4n + 2) and x(4n + 3). The radix-4 decimation-
in-time FFT algorithm is obtained by selecting L=4 and M=N/4 in the unified approach. This 
leads to n = 4rn + 1 and k = (N14)p + q. The radix-4 algorithm is obtained by following the 
decomposition procedure outlined in the previous section v times recursively. The signal flow 
graph of a 16-point radix-4 decimation-in-time algorithm is shown in Figure 3.8. The details 
can be found in [79]. 
29 







































Figure 3.9: Radix-4 decimation-in-time butterfly. 
The radix-4 butterfly, shown in Figure 3.9, is constructed by merging 4-point DFT with as-
sociated coefficients between DFT stages. The four outputs of the radix-4 butterfly namely 
B01, B02, B03 and B04 are expressed in terms of its inputs B11, B12, BI3 and B14 as fol-
lows: 
B01 = B11 + B12 W1 + B13W2 + B14W3 	 (3.11) 
B0 2 = BIt - iBI2 W1 - B13W2 + iBI4W3 	 (3.12) 
B03 = B11 - B12W1 + B13W2 - B14W 3 	 (3.13) 
II 
The Discrete and the fast Fourier transform 
B04 = B11 + iBI2W 1 - B13W2 - iBI4VV3 	 (3.14) 
The radix-4 butterfly requires three complex multiplications. The multiplication with 'i' is 
accomplished by negation and swapping of the real and imaginary parts. Radix-4 has a compu-
tational advantage over radix-2 because radix-4 butterfly does the work of four radix-2 butter-
flies using three multipliers instead of four multipliers in four radix-2 butterflies [62]. On the 
negative side, a radix-4 butterfly is more complicated to implement than a radix-2 butterfly. 
While radix-2 and radix-4 FFTs are certainly the most widely known algorithms, it is also 
possible to design FFTs with even higher radix butterflies. They are not often used because 
the control and dataflow of their butterflies are more complicated and the additional efficiency 
gained diminishes rapidly for radices greater than four [62]. 
3.4 Summary 
This chapter introduced DFT and FFT algorithms for efficiently computing the DFT. The radix-
2 decimation-in-time FFT algorithm is covered in detail whereas the overview of the radix-
4 decimation-in-time algorithm is given. It has been shown that radix-4 algorithm is more 
computationally efficient than radix-2 alogrithrn. The next chapter covers the proposed low 
power techniques for the FFT processor. 
31 
Chapter 4 
Low power schemes for FFT processor 
This chapter proposes three different schemes for reducing the power consumption in an FIT 
processor. The first scheme namely, order based processing of coefficients, modifies the tradi-
tional coefficient ordering for the FIT processor for achieving much higher reduction in switch-
ing activity and hence the power consumption. The second scheme namely, Coefficient memory 
reduction scheme, reduces the size of the coefficient memory from N12 to (N18 + 1) locations 
in an N-point FFT processor thereby saving both area and power for long FFTs. The third 
scheme namely, Coefficient addressing scheme, simplifies the coefficient address generation in 
a single butterfly FFT processor thereby saving both area and power. 
This chapter is organised into three sections. The first section describes the Order based pro-
cessing scheme and the results to show its effectiveness in reducing the switching activity as 
compared to the traditional order based processing scheme. The architectures of the low power 
FFT processors based on the Order based processing scheme will be described in Chapter 5 
and Chapter 6. The second section proposes a Coefficient memory reduction scheme and its 
hardware implementation. The third section introduces a Coefficient addressing scheme and 
its hardware implementation. It is important to note that 16-bit two's complement fixed point 
number representation is used in this thesis. 
4.1 Order based coefficient processing scheme 
The multiplier is one of the most power consuming blocks in FIT processor. Order based pro-
cessing is a very effective way of reducing the switching activity between the fixed successive 
coefficient inputs to the multiplier. The basic idea is to order the multiplication operations such 
that the switching activity is minimum between successive coefficients. For an N-point FF1', 
the number of possible ways to arrange the coefficients equals N! but the number of distinct 
coefficient orders equal (N-1)!/2. Thus the choice of the highly correlated coefficient set is a 
computationally complex NP-complete problem requiring some heuristic search algorithm. In 
32 
Low power schemes for FFT processor 
the conventional order based processing approach, the nearest neighbour selection heuristic is 
used for ordering the coefficient set so as to minimise the Hamming distance (Bit transitions or 
switching activity) between successive coefficients fed to the multiplier blocks of the butterfly 
module. This work is reported in [81]. 
In an FFT processor, the positions of the real and imaginary parts of the coefficient set cannot be 
changed independently (Every real part has a corresponding imaginary part and vice versa). The 
reduction in switching activity of the coefficient set by the conventional order based processing 
approach is insignificant because the switching activities of the real and imaginary parts of 
the coefficient set are almost complementary. For instance, if one changes the order of only 
the imaginary parts of the coefficient set for minimum switching activity, the new order of 
the real parts will be such that the net activity of the overall coefficient set will not reduce 
substantially and vice-versa. The proposed scheme addresses the above mentioned problem by 
either using the real part of the coefficient or its two's complement on the basis of minimum 
Hamming distance with the preceding real part. The scheme can be best illustrated with the 
help of a flowchart shown in Figure 4.1. Let us assume that XR and X1 be the N12 element 
real and imaginary part arrays respectively for an N-point FFT. Also assume the presence of 
the following functions for understanding the flowchart: 
Ham(i,j) - I-lamming distance between array elements Xi(i) and Xj(j). 
HamN(i) - Hamming distance between elements X(i) and X(i - 1). 
HamC(i) - Hamming distance between elements —XR(i) and XR(i - 1). 
A(XI(i)) - Number of l's in array element Xj(i). 
The flow chart is divided into four sections. Section I selects the first imaginary part of the or-
dered coefficient set on the basis of minimum number of l's. Section II arranges the remaining 
imaginary parts of the coefficient set on the basis of minimum Hamming distance between suc-
cessive imaginary coefficients. After sections I and 11, the imaginary parts of the coefficient set 
are ordered and are stored in array Xj. Section III deals with the selection of the first real part 
corresponding to the already ordered first imaginary part of the coefficient set. Either XR(0)  or 
its two's complemented value —Xj(0) is selected on the basis of lesser number of l's. A flag 
bit is asserted in case the complemented value is selected. Section IV chooses subsequent real 
parts or their two's complemented values corresponding to their imaginary parts on the basis of 
33 
Iv 
Low power schemes for FFT processor 
Figure 4.1: Flow chart of the order based processing scheme. 
34 













0000 1.0,0.0 7ffff,0000 0000,8000 1,8001,0000 
0001 .98,-().19 7d89,e706 7641 ,cfO4 0,0000,8000 
0010 .92,-0.38 7641 ,cfO4 7d89,e706 0,7641 ,cfO4 
0011 .83,-.55 6a6d,b8e3 89be,cf04 1,7642,cfO4 
0100 .71,-.71 5a82,a57d 8276,e706 0,7d89,e706 
0101 .55,-.83 471c,9592 e706,8276 1,7d8a,c706 
0110 .38,-.92 30fb,89be cf04,89be 1,e707,8276 
0111 .19,-.98 18f9,8276 471c,9592 0,e706,8276 
1000 0.0,-1.0 0000,8000 7fff,0000 1,cf05,89be 
1001 - .19,-.98 e706,8276 18f9,8276 0,cf04,89be 
1010 -.38,-.92 cf04,89be 30fb,89be 1,471d,9592 
1011 -.55,-.83 b8e3,9592 b8e3,9592 0,471c,9592 
1100 -.71,-.71 a57d,a57d 9592,b8e3 1,6a6e,b8e3 
1101 -.83,-.55 9592,b8c3 6a6d,b8e3 0,6a6d,b8c3 
1110 -.92,-.38 89bc,cfO4 a57d,a57d 0,a57d,a57d 
1111 -.98,-.19 8276,c706 5a82,a57d 1,a57e,a57d 
Table 4.1: Listing of the ordered coefficient sets obtained by different order based processing 
schemes for a 32-point FFT processor as an example. 
minimum Hamming distance with the already ordered preceding real part. Table 4.1 depicts the 
coefficient sets obtained by application of our Order based processing scheme (Our) and the 
conventional order based processing schemes (Others), to a 32-point FIT processor as an ex-
ample. The Others ordered set is obtained by using the minimum Hamming distance approach 
on successive coefficients. Our ordered set is obtained by following the procedure outlined in 
Figure 4.1. A flag bit is stored along with the ordered coefficients to indicate the form of the real 
part of the coefficient. Our scheme is most effective in the last stage of the signal flowgraph of 
the radix-2 decimation-in-time FF1 algorithm where the coefficient switching activity is most 
intense (The coefficients are different for each butterfly unlike other FFT stages as shown in 
Figure 4.2). 
4.1.1 Results 
Table 4.2 lists the switching activity reductions obtained by following the Others and Our order 
based processing schemes for different lengths of the FF1 processors. It is clear from the table 
that the switching activity reduction is close to 50% for Our scheme compared to only around 
35 
Low power schemes for FFT processor 





Others scheme Our scheme 
Switching [11o] Switching [%] 
activity activity reduction activity reduction 
16 126 120 05 68 46 
32 240 204 15 126 48 
64 476 368 23 222 53 
128 828 672 19 424 49 
256 1520 1196 21 780 49 
512 2874 2106 27 1492 48 
1K 5250 3934 25 2786 47 
Table 4.2: Switching activity comparison of different schemes for different lengths of the 
radix-2 FFT processor. 
20% for the Others scheme for all FFT lengths. Table 4.3 lists the reduction in switching activity 
obtained by Others and Our schemes for different coefficient wordlengths for a 128-point FET 
processor. It is clear that the switching activity reduction remains almost the same for different 
36 
Low power schemes for FFT processor 
Wordlength Total 
switching 
Others scheme Our scheme 
Switching [%] Switching [%] 
activity activity reduction activity reduction 
8 328 266 19 222 32 
10 448 352 21 244 46 
12 552 444 20 296 46 
14 696 566 19 372 47 
16 828 672 19 424 49 
18 970 806 17 510 48 
20 1084 884 18 560 48 
Table 4.3: Switching activity comparison of different schemes for different wordlengths for a 
128-point radix-2 FFT processor. The coefficient set is obtained by rounding to the 
nearest integer. 
Wordlength 7 otal 
switching 
Others scheme 	' Our scheme 
Switching [%] Switching [%] 
activity activity reduction activity reduction 
8 334 258 23 146 56 
10 450 352 22 220 51 
12 598 446 25 244 59 
14 722 568 21 316 56 
16 868 670 23 366 58 
18 942 788 17 448 52 
20 1106 890 20 486 56 
Table 4.4: Switching activity comparison of different schemes for different wordlengths for a 
128-point radix-2 FFT processor. The coefficient set is obtained by rounding up to 
the nearest integer. 
wordlengths. Table 4.4 lists the coefficient switching activity obtained by using the coefficient 
set obtained through rounding up to the nearest integer instead of traditional rounding to the 
nearest integer. The switching activity reduction is much better by the later method because the 
quantised representation of a negative number and its generation through two's complement 
becomes exactly identical. There always exists a difference of unity between the representation 
and its generation through two's complement in the former method leading to inferior switching 
activity reductions. 
37 
Low power schemes for FFT processor 
'g' Address Coefficient set Our scheme Others scheme 
(Real,_Imag)  
0 0000 1.0,0.0 
1 0001 .98,-.19 Block 
2 0010 .92,-.38 I 
3 0011 .83,-.55 Block 
4 0100 .71,-31  A 
0 0101 .55,-.83 Block 
1 0110 .38,-.92 II 
2 0111 .19,-.98  
5 1000 o.o,-i.o 
1 1001 -.19,-.98 Block 
2 1010 -.38,-.92 III 
3 1011 -.55,-.83 Block 
4 1100 -.71,-.71  B 
0 1101 -.83,-.55 Block 
1 1110 -.92,-.38 IV 
2 1111 -.98,-.19  
Table 4.5: Description of the various memory organisation schemes. 
4,2 Coefficient memory reduction scheme 
The power consumption in an FFT processor can also be reduced by decreasing the coefficient 
memory size required for its implementation. The proposed scheme reduces the size of the 
coefficient memory from the existing N14 locations [14,82] to ((N18) + 1) locations for an 
N-point FFT processor. The power and area saving is significant as compared to the existing 
scheme for long FFTs. 
The radix-2 FIT coefficients are expressed as follows: 
Wk = 
Where k varies from 0 to N12 - 1 giving rise to N12 coefficients for an N-point FFT where 'N' 
indicates the number of data points or the length of the FFT. The memory required to store these 
coefficients will thus require N/2 locations for Cony implementation. Each coefficient memory 
location stores the real and the imaginary parts of the coefficient. The values of the coefficients 
for a 32-point FFT in two's complement form are given in Table 4.5. The partitioning of 
coefficients into blocks (as shown in Table 4.5) is applicable to all FFT lengths. Others proposed 
to divide the memory into two identical blocks namely A and B [14, 82]. It is clear that the 
Low power schemes for FF1' processor 
coefficient values in block B can be generated from those in block A by interchanging the 
real and imaginary parts of the coefficients and by also complementing the real part before its 
assignment to the imaginary part corresponding to block B. Hence, only block A needs to be 
stored. In Our scheme, the memory is partitioned into four blocks (Block I to Block IV) rather 
than two as shown in Table 4.5. The memory size in Our scheme is reduced to ((N18) + 1) 
locations (Only Block I is needed) from the N14 locations proposed by Others schemes (Only 
Block A is needed). Using Our scheme, there is a need to store only coefficient values in 
block I and the rest of the coefficient values in other blocks and their corresponding first block 
addresses can be generated by following the general procedure given below. This procedure 
can be explained with the help of a 32-point FF1 example given in Table 4.5. Let the complex 
coefficient values in terms of the real and imaginary parts be represented as in equation 4.1. 
	
Zb9 = Rbq + .JIbg 	 (4.1) 
Where 'b' indicates the memory block number and 'g' is an index which points to the coefficient 
values within individual blocks. The first coefficient value in each block has an index 'g' equal 
to zero. The first block coefficient values are obtained from equation 4.1 by replacing 'b' with 
'1' as shown in equation 4.2. 
Zi g Ri g +jIig 	Og ~ N/8 	 (4.2) 
Let the coefficient memory address generated and the actual block address be represented by 
an n-bit array 'A m ' and an (n-i) bit array 'Abg ' respectively. The coefficient memory block 
address is always one bit less than the conventional memory address as the block size is limited 
to ((N/8) + 1) instead of N12. The corresponding addresses of the coefficient values in the 
first block are given by the following equation. 
Ai g [n_2:0]Arn [n2:0] Where n=1092(N/2), OgN/8,OmN/8 
The second block coefficient values can be obtained in terms of the first block coefficient values 
by performing the following substitution in the right-hand side of equation 4.2: R1 -'--' I, 
I ->'--' R1 and g - (N/S - 1 - g). The symbol "' here corresponds to a complement 
operation. This can also be verified from Table 4.5. The resulting equation is as follows: 
Z2 g H 11(N/8-1-g)} +jH R1(N/8_1_g)] 	0 < g ((N/8) —2) 
Low power schemes for FFT processor 
When the coefficient memory address generator proceeds to generate the address in the second 
block, the corresponding address in the first block (Only Block I is stored) are obtained by 
taking the two's complement of the coefficient memory address as follows: 
A2 g 	A[n —2 : 0] + 1 	0 < g < ((N18) - 2), ((N18) + 1) <m < ((N14) - 1) 
Similarly, the coefficient values in the third block are obtained from the first block coefficient 
values using equation 4.2 as follows: 
Z39 [Iig]+j[Rig ] 	0gN18 
When the coefficient memory address generator proceeds to generate the addresses in the third 
block, the corresponding first block addresses are obtained as follows: 
A3 9 =Arn[m-2:0] 	0<g<N18, N14rn<3N/8 
Similarly, the fourth block coefficient values are obtained in terms of the first block coefficient 
values again using equation 4.2 as follows: 
= [ RiI1 9 1 +j[I1(N18_1g)] 	0 < g ((N/8) —2) 
Similarly for the fourth block, the corresponding addresses in the first block are given by the 
following equation. 
A4 9 	Am[n - 2 : 0] + 1 	0 <g ((N18) - 2), ((3N/8 + 1) <m < (N12) - 1) 
Let us consider an example to understand the above scheme. In Table 4.5, the real and imagi-
nary coefficient values for a 32-point FFT (n equals 4) corresponding to index '0' of the second 
block are .55 and -.83 respectively. The coefficient address A5[2 : 0] for these values is 101. 
These coefficient values can be generated by moving to index 3 of the first block and then by 
complementing and interchanging the real and imaginary values stored at this address. The ap-
propriate first block address A20(0 1 1 ) corresponding to index 3 is generated by taking the two's 
complement of the coefficient memory address A5 [2 : 0]. The blocks can be easily identified 
with the help of the higher two bits along with the all zero combination of the remaining bits. 
Hence, only block I needs to be stored and the remaining blocks can be generated by designing 
40 
Low power schemes for FFT processor 
03 
A 0 










Address generator 	ROM Ri 





Figure 4.3: Hardware implementation of Our coefficient memory reduction scheme. 
an additional hardware for implementing the described procedure. The delay introduced by 
the additional block is not of much consequence because the coefficient delay in a typical FFT 
implementation is half that of the data delay [83]. Moreover, the access time of memory in our 
case is much lower on account of its smaller size. This work is reported in [84,85] 
4.2.1 Memory implementation 
The block diagram of the hardware required to implement the memory in Our scheme is shown 
in Figure 4.3. The major hardware modules are as follows: 
. Coefficient ROM of size ((N/8)+l) locations. 
. Address generator for generating the appropriate address corresponding to all the parti-
tioned blocks of the original coefficient memory. 
. Data output generator for modifying the output of the coefficient ROM as per the parti-
tioned blocks of the original coefficient ROM. 
. Control module provides select lines to the multiplexers as per the original coefficient 
ROM address generated by an external counter. 
The address generator module comprises of a multiplexer (DA) to feed the count value to the 
two's complementor module (TCMP) only during the execution of blocks II and IV of Table 4.5. 
The module DA keeps all the inputs to TCMP block at '0' for blocks I and III thereby reducing 
41 
Low power schemes for FFT processor 
the switching activity. The mutiplexer (ADR) chooses the address or its two's complement 
depending upon the block in Table 4.5. 
The data output generator contains two multiplexers namely (Dl) and (DR) to feed the imag-
inary and real parts of the coefficient memory output for inversion only whenever required as 
per Table 4.5. This approach again reduces the switching activity at the input of the inverter. 
The multiplexers (R) and (1) are used to select the appropriate real and imaginary part of the 
output of the coefficient ROM depending upon the blocks of Table 4.5 to generate the final real 
(R0) and imaginary (Ia) memory outputs. 
The control module is responsible for generating the select lines for all the multiplexers as per 
the address count. The select lines for the address and data output generator depend upon the 
block in which the current address count is located. The select lines vary from one block to the 
other. 
4.2.2 Design flow 
The discussion of the design flow is important at this stage because it helps in understanding 
the results presented in the present and subsequent chapters of this thesis. The basic design 
flow is illustrated with the help of a flowchart shown in Figure 4.4. The conventional and 
low power architectures are defined at the register transfer level (RTL) using Verilog hardware 
description language. The functionally correct designs are then synthesised using SYNOPSYS 
DesignCornpiler (dc-shell) or Cadence BuildGates to convert the RTL description into a gate 
level netlist. The synthesis tool also generates a delay file in standard delay format (SDF) 
for more accurate gate level timing simulation. The DesignCompiler also provides a timing 
constraint file in SDF format for the layout tool (Silicon Ensemble). The functionality and 
timing of the designs are again verified at the gate level using Verilog-XL simulator. In case of 
any problem, either the RTL code or/and the synthesis timing constraints have to be modified. 
The power consumption for some designs is computed at the gate level. The power consumption 
is estimated with the help of SYNOPSYS DesignPower. It uses a gate level netlist along with 
the net switching activity obtained after gate level simulation to compute the dynamic power. 
The switching activity is computed by defining the whole design as the toggle region in the 
testbench. The toggling of each and every net is recorded in the switching activity file called 
SAIF (Switching activity interchange format). The accuracy of the results will improve if the 
42 
Low power schemes for FFT processor 
Start 
I 	RTL level design 	I 
I RTL level simulation using Verilog-XL I 
LogiNE 
Synthesise using dc-shell or BuildGates 
Gate level timing simulation with SDF using Verilog-XL 
'Ti min_N 
Generate switching activity 
file (SAIF) 
I Annotate switching activity I 
Compute Power using 
DesignPower 
I 	Layout using SE 	I 
Extract capacitance, SDF 
and actual net loads 
11! 




Generate switching activity 
file (SAIF) 
Annotate switching activity I 
Compute Power using 
DesignPo wer 
Stop J 
Figure 4.4: Flow chart depicting the design flow. 
43 
Low power schemes for FF1' processor 
FFT Cony scheme 
size  
Others scheme Our scheme % 
reduction 
64 183 228 283 -24 
128 222 261 355 -36 
256 286 322 406 -26 
512 372 383 474 -24 
1K 714 485 482 +.62 
2K 867 934 623 +33 
4K 3403 1121 797 +29 
8K 5296 3331 1353 +59 
Table 4.6: Comparison of the various schemes in terms of power. 
power estimation is carried out after layout. Cadence Silicon Ensemble is used to convert the 
gate level netlist into layout with the help of the timing constraint file. The modified netlist, the 
post layout SDF and the file containing the extracted net capacitances are then used to verify the 
post layout functionality of the designs and also to estimate the power consumed by the designs 
using DesignPower. The dynamic power comprises of two components namely internal power 
and switching power. The internal power is the power consumed within the boundary of a 
cell. It is basically a combination of short circuit power and the power needed for charging and 
discharging the capacitances inside the cell. The switching power is the power consumed in 
charging and discharging the load capacitance at the output of the cell. 
4.2.3 Results 
The scheme has been implemented in register transfer level Verilog hardware description lan-
guage for different FFT lengths and then synthesized using Cadence BuildGates with 0.35f2 
Alcatel MTC 45000 CMOS technology library. Power evaluation was carried out using Synop-
sys DesignPower for the circuit netlist. Gate level simulations were carried out for one million 
clock cycles using a supply voltage of 33V and a clock frequency of 10MHz. The same proce-
dure was followed for the Conventional (Cony) and Ma's and Parhi's (Others) approaches. The 
comparative results in terms of power and area for the different FFT lengths are given for all 
the schemes in Table 4.6 and Table 4.7 respectively. It is clear from Table 4.6 and Table 4.7 
that Our scheme proves to be beneficial for 1K points onwards both in terms of area and power. 
The savings in area ranges from 48% to 54% whereas the power saving varies from 0.62% to 
59% for longer FFT's (1K points onwards). The saving in area goes on increasing with FFT 
ririi 











64 259 212 179 16 
128 520 343 256 25 
256 1012 605 383 37 
512 2222 1070 652 39 
1K 4452 2232 1168 48 
2K 8729 4449 2272 49 
4K 19986 8526 4391 49 
8K 34890 18366 8378 54 
Table 4.7: Comparison of the various schemes in terms of area. The area is expressed in 
equivalent nand gates[n]. 
Wordlength Cony scheme Others scheme Our scheme f%] 
area 
[%] 
power area power area power area power 
ml pW [n] jW [n] btW saving saving 
/Others /Others 
8 4149 510 1538 378 912 262 41 31 
12 6700 640 3501 847 1575 428 55 49 
16 8729 867 4449 934 2272 623 49 33 
Table 4.8: Power and area saving of Our scheme as compared to Others scheme for different 
coefficient wordlengths for a 2K-point radix-2 FFT processor. 
length because the additional hardware required to implement Our scheme remains almost 
fixed. Table 4.8 lists the power and area consumed by the different schemes for three different 
wordlengths for a 2K-point FIT processor. It is clear from this table that maximum power 
saving of 49% is obtained for an intermediate wordlength of 12-bits. This indicates that this 
scheme is more useful for intermediate wordlengths of 12-bits rather than extreme wordlengths 
of 8 or 16-bits. 
45 
Low power schemes for FVF processor 
4.3 Coefficient addressing scheme 
The coefficient memory addressing is important for enhancing the performance of FFT proces-
sors [14]. The power consumption is reduced by minimising the hardware required to carry 
out coefficient address generation in a radix-2 single butterfly FIT processor. The most pop-
ular and computationally efficient method of coefficient address generation was proposed by 
Cohen [15]. According to this method, address generation occurs through the application of 
variable shifts to address lines through a cascade of Barrel shifters. The work in [14] modified 
the data generation scheme proposed by Cohen while retaining the coefficient address genera-
tion method. This section presents a novel scheme for coefficient address generation in an FFT 
processor. The proposed addressing scheme involves manipulation of the address lines taking 
into consideration coefficient addresses required at various FFT stages. It has been demon-
strated in this section that the scheme can be implemented more efficiently with much reduced 
hardware than the Cohen's scheme leading to more power and area efficient realisation of FFT 
processors. 
4.3.1 Detailed design 
The radix-2 FFT, shown in Figure 4.2 for N equals to 16, is an efficient way to compute an 
N-point DFT. It has been assumed that the data inputs are arranged in bit reverse order and 
the outputs are produced in normal order. The basic operations in FFT are multiplication of 
the complex data inputs by the FIT coefficients at each stage in the signal flowgraph followed 
by their summation or subtraction and associated data and coefficient address generation. The 
coefficient and data address generation logic are required to be fast, area and power efficient 
in order to realise fast, miniaturised and low power FIT processors which could be integrated 
into complex VLSI systems. The coefficient address generation in an FIT is accomplished 
by partitioning a p-bit counter into two sections as shown in Figure 4.5. The more significant 
counter section comprises of (G_1, ..., G) bits whereas the lower section has (Gb_i, ..., Co) 
bits. Let the more and less significant counter section bits be represented by a (p - b)-bit array 
x and a b-bit array y respectively. The coefficient address (CA) is then expressed as follows: 
GA = F(x, y) 	 (4.3) 
46 
Low power schemes for FFT processor 
Data 	Y 
address x 	Counter] 
gene 
. o j 
Y 
Dual port 	 I l 	I 10 
RAM  
A 	 1+1:1 
I xl MUX 
IcA 
____________ Coefficient 
Butterfly 	 ROM 
Figure 4.5: Architecture of a single butterfly based radix-2 FFT processor. The coefficient 
address generator is enclosed by a dotted rectangle. 
Let us also assume that i = Decimal equivalent of x, which is always greater than or equal to 




Where Ii is a b-bit array, which is expressed in terms of an array y according to the following 
set of equations: 
Io = (0-0) 
Ii = (y[b - i]O ... O) 
12 = (y[b - l]y[b - 2]0...0) 
lb-1 = (y[b - 1]y[b - 2]y[b - 
= Y 
'b+i = (X ... X) 
I (X ... X) Where N = FFT size, b = 1092 (N12) 
X = Don't care bit and p = b + (1092(1092(N))) (rounded up to the nearest integer) 
47 
Low power schemes for FFT 
The set of operations described by equation 4.4 could be realised in hardware by using a single 
multiplexer based coefficient address generation logic as shown in Figure 4.5. The input chan-
nels of the multiplexer are set according to 1i and are selected as per x. In order to illustrate this 
scheme, consider a 16-point radix-2 FFT signal flowgraph, shown in Figure 4.2 as an example. 
In the first stage, only the first coefficient with an address value 0(W ° ), is required for all the 
eight butterflies. In the second stage, W ° and W 4  coefficients are required. In the third stage, 
W °, W 2 , W 4 and W 6  coefficients are needed whereas in the last stage all the coefficients from 
W 0 to W 7  are required for the butterfly operations. In this example, the coefficient memory 
has eight locations for storing coefficients (W ° to W 7) and hence the lower counter section 
should comprise of only three bits C2, C1 and Co . The more significant counter section, which 
tracks the four stages of the 16-point FFT, will be having only two bits namely C4 and C3. It is 
clear from Figure 4.2 that in the first stage the coefficient address remains equal to 'OOO'(only 
W °) irrespective of the lower section counter value. The first stage is represented by the '00' 
combination of the higher section bits and hence the input channel of the multiplexer corre-
sponding to this bit combination must always remains at '000'. In the second stage, the address 
'OOO'(W ° ) is required for the first four computed butterflies and address '100'(W 4 ) is required 
for the next four butterflies. The second stage '01' of the FIT is dependent on the status of 
the C2 bit of the counter. This means that the second channel input should be set to 'C2 00'. 
The third stage '10' of the FIT requires the generation of four different addresses 'OOO'(W ° ), 
'010'(W 2 ), '100'(W 4 ) and '110'(W 6 ). It is clear that these addresses differ only in the C2 C, 
combination and Co remains equal to zero always. It means that the third channel of the mul-
tiplexer should be connected to 'C2C10' to accomplish this task. The last stage '11' needs all 
the coefficients and hence the last channel must be connected directly to 'C2C1C0'. The same 
technique can be very easily extended to any FFT size by following the formulations described 
earlier. This work is reported in [86, 87] 
4.3.2 Results 
The new coefficient addressing scheme (Our) has been synthesized for different FFT lengths 
from the RTL level Verilog description using Cadence BuildGates with 0.35p Alcatel MTC 
45000 CMOS technology library. The power based gate level simulations were carried out at 
a clock frequency of 100MHz and at a supply voltage of 3.3V for one million clock cycles 
using Synopsys DesignPower. The same procedure was followed for the Cohen's scheme and 
the comparative results in terms of power and area for the different FIT lengths are given in 
48 









64 128 256 512 1K 2K 4K 8K 
FFT SIZE 









64 128 256 512 1K 2K 4K 8K 
FFT SIZE 
Figure 4.7: Comparison of the two schemes in terms of area. The area is expressed in 
equivalent nand gates. 
Figure 4.6 and Figure 4.7 respectively. The power consumption, given in Figure 4.6, is the 
average power consumed by the respective circuit per clock cycle. It is evident from Figure 4.6 
that the power consumption of the coefficient address generation logic depends on the switching 
activity of the inputs and outputs and not on the FFT size. In Figure 4.7, the area is slightly 
smaller for the coefficient address generation logic for a 2K-point FF1' as compared to 1K-point 
in a Cohen scheme on account of the optimisation of the subtractor with one fixed input by the 
synthesis tool. It is clear from Figure 4.8 that Our scheme results in power and area savings in 
the range of 80% to 90% as compared to the Cohen scheme for almost all FFT lengths. 
49 
Chapter 5 
Low power single butterfly FFT 
processor architecture 
The basic radix-2 FFT processor can be realised with the help of a single butterfly architecture 
for low throughput chip area limited applications. This chapter presents a low power radix-2 
single butterfly FFT processor architecture which is based on the proposed order based process-
ing scheme described in the previous chapter. 
This chapter is organised into two main sections. The first section starts with the introduction 
to a conventional radix-2 single butterfly FFT processor architecture proposed by Cohen [15]. 
This single butterfly architecture is chosen because its control logic for coefficient and data 
address generation is very simple [14, 83]. The second section describes the proposed low 
power radix-2 single butterfly FFT processor architecture. The conventional radix-2 single 
butterfly FFT processor architecture is modified to support the order based processing in the 
proposed low power radix-2 single butterfly FFT processor architecture. 
5.1 Coventional radix-2 single butterfly FFT processor 
architecture 
The basic operations of a 16-point radix-2 FFT processor is given by its signal flowgraph shown 
in Figure 5.1. The complex inputs (X) to the FFT are in bit-reversed order but its complex 
outputs (FFTO) are in normal order. The signal flowgraph for a 16-point FFT is divided into 
four stages. The first stage is represented by i=O and the last stage by i=3. The butterfly 
operation is the most important operation in the FF1'. Let < s, t > represents the butterfly 
operation, then the inputs and outputs of the butterfly are related as follows: 
XO(.$) = X(s) + W * X(t) 	 (5.1) 
XO(t) = X(s) + 117 * X(t) 	 (5.2) 
51 




80 - LI iii __ 74 - 
I1IJ Area 
• Power 
64 128 256 512 1K 2K 4K 8K 
FFT SIZE 
Figure 4.8: Percentage reduction in power and area of Our scheme over Cohen scheme. 
4.4 Summary 
This chapter described three schemes namely the order based processing scheme, the ....ieffi-
cient memory reduction scheme and the coefficient addressing scheme for reducing the power 
consumption in an FFT processor. The order based processing scheme reduced the switching 
activity by more than 50% as compared to only around 20% using conventional order based 
processing. This significant reduction in switching activity will lead to considerable power 
savings in both single butterfly and pipelined FFT processors described in chapters 5 and 6 
respectively. The single butterfly and the pipelined FF1' processors described in the subsequent 
chapters are based on this order based processing scheme. 
The remaining two schemes are based on hardware size reduction of an FFT processor result-
ing in power saving. The coefficient memory reduction scheme required only ((N/8) + 1) 
memory locations instead of N14 for storing the coefficient set of an N-point FFT processor. 
This scheme has resulted in both power and area savings for long length (more than I K-points) 
FF1' processors used mainly in OFDM applications. The coefficient addressing scheme imple-
mented the coefficient address generation logic with the help of a single multiplexer instead of 
a cascade of Barrel shifters. This led to around 80% power and area savings with respect to 
the Cohen scheme in FFT processors of all lengths. The next chapter describes a low power 
architecture of a single butterfly radix-2 FF1' processor to demonstrate the effectiveness of the 
order based processing scheme. 
50 


















=0 1=1 	i=2 	 1=3 	FFTO 
Figure 5.1: Signal flow graph of a 16-point FFT 'M' indicates the memory location. 
A butterfly accepts two complex inputs X (s) and X (t) and generates two complex outputs 
XO(8) and XO(t). The FFT coefficients are represented by W. The butterfly operation in the 
signal flowgraph can be identified by a number (j) above a line intersection in every stage of 
the signal flowgraph. There are eight butterflies in every stage and hence 'j' varies from 0 to 7. 
The number on the left hand side of the intersection indicates the coefficient value used for that 
butterfly (0 means W ° and so on). 
5.1.1 Memory organisation in a single butterfly FFT processor 
A single butterfly 16-point FF1' processor architecture comprises of a butterfly unit to realise 
equations 5.1 and 5.2, sixteen memory locations to store the inputs and intermediate outputs 
after every stage of a 16-point FFT, and associated address generation logic to feed the proper 
data and coefficient inputs to the butterfly. All the butterflies for the first stage must be computed 
first before the computation of the second stage butterflies. All the inputs in the memory are 
replaced by their corresponding intermediate outputs after the first stage. This procedure has 
to be repeated for the subsequent stages till the final output is obtained. The detailed sequence 
52 
Low power singlebutterfly FFT processor architecture 
of butterfly execution from the signal flowgraph shown in Figure 5.1 according to the butterfly 
numbering (j) is as follows: 
Stagel: <0,1 > <2,3> <4,5> <6,7> <8,9> <10,11> <12,13> <14,15> 
Stage2: <0,2> <1,3> <4,6> <5,7> <8,10> <9,11> <12,14> <13,15> 
Stage3: <0,4> <1,5> <2,6> <3,7> <8,12> <9,13> <10,14> <11,15> 
Stage4: <0,8> <1,9> <2,10> <3,11> <4,12> <5,13> <6,14> <7,15> 
The actual order followed for low power to reduce the switching activity of the coefficient 
during the execution of successive butterflies in stages 2 and 3 is as follows: 
Stagel: <0,1 > <2,3> <4,5> <6,7> <8,9> <10,11> <12,13> <14,15> 
Stage2: <0,2> <4,6> <8,10> < 12,14> <1,3> <5,7> <9,11> <13,15> 
Stage3: <0,4> <8,12> < 1,5> <9,13> <2,6> < 10,14> <3,7> < 11,15> 
Stage4: <0,8> <1,9> <2,10> <3,11> <4,12> <5,13> <6,14> <7,15> 
One more reason for the modified order of butterfly execution is that the j th  butterfly in the jth 
iteration is < s, t > where: 
s = ROTATE(2j, i) 	ii = 1092N 	 (5.3) 
t=ROTATE(2j+1,i) 	i_-0,1,...,(n-1),j-0,1,2,...,(N/2-1) 	(5.4) 
Where ROTATE Th (X, in) is the value of X rotated left by 'm' bits within 'n' bits, for instance 
ROTATE4(13,3) = 14. 
The author in [88] demonstrated that for every butterfly the two indices (s and t) differ in their 
parity. The parity of X is defined as zero if the number of l's in its binary representation is 
even and one otherwise. This observation can be exploited in organising the N-point memory 
needed into two banks according to the parity of the addresses. During any clock cycle, the 
two points X(s) and X(t) are accessed in parallel from the two memory banks because these 
points are always stored in different banks and also because 's' and 't' differ in parity. The 
parallel memory block organisation is shown in Figure 5.2. The memory is partitioned into two 
53 
Low power single butterfly FVF processor architecture 





PARITY-BIT 	 PARITY-BIT 
Figure 5.2: Parallel memory block organisation on the basis of parity bit. 
banks namely RAME and RAMO. Dual port RAMs are used in the design to read the data for 
each butterfly operation and to write back the butterfly outputs to the same memory location in 
the same clock cycle. The data interchange blocks CDI and GDO and the address interchange 
block CDA are controlled by the PARITY-BIT. These blocks direct data and the address to the 
appropriate memory bank depending upon the parity. The (PARITY-BIT) is generated by the 
PARITY block. The GDI block receives data from the the input multiplexer block MUXIN. 
5.1.2 Architecture of the FFT processor 
The architecture of the conventional single butterfly radix-2 N-point FF1 processor is shown 
in Figure 5.3. The in-place strategy (Inputs and intermediate outputs are stored in the same 
memory location after every stage) is employed to reduce the memory size for an N-point 
transform to only 'N' locations. The conventional radix-2 FF1 processor consists of a parallel 
memory block having N memory locations which are organised in the form of two memory 
banks for parallel access, a butterfly BUTTERFLY) block to implement basic FEE operations, 
an input multiplexer MUXIN to control the loading of memory, a FSM for proper sequencing 
of operation, a ROM (ROM) for storing the coefficients and the address generation block for 
generating the addresses of coefficient and data. 
The data address generation logic is based on two rotation blocks namely ROTO and ROT] to 
54 





High 	Lower section 
LS 1 
HS 	 I 	SEL 
i, Address cieneration block 
PARITY 	Hs-"-HROTOIl ROT1HS IMUXC 
PARITY-BIT 
Taddress 	I CROM 
XM (s) 1 t 
64 	 32, 32 	W
01 DATA-IN I PARALLEL 
X(s) 
MUXIN XM (t) MEMORY  I 	BUTTERFLY 
64 
XO(s) and XO(t) 	
32, J BLOCK <(t) 
XO(s) XO(t) 
t 	39 	32,,' 
SEL 	PARITY-BIT  
Figure 5.3: Conventional single butterfly radix-2 FFT processor architecture. 
realise equations 5.3 and 5.4. A least significant bit having value '0' is inserted to the input of 
ROTO module and a '1' is inserted to the input of ROT1 to satisfy the shifting and bit insertion 
requirements of equations 5.3 and 5.4 respectively prior to rotation operation. 
The parity generation block (PARITY) makes its decision on the basis of the inputs to the rota-
tion block ROTO rather than on its output. This is made possible because the parity remains the 
same after the rotate operation. This approach avoids the unnecessary delay in parity generation 
after rotation. The coefficient address generation logic is based on the proposed single multi-
plexer (MUXC) implementation for all the FIT stages as described in the previous chapter. The 
fixed FFT coefficients are stored in the ROM (CROM). 
The overall operation of the FFT is controlled by the finite state machine (FSM). The FSM 
comprises of a counter which is partitioned into two sections. The higher counter section HS, 
denoted by 'i', keeps track of the various FFT stages whereas the lower counter section LS, 
denoted by 'j', takes care of the number of butterflies within every stage. The FSM operates in 
two phases. In the first or input loading phase, the FSM loads bit-reversed input data DATA-
IN into the memory banks as well as output data values corresponding to the previous input. 
55 
Low power single butterfly FFT processor architecture 
Xr(t)1 1 5 : 01 	
Multiplier 	O[31 :01 	 XO() 
Wr[1  5:0] 0 , [30:11] 
X(t)[15:0] 
 
02[30 : 1 1 	 D[1 5:0]
TMultpher ~*02[311:0] 	 X(s) 	 X01(s) 
W 1 [1 5:0] 	 S *+ 
X(t)[15:0] 	
Multiplier 	03[31:0] 	 X,(s) 	Sub 	XOr (t) 
W,[1 5:0] 	
plier 	04[31 :0] 
03 [30:11] 	 S[1 
04[30:1 1] 
X,(t) [15:0]
Sub X0 1 (t) W 1 [1 5:0] S 
Figure 5.4: Conventional butterfly architecture. 
Two data values are stored in the memory (one in each memory bank) in every clock cycle. A 
multiplexer (MUXIN) is needed to select the loading of external input (DATA-IN) or butterfly 
output (XO(s), XO(t)) to the memory. The select line of MUXIN selects DATA-IN in the input 
loading phase. The select line SEL of MUXIN is again controlled by the FSM. In the second 
or FFT execution phase, the FSM clears the counter and then initiates the FFT computation by 
incrementing it. MUXIN selects the butterfly output in this phase. These intermediate butterfly 
outputs are loaded into the memory banks. The FFT execution phase ends after the execution 
of all the butterflies in all the FF1' stages. For instance, the execution phase in a 16-point FFT 
processor ends after 32 butterfly operations which corresponds to 32 clock cycles. The overall 
counter value has to be monitored in the two phases to keep track of their completion. The 
counter should be cleared after the execution of the FFT execution phase. The two phases are 
to be executed in a cyclic fashion till all the FFT blocks are processed. 
The block diagram of the conventional butterfly is shown in Figure 5.4. A butterfly of a 
radix-2 decimation-in-time FFT has two complex inputs namely X(s) (X 7.(s) + jX(s)) and 
X(1) (X r(t) + jX(t)) and two complex outputs XO(s) (XO(s) + jXO(s)) and XO(t) 
(XOT (t) +jXO (t)). The complex FFT coefficients are represented by W (W r + jW). All the 
inputs are of 16-bits. The butterfly outputs and inputs are related by the following equations: 
X 0, (S) =  X, (8)+ (Xr (t)Wr - X(t)W1) 
XO(s) = X i ( 8) + (Xi M Wr + Xr (t)Wi) 
001 
Low power single butterfly FVF processor architecture 
X01(t) = Xr(S) - (Xr(t)T'Vr - X1 (t)T4/) 
X0(t) = X2(s) - (Xj(t) Wr + Xr (t)Wj) 
It is important to note that only 20-bits of the multiplier output are applied to the second stage 
adder and subtractor instead of 32-bits to limit the hardware size. Since the outputs are of 16-
bits, it has been found that 20-bits of intermediate accuracy is a judicious choice between the 
noise tolerated and the hardware size [641. 
5.2 Low power radix-2 FFT processor architecture 
The conventional FF1 processor architecture, proposed by Cohen [15], has been modified to 
support order based processing scheme. The order based processing scheme has been discussed 
in detail in the previous chapter. The FIT processor based on the order based processing scheme 
is called ordered radix-2 FF1 processor. An ordered radix-2 FIT processor core, shown in 
Figure 5.5, comprises of the following components: 
. Ordered butterfly module. 
. Two dual port memories for holding data in each FF1 stage. 
• ROM for holding the fixed coefficients. 
• Address generation logic for generating address for both the coefficient and data. 
• Control logic in the form of a finite state machine (FSM). 
• Ordering logic in the form of Look up table (LUT) and a multiplexer to support order 
based processing. 
An ordered butterfly module accepts complex data at every clock cycle in order to produce 
complex outputs during the same cycle. Each memory location is 32-bit wide for storing both 
the 16-bit real and imaginary parts of data. The memory is organised into two banks based on 
parity of the address bits in order to generate the addresses of data required at successive FF1 
stages. The inputs are read from the RAMs (Both RAME and RAIvIO) into the butterfly in the 
same cycle as the outputs are written back to the RAMs for minimising the number of clock 
cycles [15]. 
57 












.--- 	j;;;tion logi 
PARITY I 	HS /-1 ROTO II ROT1 I71- HS I MUXC 
PARITY-BIT 








64 	 BLOCK 
MUXIN r32/pM(t) 	MEMORY 	 BUTTERFLY 
XO(s) and XO(t) (t) 
 I XO(s) XO(t) 
SEL 	 PARITY-BIT 
Figure 5.5: Low power single butterfly ordered radix-2 FFT processor architecture. 
The order based processing block (Ordering logic) consists of an LUT (RROM) and a mul-
tiplexer RMUX. RROM stores the addresses of the'ordered coefficient set. These addresses 
will be active only in the last stage of the FFT when order based processing has to be incor-
porated. RMUX is used to select between the ordered addresses for the coefficient and data 
and the conventional addresses. The ordered addresses are used only in the last stage whereas 
the conventional addresses are used in all the previous FF1' stages. The select signal (ASEL) 
for RMUX can be generated by the FSM using the bits of the more significant counter section. 
The order based processing scheme is most effective in the last stage of the signal flowgraph of 
the radix-2 decimation-in-time FFT algorithm where the coefficient switching activity is most 
intense (The coefficients are different for each butterfly unlike other FFT stages as shown in 
Figure 5.1). The investigations revealed that the power saving is maximum if the order based 
processing scheme is limited only to the last FFT stage. The hardware overhead to support 
order based processing in the penultimate stages outweigh the power saving obtained in these 
58 
Low power single butterfly FFT processor architecture 
Xr(t)[1 e 
Wr[15 
X1 (t) [15:0] 
W i ll 5:0] 
Multiplier 
	
0 1 [19:0] 
	
A(S)L 
Module 	 D 







X0 1 (s) 
)Q)[1-5:0] 
W r [1 5:0] 
Multiplier 	0[19:0] 	 Xr(S)  ~a Sub 
Module 
03 	 jS[15:0] 
04 (30:11] 
XO r (t) 
Xr (t)[1 5:0] 
W[1 5:0] 
Multiplier 	0, [31:0] X.(s) Sub 	X0 1 (t) 
Figure 5.6: Low power butterfly ordered architecture. 
stages. This work is reported in [81]. 
The hardware overhead required to support the order based processing scheme is in the form of 
an additional RUM (RROM) having N/2 locations with word width equal to 1092(N/2), a 2:1 
channel multiplexer (RMUX) and an array of 20 Ex-OR gates in the modified multiplier module 
of the ordered butterfly for the real coefficients only. It is clear that only the RUM size and the 
word width to a lesser extent increase with the FFT size. The RUM block consumes much less 
power and hence introduces only little overhead as compared to the conventional approach. The 
RUM and the multiplexer blocks are also required in the Others ordering approaches and hence 
the hardware overhead of Our approach with respect to the Others are only the insignificant 
Ex-OR gates within the multiplier module. The ordered butterfly module for the low power 
radix-2 FFT processor architecture is discussed in the next section. 
5.2.1 Ordered butterfly module 
The block diagram of the ordered butterfly hardware for the low power architecture is shown in 
Figure 5.6. The ordered butterfly needs a multiplication module instead of a two's complement 
multiplier for the real part of the coefficient to implement the order based processing scheme. 
The low power order based processing scheme requires selective complementation. The correct 
output of the multiplier is obtained by selectively complementing the multiplier output corre-
sponding to the real coefficients. A dedicated multiplication module has to be used for the 
real part of the coefficient (Wr) instead of a two's complement multiplier to support Our order 
based processing scheme. This will be discussed in the next section. 
Low power single butterfly FVF processor architecture 




gates B[1 5 0] 	
Complement 
Flag bit 
Figure 5.7: Architecture of the multiplication module for the low power butterfly. 
5.2.2 Multiplication module 
A simple two's complement multiplier cannot be used for the real coefficients because the real 
coefficients are used in two's complement form as well. A flag bit is also stored along with 
each real coefficient to indicate the form of the real coefficient. The multiplication module 
corresponding to the real coefficients is obtained by adding 20 Control inverters (Ex-Ors) to 
the output of a two's complement multiplier shown in Figure 5.7. The flag bit controls the 
Ex-Ors and therefore the final output of the multiplication module FO). FO is the complement 
of the 16-bit multiplier output in case the flag bit is '1' otherwise it will be the same as the 
multiplier output. The multiplier output is limited to only 20-bits instead of 32-bits. There is 
a difference of unity at the LSB position between FO and the actual multiplier's 20-bit output 
in the conventional approach. This difference arises only in rare cases where all the lower 11 
bits of the actual multiplier's 32-bit output are zero and when the real coefficient is represented 
in its two's complement form. This is because only the upper 20-bit output of the multiplier 
is complemented instead of two's complementing the whole 32-bit output. This is performed 
to avoid the use of a 32-bit adder to generate the two's complement of the multiplier output, 
thereby, saving power. Moreover, it does not lead to any error because the 16-bit output of 
the butterfly at each FFT stage is halved to avoid overflow. This approach has been verified 
extensively by comparing the outputs of different length FFT's with random data. 
5.3 Results 
A number of FFT cores of varying sizes have been designed at the register transfer level (RTL) 
using Verilog hardware description language. The cores were synthesised using Synopsys De-
signCompiler with 0.35ji Alcatel MTC45000 CMOS technology library. In order to evaluate 
the performance of synthesised cores, gate level nctlist simulations were performed using Ca- 
ZE 











16 67.60 68.44 51.49 25 
32 84.55 86.82 75.68 13 
64 104.93 105.88 95.72 10 
128 213.28 210.34 198.89 7 
256 355.27 354.74 343.80 3 
512 864.35 863.80 855.21 1 
Table 5.1: Power consumption comparison of FFT cores. 
dence Verilog-XL simulator for 1000 FF1' blocks of uniformly distributed random input data 
samples. The switching activity information obtained from the gate level netlist simulations 
was then fed into Synopsys DesignPower for power analysis. The power analysis was per-
formed at a clock frequency of 10MHz and a supply voltage of 3.3V. The same procedure is 
adopted for the Cony and Others cores for a fair comparison. Table 5.1 depicts the power con-
sumption comparison for the different FF1 cores obtained by following the conventional Cony, 
Others and Our approaches for the Non-Booth-coded Wallace tree type multiplier. It is evident 
from the table that the power saving of Our scheme ranges from 1% to 25% for 512-point to 
16-point FF1 processors respectively over the Others approaches. The percentage power saving 
continues to reduce for longer length FFT's because Our scheme is directed at reducing power 
by lowering the switching activity at the coefficient inputs of the multipliers in the butterfly 
structure. The butterfly complexity in an FF1 remains fixed with FIT length whereas the RAM 
size goes on increasing. This means that the power consumed in the butterfly increases slightly 
with FFT length as compared to the power consumed in the RAM blocks. This results in lower -
ing of the percentage savings in power for longer FFT's. The Others scheme leads to no power 
savings in most cases because the switching activity reduction is much less as compared to the 
hardware overhead required to support the scheme. This is not true for Our scheme due to the 
significant reduction in the switching activity of the ordered coefficient set. Table 5.2 lists the 
power consumed by the various blocks of the 64-point FF17 processor core. It can be concluded 
from the table that the butterfly and the RAM contributes most to the power consumption. The 
ROM and other modules consume less power. Our scheme leads to power savings both in the 
internal cells as well as on the nets. It has been found that the relative performance of Our 
approach remains the same for different data sets. 
The area overhead of Our 64-point single butterfly FIT is 0.3% and 0.01% with respect to Cony 
61 









BUTTERFLY 22.84 22.82 18.25 
RAME 7.77 7.82 7.84 
RAMO 7.77 7.80 7.82 
MUXIN 1.22 1.25 1.20 
CDI 0.40 0.41 0.41 
CDO 0.38 0.38 0.38 
FSM 0.05 0.05 0.05 
RROM - 0.05 0.05 
CROM 0.03 0.04 0.03 
ROT1 0.01 0.01 0.01 
ROTO 0.01 0.01 0.01 
CAI 0.02 0.02 0.02 
MUXC 0.002 0.002 0.002 
RMUX - 0.007 0.007 
PARITY 0.008 0.009 0.009 
Internal cell power(IP) 40.51 40.68 36.08 
Net switching power(NP) 64.42 65.20 59.64 
Total FFT power(TP)=NP+IP 104.93 105.88 95.72 
Table 5.2: Power consumption of the different cells along with net switching power of a 
64-point FFT core. 
and Others FFTs respectively. This means that the area overhead is negligible as compared to 
the power saving. 
5.4 Summary 
This chapter described a low power radix-2 PET processor architecture. The FFT processor is 
based on the order based processing scheme. The power saving is higher for short FliTs because 
the order based processing scheme is directed at saving power in the butterfly and not in the 
memory. The memory size goes on increasing with the FFT size and therefore the power share 
of the butterfly module goes on reducing with FFT size. The order based processing scheme 
can also be successfully applied to a pipelined FIT processor. The next chapter introduces a 




Low power radix-4 pipelined FFT 
processor architecture 
The pipelined FIT processors are commonly used in all multicarrier applications requiring real 
time processing. The multicarrier receiver needs a low power pipelined FIT processor. A low 
power radix-4 pipelined FF1 processor architecture is chosen as a conventional pipelined FIT 
processor [13]. This radix-4 pipelined architecture is considered because its butterfly can be 
implemented by using only one complex multiplier just like radix-2 FF1 processors with all 
the other attributes of the radix-4 FIT processor for sequential input processing. A novel low 
power radix-4 ordered pipelined FIT processor architecture is proposed by incorporating order 
based processing scheme into the conventional radix-4 pipelined FF1 processor. 
This chapter is organised into four sections. The first section describes the need of pipelined 
FIT processor. The second section explains the algorithm and the architecture of the conven-
tional radix-4 pipeined FF1 processor. The third section proposes the ordered radix-4 pipelined 
FFT processor architecture by altering the penultimate stage of the conventional radix-4 FF1 
processor to support the order based processing scheme. The chapter concludes with a section 
on the comparison of the ordered and conventional radix-4 FF1 processor architectures in terms 
of power and area. 
6.1 Need of pipelined FF1' processor architectures 
The single butterfly architectures suffer from speed limitations for long FFTs. The N-point 
single butterfly FIT processor needs 'N' complex words of memory for in-place algorithm. If 
this memory is not partitioned unlike the previous chapter, the number of read/write accesses to 
perform the FF1 creates a bottleneck: an N-point FF1 requires (N/r) log, N butterfly compu-
tations. This implies 2NlogrN read/write accesses. It means that the read/write access to the 
internal RAM every 4.7ns to perform an 8K-point FF1 in ims using a radix-2 approach. This 
63 
Low power radix-4 pipelined FFT processor_architecture 
value is difficult to achieve. The high frequency of operation considerably increases the power 
consumption as well. 
The said problem can be solved by either using a higher radix to reduce the number of but-
terflies or to partition the memory into r banks as in the previous chapter. The higher radix 
approach leads to complex butterflies and the memory partitioning requires complex address-
ing and higher area. Moreover, in all single butterfly architectures, a clock having frequency 
much higher than the input data sampling rate is required for all multicarrier applications. The 
presence of multiple clocks complicates the multicarrier receiver design. 
The real advantage of the pipelined architecture is that all the hardware blocks operate at the 
input data sampling rate. The pipelined hardware is obtained by perpendicularly projecting the 
FIT signal flow graph to the data flow. The hardware consists of several butterfly units (one per 
FFT stage) with associated complex multipliers, separated by delay commutator. The speed of 
the pipclincd architecture can be very easily enhanced by increasing the level of pipclining in 
the arithmetic units. The only disadvantage is that the butterfly unit has to be replicated logN 
times compared to the single butterfly implementation. 
6.2 Conventional radix-4 pipelined FFT processor architecture 
The pipelined FF1 processor architectures are most commonly employed in multi-carrier re-
ceivers because these architectures can be directly interfaced to input data operating at the 
sampling rate for real time applications [89-91]. The radix-4 architectures are preferred over 
radix-2 for low power applications because of the reduced number of multiplications in radix-4 
as compared to radix-2 [62]. The conventional radix-4 butterfly comprises of three complex 
multipliers and requires the availability of all the four butterfly inputs at the same time as ex-
plained in chapter 3 [61,92]. In general, the data from the A/D converter always arrives in 
word sequential format with one input at a time. This data/processor mismatch can be bridged 
by providing input buffers which consumes a lot of power and by operating the pipeline at 
(1/4) 1h  of the input rate. Bi and Jones proposed a radix-4 pipelined architecture in which only 
one complex multiplier is used in its butterfly [13]. Moreover, this architecture can be directly 
interfaced to word sequential data input without the need of input buffering. The hardware 
comparison of Bi and Jones architecture with respect to the traditional Gold and Bially [921 
and a digit serial Hui architectures [93] in terms of multipliers, adders and memory is listed in 
RE 
Low power radix-4 pipclincd FVF processor architecture 
Gold and Bially Bi and Jones Hui 
Memory 3.25N 2.75N 2.5N 
Multiplier 31094N log4 N 31094N* 
Adder 81094N 31094N 121094N* 
Table 6.1: Comparison of the different pipelined FFT architectures. * indicates digit serial. 
Table 6.1. The memory requirement of Hui architecture is slightly less than Bi and Jones but 
the power consumed in its memory is more because of deep memory fragmentation [91]. Deep 
memory fragmentation prevents the realisation of FIFOs using dual port RAMs. The traditional 
shift register based realisation of the fragmented memory increases the switching activity and 
hence the power consumption of Hui's architecture. The Bi and Jones pipelined architecture is 
implemented by Bidet as a first single chip 8K point FFT for multi-carrier orthogonal frequency 
division multiplexing applications [631. The Bi and Jones architecture is selected because it has 
the lowest value of FFTs per energy among all the pipeined architectures [53]. Moreover, it is 
also a very popular architecture for implementing multi-carrier receivers [63, 89, 911. The con-
ventional radix-4 architecture is based on Bi and Jones architecture. The conventional radix-4 
FFT processor architecture is now derived from the Bi and Jones algorithm. 
6.2.1 Bi and Jones algorithm for DFT decomposition 
The N-point DFT' of a finite duration sequence x(n) is defined again here by equation 6.1. 
X(k) =x(n)W', 	k=O,1,...,N-1 	 (6.1) 
n=O 
Where WN is defined as, 
WN = e_j2/N 	 (6.2) 
Let N be a composite number of v integers so that N = r1r2...r, and define, 
Nt = N/rir2 ... r 	15 t  < v - 1 	 (6.3) 
65 
Low power radix-4 pipelined FFT processor architecture 
Where 't' is the stage number of the decomposed DFT and rt its radix. Using the recursive 
property of equation 6.3 for radix r1, equation 6.1 becomes, 
ri-i 	 ri-i
1 pk 
X(k) = 	x (Nlp) WNN + 	x(Nip + 
l)WP±l)k+ 	 (6.4) 
P=O 	 P=O 
ri-i 
• + 	x(Nip + N1 - i)W N i_ 
P=O 
Equation 6.4 can be modified by using the relationship w. = W}. as follows: 
Ni-i 	ri-i 
X(k)= 	W7x(Nip+qi)W 	 (6.5) 
qi=0 	P=O 
Now defining indexes k 1 and ml by k = r1 k1 + m1 where 0 < k1 N1 - 1 and 
0 < m 	r - 1, equation 6.5 becomes, 
N1-1 




x i (qi,mi) = W) 71 	x(Nip+qi)W,?1m' 	
(6.7) 
P=O 
Equation 6.7 defines the computation for the first stage. Continuing the decomposition process 
of equation 6.6 for radix numbers other than r1, the complete N-point DFT can be decomposed 
into v-i further stages of computation. The final stage is defined by equation 6.8 as follows [13]: 
r-1 
X(rir2 ... r_im v + ... + r1m2 + mi) 	 (6.8) 
Whereas the intermediate stages 't' are given by the following equation. 
r-1 
xt(q.,Mt) = 	Ni_i 	
(6.9) 
P=O 
Where 2<tv-1, 0<mr,—1, 0q2 N-1and 2 < i < v 
Each summation in these equations represents an Tt point DFT. The coefficientis outside 
the summation, the decomposition thus corresponds to a decimation-in-frequency computation. 
66 























Figure 6.1: Signal flow graph of a radix-4 16-point FFT 
Since radix numbers can be any positive integer, equations 6.7, 6.8 and 6.9 can be used for 
either mixed radix computation or uniform radix computation. For ri = 4, the flowgraph of 




X(4m2 + mi) = 	xi(qi, rn1)W21m2 	 (6.10) 
q1 =0 
3 
x i (ql,ml) = 	x(4p+qi)W" 0  M1, M2 3 	(6.11) 
P=O 
In Figure 6.1, each open circle represents the summation while the dots define the stage bound-
aries. The number inside the open circle is the value of m1 (for stage 1) or m2 (for stage 2). 
The number outside the open circle is the FIT coefficient applied. 
6.2.2 Hardware implementation 
A pipelined N-point radix-4 FFT processor based on the previously described algorithm, shown 
in Figure 6.2, has 1094N  stages. Each stage produces one output within each word cycle. Each 
stage contains a commutator, a butterfly element (for summation) and a complex multiplier. 
The sequential outputs at each stage must be ordered in accordance with the value of mt. For 
67 
Low power radix-4 pipelincd FFT processor architecture 




Butterfly 	Commutator Butterfly 	Output 
v = log(N) 	Coefficient 1 Coefficient -1 







Figure 6.3: Commutator architecture for the conventional radix-4 pipelined FFT processor 
architecture. 
instance, from Figure 6.1 at stage 1, the outputs associated with rni = 0 are produced in the 
first four word cycles, then those associated with rn = 1 in the next four cycles and so on. It is 
clear from equation 6.9 that the input data for each summation at stage 't' are separated in time 
by Nt words. The required commutator comprises of six shift registers each providing Nt word 
delay along with three multiplexers and is shown in Figure 6.3. The control signals C1, C2 and 
C3 select the appropriate data using 2:1 multiplexers according to the value of mt. The timing 
diagram of the commutator for stage 1 of a 16-point FIT is shown in Figure 6.4. In Figure 6.4, 
t' is the instant when the first input word arrives. Each input word occupies a word slot of 
duration 'T' and is numbered according to its appearance in time. The four complex outputs 
from the commutator are connected to its associated butterfly. The commutator supplies the 
same set of data for Nt word cycles. 
The FIFO block of the commutator can also be realised by either using a dual port RAM (DM) 
or a Triple port RAM (TM) for low power implementation. A DM can be converted into a 
FIFO by maintaining a difference of unity between its read pointer (r) and write pointer (w). 
Let us assume that the read and the write pointers are incremented after every clock cycle of 
Me 
1 	0 15 14 13 12 11 10 	9 	8 7 6 5 4 3 	2 	1 	0 15 14 - 
13 12 11 10 	9 	87 6 5 4 3 	2 	1 	0 15 14 13 12 11 10 
4 
9 8 7 6 5 4 3 	2 	1 	0 1514 13 12 11 10 	9 	8 7 6 
4 











Stage 1 of 
16-point FFT 
Low power radix-4 pipelined FFT processor architecture 
t, 
time 	





Figure 6.4: Timing diagram of commutator outputs for the first stage of a 16-point radix-4 
pipelined FFT processor. 
duration 'T'. The read and the write pointers in a FIFO are arranged in such a way that the write 
pointer always writes to the memory location read during the previous clock cycle. Figure 6.5 
illustrates the principle of operation of a DM based FIFO having length equal to four with 
synchronous write and aysnchronous read capabilities. The FIFO block receives an input after 
every clock cycle. It is clear from Figure 6.5 that the output (0) of DM is exactly identical to 
the output obtained from a traditional shift register based FIFO of length equal to four. The 
advantage of the DM based FIFO is that there is no need to move all the stored data values 
after every new data entry into the FIFO contrary to a shift register based FIFO. This translates 
into lower switching activity and therefore low power consumption. The block diagram of the 
commutator with a DM based FIFO is shown in Figure 6.6. 
The TM used here has two asynchronous read ports and one synchronous write port and there-
fore it is equivalent to a DM with one additional read port. This additional read pointer in TM 
based FIFO can be independently positioned to read any stored FIFO values thereby making it 
ideal for handling complex data ordering. A novel architecture of the commutator based on TM 
based FIFO will be proposed in the low power ordered pipelined radix-4 FF1' processor section 
of this chapter. The design of the butterfly and the complex multiplier are illustrated in the next 
sections. 
Low power radix-4 pipelined FFT processor architecture 
t ,=0 
HO1_12 1 3 1 4 1 5 Input (I) - 	 - 
Output (0) - XIX 




r(0l) 	x 	w(0l) 	
1 
	
r(1O) 1 	w(lO) 	x 
x 	 rxl 	r(ll) 
At time t'=O+, I = 0, 	Att'=T+, I = 1, 	Att'=2T4-, I = 2, 
0 = x (Don't care) O=x 	 0=X 
Att'=3T+, I = 3, 	 Att'=4T+, I = 4, 	Att'=5T+, I = 5, 
0=0 	 0=1 	 0=2 
(b) 
Figure 6.5: (a) Timing diagram showing the input and output values of a FIFO of length equal 
to four, (b) The contents of the four location dual port RAM based FIFO, its Input 
and Output after every clock cycle for six consecutive clock cycles(+ means just 
after). 
70 







Figure 6.6: Commutator architecture for the DM based FIFO. 
-- C4 C5 C6 - 
0 0 0 0 0 
1 1 0 1 1 
2 0 1 1 0 
3 1 1 0 
Table 6.2: Control signals for the different values of m 1  (0= addition, 1 =subtraction). 
6.2.2.1 Butterfly 
The radix-4 butterfly unit implements the summations of equations 6.8 and 6.9 and is shown 
in Figure 6.7. The complex multiplication with ±j in the summations can be replaced by the 
combination of addition, subtraction and swapping between the real and imaginary parts for the 
radix-4 FF1' processor. The butterfly architecture is based on three complex adder/subtractors 
instead of eight complex adders. The control signals Cr4, C5 and C6 select the appropriate data 
and the function (addition or subtraction) in accordance with the value of mt. The values of the 
control signals for the different values of mt are listed in Table 6.2. 
6.2.2.2 Conventional complex multiplier 
A conventional complex multiplier accepts two complex inputs namely data (X r + jX) and 
coefficient (Wr + jW1 ) and produces a complex output (XO r + jX0 2 ). It is constructed by 







X r3 C7 
Low power radix-4 pipelined FFT processor architecture 






X12 	 add/sub 
C6 
Figure 6.7: Conventional butterfly architecture. 
complex multiplier are related as follows: 
XO r = (XW 7. - XW) 
XO i = (XiW r + X1 W1 ) 
The complex multiplier is shown in Figure 6.8. Only 20-bits of the multiplier outputs are used 
by the adder and subtractor to reduce the hardware without introducing significant errors. 
6.3 Low power ordered pipelined radix-4 FF1' processor 
architecture 
This section proposes a low power ordered radix-4 FFT processor architecture by modifying 
the operation sequence of the conventional low power radix-4 pipelined FFT processor archi-
tecture. The complex multiplier within the butterfly processing unit is one of the most power 
consuming block in a pipelined FFT processor. The switching activity between successive co-
efficicnts fed to the complex multiplier can be significantly reduced by order based processing 
72 
Low power radix-4 pipelined FFT processor architecture 
XrEl 5:01 







7W17tip1jier 0 i[310] 
O[30: 11 ] 




XOr[ 1  5:0] 
XO[ 1  5:0] 
Figure 6.8: Conventional complex multiplier. 
and hence its power consumption. The order based processing of fixed coefficients requires 
corresponding data sequencing as per new coefficient ordering in a pipelined FFT processor. 
Data sequencing is performed by a commutator in a pipeined FFT processors. Hence, a novel 
commutator architecture is proposed to handle the new data sequencing for stage 1 of a 16-
point FF1 processor. The data sequencing for stage 2 is restored by using a dual port RAM 
(DM) along with a ROM for its address generation. This ordering technique is suitable only for 
stage 1 of a 16-point radix-4 FF1 processor due to the need of restoring data ordering for the 
following stage. This in turn requires only small hardware overhead in the form of a six word 
additional DM (ADM) following stage 1 of a 16-point FF1 contrary to a much larger ADM fol-
lowing stage 1 of a 64-point FF1. The large size ADM is required because stage 1 of a 64-point 
FF1 handles 64 data samples compared to only 16 for stage 1 of a 16-point FFT. Thus our order 
based processing scheme is limited only to stage 1 of a 16-point FF1 processor or stage 2 of a 
64-point FF1 processor and so on. Moreover, the commutator design for incorporating order 
based processing becomes very difficult for the bigger stage 1 of a 64-point FF1 processor. 
6.3.1 Order based processing of coefficients in a radix-4 pipelined FFT processor 
This section proposes an altered operation sequencing for stage 1 of a 16-point radix-4 pipeined 
FF1 processor based on its signal flowgraph shown in Figure 6.9. Normally, the fixed coeffi-
cients are fed to the complex multiplier in an order given in Figure 6.2 starting from m1 = 0 
and ending with ml = 3 for stage I of a 16-point FF1 processor. Our approach involves 






























Low power radix-4 pipelined EFT processor architecture 
Figure 6.9: Signal flow graph of the ordered 16-point radix-4 pipeiined FFT 
efficients fed to the multiplier for stage 1 of a 16-point FFT or stage 2 of a 64-point FFT as 
listed in Table 6.3 and also shown in Figure 6.9. The coefficients are ordered so as to minimise 
the switching activity between successive coefficients by minimising the Hamming distance 
between them. The ordered coefficient set is obtained by first arranging only the imaginary 
part of the coefficient set on the basis of Hamming distance. It is followed by picking up the 
corresponding real part of the coefficient or its two's complement depending upon the Ham-
ming distance with respect to the previously arranged real part. A flag bit is asserted to indicate 
the presence of real part in two's complement form. This flag bit is also used to selectively 
complement the multiplier output as discussed in chapter 4. The switching activity decreases 
from 192 to just 78, a reduction of 59% by following our ordering approach. The order based 
processing of coefficients requires corresponding data ordering. The data ordering is performed 
by a novel design of the commutator for stage 1 of a 16-point radix-4 FFT processor. The or-
dered data sequence at the output of the complex multiplier for stage 1 of the 16-point FFT 
processor has to be converted back into a normal data sequence for its stage 2. This data se-
quence conversion is accomplished by the combination of ADM along with a ROM tROMO 
for its addressing. The new architecture of the 16-point ordered pipelined FFT processor is 
shown in Figure 6.10. The input and output sequences of ADM namely DI and DO respec-
tively are shown in Figure 6.11. It is clear that DO is in normal order to be directly fed to the 
stage 2 commutator. The stage 2 commutator will be the same as in the conventional architec-
ture. The stage 1 commutator design, to support the ordering scheme, is discussed in detail now. 
74 
Low power radix-4 pipelined FFT processor architecture 
Coefficient sequence 
after stage 1 









flag, real, imag 
W0 7fff,0000 Wo 1,8001,0000 
W0 7fff,0000 W0 1,8001,0000 
W0 7fff,0000 Wo 1,8001,0000 
Wo 7fff,0000 W0 1,8001,0000 
Wo 7fff,0000 W0 1,8001,0000 
7641,cf04 W0 1,8001,0000 
W2 5a82,a57d Wo 1,8001,0000 
W3 301b,89be W4 0,0000,8000 
Wo 7fff,0000 W1 0,7641,cfO4 
W2 5a82,a57d W3 1,cf05,89be 
W4 0000,8000 W3 1,cf05,89be 
W6 a57d,a57d W2 0,5a82,a57d 
W0 7fff,0000 W2 0,5a82,a57d 
30fb,89be W6 1,5a83,a57d 
W6 a57d,a57d W6 1,5a83,a57d 
W9 89be,30fb W9 1,7642,30th 
Normal switching activity = 198 Switching activity after ordering = 78 
Table 6.3: Ordered and conventional coefficient sequences for a 16-point radix-4 FFT 
Stage 1 	i 	 Stage 2 
Input Output 
Ordered Coefficient 
Figure 6.10: Ordered 16-point radix-4 pipelined FFT processor architecture. 
6.3.2 Stage 1 Commutator design for a 16-point FFT Processor 
As seen in equation 6.9 and Figure 6.1, the input data for each summation at stage 1 of a 
16-point FFT are separated in time by four words. The timing of the ordered data sequence 
corresponding to the ordered coefficient sequence and the normal sequence as a function of 
75 
1 	015141312110987654 321 	01514. 
1312 1110 	9 	8 7 6 5 4 3 	2 	1 	0 151413121110 
4 
98 7 65432 1 	0 15 14 13 12 11109876 
4 
5 43 2 	1 	0 15 14 13 12 11 10 	9 	87 6 5 4 3 2 
-I 




1312 1110 	7 	52 9 	3 	1 6 8 4 0 15 14 13 12 
- 
11 10 
9 8 7 	6 	3 	1 14515 13 2 	4 0 12 11 10 	9 	8 7 6 
4— 
5 4 3 	2 1513 101 	11 	9 14 0 12 8 7 6 5 4 
- 
3 2 













Stage 1 of 
16-point FFT 




110 11511411311211111019 18 17 16 15 14 13 12 1 1 1PJ 15 1 14 
1 0 h5 14 11 9 16 13 7 5 10 12 8 4 13 2 1 0 11514 
10 918 7 6 514 3 2 1 10 15 14 13 112 11 10 918 7 
Figure 6.11: Input and output data sequence of ADM. 
t'+161 	 t• 
	
1110 1 1511411311211111019 18 17 16 15 1413 120 	14 Lx(n) 
4 	 Input 
ii 
NN 
t'+28T 	 t'+20T 	 t+1 2T 
Figure 6.12: Timing diagram of normal and ordered commutator outputs for the first stage of 
a 16-point radix-4 pipelined FFT processor. 
time is shown in Figure 6.12, t' is the instant when the first input word arrives. Each input word 
occupies a word slot of duration T and is numbered according to its appearance in time. This 
ordered data sequence can be generated with the help of a commutator. It is difficult to generate 
the ordered data sequence with the help of a conventional FIFO based on shift registers (SRs) 
or DMs. In order to achieve flexibility, the commutator is constructed by using double size 
(eight words) three triple port RAM (TM) based FIFOs rather than six four word DM based 
FIFOs. The additional read port in TM greatly helps in generating the ordered sequence. The 
76 
Low power radix-4 pipelined FET processor architecture 
TMO 	 TMI 	 TM2 	
A 
	
—,liai 	 ac4lral 	I 	ae-ø4ral 	01 I-.E 	D 	 01 






02 	in 	021 	 o2 *F 










ae 	 D 	
03 






ftm[7:0] m7 m6 
CONTROL BLOCK 
Figure 6.13: First stage commutator architecture for the ordered 16-point radix-4 FFT 
processor 
commutator comprises of three TMs (Two read ports and one write port), a finite state machine 
(FSM), a ROM ROM1) and four multiplexers of variable size as shown in Figure 6.13. A TM 
acts as a FIFO with two possible outputs for flexibility. The three TMs have six possible outputs 
but only four outputs are chosen at any time by the four multiplexers depending upon the desired 
ordered sequence. Each TM has a depth of eight words for stage 1 of a 16-point FFT processor. 
The three bit sequential address al is generated by the FSM of the CONTROL BLOCK for the 
write ports of TMO and TM1 and the B port of TMO respectively. The ROM] of the CONTROL 
BLOCK provides the addresses of all the other read and write ports along with multiplexer 
controls to generate the ordered data sequence. It also helps to keep the unused outputs of TMs 
to their previous values. The ROMJ contents are listed in Table 6.4. The lower byte of the 
31-bit wide ROMJ controls the four multiplexers (m[7:0]). The next more significant 18-bits 
generate the write address (aw) and the read addresses (adrf,...,adra) of TMs. The still higher 
3-bits control the stage 2 butterfly (c[2:0]) and the most significant bit controls the chip select of 
TM2 (cs). TM2 is selectively disabled for writing by the logic high on its chip select input. This 
is done to avoid unnecessary writing of TM2 thereby reducing its power consumption. It is clear 
from the highlighted nibbles of Table 6.4 that the lower three bits (adra) of these nibbles remain 






Low power radix-4 pipeiined FFT processor architecture 
Address Contents 
{cs, c[2:0], aw, adrf, adre, adrd, adrc, adra, m[7:O] 
} 
0000 2e8804,29 
0001 1b4184, e4 
0010 2fac86, 29 
0011 16a34d, 89 
0100 17a7df, 89 
0101 38534d, e5 
0110 346595,9a 
0111 2cla6d,a9 
1000 2c3efd, a9 
1001 386595,e5 
1010 3877dd, e5 
1011 000000, 41 
1100 008041, 41 
1101 010082, 41 
1110 0180c3, 41 
1111 160104, 41 
Table 6.4: Contents of ROM1 for generating addressing and control signals for the 
commutator 
no switching activity on port A for almost half of the time duration. This addressing approach 
reduces the switching activity on the unused ports and therefore the power consumption. This 
sort of addressing is also employed for ports C, E and F. This work is reported in [94, 95]. 
The conventional butterfly structure is also modified for low power by replacing programmable 
adder/subtractors by a combination of control inverters and a summer and is explained in the 
next section. 
6.3.3 Low power butterfly 
The low power butterfly, shown in Figure 6.14, is obtained by using six control inverters (Cu 
to Cl6) to generate the normal or the complement form as per the control signals C5 , C6 and 
C7. The C4 signal controls the select lines of the four multiplexers (M1 to M4) for directing 
appropriate data to the inputs of the summer. Two four input summers (SUM, and SUM0) 
are needed to generate both the real (XO) and the imaginary component (XO) of the output. 
Two summer inputs arrive directly from the preceding commutator whereas the remaining two 
inputs come from the multiplexers. The four complex inputs to the butterfly are X r0 + jXi0 
to X.r3 + jX 3 . This architecture consumes less power due to the realisation of subtraction 
78 























Figure 6.14: Low power butterfly architecture. 
through control inversion. This approximation of realising two's complement through inversion 
introduces error in the butterfly operation. This error is insignificant for short FFTs. 
The complex multiplier for the ordered pipeined FFT processor architecture is obtained by 
modifying the conventional complex multiplier to support the order based processing scheme 
for the first stage of a 16-point pipelined FFT processor. The ordered complex multiplier is 
discussed in the next section. 
6.3.4 Ordered complex multiplier 
The ordered complex multiplier is obtained from the conventional complex multiplier by re-
placing the two's complement multiplier with the multiplier module for only the real part of the 
coefficient as discussed in chapter 4. The multiplier module comprises of a two's complement 
multiplier along with a bank of 20 Ex-OR gates for control inversion depending upon the flag 
bit. The ordered complex multiplier is shown in Figure 6.15. 
79 
Low power radix-4 pipelincd FFT processor architecture 
Xr[15 : 0] 





X r [ 1  5:0] 
W[1 5:0] 
	




Multiplie r =02[31 :0] 
Multiplier 	03[19:0] 
Module 
ctI v 	 03 
04[30:11] 
tiPliej0 4[31:0]  
XO r[1 5:0] 
X0[ 1 5:0] 
Figure 6.15: Ordered complex multiplier. 
6.4 Results 
The conventional architecture of a 64-point FFT has been implemented in three different ways 
depending upon the FIFO implementation style. The conventional SR and DM architectures 
are obtained by either using a shift register or DM based FIFOs in both stage 1 and stage 2 
respectively. The DM-SR architecture is obtained by using DM based FIFO for the bigger stage 
1 and SR based FIFO for the smaller stage 2. Our Ordered low power architecture is based on 
TM based FIFOs with order based processing scheme incorporated into its penultimate stage. 
All the cores have been designed for three different multiplier types namely, the carry save 
array type (csa), the Non-Booth coded Wallace tree type (nbw) and the Booth-coded Wallace 
tree type (wall). 
The conventional and Ordered pipelined FFT processor architectures have been designed at 
the register transfer level using Verilog hardware description language. The architectures are 
then synthesized using 0.18j CMOS technology library. Power evaluation was then carried 
out on the gate level netlist with SDF using a supply voltage of 1.8V and a clock frequency of 
100MHz for 1000 FF1' blocks of uniformly distributed random data samples. The switching 
activity decreases from 192 to 78, a reduction of 59% as per Table 6.3. The comparative results 
in terms of power for different FFT lengths, FIFO implementation styles and three common 
multiplier types are given in Table 6.5. It is clear from Table 6.5 that our Ordered architecture 
gives power savings for the three multiplier types and different FIFO architectures for the 16-
point and 64-point FIT processors. The percentage power saving is maximum for the wall 
multiplier. The percentage power saving of our Ordered approach is less for nbw multiplier 
80 
Low power radix-4 pipelined FFT processor architecture 
FFT Multiplier SR DM DM-SR Ordered % % % 
size type based based based in tnW saving saving saving 
inmW in mW in mW  /SR 1DM 1DM-SR 
16-point csa 125.32 135.42 - 96.48 23 29 - 
nbw 93.77 114.28 - 80.19 14 30 - 
wall 106.83 128.67 - 81.43 24 37 
64-point csa 351.60 289.14 276.73 252.73 28 13 09 
nbw 296.18 238.06 225.18 217.18 27 09 04 
wall 318.21 262.11 247.96 224.46 30 14 10 
Table 6.5: Comparative power consumption for the Ordered and conventional FFTprocessors. 
type in most cases but the nbw multiplier based architecture consumes less power than the 
ones based on csa and wall multipliers. Moreover, DM based approach is better for stage I of 
a 64-point FIT whereas SR based approach out-performs DM for the smaller FIFO required 
in stage 1 of a 16-point FFT processor. Hence, DM-SR architecture for the 64-point FFT is 
more effective than the SR-SR and DM-DM architectures. Our Ordered approach gives power 
savings of the order of 23% and 29% with respect to SR and DM respectively for the 16-point 
FFT processor using csa multiplier. The power saving of our Ordered approach is 28%, 13% 
and 9% with respect to SR, DM and DM-SR respectively for the 64-point FFT processor using 
csa multiplier. The percentage power saving with respect to the overall power consumption 
will go down further for longer FFTs because the Ordering approach is restricted only to stage 
1 of a 16-point FFT or stage 2 of a 64-point FF1' processor. This restriction is imposed in 
view of the large ADM requirement and commutator design complexities for the initial stages 
of longer FFTs. This large ADM will more than offset any power saving due to ordering in the 
complex multipliers. Table 6.6 lists the power consumed by the major cells of the pipelined FF1' 
processor for different FIFO implementations. It is clear from Table 6.6 that the SR based FIFO 
architecture is inferior to the other architectures for the 64-point FFT due to the high power 
consumption in the large FIFO blocks of stage 1 commutator. This high power consumption is 
attributed to the shifting (switching) of all data samples after every clock cycle in the traditional 
FIFO based on SR. It is also evident from Table 6.6 that the power saving in the Ordered 
approach is taking place not only in the multiplier but also in the novel stage 2 commutator. 
The novel stage 2 commutator architecture comprises of three double size FIFOs based on 
TMs rather than the traditional six DM or SR based FIFOs. The new commutator architecture 
consumes much less power due to less data movement and therefore less switching activity 
among the three FIFOs as compared to the traditional six FIFOs. The power consumption in 
E31 
Low power radix-4 pipelined FFT processor architecture 








 Stage 1 Commutator 91.01 53.87 53.84 53.77 
 Stage 2 Commutator 23.27 33.35 22.77 15.95 
 Stage 3 Commutator 9.64 9.29 9.37 15.17 
 Stage 1 Butterfly 1.82 1.63 1.63 1.78 
 Stage 2 Butterfly 2.01 1.81 1.59 3.04 
 Stage 3 Butterfly 6.15 6.10 6.10 4.40 
 Stage 1 multiplier 26.32 26.50 26.35 23.89 
 Stage 2 multiplier 25.27 25.81 25.81 - 18.31 
Table 6.6: Comparative major cells power consumption for the Ordered and conventional FFT 
processor for the 64-point FFT processor with csa multiplier 
the additional read ports of TM is reduced by keeping the outputs of the unused read ports 
to their previous values by addressing these ports through ROM1. The stage 3 commutator 
consumes more power in the Ordered approach as compared to the other approaches because 
its power consumption also includes the power consumed by ADM. The stage 2 butterfly in the 
Ordered approach consumes more power than the other approaches mainly due to the different 
stage 2 commutator architecture. The stage 3 butterfly in our Ordered approach consumes 
much less power than the other approaches because it has been designed using XOR gates 
(control inverters) and a summer rather than the traditional programmable adders/subtractors. 
The stage I and stage 2 butterflies in our Ordered approach consume more power than the 
other approaches due to the different input/output conditions in the form of a new stage 2 
commutator architecture based on larger size TMs. The stage 1 multiplier consumes less power 
in our Ordered approach due to different input/output conditions in the form of a different stage 
1 butterfly and stage 2 commutator architectures. The stage 2 multiplier consumes substantially 
less power in our Ordered approach because of the significant reduction in the switching activity 
at its coefficient input by ordering. 
The area overhead of Our 64-point Ordered FET processor architecture is 3%, 9% and 28% with 
respect to DM,DM-SR and SR FET processor architectures respectively for the wall multiplier 
type. 
Low power radix-4 pipelined FF7' processor architecture 
6.5 Summary 
This chapter has presented a low power ordered radix-4 pipelined FFT processor architecture 
by incorporating order based processing scheme into the first stage of an existing low power 16-
point pipelined FF1' processor architecture. The order based processing scheme significantly 
reduces the power consumption of the complex multiplier. Moreover, power saving also occurs 
in the TM based novel commutator architecture for stage 1 of a 16-point FFT. This results in a 
reduction of the overall power consumption from 37% to 14% for 16-point FFT and 30% to 4% 
for 64-point FFT using different types of multipliers and FIFOs. The order based processing 
scheme is restricted only to the first stage of a 16-point FFT or second stage of a 64-point FF1' 
due to the hardware overhead in the form of a dual port RAM for restoring data order for the 
subsequent stage. This approach is very attractive for orthogonal frequency division multiplex-
ing (OFDM) based wireless LAN (IEEE 802.11) requiring short FFTs but it can also be applied 
to the penultimate stage of long FF1'. The next chapter presents a low power architecture of a 
MC-CDMA receiver which is obtained by combining a low power 64-point radix-4 pipelined 
FFT processor architecture discussed here with a low power Combiner architecture. 
83 
Chapter 7 
Low power MC-CDMA receiver 
architecture 
MC-CDMA [96-105] is a spread spectrum technology which combines the advantages of 
OFDM (Orthogonal frequency division multiplexing) [106] and CDMA (Code division mul-
tiple access) to produce a spectrally efficient multi-user access system. This access system may 
be utilised in future mobile wireless systems, and therefore power consumption is important. A 
frequency domain processing MC-CDMA receiver contains two main system blocks, namely 
an FFT block to demodulate the OFDM signals and a Combiner block which equalises the sig-
nal and separates out the coded users. The Combiner in Our architecture is based on MMSE 
(minimum mean squared error) detection. This chapter deals with the low power architecture 
of a MC-CDMA receiver which is obtained by reducing the switching activity and also by 
shutting down blocks through clock gating. The FFT processor of the receiver is based on a 
novel coefficient ordered architecture for low power which has been explained in detail in the 
previous chapter. The blocks within the Combiner module are also clock gated to bring down 
their power consumption. 
This chapter is divided into five sections. The first two sections provide the overview of CDMA 
and MC-CDMA respectively. The third section describes the modeling of a MC-CDMA trans-
mitter and receiver in Matlab. The fourth section describes the MC-CDMA receiver architec-
tures. The chapter concludes with a section on the comparison of power and area between the 
conventional and the low power MC-CDMA receiver architectures. 
7.1 Overview of CDMA 
CDMA is based on spread spectrum communications [8,107-109]. Spread spectrum is a means 
of transmission in which the signal occupies a bandwidth in excess of the minimum necessary 
to send the information. The spreading is performed by a code which is independent of the 
data at the transmitter. The same code is used at the receiver for despreading and subsequent 
84 
Low power MC-CDMA receiver architecture 
data recovery. The receiver has to be synchronised with the transmitter. Despreading is done 
at the receiver by correlating the received signal with the sychronised replica of the code. The 
processing gain of the spread spectrum system is expressed as the ratio of the bandwidth of the 
spread spectrum waveform to that of the data. 
Spread spectrum provides multiple user random access communications with selective address-
ing capability. Multiple access refers to the sharing of a fixed communication channel, such as 
a wireless channel, by a group of users. The objective of the multiple access scheme is to 
allow users to share a channel without creating unmanageable interference with each other. 
The traditional techniques are frequency division multiple access (FDMA) and time division 
multiple access (TDMA). In FDMA, all users transmit simultaneously but use non-overlapping 
frequency bands. In TDMA, all users occupy the same bandwidth but transmit sequentially in 
time. CDMA provides the capability of separating signals transmitted simultaneously in time 
and which occupy the same bandwidth as well. This is accomplished by assigning each user a 
unique spreading code that is orthogonal with the spreading codes of all the other users. CDMA 
is often implemented using direct sequence spread-spectrum techniques (DS-CDMA). Accord-
ing to this technique, the data signal b(t) is modulated by a unique high frequency spreading 
CDMA code c(t) which comprises of +1/-1 chips as shown in Figure 7.1. CDMA code se-
quences have statistical properties similar to sampled white noise and are generated by a linear 
feedback shift register. The processing gain (F) in DS-CDMA is defined as the ratio of CDMA 
code frequency (1 /Tc) and the data frequency (1/I'). The processing gain in Figure 7.1 is 
assumed to be eight for the sake of illustration. The spread signal s(t) has no resemblance 
with the data signal b(t) and therefore it is impossible to recover the data signal without the 
knowledge of the used CDMA code sequence c(t). It may be noted that the bandwidth of the 
spread spectrum signal s(t) is P times the bandwidth of the data signal b(t) and hence the 
name spread spectrum. The processing gain represents the amount of interference protection 
provided by the CDMA code because it is a measure of the amount of spreading of data power 
over a bandwidth. The spread signal s(t) is modulated by a high frequency carrier (fcar)  to 
obtain the spread spectrum transmitted signal Str(t). The DS-CDMA transmitter and the power 
spectrum of the wideband transmitted signal .str(t) are shown in Figure 7.2. 
The data signal is recovered at the receiver by correlating the received signal with the synchro-
nised replica of the same CDMA chips. Since each user has been assigned a unique orthogonal 
spreading sequence (low cross correlation with the codes of the other users), and therefore the 
Low power MC-CDMA receiver architecture 
+1 




C  c2 c3 c4 c5 c6 c7c8 








Figure 7.1: Illustration of the spreading principle with the help of (a) Data signal b(t), 
(b) CDMA code sequence c(t) and (c) Spread signal s(t), the processing gain 
is assumed as eight. 
signals corresponding to other users are suppressed at the time of correlation. Unfortunately, 
each additional user increases the overall noise level thus degrading the quality for all users. 
The reception of DS-CDMA signal becomes problematic due to the presence of multipath fad-
ing channel. The composite received signal is the combination of several signals with different 
delays in the time domain. The ability of the receiver to recover the transmitted signal is deter-
mined by the auto correlation characteristic of the spreading codes. 
7.2 Overview of MC-CDMA 
Channel dispersion results in the reception of multiple resolvable paths in case of direct se-
quence CDMA (DS-CDMA). The RAKE receiver used for DS-CDMA has multiple correla-
tors, each synchronised to a different resolvable path. The problem here is that no code exists 
to ensure that the partial correlation of a slightly delayed path is orthogonal to the dominant 
line-of-sight path. In DS-CDMA each data symbol requires the entire available spectrum as 
shown in Figure 7.2. Due to the bursty nature of the channel, several adjacent symbols might 
86 
Low power MC-CDMA receiver architecture 
C(t) 	ei2flfcart 
b(t) 	 Str(t) 
 
Power 
car 	 frequency 
 
Figure 7.2: (a) CDMA transmitter and (b) Power spectrum of the transmitted wideband CDMA 
signal. 
be destroyed during a deep fade. 
The main advantage of MC-CDMA is its capacity to spread the signal bandwidth without in-
creasing the adverse effect of delay spread. The chip duration can be equal to the symbol 
duration in MC-CDMA contrary to the wideband waveforms used in case of DS-CDMA. In 
MC-CDMA, the user signal is not multiplied by a high speed orthogonal code sequence, but 
the same bit is transmitted on multiple sub-carriers as shown in Figure 7.3. The power spectrum 
of the transmitted signal S, (t) is shown in Figure 7.4. The sub-carrier spacing is equal to the 
reciprocal of the data symbol duration (1/T). The CDMA chip duration is assumed to be equal 
to the data symbol duration. The processing gain P here is assumed to be eight for the sake 
of illustration in Figure 7.3. The processing gain depends upon the length of the CDMA code. 
The number of sub-carriers is chosen to be more than or equal to the processing gain so as 
to prevent frequency selective fading. A cyclic prefix having duration longer than the channel 
delay spread is appended to remove ISI between successive bits. For each user, the sub-carriers 
are shifted with a 0 or 7t phase offset. The set of sub-carrier phase offsets follows a signa-
ture code sequence to distinguish different users. The resulting signal has a coded structure in 
the frequency domain and multiple access is possible using the orthogonality of the different 
codes. If the number of sub-carriers is appropriately chosen, then it is highly unlikely that all 
the sub-carriers will be located in deep fade and therefore frequency diversity is achieved. If 
the processing gain is equal to the number of sub-carriers then this system modulates all the 
sub-carriers with the same data bit, but with a phase shift on each sub-carrier determined by 
7j 
Low vower MC-CDMA receiver architecture 
+1 T  
b(t) 	
" time * 
(a) 
ci c2 c3 c4 c5 c6 c7c8 
C(t) 	 _ 
- T 	
(b) 
ci 	 ej2flfot 
El 
c2 	
i2f ifit  
fl1HL 
c3 	 ei2ffl2t 
c4 	 ej2tTf3t 
b(t) or ___ 	c5 	 ei2hlf4t 	_* Str (t) or 
b(m) fl H 	L 
x(n) 
c6 	 ei2m5t 
c7 	 ej2ffl6t 
flPHL4~ 
c8 	 ei2m7t 
(C) 
Figure 7.3: (a) Data signal, (b) CDMA code sequence, (c) Illustration of the one chip/carrier 
MC-CDMA scheme. 
Low power MC-CDMA receiver architecture 
Power 
fo 	f 1 	2 	3 	4 	f5 	f6 	f7 
frequency 








I Parallel _____ 
RF Carrier 
Figure 7.5: MC-CDMA transmitter. 
the spreading code. MC-CDMA spreads the signal in the frequency domain. This multi-carrier 
modulation can also be implemented using an inverse FFT, 
If the kth  chip of the spreading code for user u is defined as c(k, n) E-1,+1 then the transmitted 
baseband signal for the m' data symbol b(m) is: 
N-i 
	




The baseband signal is then cyclically extended by more than the channel delay spread to re-
move ISI. The resulting symbol is then passed through DAC prior to upconversion to the high 
frequency RF carrier. The block diagram of a MC-CDMA transmitter is shown in Figure 7.5. 
It is very important to have frequency non-selective fading over each sub-carrier to limit the 
amount of dispersion in MC-CDMA. As described above, MC-CDMA will convert a high data 
rate signal to low data rate with the help of a serial to parallel Converter before spreading over 
the frequency domain in order to prevent frequency selective fading. 
By using a guard interval, the receiver selects the portion of the signal that is free from ISI. 
This is then processed by the FF1' block to demodulate the sub-carriers. The channel effect of a 
multipath channel h (n) at the output of the FFT is narrowband for each sub-carrier, H ( k), and 
therefore equalisation and de-spreading can be incorporated into a single combining operation 
89 
Low power MC-CDMA receiver architecture 
	
I 	OFDM Block 
Guard 	 Combiiner~ ADC 	 Data Removal 
HF Carrier 
Figure 7.6: MC-CDMA receiver. 
to estimate the transmitted data bit. If the output of the FF1' block at frequency bin k is defined 
as Y (k) then the combining operation can be represented by 
N—i 
rec(fl) = sign( 	J(c(k,u)A(k)Y(k))) 	 (7.2) 
k=O 
The entire receiver structure is shown in Figure 7.6. The Combiner block can be implemented 
by setting A(k) by equation 7.3 for the minimum mean square error (MMSE) solution, where A 
is a parameter dependent upon the signal to noise level and the number of users. The sign bit of 
the value obtained after accumulation of the real part of the product of the equaliser coefficient 
A(k), the FF1' output Y(k) and the corresponding CDMA chip c(k, u) over the code length of 
64, is a measure of the received data estimate. 
A(k) = H*(k)/(IH(k)1 2 + A) 
	
(7.3) 
7.3 Modeling of MC-CDMA transmitter and receiver in Matlab 
The MC-CDMA transmission frame assumed here is shown in Figure 7.7. Each frame com-
prises of 32 symbols. The first symbol within each frame acts as a pilot or training symbol. The 
known pilot symbol is used by the receiver to compute the impulse response of the channel. 
This channel transfer function is assumed to remain fixed between pilot symbols. This means 
that a slow fading channel is assumed here. The channel transfer function is then used to esti-
mate the data symbols at the receiver. The processing gain here is assumed to be 64 which is 
also equal to the number of sub-carriers. Hence, the same data bit is transmitted on all the 64 
sub-carriers. The Walsh-Hadamard code having zero cross-correlation is used for spreading the 
data in the synchronised downlink here. The known pilot bit is fed to the inverse fast Fourier 
Transform. The IFFT output is a 64-valued complex symbol to which cyclic extension is ap-











Low power MC-cDMA receiver architecture 










Figure 7.7: MC-CDMA transmission frame assumed for 64 sub-carriers. 
bits for 'N' users are then spread by multiplying them by their corresponding Walsh-Hadamard 
code matrix corresponding to 'N' users before summing them. The resultant values are then 
modulated with the help of inverse fast Fourier transformation. The cyclic prefix is also ap-
pended to the output of the inverse fast Fourier transform. The same transmission procedure 
has to be repeated for all the 31 data bits. The next frame again commences with a known 
pilot symbol. The number of users is assumed to be 32. The Matlab code for modelling of a 
MC-CDMA transceiver is listed in Appendix B. 
The fading channel is modelled by a five tap FIR filter. The filter tap gains are modelled by 
complex random numbers which are exponentially decreasing in average power in powers from 
0 to —4 in step of unity. This results in a random exponentially decreasing impulse response for 
the channel. The noise is added to the above channel representation as per the required signal 
to noise ratio [76]. 
The combining operation in the receiver can be divided into channel estimation and demodu-
lation phases. In the channel estimation phase, the Combiner extracts the channel information 
from the training symbol whereas in the demodulation phase, it uses the estimated phase in-
formation to recover the data. The receiver first extracts the cyclic prefix from the received 
91 
Low power MC-CDMA receiver architecture 
symbols before the start of these phases. The symbols after cyclic prefix removal are fed to the 
FFT block to restore the signal back into the frequency domain. The data symbols are then de-
spread by multiplying the received signal by the synchronised replica of the Walsh-Hadamard 
sequence before the combining operation. 
The objective of the channel estimation phase is the computation of the equaliser coefficient 
A(k) given by equation 7.3. If the known input pilot bit is assumed to be unity then each 
FIT output value corresponding to the pilot symbol (Total 64 FFT output values) is a measure 
of the channel transfer function at that sub-carrier frequency. The reciprocal of the channel 
transfer function is then obtained by taking its complex conjugate followed by division with 
the square of its magnitude. The reciprocal of the channel transfer function is also called the 
equaliser coefficients for ) = 0 as per equation 7.3. The channel estimation phase ends after 
the generation and storage of the equaliser coefficients. These coefficients remains fixed in 
each frame and are to be computed at the beginning of each frame. A slow fading channel is 
assumed here. The number of pilot symbols will considerably go up for fast fading channel and 
therefore the number of data symbols between the two pilot symbols will reduce for fast fading 
channel. 
The channel demodulation phase commences by the multiplication of the FF1' outputs corre-
sponding to data symbols by their respective equaliser coefficients followed by summation over 
all the 64 sub-carriers as per equation 7.2. The sign bit of the accumulated output corresponds 
to the data bit received. The same procedure has to be repeated for all the 31 data symbols 
before the start of a fresh channel estimation phase. The estimation and demodulation phases 
are illustrated with the help of a flowchart shown in Figure 7.8 [76]. 
7.4 MC-CDMA receiver architecture 
A MC-CDMA receiver consists of an FFT processor to extract the frequency contents of the 
received signal and a Combiner for despreading and equalisation. This section describes a con-
ventional (Cony) and our low power (Our) MC-CDMA receiver architectures. A conventional 
64 sub-carrier MC-CDMA receiver, shown in Figure 7.9, consists of a conventional 64-point 
radix-4 pipelined FFT processor and a conventional Combiner. The conventional 64-point 
radix-4 pipelined FFT processor has already been discussed in the previous chapter. The only 
difference between the conventional and the low power Combiner lies in the disabling of the 
92 
Low power MC-CDMA receiver architecture 
Start 




I 	 Compute equaliser coefficients 
ci 
01 




sub-carrier == 64 
I 	Increment symbol 	 I 
I 	 sub-carrier = 0 
Multiply the real part of data with the real 
part of the equaliser coefficient (A). 
Multiply the imaginary part of data with the 
imaginary equaliser coefficient (B). 
Ii) 	I 
(01 
I i 	Compute (A-Band accumulate -c 	I 0-I 
o / 	 Increment sub-carrier 
D I 
o i 
E l sub-carrier == 64 
a' 
I I 	Data estimate ready 
I 	Increment symbol 	 I 
symbol == 32 
Y 
Figure 7.8: Estimation and demodulation phases in a MC-CDMA receiver. 
93 
Low vower MC-CDMA receiver architecture 
I Conventional 




Combiner 	Data estimate) 
ad ri 
Figure 7.9: Block diagram of the conventional MC-CDMA receive,: 
I 64-point Ordered 
Input -+j pipelined FFT 
I 	processor 
Input 
Low power 	FO 
Combiner I (Data estimate) 
adri 
Figure 7.10: Block diagram of our low power MC-CDMA receiver. 
FIFO and the divider modules in the data demodulation phase through clock gating. Hence, 
only a low power Combiner block is described in this section. A low power MC-CDMA re-
ceiver, shown in Figure 7.10, consists of a low power ordered 64-point radix-4 pipelined FFT 
processor and a low power Combiner. The ordered pipelined FF11' processor architecture is 
based on altering the ordering of the coefficients fed to the multiplier of the second stage of a 
64-point FFT processor so as to minimize the switching activity between successive coefficients 
resulting in power reduction as discussed in the previous chapter. The low power Combiner is 
also pipelined resulting in a fully pipelined MC-CDMA receiver architecture. Some of the 
modules of the low power Combiner are clock gated to reduce its power consumption. This 
work will be reported in [110]. 
Since the radix-4 FFT processor architectures have already been covered in detail in the previ-
ous chapter, and there is little difference between the conventional and the low power Combiner, 
therefore, this section explains only our low power Combiner architecture. 
7.4.1 Low power Combiner architecture 
A Combiner block performs dc-spreading, channel estimation and data demodulation to recover 
the transmitted bits. It first estimates the channel transfer function with the help of some known 
symbols called pilot symbols. The transmitted data is then recovered by dividing the real part 
of the received signal by the estimated channel transfer function. A MC-CDMA data frame 
comprises of pilot and data symbols. The pilot symbol helps in estimating the channel transfer 
MA 
Low power MC-CDMA receiver architecture 













Finite State Machine 
Sn 	 S12 	 S1 3 	S 14 
Figure 7.11: Combiner architecture. 
function. This transfer function is assumed to remain fixed between the two pilot symbols. A 
64 sub-carrier system is assumed and the data is divided into blocks of 32 symbols, with the first 
symbol in each block being used as a pilot for channel estimation. The equaliser coefficients 
corresponding to all the sub-carriers are computed in the estimation phase. These coefficients 
are stored in the memory and are used in the demodulation phase. The equaliser coefficients 
are assumed to be fixed during the demodulation phase. The Combiner architecture is shown in 
Figure 7.11. It comprises of the following modules: 
• Dc-spreading module. 
• Multiplication and accumulation module. 
• Division module. 
• Memory module. 
• Finite state machine. 
7.4.1.1 Dc-spreading Module 
The main purpose of this module is to multiply the input (FFT outputs) by either +1 or -1 
depending upon the corresponding chip value. This is accomplished by selectively comple- 
menting the input for chip value of -1 and letting it through for +1. This task can be performed 




Figure 7.12: Multiplication and accumulation module. 
easily by a set of XOR gates with one of their input connected to the chip value. The chip 
values are stored in a 64-bit ROM. 
7.4.1.2 Multiplication and accumulation module 
This module comprises of two multipliers which are used both in the channel estimation phase 
as well as in the demodulation phase for reducing the power consumption as shown in Fig-
ure 7.12. In the channel estimation phase, the multipliers Mult I and Mult II are fed with 
the same inputs Xr and xi respectively to compute the square of the channel transfer function 
H(k) 2 . This is accomplished by controlling the select inputs S11 and S12 of MUX A and MUX 
B with the help of a FSM. The control input S13 is set to '0' in the estimation phase so that 
the summation of the two multiplier outputs is performed by the summer (SUM). The sum-
mer based accumulator is used instead of a programmable adder/subtractor to reduce the power 
consumption. The select input 514  in the estimation phase always selects '0' for summing the 
two multiplier outputs because no accumulation is needed during the channel estimation phase. 
The output H8q (the square of the channel transfer function) from this module is used by the 
division module for computation of the equaliser coefficients A(k) given by equation 7.3. The 
registers are used at the input of the multipliers to reduce glitching at its inputs and therefore 
the power consumption. Moreover, the registers between the SUM block and the multiplier are 
used for reducing the length of the critical path resulting in low power consumption on account 
of less buffering. The demodulation phase requires the calculation of the real part of the prod-
uct of data and the equaliser coefficient (x r .eqr - xj.cqj) for each sub-carrier followed by their 
22 
Low power MG-CD MA receiver architecture 
enF 












Divider II 	rn 
xf 1 
xf 1 	COMP -Num 
Figure 7.13: Division module. 
summation over all the sub-carriers as required by equation 7.2. The multiplier Mult I is fed 
with rr and eq and Mult II with xi and eqj. This is again controlled by the select lines Sn  and 
S12 in Figure 7.12. The XOR control input S13 is set to '1' in the demodulation phase to com-
pute the difference between the two multiplier outputs by the SUM block. The select input S14 
in the demodulation phase is set such that it selects a zero input corresponding to the first data 
sub-carrier and the accumulator output (ACC) for the rest of the data sub-carriers. The MSB 
or sign bit of the accumulated output (FO) over 64 sub-carriers is a measure of the transmitted 
bit. The outputs Xd r and xdi are fed to the division module to reduce the size of FIFO by one 
register level. 
7.4.1.3 Division Module 
The division module, shown in Figure 7.13, is used to compute the real and imaginary parts of 
the equaliser coefficients rnr and m i  respectively by using the input pilot symbol and the Hsq 
output from the accumulation module as required by equation 7.3. The real part of the equaliser 
coefficient m r  is obtained by dividing the real part of the pilot symbol with the summation of 
Hsq  and the factor A. The imaginary part of the equaliser coefficient is obtained by dividing the 
complement of the imaginary part of the pilot symbol with the previous divisor. The negation 
of the imaginary part is done to obtain the equaliser coefficient tending towards the reciprocal 
of the channel transfer function for A = 0 as per equation 7.3. This is required to remove the 
channel effects later by multiplying the incoming data block with the equaliser coefficients. 
97 
Low power MC-CDMA receiver architecture 
The two word FIFO block and the register R are enabled only during the channel estimation 
phase by the enF and enR signals. This helps in reducing the power consumption by reducing 
the switching activity at the divider inputs and inside the FIFO block. The FIFO is needed 
to synchronise the numerator and the denominator of the divisor. A l's complem enter block 
(COMP) is also needed to generate the complement of the imaginary part of the FIFO output. 
The division module does not consume much power because it is active only during the short 
channel estimation phases. The division module is based on two dividers namely Divider I and 
Divider II with 'Num' indicating numerator and 'Den' for denominator. 
The difference between the low power and the conventional Combiner lies in the provision 
of gated control signals enF and en!?. These gated signals are absent from the conventional 
Combiner architecture because it does not support clock gating. This means that the dividers 
and the FIFO consume power even during the data demodulation phase in the conventional 
Combiner. 
7.4.1.4 Memory Module 
The memory module is used to store the equaliser coefficients. The memory size required 
in a 64 sub-carrier system is 64-words. A dual port RAM is used because of the need of 
simultaneous read and write operations due to slight overlap of the channel estimation and 
demodulation phases in the pipelined receiver. This phase overlap requires reading and writing 
the memory at the same time. 
7.4.1.5 Finite State Machine Module 
The Finite state machine in Figure 7.11 generates 23 control signals. The control signals S11. 
812, S13 and 814  control the blocks in the Multiplication and Accumulation module depending 
upon the receiver's phases whereas enable signals enF and enR control the selective enabling 
of blocks in the Division module. For instance, during the channel estimation phase, Sii  and 
S12 select the FF1 output Xr and xi whereas during the data demodulation phase, these signals 
select the equaliser coefficients eqr and eqj respectively. This module also generates the read 
address (ra) and the write address (wa) for the dual port RAM. 
9.1 
Low power MC-CDMA receiver architecture 
Multiplier Corn' Our % Power 
type (mW) (mW) saving 
csa 148.23 129.55 13 
nbw 140.73 126.74 10 
wall 143.48 126.65 12 
Table 7.1: Power consumption comparison of the MC-CDMA receivers for different multiplier 
types. 
Major receiver Corn' Our % Block power 
blocks (MW) (mW) saving 
FIT 100.35 92.76 9.3 
Combiner 39.80 28.86 27 
Table 7.2: Power consumption comparison of the major blocks of the MC-CDMA receiver for 
csa multiplier. 
7.4.2 Results 
The Cony and Our MC-CDMA receiver cores have been designed at the register transfer level 
(RTL) using the Verilog hardware description language. The cores were then synthesized using 
SYNOPSYS DesignCompiler with the UMC 0. 18p standard cell CMOS library. Layouts of 
the cores were generated using Envisa Silicon Ensemble place-and-route software. This was 
followed by extracting RC information and then performing RC back-annotated post layout 
gate-level netlist simulations for 4000 received data samples using Verilog-XL simulator. The 
resulting data including switching activity of the circuit nets and capacitive load information 
extracted from the layout was then used by the Synopsys DesignPower to compute the power 
consumption of the receiver cores. All the simulations were carried out at a clock frequency 
of 50MHz. The input data for the FIT is obtained by modelling the transmitter and receiver in 
Matlab for 32 users with a signal-to-noise ratio equal to 40. The receiver hardware has been ver -
ified for other values of the number of users and signal to noise ratios. The power consumption 
results are almost independent of the number users and the signal to noise ratio. The simulation 
results are listed in Tables 7.1, 7.2 and 7.3. It is clear from Table 7.1 that the total power saving 
of Our is maximum for csa multiplier and minimum for nbw. Table 7.2 lists the power con-
sumed by the two major blocks of the receiver namely the FIT and the Combiner. The power 
saving in the FFT module is only 9.3% whereas the power saving in the Combiner module is 
27% for the csa multiplier type. The power saving in the Combiner block will decrease for 







memory 20.45 19.65 
FSM 0.234 0.245 
mult 4.79 4.54 
sum 0.30 0.28 
ace 0.29 0.32 
divider 10.30 0.595 
fifo 1.17 0.741 
Table 7.3: Power consumption in the various blocks of the combiner for csa multiplier. 







Table 7.4: Area comparison of the MC-CDMA receiver for csa multiplier 
fast fading channels because the required frequency of channel estimation goes up for these 
channels. It is clear from Table 7.1 that the total power consumed is higher than the sum of the 
individual block power. This power difference is due to the switching power consumed in the 
input/output nets of the top level modules of the receiver. Table 7,3 lists the power consumed in 
the various blocks of the Combiner for both Cony and Our receiver architectures. The memory 
used to store the equaliser coefficients consumes most of the power in both the architectures. 
The divider modules consume much more power in the Cony architecture as compared to Our 
because the inputs to the divider modules are kept fixed during the demodulation phase in Our 
architecture. The divider module is needed only in the channel estimation phase and this means 
that it can be disabled by holding its inputs fixed during the demodulation phase. This fixation 
of divider inputs (Zero switching activity) is achieved by clock gating both the FIFO and the 
register. The FIFO holds the dividend whereas the divisor is stored in a register R. This tech-
nique reduces the divisor power significantly because the demodulation phase is much longer 
than the channel estimation phase. The area overhead of Our architecture is 2.25% as per Ta-
ble 7.4. The layouts for the Cony and Our architectures for csa multiplier type are shown in 
Figures 7.14 and 7.15 respectively. 
100 
Low power MC-CDMA receiver architecture 
7.5 Summary 
This chapter has presented a low power MC-CDMA receiver architecture for a 64 sub-carrier 
system. The same architecture could be extended to any number of sub-carriers. The low power 
receiver architecture is based on the 64-point radix-4 ordered FFT processor. The Combiner 
architecture also employs extensive clock gating to reduce its power consumption. Power com-
parison results with respect to the Conventional receiver architecture are presented to demon-
strate the effectiveness of the techniques used. The next chapter proposes a reconfIgurable MC-
CDMA receiver architecture which is based on a reconfigurable radix-4 FFT processor and a 
reconfigurable Combiner. The reconfigurability is exploited to design an adaptive MC-CDMA 
receiver such that its architecture varies with the channel parameters and is not designed for the 
worst case channel conditions. This reconfigurability saves power in switching from long FFTs 
to shorter FFTs etc. 
101 
Low power MC-CDMA receiver architecture 
i. ,; 	• ;::: 	 •:•:..:; i• 
	
I 	
I 	 - 
_; 	II 	
•' 	-. 	i j J1I. 	J 	__ 	 - 
3i 
pf  
r 	 - 	••'.- 	L 	 - 
hi' 	
- 






In ' t 	J t 	 _ 
•", 	,.I 	- 	 - 	" 1: i, 	 •. 	-• .. - • i_ . 
,'.E,-.- 	.•...•._.- .............._?-#.l. 	'l''................ 





1 	 1 







I-I 	 I 	 •' 
I4_•,, 
 
II,. 	'' ............... ,.I... 	' ,II 	 . 	•.. 
4 	, 	
II 	
,,• 	3.13 jL.I3. 	 1 	
r 1• 	 I 	
- 
• 	
14 - - 	LI - 	- 	1 	 - 	, 	
, 	- 
- 	 _ .4 
 II 
I4 	- 	iI I 
JA 4 0 ,11 
- 
Figure 7.14: Layout of the Conventional MC-CDMA receiver core, 
Power = 148.23mW, Area=2.227nrn2 . 
102 
Low power MC-CDMA receiver architecture 
7 	 II 	
I 	 7L-. 
	




I 	 III 	 I 	 If I 
' 	41 
-r ':.c 	1 	 - 	
•- 	 '' f 	1..-. -- 	
i.. 
cl_I.. .........V.  .- - .................-- .- ,-. _ 7 
III 	
- 	r7 
11I I 1.7 	 11f 	IJI 	 - 	 - I. . 
	 - 	. 	.'.c...-_ 	 - - .- 1 ' 	
• 	_ -,'' 	 . . 	- 
I? I-'- 
V' 	-1 	 lI,'.. 	--:-, 	 -• 	 - 	 .- 	 •--- - 	 - I 
I - 
	I 	
,1 	 7.71 	 U 	 1J1i 	I 
I 	1 	 - 
7 	 I 	 I 	 l 	 I 
Figure 7.15: Layout of Our low power MC-cDMA receiver core, 
Power =  129.55mW, Area=2.27mrn 2 . 
103 
Chapter 8 
Low power reconfigurable MC-CDMA 
receiver architecture 
This chapter proposes a novel concept of adjusting the receiver hardware size in real time as 
per the channel parameters in multicarrier wireless receivers. The FFT is one of the most 
power consuming block in multicarrier receivers. The FFT size in a OFDM!MC-CDMA based 
wireless receiver varies from 1024-point to 16-point depending upon the channel parameters. 
A low power reconfigurable radix-4 256-point FFT processor architecture is proposed here that 
can also be configured as a 64-point or 16-point as per the channel parameters to prove the 
concept. By tailoring the clock of the higher FFT stages for longer FFTs, significant power 
saving is achieved by switching to shorter FFTs from longer FFTs. This channel parameter 
driven approach is also applied to the Combiner module of the receiver. This approach can also 
be applied to other blocks of the receiver like the Viterbi decoder. 
This chapter is organised into five sections. The chapter starts with a section on the motivation 
behind this approach. The second section establishes the relationship between the FFT size 
and the channel parameters. The third section describes two 256-point reconfigurable FFT 
processor architectures that can also be reconfigured as 64-point or 16-point. The fourth section 
proposes two 256 sub-carrier MC-CDMA receiver architectures that can also be configured for 
64 sub-carriers. The chapter concludes with the power saving results for both the reconfigurable 
FFTs and the reconfigurable MC-CDMA receivers. 
8.1 Motivation 
The critical design issue for future wireless receivers is the combined requirements of high-
performance, low power and flexibility. Wireless systems have diverse application requirements 
in the form of changing data rate and bit error rate along with changing bandwidth and other 
channel parameters like the delay spread. It is desirable for wireless receivers to adapt their 
operation instead of being designed for the worst case scenario. 
104 
Low power rcconfigurablc MC-CDMA receiver architecture 
OFDM Block 
filter 	 r[Removal
RF Carrier    
	riFFT 	 inerm 	
Decode 
Figure 8.1: MC-CDMA receiver 
In multi-carrier systems like multi-carrier code division multiple access (MC-CDMA) or Or-
thogonal frequency division multiplexing (OFDM), the two most power consuming blocks in 
the receiver are the FF11' and the Viterbi decoder [90] as shown in Figure 8.1. Researchers have 
already investigated low power architectures for these two important blocks [53, 94, 1111. The 
way forward to reduce the power consumption further is to dynamically reduce the complexity 
of the receiver architecture in real time as per the changing channel requirements like the delay 
spread, signal to noise ratio (SNR), bandwidth and bit error rate etc. In [112], the researchers 
have shown the potential of saving power in a Viterbi decoder by dynamically varying its ar-
chitecture according to real-time changes in system characteristics. 
8.2 Dependence of FFT size on channel parameters 
In a basic OFDM system, a guard interval popularly known as a cyclic prefix is inserted in 
every symbol to overcome the effect of ISI. The guard interval needs to be longer than the 
delay spread of the channel. The OFDM symbol duration is chosen to be about five times 
longer than the guard interval in the interests of transmission efficiency [90]. The number of 
sub-carriers (FIT size) is determined by the following: 
(OFDM symbol duration-guard interval) * bandwidth. 
In MC-CDMA, the number of sub-carriers depends upon the delay spread (T), maximum 
Doppler frequency (fd)  and the transmission rate (R) [16]. Figure 8.2 shows the variation 
of the optimum number of sub-carriers as a function of the normalised value of delay spread 
('r/(R * P), where P is the processing gain) for different values of the normalised maximum 
Doppler frequency Nfd (fd/(R*P)). It is clear from Figure 8.2, that for the intermediate range 
of delay spread, fd  and R, the optimum number of sub-carriers vary from 16 to 1024. Only 
in extreme cases, the optimum number of sub-carriers go beyond this range. The bottom plot 













0.001 0.01 	0.1 	1 	10 	100 
Low power reconfigurable MC-CDMA receiver architecture 
Normalised Delay Spread 
Figure 8.2: Optimum number of sub-carriers as a function of normalised delay spread for 
different maximum Doppler frequencies. 
It has been clearly established that in both MC-CDMA and basic OFDM, the number of sub-
carriers (FFT size) is a strong function of the delay spread. Since the indoor delay spread is 
measured in the range from 30ns to 370ns depending on the building size [90] and the outdoor 
delay spread is much longer, it is desirable to design a reconflgurable FFT processor whose 
size can be tailored as per the channel parameters like the delay spread, the maximum Doppler 
frequency and the transmission rate in real time. The basic idea is to design the receiver for 
the maximum number of sub-carriers (FF1' size) and then clock gate unused blocks for the 
smaller sizes depending upon the delay spread. The Combiner architecture can also be made 
reconfigurable by clock downing the unused segmented memory used for storing the equaliser 
coefficients for smaller number of sub-carriers. This approach of hardware size adjustment on 
the basis of changes in the channel parameters in real time can be extended to other blocks of 
the receiver like the Viterbi decoder for saving even more power. This work is reported in [113]. 
The switching to the appropriate FFF size, Combiner and Viterbi decoder architectures will be 
done automatically by the receiver after reading the channel parameters like the delay spread, 
SNR and the bit error rate in real time. This reading operation has to be carried out at a much 
lower frequency than other operations and therefore the power overhead is minimal. 
The FF1' stages can be constrained at the synthesis stage in such a way that the smaller stages 
can be able to operate at a higher frequency to support the higher bit rate demand for shorter 




Low power reconfigurable MC-CDMA receiver architecture 
Stage 1 	 Stage2 	 Stage3 	Stage4 
Figure 8.3: Architecture of the radix-4 256-point reconfigurable FFT processor. 
8.3 Reconfigurable FFT processor architecture 
This section proposes to adjust the FIT size in real time as per the channel parameters instead 
of using a fixed large FFT based receiver designed for the worst case channel parameters like 
the delay spread, transmission rate and the maximum Doppler frequency. The FF1 size in 
MC-CDMA varies from 16-point FF1 to 1024-point FF1 [16] depending upon the channel 
parameters. Significant power saving is achieved by using the most appropriate FF1' size instead 
of a fixed large FFT size for worst case channel conditions. This is achieved by monitoring the 
channel parameters in real time. In this work, a shorter 256-point reconfigurable radix-4 FF1 
processor architecture has been proposed that can also be configured as a 64-point or 16-point 
by tailoring the clocks of the higher stages in real time to prove the concept. The hardware 
overhead in the form of logic for monitoring the delay spread and other channel parameters is 
minimal because this logic will operate at a much lower frequency. 
The reconfigurable FF1 processors are based on the Bi and Jones radix-4 pipelined architec-
ture [13]. It is better than other pipelined architectures in terms of computational efficiency and 
hardware savings in complex multipliers, adders and data stores. It consumes less power due 
to less hardware requirement as compared to the other radix-4 pipelined architectures [89]. A 
reconfigurable 256-point radix-4 pipelined architecture is proposed here for reducing the power 
consumption as compared to a radix-2 architecture. 
A reconfigurable 256-point radix-4 pipelined FF1 processor comprises of four radix-4 stages as 
shown in Figure 8.3. Reconfigurability is achieved by inserting two multiplexers namely MUX 
I and MUX II between the higher stages for directly routing the input data to stage 2 or stage 
3 depending upon the required FF1 size. The reconfigurable FF1 processor can also act as a 
64-point processor by feeding the input data directly into stage 2 and clocking down the first 
stage. This is accomplished by selecting the input data rather than the output of stage 1 by the 
out 
107 
Low power reconfigurable_MC-GDMA receiver architecture 
Stage input - Commutator I I Butterfly F( >< )—* Stage Output 
Coefficient 
Figure 8.4: Architecture of a basic stage of a radix-4 256-point reconfigurable FFT processor. 
external select input S256 of MUX I. Moreover, the gated clock input (G-cki) to stage 1 is also 
disabled by the FFT Finite state machine (FFSM). Similarly, it can act as a 16-point processor 
by controlling the select input S64 of MUX II to feed the input data directly into stage 3 and 
disabling stage 1 and stage 2 with the help of gated clocks G-ckl and G-ck2. 
The basic stage of the FFT processor comprises of a commutator, a butterfly and a complex 
multiplier as shown in Figure 8.4. The last stage contains just the commutator (COM4) and 
the butterfly (BUT4) as per the architecture described in Chapter 6. The architecture of the 
commutator, butterfly and the complex multiplier have been discussed in detail in Chapter 6. 
The commutator contains six FIFOs which are realised using dual port RAMs for low power 
consumption. The FFT Finite state machine (FFSM) is responsible for generating all the control 
signals for all the FF1' stages. It is a combination of four different finite state machines (FSM) 
i.e. one FSM per stage. Each FSM generates seven control signals C1..7 for its stage. The 
select lines of MUX I and MUX II are activated by the external inputs S256 and S64 as per the 
FIT size. 
Two different reconfigurable FFT processor architectures are proposed depending upon the 
extent of clock gating. In the first architecture RFFT-I, clock gating is applied to all the blocks 
of the commutator including the dual port RAM based FIFOs for stages 1 and 2 of the FFT 
processor. The second FIT processor architecture namely RFFT-II is obtained by limiting 
clock gating to the registers and FSMs of stage I and stage 2 only. This selective clock gating 
approach reduces the power overhead of the reconfigurable FF1' over the fixed FIT of the same 
size. On the other hand, it reduces the power saving in switching from longer to shorter FFT's. 
This reduction in power saving is due to the power consumed in the dual port RAMs for stage 1 
and stage 2 commutators on account of the clock signal activity in these stages even for shorter 
FFTs. 
HE 
Low power reconfigurable_MC-CDMA receiver architecture 
Input--* FFT-I H COMN ER-I HF0 
256 
Figure 8.5: Block diagram of RECEIVER-I. 
Input —ø FFT-II 	
COMNERIIHFo 
256 
Figure 8.6: Block diagram of RECEIVER-11. 
8.4 Reconfigurable MC-CDMA receiver architecture 
This thesis proposes two different reconfigurable architectures for the 256 sub-carrier MC-
CDMA receiver namely RECEIVER-1 and RECEIVER-H. The difference between RECEIVER-
I and RECEIVER-I! architectures lies in the extent of clock gating within their FIT and Com-
biner blocks. Clock gating is applied to most of the unused blocks including the dual port RAMs 
of the FFT and the Combiner in RECEIVER-I whereas it is selectively applied to only the reg-
isters and FSMs of RECEIVER-11. The block diagrams of RECEIVER-I and RECEIVER-lI are 
shown in Figure 8.5 and Figure 8.6 respectively. 
RECEIVER-i comprises of a 256-point reconfigurable FFT processor FFT-I) tailored for the 
MC-CDMA receiver and a reconfigurabic Combiner block COMBINER-I with a partitioned 
equaliser memory that can be partly disabled for a 64 sub-carrier receiver. The reconfigurable 
FFT-I processor consists of an ordering block ORD for restoring the digit reversed FFT output 
into normal order for onward processing. The clock gating in FFT-I processor is applied to 
all the blocks of stage 1. It is important to note that the external select input S256 is used for 
disabling the blocks of both RE, CEIVER-I and RECEIVER-H. 
RECEIVER-lI consists of a 256-point reconfigurable FF1 processor (FFT-H) tailored for the 
MC-CDMA receiver and a reconfigurable Combiner block
'
COMBINER-11' with  clock gating 
limited to only its FSM for the 256 sub-carriers. The reconfigurable FFT-II processor also 
consists of an ORD block like FFT-I. The clock gating in FFT-II processor is limited to the 
registers and FSM of its stage 1. 
109 
Low power reconfigurable_MC-CDMA receiver architecture 




Figure 8.7: Architecture of the 256-point reconfigurable FFT processor used in the 
reconfigurable 256 sub-carrier MC-CDMA receiver. 
The selective application of gated clock reduces the power overhead of the 256 sub-carrier 
reconfigurable receiver as compared to a fixed 256 sub-carrier receiver but it also reduces the 
power saving in going from 256 sub-carriers to 64 sub-carriers receiver. The next subsection 
is about the proposed reconfigurable FFT processor architectures namely FFT-I and FFT-IJ for 
the MC-CDMA receivers. 
8.4.1 Architecture of the 256-point reconfigurable FFT processor used in the 
reconfigurable receiver 
The architectures of the reconfigurable FFT processors namely, FFT-I and FFT-I1 are tailored 
to the requirements of the reconfigurable MC-CDMA receiver. Since the reconfigurable MC-
CDMA receiver assumed here supports only 256 and 64 sub-carriers and therefore only 256-
point and 64-point FFT sizes are required. Moreover, the Combiner of the 256 sub-carrier 
MC-CDMA receiver needs the FIT outputs in normal order. It means that the reconfigurable 
FFT processors FFT-I and FFT-II must have an additional ordering stage (01W) to convert the 
digit reversed FFT outputs order into normal order. The architecture of the reconfigurable FFT 
processor employed in a MC-CDMA receiver is shown in Figure 8.7. Both FFT-1 and FFT-I1 
have the same basic architecture. The difference again lies in the extent of clock gating. The 
architecture of FFT-I is similar to RFFT-I in which the clock gating is applied to all the unused 
modules of the stage 1 commutator including its dual port RAM. The architecture of FFT-II is 
similar to RFFT-II with clock gating confined to the registers and FSM of stage 1. An ordering 
block (ORD), shown in Figure 8.8, is also needed in the reconfigurable pipelined FIT processor 
to restore the FIT output into normal order for carrying out the combining operation for a 256 
sub-carrier receiver. The ordering block comprises of two RAMs namely, RAMO and RAM1 
each having 256 locations to store the 256 words of the FFT. The two RAMs are needed for 
110 





	xAoI1—II addr RAMO 
mcount 
Data-inin 	out 
count-'1 MUXA1 Haddr 	MUXD mcount 	
RAM1 
O_*[1Datain 	in 	out - 
out 
In 
T 	 I 	 sel 
256 s sel 
mcount 	count 	sel 
ROM 	 I 	FSM 
t 
count 	 G Ck (Gated clock) 
Figure 8.8: Architecture of the reordering stage in a pipelined FET processor 
uninterrupted operation of the pipelined receiver. The first RAM stores the digit reversed output 
of the FFT whereas the second RAM outputs the previous block of the FFT output in normal 
order to the Combiner for further processing. The roles of the two RAMs will be reversed 
after each 256 clock cycles for real time processing. The address pointer to the RAM, which 
stores the FFT output, is generated by a RUM based on the actual location of the digit reversed 
data whereas the address pointer to the RAM outputting data to the Combiner is generated by 
a counter. Two multiplexers namely MUXAO and MUXA1 are used to select the appropriate 
address for RAMO and RAM  respectively. A multiplexer (MUXD) is used to select the RAM 
outputs. A FSM controls the ordering operation by flipping between the RAMs for reading and 
writing operations. It controls the select lines of the multiplexers and also has a counter for 
generating the address. 
This ordering block is partially disabled in case of a 64 sub-carrier receiver by keeping its input 
fixed with the help of a multiplexer IMUX and also by disabling the clock of its FSM in both 
PET-I and PET-li architectures. This is possible because the order of the 64-outputs of the 
FFT is immaterial for the Combiner in a 64 sub-carrier receiver assumed over here. The output 
multiplexer OMUX selects either the output of the ORD block or the digit reversed FFT output 
Low power reconfigurable MC-CDMA receiver architecture 
S I1 	S 12 	S 13 	S 14 
Input 	 x1,x 	Multiplication 
	
Accumulation 	





eq ,eq  32 Xdr,X
Imr,mj [Partitioned Division  H S256 	 Module ] 	Memory module I 17 enFi enRi 	csra,wa' 
Finite State Machines I 
S11 	S12 	S13 	S14 
Figure 8.9: Architecture of the reconfigurable Combiner 
depending upon the number of sub-carriers. The ORD block is bypassed for the 64 sub-carrier 
system using OMUX. 
8.4.2 Reconfigurable Combiner architecture 
The designed reconfigurable Combiner can handle both 256 and 64 sub-carriers. The basic 
architecture of the reconfigurable Combiner is quite similar to the dedicated Combiner archi-
tecture discussed in the previous chapter and is shown in Figure 8.9. It consists of a despread 
module, a multiplication and an accumulation module, a partitioned memory module to store 
the equaliser coefficients, a divider module used in the channel estimation phase and finite state 
machines module for controlling the reconfigurable Combiner. The memory module and the 
finite state machines module are different from the 64 sub-carrier Combiner discussed in the 
previous chapter. The frame for the 256 sub-carrier system is shown in Figure 8.10. It is as-
sumed to comprise of 32 symbols. The code length for this system is also assumed to be 64 and 
hence four bits are accommodated in one symbol. The first symbol is reserved for the pilot. The 
remaining 31 symbols represent data bits. The 256 sub-carrier Combiner also has two phases 
namely the channel estimation phase and the data demodulation phase. In the channel estima-
tion phase, all the 256 equaliser coefficients corresponding to every sub-carrier are computed 
and stored in a 256 word RAM. The data demodulation phase estimates the transmitted bit by 
extracting the real part of the product of received data symbols and the equaliser coefficients 
followed by its accumulation over the code length of 64 corresponding to each transmitted bit. 





!U !tIh II.11 iuq 





111  1.0 



























Low power recon.figurabie MC-CDMA receiver architecture 
Frame(1 +31)=32 symbols 
31 data symbols (Each symbol comprises of four data bits) 
Time 
Figure 8.10: MC-CDMA frame assumed for 256164 sub-carriers. 
tion phase for the next data bit. This demodulation phase continues till all the 124 data bits in 
the 31 symbols are recovered by the Combiner. The next frame then commences again with a 
fresh channel estimation phase. 
The finite state machines module of the reconfigurable Combiner is shown in Figure 8.11. It 
consists of two FSMs and a multiplexer. The FSMs namely FSM256 and FSM64 are needed 
to realise a reconfigurable Combiner architecture that can handle both 256 and 64 sub-carriers 
respectively. A multiplexer (MUX23) is used to select the appropriate FSM depending upon 
the number of sub-carriers as per the S256 control signal. 
Two reconfigurable Combiner architectures namely COMBINER-I and COMBINER-Il are pro- 
posed for RECEIVER-I and RECEIVER-Il respectively. The reconfigurabic Combiner memory 
113 
Low power reconfigurabic Mc-cDMA receiver architecture 
[iM64 
23/ 	
MUX23 2 /{S11 S 12 ,S13, S14, enF,enR,cs,ra,wa} 
FSM256 23 	i 
S256 
Figure 8.11: Finite state machines module for the reconfigurabie receiver 
module has a partitioned memory such that it can operate both as a 64 and 256 words RAM. 
Only 64 words corresponding to 64 equaliser coefficients are needed for a 64 sub-carrier sys-
tem contrary to 256 in a 256 sub-carrier system. The 192 words of RAM and FSM256 are 
disabled by clock gating for the 64 sub-carrier system in COMBINER-I whereas only F8M256 
is disabled through clock gating in COMBINER-11. It means that the COMBINER -II employs 
limited clock gating unlike COMBINER-I. 
It is clear that the power saving occurs both in the FFT and the Combiner in switching from 256 
to 64 sub-carriers. The first FFT stage can be disabled partially or fully to significantly reduce 
its power consumption in switching from 256 to 64 sub-carriers receiver because this stage is 
not required in a 64 sub-carrier receiver. Similarly, the ordering stage can be partially disabled 
by keeping its inputs fixed for 64 sub-carriers. Moreover, the 192 words RAM and/or FSM256 
can also be disabled in the Combiner of a 64 sub-carrier receiver thereby saving considerable 
power over a fixed 256 sub-carrier receiver. 
8.5 Results 
The reconfigurable 256 sub-carrier MC-CDMA receiver cores, namely RECEIVER-I and REC-
EIVER-11, the fixed 256 sub-carrier MC-CDMA receiver core, the 256-point reconfigurable 
FFT processor cores, namely RFFT-I and RFFT-II and the fixed 256-point FF1 processor core 
have been designed at the register transfer level (RTL) using Verilog hardware description lan-
guage. The cores were synthesized using SYNOPSYS DesignCompiier with UMC 0. 18u stan-
dard cell CMOS library. Layouts of the cores were generated using Envisa Silicon Ensemble 
place-and-route software. This was followed by extracting RC information and then perform-
ing RC back-annotated post layout gate-level netlist simulations for 4000 uniformly distributed 
random input data samples. The resulting switching activity of the circuit nets was then used by 
the SYNOPSYS DesignPower to compute the power consumption for different cores. All the 
114 
Low power reconfigurabic MC-CDMA receiver architecture 
FFT type FF1' size Power % Power 
in points (mW) saving 
Fixed 256 123.01 - 
Reconfigurable 256 154.62 -26 
RFFT-I 64 55.03 55 
16 18.69 85 
Reconfigurable 256 124.31 -1 
RFFT-II 64 91.48 26 
16 70.44 43 
Table 8.1: Power comparison between the fixed and the reconflgurable FFT processor 
architectures. 
simulations were carried out at a supply voltage of 1.8V and at a clock frequency of 25MHz. 
The input data was obtained by modelling the transmitter, receiver and the channel in Matlab 
for 32 users and signal-to-noise ratio equal to 40. The receiver hardware has been verified for 
other values of the number of users and signal to noise ratios with insignificant change in the 
power results. The input data wordlcngth is assumed to be 16-bits which is typically used in 
wireless LAN applications. The results are listed in Tables 8.1 to 8.3. Table 8.1 lists the power 
saving of reconfigurable FFTs RFFT-I and RFFT-II as compared to a fixed 256-point FFT. The 
power overhead is 26% for the 256-point reconfigurable architecture RFFT-I as compared to 
a fixed 256-point FFT processor. This power overhead is primarily due to the clock gating 
of the dual port RAMs of the commutators in both stage 1 and stage 2 of the reconligurable 
FF1' processor. This extensive clock gating in RFFT-I gives power savings of 55% and 85% in 
switching to 64-point or 16-point respectively as compared to a fixed 256-point FF1' processor. 
The RFFT-I architecture is ideally suited to channel conditions where the required FF1' size is 
mostly less than the largest 256-point. 
The second reconfigurable architecture RFFT-II introduces a power overhead of just 1% as 
compared to the fixed 256-point FF1' processor due to selective clock gating. The clock gating 
is applied only to the registers and FSM of stage 1 and stage 2 but not to the dual port RAM 
based FIFOs. This means that the power overhead in the clock gates used for gating the large 
memory is absent. The power savings of 26% and 43% are obtained in switching to 64-point 
and 16-point FFT respectively as compared to a fixed 256-point FF1'. The power saving in going 
down to shorter FFT is much less for RFFT-II in comparison with RFFT-I due to selective 
clock gating. Hence, REFT-II architecture is more suitable for channel conditions where the 
probability of the maximum FF1' size (256) is high. 
115 
Low power reconfigurabic MC-CDMA receiver architecture 
Data Time RFFT-I RFFT-II % Energy % Energy 
set division Aggregate Energy Aggregate Energy saving saving 
contribution (mJ) contribution (mJ) RFFT-I RFFT-II 
1 80% as 256 0.8*154.62=123.7 0.8*124.31=99.45 
10% as 64 0.1*55.03=5.50 0. 1 *91.48=9.15 
10% as 16 0.1*18.69=1.87 0.1*70.44=7.04  
Total Energy 131.07 115.64 -6.6 +6 
2 10% as 256 0.1*154.62=135 0.1*124.31=42.43 
10% as 64 0.1*55.03=5.50 0.1*91.48=9.15 
80% as 16 0.8*18.69=14.95 0.8*70.44=56.35  
Total Energy 22 77.93 +82 +37 
3 50% as 256 0.5*154.62=77.31 0.5*124.31=62.16 
30% as 64 0.3*55.03=16.51 0.3*91.48=27.44 
20% as 16 0.2* 18.69=3.74 0.2*70.44=14.09  
Total Energy 97.56 103.69 +21 +16 
Table 8.2: Aggregate Energy saving of RFFT-I and RFFT-II architectures with respect to a 
fixed 256-point FFT processor for three data sets corresponding to three different 
FFT size requirements over three different time durations. 
Table 8.2 lists the aggregate Energy saving of RFFT-1 and RFFT-11 architectures with respect 
to a fixed 256-point FFT processor for three different FIT size requirements over three differ -
ent time durations. An overall time slot of unity is assumed and the percentage refers to the 
fractional time out of unity over which that FF1' size is active. It is clear from Table 8.2 that 
for Data set 1, RFFT-II architecture is better than RFFT-I architecture. This is primarily due to 
the longer allocated fractional time slot of 80% and large power overhead of 256-point RFFT-I 
architecture with respect to the fixed 256-point FFT processor. It means that RFFT-II is better 
than RFFT-I in situations where the maximum FFT size (256-points) is required most of the 
time. On the other hand for Data set 2, RFFT-I architecture saves 82% energy as compared 
to only 35% for RFFT-II. This is because for Data set 2, the minimum FFT size (16-points) 
is required most of the time. Even for a relatively longer fractional time slot of 50% for the 
maximum FFT size corresponding to Data set 3, the energy saving corresponding to RFFT-I 
is more than that for RFFT-II. This indicates that for most cases where the largest FFT size 
is required for 50% or less time, RFFT-I architecture gives more energy saving than RFFT-II. 
Any of the two architectures can be chosen depending upon the application's requirements. 
It is evident from Table 8.3 that stage 1 of the FFT processor contributes most to the power 
consumption followed by stage 2, stage 3 and then the last stage. The power consumed by 
stage 1 commutator is maximum for RFFT-I architecture because the clock gating is applied 
116 
Low power reconligurable MC-cDMA receiver architecture 
FFT stage Major blocks Fixed FFT RFFT-I RFFT-II 
in each Power Power Power 
stage (MW) (MW) (mW) 
Stage 1 COM1 62.90 93.38 63.45 
BUT, 0.94 0.92 0.94 
__ _ MULTI 6.11 6.11 6.11 St_____
age 2 COM2 18.79 28.05 19.01 
BUT2 0.91 0.89 0.94 
MULT2 6.15 6.14 6.13 St______
age 3 COM3 5.71 5.78 5.79 
BUT3 0.94 0.98 0.97 
MULT3 6.19 6.17 6.17 
Stage 4 COM4 1.31 1.30 1.31 
BUT4 1.50 1.44 1.51 
MUX-I - - 0.1 0.12 
MUX-11 - - 0.05 0.05 
Table 8.3: Power consumed by the major blocks of the fixed and reconfigurable 256-point FFT 
processors. 
to all the modules of its commutator. The stage 1 commutator power overhead is negligible 
for RFFT-II as compared to RFFT-I with respect to the fixed FFT because clock gating in 
RFFT-II architecture is limited to the FSM of its commutator and is not applied to its dual 
port RAMs. The same power consumption trend continues for stage 2 as well. Stages 3 and 4 
consume almost identical power for all the architectures due to the absence of clock gating in 
these stages. The additional logic for reconfiguration in the form of two multiplexers namely 
MUX-I and MUX-11 does not consume much power. The real strength of the reconfigurable 
FFT architecture lies in using the optimum FFT length in real time thereby saving power. 
Table 8.4 lists the power consumed by the fixed and the reconfigurable receiver architectures. 
RECEIVER-I, in the 256 sub-carriers mode, consumes maximum power just like RFFT-I as 
compared to both RECEIVER-11 and the fixed receiver. This is because of extensive clock 
gating in both the FFT processor and the Combiner. The power overhead of RECEIVER-I is 
16% as compared to only 0.5% for RECEIVER-I!. The power overhead of RECEIVER-11 is 
much less as compared to RECEIVER-I because the clock gating is limited to only the registers 
and FSMs of the FFT and the Combiner. The power saving in going down to 64 sub-carrier is 
47% in RECEIVER-! because the power saving occurs in the whole of commutator, equaliser 
RAM and the ordering block. The power saving in switching to 64 sub-carriers is only 19% for 
RECEIVER-11 due to the limited power reduction in the commutator, ordering block and small 
117 
Low power reconfigurable MC-CDMA receiver architecture 
Receiver Receiver size Power % Power 
type (sub-carriers) (mW) saving 
fixed (256) - 243.66  
RECEIVER-I 256 285.85 -16 
64 129.45 +47 
RECEIVER-11 256 244.79 -0.5 
64 196.60 +19 



















Table 8.5: Power consumed by the major blocks of the fixed and the reconfigurahie receivers. 
power saving in the Combiner by disabling only FSM256. 
Table 8.5 lists the power consumed by the major blocks of the receiver. It is clear that the FFT 
block consumes most of the power in a MC-CDMA receiver. The power consumed by the FFT 
in the receiver has gone up considerably after the inclusion of the ordering block ORD. 
8.6 Summary 
This chapter has presented a novel concept of real time adjustment of the MC-CDMA receiver 
hardware as per the channel requirements. Two 256 sub-carrier reconfigurable MC-CDMA 
receiver architectures have been presented which can also be configured for 64 sub-carriers in 
real time. These architectures can be very easily modified to support any other combination of 
sub-carriers. The appropriate reconfigurable architecture is chosen depending upon the channel 
parameters. The power saving in going down to smaller number of sub-carriers has been clearly 
established. This concept of hardware size adjustment on the basis of changing channel param-
eters can also be extended to other receiver blocks like the Viterbi decoder. Two reconfigurable 
pipelined FFT processor architecture are also presented that can be configured as a 64-point or 
16-point in real time as per the channel parameters. 
This thesis investigated low power techniques and architectures for the FIT and the Combiner 
118 
Low power reconfigurabic_MG-CD MA receiver architecture 
blocks of the MC-CDMA receiver as well as techniques for saving power in the whole receiver 
by tuning its hardware as per the channel parameters. The FIR filter is also an important block 




Low power FIR filter architectures 
The Finite impulse response filter is one of the most commonly used block in signal processing 
and telecommunication systems. It is also used in multicarrier receiver and therefore power 
consumption is important. The low power FIR filter architectures presented in this chapter are 
based on the existing low power algorithms. The power saving potential of most of the low 
power algorithms considered in this chapter has so far been evaluated only up to the multiplier 
level [11, 12, 17, 18]. It is very important to investigate the power saving potential of all these 
algorithms on the overall FIR filter in the presence of hardware overhead associated with each 
of these algorithms. All these power saving algorithms are mostly applicable to single multiply 
accumulate unit based FIR filter architectures. Therefore, this chapter presents the power saving 
potential of each of these algorithms and their hybrid on the single multiply accumulate unit 
based direct form FIR filter architecture. 
This chapter is organised into seven sections. The first section gives an overview of the direct 
form FIR filter. The second section describes a conventional direct form FIR filter architecture. 
In the third section, the conventional filter architecture is modified to support low power coeffi-
cient ordering scheme. The fourth section proposes the low power FIR filter architecture for the 
coefficient segmentation algorithm. The fifth section introduces a block processing algorithm 
based FIR filter architecture. The sixth section describes a low power hybrid FIR filter architec-
ture obtained by the combination of coefficient segmentation and block processing algorithms. 
The chapter concludes with a section on the comparison of power and area for the different 
low power FIR cores with respect to the conventional FIR core for the three commonly used 
multiplier types. 
9.1 Overview of the direct form FIR filter 
A filter is a system that selectively alters the characteristics of a signal in a specified way. 
The filters are commonly used to remove or minimise noise from a signal. It is also used to 
120 





Figure 9.1: Direct form FIR filter structure. 
separate two or more signals combined together for efficient use of a communication channel 
or to extract information from signals. 
A digital filter is a mathematical algorithm that produces a digital output from a digital input 
for achieving a filtering objective. It can be implemented in hardware or software depending 
upon the application. 
A Finite impulse response (FIR) filter is represented by the following equation: 
Nf 1 
y(n) = 	h(k)x(n - k) 	 (9.1) 
k=O 
Where x(n) and y(n) are the input and output samples of the filter respectively, h(k) is the kt'l 
impulse response coefficient and is the filter order. Equation 9.1 clearly indicates that the 
filter's output response to any input x(n) that eventually goes to zero will eventually go to zero. 
The structure of the Direct form FIR filter (DF) is shown in Figure 9.1. It is nonrecursive which 
means that the present filter output depends only on the present and past inputs and not on the 
past output values. This characteristic is responsible for the stability and popularity of FIR 
filters. 
The DF filter structure consists of a series of delay elements (z'). The output of each of these 
delay elements along with the input sample are called the taps of the filter. An N1-tap filter has 
Nf - 1 delayed samples. The filter order depends upon the number of taps. The filter response 
tends towards the ideal response by increasing the number of taps. Each data tap is multiplied by 
121 
Low power FIR filter architectures 














Figure 9.2: Generic direct form FIR core. 
its respective coefficient and then all the product terms are summed together to generate the final 
output at that point of time. The delay elements are realised by clocked registers. The output 
y(n) at any time depends upon the current input x(n) and N1 - 1 previous inputs for an N1-tap 
filter. The DF structure can be realised in a parallel fashion by replacing each multiplication 
operation by a hardware multiplier, each delay element by a register, and a summer for adding 
the outputs of all the multipliers. This thesis explores a single multiply-accumulator based low 
power FIR filter architectures. The filtering operation in this architecture has to be carried out in 
a sequential fashion for N1 clock cycles to generate a single output. This filter implementation 
saves area but is much slower than the parallel implementation. 
9.2 Conventional DF FIR filter architecture 
The block diagram of a generic DF FIR core is shown in Figure 9.2. It consists of two memory 
blocks for storing the coefficients (HROM) and input data (XRAM), two registers for holding 
the coefficient and input data namely HREG and XREG respectively, and the FSM along with 
the main arithmetic unit (AU). The XRAM is realised in the form of a latch based circular buffer 
for reducing its power consumption. The FSM is responsible for applying the appropriate 
coefficients and input data to AU. The AU architecture along with that of the FSM mainly 
varies from one algorithm to the other. 
122 
Low power FIR filter architectures 
cik 
n res 
Figure 9.3: Conventional arithmetic unit. 
In the conventional implementation of an FIR filter, the AU consists of a multiplier (muir), an 
adder (add), an accumulator (acc) and a clearing logic block (ciacc) in the form of a multiplexer 
as shown in Figure 9.3. The clacc carries out the dual operation of feeding the accumulated 
values back to the adder and also for clearing the accumulator when the valid input to the AU is 
asserted high. The valid input is asserted high after the generation of a filter output. The malt 
has three different implementations namely a carry-save array type (csa), a Non-Booth-coded 
Wallace tree type nbw) and a Booth-coded Wallace tree type (wail). The power evaluation of 
the filter architectures was carried out for all of the three multipliers. The AU is used to multiply 
a coefficient h(k) with an input data sample x(k) and adding the previous stored accumulator 
register value to the product at the same time in each clock cycle. Data samples and the filter 
coefficients are represented as 16-bit two's complement numbers whereas the filter output is of 
32-bit. 
A conventional implementation of a direct form FIR filter is executed such that at each clock 
cycle a new data sample, x(k), and the corresponding filter coefficient, h(k), are fetched from 
XRAM and HROM simultaneously and stored in the respective registers namely XREG and 
HREG. The register outputs are directly connected to the multiplier. Therefore, for each multi-
plication both inputs of the multiplier receive new data. Due to this continuous change at both 
the inputs, there will be a high level of switching activity within the multiplier leading to higher 
power consumption 
123 
Low power FIR filter architectures 
9.3 Coefficient ordering based FIR filter architecture 
The multiplier is a major bottleneck governing the performance of a DSP algorithm. In addi-
tion, the power dissipated within the multiplier represents a significant proportion of the overall 
power dissipated by the DSP device [114]. A reduction in the switching activity within the 
multiplier block can be achieved by implementing the filter such that respective data samples 
are multiplied with filter coefficients in a non-conventional order [115]. This order can be ob-
tained by minimising the Hamming distance between those filter coefficients used in successive 
multiplication operations. It must be noted that the ordering of coefficients is performed only 
once prior to the commencement of filtering. Subsequent use of the filter will utilise the same 
order of coefficients. For this reason, coefficient ordering has no implications on the speed of 
the filtering process. 
Filtering commences by fetching a coefficient and the corresponding data sample. These are 
then presented to the multiplier inputs and the result is added to the accumulator. The rest of the 
coefficients are processed in a similar manner. Once all the coefficients are processed the filter 
output is obtained from the accumulator. For the next filter output, the accumulator is cleared, 
a new data sample is read into the data memory (replacing the oldest data in the memory), and 
the above steps are repeated. 
A block diagram of the ordered FIR core is illustrated in Figure 9.4. It consists of HROM, 
XRAM, AU and a FSM containing an address generation logic (ADDR-GEN) block to support 
ordering. The ADDR-GEN block governs the non-conventional processing of coefficients. It 
consists of a look-up-table (LUT) and an adder. The XRAM is realised in the form of a circu-
lar buffer for reducing its power consumption. The XRAM position to be currently written is 
tracked by the Write-pointer generated by the FSM. The Write-pointer always points towards 
the most recently entered data value x(0). In a conventional implementation, the access of the 
first coefficient h(0) in HROM and the data sample x(0) in XRAM must only be aligned and 
all other combinations of coefficients and data will automatically fall in place. In an ordered 
implementation, the coefficients in the HROM can be in any order depending upon the ordering 
algorithm and the original coefficient set. Let us assume that the first 1-IROM location stores 
h(5) instead of h(0) as per the ordered list. In order to generate the Read-pointer or address 
of the correct data corresponding to h(5), the ADDR-GEN block should be able to generate 
the address of the corresponding data x(5) which is always located five positions away from 
x(0) in an XRAM. The x(0) position is always pointed at by the Write-pointer. Hence, the first 
124 
Low power FIR filter architectures 
I 	Original FSM 	FSM (for 














Figure 9.4: Coefficient ordering based FIR filter architecture. 
offset in the LUT, for the Write-pointer corresponding to the first entry of HROM, must be 5. 
The remaining entries of the LUT can be obtained by storing appropriate offsets with respect 
to the Write-pointer's position by examining the corresponding coefficients in the HROM. The 
HROM and LUT are both addressed by the same counter. The final XRAM Read-pointer for a 
given counter value is obtained by adding the corresponding offset to the Write-pointer. The 
Write-pointer is always decremented by one after the generation of every output as the XRAM 
receives a new data sample value only after every output generation. This work is reported 
in [116]. 
9.4 Coefficient segmentation based FIR filter architecture 
According to this technique [17], the fixed coefficients are divided into two components such 
that one of the components can be realised by using a single shift operation. The second compo-
nent with a reduced wordlength is applied to the coefficient input of the multiplier. This leads 
to significant reduction in the switching activity at the coefficient input of the multiplier and 
hence the power consumption. It is important to note that a shifter consumes much less power 
as compared to the multiplier and consequently the first component does not consume much 
power. In order to reduce the switched capacitance at the coefficient input of the multiplier, the 
segmentation must be done in such a way that the consecutive values of the second component 
125 
Low sower FIR filter architectures 
M 	x 	 S 	valid 
ci k 
nres 
Figure 9.5: Arithmetic unit Jbr coefficient segmentation. 
should have the same polarity. This approach will also minimise the effective wordlength of 
the second component which is applied to the multiplier. 
Here a 16-bit coefficient h(k) is segmented into two numbers namely a 16-bit decomposed 
coefficient m(k) and a 5-bit shift value s(k). The MSB of s(k) acts as a sign bit and the remaining 
four bits are a measure of shift. The input data sample x(k) and the number m(k) are applied 
to the multiplier while a shift operation of x(k) will be performed according to the shift value 
s(k). One more 5-bit register has to be included in the main architecture to store the s(k) 
value. The result of the multiplication and the shifted input data are added to the previous value 
stored in the accumulator register. The AU for the coefficient segmentation algorithm is shown 
in Figure 9.5. It consists of a multiplier (mult), an adder (add), a logarithmic shifter (shift) 
implemented using arrays of 2-to-1 multiplexers, a conditional two's complementor (xconv), a 
multiplexer (max) to load and clear the shifter and a clearing block (clacc) identical to the one 
in the conventional FIR filtering block. The MSB of the shift value s(k) determines if a negative 
shift has to be performed and therefore controls the conversion unit xconv. The output of xconv 
is the two's complement of the data only if the MSB of s(k) is one, otherwise the output is equal 
to the input data. When h(k) is zero (m(k)=O, s(k)=O) or one (m(k)=i, s(k)=O), the shift value 
ipz 
Low power FIR filter architectures 
B3 	B2 	Bi 	BO 
= 	x0  h0 4 -  Txihui I x2h2r,k x 3 h3 -* acc0 
y 1 	= 	x1 h0 ' 	x0h 1 '' +x 1 h2 +x 2h3 ' -' acc 1 
B7 	B6 	B5 	B4 
= x2h0 	
M+ + x1 h2 + x0h3 
Figure 9.6: Illustration of block processing with a block size of two. 
will be zero. In these cases, the output of the shifter must be zero as well. In order to guarantee 
this behaviour, a multiplexer is needed between the conversion unit and the shifter that applies 
a zero vector when s(k) equals to zero. Since three values (multiplier, shifter and accumulator 
outputs) are to be added, a single multi-input adder carries out this addition. 
9.5 Block processing based FIR filter architecture 
In the direct form realisation of the filter a new data sample x(k) and the corresponding coef -
ficient h(k) are multiplied at each clock cycle followed by accumulation in a conventional AU. 
This leads to high switching activity because both inputs of the multiplier receive new data at 
every clock cycle. Another source of power consumption in DSP's is the activity on data and 
address buses. Since each time a new data sample is to be multiplied with a new coefficient, 
both data and address buses experience high switching activity. According to the block process-
ing technique [18], data samples are processed in blocks. By processing multiple data samples 
at the same time rather than one, it is possible to reduce the switching activity not only at the 
multiplier inputs but also on the address and data busses, thereby, achieving considerable power 
saving. As per earlier results [18], blocks of size two yields best results because both inputs 
to the multiplier are held constant for two clock cycles rather than only one for the block sizes 
greater than two. The block operation for the block of size two is shown in Figure 9.6. It is 
clear from this Figure that if one starts the filtering process from the first entry of Block BO, and 
moves vertically down by one level then the filter coefficient is not changing within this block. 
Moreover, if one moves diagonally from the second entry in Block BO to the first entry in Block 
Bi then the data is not changing as well. This pattern is repeated for all the blocks up to B3. 
127 
Low power FIR filter architectures 
h 	 valid 
ci k 
n res 
Figure 9.7: Arithmetic unit for block processing. 
This characteristic is exploited to retain the data and coefficient in filtering block registers for 
two clock cycles rather than one, thereby, reducing switching activity at the multiplier inputs 
and address and data busses resulting in power saving. 
The AU, shown in Figure 9.7, is designed for a block size of two. It consists of a multiplier 
(mult), an adder (add), two accumulators for storing the results for two outputs namely accO 
and acci, a multiplexer (mux) to select the proper output of the accumulators to be fed back to 
the adder and also a clearing logic (clacc). The clacc initialises the accumulators in response 
to the active high valid signal which goes up at the time of switching from one block of outputs 
(yO and yl) to the next (y2 and y3). The appropriate accumulators are selected by the controls 
generated by the FSM. Once the block processing for the first two outputs (yO and yl) is over, 
the FSM directs the loading of HREG and XREG with the new set of coefficient and data values 
corresponding to the outputs y2 and y3. This sequence has to be repeated for all future outputs. 
The accumulator accO and acci are clocked by complementary clocks namely cikO and clkl 




Low nower FIR filter architectures 
M 	x 	 S 	valid 
Figure 9.8: Arithmetic unit for the combination. 
9.6 Combination of block processing and coefficient segmentation 
based filter architecture 
The architectures corresponding to coefficient segmentation and block processing can be com-
bined together to yield even more significant reduction in power with a slight area overhead. 
The architecture of the AU for the combination of block processing and coefficient segmenta-
tion is shown in Figure 9.8. It is basically the combination of the architectures described in 
Figure 9.5 and Figure 9.7. The power saving occurs not only by segmenting the coefficients but 
also by holding the segmented coefficients and data values at the input of the multiplier for two 
clock cycles rather than one. The coefficient segmentation, block processing and their hybrid 
architectures are reported in [117]. 
129 











AU 0.896 0.756 0.631 0.712 0.504 
XRAM 0.122 0.141 0.121 0.065 0.065 
HROM 0.026 0.020 0.026 0.013 0.012 
FSM 0.105 0.159 0.111 0.1 0.108 
XREG 0.034 0.035 0.036 0.026 0.027 
HREG 0.031 0.023 0.025 0.020 0.018 
OREG 0.007 0.007 0.007 0.008 0.008 
Total 1.221 1.141 0.957 0,944 0.742 
Table 9.1: FIR Cell power. 
9.7 Results 
The different FIR cores for the three algorithms namely coefficient ordering (ORDER), coef-
ficient segmentation (CSEG), block processing BP), and the combination of CSEG and BP 
(COMB) have been analysed with regard to area usage and power consumption with respect to 
the conventional architecture (CONy). The cores were designed using Verilog HDL and then 
synthesized using Ambit BuildGates targeting the UMC 0. 18/-t standard cell CMOS library. The 
requirements for the synthesis were identical for all the cores. This was necessary in order to 
allow for a consistent power consumption and area usage comparisons. A maximum circuit 
delay of 35ns has been defined for all the cores. A layout for each core was generated using 
the Cadence Silicon Ensemble place-and-route software. This was followed by extracting RC 
information and then performing RC back-annotated post-layout gate-level netlist simulations 
for a uniformly distributed random input data sample set equal to 1000 using Verilog-XL simu-
lator. The resulting data including switching activity of the circuit nets and the capacitive load 
information extracted from the layouts was then used by the Synopsys DesignPower tool to 
compute power consumption figures for the different FIR cores. In all of the above stages, a 
clock rate of 10MHz and a supply voltage of 1.8 Volts were used. The power profile of the filter 
has been verified up to 100MHz. 
The results are shown in Tables 9.1 to 9.6 and Figures 9.9 and 9.10 for a73-tap band-pass filter. 
The power saving in this filter was gauged for three different multiplier types from Synopsys 
Design Ware namely csa, nbw and wall. Table 9.1 and Table 9.2 list the power consumption 
for the different blocks of the overall filter and the AU respectively using a csa multiplier type. 
According to Table 9.1 and Figure 9.9, the power consumption of the AU is in the range of 
130 





\ / 	1% 
HROM 
(a) CONV 
F M XREG 	HREG 
11% 













XRHJ HREG OR.EG 
FSM'\ 3 / 1% 
12% 
HROM 
3% 	 AU 
_—'. 
XRAMJ 	 65% 
12%  
XREG HREG 
FSM 4% \ 2% OREG 
15% 
HROM 







Figure 9.9: Distribution of cell power for the overall FIR core. 
131 











malt 0.682 0.556 0.331 0.466 0.207 
add 0.126 0.117 0.157 0.103 0.127 
accO 0.077 0.072 0.071 0.055 0.051 
acci - - - 0.055 0.051 
clacc 0.012 0.011 0.012 0.015 0.016 
xconv - - 0.013 - 0.008 
muxO - - 0.006 - 0.004 
max] - - - 0.017 0.018 
shift - - 0.041 - 0.024 
Total 0.896 0.756 0.631 0.712 0.504 
Table 9.2: Arithmetic unit Cell power. 
Algorithm csa nbw wall  
(M W)-  % (mW) % (mW) % 
CONY 0.682 - 0.566 - 0.687 - 
ORDER 0.556 18 0.425 25 0.373 46 
CSEG 0.331 51 0.325 43 0.597 13 
BP 0.466 32 0.388 31 0.438 36 
COMB 0.207 70 0.227 60 0.349 49 
Table 9.3: Power consumption analysis for different multipliers. 
(65-75%) of the overall FIR core power. It is also evident from Table 9.2 and Figure9.10 that 
the multiplier contributes most to the power consumption (40-76%) in the AU and therefore all 
power saving efforts must be directed to reduce its power consumption. Table 9.3 shows that 
the percentage saving in power at the multiplier is substantially higher for CSEG as compared 
to ORDER and BP for the csa and nbw multiplier types whereas the reverse is true for the wall 
multiplier. The higher savings in power at the multiplier side in CSEG is primarily attributed 
to the significant reduction in the switching activity and effective wordlength of its coefficient 
input. The wall multiplier performs very well for similar ORDER and BP algorithms where 
the adjacent coefficient inputs are having less switching activity. The wall multiplier saves 
46% and 36% power with ORDER and BP respectively as compared to only 18% and 32% for 
csa and 25% and 31% for nbw multipliers. The power saving at the multiplier will obviously 
be maximum for the COMB. The performance of the csa multiplier type is best in terms of 
power reduction for CSEG and COMB whereas the worst performance is obtained for the wall 









8%\\ 2% mux 
accO / 2% 
8% 
add  
14% 	 mulL 
66% 
Low power FIR filter architectures 
(a) CONV 	 (d)BP 
clacc 







xconv 	mux shift 
clacc 
2% 
acc 	 mutt 
11% addy__) 53% 
25% 
xconv muxO 	mux 1 
2% \i% \ 4% 	shift 







(c) CSEG 	 (e) COMB 
Figure 9.1O Distribution of cell powerfor the AU's. 
133 
Low power FIR filter architectures 
Algorithm csa nbw wall  
(mW) % (mW) % (mW) % 
CONV 1.221 - 1.082 - 1.219 - 
ORDER 1.141 07 0.987 09 0.942 23 
CSEG 0.957 22 0.958 11 1.263 -4 
BP 0.944 23 0.844 22 0.898 26 
COMB 0.742 39 0.765 29 0.901 26 
Table 9.4: Power consumption analysis for FIR cores. 
Algorithm csa nbw wall  
[Mm'] % [mm'] % [mm 2 ] % 
CONY 0.184 - 0.186 - 0.186 - 
ORDER 0.187 -2 0.189 -2 0.189 -2 
CSEG 0.191 -4 0.193 -4 0.194 -4 
BP 0.190 -3 0.192 -3 0.192 -3 
COMB 0.197 -7 0.199 -7 0.199 -7 
Table 9.5: Area comparison for FIR cores. 
The overall power reduction of the different filter cores is listed in Table 9.4. It is clear from 
this table that the overall power reduction is more in BP as compared to ORDER and CSEG for 
all the multiplier types. This is because in BP power saving occurs not only in the multiplier 
but also on the address and data busses for both data and coefficient memories on account of 
less memory accesses. This is evident from the power consumption of the two memory blocks 
namely HROM and XR.4M in Table 9.1. Moreover, it is clear from Table 9.5 that the extra 
hardware introduced to support CSEG is around 4% as compared to only 3% for BP. This 
additional hardware in the form of shifter, two's complementor, three-input adder etc. (listed 
in Table 9.2) is also responsible for the inferior power saving in CSEG over BP. The overall 
power consumption of the wall based FIR core for CSEG has gone up over CONV because the 
power saving in the multiplier is to the tune of just 13%. This is not good enough to offset 
the power consumed by the additional hardware provided to support the algorithm. The best 
power saving to the tune of 39% and 29% are obtained for COMB with csa and nbw multiplier 
types respectively. This is not true for wall multiplier type because of the poor performance of 
CSEG with this type of multiplier. The best performance with wall multiplier is obtained by 
BP both in terms of power and area. The ORDER algorithm also saves 23% power for the wall 
multiplier with an area overhead of just 2%. 
134 
Low power FIR filter architectures 
Algorithm Deiay'ns)_____ 
csa nbw wall 
CONV 10.01 9.02 9.61 
ORDER 10.01 9.02 9.61 
CSEG 9.94 9.70 9.70 
BP 10.03 9.05 9.64 
COMB 9.98 932 9.72 
Table 9.6: Delay analysis of AU's using different multipliers. 
The area overhead as per Table 9.5 is around 2% for ORDER, 3% for BP and 4% for CSEG 
for all multiplier types. The results have proved that these algorithms are capable of reducing 
power consumption in the range of 7-39% with a small increment in area. It is finally inferred 
from the tables that the best performance is obtained with COMB using csa and nbw multipliers 
at an expense of 7% increase in area whereas BP gives best result using wall multiplier at 
the expense of only 3% increase in area. The ORDER algorithm performs best with the wail 
multiplier saving 23% power at an expense of only 2% in area. Any of these cores can be used 
for low power depending on the area overhead budget. The results obtained have also been 
verified with other filter lengths and data samples in addition to the filter considered here. 
The delay analysis for AUs using different multipliers is given in Table 9.6. It is clear from the 
table that the difference in delays is negligible for the different algorithms and the multipliers. 
The critical path for the faster nbw and wall multipliers runs through the shifter-adder combi-
nation rather than the multiplier-adder combination. This is due to the reduction in the delay 
of the multiplier on account of its smaller size in case of CSEG and COMB algorithms. There-
fore, the critical path for these algorithms remains the same for nbw and wail multipliers. The 
critical path for these cases is slightly longer compared to COIVV or ORDER algorithms due to 
the additional delay introduced by a three-input adder. The critical path in case of a slower csa 
multiplier for CSEG and COMB algorithms runs through the multiplier-adder combination and 
hence it is longer than for the other multipliers. The critical path for CONy, ORDER, CSEG and 
COMB algorithms remains almost same for csa multiplier because the reduction in the delay of 
the smaller multiplier is partially compensated by the delay introduced by the three-input adder. 
In the case of BP, the critical path is almost same as compared to CON for all the multipliers. 
The slight increase of critical path in BP is attributed to the higher fanout of the adder in the 
form of two accumulators rather than one. The critical path is the same for CONV and ORDER 
for all types of multipliers. 
135 
Low power FIR filter architectures 
9.8 Summary 
This chapter has presented low power direct form FIR filter architectures for the existing low 
power algorithms. The performance of each of these algorithms has been evaluated only at the 
multiplier level till now. It is very important to examine the effect of these algorithms on the 
overall FIR core especially in the presence of hardware overhead. The power reduction capa-
bility of the low power algorithms are multiplier dependent and hence the results are presented 
for the commonly used three multiplier types. The area overhead of the algorithm is also an 
important measure of its effectiveness. The area overhead for all the algorithms are evaluated. 
For instance, coefficient segmentation based FIR core provides up to 22% power reduction with 
an area overhead of 4%. Block processing based core, on the other hand, results in a power re-
duction of up to 26% at the expense of only 3% in area. Coefficient ordering provides power 
saving up to 23% with only 2% increase in area. However, if more power saving is desired 
then the combination of coefficient segmentation and block processing based core can be used 
providing up to 39% power reduction, but at the cost of 7% increase in area. Any architecture 
can be considered depending upon the power and area constraints. 
This thesis has proposed low power algorithms and architectures for the important blocks of 
the multi-carrier receiver. It also proposes a concept of power reduction through real time 
adjustment of receiver hardware on the basis of channel parameters. The next and the last 
chapter concludes the thesis and describes the roadmap for the future. 
136 
Chapter 10 
Summary and Conclusions 
10.1 Introduction 
The aim of this thesis is to investigate low power architectures for the basic blocks of a MC-
CDMA and other multicarrier receivers as well as the techniques which can be used for reducing 
the power of the whole receiver. The key receiver blocks investigated for low power in this 
thesis arc the FFT, Combiner and a direct form FIR filter. The power consumption of the whole 
receiver is reduced by introducing the concept of an adaptive receiver instead of a fixed receiver 
designed for the worst case channel conditions. 
This chapter is organised into four sections. The first section presents the summary of the work 
presented in each chapter. The second section provides the conclusions drawn from the results 
obtained in each chapter. The third section summarises the achievement and the last section 
outlines areas for future investigation. 
10.2 Summary 
The power consumption in portable devices has to be reduced to extend the battery life. Mul-
ticarrier mode of transmission and reception has become very popular in the wireless world. 
MC-CDMA has significant potential to be included as a standard in future generations of mo-
bile communications and so power consumption is important. The theme of this thesis has been 
to investigate low power architectures for the individual blocks of a MC-CDMA receiver as well 
as for the whole receiver. The receiver blocks namely the FFT, Combiner and FIR filter have 
been investigated for low power. Three different low power techniques namely, the coefficient 
addressing scheme, the coefficient memory reduction scheme and the order based processing 
scheme have been proposed for the most power consuming FFT block of the receiver. A low 
power radix-2 single butterfly and a pipeined radix-4 FFT processor architectures have also 
been proposed based on the order based processing scheme. The radix-4 pipelined FFT pro- 
137 
Summary and Conclusions 
cessor has been combined with the low power combiner to realise a low power MC-CDMA 
receiver. 
This thesis has presented low power FIR filter architectures for the existing low power algo-
rithms. Most of these algorithms have been tested only up to the multiplier level. It is important 
to test the effect of each of these algorithms on the power of the overall filter especially in the 
presence of hardware overhead. Any proposed low power filter architecture can be chosen 
depending upon the area/power budget. 
This thesis has also introduced a novel concept of adjusting the receiver hardware dynamically 
as per the changing channel parameters like the delay spread, transmission rate, signal to noise 
ratio and the Doppler frequency. This adaptive receiver consumes much less power as compared 
to a receiver designed for the worst case channel conditions. Two 256 sub-carrier reconfigurable 
receiver architectures have been proposed that can also be configured for 64 sub-carriers instead 
of using a fixed 256 sub-carrier receiver. 
Chapter 2 provided a description of general and block specific low power techniques for reduc-
ing the switched capacitance and therefore the power consumption. Most of the techniques ex-
ploited signal correlations, hardware size reduction through some transformations and by shut-
ting down unused portion of the architecture in real time. The previous work on MC-CDMA 
receiver architecture involved processing symbols in block. This technique was suitable only 
for the combiner but not for the FFT due to the presence of memory overhead. This memory 
overhead was needed to store the whole block of symbols rather than one symbol after every 
FFT stage. 
The Discrete Fourier transform (DFT) and the FFT for efficiently computing the DFT were 
introduced in Chapter 3. The radix-2 decimation-in-time FFT algorithm was covered in detail 
whereas the overview of the radix-4 decimation-in-time algorithm was given. The comparison 
of radix-4 and radix-2 FFT algorithms in terms of butterfly complexity and number of stages 
was also discussed. 
Chapter 4 presented three schemes for reducing the power consumption in FFT processors 
namely the order based processing scheme, the coefficient memory reduction scheme and the 
coefficient addressing scheme. The order based processing scheme was based on the novel 
concept of using either the real part of the coefficient or its two's complement as per the Ham-
ming distance with the preceding real part. A flag bit was stored along with the coefficients to 
138 
Summary and Conclusions 
indicate the form of the real part of the coefficient. The coefficient memory reduction scheme 
exploited the relationship among the contents of the coefficient memory to reduce the memory 
size from N/4 to N/8+1 locations. This resulted in both area and power saving for long FFTs. 
The coefficient addressing scheme for single butterfly FIT processor was realised using a sim-
ple multiplexer instead of a cascade of Barrel shifters. This saved both area and power for all 
FFT lengths. 
The architecture of radix-2 single butterfly FIT processor was presented in Chapter 5. This 
low power architecture was based on the order based processing scheme. The order based 
processing scheme was only applied to the last stage of the FIT processor where the switching 
activity between successive coefficients was the highest. The hardware overhead was in the 
form of a ROM, a multiplexer and a set of Ex-OR gates. 
Chapter 6 presented architectures for the 16-point and 64-point low power radix-4 pipelined 
FIT processors. The power consumption of the low power radix-4 FF1 processor architecture, 
proposed by Bi and Jones, was further reduced by incorporating order based processing scheme 
to its penultimate stage. The coefficient ordering required corresponding data ordering. A novel 
triple port RAM based commutator architecture was proposed for the first stage of a 16-point 
FFT processor to handle data ordering. This commutator architecture saved power compared to 
the traditional dual port RAM architecture due to the use of only three triple port RAMs instead 
of six dual port RAMs. 
Chapter 7 presented architectures for a 64 sub-carrier MC-CDMA receiver. This architecture 
was obtained by combining the low power ordered 64-point FF1 processor with the Combiner. 
The divider modules were needed only in the channel estimation phase and not in the data 
demodulation phase. Therefore, the divider modules could be disabled in the data demodulation 
phase by keeping their inputs fixed during this phase. 
Chapter 8 introduced a novel concept of designing an adaptive receiver which could be con-
figured as per the channel parameters like the delay spread, signal to noise ratio, transmission 
rate and Doppler frequency instead of a fixed receiver designed for the worst case channel 
conditions. Two 256-point reconfigurable FIT processor architectures have been proposed de-
pending upon the extent of clock gating which could also be configured as a 64-point or a 
16-point as per the channel requirements in real time. The power saving in going down to 
64-point or 16-point as compared to the fixed 256-point FIT processor was 55% and 85% for 
139 
Summary and Conclusions 
RFFT-I and 26% and 43% for RFFT-II respectively. The power overhead of 256-point RFFT-I 
and 256-point RFFT-II with respect to fixed 256-point FFT was 26% and only 1% respectively. 
Two reconfigurable 256 sub-carrier MC-CDMA receiver architectures namely, RECEIVER-I 
and RECEIVER-II have also been proposed depending upon the extent of clock gating in their 
modules. These architectures could also be configured for 64 sub-carriers. This approach led to 
power saving of 47% and 19% for RECEIVER-1 and REElVER-II respectively as compared 
to a fixed 256 sub-carrier receiver designed for the worst case channel conditions. 
Various direct form low power FIR filter architectures were proposed in Chapter 9. These ar -
chitectures were based on the existing low power algorithms like coefficient ordering, block 
processing, coefficient segmentation and their combination. Most of the low power algorithms 
were only tested at the multiplier level. It was important to evaluate the effect of these algo-
rithms on the power consumption of the overall filter in the presence of associated hardware 
overhead. Each of the above mentioned algorithms was evaluated in terms of area and power 
with respect to a conventional 73-tap FIR filter. 
10.3 Conclusions 
This thesis proposed novel low power techniques and architectures for the various blocks of 
the MC-CDMA receiver along with the receiver itself. It can be concluded from Chapter 4 
that the order based processing scheme resulted in switching activity reduction of close to 50% 
as compared to only 20% by the traditional ordering scheme based on Hamming distance for 
all FIT lengths. This significant reduction in switching activity was responsible for the power 
savings obtained in both radix-2 single butterfly and radix-4 pipelined FF1' processors. 
It can be concluded from Chapter 4 that the memory reduction scheme was more effective both 
in terms of area and power saving for long FFTs (1024-points onwards). The savings in area 
ranged from 48% to 54% whereas the power savings varied from 0.62% to 59% for long FFT's. 
It has been found that the scheme was most effective for an intermediate wordlength of 12-bits 
rather than 8 or 16-bits. This scheme was more suitable for OFDM applications requiring long 
FFTs. The scheme would lead to the design of more power and area efficient long length FF1' 
processors. 
It has been shown in Chapter 4 that the proposed coefficient addressing scheme maintained an 
improvement of more than 80% over the existing addressing schemes both in terms of area and 
140 
Summary and Conclusions 
power for almost all FFT lengths. The scheme would lead to the design of faster and more 
power and area efficient systems with embedded radix-2 single butterfly FF1' processors. 
The results in Chapter 5 concluded that the percentage power saving continued to reduce for 
longer length FFT's because Our order based processing scheme was directed only at reducing 
power by lowering the switching activity at the coefficient input of the multipliers in the but-
terfly structure. The butterfly complexity in an FFT remains fixed with FIT length whereas the 
RAM size goes on increasing. This means that the power consumed in the butterfly increases 
slightly with FF1' length as compared to the power consumed in the RAM blocks. This resulted 
in lowering of the percentage savings in power for longer FFT's. The Others ordering scheme 
led to no power savings in most cases because the switching activity reduction was much less 
compared to the hardware overhead required to support that scheme. This was not true for Our 
scheme due to the significant reduction in the switching activity of the ordered coefficient set. 
The power saving of Our order based processing scheme ranged from 1% to 25% for 512-point 
to 16-point single butterfly FF1' processors respectively over the Others approaches. The power 
saving results would improve considerably using customised dual port RAMs rather than the 
SYNOPSYS Design Ware RAMs. The customised RAM consumed very less power than the 
Design Ware RAM. Moreover, more power saving would be obtained for shorter wordlengths 
due to proportionally less power consumed in the dual port RAMs as compared to the butterfly 
in radix-2 single butterfly FF1' processor architecture. 
It was experimentally found that the order based processing scheme gave best power results 
when applied to the last stage (last stage has high switching activity because of different coef -
ficients) of the radix-2 FF1' processor. It has been found that the saving in switching activity 
was lesser for the inner FIT stages as compared to the hardware overhead needed to support 
the scheme in these stages. This was responsible for the inferior power performance of FF1' 
with ordering incorporated into the inner stages. Hence, the ordering was limited only to the 
last stage of the FFT signal flowgraph. 
It can be concluded from Chapter 6 that the order based processing scheme was effective only 
in the penultimate stage of the PET processor due to the need of corresponding data ordering 
followed by restoration to the normal data order for the subsequent stage. This approach was not 
feasible for the higher stages due to the complexity of the commutator and also due to the need 
of data order restoration hardware overhead for the subsequent stage. This FFT architecture 
was attractive for wireless LAN applications requiring short FFTs. It could also be used in 
141 
Summary and Conclusions 
the penultimate stages of long FFTs. The switching activity of the successive coefficients has 
been reduced by 54% in the penultimate stage of the pipelined FFT processor after order based 
processing. The power saving was in the range of 4-30% for the 64-point FIT and 14-37% for 
the 16-point FFT for the ordered FFT processor depending upon the multiplier type and FIFO 
implementation. 
From the results of Chapter 6, it can be concluded that the dual port RAM (ADM) in the last 
commutator stage of the ordered pipelined FFT processor was the hardware overhead needed 
for incorporating order based processing. The power consumed by this hardware overhead 
would be reduced significantly by using a customised dual port RAM instead of a SYNOPSYS 
Design Ware RAM. Moreover, the power consumption in the hardware overhead would be con-
siderably lower for shorter wordlength. This means that the power saving percentage would 
significantly increase by using a customised dual port RAM, and for wordlcngths shorter than 
16. 
Chapter 7 presented a 64 sub-carrier MC-CDMA receiver architecture. The power saving was 
obtained both in the FF1' and the Combiner. The power saving in the Combiner was significant 
by keeping the inputs to the divider fixed during the long data demodulation phase. 
It can be concluded from Chapter 8 that significant power saving could be obtained by using 
an adaptive receiver tuned to the changing channel parameters instead of using a fixed receiver 
designed for the worst case channel conditions. This work has presented a novel concept of 
real time adjustment of the FF1' size in a wireless receiver as per the channel requirements. 
Two reconfigurable 256-point FFT processor architectures namely RFFT-I and RFFT-II have 
been proposed which could be configured for 64-point or 16-point as well. The difference 
between RFFT-I and RFFT-II lied in the extent of clock gating. The clock gating was applied 
to all the modules of stage 1 and stage 2 in case of RFFT-J whereas it was limited to FSM and 
registers in case of RFFT-II. The selective clock gating helped in reducing the power overhead 
of REF T-II to just 1% as compared to 26% for RFFT-I with respect to the fixed 256-point FET 
processor. On the other hand, the power saving in going down to smaller FF1' size was much 
less for REFT-II due to limited clock gating. RFFT-I architecture is recommended over RFFT-
11 when the probability of the largest FF1' size (256 here) is less than 50% due to the power 
overhead. Two reconligurable 256 sub-carrier receivers namely, RECEIVER-I and RECEIVER-
!1 have also been proposed depending upon the extent of clock gating. These receivers could 
also be configured for 64 sub-carriers. The power saving results for receiver were similar to 
142 
Summary and Conclusions 
the reconflgurable FFT processor. The power saving, in going down to smaller number of sub-
carriers, has been clearly established. This flexible architecture is ideal under variable channel 
conditions. 
In Chapter 9, novel FIR filter cores have been developed based on a number of low power 
algorithms. The algorithms could be used alone or in combination depending upon the power 
and area constraints. For instance, coefficient segmentation based FIR core provided up to 
22% power reduction with an area overhead of 4%. Block processing based core, on the other 
hand, resulted in a power reduction of up to 26% at an expense of only 3% in area. Order 
based processing led to 23% power reduction at an expense of 2% in silicon area. However, 
if more power saving was desired then the combination of coefficient segmentation and block 
processing based core could be used providing up to 39% power reduction, but at the cost of 
7% increase in area. Any algorithms could be chosen depending upon the area/power budget. 
10.4 Achievements 
. A novel order based processing scheme is proposed which is much better than the con-
ventional coefficient ordering scheme. 
. A low power radix-2 single butterfly FFT processor architecture is proposed based on the 
order based processing scheme. 
. A low power radix-4 pipelined FFT processor arechitecture is proposed based on the 
order based processing scheme. 
. A low power 64 sub-carrier MC-CDMA receiver architecture is proposed. 
. A novel concept of adaptive receiver is introduced. According to this concept, the hard-
ware size of the receiver is dynamically adjusted as per the channel requirements instead 
of designing it for the worst case channel conditions. 
. A novel coefficient memory reduction scheme is proposed that reduces the memory size 
to N/8+1 locations instead of N/4 locations in an N-point FFT processor. 
. A novel coefficient addressing scheme for generating the coefficient address using only 
one multiplexer instead of two barrel shifters in cascade is also proposed for an N-point 
FFT processor. 
143 
Summary and Conclusions 
. Various novel low power FIR cores for the existing low power algorithms are proposed. 
10.5 Future work 
This thesis has tried to provide a thorough investigation into the research proposal outlined in 
section 10.1. However, a number of additional issues need to be explored which might further 
add to the knowledge gained from the research presented. The additional issues are highlighted 
as follows: 
• In all FFT processors, a fixed point 2's complement number representation is employed. 
The dynamic range can be extended by using block floating point representation. 
• The wordlength is assumed to be fixed for all stages of the pipelined FFT processors. It is 
important to use shortest wordlcngth in the most complex highest or first FFT stage and 
gradually increase it for the lower stages for reducing power consumption on the basis of 
the desired SNR. 
• The architecture of a MC-CDMA receiver contains only the basic blocks in the form 
of FFT and combiner for BPSK reception. Some additional blocks like de-interleaver, 
signal mapper and Viterbi decoder can be added to support more complex modulating 
schemes for better performance in terms of bit error rates etc. 
• Some more aspects like hardware for synchronisation has to be explored in the MC-
CDMA receiver in the presence of frequency offset. 
• A scheme for altering the receiver complexity in real time needs to be formulated. This 
scheme should clearly establish the relationship between the receiver hardware size and 
the channel conditions for all the important blocks of the receiver. Channel monitoring 
hardware will have to be designed as well for real time adaptive receiver. Moreover, 
the frequency of hardware size adjustment on the basis of channel conditions has to be 
studied for minimising power consumption. 
. This concept of hardware size adjustment on the basis of changing channel parameters 
can be extended to other receiver blocks like the Viterbi decoder. The ultimate objective 
is the realisation of a complete adaptive receiver. 
144 
Summary and Conclusions 
• Some of the cores are realised in 0.35/L technology and implemented up to the gate level. 
It will be better to realise all the cores up to the layout level in 0.18 technology. Al-
though, it has been found that the comparative power results at the gate level using wire 
load models with SDF (standard delay format) are not quite different from those obtained 
after layout. 
Only the direct form of the FIR filter is explored in this work. This work can be extended 
to other filter structures as well. Moreover, the performance of an architecture derived 
from the combination of coefficient segmentation and ordering needs to be explored. 
References 
K. Lahiri, A. Raghunathan, S. Dey, and D. Panigrahi, "Battery driven system design: 
a new frontier in low power design," in 15th international Conference on VLSI design, 
vol. 1, pp.  261-267, 2002. 
L. D. Paulson, "Low-power chips for high-powered handhelds," Computer, vol. 36, 
pp. 21-23, January 2003. 
M. Pedram, "Power minimisation in IC design: principles and applications," in ACM 
transactions on Design Automation of Electronic systems, vol. 1, pp.  3-56, January 1996. 
A. Chandrakasan, S. Sheng, and R. W. Brodcrsen, "Low-power CMOS digital design," 
IEEE Journal of Solid-State Circuits, vol. 27, pp.  473-484, April 1992. 
A. Chandrakasan and R. Brodersen, Low Power Digital CMOS design. Kluwer, 1995. 
A. Chandrakasan and R. W. Brodersen, "Minimizing power consumption in digital 
CMOS circuits," IEEE Proceedings, vol. 83, pp.  498-523, April 1995. 
C. Small, "Shrinking devices put the squeeze on system packaging," EDN, vol. 39, 
pp. 41-46, February 1994. 
F. Swarts, P. Rooyan, I. Oppermann, and M. Lotter, CDMA techniques for third genera-
tion mobile systems. Kluwer, 1999. 
A. McCormick and E. Al-Susa, "Multicarrier CDMA for future generation mobile com-
munication," Electronics and Communication Engineering Journal, vol. 14, pp.  52-60, 
April 2002. 
N. Zhang and R. Brodersen, "Architectural evaluation of flexible digital signal processing 
for wireless receivers," in 34th Asimolar Conference on Signals, Systems and Computers, 
vol. 1, pp.  78-83, 2000. 
P. Merakos, K. Masselos, 0. Koufopavlou, S. Nikolaidis, and C. Goutis, "A novel trans-
formation for reduction of switching activity in FIR filter implementation," in Interna-
tional Conference on Digital Signal Processing, vol. 2, pp.  653-656, July 1997. 
A. Erdogan and T. Arsian, "An order based segmentation algorithm for low power im-
plementation of digital filters," in IEEE In!. Conference on Acoustics, Speech, and Signal 
Processing ('IC'ASSP'2000), pp. D441 - D444, June 2000. 
B. Guoan and E. Jones, "A pipelined FFT processor for word sequential data," IEEE 
Transactions on Acoustics, Speech and Signal Processing, vol. 37, pp.  1982-1985, De-
cember 1989. 
Y. Ma and L. Wanhammar, "A hardware efficient control of memory addressing for 
high performance FIT processors," IEEE Transactions on Signal Processing, vol. 48, 
pp. 917-921, March 2000. 
146 
References 
D. Cohen, "Simplified control of FF1 hardware," in IEEE Transaction on Acoustics, 
Speech and Signal Processing, vol. 24, PP.  577-579, December 1976. 
S. 1-lara and R. Prasad, "Design and performance of multicarrier CDMA system in fre-
quency selective rayleigh fading channels," IEEE Transactions on Vehicular Technology, 
vol. 48, pp.  1584-1595, September 1999. 
A. Erdogan and T. Arsian, "A coefficient segmentation algorithm for low power imple-
mentation of FIR filters," in IEEE International Symposium on Circuits and Systems, 
pp. 111.359 - 111.362, June 1999. 
A. Erdogan and T. Arsian, "Data block processing for low power implementation of 
direct form FIR filters on single multiplier CMOS based DSPs," in IEEE International 
Symposium on Circuits and Systems, pp. D441 - D444, June 1998. 
J.M.Rabaey and M. Pedram, Low power design methodologies. Kluwer, 1996. 
G. Tiwary, "Below the half-micron mark," IEEE Spectrum, vol. 31, Pp.  84-87, November 
1994. 
K. Khouri and N. Jha, "Leakage power analysis and reduction during behavioral synthe-
sis," IEEE Transactions on VLSI Systems, vol. 10, pp.  876-885, December 2002. 
D. B. Lidsky and J. Rabaey, "Low power design of memory intensive functions," in 
IEEE International Symposium on Low Power Electronics and Design, vol. 1, pp. 16-
17, October 1994. 
D. Garrett, M. Stan, and A. Dean, "Challenges in clockgating for a low power ASIC 
methodology," in IEEE International Symposium on Low Power Electronics and Design, 
Pp. 176-181, 1999. 
C. Chen, C. Kang, and M. Sarrafzadeh, "Activity-sensitive clock tree construction for 
low power," in IEEE International Symposium on Low Power Electronics and Design, 
pp. 279-282, August 2002. 
A. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. Brodersen, "Optimizing 
power using transformations," IEEE Transactions on Computer-aided Design of Inte-
grated Circuits and Systems, vol. 14, pp.  12-30, January 1995. 
R. Amritatrajah, T. Xanthapoulas, and A. Chandrakasan, "Power scalable processing us-
ing distributed arithmetic," in IEEE International Symposium on Low Power Electronics 
and Design, pp. 170-175, 1999. 
M. Mehendale, A. Sinha, and S. Sherlekar, "Low power realisation of FIR filters imple-
mented using distributed arithmetic," in Design Automation Conference, Pp.  151-156, 
February 1998. 
J. Monteiro, S. Devadas, and A. Ghosh, "Retiming sequential circuits for low power," in 
International Conference on Computer-aided Design, pp. 398-402, November 1993. 
C. Schimpfle, S. Simon, and J. Nossek, "Optimal placement of registers in data paths for 
low power design," in IEEE International Symposium on Circuits and Systems, pp. 2160-
2163, June 1997. 
147 
References 
[30] A. Raghunathan, S. Dey, and N. Jha, "Glitch analysis and reduction in register trans-
fer level power optimisation," in 33rd ACM/IEEE Design Automation Conference, June 
1996. 
[311 M. Alidina, J. Monteiro, S. Devdas, A. Ghosh, and M. Papaefthyrniou, "Precomputation-
based sequential logic optimization for low power," in IEEE Transactions on Very Large 
Scale Integration(VLSI) Systems, vol. 2 no.4, pp.  425-436, December 1994. 
J. Monteiro, S. Devdas, and A. Ghosh, "Sequential logic optimization for low power us-
ing input-disabling precomputation architectures," in IEEE Transactions on Computer-
aided Design of Integrated Circuits and Systems, vol. 17 no.3, pp.  279-284, March 1998. 
J. Sacha and M. Irwin, "Number representations for reducing switched capacitance in 
subband coding," in International Conference on Acoustic, Speech and Signal Process-
ing, pp. 3125-3128, 1998. 
S. Wuytack, F. Catthoor, F. Franssen, L. Nachtergale, and H. Deman, "Global commu-
nication and memory optimising transformation for low power design," in International 
Workshop on Low Power Design, pp. 178-187, April 1994. 
C. Su, C. Tsui, and A. Despain, "Low power architecture design and and compilation 
techniques for high performance processors," in Digest of papers COMPCON spring, 
pp. 489-498, February 1994. 
M. Stan and W. Burleson, "Bus-invert coding for low power I/O," IEEE Transactions on 
VLSI Systems, vol. 3, pp.  49-58, March 1995. 
L. Benini, L. Macchiarulo, A. Macu, and M. Poncino, "Layout-driven memory synthesis 
for embedded systems-on-chip," IEEE Transactions on VLSI Systems, vol. 10, pp. 96-
105, April 2002. 
L. Benini, L. Macchiarulo, A. Macli, and M. Poncino, "Increasing energy efficiency 
of embedded systems by application-specific memory hierarchy generation," in IEEE 
Design and Test, vol. 17, pp.  74-85, April 2000. 
P. Panda and N. Dutt, "Low-power memory mapping through reducing address bus ac-
tivity," IEEE Transactions on VLSI Systems, vol. 7, pp.  309-320, September 1999. 
L. Benini and G. Micheli, "State assignment for low power dissipation," in IEEE Journal 
of Solid State Circuits, vol. 30, pp.  258-267, March 1995. 
K. Roy and S. Prasad, "Circuit activity based logic synthesis for low power reliable 
operations," in IEEE Transactions on VLSI System, vol. 1, pp.  503-5 13, December 1993. 
E. Musoll and J. Cortaddlla, "Scheduling and resource binding for low power," in Inter-
national Symposium on System Synthesis, pp. 104-109, 1995. 
T. Callaway and E. Swartzlander, "Optimising adders for WSI," in IEEE International 
Conference on Wafer Scale Integration, pp. 251-260, 1992. 
T. Callaway and E. Swartzlander, "Optimising multipliers for WSJ," in IEEE Interna-
tional Conference on Wafer Scale Integration, pp. 85-94, 1993. 
148 
References 
G. Keane, J. Spanier, and R. Woods, "The impact of data characteristics and hardware 
topology on hardware selection for low power DSP," in International Symposium on Low 
Power Electronics and Design, pp. 94-96, August 1998. 
C. Tsui, M. Pedram, and A. Despain, "Power efficient technology decomposition and 
mapping under an extended power consumption model," in IEEE Transactions on 
Computer-aided Design of Integrated Circuits and Systems, vol. 13(9), pp.  1110-1122, 
September 1994. 
C. Yeh, C.-C. Chang, and J.-S. Wang, "Technology mapping for low power," in Design 
Automation Conference, vol. 1, pp.  145-148, January 1999. 
H. Choi and W. Burleson, "Search-based wordlength optimization for VLSI/DSP syn-
thesis," in Workshop on VLSI Signal Processing, pp. 198-207, 1994. 
D.-S. Chen and M. Sarrafzadeh, "An exact algorithm for low power library-specific gate 
re-sizing," in 33rd ACM/IEEE Design Automation Conference, June 1996. 
H. Vaishnav and M. Pedram, "Pcube: A performance driven placement algorithm for 
low power design," in European Design Automation Conference, pp.  72-77, September 
1993. 
R. Mehra, L. Guerra, and J. Rabaey, "A partitioning scheme for optimising interconnect 
power," IEEE Journal of Solid State Circuits, vol. 32, pp.  433-443, March 1997. 
J. Cong, C.-K. Koh, and K.-S. Leung, "Simultaneous driver and wire sizing for perfor-
mance and power optimisation," IEEE Transactions on VLSI Systems, vol. 2, pp.  408-
425, December 1994. 
B. M. Bass, "A low-power high-performance 1024-point FFT processor," IEEE Journal 
of Solid-State Circuits, vol. 34, pp. 380-387, March 1999. 
B. M. Bass, "An energy efficient single-chip FFT processor," in Symposium on VLSI 
Circuits, pp.  164-165, June 1996. 
K. Masselos, P. Merakos, T. Stouratis, and C. Goutis, "Novel techniques for bus power 
consumption reduction in realizations of sum-of-product computation," IEEE Transac-
tions on VLSI Systems, vol. 7, pp.  492-497, December 1999. 
K. Masselos, S. Theoharis, P. Merakos, T. Stouratis, and C. Goutis, "Low power synthe-
sis of sum-of-product computation," in IEEE International Symposium on Low Power 
Electronics and Design, pp. 234-237, 2000. 
K. Masselos, P. Merakos, T. Stouratis, and C. Goutis, "Low power synthesis of sum-of-
product computation in DSP algorithms," in IEEE International Symposium on Circuits 
and Systems, vol. 6, pp.  420-423, July 1999. 
S. Johansson, S. He, and P. Nilsson, "Wordlength optimization of a pipelined FF1 pro-
cessor," in 42nd Midwest Symposium on Circuits and Systems, vol. 1, pp. 501-503, 1999. 
L. Jia, Y. Gao, J. Isoaho, and H. Tenhunen, "A new VLSI-oriented FF1 algorithm and 




L. Jia, Y. Gao, J. Isoaho, and H. Tenhunen, "Implementation of a low power 128-point 
FEE," in 6th International Conference on Solid-state and Integrated Circuit Technology, 
pp. 369-372, 1998. 
L. Rabiner and B. Gold, Theory and application of digital signal processing. Prentice 
Hall, 1975. 
B. M. Bass, An approach to low-power, high-performance, Fast Fourier transform pro-
cessor design. PhD thesis, Stanford University, 1999. 
E. Bidet, D. Castelain, C. Joanblanq, and P. Senn, "A fast single-chip implementation of 
8192 complex point FEE," IEEE Journal of Solid-state Circuits, vol. 30, pp.  300-305, 
March 1995. 
S. Sridhara and N. Shanbhag, "Low-power FEE via reduced precision redundancy," in 
IEEE Workshop on Signal Processing Systems, pp. 117-124, 2001. 
R. Hegde and N. Shanbhag, "Energy-efficient signal processing via algorithmic noise-
tolerance," in IEEE International Symposium on Low Power Electronics and Design, 
pp. 30-35, 1999. 
K. Stevens and B. Suter, "A mathematical approach to a low power FEE architecture," in 
IEEE International Symposium on Circuits and Systems, vol. 2, pp. 21-24, 1998. 
B. Hunt, K. Stevens, B. Suter, and D. Gelosh, "A single chip low power asynchronous 
implementation of an FEE algorithm for space applications," in Fourth International 
Symposium on Advanced Research in Asynchronous Circuits and Systems, pp.  216-223, 
March 1998. 
Y. Wang, H. Lam, C.-Y. Tsui, R. Cheng, and W. Mow, "Low complexity OFDM receiver 
using log-FEE for coded OFDM system," in IEEE International Symposium on Circuits 
and Systems, vol. 3, pp.  445-448, 2002. 
Y. Wang, C.-Y. Tsui, R. Cheng, and W. Mow, "Performance study of OFDM receiver 
using FEE based on log number system," in IEEE 55th Vehicular Technology Conference, 
vol. 3, pp.  1257-1259, 2002. 
J. Ludwig, S. Nawab, and A. Chandrakasan, "Low power digital filtering using approx-
imate processing," in IEEE Journal of Solid State Circuits, vol. 31, pp.  395-399, March 
1996. 
M. Mehendale, S. Sherlekar, and G. Venkatesh, "Low power realisation of FIR filters 
using multirate architectures," in 9th International Conference on VLSIDesign, pp. 370-
375, January 1996. 
L721 M. Mehendale, S. Sherlekar, and G. Venkatesh, "Coefficient optimisation for low power 
realisation of FIR filters," in IEEE Workshop on VLSI Signal Processing, pp. 352-361, 
1995. 
[73] N. Sankarayya, K. Roy, and D. Bhattacharya, "Algorithms for low power and high speed 
FIR filter realisation using differential coefficients," in IEEE Transactions on Circuits 
and Systems-11: Analog and Digital Signal Processing, vol. 44, pp.  488-497, June 1997. 
150 
References 
Z.Yu, M.-L. Yu, K. Azadet, and A. W. Jr., "A low power FIR filter design technique using 
dynamic reduced signal representation," in International Symposium on VLSI Technol-
ogy, Systems and Applications, pp.  113-416, 2001. 
J. Park, W. Jeong, H. Choo, H. Mahmoodi-Meimand, Y. Wang, and K. Roy, "High per-
formance and low power FIR filter design based on sharing multiplication," in Interna-
tional Symposium on Low Power Electronics and Design, pp. 295-300, August 2002. 
A. McCormick, P. Grant, J. Thompson, T. Arsian, and A. Erdogan, "Low power receiver 
architectures for multi-carrier CDMA," FEE Proceedings, vol. 149, pp.  227-233, August 
2002. 
A. McCormick, P. Grant, J. Thompson, T. Arsian, and A. Erdogan, "A low power MMSE 
receiver architecture for multi-carrier CDMA," in IEEE International Symposium on Cir-
cuits and Systems, vol. 4, pp.  41-44, May 2001. 
A. Oppenheim and R. Schafer, Discrete-time signal processing. Prentice Hall, 1989. 
J. Proakis and D. Manolakis, Digital signal processing: Principles, algorithms and ap-
plications. Macmillan publishing company, 1992. 
J.W.Cooley and J. Tukey, "An algorithm for the machine calculation of complex Fourier 
series," International Journal of Mathematics of Computation, vol. 19, pp.  297-301, 
April 1965. 
M. Hasan and T. Arsian, "Implementation of low power FFT processor cores using a 
novel order based processing scheme," lEE Proceedings on Circuits, Devices and Sys-
tems, vol. 50, pp.  149-154, June 2003. 
Y. Chang and K. Parhi, "Efficient FFT implementation using digit-serial arithmetic," in 
IEEE Workshop on Signal Processing Systems, pp. 645-653, 1999. 
[831 Y. Ma, "An effective memory addressing scheme for FFT processors," IEEE Transac-
tions on Signal Processing, vol. 47, pp.  907-911, March 1999. 
M. Hasan and T. Arslan, "Scheme for reducing size of the coefficient memory in FF1 
processor," Electronic Letters, vol. 38, pp. 163-653, February 2002. 
M. Hasan and T. Arslan, "FIT coefficient memory reduction technique for OFDM appli-
cations," in International Conference on Acoustics, Speech and Signal processing, vol. 1, 
pp. 1085-1088, May 2002. 
M. Hasan and T. Arsian, "Coefficient memory addressing scheme for high performance 
FFT processors," Electronic Letters, vol. 37, pp. 1322-1324, October 2001. 
M. Hasan and T. Arsian, "A coefficient memory addressing scheme for VLSI implemen-
tation of FFT processors," in international Symposium on Circuits and Systems, vol. 4, 
pp. 850-853, May 2002. 
M.C.Pease, "Organization of large scale Fourier processors," JACM, vol. 16, pp. 474-
482, July 1969. 
151 
References 
W. Li and L. Wanhammar, "A pipeline FFT processor," in IEEE workshop on Signal 
Processing Systems, vol. 4, pp. 654-662, October 1999. 
N. Zhang and R. Brodersen, "Architectural evaluation of flexible digital signal processing 
for wireless receivers," in 34th  Asilomar Conference on Signals, Systems and Computers, 
vol. 1, pp.  78-83, 2000. 
L. Fanucci, M. Forliti, and P. Terreni, "Fast: FFT ASIC automated synthesis," in INTE-
GRATION the VLSI journal, vol. 33, pp.  23-37, 2002. 
B. Gold and T. Bially, "Parallelism in fast fourier transform hardware," IEEE Transac-
tions audio Electroacoustics, vol. 21, no. 1, pp.  5-16, 1973. 
C. Hui, T. Ding, J. McCanny, and R. Woods, "A 64-point Fourier Transform Chip for 
Video Motion Compensation using Phase Correlation," IEEE Journal of Solid-State Cir-
cuits, vol. 49, pp.  1751-1761, November 1996. 
M. Hasan, T. Arsian, and J. Thompson, "A novel coefficient ordering based low power 
pipelined radix-4 FFT processor for wireless LAN applications," IEEE Transactions on 
Consumer Electronics, vol. 49, pp.  128-134, February 2003. 
M. Hasan and T. Arsian, "A triple port RAM based low power commutator architecture 
for a pipelined FFT processor," in International Symposium on Circuits and Systems, 
vol. 5, pp.  353-356, May 2003. 
N. Yee, J. Linnartz, and G. Fettweis, "Multi-carrier CDMA in indoor wireless radio 
networks," in IEEE PIMRC, pp.  109-113, September 1993. 
K. Fazel and L. Papke, "On the performance of convolutionally-coded CDMA/OFDM 
for mobile communication system," in IEEE PIMRC, pp.  468-472, September 1993. 
R. Prasad and S. Hara, "An overview of multi-carrier CDMA," in International Sympo-
sium on Spread Spectrum Techniques and Applications, pp. 107-114, September 1996. 
J. Bingham, "Multi-carrier modulation for data transmission: An idea whose time has 
come," in IEEE Communication Magazine, vol. 36, pp.  112-117, February 1998. 
S. Hara, T.-H. Lee, and R. Prasad, "BER comparison of DS-CDMA and MC-CDMA 
for frequency selective fading channels," in 7th Tyrrhenian International Workshop on 
Digital Communications, pp.  3-14, September 1995. 
N.Yee and J. Linnartz, "Wiener filtering of multicarrier CDMA in a rayleigh fading 
channel," in IEEE PIMRC, pp.  1344-1347, September 1994. 
Y. Bar-Ness, J. Linnartz, and X.Liu, "Synchronous multi-user multi-carrier CDMA com-
munication system with decorrelating interference canceler," in IEEE PIMRC, pp.  184-
188, September 1994. 
S. Hara and R. Prasad, "DS-CDMA, MC-CDMA and MT-CDMA for mobile multime-




V.M.DaSilva and E. Sousa, "Multicarricr orthogonal CDMA signals for quasi-
synchronous communication systems," IEEE Journal on Select Areas of Communica-
tion, vol. JSAC-12, pp.  1106-1110, June 1994. 
E. Sourour and M. Nakagawa, "Performance of orthogonal multicarrier CDMA in a 
multipath fading channel," IEEE Transactions on Communications, vol. 44, pp. 356-
367, March 1996. 
R. V. Nee and R. Prasad, OFDM wireless multimedia communications. Artech House, 
2000. 
R. Prasad, CDMA for wireless personal communications. Artech House, 2000. 
J.G.Proakis, Digital Communications. Mc-Graw Hill, 1995. 
D. R.L. Pickholtz and L. Milstein, "Theory of spread spectrum communication- a tuto-
rial," IEEE Transactions on Communcations, vol. 30, pp.  855-884, May 1982. 
M. Hasan, T. Arslan, and J. Thompson, "A novel low power pipelined architecture for a 
MC-CDMA receiver," in Y d  International Symposium on Image and Signal Processing 
and Analysis, To appear in September 2003. 
I. Kang and A. W. Jr., "Low-power Viterbi decoder for CDMA mobile terminals," IEEE 
Journal of Solid-state Circuits, vol. 33, pp.  473-482, March 1998. 
R. Henning and C. Chakrabarti, "Low-power approach for decoding convolutional codes 
with adaptive Viterbi algorithm approximations," in International Symposium on Low 
Power Electronics and Design, pp. 68-71, 2002. 
M. Hasan, T. Arsian, and J. Thompson, "A delay spread based low power reconfigurable 
FIT processor architecture for wireless receivers," in International Symposium on Sys-
tem on Chip, To appear in November 2003. 
[1 14] I. Abu-Khater, A. Bellaouar, and M. Elmasry, "Circuit techniques for CMOS low-power 
high-performance multipliers," IEEE Journal of Solid-state Circuits, vol. 31, pp.  1535-
1546, October 1996. 
A. Erdogan and T. Arsian, "Low power implementation of linear phase FIR filters for 
single multiplier CMOS based DSPs," in International Symposium on Circuits and Sys-
tems, vol. D, pp.  425-428, May 1998. 
A. Erdogan, M. Hasan, and T. Arsian, "A low power FIR filtering core," in International 
Conference on ASIC/SOC, pp.  271-275, September 2001. 
A. Erdogan, M. Hasan, and T. Arsian, "Algorithmic low power FIR cores," lEE Proceed-




A.! Refereed Journals 
M. Hasan, T. Arslan and J. Thompson, "A Novel coefficient ordering based low power 
pipeined radix-4 FFT processor for wireless LAN applications", IEEE Transactions on 
Consumer Electronics, vol. 49, no. 1, pp. 128-134, February, 2003. 
M. Hasan and T. Arsian, "Implementation of low power FIT processor cores using a 
novel order based processing scheme", lEE proceedings on Circuits, Devices and Sys-
tems, vol. 150, no. 3, pp.  149-154, June 2003. 
A.T. Erdogan, M. Hasan and T.Arslan, "Algorithmic low power FIR cores", lEE pro-
ceedings on Circuits, Devices and Systems, vol. 150, no. 3, pp.  155-160, June 2003. 
M. Hasan and T. Arsian, "Scheme for reducing size of the coefficient memory in FFT 
processor", Electronic letters, vol. 38, no. 4, pp.163-164, February, 2002. 
M. Hasan and T. Arsian, "Coefficient memory addressing scheme for high performance 
FFT processors", Electronic letters, vol. 37, no. 22, pp. 1322-1324, October, 2001. 
A.2 Refereed Conferences 
A.T. Erdogan, M. Hasan and T. Arslan, "A low power FIR filtering core", IEEE Interna-
tional Conference on ASIC/SOC, pp.  271-275, September, 2001. 
M. Hasan and T. Arsian, "A coefficient memory addressing scheme for VLSI implemen-
tation of FIT processors", International Symposium on Circuits and Systems, Volume 
4, pp.  850-853, May, 2002. 
M. Hasan and T. Arsian, "FFT coefficient memory reduction technique for OFDM appli-
cations", International Conference on Acoustics, Speech and Signal processing, Volume 
1, pp.  1085-1088, May, 2002. 
154 
Publications 
M. Hasan and T. Arsian, "A Triple Port RAM Based Low Power Commutator Architec-
ture for a Pipelined FF1' Processor", International Symposium on Circuits and Systems, 
Volume: 5, pp.  353-356, May, 2003. 
M. Hasan, T. Arsian and J. Thompson, "A Novel Low Power Pipelined Architecture for 
a MC-CDMA receiver", Yd  International Symposium on Image and Signal processing 
and Analysis, September, 2003 in Rome, Italy. To appear. 
M. Hasan, T. Arsian and J. Thompson, "A Delay spread based low power reconfigurable 
FF1' processor architecture for wireless receivers", International Symposium on System 
on Chip, November, 2003, in Tampere, Finland. To appear. 
155 
Appendix B 
C-code for the order based processing 
algorithm 
/* This programme reads in coefficient set, performs quantisation 
based on a specified wordlength and arrange the coefficients as 
per Our order based processing scheme. *1 
#include <stdlib. h> 
#include <stdio. h> 
#include <time.h> 
#include <string. h> 
#include <math.h> 
#define max_wordlength 32 
#define maxcoeff number 4096 
mt wordlength,countl,count2,count, hold—bit number,exchange_index, 
coeff number, rcoef min[maxcoeff number] [2], 
icoefmin [maxcoeff number] [2], 
dsignal [maxcoeff number] , rcoef norm[maxcoeff number], 
icoef norm[maxcoeff number] ,signal[maxcoeff number], 
nrcoef min[maxcoeff number] [2] ,nicoef min[maxcoeff number] [2], 
rcoef_comp [maxcoeff number]; 
double imag_coef [maxcoeff number], real_coef [maxcoef f_number], 
rn_factor, tempr, tempi; 




FILE *coefi, *rcoef, *reportl, *report2 
156 
C-code for the order based processing algorithm 
/* read in command line arguments *1 
for(i=O;i<argc; i++) 
if(strcmp(argv[i],'-w") == NULL) ( 
wordlength = atoi(argv[i+1]); 
} 
for(i=O;i<argc;i++) 
if(strcmp(argv[i],"-fl") == NULL) { 
coefi = fopen(argv[i+11, "r+"); 
if(coefi=NULL) printf( "Can't open coef\n'); 
} 
for(i=O;i<argc;i++) 
if(strcmp(argv[i], '-f 2") == NULL) { 
rcoef = fopen(argv[i+1], 'r+'); 
if(rcoef==NULL) printf("Can't open coef\n'); 
} 
1* Step 1 - Read in the imaginary and real parts of the 
coefficient in decimal and store it in files 
coefi and rcoef respectively *1 
/* Read in the imaginary part *1 
coeff number = 0; 
while( !feof(coefi)){ 
fscanf(coefi, '%lf", &imag_coef[coeff_number]); 
coeff_number++; 
} 
coef f number--; 
fclose(coefi); 
printf("\n\n\ncoef f_number is %d\n\n\n', coef f_number); 
/* Read in the real part */ 
coef f_number = 0; 
while(!feof(rcoef)) { 
157 
C-code for the order based processing algorithm 





printf(\n\n\ncoeff number is %d\n\n\n', coeff number); 
/* Step 2 - Quantise coefficients as per wordlength *1 
rn_factor = pow(2,wordlength-1) - 1; 
for ( 1=0 ; i<niaxcoeff number; i++) { 
tempr = real_coef[i] * mfactor; 
tempi = imag_coef[i] * rn_factor; 
if(tempr >=0) { 
rcoef_norm[i] 	= (int)(ternpr + 0.9); 
rcoef_rnin[i][1] = (int)(ternpr+ 0.9); 
rcoefrnin[i][0] = 1; 
} 
else { 
rcoefnorm[i] 	= (int)(ternpr + 0.9 + (pow(2,wordlength)-1)); 
rcoefmin[i] [1] = (int) (tempr + 0.9 + (pow(2 ,wordlength)-1)); 
rcoefmin[i][0] = 1; 
} 
if(ternpi >= 0) { 
icoefnorm[i] 	= (int)(ternpi + 0.9); 
icoef_min[i][1] = (int)(tempi + 0.9); 
icoefmin[i][0] = 1; 
} 
else { 
icoefriorm[i] = (int)(ternpi + 0.9 + (pow(2,wordlength)-1)); 
icoefmin[i] [1] = (int) (tempi + 0.9 + (pow(2,wordlength)-1)); 
icoefniin[i][0] = i; 
} 
if(rcoefnorm[i] == 0) 
rcoef_cornp[i] = rcoef_norm[i]; 
else 
rcoef_comp[i] = (int) (pow(2,wordlength)-rcoefnorrn[i]); 
} /* for */ 
158 
C-code for the order based processing algorithm 
/* Step 3 - Order coefficients for minimum Hamming distance *1 
/* Count the number of transitions in the initial 




xor=icoef_norm[ j] icoef_norm[ j+i]; 
for (k=0 ; k<wordlength; k++) 
{ 
if(xor & 01) trhnorm++; 
xor >>= 1; 
} 
} 
/* Count the number of transitions between the last and the 
first imaginary coefficient */ 
xor=icoefnorm[coef f_number-i] icoef_norm[ 0]; 
for(k=0;k<wordlength;k++) 
{ 
if(xor & 01) trhnorm++; 
xor >>= 1; 
} 
/* Count the number of transitions in the initial real 
coefficient set *1 
for(j=0; j<coef f_number-i; j++) 
{ 
xor=rcoef_norm[ j] 'rcoef_norm[ j+i]; 
for (k=0 ; k<wordlength; k++) 
{ 
if(xor & 01) trhnorm++; 




C-code for the order based processing algorithm 
/* Count the number of transitions between the last and the 
first real coefficient */ 
xor=rcoefnorm[ coeff number-i] rcoef norm[ 0]; 
for(k=0;k<wordlength;k++) 
{ 
if(xor & 01) trhnorm++; 
xor >> 1; 
} 
/* Display the normal switching activity of the coefficient set */ 
printf('Number of transitions for icoef_norm vector and 
rcoef norm vector= %4d\n ,trhnorm); 
/* Sort the imaginary parts of the input coefficient vector 
for the minimum transition icoef mm */ 
/* Find an imaginary coefficient with minimum l's in it and 
place it at the top */ 
hold bit number=wordlength; 




for ( k=0 ; k<word length; k++) 
{ 
if(xor & 01) count++; 
xor >>=1; 
} 
if(count < hold—bit—number) 
{ 




if(hold_bit_number != wordlength) 
{ 
160 
C-code for the order based processing algorithm 
ricoefmin[O] [0]; 
temp=icoefmin[ 0] [1]; 
icoef_min[O][OJ=icoef min[exchange index] [0]; 
icoefmin [ 0] [1 ] =icoef mm [exchange index] [1 1; 
icoefmin [ exchange_index] [0 ] =r; 
icoef min[ exchange_index] [1 ] =temp; 
} 
/* Now sort the rest of the imaginary coefficients *1 
for( j0; j<coeff number-i; j++) 
{ 
hold_bit_number=wordlength; 
for(k=j+i ;k<coeff number; k++) 
{ 
count=O; 
xoricoefmin[ j] [1] icoef min[k] [1]; 
for ( i=0 ; i<wordlength; i++) 
{ 
if(xor & 01 ) count++; 
xor >>1; 
} 
if(count < hold_bit_number) 
{ 
hold—bit—number = count; 
exchange_index = k; 
} 
} 




icoef_min[j+1] [0]=icoef min[exchange index] [0]; 
icoef_min[j+l] [1]=icoef min[exchange index] [1]; 
icoef_min [exchange_index] [0 ] =r; 




C-code for the order based processing algorithm 
/* The imaginary coefficient has already been sorted 
on the basis of minimum Hamming distance and now start 
arranging the real part *1 
/* Choose the first normal real part or its complement 
corresponding to the already aligned imaginary part 













if(xor & 01) count2++; 
xor >>1; 
} 
if(countl < count2) 
{ 
rcoef_min[0][1] = rcoef_norm[j]; 
rcoef_min[0][0] = 









C-code for the order based processing algorithm 
I Choose the normal real part or its complement 
depending upon the Hamming distance with the 
preceding real coefficient *1 
for( j=0; j<coef f_number-i; j++) 
{ 
p=icoefmin[ j+1 ] [0]; 
count 1=0; 
xor=rcoefmin[j] [1] rcoef_norm[p]; 
for 
( 
i=0; i<wordlength; i++) 
{ 






xor=rcoefmin[j] [i] - rcoefcomp[p]; 
for (k=0 ; k<wordlength; k++) 
{ 
if(xor & 01) count2++; 
xor >>1; 
} 
if(countl < count2) 
{ 
rcoef_min[j+1]{1] = rcoefnorm[p]; 
rcoefmin[j+1][0] = p; 
signal[j+1]=0; 









C-code for the order based processing algorithm 
/* Count the number of transitions in the ordered 
imaginary vector */ 
trhmin=0; 
for( j=0; j<coeff_nurnber-1; j+-I-) 
{ 
xor=icoefmin[j] [1] icoef_min[ j+i] [1]; 
for(k=0 ;k<wordlength;k++) 
{ 
if(xor & 01) trhmin++; 
xor >>= 1; 
} 
} 
xor=icoef min[coef f_number-i] [i]icoef min[0] [1]; 
for (k=0 ; k<wordlength; k++) 
{ 
if(xor & 01) trhmin++; 
xor >>= 1; 
} 




xor=rcoef_min[j] {i]rcoef_min[ j+1] [1]; 
for (k=0 ; k<word length; k++) 
{ 
if(xor & 01) trhmin++; 
xor >> 1; 
} 
} 
xor=rcoefmin[coef f_number-i] [1 ] rcoef min[ 0] [1]; 
f or 
( 
k=0 ; k<word length; k++) 
{ 
if(xor & 01) trhmin++; 
xor >>= 1; 
} 
'Fell 
C-code for the order based processing algorithm 
/* Display the switching activity of the ordered 
coefficient set */ 
printf('Number of transitions for icoef min vector 
and rcoef—min vector = %4d\n\n",trhmin); 
/* Store the old and new order along with the 
indices in file coeforder.dat*/ 
report = fopen( "coef_order.dat", "w"); 
if(reportl==NULL) printf( 'Can't open 
coefnumbered.dat\n"); 
for 
( i=O; i<coeff number; i++) { 
fprintf(reportl, "h[%2d] = %6d (%Ox) %6d (%Ox) ",i, 
icoefnorm[i] ,icoef norm[i] ,rcoef norm[i] ,rcoefnorm[i]); 
fprintf(reportl, "%6d (%Ox) %6d %6d (%Ox) %6d %d\n", 




fprintf(reportl, "\nNuither of transitions for 
(r+i)coefnorxn = %3d\n",trhnorm); 
fprintf(reportl, "Number of transitions for 
(r+i)coefmin = %3d\n",trhmin); 
/* Store the modified set in original order in a specific 
format and print it into file mem.dat*/ 
reporti = fopen( "mem.dat", 'w"); 
if(report2==NULL) printf("Can't open mem.dat\n"); 
for( i0; i<coeff number; i++) 
{ 








c-code for the order based processing algorithm 
fprintf(report2, '7'h %02x: out = 33'h %d%04x%04x;\n', 
i,dsignal[ ii ,nrcoef min[i][ 1],nicoef min[ 1] [ 1]  ) 
fclose(reportl); 
fclose(report2); 
return 0; } 
166 
Appendix C 
MATLAB code for the MC-CDMA 
transceiver 
The MATLAB code has a main section which calls the transmitter and receiver functions. The 
mobile channel is also defined in the main section. The code accepts as inputs the simulation 
length or the number of data bits to be transmitted, the number of users and the signal to noise 
ratio. This appendix starts with the main section of the code. It will be followed by a section 
of the code for the mc-cdma transmitter. The appendix ends with a section having the code for 
the mc-cdma receiver. 
Cl Main section of code 
The main section of the code accepts as inputs the length of the simulation or the number of 
data bits to be transmitted, the number of active users and the signal to noise ratio. The main 
section of the code defines the channel with the help of a five tap filter having exponentially 
decreasing random coefficients. The exponentially decreasing behaviour is due to the fact that 
the delayed signals have a lower average power than the direct path signal. 
% Simulation length 
L=1000; 
% No. of users 
N=32; 
% Signal to noise ratio 
snr = 20; 
% Generation of random coefficients for the five-tap FIR filter 
used to model the channel 
b = (randn(1,5)+j*randn(1,5) ) .*exp(_(0:4)); 
b = b/(norm(b)); 
167 
MATLAB code for the MC-cDMA transceiver 
% Function used to model the mc-cdma transmitter 
[Y] = mccdmatransmitter(L,64,N,4,31); 
% Complex noise generation 
n=l/sqrt(2)*(randn(length(y),l)+i*rafldn(leflgth(Y),l)) 
% Signal at the input of receiver 
yl=filter(b, l,y) + 10'(_SflrI20)*n 
% Function used to model the receiver 
[dl] = mcreceiver(L,64,N,4,31,yl,ts); 
C.2 Function used to model the MC-CDMA transmitter 
% Function used to model the mc-cdma transmitter 
function [y] = mccdmatransmitter(n,L,N,x,ti) 
% n - no of data bits 
% L - code length 
% N - number of users 
% x - cyclic-extend length 
% ti - training interval 
% Code Definition (Normalised) 
C = hadamard(L)/sqrt(L); 
% Setup output vector (Column vector) 
y=zeros(n*(L+x)+floor(n/ti)*(L+x),l); 
% Create training symbol 
trs = sign(randn(L,l)); 
ts=ifft(trs); 
ts=[ts(L-x+l:L) ;ts]; 
% Create random data 
(Each row contains n-bits corresponding to a user) 
d=sign(randn(N,n)); 
IM 
MATLAB code for the MG-CD MA transceiver 
index=1; 
for k = 1:floor(n/ti) 
% Transmit training symbol 
y((k_1)*(ti+1)*(L+x)+(1:L+x)) = ts; 
% Transmit ti data symbols 
% (C:,(1:N)- First N-columns of C out of L) 
% d(:,1) - first column of data 






%normalise y to average power of 1 
yy*sqrt(L); 
C..3 Function used to model the MC-CDMA receiver 
% Function used to model the receiver 
function [dl] = mcreceiver(n,L,N,x,ti,y,ts) 
% n - no of data bits 
% L - code length 
% N - number of users 
% x - cyclic-extend/guard interval 
% ti - training interval 
% y - receiver signal 
% ts - training sequence 
% normalise power of y 
y = y/sqrt(L); 
% Code Definition (Normalised) 
C = hadamard(L)/sqrt(L); 
dl=zeros(N,n); 
indexr=1; 
MATLAB code for the MC-CDMA transceiver 
for k = 1:floor(n/ti) 
% Get channel estimate from training symbol 
trposition =(k_1)*(ti+l)*(L+x)lx; rts=fft(y(trposition+(1:L))); ce=rts./ts; 
eq=conj(ce)./abs(ce)**2; 
% Data Reception 
for 1 = l:ti 
dint= eq.*fft(y( (k_1)*(ti+1)*(L+x)+(1:L)+l*(L+x)+x)); 






Verilog code for the MC-CDMA 
receiver 
The two key blocks of the MC-CDMA receiver are the FF1' and the Combiner. 
D.1 Verilog code for the FFT 
/1 Top level FFP module 
'timescale 1ns/lp 
'define dwidth 32 
'define width 16 
'define depthl 16 
'define depth2 
'define cwidth 5 
'define swidth 4 
module FFT(clk,in,Xfro,Xfio,reset,c000tl); 
parameter dwidth = 'dwidth; 
parameter depthl = 'depthi; 
parameter depth2 = 'depth2; 
parameter swidth 	'awidth; 
parameter cwidth 	'cwidth; 
parameter width = 'width; 
input clk,reset; 
input [dwidth-1:01 in; 
output [width-1O1 Xfro,Xfio; 
output[cwidth:Oj counti; 
wire c42 ,c52 ,c62,cm2,c41 ,cSl ,c61,cml,c43,c53,c63,cm3; 
wire (dwidth-1:0] oll,o21 ,o31 ,o41,o12,o22,o32,o42,o13,o23,o33,o43,wl,wol ,w2,wo2; 
wire [width-1:0] xtrol,xtiol,Xmrol,Xmiol,Xfmrol,Xfmiol,Xtro2,Xtio2,Xmro2,Xmio2,Xfmro2,Xfmio2; 
// ROM for storing the coefficient for stage 1 
roml rombl(.out(wol),.addr(countl)); 
II ROM for storing the coefficient for stage 2 
rom2 romb2(.out(wo2),.addr(c000tl[3;0}fl; 
// Registers for storing the current coefficient inputs 
regO regrl(.clk(clk), .in(wol), .out(wl)); 
regm regr2(.clk(clk), .in(wo2),.out(w2)); 
// First stage Commutator 
comul comutatorl(.clk(clk), .in(in), .oll(oll),.o21(o21), .o31(o31), .o41(o41), .reset(reset), .c41(c41),.cSl(c51), 
.c61(c61),.cml(cml)...countl(countl)); 
If First stage Butterfly 
butter butterl(.Xro(Xtrol), .Xio(Xtiol), .xrO(oll[3l;l6]), .xiOo11[15:O]), .xrl(o21[31:161), .xil(o21(15:O}), 
xr2(o31[31:16J),.xi2(o31[15:01), .xr3(o41[31:16]),.xi3(o41(15:0]),.c4(c41),.c5(c5l),.c6(c61), 
.cm(cml)); 
1/ Intermediate register 
regO regsl(.clk(clk), .in({Xtrol,Xtiol}),.out({Xmrol,Xmiol})); 
171 
Verilog code for the MC-CDMA receiver 
/1 First stage Complex multiplier 
cmult multii).xr)Xnrol),.xi(Xmioi),.wx(wi[3i;16)),.wi(wi[i5:0]),.xro(xfol)xjo(x.dofl); 
II Second stage commutator 
comu2 comutator2).clk)clk),.in({Xffllroi,xfxnioi}),.o12(o12),.o22(o22)o32)o32)o42)o42)reset(reset) 
.c42)c42),.c52(c52),.c62(c62),.cm2(cm2), .count2(counti[3:0])); 
1/ Second stage Butterfly 
butter butter2).Xro(Xtro2),.Xio(Xtio2),.xrO)o12[3i;16)),.xiO(o12[15.O])xri(o22)3i:lS])xji(o22[ig.O]) 
.xr2)o32[31:16J),.xi2(o32[15:0]), .xr3(o42[31:16]),.xi3(042[15:0[),.c4(c42),.c5)c52),.c6(c62), 
• cm) cml)); 
/1 Intermediate register 
regO regs2)cik(cik), •in({Xtro2,Xtio2)),.out({xnro2,xrjio2))); 
II Ordered complex multiplier 
amulto multi2(.xr(XInroi),.xi)xnio2),.wr(w2[3i;i6]),.wj(w2[15:O]),.xro(Xfo2),xjo(Xfo2)si(wl(l2])). 
II Third stage commutator 
comu3 comutatorl).clk(cik)..in({Xfmro2,xfmio2}),.o13(oi3),.o23(o23),033)o33)o43(o43)reset(reset) 
•c43(c43),c53(c53), •c63(c63),.cm3(cm3),.count3(counti[4:01)); 
/1 Third stage butterfly 
butter butter3(.Xro)Xfro),.Xio(Xfio),.xrO)oil[31:if]),.xiO(oil[i5;O)),.xri)o23[ij.ig))xii(O23[ii.O]) 
•xr2)o33[31:16]), .xi2(o33[15:0]), .xrl)o43[31:16]),.xii(o43(i5:O)),.c4)c43),.cS(c53), 
•c6(c63),.cm(cm3)); 
endmodule 
// Butterfly module 
module butter)Xro,Xio,xri,xio,xri,xii,xr2,xi2,xr3,xj3c4csc6um); 
input [width-1:0] xro,xi0,xri ,xii,xr2 ,xi2,xri,x13; 
output [width-1:01 Xro,Xio; 
wire [width-1:0] mO,mi ,m2,m3,xmrO,xmri,xrnx3,x1,j2,i3; 
input c4,c5,c6,cm; 
addon exi(.A(xri), •SI(c5), •B(xmrl)); 
addon ex2(.A(xr3), •SI)c6), •B(xnu-3)); 
addon ex3(.A(xil), .SI(c5), •B(xmii); 
addon ex4(.A(xi3), •SI(c6), •B(xmi3)); 
addon ex5).A(x12), •SI(Cm), •B(xmi2)); 
addon ex6).A(xro), •SI(cm), •8(xmr0)); 
sum sumr(.INPUT){m0,m2,xmr1,xmr3}), •SUM(xro)); 






// Control inversion module for the butterfly 
module addon)A, SI, B); 
input [width-i : 0] A; 
input SI; 
output [width-i 	0] B; 
reg [width-i : 01 B; 








/1 Summer module for the low power butterfly 
module sum(INPUP, SUM); 
input [swidth*width_l : 01 INPUT; 
output [width-i : 0] SUM; 
// Instance of DM02 sum 
DM02 sum #)wjdth, width) 
Ui (.INPUT(INPUT), .SUN)SUM)); 
endmodule 
172 
Veriog code for the MC-CDMA receiver 
// 16-bit 2:1 multiplexer for the first stage butterfly 
module mux16(out,ini,ini,cont); 
Output [width-1:0] 	out 
input [width-1:0[ inO,ini; 
input cont; 
reg [width-1:0] 	out; 








II Complex multiplier for the first stage of FFT processor 
module cmult(xr,xi,wr,wi,xro,xio); 
input (width-1:0) xr,xi,wr,wi; 
Output(width-1:0[ xro,xio; 





add addl(.A(m2[width+width-2 width-1[),.B(m3[width+width-2 : width-1]).CI(1'b0),.S(xio)); 
sub subl(.A(ml(width-t-width-2 : width-1]),.B(m4[width+width-2 : width-1]),.CI(l'b0),DIFF)xro)); 
endmodule 1/ cmult 
// Adder module definition 
module add(A,B,CI,S,Co); 
input [width-1:0] A,B; 
output [width-1:0] S; 
input CI; 
output CO; 
DW01 add #(width) add(.A(A),.B(B),.CI(CI),.SJM(S), . Co(CO)[; 
eodmodule 
II Subtractor module definition 
module sub(A, B, CI, 01FF, CO); 
input (width-1 : 0] A,B; 
input Cl; output [width-i : 0] 
DIFF; output CO; 
II Instance of OWl i_sub 
DWO1_sub #(width) 
sub).A(A), .B(B), .CI(CI), .DIFF(DIFF), .CO(CO)); 
endmOdule 
// Two's compinmenL multiplier module definition 
module mult>A, B, TC, PROD); 
input [width-1 : 0] A,B; 
input TC; 
output (width+width_l 0) PROD; 
II Instance of OWOlmult 
DWO2_mult #(width, width) 
mult ( .A(A), .B(B), .TC(TC), .PRODUCT(PROD) ); 
endmodule 
// Ordered complex multiplier 
module cmulto(xr,xi,wr,wi,xro,xio,si); 
input [width-1:01 xr,xi,wr,wi; 
input si; 
OUtpUt[width-1:0[ xro,xio; 
wire [width+width-1:0] m2,m4; 
wire (width-1:0] bl,b3; 
block blockl(.A(xr), .B(wr),.SI(si),.PR000CT(bl)); 
173 




add addi(.A(1s2[width+width-2 	width_i]), .H(b3),.CI(i'bO),.S(xio)); 
sub subl(.A(bi),.B(m4[width+width.2 : width-i]),.CI(i'bO),.DIFr(xro)); 
endmodule 1/ cmult 
1/ Combination of two' complement multiplier module and a control inversion addon module to 
implement order based processing scheme in the second stage complex multiplier 
module block(A, B, SI, PRODUCT); 
input SI; 
input (width-1 	01 A,B; 
wire (width+width-j : 0) PRO; 
Output[width-i : 01 PRODUCT; 
mult multi(.A(A),.B(s),.I'C(i'bi),.PROD(PRO)); 
addon mux(.A(PRO[width+width-2 : widthiJ),.SI(SI),.B(PRODUcT)); 
endmodule 
/1 First stage commutator Of the FFT processor, 
module comui(clk,in,oli,o2i,oIi,041,reset,c41,c5ic6lcmicounti); 
input clk,reset; 
input [dwidth-1:0] in; 
output ]dwidth-1 :0] oli,o21,o31,o4i; 
output c41,c51,c61,cml; 
output[counter width-i :0] counti; 
wire ]dwidth-1:0) Out0,outl,out2,00t3,out4,out5; 
wire ]swidth-1:0) addri,addr2; 
assign oiiout2; 
/1 Dual port RAMs are used as FIFOs (Six FIFOB) 
duramci rasn0(.clk(clk), .rstn(]'bl), .csn(ibO), .wrn(i'bO), .rdaddr(addr2(, 
.wr_addr(addri), data in(in), .dataout(outo)); 
duramci rami(.clk(clk), .rstn(l'bl), .cs_n(i'bO), .wr_n(ibO), .rd_addr(addr2), 
.wr_mddr(addri), .datain(out0), data out(outl)); 
duramcl ram2(.clk(cik), .rst_n(i'bl), .cs_n(l'bO), .wrn(l'bO), .rdaddr(addr2), 
.wraddr(addrl), .data_in(outi), .dataout(out2)); 
duramcl ran3(.cik(cik), .rstn(1bl), .csn(i'bO), .wr_n(l'bO), .rd_addr(addri), 
.wr_addr(addrl), .datain(out2), .dataout)outl)); 
duramcl ram4(ci]c(c1k), .rstn(i'bi), .can(ibO), .wr_n(i'bO), .rdaddr(addr2), 
.wr_addr(addri), .dmtain(outi), .dataout(out4)); 
duramci ram5(.cik(clk), .rst_n(l'bi), .cs_n(i'bO), .wrn(l'bO), .rdaddr(addr2), 
wraddr(addri), .data_in(outd), .data_out(outS)); 
mux32 muxi(out(o2i),.in0(out3),.in1(in),.cont)cji)). 
mux32 mux2(.out(o3i), .in0(out4),.inl(outO), .cont(c21)); 
mux32 mux3(.out(o41),.inO(out5),.inl(outl),.coflt(cI1)); 
// FSM of first stage commutator 
fsmcl fsmi(.cil)cll),.c21(c2i),.c31(c31), .addri(addri),.addr2(addr2),.clk(clk),.reset(reset) 
.c 4 l(c4l),.c5l(c5l),.c6i(c61),.cmi(cmi),.countj(counti)). 
endmodule 
II Dual port RAM for the first stage commutator 
module duramcl(clk, rot—n, can, wzn, rd_addr, wraddr, data in, data_out); 





input (swidth-i : 01 rd_addr,wr_addr; 
input ]dwidth-1 : 0) data _in; 
output]dwidth-i : 0] data_ out; 
// Instance of DWrainrwsdff 
DWramrwsdff #(dwidth, width, rat mode) 
duramci (.clk(clk), .rat_n(rst_n), .csn(csn), .wr_n(wrn), .rd_addr(rdaddr), 
.wr_addr)wr_addr), .data_in(data_in), .datm out(data out)); 
174 
Verilog code for the MC-CDMA receiver 
end!sodule 
II FSM for the first stage of the FFT processor 
module fsmcl(cll,c21,c31,clk,addrl,addr2,reset,c41,c51,c61,cml,countl); 
output cll,c21,c31,c41,c5l,c61,cml; 
output [swidth-1:0] addrl,addr2; 
rug (swidth-1:0] addrl,addr2; 
input clk,reset; 
output [count_width_10] 	counti; 
rag tcount_width-1:0J 	countl; 
reg cll,c21,c31,c41,c5l,c61; 
assign cml = c51c61; 





































cS 1= 1; 
c61=0; 
end 














// 32-bit multiplexer for the first stage commutator 
module mux32(out,in0,ini,corit); 
output [dwidth-1:0] 	out; 
input (dwidth-10) ino,inl; 
input cont; 
reg [dwidth=1:0] 	out; 
always O)in0 or ml or cont) 
begin 
ivj,1 







// 16-hit adder for the first stage Commutator 
module reg0(out,in,clk); 
output (dwidth-1:0] 	out; 
input [dwidth-1:0] in; 
input cik; 
reg ]dwidth-1:0] 	Out; 





// Second stage commutator for the 64-point FFT processor 
module comu2(clk.in,o12,o22,o32,042,reset,042c52c62cm200uflt2). 
input clk,reset; 
input [cwidth-2:0) count2; 
input Idwidth-1:0J in; 
output o42,c52,c62cm2; 
output (dwidth-1;0)o12,o22,o32,o42; 
wire ]dwidth-1 :0] OUtO,A,B,C,D,E,F; 
wire [cwidth_3:0] addrl,addr2; 
wire ]rwidth-1:0] ro; 
assign cm2=ro(27ro(26]; 
assign c42 ro]28]; 
assign c52= ro[27]; 
assign c62 ro[26]; 
II Triple port RAM as FIFO5 (Three of double size instead of six) 
durainc2 ramO(.clk(clk(, .rst_n(l'bl), .ca_n(l'bO), wrn(l'bQ), .rdladdr(addr2), 
.rd2_addr(ro[10:8]), .wr_addr(addrl), .datain(in), data rdl out(S), .datard2out(A)(; 
duramc2 raml(clk(clk), .rst_n(1bl(, .csn(l'bO(, .wr_n(l'bO), .rdladdr(ro[16:14]), 
.rd2addr(ro[13;111), .wraddr(addrl), .datain(B), .datardlout(D(, .datard2out(C)); 
duramc2 ram2(.clk(clk), .rst_n(Pbl), .csn(l'bO(, .wr_n(l'bO), .rdladdr(ro[19:17](, 
.rd2addr(ro]22:20]), .wraddr(ro[25123]), .data_in(D), .datardlout(E(, .datard2out(F)); 
rmux32 mux0cl(.out(o12), .in0(A), .in1(D),.in2(F), .sel(ro[1:0])); 
rmux32 muxlcl(.nut(o22), .in0(in), .in1(C),.in2(E), .sel(ro[32])); 
rmux32 mux2cl(.out(o32), .inO(A), .inl(D),.jn2(F), .sel(ro[5:4])); 
muxc2 mux3Cl(.out(o42), .inO(A), .in1(B),.in2(C),.jn3(E), .sel(ro[7:6])); 
1/ ROM for addressing the triple port RAMs 
romc2 rom2(.out(ro), .addr(count2)); 
II FSM for stage 2 commutator 
fsmc2 fsml2( .addrl(addrl), .addr2(addr2), .clk(clk),.reset(reget)); 
endmodule 
II Triple port RAM module used in the second stage commutator (Two read ports and one write port) 
module duranlc2(clk, reIn, cen, Mm, rdl_addr, 
rd2addr, wr_addr, data-in, data rdl out, data mdl out); 
input cik, rst n,os n, wr_n; 
input ]cwidth-3 0) rdladdr,rdladdr,wraddr; 
input (didth-l: 01 data_in; 
output )dwidth-1 : 01 data mdl out,data mdl out; 
II Instance of OW ram 2rwsdff 
OWram2rwsdff #)dwidth, depth, ret_mode) 
01 ( .clk(cik), .rst_n(rstn), .cs_n(csn), .wr_n(wrn), .rdl_addr(mdladdr), 
.rd2_addr(rd2addr), .wr_addr(wr_addr), .datain(datajn), .data rdl out(data mdl out), 
data rd2 out) data mdl Out )) 
endrnodule 
176 
code for the MC-CDMA receiver 
II Four channel multiplexer for the second stage commutator 
module muxc2(out, ml, ini,in2,in3, eel); 
input (cwidth-4:0) eel; 
output [dwidth-1;0] 	out; 
input [dwidth-1:0] inl,ini,in2,in3; 
reg [dwidth-1:0} 	out; 
always 9(sel or mO or .ini or in2 or 1n3) 
begin case (Bel) 
2 1 h 0; out = mO; 
2'h 1: out ml; 
25 2: out = in2; 
25 3: out = irr3; 
default : out 32h x; 
endcase II case(sel) 
end 
endmodule II 
/I Three channel multiplexer for the second stage commutator 
module rmux32(out, mO, inl,in2, eel); 
input [cwidth-4:0] sd; 
output [dwidth-1:O] 	out; 
input )dwidth-1:0] in0,inl,in2; 
reg [dwidth-1;0) 	out; 
always 9(sel or ml or ml or in2) 
begin 
case (eel) 
2 1 h 0; Out 	ml ; 
2'h 1: out = ml; 
2 1 h 2: Out = 1n2; 
default : out 4th x; 
endcase II case(Bel) 
end 
endrodule 
// 33-bit 8014 for storing the second stage coefficient along with a flag bit. 
module romc2 (out,addr); 
input [cwidth-2:0] addr; 
output [rwidth-l:O] 	out; 
reg [rwidth-1;O] 	out; 
always 9(addr) 
case (addr) 
4'S 0: 	out = 29h 0192cb41; 
4 1 h 1; 	out = 295 02130c41; 
4 1 S 2: 	Out = 29'S 16934d49; 
4'S 3: 	out = 29'S Ofla4d29; 
4'h 4: 	out 	29'h ibd3cde4; 
4'S 5; 	out = 29th Oc3ecf29; 
4 1 S 6: 	out 	29 1 h 14359689; 
4 1 S 7: 	out = 29th 17382089; 
a'h 8: 	out = 29'S 189595e5; 
4h 9; 	out = 29'h 14f7dd9a; 
4 1 S a; 	out 	29th OcaOb5a9; 
4h b; 	Out 	29th OccdO5a9; 
45 c; 	out = 29th i8f7dde5; 
45 d: 	out = 29 1 h 18e820e5; 
4 1 h em 	out = 29th 00924941; 
4'h f: 	out = 29'S 01128a4l; 
default ; Out = 29th x; 
endcase /1 case(addr) 
endmodule /1 ROM 
// P814 for the second stage of FFT 
module fsmc2(clk,addrl,addr2,reeet); 
output [cwidth-3:0] addrl,addr2; 
reg )cwidth-3:0] addrl,addr2; 
input clk,reset; 




addr23 'bOO 1; 
end 
177 








// 33-bit register for holding the second stage coefficient and a flag bit 
module regm(out,in,clk); 
output (dwidth:03 	out; 
input [dwidth:OJ in; 
input elk; 
reg [dwidth:O] 	out; 





1/ Commutator for the third stage of FFT 
module comu3(clk,in,o13,o23,o33,o43,reset,c43,c53,c63,cm3,00unt3); 
input clk,reset; 
input [dwidth-1:0] in; 
Output c43,c13,c63,cm3; 
output [dwidth-1 O] 013,023,033,043; 
input (cwidth-1:01 count3; 
wire [cwidth-3:0] addrx,addry; 
wire Ldwidth-1:01 outl,out2 ,out3,out4 ,out5,out6,A; 
wire c13,c23,c33; wire [ewidth-1;0] count3; 
assign 013= out3; 
II Dual port 1W4 is used for restoring the data order 
duramc3 ADM(.clk(clk), .rst_n(l'bl), .cs_n(l'bO), .wr_n(l'bO), .rd_addr(addrx), 
.wraddr(addry), .data_in(in), .dataout(A)); 
1/ ROM for addressing the dual port RAN (ADS) 
romc3 (.out({addrx,addry}>, .addr(count3)); 
/1 Six FIFOs having unity length 
regO regl(.clk(clk), .in(A(,.out(outl)); 
rego reg2(.clk(clk), .in(out1).out(out2)); 
rego reg3(.clk(clk), .in(out2),.out(out3; 
regO reg4(.clk(clk(, .in(out3),.out(out4(); 
regO reg5(.clk(clk), .in(out4),.out(out5)(; 








// Dual port RAM definition for the third stage 
module duramc3(clk, rat_n, can, wrn, rdaddr, wr_addr, data_in, data_out) 
parameter rat _mode = 0; 
input elk, rst_n,cs_n,wr_n; 
input ]nwidth-2 01 rd_addr,wr_addr; 
input ]dwidth-1 	01 data_in; 
output]dwidth-1 0] data—cut; 
// Instance of Dwramrwsdtf 
Dwrainrwsdff (dwidth, depth2, rat—mode) 
duramc3 (.clk(clk), .rst_n(rst_n(, .cs_s(cn_n), .wr_n(wr_n), .rdaddr(rdaddr), 
178 
Verilog code for the MC-CDMA receiver 
.wr_addr(wr_addr), .data_in(datain), .data out(data out)>; 
endn1odule 
ii FSM for the third stage of FFT 
module fsntc3(c13,c23,c33,clk,reset,c43,c53,c63,cm3,cnt); 
input clk,reset; 
input [swidth-3:0] ant; 
output cl3,C23,c33,c43c53,c63,cm3; 
rag Cl3,023,c33,c43,c53,c63; 
assign cm3 = c53 - c63; 

















































II 16-bit multiplexer for the third stage 
module msoxc3(out,ino,inl,cont); 
output [dwidth-1:0] 	out; 
input (dwidth-1:0] ino,inl; 
input cent; 
rag (dwidth-1:01 	out; 








Verilog code for the MC-CDMA receiver 
endinodule 
ROM contents for generating the read and write addresses for ADM in the third stage 
of the 64-point FFT processor. 
module romc3(out, addr); 
input [cwidth-1:0] addr; 
output cwidth:O] 	out; 
reg [cwidth:0I 	out; 
always 6(addr) 
case (addr) 
5 1 h 0: Out = 6 1 h 21; 
5 1 h 1: out = 6 1 h 12; 
5th 2: out = 6 1 h 33; 
out 	Oh 2c; 
out = 6 1 h 3d; 
5 1 h 5: Out = Oh 06; 
5th 6: out = 6 1 h of; 
5th 7: out = 6 1 h 10; 
5 1 h 8: Out = 6 1 h 19; 
5 1 h 9: out = 6 1 h 22; 
5 1 h a: Out = 6 1 h 03; 
S'h hr out = 6 1 h in; 
Yb C: out = 6 1 h 08; 
5th d: out - 6 1 h 29; 
5 1 h e: out = 6 1 h 23; 
S'h f: Out 	6 1 h 3c; 
S'h 10: Out = 6 1 h 05; 
5 1 h 11: Out 	6 1 h 36; 
Yb 12: out 	6 1 h 17; 
5th 13: out = 6 1 h 08; 
5 1 h 14: Out = 6 1 h 19; 
S'h 15: out = 6 1 h 22; 
5th 16: out = 6 1 h 2b; 
5 1 h 17: Out 6 1 h 34; 
5 1 h 18: out = 6 1 h 3d; 
Yb 19: Out = 6 1 h 06; 
S'h la: out = 6 1 h 27; 
5th lb: out = 6 1 h 38; 
5 1 h in: Out = 6 1 h 2c; 
5 1 h id: Out = 6 1 h Od; 
S'h le: Out = Oh 07; 
S'h if: out = O'h 18; 
default : out - 6 1 h a; 
endcase // case(addr) 
endmodule /1 ROM 
II ROM for storing the first stage coefficient 
module rOml(Out, addr); 
input 1counter_width-1:0] addr; 
output [dwidth-1:0) 	out; 
reg (dwidth-1:0) 	Out; 
always @(addr) 
case (addr 
6 1 h 0: Out = 326 	7fff0000; 
6 1 h 1: out = 3211 7fff0000; 
6 1 h 2: out = 32 1 h 	7f61f373; 
6 1 h 3: out = 32h 7d89e706; 
6 1 h 4: Out = 32 1 h 	7a7cdad7; 
6 1 h 5: Out = 32th 7641cf04; 
6 1 h 6: out = 32 1 h 	70e2c3a9; 
6 1 h 7: Out = 32'h 6a6db8e3; 
6 1 h 6: out = 32'h 	62fiaecc; 
6th 9: out = 32th 5a82a57d; 
Oh a: out - 32 1 h 	51339d0e; 
6 1 h b: out = 32 1 h 471c9592; 
6 1 h C: Out = 32 1 h 	3c56801d; 
6 1 h dr out = 32h 30fb89be; 
6th e: out = 32 1 h 	25288583; 
6th f: out = 32'h 18f98276; 
Oh 10: Out = 32'h 	0c8c809e; 
6 1 h 11; Out = 32h 7fff0000; 
6th 12: Out = 32h 	7d89e706; 
6 1 h 13: out = 32 1 h 7641cf04; 
6 1 h 14: Out 	32'h 	6a6db8o3; 
6 1 h 15: Out = 32 1 h 5a82a57d; 
6th 16: out 	32h 	471c9592; 
6th 17: out = 32h 30fb89be; 
6'h 18; out = 32'h 	18f98276; 
IFOIC 
Veriog code for the MC-CDMA receiver 
6'S 19: out 	325 	00008000; 
6'S La: out = 32'h e7068276; 
6 1 S 15: Out = 32h 	cf0489be; 
6 1 h icr out = 32 1 S 58e39592; 
6th id: out = 32h 	a57da57d; 
6'S le: out = 32 1 h 9592b8e3; 
6'h if: Out = 32 1 S 	89becfO4; 
Oh 20: Out = 32 1 S 8276e706; 
6h 21: out = 32 1 S 	7fff0000; 
6th 22: out = 32 1 S 7a7cdad7; 
65 23: out = 32 1 h 	6a6db8e3; 
6th 24: out = 325 51339d0e; 
6h 25: Out 	32h 	30fb89bo; 
6'S 26: out 	32h Oc8c809e; 
6'S 27: out 	32 1 S 	e7068276; 
6'S 28: out = 32 1 h c3a98fld; 
6 1 h 29: out 	32 1 S 	a57da57d; 
6'S 2a: out = 32 1 S 801dc3a9; 
6 1 S 25: out = 32 1 S 	8276e706; 
6'S 2c: out = 32 1 h 809e0c8c; 
65 2d: out 	32 1 S 	89be3Ofb; 
6 1 h 2e: out = 32 1 S 9d0e5133; 
65 2f: out = 32h 	58e36a6d; 
6'h 30: out = 32 1 S dad77a7c; 
6th 31: out = 32h 	7fff0000; 
6 1 h 32: Out = 32h 7fff0000; 
6'S 33: Out = 32h 	7fff0000; 
6'S 34: out 	325 7fff0000; 
6th 35: out = 325 	7fff0000; 
6th 36: out = 32 1 S 7fff0000; 
6th 37: out 	32'S 	7fff0000; 
6 1 h 38: out 	32 1 S 7fff0000; 
6th 39: out = 32'h 	7Iff0000; 
6'S 3a: out = 32 1 S 7fff0000; 
6th 35: out = 32 1 S 	7fff0000; 
6h 3cr out = 32 1 S 7fff0000; 
6'h 3d: Out 	32 1 S 	7fff0000; 
6'S 3e: out = 32 1 h 7ff00000; 
6h 3f: out = 32 1 S 	7fff0000; 
default : out = 32 1 h 	xxxxxxx; 
endcase 1/ case(addr) 
endmodule // ROM 
1/ ROM for storing the second stage coefficient 
'define dwidth 32 
'define swidth 4 	II counter word size 
module rom2(out, addr); 
parameter dwidth 'dwidth; 
parameter swidth = 'swidth; 
input (swidth'-1:0] addr; 
Output [dwidth:0] 	out; 
reg [dwidth:0) 	out; 
always 9(addr) 
case (addr) 
4 1 S 0: 	out = 33 1 S 180010000; 
4 1 h 1: 	out 	33 1 S 180010000; 
4'h 2: 	Out = 33'h 180010000; 
4'h 3: 	out = 33 1 S 180010000; 
4 1 S 4; 	Out = 33'S 180010000; 
4 1 S 5: 	Out = 33'S 000008000; 
4 1 S 6: 	Out 	33 1 h 07641cf04; 
4 1 S 7: 	Out = 33'S lcf0589be; 
45 8: 	Out = 33'h lcfO5B9be; 
4 1 S 9: 	out = 33'S 05a82a57d; 
4 1 h a: 	Out = 33'S 05a82a57d; 
4 1 S b: 	Out = 33 1 S 15a83a57d; 
4 1 S C: 	 out = 33 1 S 15a83a57d; 
4'S d: 	Out = 33 1 S 1764230fb; 
4 1 S e: 	out = 33 1 S 180010000; 
4 1 S f: 	out = 33 1 S 180010000; 
default : out = 33'S xxxxxxx; 
endcase II case(addr) 
endmodule // ROM 
181 
Verilog code for the MC-CDMA receiver 
D.2 Verilog code for the Combiner 
// Top level Combiner module 
define width 16 
define depth3 64 
define c inputs 3 
define cwidth 5 
module Combiner(c1k,in,accreset,1aznbda); 
parameter depth3 	depth3; 
parameter cwidth = cwidth; 
parameter width 	width; 
parameter c_inputs = 'c_inputs; 
input clk,reset; 
input [widthl-width-1:0( in; 
output [width-1:0] ace; 
input (width-1:0) lambda; 
wire cll,c12 ,c13,c21,enf,enr,csram; 
wire [width-1:01 xdr,xdi,ernx,emi,eqr,eqi,acco,mor,moi; 
wire [cwidth:O] waddr,raddr; 
wire x; 
II Multiplication module 
blockl blocka(.mor(mor),.moi(inoi),.xdr(xdr),.xdi(xdi),.xr(in(31:16)),.xi(in(15:0J),.emr(emr),.emi(emi), 
.cil(cll),.c12(c12),.c13(c13)...clk(clk)); 
Accumulation and summing module 
block2 blockb(.mor(mor)..moi(moi),.accso(acco), c21(c21),.clk(clk),.acco(acc)); 
II Division module 
block3 blockc(.eqr(egr),.eqi(eqi),.xdr(xdr),.xdi(xdi), .acci(acco),.lambda(lambda),.clk(clk),.enf(enf), 
.enr(enr)); 
II Memory for storing the equaliser coefficients 
cduram mem(.clk(clk), .rstn(l'bl), .csn(csram), .wrn(lbO), .rd_addr(raddr), .wr_addr(waddr), 
.data_in({eqr,eqi}), .data_out({emr,emi})); 
If FSM for the Combiner 
cfsn, fsml(.clk(clk), .cll(cll),.c12(c12),.c13(c13),.c21(c21),.enf(enf),.enr(enr),.csrant(csram), 
.waddr(waddr) , .raddr(raddr), .reset(reset)); 
endniodule 
// FOM for the Combiner 
module cfsm(clk, cli, c12, c13, c2l, enf, enr,csrain,waddr,rsddr, reset); 
output (cwidth:0] raddr,waddr; 
output dl ,c12,c13,c21,enf,enr,csram; 
reg cli ,c12 ,c13,c21 ,enf,enr,csram; 
input clk,reset; 
parameter pilotl 'hO; 
parameter data  
reg pn,ns; 
mg (cwidth:0) carrier _count, next_Carrier_count; 
reg (cwidth- 1 :0) symbol _count ,next_symbol_count; 
assign waddr=carrier_count; 
assign raddr= carrier_count+3; 










carrier count=next carrier_count; 
symbol eount= next_symbol_count; 
end 
end 
always (ps or carrier_count or symbol_Count) begin 
182 







enf = l'bl; 
enr = i'bl; 
next carrier count = carrier count + 1; 








if (carrier count < 63) 
begin 














next carrier count = carrier count+i; 
if(symbol count == 






if (carrier_count < 1) 
ci3 	i'bO; 
else 

























if (carrier count < 63) 
next symbol count = symbol—Count; 
else 





// Dual port RAM for storing the coefficients 
Verilog code for the MC-CDMA receiver 
module cduram(clk, rst_n, can, wr_n, rd_addr, wr_addr, data in, data out); 
input clk,rstn,csn,wrn; 
input [cwidth 	0] rd_addr,wr_addr; 
input [width+width-1 : 01 data in; 
output[width+width-1 r 01 data_out; 
1/ Instance of OW-rain r_w_s_dff 
Dwramrwsdff #(width+width, depth3, rat-mode) 
cduram ).clk(clk), .rat_n(rst_n), .cs_n(cs_n), .wr_n(wr_n), .rd_addr(rd_addr), wraddr(wr_addr), 
.datain(data_in), .data_out(data_out)); 
endinodule 
// Verilog code for Block! of Combiner 
module blockl)mor,moi,xdr,xdi,xr,xi,emr,emi,cii,c12,ci3,clk); 
output [width-1:0} 	mar, moi,xdr,xdi; 
input [width-1:01 xr,xi,emr,emi; 
input clk,cii,c12,ci3; 
wire [width-1:01 m1,m2,ni3,m4,xro,xio,PI; 
wire [width+width-i :0] PR000CTR,PR000CTI; 
assign mar = PR000CTR[width+width-2;width-1]; 
assign P1 = PR000CTI[width+width-2;width-1]; 
assign xdr = ml; 







mult multr(.A)ml), .B)m2), .TC)l'bi), .PROD(PRODUCTR)); 
mult multi).A)m3), .B(M), .TC(i'bi), .PROD(PR000CTI)); 
caddon compli(.A(PI), .SI(c13), .B)moi)); 
endsiodule 
1/ Control inversion module for Block 1 of Combiner 
module caddon(A, SI, B); 
input [width-i 	01 A; 
input SI; 
output [width-1 01 B; 
reg [width-i : 0] B; 





B = A; 
end 
endmodule 
II 06-bit multiplexer for Block! of Combiner 
module cmuxl6(out,inO,ini,cont); 
output [width-1:0J 	out; 
input [width-1:0) ini,inl; 
input cant; 
mg [width-1:0) 	Out; 








II 16-bit register for Blocki of Combiner 
module cregi6(out,in,clk); 
output [width-1:0) 	out; 
input [width-1:0] in; 
input clk; 
reg [width-1:0] 	out; 
always 6(posedge clk) 
begin 
184 




II Verilog code for Nlock2 of the Combiner 
module block2(mor,moi,accso,acco,c2l,clk); 
input [width-1:0) 	mor,moi; 
output[width-1:0) accso,acco; 
input clk,c21; 
wire [width-1:01 sr,si,mo,eccio; 
ensign accso = {4'hO,accio[width-1:4)}; 
crogl6 reg2l(.out(sr),.in(inor),.clk(cik)); 
creglf reg22(.out(si),.in(etoi), .clk(clk)); 




1/ Summer for Block2 of the Combiner 
module csuzn(INPUT, SUM); 
input [c_inputs*width_l 	0] INPUT; 
output [width-1 	0) SUM; 
ii Instance of DW02 sum 
DM02_sum #(c_inputs, width) 
Ui (.INI'UT(INPUT), .SUM(SUM)); 
endaodule 
1/ Verilog code for Hlock3 of the Combiner 
module block3(eqr,eqi,xdr,xdi,ecci,lainbde,clk,enf,enr); 
output (width-1:0] 	eqr,eqi; 
input [width-1;0] xdrxdi,acci,laxnbda; 
input Clk,enf,enr; 
wire [width-1:0] deno,den,fr,fi,fmiiegr,ieqj; 
assign eqr{ieqr(width-6:0),5h0}; 
assign eqi={ieqi[width-6:0),5'h0}; 




divide dividerr( .A(fr), .B(deno), PC(l'bl), .QUOTIENT(ieqr)); 
divide dividori( .A(fini), .B(deno), .TC(i'bi), .QUOTIENT(ieqi)); 
endmodule 
// FIFO module for Block3 of Combiner 
module fifo(out,in,clk,enf); 
Output (width+width-1:0) 	out; 
input (width+width-1:0) in; 
input clk,enf; 
wire enelk; 
wire [width+width-1:0] 	11; 
creg32 regf0(.out(ii), .in(in),.clk(clk), .enf(enf)); 
creg32 regfl(.out(out),.in(ii),.clk(clk),.enf(enf)); 
endsnoduie 
/1 Register with enable for Block3 of Combiner 
module reglOE(out,in,clk,enr); 
output twidth-1:01 	out; 
input [width-1:0] in; 
input clk,enr; 
reg [width-1:0] 	out; 







Vcrilog code for the MC-GDMA receiver 
II Adder for Block3 of Combiner 
module cadd(A,B,CI,S,CO); 
input (width-1:03 A,B; 
Output [width-1:0] S 
input Cl; 
Output CO; 
DW01 add #(width) cadd(.A)A), .5(B),.CI(CI), SUN(S), .CO(CO)); 
endxnodule 
// 32-bit register for FIFO module of BlockS 
module creg32(Out,in,olk); 
output [width+width-1:0] 	out; 
input [width+width-1:0) in; 
input cik; 
req [width+width-1:01 	out; 





// Divider module for Block3 of Combiner 
module divide) A, B, TC, DIVIDE BY 0, QUOTIENT 
parameter TC_mode = 1; 
input [width-1:0) A,B; 
input TC; 
output DIVIDE BY 0; 
output [width-1;01 QUOTIENT: 
1/ Instance of DW02 divide 
0W02 _divide #(width, width, TCmode) 
Ui ) .A(A), .B(B), .TC(PC), .DIVIDE BY O(DIVIDE BY 0), .QUOTIENT(QUOTIENT) ); 
endmmodule 
186 
