Multicarrier Demultiplexing and VLSI Implementation for Satellite Communications Systems. by Qi, Ronggang.
Multicarrier Demultiplexing and VLSI 
Implementation for Satellite 
Communications Systems
by
Ronggang QI
Thesis submitted to the University of Surrey 
for the degree of 
Doctor of Philosophy
Centre for Satellite Engineering Research, 
University of Surrey,
Guildford, Surrey,
United Kingdom.
May, 1996
ProQuest Number: 27726994
All rights reserved
INFORMATION TO ALL USERS 
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a com p le te  manuscript 
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
uest
ProQuest 27726994
Published by ProQuest LLC (2019). Copyright of the Dissertation is held by the Author.
All rights reserved.
This work is protected against unauthorized copying under Title 17, United States C ode
Microform Edition © ProQuest LLC.
ProQuest LLC.
789 East Eisenhower Parkway 
P.O. Box 1346 
Ann Arbor, Ml 48106- 1346
Acknowledgments
I would like to express my gratitude to Professor B. G. Evans and Ms. F. P. Coakley for 
supervising me throughout my years at Surrey. Their guidance, encouragement, and support 
have been an important factor for me to complete the Ph.D. programme. I am particularly 
grateful to Ms. F. P. Coakley who carefully reviewed the draft, pointed out corrections, and 
made valuable comments.
I also would like to express my gratitude to Dr. W. Yim for his suggestions and comments 
through our interesting discussions in the early phase of this work. I wish to thank many of 
my colleges and friends at Surrey for their assistance and helpful discussions.
I wish to thank the Committee of Vice-Chancellors and Principals (CVCP), U.K. for the ORS 
awards that have supported this research work.
I would like to dedicate this thesis to my parents, my wife, and all the family members who 
constantly give me their loves and encouragement throughout the period of my study 
overseas.
Abstract
Being focused on the derivation and VLSI implementation of low-complexity multicarrier 
demultiplexers (DEMUXs) for on-board processing (OBP) satellites, three major 
contributions are presented in this thesis:
1. A systematic approach based on the multirate signal flow graph (MSFG) representation 
and transforms to multirate system optimization is proposed. A major advantage of this 
approach is that it provides clearer structural information and avoids tedious and adhoc 
mathematical manipulations in multirate network simplification. Many computationally 
efficient multicarrier DEMUX structures can be derived using the MSFG approach. A 
number of identities and transforms of MSFG are identified and summarized.
2. A simple multirate VLSI modeling method for efficient mapping from a computational 
structure to VLSI architecture is also proposed. With this method, the efficiency of 
complexity, power consumption, and throughput of a VLSI architecture can be considered 
jointly and trade-offs between them can be made by imposing different constraints. For 
given system parameters and a DEMUX structure, the searching for an optimal VLSI 
architecture becomes the determination of the configuration for basic components. The 
proposed method provides technology-independent estimation for VLSI complexity and 
power consumption. Examples illustrate the usefulness of the proposed method for 
comparative study of alternative VLSI architectures.
3. A low-complexity binary tree architecture, known as the TM2-tree, is proposed for 
efficient VLSI implementation. In a TM2-tree, not only the inner product (IP) is time- 
shared amongst stage channels, but the IP itself is also realized by time-sharing a common 
complex arithmetic processor for both the lowpass and the highpass filtering processes at 
a lower hierarchy level. A gate array ASIC design for an 8-channel tree DEMUX based 
on the TM2-tree architecture is also presented.
Additionally, issues on bandpass sampling, EDM signal channel stacking, efficient complex 
filter structures, flexible DEMUX structures, bit-serial arithmetic and techniques, systolic 
DFT and FIR architectures, etc. are also discussed and some novel structures and interesting 
results are provided in the relevant chapters.
Contents
Acknowledgments................................................................
Abstract .................................................... ......................
Table of Contents................................................................
List of Figures.......................................................................
List of Tables.........................................................................
Acronyms ...........................................................................
Notation ...........................................................................
Chapter 1 Introduction.................................................... ..............1
1.1 On-board processing satellite.......................................... ............. 1
1.2 Multicarrier demultiplexer and demodulator................. ............. 3
1.3 Mapping MCDD function into VLSI............................. ............. 4
1.4 Outlines of chapters..........................................................
.......................................T...............
............. 5
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing......... ..............8
2.1 Band-limited signals....................................................... ............. 8
2.1.1 Lowpass signals................ ........................................ ............. 8
2.1.2 Analytic signals and Hilbert transform..................... ............. 9
2.1.3 Frequency translations and bandpass signals........... ............. 9
2.1.4 Frequency translation (rotation) invariant property. ............ 11
2.2 Bandpass sampling theorem............................................ ............ 13
2.2.1 First-order bandpass sampling theorem................... ............ 13
2.2.2 Practical considerations........................................... ............ 14
2.3 Complex filters................................................................. ............ 16
2.3.1 Direct transversal structure...................................... ............ 17
2.3.2 Rotation Invariant Complex FIR (RICF) network.. ............ 17
2.3.3 Down-Real Filtering-Up (DRFU) network............. ............ 18
2.4 Frequency division multiplexed signals.......................... ............ 19
2.4.1 FDM signals.............................................................. ............ 19
2.4.2 Sampling of FDM signals................................................................................................. 20
2.4.3 Channel stacking.............................................................................................................. 22
2.4.3.1 Stacking of analog FDM...........................................................................................22
2.4.3.2 Channel stacking of sampled FDM signals...............................................................22
2.4.3.3 Frequency translation and channel stacking.............................................................23
2.5 Multirate systems and signal processing................................................................................. 24
2.5.1 Discrete signal representations.........................................................................................24
2.5.1.1 Polyphase representation...........................................................................................24
2.5.1.2 Modulation representation........................................................................................24
2.5.2 Sampling rate alteration....................................................................................................25
2.5.2.1 Downsampling and decimation................................................................................. 25
2.5.2.2 Upsampling and interpolation................................................................................... 26
2.5.3 Rational sampling rate conversion................................................................................... 26
2.5.3.1 Rational sampling rate conversion............................................................................27
2.5.3.2 Polyphase analysis and polyphase synthesis of signals............................................ 28
2.5.3.3 Polyphase decomposition of decimation and interpolation filters...........................30
2.6 Frequency demultiplexing........................................................................................................32
2.7 Summary................................................................................................................................... 32
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing............................... 34
3.1 Transmultiplexing and demultiplexing.....................................................................................34
3.2 Demultiplexing function of MCDDs.......................................................................................35
3.2.1 Real and complex FDM....................................................................................................35
3.2.2 Rational sampling rate decimation....................................................................................36
3.2.3 Channel stacking requirement.......................................................................................... 37
3.2.3.1 Complex FDMs......................................................................................................... 37
3.2.3.2 Real FDMs.................................................................................................................37
3.2.3.3 Odd -to-even stacking conversion............................................................................ 37
3.2.4 Demultiplexing function...................................................................................................39
3.2.4.1 Lowpass model.........................................  39
3.2.4.2 Bandpass model......................................................................................................... 40
3.3 Demultiplexing approaches for OBP applications.................................................................. 42
3.3.1 Per-channel approach.......................................................................................................42
3.3.2 Block processing approaches........................................................................................... 42
3.3.2.1 Maximally decimated Polyphase DFT filter banks................................................... 43
3.3.2.2 Polyphase Matrix DFT filter banks...........................................................................44
3.3.2.3 Modified PMDFT filter banks...................................................................................46
v
3.3.3 Frequency domain block processing...............................................................................49
3.3.4 Analysis-synthesis method............................................................................................... 50
3.3.5 Multistage approach: tree structures............................................................................... 52
3.4 Flexible multicarrier DEMUX architectures...........................................................................54
3.4.1 Per channel-polyphase FFT programmable architecture..................................................54
3.4.2 Reconfigurable tree structures.........................................................................................55
3.4.3 Single-stage reconfigurable architecture..........................................................................56
3.5 Summary................................................................................................................................... 56
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network
Optimization........................................................................................................... 58
4.1 Introduction..............................................................................................................................58
4.2 MSFG........................................  59
4.3 Basic multirate operators and MSFG node functions...................................................   60
4.4 MSFG transformation..............................................................................................................62
4.4.1 The noble identities........................................................................................................... 63
4.4.2 Cascade of samplers......................................................................................................... 63
4.4.3 Commutator decomposition............................................................................................. 64
4.4.4 Commutator cascade........................................................................................................ 65
4.4.5 Complex filter identities.............................................................................................  66
4.4.6 Polyphase decomposition transforms.............................................................................. 67
4.4.7 Modulation polyphase decomposition transforms...........................................................67
4.4.8 Modulation identities........................................................................................................ 69
4.4.9 Commutator-modulator cascades.....................................................................................70
4.4.10 Commutator-filter cascades........................................................................................... 72
4.4.11 Commutator-sampler commutability............................................................................. 72
4.4.12 Identities associated with composite rate-changing nodes........................................... 74
4.5 Examples..........................................................   74
4.5.1 Five-channel DEMUX for an MF-TDMA system...........................................................74
4.5.2 The optimal complex BSF structure for binary tree DEMUX....................................... 78
4.5.3 The optimal real BSF structure........................................................................................ 82
4..6 Summary.................................................................................................................................. 82
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture
Optimization............................................................................................................84
5.1 Introduction............................................................................................................................ 84
5.2 Multirate VLSI complexity model.......................................................................................... 86
5.2.1 Complexity and power consumption in CMOS VLSI.................................................... 87
5.2.1.1 VLSI complexity....................................................................................................... 87
5.2.1.2 CMOS VLSI power consumption.........................................................   87
5.2.2 Complexity model of multirate VLSI.............................................................................. 88
5.2.2.1 Single-rate systems....................................................................................................88
5.2.2.2 Multirate VLSI.......................................................................................................... 90
5.2.2.3 The configuration matrix A .......................................................................................91
5.2.2.4 On the A-domain.......................................................................................................91
5.2.3 Complexity and power consumption in multirate VLSI.................................................91
5.2.3.1 Complexity estimation for multirate systems...........................................................91
5.23.2 Power consumption for multirate systems.............................................................92
5.2.4 Bit-parallel vs. bit-serial technique.................................................................................. 93
5.2.5 Throughput of multirate DSP systems............................................................................. 93
5.3 VLSI architecture for multirate system via optimization....................................................... 94
5.3.1 Objective functions of optimal multirate VLSI architectures......................................... 94
5.3.2 Sub-optimal criteria in minimum distance sense..............................................................95
5.3.3 Weighted evaluation function approach..........................................................................95
5.3.4 Strong preferences: partial optimization......................................................................... .96
5.3.5 Constrained optimization problems................................................................................. 96
5.3.6 The mapping between p  and A........................................................................................97
5.4 Examples.................................................................................................................................. 98
5.4.1 PMDFT and TM-PMDFT filter banks............................................................................ 98
5.4.2 PMDFT and TM-PMDFT complexity............................................................................. 99
5.4.3 VLSI architecture optimization................................   100
5.4.3.1 Optimal PMDFT architectures................................................................................100
5.4.3.2 Optimal TM-PMDFT architectures........................................................................ 103
5.5 Conclusions............................................................................................................................. 106
Chapter 6 Bit-Serial Techniques and Systolic Architectures...........................................107
6.1 Bit-serial arithmetic for digital signal processing..................................................................107
6.1.1 Bit-serial adder/subtractor..............................................................................................108
6.1.2 Serial-parallel multiplier.................................................................................................. 109
6.1.3 Bit-serial 2’s complement ..............................................................................................113
6.2 Distributed arithmetic techniques...........................................................................................113
6.2.1 DA approach for inner product generation....................................................................114
6.2.2 Long length FIR using DA techniques.......................................................................... 115
6.3 Systolic Stream DFT structures.............................................................................................116
6.3.1 Stream D FT.................................................................................................................... 116
6.3.2 Two SS-DFT structures.................................................................................................116
6.3.2.1 Kung’s model...........................................................................................................117
6.3.2.1 Chang’s model..........................................................................................................118
6.3.3 Comparison between SS-DFT and FFT ........................................................................121
6.4 Systolic FIR ............................................................................................................................ 122
6.4.1 Why systolic FIR?...........................................................................................................122
6.4.2 A new semi-systolic architecture for F IR ......................................................................122
6.4.3 A novel pure systolic FIR architecture.......................................................................... 125
6.4.4 Pure systolic halfband FIR...................................................................................... 131
6.5 Summary.......................................................  132
Chapter 7 VLSI Architecture for Binary Tree DEMUX.................................................. 134
7.1 Binary tree DEMUX..............................................................................................................134
7.1.1 Real binary tree............................................................................................................... 135
7.1.2 Complex binary tree ....................................................................................................... 136
7.2 Time-multiplexed tree ............................................................................................................138
7.3 Time-multiplexed BSF............................................................................................................140
7.3.1 Decomposition of B SF................................................................................................... 140
7.3.2 The time-multiplexed BSF..............................................................................................140
7.3.3 The DA inner-product.................................................................................................... 142
7.4 The two-path input buffer.......................................................................   143
7.5 VLSI architecture for TM2-tree.............................................................................................144
7.6 An 8-channel DEMUX ASIC based on TM2-tree................................................................ 147
7.7 Comparisons with other TM-trees................................................................................... 151
7.7.1 Direct tree vs. TM-tree...............  151
7.7.2 Other BSF structures...................................................................................................... 152
7.7.3 TM-trees using other BSF structures............................................................................ 154
7.7.4 Complexity of TM-trees................................................................................................. 155
7.7.5 Power consumption of TM-trees................................................................................... 157
7.7.6 Throughput of TM-trees................................................................................................. 158
7.8 Summary..................................................................................................................................159
Chapter 8 Conclusion and Future Work...................   161
viii
Appendices................................................................................................................................. 167
Appendix A Practical minimum sampling frequency for first-order bandpass sampling 167
Appendix B Homogenous stacking of uniformly sampled FDM signals................................... 170
Appendix C Derivation of polyphase-matrix-DFT filter banks.................................................. 171
Appendix D Simplification of DFT-Frequency Shift-IDFT network........................................ 174
Appendix E Upsampler-SPC and PSC-Downsampler identities................................................ 175
Proof of the upsampler-SPC Identity:...............................................  176
Proof of the SPC-downsampler Identity:................................................................................178
Appendix F The rule of “Cut and Insert” ....................................................................................180
Appendix G SOS technology and MA9000A Sea of Gates ......................................................182
Appendix H Circuits diagrams of a 16-channel 6-bit DEMUX................................................. 184
References ................................................................................................................................. 191
Publications................................................................................................................................ 197
List of Figures
Figure 1.1 OBP system concept: multibeam with on-board switching.......................................2
Figure 1.2 MCDD function............................................................................................................ 3
Figure 1.3 From function model to VLSI................................   5
Figure 2.1 Lowpass signals............................................................................................................ 9
Figure 2.2 Bandpass signals..............................................................   11
Figure 2.3 Bandpass signal protected with guard bands............................................................. 13
Figure 2.4 Minimum and acceptable sampling frequencies for bandpass signals.................. 13
Figure 2.5 Transversal FIR structures ........................................................................................ 17
Figure 2.6 Transversal complex F IR .......................................................................................... 18
Figure 2.7 DRFU complex filter..................................................................................................19
Figure 2.8 SSB FDM signals ......................................................................................................20
Figure 2.9 Sampling schemes of FDM signals............................................................................21
Figure 2.10 Non-homogenous channel stacking (K=3)..............................................................22
Figure 2.11 Spectra relation for integer rate alteration ..............................................................26
Figure 2.12 Rational sampling rate conversion...........................................................................28
Figure 2.13 Polyphase analysis of signal......................................................................................29
Figure 2.14 Polyphase synthesis of signal....................................................................................30
Figure 2.15 Transposition of LTI polyphase network................................................................. 30
Figure 2.16 Polyphase decimation filter....................................................................................... 31
Figure 2.17 Polyphase interpolation filter....................................................................................31
Figure 2.18 Polyphase-matrix rational sampling rate converter.................................................32
Figure 2.19 Frequency demultiplexing......................................................................................... 32
Figure 3.1 Channel stacking for real and complex FDM signals...............................................38
Figure 3.2 Lowpass uniform DEMUX function ........................................................................39
Figure 3.3 Derivation for bandpass DEMUX m odel................................................................. 40
Figure 3.4 Bandpass DEMUX function.......................................................................................40
Figure 3.5 Bandpass model for maximally decimated polyphase DFT filter bank................. 43
Figure 3.6 Maximally decimated polyphase DFT filter bank....................................................44
Figure 3.7 PMDFT filter bank structure..................................................................................... 45
Figure 3.8 Interpretation for the inner summation of Eq.(3.15)............................................... 47
Figure 3.9 Interpretation for Eq.(3.15)........................................................................................48
Figure 3.10 Pipeline PMDFT structure.......................  48
Figure 3.11 Fast convolution non-uniform DEMUX architectures........................................... 50
Figure 3.12 Partial reconstruction analysis-synthesis demultiplexer........................................ .51
Figure 3.13 Tree structure  .........................................................   52
Figure 3.14 Binary tree DEMUX..................................................................................................53
Figure 4.1 Noble identities........................................................................................................... 63
Figure 4.2 Sampler-cascade identities.........................................................................................64
Figure 4.3 Commutation decomposition.....................................................................................65
Figure 4.4 Commutator cascades.................................................................................................66
Figure 4.5 Complex filtering identities........................................................................................66
Figure 4.6 Polyphase decomposition transforms (PDT)........................................   67
Figure 4.7 Two types of MPDT structures in conventional SFG..............................................68
Figure 4.8 MSFGs of type 1 and type 2 MPDT..........................................................................69
Figure 4.9 MPDT for complex modulation................................................................................ 69
Figure 4.10 Modulation identities  .................................................................................. 70
Figure 4.11 Commutator-SPC and SPC-commutator identities.................................................72
Figure 4.12 Commutator-filter Identities......................................................................................72
Figure 4.13 Sampler-commutator identities................................................................................ 73
Figure 4.14 Identities associated with composite rate-changing nodes......................................74
Figure 4.15 Five-channel MF-TDMA signal and the demultiplexing channel filter................75
Figure 4.16 DEMUX function (lowpass model)..........................................................................75
Figure 4.17 Step-by-step simplification for demultiplexing the k-th channel........................... 77
Figure 4.18 Polyphase DFT DEMUX structure for the 5-channel MF-TDMA traffic.............78
Figure 4.19 Direct complex BSF structure...................................................................................78
Figure 4.20 Derivation for the optimal complex BSF................................................................. 81
xi
Figure 4.21 The optimal real BSF structure................................................................................ 82
Figure 5.1 Single-rate system representation.............................................................................. 88
Figure 5.2 A multirate system representation.............................................................................90
Figure 5.3 Time-multiplexed implementation of PMDFT........................................................ 98
Figure 5.4 VLSI complexities of a PMDFT filter bank........................................................... 101
Figure 5.5 VLSI complexities of a TM-PMDFT filter bank................................................... 104
Figure 6.1 Bit-serial add & subtract module............................................................................. 108
Figure 6.2 A 2’s complement, 4x4-bit, bit-parallel pipeline multiplier..................................110
Figure 6.3 Step-by-step illustration of parallel/serial full-precision multiplication.............. I l l
Figure 6.4 The novel parallel/serial full-precision multiplier.................................................. 112
Figure 6.5 A simple bit-serial 2’s complement circuit............................................................. 113
Figure 6.6 DA approach to long length FIR implementation.................................................. 116
Figure 6.7 Derivation steps for Kung’s model.......................................................................... 117
Figure 6.8 Kung’s SS-DFT.........................................................................................................118
Figure 6.9 Derivation steps for Chang’s model........................................................................ 120
Figure 6.10 Chang’s SS-DFT.......................................................................................................120
Figure 6.11 Hardware complexity comparison between SS-DFT and FFT........................... .121
Figure 6.12 A new systolic architecture for symmetric FIR convolutions..............................124
Figure 6.13 Modular expansion for large N ...............................................................................125
Figure 6.14 Two-path FIR structures using symmetry property............................................... 127
Figure 6.15 A new ID array for linear phase F IR .....................................................................128
Figure 6.16 Pipelined ID FIR array............................................................................................129
Figure 6.17 Novel pure systolic arrays for linear phase FIR.....................................................130
Figure 6.18 Pure systolic architecture for halfband F IR ........................................................... 132
Figure 7.1 Binary tree DEMUX................................................................................................. 135
Figure 7.2 Real binary tree spectra and real BSF...................................................................... 137
Figure 7.3 Complex binary tree spectra and complex BSF......................................................138
Figure 7.4 The direct tree vs. the TM-tree................................................................................. 139
Figure 7.5 The time-multiplexed BSF (N=7).......................................................................... 142
Figure 7.6 The DA real inner-product processor architecture................................................. 143
Figure 7.7 Input buffers: (a) linear FIFO (b) two-path buffer (c) operation of (b )............... 144
Figure 7.8 VLSI architecture for an 8-channel DEMUX based on TM2-tree......................... 145
Figure 7.9 Control signals and the data flow of the TM-tree architecture..............................146
Figure 7.10 Design hierarchy of the DEMUX ASIC................................................................ 148
Figure 7.11 Eight-channel TM-tree DEMUX circuits diagram (top hierarchy)......................149
Figure 7.12 Gate Count Estimation for BSF Structures............................................................ 154
Figure 7.13 TM-tree architectures for different types of BSFs................................................ 155
Figure 7.14 Complexity estimation for TM-tree DEMUXs with 7-tap BSFs......................... 156
Figure 7.15 Power consumption estimation for TM-trees........................................................158
xiii
List of Tables
Table 2.1 Complexity of complex FIR structures........................................................................19
Table 3.1 Odd-to-even conversion with trivial frequency shift..................................................38
Table 3.2 Trivial frequency shift: even-stacking (D=2)..............................................................41
Table 3.3 Trivial frequency shift: odd-stacking (D=2)...............................................................41
Table 4.1 MSFG Nodes.................................................................................................................61
Table 5.1 Operation count for direct implementation of PMDFT............................................. 99
Table 5.2 Operation count for TM-PMDFT filter banks............................................................99
Table 5.3 Multiplication and addition counts of optimized radix-2 complex FFTs ............100
Table 5.4 Optimization results for the PMDFT structure.................................   102
Table 5.5 Optimization results for the TM-PMDFT structure.................................................. 105
Table 7.1 Complexity and power consumption of DEMUX ASICs....................................... 150
Table 7.2 Complexity requirement for complex BSF structures..............................................152
Table 7.3 Gate count of some basic arithmetic operators and delay element......................... 153
Table 7.4 Gate count estimation for BSF schemes.......................   153
xiv
Acronyms
ID one-dimensional
ADC analog-to-digital converter
ASIC application specific integrated circuits
BC basic component
BER bit error rate
BO basic operation
BSF band-splitting filter
CAD computer aided design
CDMA code division multiple access
CMOS complementary metal-oxide semiconductor
DA distributed arithmetic
DC direct current
DEMUX demultiplexer
DFT discrete Fourier transform
DRFU down-real filtering-up
DS downsampler
DSB double sideband
DSP digital signal processing (processor)
EIRP effective isotropic radiated power
FDM frequency division multiplexing
FDMA frequency division multiple access
FFT fast Fourier transform
FIFO first-in-first-out
FIR finite impulse response
G/T gain-to-noise temperature ratio
GDFT generalized discrete Fourier transform
HPA high powered amplifier
IDFT inverse discrete Fourier transform
IDU integral & dump node
IF intermediate frequency
IFFT inverse fast Fourier transform
IIR infinite impulse response
I/O input and output
IP inner product
IS implementing structure
ISI intersymbol interference
LPTV linear periodically time-varying
USB least significant bit
LTI linear time invariant
MAC multiply & accumulate
X V
MCDD multicarrier demultiplexer and demodulator
MF-CDMA multi-frequency CDMA 
MF-TDMA multi-frequency TDMA 
MOD modulation node
MPDT modulation polyphase decomposition transform
MSB most significant bit
MSFG multirate signal flow graph
MSM microwave switch matrix
OBP on-board processing
OD ordinary node
PE processing element
PMDFT polyphase matrix DFT
PR perfect reconstruction
PSC parallel-to-serial commutator (converter)
QMF quadrature mirror filter
QPSK quaternary phase shift keying
RAM random access memory
RF radio frequency
RICF rotation invariant complex FIR
ROM read-only memory
SA sampling node (without hold)
SAW surface acoustic wave
SCPC single channel per carrier
SDR signal-to-distortion noise ratio
SFG signal flow graph
SH sampling & hold node
SOS silicon-on-sapphire
SPC serial-to-parallel commutator (converter)
SPE super processing element
SS-DFT systolic stream DFT
SSB single sideband
TDM time division multiplexing
TDMA time division multiple access
TM-BSF time-multiplexed BSF
TM-PMDFT time-multiplexed PMDFT 
TM-tree time-multiplexed tree
TM2-tree time-multiplexed tree with time-multiplexed BSF
TMUX transmultiplexer
US upsampler
USH upsampling & hold node
VLSI very large scale integration
VSAT very small aperture terminal
xvi
Notation
L(*)J takes the largest integer that is not greater than (•)
<(•)>» (•) modulo m
a a  matrix
a roll-off factor;
the average percentage of active gates of a circuit
«„ the a  value of BC z at sampling frequency^
a,(0 analytic signal of x(f): ax(t) = x(t) +  2(f)
A , ( C 0 ) 4(0))= c%c(0}
B bandpass signal bandwidth
C complexity
c . the effective complexity
A configuration matrix
D the number of samples per symbol
Ac carrier frequency offset
8. the configuration index of BC / at sampling frequency jf.
4(A) normalized objective function on x
4 4 Fourier transform
/o centre (carrier) frequency of a bandpass signal
f  dock clock frequency
/, centre frequency of a FDM signal
t the normalized switching frequency of BC i at sampling frequency^.
Fk K-point DFT transform matrix
Fm» the maximum switching frequency for given VLSI technology
/ . output sampling rate
f. (input) sampling frequency
y(m in ) the minimum sampling frequency
fj sw switching frequency
f sampling frequency vector
G gate count
G gate count matrix
gcd(integers) the greatest common divisor of the integers
G ,
the gate count of BC i at sampling frequency^
g » time-varying impulse response of a linear system
•4 4 Hilbert transform
K M ) bandpass filter
V ” ) BSF’s highpass filter
K(n) BSF’s lowpass filter
K M ) lowpass filter
K M polyphase sub-filter of h(n): h(nLM+pM+qL)
K M polyphase sub-filter of h(n): h(nLMK’+rLM+pM+qL)
i the maximum wedge order for bandpass sampling;
J
the number of basic component types in a multirate system 
the number of sampling frequencies in a multirate system
K the number of FDM channels
K ’ the number of FDM channels including guard channels
K kQWQ is the centre frequency of the first FDM channel
L interpolation factor
A internal word length of BC i
M decimation factor
m the number of tree stages
M(.) mapping function from the A matrix to an integer (p)
max(integers) takes the maximum out of the integers
N BC count matrix
N FIR filter length
n half-band FIR filter order
Nb signal word length
the number of BC i at sampling frequency f.
the wedge order
OBJx(A) objective function on x
P central frequency-to-channel spacing ratio (f/W) of a real FDM;
P
an integer representation for the A matrix 
power consumption
Pc the relative precision of carrier frequency
Pk a A>bit integer
Ps
power consumption factor (mW/MHz/active gate) 
the relative precision of sampling frequency
p analog FDM channel stacking index,
R symbol rate
r the number of -Is in the A matrix
taking the real part of the signal
A i//„
A the propagation delay of a adder
A r = i / / c
Tm the propagation delay of a multiplier
vectorization function for A matrix
w channel spacing
W0 frequency resolution
the discrete sampling function
w*
s|*
II
%*(r) complex conjugate of x(t)
x(t) x ( t ) =  j r { x ( t ) }
j— k
the k-th modulated component of x(n): x(n)e M
the 1-th polyphase component of x(n): x{nM+X)
xviii
Chapter 1 Introduction
Chapter 1 
Introduction
Communications satellite networks have enjoyed rapid growth in recent years due to their wideband capabilities, transmission methods that are insensitive to distance and 
terrain, and a substantial reduction in circuit costs. To meet the requirements for high 
capacity and low required transmit power per channel, and also to make satellite channels 
cost competitive with optical cables, the use of high effective isotropic radiated power 
(EIRP) and gain-to-noise temperature ratio (G/T) multiple spot beam satellites with on-board 
processing (OBP) will be needed for future generations of communication satellites [Har90]. 
With multibeam, on-board processing, and baseband switching capability, future systems will 
no longer be simple “bent-pipe” repeaters, instead, they will be regenerative with on-board 
demodulation/decoding and re-modulation/re-coding that separate the up-link and down-link. 
Hence the up-link and the down-link can be optimized separately resulting in a reduction of 
the required E/N0 for both links (typically, 3dB) [Mar93]. More importantly, they will 
provide full connectivity among all beams and allow small earth terminals due to high 
antenna gain [Cam90b].
1.1 On-board processing satellite
On-board regeneration allows small, inexpensive earth stations with reduced antenna size and 
high powered amplifier (HPA) power (in fact the evolution of OBP concept and application 
is in consonance with the trends to use smaller and less expensive earth stations in dispersed 
networks). In a multiple spot-beam satellite, on-board switching is necessary in order to 
maintain the needed connectivity between beams. This switching function can be realized by 
either a RF unit e.g., using Microwave Switch Matrix (MSM) which re-routes signals at 
microwave frequencies (such as 4 GHz), or a baseband unit which requires on-board 
demultiplexing and demodulation. Baseband switching has the additional advantage of
Qi, Multicarrier DEMUX and VLSI Implementation 1
Chapter 1 Introduction
decoupling the up-link and the down-link, thus enabling rate and format conversions, as well 
as improving the link performance [Say92].
Inter-Satellite Link
TDM
FDMA/TDM
V SA T sM obiles
Figure 1.1 OBP system concept: multibeam with on-board switching; on­
board regeneration enabling FDMA/TDM conversion
Figure 1.1 illustrates the on-board processing concept applied to a multibeam mobile satellite 
communication system. It is a typical OBP system with single channel per carrier/frequency 
division multiple access (SCPC/FDMA) on the up-link and TDM on the down-link. This 
scheme allows mobile terminals to transmit a narrow band, low power, signal alleviating 
their limited power constraint. The highest satellite efficiency and interconnectivity are 
obtained by remodulating the regenerated baseband signals onto a single time division 
multiplexing (TDM) carrier [Nus87]. Thus the scheme can provide higher EIRP due to the 
use of continuous single TDM carrier on the down-link, which is less affected by non- 
linearity of satellite HP As [Ela86, Yim88]. The scheme requires the regeneration on-board 
satellite and baseband switching to provide high-efficiency interconnection of up-link and 
down-link beams.
The separation of up-link and down-link is the most prominent feature of OBP regeneration 
that has the following advantages [Nus87, Eva87]:
* The up-link noise and interference powers are not added to the down-link; however , up­
link errors and down-link demodulation errors are cumulative,
* The up-link and down-link can be designed separately giving significant flexibility and 
scope for economy, and
Qi, Multicarrier DEMUX and VLSI Implementation 2
Chapter 1 Introduction
* Modulation formats and transmission rates can be changed to provide even more 
significant system benefits.
1.2 Multicarrier demultiplexer and demodulator
In a regenerative OBP satellite shown in Figure 1.1, a baseband processor known as 
multicarrier demultiplexer and demodulator (MCDD) is needed to perform the on-board 
regeneration (an exception is direct regeneration OBP systems in which the regeneration is 
under taken in an RF rather than in baseband [Miz93]). An MCDD separates frequency 
multiplexed up-link carriers (e.g., SCPC, MF-TDMA, etc.) into individual channels and then 
performs demodulation to each of the channels in order to perform on-board baseband 
switching for multibeam systems [Bjo93].
xxxx..
w
A/D
sampling frequency
/S=KW
Figure 1.2 MCDD function
L J DEMOD
DC
LU
Î5
Z
o s□c DEMOD
LU
LL
b
I
od z
o
o s DEMOD
HI
Q
LU
fc
Œ A DEMOD /www\
/ m c d - 2 R R
Figure 1.2 illustrates the MCDD function. The input signal to the MCDD is a frequency 
multiplex of a SCPC/FDMA up-link signal. Other formats, such as an multi-frequency (MF) 
TDMA, or an MF-CDMA up-link signal, are also possible which will generally not affect the 
multicarrier demultiplexing function (the sampling rate conversion may be changed 
accordingly). The demodulators, however, will surely be different for different up-link signal 
format. The input frequency multiplex consisting of a group of K  FDM channels is sampled 
at a rate sufficient and carefully chosen (in the case of bandpass sampling) for the group 
bandwidth. The multicarrier demultiplexer (DEMUX) which accounts for most computations 
in a MCDD separates the group into K  individual channels. Since each output of the 
DEMUX has a bandwidth l/K  that of the input and the demodulators require a sampling rate 
that is an integer multiple of the channel symbol rate R (e.g., 2R, 4R, etc.), the sampling rate 
conversion can be included in the multicarrier DEMUX. An alternative is to use a separate 
sampling rate conversion stage before the demodulation. The demodulators operating at an 
integer multiple of the symbol rate recover baseband bits for each channel.
Qi, Multicarrier DEMUX and VLSI Implementation 3
Chapter 1 Introduction
This thesis mainly concerns the multicarrier demultiplexer of the MCDD with emphasis on 
low complexity and low power consumption DEMUX architectures. Since the main function 
of multicarrier DEMUX is channel separation (channelization) which is similar to that of 
transmultiplexers that perform direct translations between EDM and TDM formats in 
terrestrial systems, most OBP demultiplexing approaches are a natural evolution from 
terrestrial transmultiplexing methods. Recent development on multirate filter bank theory has 
brought up new insights into frequency demultiplexing/transmultiplexing [Fli94]. Our 
previous studies show that various frequency demultiplexing/transmultiplexing approaches 
can be unified under a framework of generalized polyphase DFT filter bank [Yim92a, 
Yim92b]. The work presented in this thesis can be considered as a extension of the previous 
work towards practical DEMUX filter bank design, complexity analysis, and implementation 
with very large scale integrated circuits (VLSI).
1.3 Mapping MCDD function into VLSI
There are three basic requirements for OBP (or any other) payload in space environment: low 
mass, low power consumption, and high reliability. Implementing MCDD digitally makes a 
significant step towards this direction. Since an MCDD, in particular the multicarrier 
DEMUX, is the major drive of mass and power consumption of OBP payload, efficient 
implementation of MCDD is essential. Though general purpose digital signal processing 
(DSP) processors (e.g., TMS series, etc.) could be used to realize MCDD, this approach is 
generally only useful for test-bed and experiment/demonstration purposes, and is not feasible 
for future operational OBP satellite systems in which thousands of mobile channels will be 
involved. For practical use of MCDD, it has to be implemented in VLSI using several 
application specific integrated circuits (ASICs) to maximally reduce the mass and power.
VLSI implementations require efficient mapping from computational (DSP) models to 
hardware architectures. This mapping can be carried out in three steps:
* at the algorithmic level, design a computational efficient algorithm that performs the 
desired function specified by a mathematical model with less computations,
* map the algorithm into a general hardware implementing structure which consists of only 
fundamental building blocks that perform the basic operations. The implementing 
structure can be either the direct form which is obtained by one-to-one mapping between 
the basic operations and the hardware building blocks, or the one which involves time­
sharing (time-multiplexing) of some of the functional blocks, and
* map the implementing structure onto an efficient VLSI architecture that requires low 
complexity and power and allows high throughput.
This mapping procedure can be illustrated by Figure 1.3.
Qi, Multicarrier DEMUX and VLSI Implementation 4
Chapter 1 Introduction
Function Model Optimal Structure VLSI Architecture
Figure 1.3 From function model to VLSI
1.4 Outlines of chapters
In chapter 2, fundamentals of signal analysis, complex filtering, and multirate signal 
processing are described. The frequency translation (rotation) invariant property of complex 
signals is identified, which can be used for efficient complex filtering. The first order 
uniform bandpass sampling theorem is modified to take account the instability of the 
sampling frequency source and variations of the signal centre (carrier) frequency. As a result, 
a robust bandpass sampling, which requires the carrier frequency be on the 1/4, or -1/4 
sampling frequency grid, is proposed. Two useful complex filter structures are introduced: 
the rotation invariant complex (RIC) FIR and the down-real filtering-up (DRFU) complex 
filter. The former reduces the multiplications by a factor of two by exploiting the frequency 
translation invariant property of conjugate symmetric complex filters. The latter realizes a 
complex filter with two identical real lowpass filters (FIR or IIR) and is very useful in 
simplifying complex (modulated) filter banks. FDM signal sampling schemes are 
summarized. The stacking of FDM signals plays an important role in uniform filter bank 
design. To manipulate the signal stacking, the concept of homogenous channel stacking is 
introduced and its properties are given. It is found that in general sampled FDMs with even 
channel stacking are advantageous for low complexity realization of uniform filter banks. 
The conversions between odd- and even-stacking can be trivial using 7t- or n il  frequency 
shifts. Multirate systems and multirate digital signal processing principles and techniques are 
briefly summarized. Finally, a generic bandpass filter bank model for frequency 
demultiplexing problems is presented.
Chapter 3 addresses and summarizes frequency demultiplexing approaches and structures 
from the viewpoint of multirate filter banks. Similarities and differences between the 
terrestrial transmultiplexers and the multicarrier demultiplexers are discussed. The effect of 
the channel stacking on DEMUX structure is discussed. In general, an even-stacking FDM 
signal is preferred for low complexity DEMUX filter banks. Conditions of trivial odd-to- 
even stacking conversion are given. Two complex-modulated functional models, the lowpass
Qi, Multicarrier DEMUX and VLSI Implementation 5
Chapter 1 Introduction
and the bandpass generic DEMUX filter banks, are introduced. The latter seems more 
convenient to use in deriving optimal DEMUX filter bank structures than the former. 
Demultiplexing approaches for OBP applications are classified and reviewed. By modifying 
the conventional polyphase matrix DFT (PMDFT) filter bank, a pipeline PMDFT structure is 
proposed. It has merits of reduced DFT size, being suitable for VLSI implementation due to 
good modularity, and fast processing speed due to its semi-systolic and pipeline property. 
Flexible DEMUX architectures are reviewed and studied.
Chapter 4 describes a new approach to multirate filter bank optimization using the multirate 
signal flow graph (MSFG) representation and transform. Commonly used approaches in 
multirate filter bank design are based on mathematical derivation and manipulations. The 
mathematical approach, however, may obscure important structural information which can be 
vital for obtaining efficient multirate network structures. This chapter introduces a systematic 
approach to the optimization of multirate filter banks based on the MSFG representation and 
transforms. It has the advantages of presenting clear structural information and is free of 
tedious mathematical manipulations. The MSFG is an extension from the conventional signal 
flow graph which is only valid for linear time-invariant (LTI) systems. By including rate- 
changing and some sampling node functions, MSFG can be used to describe linear periodically 
time varying (LPTV) multirate networks. A number of MSFG identities and transforms are 
identified and proved which are useful and convenient to use in multirate network 
transformation (simplification). As an example, an optimal complex BSF structure for the 
binary tree structure is derived using MSFG transformation. This BSF structure has the 
minimum computation rate as a result of polyphase decomposition of the prototype filter and 
exploitation of filter’s symmetry property. With MSFG’s sampling node functions, it can even 
describe some digital network functions, like sampling, switching, sampling and hold, integral 
and dump, interleave, etc.. Hence it has the potential of making the direct mapping from DSP 
algorithms to digital networks.
Chapter 5 describes a systematic approach to mapping computational efficient multirate DSP 
algorithms into optimized VLSI architectures subject to constraints on complexity, power, 
and system throughput. A modeling method for multirate systems in VLSI is proposed. With 
this simple VLSI model, system’s complexity, power consumption, and system throughput 
are represented in a form of objective functions and the estimations are technology 
independent. The optimized VLSI architectures are obtained via optimization techniques. It 
is shown that a filter bank’s VLSI architecture can be optimized by configuring basic 
components into either bit-serial or bit-parallel architectures for given constraints of 
complexity, power, and throughput. As an example, the proposed approach is used to 
optimize VLSI architectures and to estimate complexity and power consumption estimations 
of a PMDFT filter bank and its time-multiplexed version, TM-PMDFT.
Qi, Multicarrier DEMUX and VLSI Implementation 6
Chapter 1 Introduction
Chapter 6 introduces some useful bit-serial techniques and systolic DFT and FIR 
architectures which have been or could be used in MCDD and flexible (reconfigurable) 
DEMUX architectures. The aim of using bit-serial techniques is to reduce the complexity. 
Systolic architectures are used to improve system throughput (speed) and to optimize the 
utilization of the silicon area. Three bit-serial arithmetic units are presented: a bit-serial 
adder/subtractor, a simple bit-serial 2’s complement unit, and a novel full-precision bit-serial 
multiplier structure. The distributed arithmetic (DA) approach is proposed for inner-product 
computation in a FIR. For a long length FIR, simple segmentation of the filter is proposed so 
that the processing can be distributed into several smaller DAs in parallel each of which 
utilizes an efficient DA structure with small memory. To improve processing speed and to 
suit VLSI implementations systolic architectures for stream DFTs and FIRs which account 
for most computations in a multicarrier DEMUX are proposed. To derive the two stream 
DFTs (Kung’s and Chang’s models), the relationship between the stream DFT and the 
parallel DFT are established via MSFG representation. Systolic architectures of the two 
stream DFT, known as systolic stream DFTs (SS-DFTs), are derived from this relationship 
via MSFG transforms. Two new systolic FIR architectures are presented in this chapter: a 
low complexity semi-systolic FIR and a novel, pure systolic FIR architecture which makes 
use of the symmetry property of linear phase FIR. The proposed pure systolic architecture 
has significantly improved the speed performance without increasing the latency. Finally, a 
pure systolic halfband FIR architecture is also proposed. This halfband FIR architecture is 
considered optimal in terms of low complexity, low latency and high throughput.
Chapter 7 presents a low-complexity VLSI architecture and the associated ASIC design for 
an 8-channel tree type DEMUX. Firstly, the principle of the real and complex binary tree 
DEMUX is described. Then the concept of time-multiplexed complex tree (TM-tree) and 
time-multiplexed band-splitting filter (TM-BSF) is introduced, leading to an novel low 
complexity binary tree architecture, called the TM2-tree. In a TM2-tree, the band-splitting 
filter (BSF) is realized by cascading a complex input buffer and a complex inner product (IP) 
processor which produces both the lowpass and the highpass samples on a time-sharing basis. 
To minimize the hardware complexity, DA inner-product architecture is employed to 
implement the complex IP and an efficient low-complexity input buffer structure is proposed. 
As a design example, an ASIC (gate array) design for an 8-channel DEMUX based on TM2- 
tree architecture is presented. Finally, comparisons in complexity, power consumption and 
processing speed with other implementation schemes of binary tree are given.
In chapter 8, conclusions about this research work are made. Major achievements are 
summarized. Finally, possible directions for future research work are pointed out.
Qi, Multicarrier DEMUX and VLSI Implementation 7
Chapter 2  Bandpass Signals, Filtering, and Multirate Signal Processing
Chapter 2 
Bandpass Signals, Filtering, and 
Multirate Signal Processing
Signals encountered in the real world are always band-limited. Digital processing for bandpass signals requires sampling rate alteration in order to reduce computations. Such 
a system is called multirate system that has attracted great attention during the last 2 
decades. Multirate DSP techniques are particularly useful in digital telecommunication 
systems for handling multiple data transmission rates. A typical example of multirate DSP 
system is the digital multicarrier demultiplexer in a MCDD which channelizes frequency 
division multiplexed signals in FDMA, MF-TDMA, or MF-CDMA systems. This chapter 
briefly introduces multirate digital processing fundamentals, properties of bandpass, FDM 
signals, as well as the concepts of bandpass sampling and efficient complex filtering.
2.1 Band-limited signals
2.1.1 Lowpass signals
Signals in transmission systems are always band-limited with finite bandwidth. A signal is 
said lowpass if its significant spectral content is centred around zero frequency as illustrated 
by Figures 2.1(a) and 2.1(b). For real lowpass signal, its magnitude spectrum must be even 
symmetric and the phase must be odd symmetric about fi=0 as shown in Figure 2.1(a). For 
complex signals (also referred as complex envelopes) shown in Figures 2.1(b) and 2.1(c), 
symmetry properties of spectrum do not preserve. Bandpass signals in the real world can be 
interpreted as being translated from their equivalent lowpass signals, which are often referred 
as baseband signals. Efficient signal processing tends to move the bulk of the processing load 
to baseband.
Qi, Multicarrier DEMUX and VLSI Implementation 8
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
IX (f ) l  IX (f ) l  IA (f ) l
-  1.0 - -  1.0 -  2.0
'
+
B o b  — - p— — j - _  B O B
( a )  ( b )  ( c )
Figure 2.1 Lowpass signals
2.1.2 Analytic signals and Hilbert transform
A signal is said analytic if, and only if, its Fourier transform is zero for negative frequencies, 
as for those illustrated by Figures 2.2(c) and 2.2(d). Given a real signal x(t) with Fourier 
transform =  X(co), its analytic signal ax(i) is defined, in frequency domain, by
Ajc(o)) =  2X (co)m((o)
= X(ro)[l + sgn(ro)] (2.1)
= f2X+(ti>) , força>0 
[0 , for to < 0
where w(co) is the unit step function and X*(co) is the right (positive) sideband of X(co) . The 
non-symmetric property of A_c(co) determines that an analytic signal must be complex.
Signals depicted in Figures 2.1(c) and 2.2(d) are analytic with a single sideband while the 
complex signal of Figure 2.2(c) can be considered as double-sidebanded analytic with respect 
to the real bandpass signal of Figure 2.2(a).
Eq. (2.1) shows that the analytic signal of a real signal is obtained by adding signal
X(co) = X(to)sgn(co) (2.2)
to the original real signal. Eq. (2.2) is known as the Hilbert transform of x(t) and is denoted 
by
jr{x(t)}=x(t)  (2.3)
Hence the analytic signal and the real signal are related by Hilbert transform:
ax (t) = x{t) + %(f) (2.4)
We sometimes refer to the real and complex lowpass signals respectively as the real baseband 
and the complex baseband signals with respect to their modulated (frequency shifted) 
bandpass associates as shown in Figure 2.2.
2.1.3 Frequency translations and bandpass signals
Frequency translation shifts the signal spectrum from one frequency band to the other without 
changing the shape. Frequency shift in the continuous time domain is simply a multiplication 
by a continuous sinusoid either real or complex. For an arbitrary complex signal
Qi, Multicarrier DEMUX and VLSI Implementation 9
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
x(t) = x r (t) + jx i (t) with Fourier transform X(co), we define the following three types of 
frequency translations:
<— > %(o) -  m J  (2.5)
x(f)cos(cd ----->—X(co + co c ) + —X(co — (0 C) (2.6)
SR|x(r)e7t0cf j-< X (—co — co c ) + —X(od — co c ) (2.7)
The transform pair of Eq. (2.5) is called one-sided frequency translation as the signal 
spectrum is moved only in one direction. The complex bandpass signal of Figure 2.2(c) is the 
direct result of an one-sided translation of the real lowpass signal of Figure 2.1(a). Generally, 
signals after one-sided translations become complex. Eq. (2.6) is a two-sided frequency 
translation since the signal spectrum is moved in both positive and negative frequency 
directions. The signal shown in Figure 2.2(a) is the result of a two-sided translation of the 
signal in Fig. 2.1(a). The two-sided frequency shift of Eq. (2.6) belongs to the class of 
double-sideband (DSB) suppressed carrier modulation. Quadrature modulation as described 
by Eq. (2.7) can be considered as another kind of two-sided frequency translation. Unlike the 
double-sideband suppressed carrier modulation where the spectrum at the negative frequency 
side is exactly the same as that at the positive frequency side except the difference in the
centre frequencies, the negative sideband of a quadrature modulated signal is the complex
conjugate of the positive sideband due to the real nature of quadrature modulation in time 
domain. The two two-sided frequency shifts become equivalent when dealing with real 
signals.
A bandpass signal can be regarded as one that is frequency translated with carrier frequency 
(ùc=2nfc from its lowpass equivalent, as shown in Figure 2.2 for f e >B where B is the one­
sided signal bandwidth. The bandpass signal is said narrow band if f c » B  (as often 
encountered in RF/IF systems).
Another kind of real bandpass signals is single-sideband (SSB) modulated signals defined by
x SSB (t) = x(t) cos(to ct) + x(t) sin(to ct) (2.8)
where x(t) is the baseband real (lowpass) signal (Figure 2.1(a)) and x(t) is the Hilbert 
transform of x(t). When the sign in above equation is the transmitted SSB signal is the 
upper sideband (Figure 2.2(b)), otherwise it is the lower sideband (Figure 2.2(e)).
Qi, Multicarrier DEMUX and VLSI Implementation 10
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
IX(f)l 
- -  0.5
0
(a)
IX(f)l
- -  0.5
0 f
(b)
IX(f)l
- -  1.0
0 fc
IX(f)l 
- -  0.5
-f„
(d)
0
(e)
Figure 2.2 Bandpass signals
2.1.4 Frequency translation (rotation) invariant property
In discrete time and frequency domains, assuming uniform sampling with sampling period T, 
the complex envelope x(t) = x r (t) + jx i (t) becomes x(nT) = x r (nT) + jx { (nT ) , or simply,
j —tx(n) = x r (n) + jx i (n) , while the complex sinusoid eJ(ùct =e Tc is sampled uniformly on the 
unit circle in the z-plane with N=TJT (N integer) samples per period which gives the discrete 
sinusoidal ejlfn. We use notation W^n = eJlrn for the discrete complex sinusoid, as being
frequently used in DSP literature. Since the complex sinusoid is periodic, frequency
translation of discrete signals is also referred as signal rotation in the discrete time domain.
Frequency translation (rotation) invariant property
Given complex envelope
x(t) = x r(t) + jXiit) (2 .8)
which is complex conjugate symmetric,
x* (t) = x (-t)  (2.10)
then its one-sided frequency translation
y{t) = x(t)e}(Sict (2.11)
preserves the conjugate symmetry property, that is,
= (2.12)
The proof of Eq. (2.12) is straightforward: taking the conjugate of Eq. (2.11) and applying 
Eq. (2.10) results in Eq. (2.12). This property can be extended to discrete complex envelope
x{n) = x r {n) + jx { (n) . (2.13)
If x(n) is conjugate symmetric,
Qi, Multicarrier DEMUX and VLSI Implementation 11
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
x*(n) = x(-n )  (2.14)
then the rotated signal
y(n) = x(n)W ;Qn (2.15)
is also conjugate symmetric by having
y*(n) = y(-n )  (2.16)
where the positive integers P and Q are assumed co-prime and P>Q>1.
Eq. (2.14) gives non-causal signals that do not exist in real world. For causal signals, they 
must be of finite duration to have any kind of symmetry (otherwise they have to extend to 
n<0 which is in contradiction to causality). Assuming that the discrete complex envelope of 
Eq. (2.13) is causal and has duration of N, the complex conjugate symmetry means
x* (n) = x ( N - 1 - n ) . (2.17)
The complex conjugate of the rotated signal y(n) = x(n)WpQn is
y ( n )  = x ( N - l - n ) W P0''
= y ( N - l~ n ) W * N-') ( *
which is not strictly conjugate symmetric due to the added constant phase shift. However, if 
we introduce an initial phase shift to the rotated signal, the resulting signal of Eq. (2.19) will 
be strictly conjugate symmetric as shown by Eq. (2.20).
y'(n) = W2®('v",) • (x{ti)W;an ) (2.19)
y*(n) = / ( j V - l - n )  (2.20)
The proof is straightforward and is omitted for the sake of length.
We can thus formally state the frequency translation (rotation) invariant property as follows:
The complex conjugate symmetry property is preserved for any bandpass signal 
(system) if  its lowpass equivalent is conjugate symmetric; I f  it is causal, an initial 
phase shift is necessary for the rotated (bandpass) signal to preserve the symmetry 
property.
Although we start with the assumption that the equivalent lowpass signals are conjugate 
symmetric, it is not necessary to confine signals to be of lowpass type before frequency 
translation. It is more general and more convenient to state the property as:
A frequency translated signal (systes) is complex conjugate symmetric if  and only 
if  it is complex conjugate symmetric before the translation.
This property will be used to derive an efficient complex filter structure that saves the 
computation by approximately a factor of two.
Qi, Multicarrier DEMUX and VLSI Implementation 12
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
2.2 Bandpass sampling theorem
2.2.1 First-order bandpass sampling theorem
The signal under consideration is assumed real SSB modulated bandpass type centred at f c 
with one-sided bandwidth B as shown in Figure 2.3. According to the Shannon sampling 
theorem [Bel84], a signal whose highest spectral component is a t/m is completely determined 
by the set of its values at regularly spaced intervals of period T=(2fm)~l. It is, however, always 
over sampled at the Nyquist sampling rate of f=2fm for a bandpass signal. The classical 
bandpass theorem for uniform sampling shows that the signal can be reconstructed if the 
sampling frequency is at least [Fel49, Gre77]
(2.21)
where I  is the largest integer within f m / B = f c / B + 0.5, that is,
/ =  A  + o.5 .
B
(2.22)
The minimum sampling rates can be graphically shown in Figure 2.4 by the upper edges of 
the dark shaded area.
B B
-f f
Figure 2.3 Bandpass signal protected with guard bands
0 1 2 3 4 5 6 7
fJB=I+G
Figure 2.4 Minimum and acceptable sampling frequencies for bandpass 
signals
Qi, Multicarrier DEMUX and VLSI Implementation 13
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
The interpretation of the theoretical minimum for analog bandpass signal can be misleading 
as one might think, from the figure, that no aliasing will occur as long as f>4B  which is not 
true. In reference [Gas78], Gaskell gave conditions for acceptable uniform sampling rates at 
which no spectral overlapping would occur:
(2.23)
- 1
where 1 < nw < / .  The nw=\ case is obviously the Nyquist sampling rate whilst nw=I 
corresponding to the minimum sampling rate. Thus there are 7-1 disallowed regions between 
the minimum sampling rate and the Nyquist rate as shown by light shaded areas in the figure. 
It shows that the signal can be critically sampled without aliasing at the Nyquist m tef=2B  of 
its lowpass counterpart if it is centred at f=(i-0.5)B  where i is an integer not less than 1. 
(This case is also referred as integer band positions because the lower and upper edges of the 
passband are positioned at the integer grid of kB , & = 0,±1,±2,--- [Vau91].) The first-order 
bandpass sampling theorem can be formally stated as:
I f  a continuous analog signal x(t) is band-limited to B Hz and centred at 
f c  ’ ( /c -  th6*1 x(t) can be exactly recovered from the set o f its sampled
values at regularly spaced intervals o f period T = l / f s , provided that the 
sampling frequency f s satisfies
V c +B 2 f c - B
where l < n w <\_fc / B + 0.5J.
2.2.2 Practical considerations
Although ideally a bandpass signal can be sampled at 2B Hz which is the Nyquist rate for the 
equivalent lowpass signal if integer band condition is met, any imperfection (disturbance) of 
the sampling frequency oscillator or variations of the centre frequency would pull the 
operating point away from the wedge tips of the allowed regions. Hence the theoretical 
minimum sampling rates for bandpass signals given by Eq. (2.21) are not achievable in 
practice. It has been shown (see Appendix A) that the practical minimum sampling rates, in 
which relative precision of sampling frequency ps and the variations of carrier frequency are 
taken into account, can be described by
/(■*") = 2(l + Pc)fc+B  (2.24)
/ '  =
( 1 +  P s ) { 2 ( 1 +  P c ) f c  + f f )  
4 ( P c + P S) f c + 2 B
(2.25)
Qi, Multicarrier DEMUX and VLSI Implementation 14
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
if the tolerances of sampling frequency and carrier frequency are expressed in relative 
precision pc (as in situations where the carrier frequency varies due to Doppler effects), or by
:(min) _  2 / c  +
/ '( 1 - P , )
f (2.26)
/ '  =
(l + p ,)(2 /e +(2i) + l)fi) 
4 p J c +2{2ü + \)B
(2.27)
if the sampling frequency is confined by relative precision whereas the carrier frequency is 
allowed to vary within pre-defined guard-bands whose bandwidths are equal to 
MB (0<\) <1).
There are cases where tolerance to carrier frequency variations is of primary concern. It has 
been found that, for maximum tolerance to carrier frequency variations and assuming p < < l, 
the sampling frequency and the carrier frequency are related by
f X - i , i - p+ + ' (2.28a)
for n odd, and
f c
rrK
\ 2
1 -P , /. 2 4 / .
for nw even, where l<n^</ and I  is defined by Eq. ( 2.22). Or, in a unified formula,
4
f s  - ■fc
2nw- l
The allowed maximum carrier frequency deviation is given as
A c ^ ^ ~ ^ n w - l ) P s ) f s  ~ ~ B
When ps = 0, Eq. (2.29) becomes
(2.28b)
(2.28c)
(2.29)
(2.30)
Eqs. (2.28a) and (2.28b) are referred as j f s and - j f s stacking of carrier respectively. This
particular carrier arrangement ensures aliasing-free bandpass sampling without the need to
check the validity of sampling frequency with Inequality (2.23). Apart from the main 
advantage of being tolerant to carrier frequency uncertainty, the j f s ( - j f s) stacking can
also simplify digital processing loads in quadrature bandpass sampling allowing simpler 
sampler structures [Ric82, Rad84, Con83]. The feature of j f s stacking arrangement has been
exploited for EDM signal sampling in [Can94] and will be discussed in a later section.
Qi, Multicarrier DEMUX and VLSI Implementation 15
Chapter 2 Bandpass Signals, Filtering, and Multi rate Signal Processing
It should be pointed out that the noise level of bandpass sampled signals is higher than that of 
the equivalent baseband sampled signals due to folding over of the out-of-band noise 
[Vau91]. It is therefore necessary to introduce analog bandpass anti-aliasing filtering before 
bandpass sampling to reduce signal-to-noise ratio degradation.
2.3 Complex filters
In telecommunication systems sampled lowpass or bandpass signals are often processed with 
appropriate filters either to shape the signal impulse or to remove unwanted spectral 
components. When both sidebands of a real bandpass signal are of interest, a real bandpass 
filter can be used, which can be obtained by applying a two-sided frequency translation of Eq. 
(2.6) to the equivalent lowpass impulse response. Efficient filter structures can be used since 
symmetry property of the equivalent lowpass filter are preserved. Complex bandpass filters 
are used if one of the sidebands is to be processed and retrieved. The symmetry property of 
the equivalent lowpass filter is destroyed due to the one-sided frequency translation. 
Consequently, the implementation tends to be more complicated than that of the real bandpass 
filters. Although digital filters can be of infinite impulse response (HR) or finite impulse 
response (FIR), the latter is dominant in practice due to its guaranteed stability and linear 
phase which is important to many communication systems. In this section, we will solely 
focus on the structures and transformations of FIR filters. Two types of efficient complex 
filter structure are proposed in this section.
2.3.1 Direct transversal structure
A commonly used method of complex bandpass filter design is to design an equivalent 
lowpass filter hj^ri) followed by a one-sided frequency translation to move h j j i )  to the 
desired frequency location (hence complex). That is,
V ( » )  = M " )W ,-e" (2.31)
where the positive integers P and Q are related to the passband centre frequency f c and the 
sampling frequency F  , (0</c<F), by
^  = -£- (2.32)
P F,
If the equivalent lowpass filter is an A-tap linear phase real FIR with symmetry property of
hip (n) = - l - n )  (2.33)
the multiplications can be reduced by a factor of two using the transversal symmetric FIR 
structure shown in Figure 2.5(a) [Had91]. The complex bandpass filter of Eq. (2.31) is not 
symmetric in any sense, a direct implementation leads to the transversal structure shown in 
Figure 2.5(b). For a complex FIR filter of length N, which is assumed odd hereafter for
Qi, Multicarrier DEMUX and VLSI Implementation 16
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
simplicity, the structure of Figure 2.5(b) requires 4N  real multiplications and 4N-2  real 
additions where adder tree is assumed wherever applicable.
x(n)
y(n)
real signal path
x(n)
y(n)
complex signal path
( a ) (b )
Figure 2.5 Transversal FIR structures (a) symmetric real FIR, h ’s are real 
coefficients; (b) direct complex FIR, c/s are complex coefficients
2.3.2 Rotation Invariant Complex FIR (RICF) network
The main idea of reducing computations in a complex filter is to exploit the symmetry 
property of the equivalent lowpass FIR, which is destroyed by the one-sided frequency 
translation except for those centred at some particular frequency locations. Narashima and 
Peterson [Nar79] showed that the complex bandpass filter of Eq. (2.31) preserves the 
complex conjugate symmetry if, and only if, the centre frequency satisfies
—  = — = - - ?— r (2.34)
Fs P 2(N - 1)
where i is an arbitrary integer, which gives
A ,/  (») = H ) '  Ag, ( # - ! - „ ) .  (2.35)
For an arbitrary centre frequency/c which does not satisfies Eq. (2.34), the complex bandpass
filter is no longer complex conjugate symmetric. Eq. (2.34) shows that f c is strongly
dependent on the filter length N  if the symmetry property is preserved. The restriction on f c
can be dropped and the symmetry property of h^ri) can still be utilized according to
frequency translation invariance discussed in section 2.1.4. Referring to Eq. (2.19), this can 
be achieved by introducing a constant phase rotation W ^ N~X) = ty/c(AM)/2 to Eq. (2.31), that is
Kr{n) = hBM 'W ; Q’'W 3 N~')
The resulting complex filter thus becomes complex conjugate symmetric,
A y W  = A ; x ^ - i - » )
(2.36)
(2.37)
The complex FIR filter of Eq. (2.36) can be realized by the network shown in Figure 2.6(a), 
that resembles the structure of Figure 2.5(a), but with complex data path and operators 
instead. The operator ‘cross in circle’ in the figure performs the conventional complex or real 
multiplication depending on the inputs signal type, the operator ‘cross in rectangle’ performs 
a more complicated complex operation as defined in Figure 2.6(b),
Qi, Multicarrier DEMUX and VLSI Implementation 17
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
y = + x2h
=  [ i x ir  + x 2r ) K  ~ (xu -  x 2i )A, ] + ;[(x„ + x 2l )hr +{xlr-  x lr )/i, ]
(2.38)
The required number of real multiplications for this operation is four, instead of eight for two 
complex multiplications, saving multiplications by a factor of two. The number of real 
additions remains the same as that of two complex multiplications. One possible 
implementation structure for the operator is also shown in the figure. With the structure of 
Figure 2.6(a), the required number of real multipliers and adders for an iV-tap (N assumed 
odd) complex bandpass filter are Nmul = 2N+6 and Nadd = (7iV+17)/2 respectively, where an 
adder tree is assumed for the accumulator. The multiplications are approximately halved 
compared to the direct structures without exploiting the symmetry property. The introduced 
constant phase shift can be compensated either before or after the complex FIR network.
x(n)
y(n)
complex signal path
^ 2
y=c x x+cx2
x.■2i
%
3V
real signal path
(a) (b)
Figure 2.6 Transversal complex FIR: (a) RICF structure; (b) "cross in 
rectangle" operator
2.3.3 Down-Real Filtering-Up (DRFU) network
Bandpass filter of Eq. (2.31) performs the complex discrete convolution
y(ri) = h p (n)*x(n)
= X^Lp(0Wpe/ • x(n -  i)
(2.39)
i= 0
By pure mathematical manipulation, Eq. (2.39) can be rewritten as
y(n) = W;°" X  V  (i) ■ x{n -
/=0
= {hu,(n)*[x(n)WPQ’' ^ W P
(2.40)
r-Qn
which indicates that the bandpass filtering can be realized by filtering the down-converted 
input signal with the equivalent lowpass filter (real) followed by an up-con version. We call 
the structure specified by Eq. (2.40) Down-Real Filtering-Up (DRFU) bandpass filter 
network and depict it in Figure 2.7(a). That is, filtering the passband signal with hBP{ri) which 
is up-shifted from hjin)  is equivalent to filtering the baseband signal which is down-
Qi, Multicarrier DEMUX and VLSI Implementation 18
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
converted from the passband with h j j i )  and then up-converting the output back to the centre 
frequency. The filtering process is illustrated in Figures 2.7(b) through 2.7(e).
x \n )x(n)
%"(/)
WpQn^LP (W)
y(n)
(b)
< M >
fc
(d)
Figure 2.7 DRFU complex filter
f c
(c)
f c
(e)
The advantages of DRFU structure lie in that the computations of complex bandpass filtering 
is reduced by using two, instead of four, real filters at the expense of two digital mixers 
which can be implemented with two complex multipliers and a shared ROM in which the 
sinusoid values are stored. Hence for an iV-tap linear phase equivalent lowpass FIR, the 
DRFU structure requires N+9 real multipliers and 2N+2 real adders saving approximately a 
factor of four for multiplications and a factor of two for additions compared to the direct 
structure. The DRFU structure is also used to derive efficient multirate filter banks that will 
be addressed in chapters 3 and 4. The complexities of the three types complex FIR structures 
are summarized in Table 2.1.
Table 2.1 Complexity of complex FIR structures
direct RICF DRFU
real multipliers 4N 2N+6 N+9
real adders 4N-2 (7A+17)/2 2N+2
ROMs 2(N+1) N+l (N+l)/2+2(P-l)
control logic simple simple complicated
2.4 Frequency division multiplexed signals
Another type of band-limited signal is a FDM signal which consists of a group of passband 
signals (channels) centred at different frequencies.
Qi, Multicarrier DEMUX and VLSI Implementation 19
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
2.4.1 FDM signals
If channel signals are evenly spaced in frequency with identical channel spacing in between, 
the FDM signal is said to be uniform, as seen in Figure 2.8(a). Otherwise it is nonuniform. 
Since most computationally efficient DEMUX filter banks require that the FDM signals be 
uniform, the discussions on the design of FDM signals and the corresponding DEMUX filter 
banks will be focused mainly on the uniform cases throughout this thesis. It is however 
possible to handle certain types of nonuniform signals characterized by octave channel 
spacing (and hence, octave channel bandwidths), as illustrated in Figure 2.8(b), using 
multistage filterbank techniques [Cro83, Gra96].
A uniform FDM signal can be described as
Â--1
(2.41)
. &=0
where xk(t), k=0,l, , K -l,  are K  baseband signals centred at /  = 0, /  is the intermediate 
centre frequency of the FDM signal and W is the channel spacing. The channel signals xk(f) 
can be real or complex depending on being DSB or SSB. In MCDD applications, however, 
SSB channel signals are normally assumed since they provide high spectral efficiency. A SSB 
FDM signal is illustrated in Figure 2.8(a).
k((»)|
3 2 0
/ / - / - /
0 ]
C O /
co
(a)
lx(co)|
2 1 0
- C O /
0 1 2
t o /
to
(b)
Figure 2.8 SSB FDM signals (a) uniform FDM; (b) octave FDM
2.4.2 Sampling of FDM signals
To sample a continuous FDM signal at baseband one needs to move the passband close to the 
frequency origin fi=0 so as to reduce the sampling frequency. Alternatively, the bandpass 
FDM can be directly sampled at IF using bandpass sampling techniques discussed in the 
previous section. In some applications, complex (quadrature) sequences are required, such as 
in radar and sonar systems. One way to obtain complex samples is through the use of
Qi, Multicarrier DEMUX and VLSI Implementation 20
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
quadrature sampling which is carried out at baseband [Ric82]. Whilst in applications where 
real samples are required the sampling can be carried out at either baseband or H7RF bands.
Therefore, we have basically three sampling schemes for FDM signals, namely, baseband 
real, bandpass real, and baseband quadrature approaches as shown in Figures 2.9(a), 2.9(b), 
and 2.9(c) respectively. The advantages of bandpass sampling include reduced complexity at 
the analog side and free from mismatching problems between I- and Q-channels which are 
inherent to quadrature sampling. However, as pointed out in section 2.2, bandpass sampling 
schemes are sensitive to band position and have poorer noise performance than that of 
baseband sampling. The quadrature output can also be obtained by digital means after the 
bandpass sampling, thus eliminating mismatching problems and retaining the merits of 
bandpass sampling. This scheme is illustrated in Figure 2.9(d).
/ ,  > 2 K W  f s > 2 K W
A/D A/D
cos27t [fj -$KWjt
(a) (b)
x)—
%(»)
x)---
sinnKW t
A/D
A/D
Hilbert
processor
(e)
Figure 2.9 Sampling schemes of FDM signals, (a) baseband real sampling;
(b) passband real sampling; (c) baseband quadrature sampling; (d) (b) with 
quadrature output; (e) baseband real sampling based on Weaver structure
For baseband complex sampling of Figure 2.9(c) the structure is valid as long as the 
intermediate carrier frequency and the FDM bandwidth satisfy f I > \  KW (which is always
true). Whereas in the case of baseband real sampling as depicted in Figure 2.9(a), the 
structure requires the FDM signal be featured by / ; > KW  in order to avoid aliasing after
mixing. Thus for wide band FDM (as with respect to the intermediate centre frequency fj )
Qi, Multicarrier DEMUX and VLSI Implementation 21
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
where \K W  < f j  < KW  single-sideband modulator (also known as Weaver modulator) can be 
employed to replace the simple mixer in Figure 2.9(a) for frequency translation [Dar70, 
Cro83]. This Weaver modulation real sampling scheme is shown in Figure 2.9(e).
Since only one of the side bands is sampled in complex sampling schemes, the required 
sampling frequency is thus halved compared to that of real sampling schemes where both 
sidebands are protected. The price paid for the reduced sampling frequency is an additional 
set of sampling & hold and analog-to-digital converter (ADC). In addition to the added 
complexity for complex sampling scheme of Figure 2.9(c), the problems of matching gain 
and phase response differences between the I- and Q-channels also need to be resolved in 
practice.
2.4.3 Channel stacking
2.4.3.1 Stacking of analog FDM
Arrangement of FDM channels is important in deriving efficient digital demultiplexing 
structures. We introduce the concept of channel stacking beginning with analog FDM signals 
though it plays no significant role in analog demultiplexing filter banks for analog channel 
(slot) filters are not sharable. For the analog FDM signal represented by Eq. (2.41), the centre 
frequency of the first channel can be represented b y /0 = (&0+p)W where -0.5<p<0.5 and k>l  
is an integer. Thus the spectra of all the channels are centred on (k+p)W (k positive integer) 
grid for./>0 and on ~(k+p)W grid for/<0. The signal is said in odd stacking if p=-0.5 and is in 
even stacking if p=0. For other values of p, the signal is said in skewed stacking. Clearly, the 
spectral stacking at the positive frequency side is different from that at negative frequencies 
for a skewed stacking real FDM. This can be seen in Figure 2.10(a). We call the consistent 
stacking at both frequency directions homogenous stacking. Hence odd and even stacking are 
homogenous.
IX(co)l
pW , w  J
-2 .a # # # -0 + , 4
-to, (0,
- 1 - 0  40
W
+2
0  ft 2 jtpW
(a) (b )
Figure 2.10 Non-homogenous channel stacking (K=3): (a) analog FDM, 
k=l ,  p=0.25; (b) bandpass sampled FDM,p=12.25, nw=4
2.4.3.2 Channel stacking of sampled FDM signals
Channel stacking arrangement for analog signals is of little practical interest. The channel 
stacking problems are discussed most often in the context of sampled FDM in literature. The 
nature of periodicity of sampled signal spectra may change the channel stacking of analog 
FDM signals. Consider the uniform sampling of the analog FDM described by Eq. (2.41) 
with sampling f re q u e n c y , the sampled FDM can be represented as,
Qi, Multicarrier DEMUX and VLSI Implementation 22
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
(2.42)
Assuming robust bandpass sampling for maximum tolerance to carrier variation, / s and /, 
should have the relation/s=4/j /(2t z w- 1), where integer «w is the wedge order (Eq. (2.28c)). 
Assuming
Similar to the analog case, homogenous stacking for sampled FDM signals refers to the 
consistent stacking in both frequency directions, i.e., in the normalized frequency intervals [0, 
tu] and [t u , 2t u ]. It has been shown in Appendix B that the sampled FDM signal of Eq. (2.44) 
has homogenous stacking if conditions
are satisfied.
Example:
Consider the case of tzw=4 and K=3. To have robust bandpass sampling and homogeneous 
channel stacking, we need to carefully choose the centre and sampling frequencies. From Eq.
(2.45) we have, ((-14+14p)g/,)14=((14-14/?)g/}14. Since p>21/2 (Eq. (2.46)), thus 8/?>14, and 
the above equation becomes (-14+6p)14=(14+2p)14. Hence, (6p)14=(2p)14. That is, 4p=14/, or 
p=lU2, where i is any integer. Therefore, f=7iW/2 and/=4//7=2zW (Eq. (2.28c)), where i>3 
to satisfy p>2l/2. The resulting sampled FDM will have the positive sideband channels 
positioned at 27u(fc-l+3//2)/(2/) and the negative ones at 2K(-k+l+i/2)/(2i). Obviously, an odd 
i gives odd stacking and even i leads to even stacking for this particular case.
Figure 2.10(b) illustrates a sampled non-homogenous stacking FDM with p=12.25, K=3, and
2.4.3.3 Frequency translation and channel stacking
In practice, homogenous stacking is wanted because of computational advantages. We have 
identified two properties of homogenous stacking:
where p>0, and considering the relation between/s an d /,, Eq. (2.42) becomes
(2.43)
(2nw- l) (2 k -K + l+ 2 p )
(2.44)
((-(2n„ - l X £ - i )  + 2(2H„ =({(2nw- l X K - i ) - 2 ( 2 n w- l ) p ) s^ ^  ^ 2 A 5 )
and
p >( n w -0 .5 )K (2.46)
n=4.
Qi, Multicarrier DEMUX and VLSI Implementation 23
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
• the homogeneity does not change with frequency translations; and,
• shifting a sampled FDM spectrum which is p-stacking by an half of the sampling rate 
(i.e., eJKn) gives rise to an (l-p)-stacking spectrum.
We can take the advantage of these properties to change channel stacking by performing one­
sided frequency translations. Therefore, the stacking remains unchanged after ^-frequency 
translation if the signal is in odd or even stacking. It can be shown that for symmetric (about 
7t) homogenous channel stacking a trivial Jt/2-frequency shift will result in an even stacking 
for odd K  and an odd stacking for even K. These features are useful in frequency DEMUX 
filter bank designs which will be discussed in chapter 3 and 4.
2.5 Multirate systems and signal processing
Multirate digital signal processing technique provides a very efficient means for bandpass 
signal processing and multirate system optimizations. It minimizes computations in a 
multirate system by using the most appropriate sampling rate that is commensurate with the 
signal bandwidth [Cro83, Vet87, Fli94, Gra96]. This technique is particularly relevant to 
narrow band signal processing , e.g., in transmultiplexing/demultiplexing problems.
2.5.1 Discrete signal representations
2.5.1.1 Polyphase representation
Given an integer M, exactly M  discrete signals can be obtained from a signal x(n), each of 
which is downsampled from x(n) with a different phase offset and is given as
(n) = x(nM + X ) , X  = 0,1, • • • M -1 , (2.47)
where x ^ ( n )  s are known as polyphase components of x(n). Consequently, the signal can be 
represented by the polyphase components in the following form:
M-l f
=  E  XH  r  m C”  -  x )
x=o
M-l
M (2.48)
x=o
where wJri) is the discrete sampling function which is defined as
( 2 »
It is easy to show that the polyphase representation of discrete signal in the z-domain is 
[Fli94]:
Af-l
%W = E ^ M z " )  (2-50)
x=o
Qi, Multicarrier DEMUX and VLSI Implementation 24
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
where the z-transform of polyphase components is given as
(2.51)
2.5.1.2 Modulation representation
We define a set of complex exponential modulated signals x^ mC71) > & = 0,1,---,M -1  as
xuu  (” ) = x{n)e,~i!"' = x(n)W~kn (2.52)
Their z-transforms are
Xl:U z) = x ( z W k) (2.53)
It is also simple to show, by considering Equations (2.48) and (2.49), that the z-transform of 
x(ri) can be represented by its modulated signals, i.e.,
M-l f  i M-l \
1 = 0  k = 0  J
(2.54)
Comparing with Eq. (2.50), we have the following relationship between the polyphase 
components and the modulation components:
1 M—1
Z- ^ W ( Z« )  £ XW (Z)W«  (2.55)
M  k = 0
That is, the polyphase components can be obtained by performing a IDFT to the modulation 
components.
2.5.2 Sampling rate alteration
2.5.2.1 Downsampling and decimation
Wherever sampling rates are considerably higher than the signal’s Nyquist rate, there might 
be possibilities to reduce the sampling rate hence reducing the computation load. The process 
of downsampling is also called sampling rate decimation. In most cases, it consists of two 
stages: anti-aliasing filter followed by a downsampler, as shown in Figure 2.11(a). The 
purpose of the anti-aliasing filter is to remove spectral fold-over (aliasing) into the band of 
interest after downsampling. For decimation by M, the downsampler simply takes every M-th 
samples and discards the rest from the input sequence. Hence, for the decimated signal 
depicted in Figure 2.11(a), we have
y(m)= ^ h ( k ) x { m M - k )  (2.56)
k = - ° °
A prominent property of downsampling is that the spectrum of the downsampled signal is the 
sum of equally spaced shifted replica of the original signal spectrum. This property can be 
seen in the z-transform of the downsampled signal,
Qi, Multicarrier DEMUX and VLSI Implementation 25
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
m  k=0
(2.57)
With the substitutions of z = ej(° and y(z) = H(z)X(z),  we have the Fourier relation between 
the decimated signal and the original signal,
x\ m  |.W, c o - 2^
É=0 M 1 1 M
(2.58a)
or, in terms of new frequency (ù'=M(ù
1 ^  ,/co' -  2nk  ^ ./c o '-  27tfc)
«  ;
H
M
(2.58b)
The spectral relation (ignoring scaling factors for the amplitudes) is shown in Figures 2.11(b) 
to 2.11(e)
(a)
(b)
x(n)
*  h(n)
v(n) y(m)
(0
(g)
x(n)
Î L
v (m )
h(ri)
y(”0
m m
(c)
(d)
(e)
IX(to)l
— 1------1------------ 1------ i— J— i-----------►
) 271
1 1 p.
IV(co)l
) 71 271
1 1 I I I I p
iy(m)l
) 7t/M 71 2tt 
-------------1------------ 1------------ 1------:-----------►
(h)
(i)
ti)
0 271 471 671
IX(to)l
L _ 1 ------1------------ 1------1----- 1----- 1-----------►
) 271
1 1 fc.
IV(m)l
) 71 271
1 1 1 1 1 1 fc
iy(to)i
) n/L n 2n " 
— i------1------1------1------l — i— ;------------►
0 n/L n 2n
Figure 2.11 Spectra relation for integer rate alteration: (a) to (e) for integer 
rate decimation (M=3); (f) to (j) for integer rate interpolation (L=3)
2.S.2.2 Upsampling and interpolation
If several narrow-band signals are to be combined to form a wide-band signal (as in FDM 
signal generation), their sampling rates must be increased before combining them together. 
The process of sampling rate increase is called interpolation. The sampling rate interpolation 
performs the upsampling followed by an anti-imaging filtering as shown in Figure 2.11(f). 
The latter is necessary because the upsampled sequence v(m) contains L - l images of the 
baseband spectrum at harmonics of the original sampling frequency InklL, fc=l,2, , L - l.
The interpolated signal of Figure 2.11(f) can be represented as
y{m)= ^h{k)v(m  -  k) (2.59)
k=-
Qi, Multicarrier DEMUX and VLSI Implementation 26
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
where
x(m  / L) for m = nL, n integer, 
0 otherwise. (2.60)
The z-transform of the upsampled signal is
y (z )= x (z 1) (2.61)
Therefore, its Fourier transform is
V(œ) = X(La)) = X(œ/) (2.62)
These relations are illustrated in Figures 2.11(g) through 2.1 l(j).
2.5.3 Rational sampling rate conversion
Sampling rate decimation or interpolation is the increase or decrease of sampling rate by an
integer factor. In many cases, rational sampling rate conversions are required as in frequency 
demultiplexing filter banks. In this section, we show that rational sampling conversion can be 
realized by a linear periodically time-varying (LPTV) filter. Alternatively, the LPTV filter
(LTI) and the LPTV operations are decoupled from the filtering by the use of SPC and PSC 
commutators.
2.5.3.1 Rational sampling rate conversion
It has been shown that rational sampling rate conversion of a LTI discrete system leads to a 
LPTV system with input-to-output relation given as [Cro83],
where h(t) is the impulse response of the corresponding continuous LTI system and 1/7" is 
the sampling frequency for h(t).
The time-varying nature of Eq. (2.63) is apparent. Since gm(n) is periodic in L, the system is 
linear periodically time-varying. Therefore, sampling rate alteration of a LTI system leads to 
a LPTV system provided that the rate change is rational [Cro83]. As the multirate filter bank 
theory is under the assumption of rational rate conversion, multirate filter banks thus belong 
to the class of LPTV systems.
For non-rational sampling rate alteration, if we imagine that the rate ratio could be expressed 
by a ratio of two very large integers which approach infinity, then the periodic nature of g j n )  
no longer exists due to infinite L. Therefore, for non-rational sampling alteration, the LTI
can be replaced by a polyphase network in which all the subfilters are linear time invariant
(2.63)
where the time-varying impulse response gm(n) is defined as
(2.64)
Qi, Multicarrier DEMUX and VLSI Implementation 27
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
system becomes in general linear time-varying. Figure 2.12(a) shows the rational sampling 
rate conversion using LPTV filter.
y(m)x(n) x(n) y(m)
h(n)
(a) (b)
Figure 2.12 Rational sampling rate conversion: (a) using a LPTV filter;
(b) using a LTI filter
An alternative structure to Figure 2.12(a) is to use interpolation and decimation filters in 
tandem. As a result, the impulse responses of the anti-aliasing filter in the decimator and the 
anti-imaging filter in the interpolator can be combined into a single LTI filter h(n). The 
resultant rational sampling converter is shown in Figure 2.12(b). Ideally, for lowpass signals, 
the combined filter should have the frequency response as
//(to) = L for |to | < m ini—, — 11 1 U  M j (2.65)
0 otherwise
Compared to the LPTV filter approach, the symmetry property of the LTI filter can be 
exploited in this method, saving complexity by approximately a factor of two. Despite the 
advantages of avoiding time-varying filtering and being able to exploit symmetry property, 
the structure suffers poor computational efficiency since the filtering is performed at the 
highest sampling rate of the system. As will be shown later, a polyphase decimator and 
interpolator can greatly improve the computational efficiency by using a set of LTI sub-filters 
operating at a low sampling rate. They, however, lose symmetry due to the decomposition of 
the filter.
2.S.3.2 Polyphase analysis and polyphase synthesis of signals
Before introducing the concept of polyphase decomposition of decimation/interpolation 
filters, let us first look at two basic multirate operations, the polyphase analysis and polyphase 
synthesis of signals.
Polyphase components o f signal:
For a discrete signal x(n) (real or complex), its V polyphase components are defined as
x {(m) = x(mN + i) , i = 0 ,1 ,••• ,//-!  and m e Z  (2.66)
where N  is an arbitrary integer. Clearly, a polyphase signal is a decimated signal with a time 
shift as illustrated in Figure 2.13(a). Though polyphase signals can also be defined as 
x i(m) = x ( m N - i ) ,  as appeared in some literature, Eq. (2.66) will be solely used as the
definition for polyphase signals hereafter to avoid confusion.
Polyphase analysis:
Qi, Multicarrier DEMUX and VLSI Implementation 28
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
From the polyphase definition of Eq. (2.66), the polyphase components can be obtained 
through the structure shown in Figure 2.13(b). The commutation process in the structure is 
known as polyphase analysis and the structure is called serial-to-parallel commutator 
(converter) (SPC). Because time advances are used in the structure it is non-causal and not 
realizable in practice. The time advances can be abstracted from the structure transforming 
the non-causal SPC structure into an equivalent structure consisting of a time advance and a 
causal SPC (shaded area) as shown in Figure 2.13(c).
x(ri)
x0(m)
Xiim)
x2(m)
b# d
a
T T !
1 ib
t  .  T  .  Ti
i - r - i D ' 1 4 e
m
m
(a) down-commutation (M=3)
x(n)
I tM
ED 
x
I n • * N - l ( m )  * ( » ) — I z N - 1 !---------1 j N
r r n
I n
I n
xi(m)
xo(m)
(b) non-causal SPC
.-i
.-i
4 n
* N - i ( m )
*N-2(m)
I n x0(m)
SPC
(c) causal SPC
F ig u re  2.13 Polyphase analysis of signal 
Polyphase synthesis:
The dual process of polyphase analysis is called polyphase synthesis of a signal, which 
interleaves (combines) the polyphase components back into the original signal. The process is 
also governed by the relations of Eq. (2.66). The structure that synthesizes the original signal 
from the polyphase components should be the dual of Figure 2.13(b), which can be obtained
Qi, Multicarrier DEMUX and VLSI Implementation 29
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
by taking the generalized transposition [Cro83] to the figure giving rise to Figure 2.14. The 
structure is called parallel-to-serial commutator (converter) (PSC).
+» x(ri)
PSC
Figure 2.14 Polyphase synthesis of signal
2.S.3.3 Polyphase decomposition of decimation and interpolation filters
Polyphase structures are extremely important in multirate systems and filter banks. They 
provide opportunities for moving filtering (computations) from high sampling rates to lower 
ones and enables hardware sharing amongst different channels in filter banks, leading to both 
computationally efficient and low-complexity structures.
Consider the direct structure of the sampling rate decimator in Figure 2.11(a). Since the 
impulse response of the LTI anti-aliasing filter can be regarded as a signal (it is indeed from 
system point of view), it can be polyphase decomposed according to Eq. (2.50). Hence we 
have in z-domain,
M —\
fl(z) = E z " X^ ' ,)(z " )  (2.67)
x=o
which represents a LTI polyphase network as shown in Figure 2.15(a). Replacing the anti­
aliasing filter in Figure 2.11(a) with the polyphase network results in an equivalent structure 
shown in Figure 2.16(a). It follows that the downsampler can be moved across the polyphase 
network by applying noble identities [Cro83], giving rise to the computational efficient 
structure of Figure 2.16(b) in which LTI sub-filters are at the lower sampling rate.
(a) (b)
Figure 2.15 Transposition of LTI polyphase network
Qi, Multicarrier DEMUX and VLSI Implementation 30
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
SPC
(a) (b)
Figure 2.16 Polyphase decimation filter
For the polyphase interpolator structure, the concept of transposition of LTI network can be 
used transforming the LTI polyphase network of Figure 2.15(a) into an equivalent transposed 
polyphase structure of Figure 2.15(b) (for M=L). Again, replacing the anti-imaging filter in 
Figure 2.11(e) with the transposed polyphase filter of Figure 2.15(b) results in Figure 2.17(a). 
Applying noble identities [Cro83] to the structure, the upsamplers are moved across the 
polyphase network leaving the polyphase filters at the lower sampling rate. The resulting 
computationally efficient polyphase interpolator structure is shown in Figure 2.17(b).
PSC
(a) (b)
Figure 2.17 Polyphase interpolation filter
The polyphase decomposition technique is also applicable to rational rate conversion cases. 
For the conversion ratio of L/M, where L and M  are respectively the interpolation and 
decimation factors and are assumed co-prime, Hsiao derived a highly efficient sampling rate 
converter based on the polyphase decomposition of decimation and interpolation filters 
[Hsi87]. Bi determined the minimum delay requirement for the structure [Bi92]. This 
structure is illustrated in Figure 2.18 where the branches of the lattice are polyphase sub­
filters (some of them might be cascaded with pure delays).
Qi, Multicarrier DEMUX and VLSI Implementation 31
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
Figure 2.18 Polyphase-matrix rational sampling rate converter
2.6 Frequency demultiplexing
Conceptually, frequency demultiplexing for FDM signals can be carried out in a reverse order 
of the signal forming. That is, bandpass filtering followed by sampling rate reduction. This 
process is depicted in Figure 2.19. The channel filter definition depends on the actual 
characteristics of each channel. For uniform FDM channels the channel filters are shifted 
versions of a common prototype lowpass filter which makes it possible to share computations 
amongst all channels. Otherwise, the sharing will be very difficult, if not impossible, unless 
multistage demultiplexing approaches are adopted (for example, in the case of octave FDM 
demultiplexing).
A/D
BP k
BP 2
Figure 2.19 Frequency demultiplexing
2.7 Summary
Computationally efficient signal processing tends to move bandpass signals to baseband by 
means of frequency translation or multirate signal processing.
A signal (system) has the property of frequency translation invariance if its equivalent 
lowpass is conjugate symmetric.
The first-order (uniform) bandpass sampling theorem shows that sampling a bandpass signal 
can be a pitfall without careful consideration of relations between the sampling frequency, the 
centre frequency, and bandwidth. We have shown that to have maximum tolerance to 
frequency uncertainty (or equivalently, sampling frequency variation) l/4/t (or - l/4 £  ) 
stacking of the carrier, i.e.,/i=4//(2n-l), n the wedge order, is required.
Qi, Multicarrier DEMUX and VLSI Implementation 32
Chapter 2 Bandpass Signals, Filtering, and Multirate Signal Processing
The frequency translation invariant property allows efficient RICF complex FIR structure 
which save the number of multiplications by approximately a factor of two compared to the 
direct implementation scheme. Another efficient complex filter structure introduced in this 
chapter is the DRFU structure which requires even fewer multiplications than the RICF 
structure. The DRFU structure will be used to simplify filter banks.
Three real and two complex (quadrature) sampling schemes are discussed. For frequency 
DEMUX in MCDD, we can use simple baseband or passband real sampling schemes to 
reduce complexity at the analog side.
Homogenous (odd or even) channel stacking of sampled FDMs can be achieved via bandpass 
sampling if the centre and sampling frequencies are carefully chosen. Trivial 7i/2 -frequency 
shift can be used for the odd-even, or even-odd stacking conversion.
The concept of multirate signal processing is reviewed. In particular, the principle and 
structures of two types of commutators, SPC and PSC, are introduced. Finally, a generic 
frequency demultiplexing problem is formulated.
Qi, Multicarrier DEMUX and VLSI Implementation 33
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
Chapter 3 
Multirate Filter Banks and 
Frequency Demultiplexing
To make satellite channels cost competitive with optical cables, the use of small, inexpensive earth stations with reduced antenna size and HPA power will be needed.
This will necessitate the use of high EIRP and G/T multibeam satellites with OBP 
capability. In this chapter, we briefly summarize DEMUX approaches and structures 
applicable to MCDD from the viewpoint of multirate filter banks. Similarities and differences 
between the terrestrial transmultiplexers and the multicarrier demultiplexers are discussed. 
The effect of the channel stacking on DEMUX structure is discussed. To be able to compare 
and derive various DEMUX filter banks, two generic complex-modulated functional 
DEMUX models, the lowpass and the bandpass models, are introduced. By modifying the 
conventional PMDFT filter bank, a pipeline PMDFT structure is proposed. Finally, flexible 
DEMUX architectures are reviewed and studied.
3.1 Transmultiplexing and demultiplexing
Since the key function of frequency demultiplexing is channel isolation (channelization) 
which is common to that of transmultiplexing, the direct translation between FDM and TDM 
formats, in terrestrial systems, most OBP demultiplexing approaches are a natural evolution 
from terrestrial transmultiplexing methods [Kwa90]. However, there are significant 
differences between the two. Transmultiplexing (TMUX) refers to the signal format 
conversions from TDM to FDM, or vice versa, to provide interface between analog 
transmission paths (in FDM) and digital switching facilities (in TDM) in terrestrial telephony 
systems. The primary concerns of transmultiplexing are the stability (under looped
Qi, Multicarrier DEMUX and VLSI Implementation 34
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
conditions), group delay, and crosstalk. These requirements are quite different from those of 
on-board multicarrier demultiplexing whose main function is channel isolation (will be 
discussed later). Scheuermann and Gôckler [SchSl] gave a comprehensive survey on the 
conventional transmultiplexing methods and concluded four categories of TMUX structures: 
bandpass filter bank method, lowpass filter bank method, Weaver-structure method, and 
multistage modulation method. The first two are considered block processing approach in 
which channel filtering can be shared amongst all channels. They belong to a class of 
multirate filter bank structure which was later recognized as the polyphase DFT filter banks. 
The Weaver-structure method is a per-channel approach in which multistage 
decimation/interpolation filters are used to reduce computational complexity. The multistage 
modulation method performs the transmultiplexing in stages relaxing the stringent filtering 
requirements which are otherwise demanded by single stage approaches.
A baseband switching satellite requires the demultiplexing and demodulation of the up-link 
carriers before they can be switched to the designated down-link beams. Hence the 
multicarrier demultiplexing usually refers to one direction translation: from FDM to TDM. 
Unlike terrestrial TMUXs, a multicarrier DEMUX performs not only the channelization, but 
also the sampling rate alteration to match the sampling rate required by demodulation 
(integer multiples of the symbol rate). A major difficulty associated with multicarrier 
DEMUXs is high computational and hardware complexities. There exist CCITT standards 
for terrestrial multiplex and transmultiplexing instrument. This is not the case for multicarrier 
DEMUX because the DEMUX specifications are affected by such factors as the system 
multiple access scheme, frequency plan, and the modulation scheme, etc. that vary from 
system to system. Being used for different applications, TMUX and DEMUX have different 
design optimization criteria. For TMUX design, the objective is to minimize crosstalk and 
group delay whereas in a DEMUX design the system bit-error-rate (BER), or, equivalently, 
the signal-to-distortion noise ratio (SDR) at the DEMUX output is of primary concern, where 
the distortion noise arises mainly from the aliasing products due to sampling rate decimation 
and also from signal quantization and clipping in ADC [Bjo93].
Another difference between the two is that the terrestrial TMUXs normally involve only real 
processing whereas the DEMUXs in MCDDs are basically complex processing devices as the 
multicarrier demodulators often require complex signal format.
3.2 Demultiplexing function of MCDDs
3.2.1 Real and complex FDM
Unlike TMUXs for terrestrial applications, in which input/output signals are real, DEMUXs 
in MCDDs are required to produce complex samples for demodulation with either real or
Qi, Multicarrier DEMUX and VLSI Implementation 35
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
complex input signals. For uniform ^-channel FDMs centred near/=0 with channel spacing 
W, the sampling frequencies for a complex and a real signal can be respectively given by
f=(K+2G-2k0+l)W=K'W (3.1a)
and
f=2(K+2G-2k0+l)W=2K'W (3.1b)
where G is the number of guard-bands on the lower and upper ends of frequency band of 
interest and k=0 (even-stacking) or k=0.5 (odd-stacking). The channel spacing is related to 
the channel symbol rate R and the roll-off factor a  [Pro89] of the channels by
W=2R(l+a) (3.2)
The main advantage of using complex input signal is the reduced input sampling rate (halved 
compared to that of the real signal, as shown in Eq. (3.1)), hence relaxing the stringency of 
channel filters (allowing wider normalized transition band, thus less taps). However, an
additional filtering stage (Hilbert transform) is required to convert the inherently real-valued
sampled FDM signal at I.F. into complex form. Thus the choice affects the overall 
complexity of the MCDD. The insertion of such an additional filtering stage is justified if the 
increased complexity for real-to-complex conversion is compensated by the reduction in 
DEMUX complexity.
3.2.2 Rational sampling rate decimation
For MCDD applications, it would be most convenient for the multicarrier demodulators to 
have the input sampling rate (which is the output sampling rate/o of the preceding DEMUX) 
equal to a harmonic of the symbol rate R, for instance, 2R, or 4R for quaternary phase shift 
keying (QPSK) modulated channel signals. Assuming that the required input sampling rate 
for the demodulator is DxR, D the number of samples per symbol, and referring to Eqs. (3.1) 
and (3.2), the DEMUX is generally required to perform a rational sampling rate decimation 
M / L, which is determined by
M f  2X' (1 + oc) (3.3a)
for complex FDM input, and
-  = ^ -  = —  (1+00) (3.3b)
L f 0 D
for real FDM input.
A DEMUX filter bank for complex FDM is under-decimated if the rational decimation factor 
M/L<K', critically-decimated if M/L=K', or over-decimated if M/L>K'. Similarly, for a 
DEMUX filter bank accepting real FDM, the conditions of under-, critically-, and over-
Qi, Multicarrier DEMUX and VLSI Implementation 36
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
decimation are respectively M/L<2K\ M/L=2K\ and M/L>2K'. Clearly, according to Eq. 
(3.3), a DEMUX is always over-decimated for D=l, or 2 and under-decimated for D>4.
Although over-decimation can cause aliasing to the demultiplexed channel signals, the effect 
can be reduced by suppressing components outside the protected channel passband, for 
example, by choosing a small roll-off factor a  .
3.2.3 Channel stacking requirement
Channel stacking is only relevant to uniform DEMUX/TMUX problems. Consider a complex 
modulated filter bank [Cro83, Fli94], if the input signal (or equivalently the channel filter) is 
modulated by complex exponentials uniformly sampled on the unit circle, i.e., by
&=0,1,--, K '-\,  we call it uniformly modulated DEMUX filter bank. A frequency multiplex 
generally requires complex channel filters for all the frequency except for even-stacking 
cases where a real lowpass filter is sufficient for the first frequency slot. When polyphase 
DFT structures are considered, even-stacking FDMs allow the use of ordinary DFT (or fast 
Fourier transform (FFT)) instead of more complicated generic DFT (GDFT) [Cro83]. We 
will see later that, under the assumptions of uniform channel spacing and polyphase DFT 
structure, the channel stacking requirements are different for real and complex FDMs.
3.2.3.1 Complex FDMs
Basically, there is no restriction on the channel stacking of complex FDMs since the total 
frequency slots (hence the number of channel filters) is K', which is already the minimum 
size for the DFT. It is however desirable, from an implementation point of view, to have 
even-stacking signals before demultiplexing so that the resulting polyphase sub-filters could 
be all real.
3.2.3.2 Real FDMs
As has been mentioned earlier, another advantage of using even-stacking signal is that the 
polyphase filter bank consists of real sub-filters. It, however, can not be exploited by real 
FDMs with even-stacking because even-stacking real FDM signals shown in Figures 3.1(b) 
and 3.1(d) will have only half of its channels centred on the even (or odd) grid consisting 
points 2W  apart, on which Kf channel filters are allocated. Consequently, half of the channels 
are lost. This is in contrast to the case of demultiplexing odd-stacking real FDMs (Figures 
3.1(a) and 3.1(c)) in which the complex modulated channel filters being allocated 2W  apart 
can select all the Kf channels and allows a smaller DFT (X'-point).
3.2.3.3 Odd -to-even stacking conversion
With the above observations, we envisage that it is advantageous to have even-stacking 
FDMs (except for the real, even-stacking case) so as to be able to use real polyphase sub­
filters and consequently to reduce the complexity of the uniform DEMUX filter bank. When
Qi, Multicarrier DEMUX and VLSI Implementation 37
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
the sampled FDM is not in the desired even-stacking simple one-sided frequency translations 
introduced in section 2A.3.3 can be used to convert it into even-stacking.
■— 2m
Consider an odd-stacking complex FDM. If modulated with eJ K' ™ =TIÇ('+05)'1, i integer, the 
signal will be converted to even-stacking. Preferably, we want the complex exponential to be
trivial, that is, W ^l+Q5)n = ejKn ={-Ÿ)n ^-frequency shift), or W ^l+05)n = e~^n ={j )n (n/2- 
frequency shift) so that no multiplication and addition are required. It is a simple matter to 
show that the trivial modulation conditions are when i=I and K'=2I+l (Kf odd) for 
VtÇ(/+0'5)n = (-1)R, or, when /= / and K'=Al+2 (K'/2 odd) for W ^I+05)n = ( j)”. Similarly, for an 
odd-stacking real FDM signal, any frequency translation of W2^ !+0'5)/1 will perform the odd-to- 
even conversion. However, it becomes trivial only when i=I and £ z=2/+l (K' odd) such that
wd.'+0'5)" =(;)"■
In summary, even-stacking signals are preferred over odd-stacking ones for low complexity 
uniform DEMUX filter banks. The conditions for trivial odd-to-even stacking conversions 
are given in Table 3.1.
Table 3.1 Odd-to-even conversion with trivial frequency shift
Stacking real/complex freq. shift K' illustration
even
real N/A N/A Figs. 3.1(b) 
and 3.1(d)
complex 0 any K' Fig. 3.1(g)
odd
real tt/4 K'/2 odd Fig. 3.1(c)
n/2 T odd Fig. 3.1(a)
complex n/2 K'/2 odd Fig. 3.1(e)
K T o d d Fig. 3.1(f)
\X (f) \
A  6 5 4 3 3 1
I m \
(a) real, odd-stacking, K' odd
£o_ 5 4 3 2 1
(b) real, even-stacking, K' odd
\X (f) \
(c) real, odd-stacking, fC/2 odd
\X (f) \
r.» h  A 4 3 2 I 0|
\X ( j) \A . |§ ! ,. Æ&
0 , 1 , 2 , 3 , 4 , S ,
(d) real, even-stacking, IC odd (e) complex, odd-stacking, AT/2 odd
\X(f)\
I 2 3 4 5
m  m  m1 -:'4 A-i s i ; i ' i i
(f) complex, odd-stacking, K' odd (g) complex, even-stacking, K' even
Figure 3.1 Channel stacking for real and complex FDM signals
Qi, Multicarrier DEMUX and VLSI Implementation 38
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
3.2.4 Demultiplexing function
In its most general form, a DEMUX filter bank (complex or real input) can be described as a 
complex-modulated filter bank. All DEMUX filter bank structures can be derived from two 
complex-modulated functional models to be given below.
3.2.4.1 Lowpass model
In a direct approach, frequency DEMUX function can be genetically represented by an array 
of frequency shifts followed by a bank of lowpass fractional decimation filters as shown in 
Figure 3.2(a). The former performs the spectral alignment to centre the desired channel to 
f=0 whilst the latter suppresses the unwanted channels and decimates the sampling rate 
accordingly. To derive the optimum block-processing filter bank structures, we conceptually 
place a channel filter to each of the dummy channels (the lower and upper guard-bands) as 
illustrated in the figure where one dummy channel is assumed on each side. Thus the total 
number of EDM channels is K', as defined earlier in Eq. (3.1), instead of the number of 
information channels K. Consequently, the real lowpass filter is operating at a high sampling 
rate of LK'W  for complex EDM, or ILK'W  for real EDM. A typical characteristics of the 
prototype channel filter is illustrated in Figure 3.2(b) in which D=2 (2 samples per symbol). 
Thus the fc-th demultiplexed channel output of the lowpass filter bank in Figure 3.2 can be 
expressed by
X  (m) = ^  x(i)W^,+ko h(mM -  iL) (3.4)
where gcd(L,M)=mL-/M, or 1 corresponding to the even-stacking or odd-stacking case.
This DEMUX function model has been frequently used to describe and to derive various 
DEMUX algorithms.
y'çfm) (discard)x(ri)
H(z)
(a)
K'W(2K'W)
(b)
Figure 3.2 Lowpass uniform DEMUX function: (a) lowpass model, 
(b) prototype channel filter (f0=-2R, a=0.5)
Qi, Multicarrier DEMUX and VLSI Implementation 39
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
3.2.4.2 Bandpass model
Sometimes the DEMUX function can be more conveniently described by a complex 
modulated filter bank from which more efficient filter bank structures can be derived. Redraw 
the k-th channel in Figure 3.3(a). The frequency shift can be moved across the up-sampler by
k + k, k + kn
changing the phase-shift from ——^2% to 0 2n (Figure 3.3(b)). The frequency shift
K LK
can pass through the LTI filter making the filter complex modulated and further through the
k + K  _ {k + k^)M
down-sampler changing the phase-shift from 
channel filter is shown in Figure 3.3(c).
LK'
■2n to
(a)
(b)
LK'
■2n . The resulting
lM
x(n)
x(n) I m
(c)
Figure 3.3 Derivation for bandpass DEMUX model
Hence, the lowpass DEMUX model of Figure 3.2 is equivalent to the bandpass model shown 
in Figure 3.4. Being mathematically equivalent, both function models can lead to identical 
optimal filter bank structures, viz., polyphase DFT filter banks. However, in some 
circumstances, one can be more convenient in deriving efficient structures and more 
comprehensible than the other.
x(n) / 0(m) (discard)
/ r - i ( m) (discard)
I m
I m
I m
I m
Figure 3.4 Bandpass DEMUX function
Qi, Multicarrier DEMUX and VLSI Implementation 40
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
The k-th demultiplexed channel output of the bandpass filter bank in Figure 3.4 can be 
expressed by
Vi (m) = W ^ k+ko)m %  x(i)hk (mM -  iL) (3.5)
i=-oo
where the bandpass channel filter hk(ri) is the frequency translated version of a prototype 
lowpass filter h(n) and is given by
hk(n) = h (n )W $ +hy (3.6a)
or, in the z-transform
Hk{z) = H{zW-L£:k°). (3.6a)
As shown in Figure 3.4, the filter banks derived from this bandpass model require additional 
frequency-shifts to move the channel signal back to baseband. Referring to Eq.(3.3), the
required frequency shifts become trivial or could even disappear for some specific values of
a  and D. These values are listed in Table 3.2 and Table 3.3 for even-stacking and odd- 
stacking cases respectively. The simplest case is obviously when a=0.5 which makes the 
offset frequency shift either a trivial 7t/2-shift, Tt-shift, or completely disappear. It should be 
noted that the real FDM in Table 3.2 is originally in odd-stacking, as required by uniform 
DEMUXs. The even-stacking format is obtained using the odd-to-even conversion discussed 
in the previous section, which consequently changes the signal to complex.
Table 3.2 Trivial frequency shift: even-stacking (D=2)
roll-off: a 0.125 0.250 0.375 0.500 0.625 0.750 0.875
complex: w ^ k+ka)n 7T/4 n/2 n/4 n n/4 n/2 n/4
real: n/2 n n/2 0 n/2 n n/2
Table 3.3 Trivial frequency shift: odd-stacking (D=2)
roll-off: a 0.125 0.250 0.375 0.500 0.625 0.750 0.875
complex: w^{k+ka)n 71/8 n/4 71/8 n/2 tc/8 n/4 71/8
real: w2"^,(2t+to)n tc/4 n/2 71/4 n n/4 n/2 71/4
Although the channel filters in the above two demultiplexing models can be recursive (HR) 
[Bel74], linear phase non-recursive (FIR) filters are assumed hereafter for MCDD 
applications in order to minimize the group delay distortion in DEMUX and prevent 
excessive intersymbol interference (ISI).
Qi, Multicarrier DEMUX and VLSI Implementation 41
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
3.3 Demultiplexing approaches for OBP applications
The demultiplexer approaches can be generally categorized into block processing and non­
block processing categories. The former is only relevant to uniform channel spacing cases 
and allows the channel filtering to be shared amongst all the channels by the use of DFT (or 
FFT). It is therefore computationally very efficient. The latter is more appropriate to 
demultiplexing non-uniform channels when (complete) sharing of channel filtering is 
impossible. Thus it is less computationally efficient. On the other hand, the demultiplexing 
methods can be also classified into single stage and multistage approaches depending on 
whether the multicarrier demultiplexing function is done consecutively (step by step) or in a 
single go.
In this section, a brief survey on multicarrier demultiplexing approaches is given. A modified 
PMDFT filter bank structure is introduced. Finally, issues on flexible DEMUX structures are 
addressed.
3.3.1 Per-channel approach
In per-channel method, an individual channel filter is designated to each of the channels. 
Hence no sharing of computations is allowed. This approach is considered a direct method 
and is appropriate for arbitrary non-uniform demultiplexing problems where sharing of 
computations is hardly attainable. Though computationally inefficient, the method is the 
most versatile since each channel is processed separately, and can thus have its bandwidth 
and impulse response specified independently. From fault-tolerance point of view, this 
approach has graceful degradation since failures of any individual channel filters will not 
affect the others. Nevertheless, the high complexity of the architecture makes it unattractive 
for applications where large number of channels are involved, for example, for mobile 
missions where the number of channels could be as large as thousands. The pass bands of the 
bandpass channel filters are non-overlap and the bandwidths are in general different from 
each other. Hence, in general, different rational decimation rates are required for different 
channels. More details about the approach can be found in [Del89] and [Ana92].
3.3.2 Block processing approaches
The frequency demultiplexing for OBP applications is required to be highly computational 
efficient in order for the payload to accommodate the MCDD equipment with limited space, 
mass, and power. Block processing allows sharing of channel filtering amongst all the 
channels (hence is also called multichannel method [Yim91a]) to achieve computational 
efficiency and to reduce complexity. Basically, the block processing approaches can be 
classified into the time-domain based polyphase DFT approaches and the frequency-domain 
based fast convolution approaches depending in which domain the convolutions of channel 
filtering are realized.
Qi, Multicarrier DEMUX and VLSI Implementation 42
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
3.3.2.1 Maximally decimated Polyphase DFT filter banks
The bandpass and lowpass TMUX approaches discussed earlier belong to this category. The 
concept of polyphase filter bank was first introduced by Bellanger etc. in an attempt to design 
a low complexity 60-channel TMUX [Bel74]. The idea of polyphase decomposition of 
signals and LTI networks in time-domain was later developed into the theory of polyphase 
transforms which forms the foundation of a variety of polyphase DFT filter banks and block 
processing algorithms (hence these filter banks (algorithms) are considered time-domain 
approaches) [Cro83].
Consider an maximally decimated uniform analysis (demultiplex) filter bank as shown in 
Figure 3.5, which is a special case of the bandpass filter bank of Figure 3.4 with L=1 and 
M=K'. The channel filters are complex modulated versions of a common prototype filter 
H0(z). They are given by
Ht {z) = H(zW kK?k" ) ,  k = Q>X-,K’ - \  (3.7)
Although 0<&o<l can be arbitrary, we normally restrict it to be A:0=0 for even channel 
stacking, or £0=0.5 for odd-stacking in practice.
Figure 3.5 Bandpass model for maximally decimated polyphase DFT filter 
bank
Using the polyphase representation (Eq.(2.48)) for the channel filter the k-th sub-signal 
Yk (z) in the filter bank can be expressed as
t ( z )  = Hk(z)X(z)
1=0
Or, equivalently, in vector and matrix representation,
Y(z) = F ;.-H 0« ( Zr )-X(z)
~(k+k0 )i (3.8)
(3.9)
where the column vectors
Y(z) = [ y „ ( z ) , y , ( z ) , - , ( z ) f  - (z) = (zr ),z - 'H t f (zr ) ] '.
and F*, = [9*j ]K,xK, -  [W ,^ ]K,xK, is the complex conjugate of DFT matrix. The polyphase
Qi, Multicarrier DEMUX and VLSI Implementation 43
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
components H^p)(z)  are polyphase components of H0(z) as defined in Eq.(2.47) and 
Eq.(2.51). These sub-filters become real for even-stacking cases if H0(z)=H(z) is real.
Figure 3.6 shows the structural interpretation of Eq.(3.9) in which in every output sampling 
period (K ' times the input sampling period) the output vector from the polyphase filter bank 
undergoes an inverse DFT (apart from the scaling factor 1/K' ).
W  
F,(z)
(z)
Figure 3.6 Maximally decimated polyphase DFT filter bank
The computational advantages of this structure lie in that
• a single prototype filter is shared amongst all the channels;
• the polyphase decomposition of channel filters moves the complex modulations of the 
bandpass channel filters into lumped operations of DFT following the polyphase filtering; 
and,
• the filtering and the DFT are performed at the lower sampling rate side further reducing 
the computational load.
A similar filter bank structure can be found in reference [Cor90] in which a Æ/2-point DFT 
and a Æ/2-point inverse DFT (IDFT), instead of a single Æ-point DFT, were used.
3.3.2.2 Polyphase M atrix DFT filter banks
The polyphase DFT filter banks discussed in the previous section are maximally decimated 
filter banks. They may not be suitable for some MCDD applications where rational sampling 
rate conversion is required. To deal with rational decimation problems associated with 
MCDDs Yim etc. proposed a novel DEMUX structure known as polyphase matrix DFT 
based on a time domain representation of uniform filter banks [Yim92a].
Consider the lowpass model for uniform DEMUX function in Figure 3.2. Assuming that the 
input is complex FDM, the k-th demultiplexed channel output signal can be expressed by
yk (n) = ^  x(i)e~j2nFk^ +'°^h(nM -  iL) (3.10)
where F=fk/fs is the normalized centre frequency (with respect to the input sampling 
frequency/5 ) of the k-th channel.
Qi, Multicarrier DEMUX and VLSI Implementation 44
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
Rather than assuming the input sampling freq u en cy b e in g  integer multiples of the channel 
spacing W, the input and output sampling frequencies are assumed integer multiples of a 
frequency resolution WQ, i.e.,/s =MW0, and/out =LW0, and all the carrier frequencies are also 
on this frequency resolution grid: f k =(kQ+kJ)WQ , &=0,1,---, K -l,  where kQW0 (k0 integer) is 
the centre frequency of the first channel and J, M, and L  are positive integers. The channel 
spacing is thus W =JWQ and J  and M  are assumed co-prime.
It can be shown that the polyphase decomposed k-th channel signal is given by
M - l °° SOtt— I,,— ! \
y t  (^  + P ) = Ê  Z  x q ( m ) h P.<l (Z - m ) e  "
,=0"T-. (3.11)
= < ioE ' ' M (0w«‘v
q=Q
where xq(m) = x(mM-q), q=0,l,"', M -l, is the q-th branch signal of the M-commutated input 
signal (referring to Figure 3.8); hpq(ri) = h(nLM+pM+qL) is one of the decomposed LxM  
sub-filters of the prototype channel filter h(n); and the intermediate signals
vP,qQ)- ^ lx q(m)hp,q( l~ m) > P=0,1,-", L~1, ^=0,1,"", M ~l, sltq the outputs from the
/n=-co
polyphase matrix sub-filters which are shared by all channels (due to the fact that vpq(l)'s are 
independent of the channel index k). We assume the prototype lowpass filter h(ri) to be of 
FIR type. The filter length is constrained to be multiple of LM  in order to avoid 
(periodically) time-varying of the sub-filters (similar constraint is necessary for polyphase 
DFT filter banks).
The PMDFT structure is shown, corresponding to Eq.(3.9), in Figure 3.7. A detailed proof of 
the PMDFT filter bank can be found in Appendix C.
Comparing to the maximally decimated polyphase DFT structure, the PMDFT structure has 
similar computational efficiency as a polyphase filter bank, but an M-point (instead of Ap­
point, M>K) DFT is required giving rise to increased complexity in the DFT processor.
n=0Mib-FIR
MUX-><iih-F JR
D F T
( F F T )
(P/S )
P = L -1
q=0
M -l K -l
Mih-HRD EM U X -
(S/P conv;
(P/S )
|=M -1
Figure 3.7 PMDFT filter bank structure
Qi, Multicarrier DEMUX and VLSI Implementation 45
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
3.3.2.3 Modified PMDFT filter banks
Consider the bandpass DEMUX model shown in Figure 3.4. Its k-th channel is shown in 
Figure 3.3(c). For simplicity, k=0 (even-stacking) is assumed. Thus, the intermediate signal 
y k{l) in the z-domain is
We assume that the prototype lowpass filter H(z)  is FIR type and has a length that is an 
integer multiple of LMK'. Thus it can be polyphase decomposed into
Changing the variable i to i=rLM+pM+q, r=0,l,---,X '-l, p=0,l,---,L-l, and <7=0,1,-M -l 
gives
Hrpq(z) is the z-transform of the polyphase sub-filter hrpq{n)=h(nLMK'+rLM+pM+qL). 
Assuming M>LKf (over decimation which is the case for MCDD applications), Eq.(3.16) 
shows that s^p (z) can be obtained by first passing the interpolated input signal through an
array of polyphase sub-filters (M in total), combining the M  outputs into LK' signals (due to 
the periodicity of W]%. ), and then processing the signals with a LX'-point DFT, as shown in
the front part of Figure 3.8(a).
Figure 3.8(a) shows one of the terms of the outer summation in Eq.(3.15). Although only the 
first K' outputs from the LX'-point DFTs are useful and the rest of them can be discarded, we 
deliberately extend the frequency shift array to include outputs from LX'-point DFTs in order 
to be able to share a single DFT (instead of L) in this stage. It can be done by first inserting a 
cascaded inverse DFT and DFT pair at the output side of Figure 3.8(a) which will not change 
the system whatsoever. Then move the IDFT backward across the delay-add network as is
(3.12)
LM K '-l
(3.13)
i= 0
where H^p\z)  are the polyphase components of H(z), as defined in Eq.(2.48). 
Substitute Eq.(3.13) into Eq.(3.12) yields
L M K '-l
t ( z ) =  È r X r ^ z ^ 'M z 1') (3.14)
1=0
■rLM—pM —q -rrr(rLM +pM +q)k r j { p )
LK' rLM+pM+q
r=0 p =0  9=o (3.15)
where s^p (z) is defined as
M -l
(3.16)
Qi, Multicarrier DEMUX and VLSI Implementation 46
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
shown in Figure 3.8(b). This move is justified by the memoryless of DFT/IDFT 
(combinational devices) and the linearity of the sub-system. It can be shown that the DFT- 
freq. shift array-IDFT network is equivalent to a simple switch network which requires no 
mathematical operations at all (see Appendix D). Hence Figure 3.8(c) results.
l ï X»- - - -
z'1 '
z co
z"1 mb
L K - p t
z"1 i
n 7 ^
D F T
z"1 e
z'1
Bl I
f
; ^ 1 4
o
— jm 
b
L K - p t  j j
— k
n
e
□
DFT I L
k=K-l
b=LK’-l
8)
w (LMr+Mp)k 
" L K '
k=K’-l
L K - n t  |
D F
d i s c a r d
d i s c a r d
k=LK’-l
w (lMr+Mp)k
LK'
(b)
k=0
k=l
k=K’-l
L K - p t
D F T
— ! c
k=LK’-\
-M
(c)
d i s c a r d
Figure 3.8 Interpretation for the inner summation of Eq.(3.15)
Qi, Multicarrier DEMUX and VLSI Implementation 47
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
We can further move forward the shared DFT at each stage across the outer delay-add 
network of Eq.(3.15) leading to a shared ZX'-point DFT for all the stages as shown in Figure 
3.9.
— Em ]—> 
— Em !—^ 
— Em !—>
Figure 3.9 Interpretation for Eq.(3.15)
ch_0
ch_K-l
discard
discard
discard
(b)
P E
P M
1 o
z - L ADD
PE
(c)
x(n)
PE
0
PE
1
PE
K'-1
LK-pt
DFT
►ch_K-1
► discard
(d)
Figure 3.10 Pipeline PMDFT structure
Although the channel filters are shared and the number of DFT processors is minimized to 
one in Figure 3.9, the structure is yet computationally inefficient because computations are 
performed at the high sampling rate. To further optimize the structure, we can move the 
down-samplers in the figure backward right after the input tapped delay array in Figure 3.9 
forming an M-branch SPC. This is allowed because the network following the input tapped
Qi, Multicarrier DEMUX and VLSI Implementation 48
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
delay is divisible by z M. It will be shown in chapter 4 that the order of the up-sampler and the 
SPC can be changed with a little modifications when M  and L are relatively prime. Similarly, 
the up-samplers can be moved across the polyphase sub-filters leaving filtering performed at 
low sampling rate. Again, the move is allowed due to the fact that the sub-filters are divisible 
by z L. Finally, we get the computationally efficient polyphase-DFT filter bank structure as 
shown in Figure 3.10(a).
The main structural difference between this modified polyphase DFT and the conventional 
PMDFT is that instead of using a single stage polyphase filter bank, the polyphase matrix is 
broken down into Kr such stages in pipeline. The structure of Figure 3.10(a) can be seen as a 
semi-systolic array since the polyphase signals are broadcast globally to all the PEs and it is 
redrawn in Figure 3.10(d). The processing element is defined by Figures 3.10(b) and 3.10(c). 
Hence the modified structure has improved parallelism in computation and improved 
structure modularity making it more suitable for VLSI implementation especially when the 
number of channels K  (hence K!) is large. Another improvement is the reduced DFT size (by 
a factor of M/(LK')) because over-decimation multicarrier DEMUXs (i.e., M>LK') are often 
required for MCDD applications.
3.3.3 Frequency domain block processing
The concept of fast convolution [Opp75, Jac89] in the frequency-domain can be directly 
applied to the demultiplexing problems leading to computationally efficient fast convolution 
block processing architectures which are particularly suitable for demultiplexing frequency 
multiplexes with large number of channels and/or for those with non-uniform channel 
spacing [Cam88, Cam90a, Say92, San92].
The principle of this approach is that the channel filtering in DEMUXs can be performed in 
the frequency domain by pairwise multiplying the Fourier transformed filter coefficients with 
the short-time FFT of the signal. The filtered output is obtained by a short-time inverse FFT 
(IFFT). Since the fast convolution is only valid for circular convolutions segmentation and 
overlapping techniques are necessary when applied to linear convolutions [Opp75, Jac89].
Basically, there are two linear fast convolution structures, namely, the overlap-add and 
overlap-save methods. For a linear convolution of a signal with an iV-tap FIR filter through 
fast convolution (circular convolution) of length M, the overlap-add method uses non­
overlapping input segments of length M-N. The M-point IFFTed signal segment overlaps 
with the following segment by N  samples. Adding the overlapped segments produces the 
desired output of the linear convolution (hence the name overlap-add). In the overlap-save 
method, on the other hand, it is the input segments (length M) that are overlapped by N  
samples; whilst the output segments from the IFFT are truncated to be non-overlapping and 
then concatenated.
Qi, Multicarrier DEMUX and VLSI Implementation 49
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
For generality, consider the non-uniform frequency demultiplexing problem with non­
overlapping channel bandwidths. The channel definition filters are illustrated in Figure 
3.11(a) in which the spectra of the channel filters are effectively zero in stop bands and the 
frequency resolution is assumed Af=2it/M. The channel bandwidths are thus ,
For critically decimated DEMUX with integer decimation, M  must be a 
common integer multiple of all the Nt , i.e., M=NM. where M. is the decimation factor for 
the i-th channel. The integer decimation at the output of each demultiplexed channel makes it 
possible for the IFFT size to be reduced by a factor of the decimation factor if we restrict M— 
/ /b e  integer multiples of the decimation factors, that is, M-N=LM. .
Given above definitions and conditions, the non-uniform DEMUX structures using overlap- 
add and overlap-save fast convolution approaches are shown in Figures 3.11(b) and 3.11(c) 
respectively. It appears that the overlap-save fast convolution architecture is simpler than the 
overlap-add architecture. The fast convolution approaches are unpopular for uniform channel 
spacing applications because no decisive advantages over the polyphase-DFT method have 
been reported [Yim91a].
A, MA/
( a )  c h a n n e l  d e f i n i t i o n  f i l t e r
M-N
SPC
H,(&) IFFTs PSCs
M-pt
FFT
( b )  o v e r l a p - a d d  s t r u c t u r e
NIM
SPC M-pt
FFT
M SPC 
(o v e r la p  b y  TV)
H.(&) IFFTs PSCs
( c )  o v e r l a p - s a v e  s t r u c t u r e
Figure 3.11 Fast convolution non-uniform DEMUX architectures
3.3.4 Analysis-synthesis method
In this method, the frequency multiplex is first decomposed into sub-channels (subbands) 
with identical bandwidth (for example, comparable to 64 kb/s) through an analyzer which is
Qi, Multicarrier DEMUX and VLSI Implementation 50
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
in concept a bank of bandpass decimation filters. Individual channel signals are then 
synthesized by grouping (multiplexing) the subbands via bandpass interpolation filters. Since 
the synthesizers are to reconstruct the channel signals from the subbands, the property of 
perfect reconstruction (PR) is required in theory for the maximally decimated analysis- 
synthesis filter bank [Vai87, Vai90].
The theory of PR is based on the principle of lossless system. In the circumstance of 
Quadrature Mirror Filter (QMF) banks, the lossless means that the analysis and synthesis 
matrices are required be paraunitary which allows both the analysis and synthesis filters be 
FIR and equal length leading to the overall system transfer function being a pure delay (hence 
perfect reconstruction). Since the analysis and synthesis matrices must take the form of 
pseudo-circulant for complete aliasing cancellation [Vai90], filtering can not be shared 
amongst analysis and synthesis filters. As a result the PR M-band QMF filter bank is not 
computationally efficient.
Alternatively, the analysis (and the synthesis) filters can be equal-distance complex 
modulated from of a common prototype lowpass filter. That allows the decomposition of the 
analysis matrix into a cascade of a diagonal matrix (polyphase filter bank) and a DFT matrix, 
and allows the decomposition of the synthesis matrix into DFT-diagonal matrix cascade. It 
leads to the optimal structures of polyphase-DFT (or DFT-polyphase for synthesizer). A 
general analysis-synthesis filter bank structure is shown in Figure 3.12. Apparently, the 
synthesizers are not necessary for uniform channel spacing. Hence the approach is only 
relevant to non-uniform filter banks.
IVSD F T
D F T
D F T
D F T
D F T
a n a l y z e r  s y n t h e s i s z e r s
Figure 3.12 Partial reconstruction analysis-synthesis demultiplexer
Although computationally efficient, the polyphase-DFT analysis-synthesis filter banks like 
the one shown in Figure 3.12 are generally considered unsuitable for perfect reconstruction 
unless the constraint of PR is relaxed. This is because for complete aliasing cancellation the 
synthesis filters in the polyphase-DFT analysis-synthesis filter banks are required either 
recursive (HR) which may cause stability problems or FIR with excessively longer filter 
length than that of the analysis filters [Vai90].
Qi, Multicarrier DEMUX and VLSI Implementation 51
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
Alternatively, almost perfect reconstruction criterion [FH94] can be adopted to alleviate PR 
constraint while at the same time the computational efficient polyphase-DFT structures can 
still be used. This leads to the so-called modified polyphase-DFT bank in which the adjacent 
subband spectrum aliasing is exactly canceled whilst the non-adjacent spectrum aliasing is 
suppressed by stopband attenuation of the analysis and synthesis filters [Fli93].
Another way to achieve acceptable reconstruction of channel signals from subbands using 
polyphase-DFT structures is to design the analysis and synthesis filters such that the 
equivalent lowpass filter for each channel is the ideal brickwall type with real coefficients 
[Kwa92]. In doing so, the aliasing is effectively suppressed by the ideal filters.
In frequency demultiplexing applications the decimation factors in non-uniform DEMUX are 
often rational which makes the perfect reconstruction of the non-uniform analysis-synthesis 
banks even more complicated than that in integer decimation cases. However, perfect 
reconstruction is still possible if the analysis and synthesis matrices (with different sizes) are 
made lossless [Kov93]. This approach is, as has been pointed out earlier, not computationally 
efficient. Liu and Bruton [Liu93] introduced a direct frequency domain design method for 
the non-uniform analysis-synthesis banks. This method uses the same set of FIR filters for 
the analysis and synthesis filters and, in addition, two banks of lowpass rational sampling 
filters are used for the analyzer and synthesizer respectively. The almost perfect 
reconstruction is achieved by numerical approach.
3.3.5 Multistage approach: tree structures
A tree structured DEMUX is a multistage demultiplexing device which successively divides 
the signal spectrum if the number of channels is a composite number. The number of stages 
of a tree is determined by the number of factors which factorize the number of channels. That
nt—\
is, if the number of channels K  can be represented by K = Y {K i , then an m-stage tree can be
/=0
designed for channel separation. There are X. t identical K. -way DEMUXs at the i-th stage 
and m stages altogether. This channelization scheme is illustrated by Figure 3.13.
a :.-D E M U X
AT.-DEM UX
Figure 3.13 Tree structure
The advantage of tree structure is that the computation rate is significantly reduced due to the 
multistage decimation filtering [Cro83]in the structure. Since the critically sampled signals 
require tight channel filter specifications, the signals input to the tree node DEMUXs are 
usually oversampled to relax the filter specifications.
Qi, Multicarrier DEMUX and VLSI Implementation 52
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
Because the intermediate nodes at different stages are different in channel bandwidth, this 
structure can also be used for some non-uniform demultiplexing problems.
The most commonly used tree type DEMUX is the binary tree structure which requires the 
number of channels to be an integer power of two and an over-sampling factor of two for 
each stage except the first one at which the input is a critically sampled real signal [Yim88, 
Goc88, Del88, Eys90, Qi92a, Sec92, Aue93]. This particular tree structure is shown in 
Figure 3.14(a). Each node of the binary tree is called a band-splitting filter (BSE) which 
equally split the spectrum into lower and upper halves with a pair of lowpass and highpass 
filters followed by a downsampler of factor of two. For the following demultiplexing stage to 
correctly split the spectrum, a ^-frequency shift is necessary for the decimated highpass 
output. The filters in a BSE can be real or complex. Generally, as is illustrated in Figure 
3.14(b) and 3.14(c), a complex BSE allows large transition bandwidth hence relaxed filter 
specifications with short filter lengths and a real BSE needs long filter lengths to meet sharp 
transition band requirement. The structure for complex BSE is, however, more complicated 
than that of the real BSE. Furthermore, an half-band filter (which has all its odd-ordered 
except the central tap coefficients set to zero) can be used as the prototype filter for both 
complex and real lowpass and highpass filters in a BSE, further reducing the computation by 
a factor of two.
h i X n ) i ?
h n ( n ) 1 2
n
BSF
BSF
BSF
BSF
" T t ,(-1)' L— BSF
BSF
(a)
BSF
chO
chi
ch2
ch3
ch4
ch5
ch6
ch7
m
(b)
^2
(c)
F ig u re  3.14 Binary tree DEMUX
As will be seen in chapter 4, by use of multirate digital signal processing techniques, the 
filtering within a complex BSF can be completely shared by the lowpass and highpass paths 
and operates at the decimated rate leading to the optimal BSF structure. Similar result can 
also be obtained for real BSF.
Apart from its simple structure and low computation rate, the modularity of the structure is 
another advantage which allows very efficient VLSI implementations. This feature has been
Qi, Multicarrier DEMUX and VLSI Implementation 53
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
fully exploited in our ASIC design for an 8-channel tree type DEMUX which will be 
discussed in full detail in chapter 7.
3.4 Flexible multicarrier DEMUX architectures
Future satellite communication systems mandate full traffic allocation flexibility, such as 
changeable and/or non-uniform channel bandwidth. However, the flexibility and the 
computational efficiency are often turned out to be conflicting requirements. For example, 
the per channel approach is perhaps the most flexible DEMUX architecture since the same 
channel filtering structure can be used for all the channels if it is made programmable, but it 
is the least efficient in computation as no computations can be shared. The polyphase-DFT 
structure, on the other hand, is considered the most computationally efficient among all the 
existing demultiplexing structures. It, however, suffers from lack of flexibility because 
uniform channel spacing is a must and the DFT size and the filter length are closely related to 
the number of channels. Other demultiplexing methods discussed previously can provide 
flexibility to some degree. The per channel approach, though flexible, is only feasible when 
the number of channels is small due to the computational inefficiency.
3.4.1 Per channel-polyphase FFT programmable architecture
Compromises of computational efficiency and flexibility can be made by use of multistage 
approaches (e.g., tree structures). Guo and Maral [Guo92] proposed a two-stage flexible 
architecture which combines the high flexibility of the per channel approach and the low 
computation demand of the polyphase approach. A basic assumption of this approach is that 
the total FDM bandwidth is fixed whereas the FDM carriers can have different bandwidths. 
With the per channel filters as the first stage, the FDM input is first divided into groups each 
of which has uniformly spaced information channels. Then computationally efficient 
polyphase-FFT filter banks are used to further separate these groups. This architecture 
requires completely programmability for both the per channel filters and the polyphase-FFT 
banks. Hence the most likely implementation scheme for this architecture is through the use 
of DSP processors. Customized hardware architecture, though possible, would demand too 
complicated control strategy.
A more practical wideband reconfigurable DEMUX architecture can be found in [Fer91] 
where the traffic can be switched between the following three cases: a) 800 channels at 64 
Kbps; b) a mix of 400 channels at 64 Kbps and 12 channels at 2.048 Mbpd ; and c) 24 
channels at 2.048 Mbps. The reconfigurable DEMUX consists of two reconfigurable 
polyphase-FFT filter banks for 64 Kbps and 2.048 Mbps respectively and a simple 2-channel 
polyphase-FFT module which acts as the first demultiplexing stage for the mixed traffic case. 
For cases a) and c), the first stage is simply bypassed. Obviously, in any case there are at
Qi, Multicarrier DEMUX and VLSI Implementation 54
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
least one of the three polyphase-FFT banks which are not used. Hence the flexibility is 
achieved at the expense of hardware redundancy.
TRW recently proposed a simple multirate MCDD architecture for VSAT applications 
[Can94]. The total FDM bandwidth to be separated is 37.4 MHz which consists of 26 wide 
band channels (2.048 Mbps each) each of which can consist 32 narrow band channels (64 
Kbps each) in which only 28 are used. Thus a fully configured system can handle up to 
26x28=728 low-rate channels. The architecture has a fixed wide band 32-channel DEMUX 
stage (6 of the 32 wide band channels are on the guard bands hence are not useful) followed 
by an array of narrow band 32-channel DEMUXs (26 in total) which can be programmed 
either to process a wide band channel into 32 narrow band channels or simply to be bypassed 
leaving the wide band channel intact. Consider the extreme case where all the channels are 
wide band, all the 26 narrow band DEMUXs are bypassed. To reduce complexity especially 
under such high redundancy, polyphase-FFT structure has been used.
3.4.2 Reconfigurable tree structures
For some non-uniform FDM signals, tree structure can be used for demultiplexing. In what 
follows, we shall give, without proof, the necessary and sufficient conditions of existence of 
a tree structure for an non-uniform signal. Let us consider a complex FDM signal with N  
unequal channel bandwidths. Assume that the normalized minimum channel bandwidth is 
2k/K  and that all the other channel bandwidths are integer multiples of this spectral 
resolution, viz., for the i-th channel, the channel bandwidth is B ^ ln n /K , ni integer. For 
maximally decimated DEMUX filter bank, ^ . =oBi = 2tu , or equally, Thus we
can represent the multichannel signal with its N  channel bandwidths by (n0 ,n{, •••,«^1), or 
equally, with its N+l channel edge frequencies by (m0 ,(0,, ) since the i-th channel
bandwidth can be expressed by Zf=co +7-(Q. (note that G)0=0 and (ùN=2n ). Then the necessary 
and sufficient conditions for the existence of a tree structure for (n0 , •••,nNA) are given as 
follows:
• n.e {n\, «'2,*• n m= l}, where n ^ k n i+1, /=0,1,-■ •,m-l, k>\ are integers, and n= K \
• the lower and upper edges of the i-th channel (i.e., to,, and co.+/ ) are on the frequency grid 
2%n/K, i.e., = 0 .
When the above conditions are satisfied, an m-stage tree can be constructed with uniform kQ~, 
k f ,  -, and &ml-DEMUXs as the demultiplexing stages successively from the first stage to the 
last one. Accordingly, for example, an multichannel signal (2,1,1,2,6) can be separated by a 
tree whereas (2,1,2,1,6) can not because the above second condition is violated.
The flexibility of tree structure lies in the fact that it can accommodate a large number of 
multichannel patterns (traffic modes) as long as they satisfy the above two conditions. For
Qi, Multicarrier DEMUX and VLSI Implementation 55
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
example, a fully configured two-stage tree with fc0=3 and k=2 (three of them) can be 
configured to channelize any one of the following eight traffic modes: (1,1,1,1,1,1), 
(2,2,1,1), (2,1,1,2), (1,1,2,2), (2,1,1,1,1), (1,1,2,1,1), (1,1,1,1,2), and (2,2,2). If the 3-channel 
DEMUX and the 2-channel DEMUX in the above example are chosen as building blocks of 
the tree and if they can be reorganized (reconnected) to form a new tree with k=2 and k=3 
(two of them), then additional three traffic modes, i.e., (3,1,1,1), (1,1,1,3), and (3,3) can also 
be demultiplexed.
The simplest, and perhaps the most frequently encountered, tree structures are binary trees 
with over-sampling factor of 2 [Yim88, Goc88, Del88, Eys90, Qi92a, Sec92, Aue93]. In 
binary tree case, an m-stage tree can separate non-uniform frequency multiplex with 
contiguous bandwidths which are integer power of 2 multiples of the spectrum resolution 
271/(2x2'”)= 7t/2m [Gra96] where over-sampling of 2 is assumed. Because of the over-
sampling factor of 2, the frequency grid 2nn/K  on which the i-th channel is required to 
reside for critically sampled multichannel signal (as had been discussed previously) has now 
been changed to 2nn/2K =2nnj(2x2m).
3.4.3 Single-stage reconfigurable architecture
The main restriction on the traffic mode in the above two types of flexible architectures is the 
requirement that the channel bandwidths are integer multiples to each other. If the channel 
bandwidth ratio is rational, viz., BJB}=qlp, tej, q^l, and gcd(p,#)=l, the flexible DEMUX 
architectures introduced previously are in general not applicable. In this case, frequency 
filtering or analysis-synthesis filter bank structure can be employed. Furthermore, if 
structures are made programmable, for example, FFT with variable size and reconfigurable 
filtering stage, etc., then one could have complete control over the filter bank specifications 
and accommodate traffic changes.
Since the programmability/configurability is demanded at a high level which requires major 
changes in hardware architecture (for instance, when FFT/FIR size is changed) and 
complicated memory management/control strategy, customized hardware architecture can 
hardly meet the requirements at such a large scale. Hence DSP oriented implementation 
seems a promising alternative for this purpose. However, DSP implementation scenario may 
have problems to handle traffic with large (even moderate) number of channels and with high 
transmission rates.
3.5 Summary
1. Unlike terrestrial TMUXs, the on-board multicarrier DEMUXs perform not only the 
channelization, but also sampling rate alteration for subsequent demodulation. This 
feature demands different design criteria for multicarrier DEMUXs from those of 
TMUXs.
Qi, Multicarrier DEMUX and VLSI Implementation 56
Chapter 3 Multirate Filter Banks and Frequency Demultiplexing
2. Allowing complex FDM input can reduce the DEMUX computational load as a result of 
the reduced input sampling rate. The DEMUX complexity can be also reduced because 
less stringent channel filters can be used.
3. In general, even channel stacking EDMs are advantageous (except for the real, even- 
stacking case) over odd-stacking ones for low complexity uniform DEMUX filter banks 
because they allow the use of ordinary DPT (EFT) instead of more complicated GDFT 
and real sub-filters (polyphase filters) instead complex ones. When the sampled EDM 
signal is not in the desired even-stacking trivial one-sided frequency translations, e.g., n, 
7T/2, or 7t/4 frequency shift, can be used to convert it into even-stacking one.
4. DEMUX filter bank structures can be derived from two complex-modulated function 
models: the lowpass and the bandpass generic DEMUX filter banks. We have found that 
the latter is more convenient to be used to derive optimal DEMUX filter bank structures 
than the former.
5. In the survey of multicarrier demultiplexing approaches we conclude that
• for uniform and fixed traffic, computationally efficient block processing methods 
such as polyphase EFT and fast convolution approaches should be sought;
• tree type multistage DEMUX approaches are useful when some limited flexibility 
is required and/or when hardware implementation is considered for their highly 
modular and simple structures; and
• per-channel and analysis-synthesis approaches are considered only when the 
flexibility is of primary concern and the number of channels is small.
6. Comparing to conventional PMDFT structure, the modified PMDFT structure has merits 
of reduced DFT size, being more suitable for VLSI implementation due to good 
modularity, and allowing fast processing due to its semi-systolic and pipeline property.
7. Flexible DEMUX architectures were reviewed and studied in this chapter. The per 
channel approach, though the most flexible, is only feasible when the number of channels 
is small due to its poor computational inefficiency. To accommodate less restricted traffic 
change such as allowing rational bandwidth ratio, single-stage flexible architectures such 
as reconfigurable/programmable frequency filtering or analysis-synthesis filter bank 
structure can be used. For traffic changes restricted to integer bandwidth ratio, two-stage 
approaches such as per channel-polyphase DFT and reconfigurable tree architectures can 
be sought. In the above three types of flexible architectures, two-stage approaches are 
believed the most computationally efficient due to the use of polyphase DFT structure (a 
tree node is a polyphase DFT) hence are preferred wherever applicable.
Qi, Multicarrier DEMUX and VLSI Implementation 57
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
Chapter 4 
Multirate Signal Flow Graph 
Approach to Multirate Network 
Optimization
4.1 Introduction
It has been shown that the multicarrier demultiplexing require an immense amount of computations on-board the satellite, e.g., the computation load for an eight 19.2 kHz 
channel DEMUX would be well above 108 multiplications per second if per channel 
approach is used [Gar85], hence substantially increasing the demand for payload mass and 
power. The use of digital signal processing techniques can significantly reduce the computation 
load. For example, using polyphase DFT filter banks in single stage, or in multi-stage can 
drastically reduce DEMUX complexity as compared with direct (per channel) approaches.
The approaches and procedures to derive optimal DEMUX structures, however, are less 
attended in the literature with a lack of systematic design methodology. Commonly used 
approaches in multirate filter bank design are based on mathematical derivation and 
manipulations, sometimes aided with conventional signal flow graph transforms. The main 
advantage of mathematical approach lies in its conciseness. It, however, could obscure some 
important structural information which can be vital for obtaining efficient multirate network 
structures [Vai90, Yim91a, Kov93]. In this chapter, a systematic approach to the optimization 
of the multirate filter bank structures is introduced. The approach is based on the multirate 
signal flow graph (MSEC) representation and transforms for linear multirate systems [Qi96]. It
Qi, Multicarrier DEMUX and VLSI Implementation 58
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
has the advantages of presenting clear structural information and is free of tedious 
mathematical manipulations. The proposed MSFG approach is found useful to derive 
optimized structures for multicarrier DEMUXs with minimized complexity and power 
consumption as well as maximized throughput. Though targeted at seeking optimal 
multicarrier DEMUX structures, the MSFG approach is universally applicable to other linear 
multirate networks and filter banks.
As direct applications of the MSFG approach, three design examples of optimal multicarrier 
DEMUX structures for OBP satellite systems are given to show the effectiveness of the 
approach. The first example is the derivation for an optimal 7-channel DEMUX for an MF- 
TDMA system. The resulting structure is a polyphase DFT filter bank which achieves the 
theoretical minimum computation rate. The second example shows an optimal complex BSF 
structure for the binary tree structure. This BSF structure also reaches the theoretical minimum 
of computation rate by polyphase decomposing the prototype filter and at the same exploiting 
its symmetrical property. The third example is a simple optimal real BSF structure which can 
be used in transmultiplexing applications.
4.2 MSFG
The signal graph representation renders explicit structural information making it suitable for 
hardware structure mapping. Nevertheless, the conventional signal flow graph (SFG) 
representation which suits linear time invariant systems [Rob62, Opp75] proves inadequate in 
multirate environment due to the linear periodically time varying nature of linear multirate 
systems [Cro83].
We introduce a multirate signal flow graph representation for multirate systems, as a 
complement to conventional design approaches and an extension to the conventional SFG. An 
MSFG provides more direct and clearer link to the hardware structure and the parallelism of 
filter banks than does the pure mathematical representation. Being extended from the 
conventional SFG, the MSFG preserves most of the properties of the former. In multirate 
environment, identifies and transforms of MSFG are identified and defined. With the 
introduction of a set of short-hand notation for MSFG, the flow graph manipulation and 
transformation problems can be considerably simplified and the optimized DEMUX structures 
can be derived with ease.
Signal flow graph is traditionally defined by a set of branches and nodes in which the former 
define the signal operations and the latter define the connection points of these branches in the 
structure [Opp75, Cro83]. Similarly, an MSFG also consists of branches and nodes. But 
branches in MSFG are restricted to be LTI, just as those in the conventional SFG, and all the 
non-linear and time-varying (e.g., modulation, downsampling, upsampling, etc.) operations are 
defined by node functions. The reason that we attribute the non-linear and multirate operations
Qi, Multicarrier DEMUX and VLSI Implementation 59
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
to node functions is because all the “irregularities” of the MSFG, with respect to the 
conventional SFG, can be approached by identities and transforms associated with 
nonlinear/time-varying nodes. Thus special care is only needed when those nonlinear/time- 
varying nodes are involved. Since most of the non-linear and multirate operations are trivial in 
digital networks, such as down-, up-sampling, sampling & hold, commutations, etc., these 
node functions can be defined as basic operations from implementation point of view. For non­
trivial node functions, such as modulation (multiplication by a signal sequence), transforms can 
be applied to convert them into trivial ones.
4.3 Basic multirate operators and MSFG node functions
Conceptually, the fundamental difference between classical DSP and multirate DSP lies in the 
sampling rate alteration which is not allowed in the former. As has been mentioned in chapter 
3, the most fundamental multirate operators are downsampler (DS) and upsampler (US) which 
respectively decrease and increase the sampling rate in multirate DSP. From DS and US, two 
important rate-changing components are defined. They are serial-to-parallel commutator 
(SPC), which decreases the sampling rate by decomposing the incoming signal into a group of 
sub-signals, and parallel-to-serial commutator (PSC), which increases the sampling rate by 
combining a group of signals.
There are some other multirate operators frequently encountered in digital networks, but 
commonly ignored in DSP literatures. A sampling & hold (SH) does not actually change the 
sampling rate of a signal. It retains (holds) the signal value at the sampling point a number of 
times (SH factor). An upsampling & hold (USH), on the other hand, does increase the 
sampling rate of a signal by repeating (holding) every signal element of the input signal L  times 
where L is the USH factor (see Table 4.1). Similar to the upsampler that has the dual operation 
of downsampler, the USH also has a dual operation which is the integral & dump (IDU). The 
IDU operator is defined by the transposition of USH. An IDU with decimation factor M  
integrates (accumulates) every M  signal elements and dumps the integral to output reducing 
the sampling rate by a factor of M. Unlike the SH node, a sampling (SA) node samples the 
incoming sequence without holding the sampled signal elements as illustrated in Table 4.1 
where the node value kN denotes the sampling instance n that satisfies <n-k>N=0.
Qi, Multicarrier DEMUX and VLSI Implementation 60
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
Table 4.1 MSFG nodes
MSFG
Node
Functions Block diagram 
notation
Short-hand
notation
Output
waveform
Node
nature
m o d u la to r >-o
♦ (n)
MOD m u l t i p l i c a t i v e
o rd in ary  nodeOD a d d i t i v e
u p sa m p lerUS a d d i t i v e
d o w n sa m p lerDS a d d i t i v e
sa m p lin g  & 
h old
SH a d d i t i v e
u p sa m p lin g  & 
h o ld
USH a d d i t i v e
in teg ra l & 
dum pIDU a d d i t i v e
sa m p ler n= iN +k, i integer 
  ►
N=3, k=2SA a d d i t i v e
se ria l-to -
p a ra lle l
c o m m u ta to r
SPC a d d i t i v e
p a ra lle l- to -
se r ia l
c o m m u ta to r
PSC c o m b i n a t i v e
b ■X)
Apart from the above eight multirate operators, namely, US, DS, SH, USH, SPC, PSC, IDU, 
and SA, two single-rate nodes are also defined in MSFG. They are the ordinary node (OD) 
which is exactly the same as that in conventional SFG and modulation node (MOD) which 
performs signal modulation (multiplication).
These MSFG nodes can be classified into two types: additive and multiplicative. Signals 
coming into an additive node are summed at the node. Nodes in a conventional SFG are all 
additive. The added feature of the additive node in MSFG is the sampling function and
Qi, Multicarrier DEMUX and VLSI Implementation 61
Chapter 4 Multirate Signal Flow Graph Approach to Multi rate Network Optimization
sampling rate alteration. Thus an essential requirement associated with the additive node is that 
all signals coming into the node must have the same sampling rate. The SPC and PSC can be 
considered as special type of additive node in that SPC node is additive and rate-changing and 
allows fixed number of output branches with different output signals whereas the PSC node is 
“additive” only in a sense that the allowed fixed number of in-coming signals are rate-up- 
converted and “added” (combined) forming the output signal.
The only multiplicative node in MSFG is the modulation node at which all the signals coming 
into the node are multiplied together. In multirate signal processing, modulation nodes have, 
almost without exception, only two inputs: the signal to be modulated and the periodic 
modulating sequence. It will be shown later that a multiplicative node can be replaced by a 
MSFG network where all the nodes are additive.
Having moved all nonlinear and time-varying operations to nodes, the branch transfer functions 
(transmittances) between any two nodes are left linear and time-invariant allowing any linear 
and time-invariant operators moving across the branch freely.
All the MSFG node functions are listed and illustrated in Table 4.1.
4.4 MSFG transformation
An multirate DSP system can be transformed into its equivalencies with different structures via 
flow graph transformations to be given in this section. The most fundamental relationships 
between multirate operators have been given in the form of identity {noble identities) [Cro83, 
Vai90, Fli94].
The most fundamental and important result in multirate DSP theory is, perhaps, the concept of 
polyphase decomposition of signals and networks. Polyphase filter transform can lead to 
computationally efficient filter bank structures. We extend the polyphase decomposition 
concept to modulation circuits leading to the modulation polyphase decomposition transform 
(MPDT) and its variations. As will be shown later in an example, the use of MPDT, combined 
with the polyphase filter transforms, can simplify a modulated filter bank giving rise to a highly 
efficient DFT filter bank structure.
Since the upsampling and downsampling processes are often realised via PSC and SPC (as in 
polyphase decomposition of filters and modulators) in multirate DSP networks, one will 
inevitably deal with various combinations of commutators and other signal processing 
components in multirate networks. We have identified a set of identities associated with the 
cascades of commutators with filters, modulators, and single up/down-samplers, which are 
found very useful in multirate DSP network transformations.
Qi, Multicarrier DEMUX and VLSI Implementation 62
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
The mapping of an multirate DSP algorithm into the (hardware) implementation structure can 
thus be done by MSFG transformations. We shall show how these MSFG identifies and 
transforms are applied to derive the efficient DEMUX structures in the examples to follow.
4.4.1 The noble identities
The noble identities are considered the most fundamental characteristics of multirate systems, 
with which most MSFG identities and transforms can be derived. Some noble identities are 
shown in Figure 4.1.
Figure 4.1 Noble identities
The identities (a) to (d) are obvious because of the linear assumption of MSFG. The noble 
identity (e) shows that the down-sampler (with factor M) can be moved forward across any 
LTI system and it can be moved backward across a LTI system if, and only if, the system is
suggests that an up-sampler (with factor L) can move backward freely across any LTI system 
and move forward across a LTI system only when it is divisible by z L .
4.4.2 Cascade of samplers
performed in stages if the factors are composite. This is evident from the definitions of down- 
sampling and up-sampling. Identity 4.2(c) shows that for a upsampler-downsampler, or 
downsampler-upsampler, cascade to be commutable the necessary and sufficient condition is 
that the two rate converting factors L and M  must be co-prime, i.e., gcd(L,M)=l. This 
property has been addressed in many text books hence no proof will be given here. When the 
upsampler and the downsampler have the same factor, say M, the order can not be changed. If 
upsampling is preceding downsampling, the inserted M -l  zeros to each signal sample by the 
upsampling process are discarded by the downsampling process leaving the signal intact, as 
shown in Figure 4.2(d). If a delay of an integer multiple of M  (say q) is inserted between the 
up- and down-samplers in Figure 4.2(d), according to the noble identities, the delay can be 
moved across the samplers as shown in Figure 4.2(f). For any inserted delay which is non­
(a ) (b)
divisible by z M to avoid non-realisable fractional delays. Similarly, the noble identity (f)
The identities of Figures 4.2(a) and (b) show that down-sampling and up-sampling can be
Qi, Multicarrier DEMUX and VLSI Implementation 63
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
integer multiple of M  as shown in Figure 4.2(e), the output samples will be constant zero 
because at the sampling points of the downsampler the signal elements are all zeros. In 
contrast, if the downsampling is performed first, the following upsampling will insert M -l 
zeros after each sample retained by the downsampling, hence the process is equivalent to the 
modulation of the signal with a sampling sequence as shown in Figure 4.2(g). It is clearly a 
sampling (SA) node with offset being equal to zero according to the SA node definition in 
Table 4.1. When this sampling process is cascaded with a delay which is an integer multiple of 
M (or, more generally, a LTI filter which is divisible by z M ) on any side of the network, the 
delay can move across the downsampler-upsampler cascade freely as shown in Figure 4.2(h). 
This property can be proved by simply applying the noble identities.
Mé >
(a)
# -----> - -% <=>+---- >- I»< = > 0-----
(b)
- -  M M -l#  >  •  #  >  #  # -------- y .  0  < = >  Q_
iff G CD(M , L) = 1
(C ) ( d )
r _ v ^ . <=> o °> o 5»— o
(e) (f)
M '1 M O—>-«— O 0M
6 <=> ° ---
<t»w=f •n=0±M-±2M--
l o . otherwise
(g )
z~kM M~l M M~l M z~kM
o— >—• — ^ -e  e — =>—#— o
(h)
Figure 4.2 Sampler-cascade identities
4.4.3 Commutator decomposition
Serial-to-parallel and parallel-to-serial commutations can be performed in multi-stage if the
m -l
commutation factors are composite. That is, for A-fold commutation, if N  = Y [N i , an m-stage
i= 0
commutation tree can be constructed. Figure 4.3 demonstrates how a SPC and a PSC (N=6 for 
both cases) are decomposed into commutator trees. The numbers in the figure are the sample 
indices of signal elements. This property will be frequently used in MSFG simplification.
Qi, Multicarrier DEMUX and VLSI Implementation 64
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
0,1,2,3,...
0 -5
5.11.17....
4.10.16....
3.9.15....
2.8.14....
1.7.13.... 
0 ,6 , 12, ...
0,1,2,3,...
0—5
XJ 5,11,17,... 
O  3,9,15,...
O  2,8,14,...
O  0,6,12,...
1.7.13....
4.10.16....
(a)
5,11,17,
4,10,16,
3.9.15...
2.8.14...
1.7.13... 
0 ,6, 12...
_ a
5,11,17,... 1,3,5,7,...
.6
5—0
0,1,2,3,
"(b)
5-0
0,1,2,3,...
Figure 4.3 Commutation decomposition
4.4.4 Commutator cascade
Very often, we have to deal with SPC-PSC and PSC-SPC cascades in MSFG manipulation. 
Since SPC and PSC are dual and perform operations complementary to each other, it is
commutations. Therefore it can be expected that the cascades should be equivalent to simple 
networks connecting the input(s) and the output(s) with, perhaps, some pure delays as a 
consequence of the use of causal PSC (referring to chapter 2). It is simple to verify (e.g., 
graphically) that a SPC-PSC cascade with commutation factor of N  is equivalent to a pure 
delay of AM (Figure 4.4(a)). Similarly, a PSC-SPC cascade is equivalent to a switching 
network as shown in Figure 4.4(b). In particular, when a unit delay is inserted between the 
PSC and SPC, the network can be transformed into a delayed connection (with a unit delay) 
between corresponding ports of PSC and SPC (Figure 4.4(c)). The simplest case is the direct 
connections between the corresponding PSC and SPC ports which occurs when a time advance 
of z~(AM) is in sandwich between the commutators, as shown in Figure 4.4(d). We shall show 
that the identities of Figure 4.4(a) and Figure 4.4(d) are particularly useful in MSFG 
transformation.
intuitive to foresee that cascading of the two would cancel the effects of the two types of
Qi, Multicarrier DEMUX and VLSI Implementation 65
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
(a)
-w+i
> -0  <:=> O
aN-i
N N
bN-i z 1
: < = >
3r N-l G 
« - r\ z-1 :
o  b N-2 
O K „O bi a  i V
r\
U  DO
1 bo ao V ^  bn-i
(b)
ao
(c)
-i
aN-i o O  bn-i
-i
ai O 
ao O—^
-i o  bi 
■o bo
N -l
ao
(d)
a N-i O- "O bN-i
a i o- 
ao O-
-o bi 
-O bo
Figure 4.4 Commutator cascades
4.4.5 Complex filter identities
In chapter 2, we have shown that a complex filter can be realised using the DRFU structure 
which uses two identical real lowpass filters as shown in Figure 2.7. We redraw it here in 
MSFG in Figure 4.5(a). Similarly, if a signal is firstly modulated by a complex exponential and 
then filtered by a real filter, the process is equivalent to firstly filtering with an inverse- 
modulated filter (complex) and then modulating with the complex exponential as shown in 
Figure 4.5(b). The proof of the second identity is similar to that of the first which has been 
given in chapter 2.
rjhl • h(n) h(n)
0 - > | | >
e
O > -O < = >
_/Xn -pai
(a)
h(n)
■ >-- o
-jkn
fkne
O >
 • h(ri)r
(b)
Figure 4.5 Complex filtering identities
Qi, Multicarrier DEMUX and VLSI Implementation 66
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
4.4.6 Polyphase decomposition transforms
The significance and the concept of polyphase decomposition for LTI networks and 
decimation/interpolation filters have been addressed in section 2.5.3.3. The transforms shown 
in Figures 2.15, 2.16, and 2.17 are directly translated into MSFG and are drawn in Figure 4.6 
where #,(z)'s are the z-transforms of polyphase filters Zz,(m)’s which are defined as
hiirn) = h(mN + i), / = 0,1,---,7V-1 (4.1)
The transforms of Figures 4.6(b) and 4.6(c) are obtained by directly applying the noble 
identities of Figures 4.1(e) and 4.1(f) to Figure 4.6(a).
*(z)
O > Q <=>
-N+l fW z")
(a) polyphase decomposition for L U  filter
H„(z)
-1
o-> -> -o
(b) polyphase decomposition for decimation filter
N
#— > o
(c) polyphase decomposition for interpolation filter 
Figure 4.6 Polyphase decomposition transforms (PDT)
4.4.7 Modulation polyphase decomposition transforms
Let us consider the modulation of signal x(ri) with a sequence §(n), i.e., y(n)=x(n)§(n). It can 
be represented by its AT polyphase components:
y i(m)= y(mN + i)
= x i(m)§i(m)
where
X; (m) = x(mN  + i) 
(|). (m) = (|) (mN + i)
(4.2)
(4.3)
(4.4)
, m integer, are polyphase components of x(n) and respectively (Eq.
(2.64)). With the SPC and PSC structures of Figures 2.13(b) and 2.14, the above expression
Qi, Multicarrier DEMUX and VLSI Implementation 67
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
describes a polyphase network shown in Figure 4.7(a). We call the structure the type 1 MPDT. 
Alternatively, the modulation can be expressed with a different set of polyphase components 
by:
y'i W = y-i W  = y{mN -  i)
— x(mN - 1)(|> (mN -  i) (4.5)
= x_i(m)§_i(m)
-1 ,  and m integer. In this case, the SPC will be causal but the PSC will be non- 
causal. The embedded polyphase network is shown in Figure 4.7(b) which is equivalent to 
Figure 4.7(a) and is called the type 2 MPDT. In the above polyphase networks, the serial-to- 
parallel commutator in Figure 4.7(a) and the parallel-to-serial commutator in Figure 4.7(b) are 
non-causal hence unrealisable. Recall in chapter 2 that a non-causal SPC can be replaced by a 
time advance of z^”1 followed by a causal SPC (Figure 2.13(c)). Similarly, a non-causal PSC 
can also be substituted by a causal PSC followed by a time advance z^”1. Therefore, the type 1 
MPDT and the type 2 MPDT have the equivalent structures shown in Figures 4.7(c) and 
4,7(d) respectively. The corresponding MSFGs of Figures 4.7(c) and 4.7(d) are shown in 
Figures 4.8(a) and 4.8(b) respectively.
§ - N + i ( m )
Figure 4.7 Two types of MPDT structures in conventional SFG 
If the modulating sequence is a complex exponential with period N,
Qi, Multicarrier DEMUX and VLSI Implementation 68
Chapter 4 Multi rate Signal Flow Graph Approach to Multi rate Network Optimization
_ ■2rc
(t>(n) = e~J» n =WNn (4.6)
then its polyphase components become complex constants (j), (m) = (|) (mN + i) = which 
leads to a simple MPDT structure in which simple scalar multiplications are used instead of the 
modulations as shown in Figures 4.9(a) and 4.9(b).
(
:
—o  
y(n) < = >
# )
M(m) N-lN-l
x(n)
(a) 0o('«)
Figure 4.8 MSFGs of type 1 and type 2 MPDT
O
x(n) ^  y(n)
N - l
N - 2
N-l
x(n) - N + l
(a) type 1 MPDT 
Figure 4.9 MPDT for complex modulation
4.4.8 Modulation identities
(b) type 2 MPDT
Some important equivalent structures associated with modulation nodes, which are frequently 
used in multirate filter bank design, are summarized into identities shown in Figure 4.10. In this 
figure, the identities shown in Figures 4.10(a) to 4.10(d) holds for arbitrary modulating 
sequence (|)(n), periodic or non-periodic. The modulating sequences (|)(L«/mJ) in 4.10(b) and 
(|)(Lm/ZJ) in 4.10(d) are up-sampled (hold or without hold) versions for (|)(m) in 4.10(b) and 
<j)(«) in 4.10(d) respectively.
Since most often the modulating sequences are complex exponentials, as defined by Eq. (4.6), 
the identities shown in Figures 4.10(a) to 4.10(d) are equivalent to those of 4.10(e) to 4.10(h). 
Note that the modulations in Figures 4.10(e) and 4.10(g) can become trivial if M  and N, or L 
and N  are carefully chosen.
Qi, Multicarrier DEMUX and VLSI Implementation 69
Chapter 4 Multirate Signal Flow Graph Approach to Multi rate Network Optimization
We often have to deal with cascades of modulator and delay/advance in MSFG networks. 
Assuming that the modulation sequence extends infinitives in both time directions, it is easy to 
show that the delay can be moved across the modulator if the origin of the modulating 
sequence changes accordingly as shown in Figures 4. 10(i) and 4.10(j).
AT1 AT1 AT1 AT1
0— ®— >—#  <^ => • — »—® > O • — S—®— B»—O <:=> O— — >—#
x(n) ^  y(m) ^  x(n) ^  y(m) g
m  §(mM) Wjb) (|)(L^J)
(a) (b)
L L L L
#— — >—o  < = > o— ®— > - #  O— >-<$)— >—•  <c^ > #— > - 0
A:(n) A y(m) ^  x(n) ^  y(m )
§(m ) §(nL ) (j)(n) ^ (L lJ )
(c) (d)
M"1 M~l M"1 AT1O =>— —>- •  <:=> # O # O <=>  O  > -#
x(n) ^  y(m) ^  %(») ^  ^
^  V  %
(e) (f)
L L L L
#  O O >"^8 O #  <^ > • ----> -® ---->~0
x(n) ^  y(m) ^  x(n) ^  y(m ) 6
(g) (h)
z k z z  z
O— f — O O— ® O  O— O <— > O— ®— > - 0
4)(») <()(n+A:) <])(«) (()(«-/:)
(i) (i)
Figure 4.10 Modulation identities
4.4.9 Commutator-modulator cascades
When modulators are cascaded with commutators, MPDTs introduced in section 4.4.7 can be 
used to move polyphase components of modulating sequences into polyphase branches of 
commutators. For a PSC-modulator cascade, by using the type 1 MPDT to the modulator and 
then the identity of Figure 4.4(d), the modulator is polyphase decomposed and moved into the 
PSC branches as shown in Figure 4.11(a). Similarly, for a modulator-SPC cascade, it can be 
transformed into an equivalent structure shown in Figure 4.11(b) by applying the type 2 MPDT 
and the identity of Figure 4.4(d). If the modulation sequence in the above cascades is periodic 
with period being equal to the commutation factor, for example, the complex exponential of 
Eq. (4.6), then the modulators in the commutator branches diminish and become scalar 
multipliers as shown in Figures 4.11(c) and 4.11(d).
Qi, Multicarrier DEMUX and VLSI Implementation 70
Chapter 4 Multirate Signal Flow Graph Approach to Multi rate Network Optimization
If branches of a SPC are modulated with the same periodic sequence, say a complex 
exponential of Eq. (4.6), then a common modulator can be used for all the branches and each 
branch is phase shifted by a constant phase as shown in Figure 4.11(e). This identity can be 
proved by applying the identities of Figure 4.10(f) and Figure 4.10(i). Similarly, when branches 
of a PSC are modulated with the same complex exponential, a common modulator can be used 
and each branch has to be phase shifted accordingly as shown in Figure 4.11(f).
#»)
(a)
x(n)
(c)
Qi, Multicarrier DEMUX and VLSI Implementation 71
Chapter 4 Multirate Signal Flow Graph Approach to Multi rate Network Optimization
■AM
x(n)
(d)
NK
NKx(n)x(n)
NK
(e)
NK
(0
Figure 4.11 Commutator-SPC and SPC-commutator identities
4.4.10 Commutator-filter cascades
Sometimes, we may find that it is necessary to identify equivalent structures for commutators 
whose branches connect to identical filters as shown in Figure 4.12. In these cases, simple 
noble identities can be used to move the filters in the branches to the other side of the 
commutator resulting in a cascade of the commutator with an interpolated filter.
H(z) H(z)
(a) (b)
Figure 4.12 Commutator-filter Identities
4.4.11 Commutator-sampler commutability
In MSFG transformation, we often find ourselves in an awkward position of trying to move 
samplers across commutators, e.g., in upsampler-SPC and PSC-downsampler cascades. We 
will show in this section that the commutation is allowed only when the sampling factor of the 
sampler and the commutation factor are co-prime. Also, the move of the sampler will lead to 
reordering of the commutator branches and time advances to the branch signals.
Qi, Multicarrier DEMUX and VLSI Implementation 72
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
For an upsampler-SPC cascade (Figure 4.13(a)), if L  and M  are co-prime, the upsampler can 
be moved to the output of the SPC. The time advance introduced to the z-th branch of the SPC 
will be
/. =
m
where I and m are any solution of
mL - IM  = 1
(4.7)
(4.8)
The original output polyphase signals will be reordered as a result of the transform by 
delivering k-th polyphase component y*(m), instead y,(m) at the z-th branch of the transformed 
structure, where k is given by
k = M - I - z)l) m (4.9)
Similarly, when the downsampler moves across the PSC in a PSC-downsampler cascade as 
shown in Figure 4.13(b), the time advance introduced to the z-th branch of the PSC is
i m
m: =
I
where I and m are any solution of
m L -lM  = - \
(4.10)
(4.11)
Again, as shown in Figure 4.13(b), the input polyphase signals are reordered in the transformed 
network with xk(n), instead of %,(»), as the input signal to the z-th channel, where k is 
determined by
k = {iM)L (4.12)
The notation (•). indicates modulo operation, that is, = n mod N. The proof of the 
identities is lengthy and hence given separately in Appendix E.
L z™-1
^ O y
L z°^ O yA/-1-((M'1)L>m(^ )^
(a)
zm^  Af1
<^(i,.)M>L(n)
. v
x<o Mhi.n) o i ^
(b)
x.Xn)
Figure 4.13 Sampler-commutator identities
Qi, Multicarrier DEMUX and VLSI Implementation 73
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
4.4.12 Identities associated with composite rate-changing nodes
According to the definition of USH node, it is equivalent to a PSC that has all its parallel 
inputs connecting to a common node as shown in Figure 4.14(a). The IDU function, as the 
dual to the USH, is equivalent to the transposition of USH and is shown in Figure 4.14(b). 
Another way of realising the IDU function is to perform an M-tap comb filtering first and then 
downsample the sequence by M as shown in the figure. Having defined the USH function, the 
SH function can be realised by a simple cascade of a downsampler and an USH as shown in 
Figure 4.14(c). We have also identified the relationship between the sampling function and the 
downsampler and upsampler. Figure 4.14(d) shows this relation which includes the identity 
shown in Figure 4.2(g) as a special case (&=0).
M - t a p  c o m b  f i l t e r
( b )
N  W
-© <  >  O—>
z N 1 N  z* kN
0—>—#—>—#—>—o <c=> o— =»—@— >—o
( d )
Figure 4.14 Identities associated with composite rate-changing nodes
4.5 Examples
To show how these MSFG identifies and transforms are used to derive efficient DEMUX filter 
bank structure, three examples are given in this section.
4.5.1 Five-channel DEMUX for an MF-TDMA system
The task of this example is to channelize MF-TDMA traffic [Ana90, Gro94] consisting of five 
equally spaced carriers. Each MF-TDMA channel is shared by multiple users through TDM A 
and has the symbol rate of 1 Mbaud. The roll-off factor is assumed 0.5, hence the channel 
spacing is 3 MHz. If two samples per symbol is required by the demodulators, the desired 
output sampling rate of the DEMUX will be 2 MHz.
Qi, Multicarrier DEMUX and VLSI Implementation 74
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
Though ideally the real input frequency multiplex can be sampled at the Nyquist rate, guard 
bands on both lower and upper edges of the frequency multiplex are necessary due to the 
imperfection of anti-aliasing filtering at the analog stage. To simplify the demultiplexing stage, 
these guard bands are commonly chosen as integer multiples of the channel spacing. In this 
case, the upper and lower guard bands are both set to be one channel spacing bandwidth. As a 
result, the input sampling frequency is 2x(5+2)x3=42MHz. The spectrum of the sampled input 
signal is illustrated in Figure 4.15(a). It is apparently an odd-stacking real FDM
To be able to use real channel filters in the lowpass DEMUX model of Figure 2.1 and to avoid 
using GDFT, the sampled input signal needs to be in even channel stacking. The odd-to-even 
channel stacking conversion can be achieved using a trivial 7t/2-frequency-shift which requires 
no arithmetic operations. Figure 4.15(b) shows the spectrum after the conversion. The key 
characteristics of the channel filter is illustrated in Figure 4.15(c).
Ps/2 Fs=42MHz
(a) real freq. multiplex with two guard-bands 
|X '(f)l
. . 
t  ? ,4  , „
1.0
0.5
Ps/2 Fs=42MHz
(b) even-stacking by f-frequency-shift
raised-cosine
0.5 1.0 1.5MHz »  Fs=42MHz
(c) channel filter characteristics
Figure 4.15 Five-channel MF-TDMA signal and the demultiplexing channel 
filter
x(n) h(ri) 21
h(n) 21
y', (pi)
h(ri) 21
h(ri) 21
h(ri) 21
h(n) 21
y's (m)
h(n) 21
discard
ch4
Figure 4.16 DEMUX function (lowpass model)
Qi, Multicarrier DEMUX and VLSI Implementation 75
Chapter 4 Multi rate Signal Flow Graph Approach to Multirate Network Optimization
The DEMUX function can be described by the MSFG shown in Figure 4.16. A direct 
implementation of this structure would require five (not considering the guard-channels) 
identical lowpass channel filters and a bank of frequency shifts. In the multirate filter bank 
theory, an uniform modulated filter bank like this can be efficiently realised with only one 
shared polyphase decomposed channel filter plus a DFT (hence polyphase-DFT), a significant 
saving in computation over the direct structure.
To derive the polyphase DFT structure, let us consider the k-th channel of DEMUX in Figure 
4.16 and redraw it in Figure 4.17(a). Obviously, any operations with, or after those with, 
channel index k can not be shared by different channels. Therefore, the basic theme of the 
MSFG simplification is to delink operations (especially those computationally demanding ones) 
with the channel index k and to move them before any operations associated with the index k. 
Another principle of MSFG simplification is to move operations to places where the sampling 
rate is as low as possible to reduce the computation rate. To this end, we firstly convert the 
real filter h(n) into a trivial complex one h(n)=W4~nh(ri)=(j)nh(ri) using the complex filtering
identity (Figure 4.5(b)) as shown in Figure 4.17(b) and then polyphase decompose the complex 
decimation filter leading to Figure 4.17(c). Use the type 2 MPDT of Figure 4.9(b) to the first 
frequency shift and decompose the l-to-21 SPC into a tree consisting of a l-to-7 SPC and 
seven l-to-3 SPCs (referring to Figure 4.3(a)). The 7-to-l PSC of the decomposed frequency 
shift will cancel with the l-to-7 SPC of the SPC tree (Figure 4.4(d)) resulting in Figure 
4.17(d). The complex multipliers in the figure can be moved after the sub-filters as shown in 
Figure 4.17(e). Reorganize the structure we get Figure 4.17(f).
The structure of Figure 4.17(f) shows that the polyphase network is independent of the channel 
index k. It can be therefore shared by all the channels. Since the frequency-shift network in the 
figure simply performs the k-th component of a 7-point DFT, the whole DEMUX network can 
be constructed by a polyphase filter bank followed by a 7-point DFT as shown in Figure 4.18.
Considering that the input signal is real and that the complex filter h(n)=(j)nh(ri) has 
coefficients either pure real or pure imaginary, the actual computation load of the complex 
polyphase filter bank is the same as that of the real prototype filter h(n). Furthermore, all the 
computations are carried out at the low sampling rate of 2 MHz and the 7t/2-frequency-shift 
bank requires no arithmetic effort. Hence we conclude that the derived polyphase-DFT 
structure is computationally optimum.
Qi, Multicarrier DEMUX and VLSI Implementation 76
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
h(n) 21 O— •
x(n)
h(n) 21
0~> §-- > •  > S)—>—Q
y'* W
W4-
(a)
W7*'
(b)
(c)
w:
h(,(n)
(d)
h { n )
(e)
o^(n)
hiairi)
(f)
Figure 4.17 Step-by-step simplification for demultiplexing the fc-th channel
Qi, Multicarrier DEMUX and VLSI Implementation
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
h o ( r i)  7 - p t  D F T
> -o  y  A m )
Y4 (m) 
y's (m)
y'6 (m)
h2o(n) w;
Figure 4.18 Polyphase DFT DEMUX structure for the 5-channel MF- 
TDMA traffic
4.5.2 The optimal complex BSF structure for binary tree DEMUX
In this example, we derive a novel complex band-splitting filter structure which is believed 
optimal in the sense that it has the minimum computation rate amongst the known existing BSF 
structures using MSFG transforms. Recall the definition of BSF of a binary tree in chapter 3 
(Figure 3.14). The lowpass and highpass filters are frequency translated versions of a common 
real lowpass filter h(n). As will be seen in chapter 7, the lowpass and the highpass filters of a 
complex BSF are defined by Eqs. (7.3a) and (7.3b) respectively. The direct complex BSF 
structure is shown in Figure 4.19(a). Applying the transform of Figure 4.5(a) to the complex 
filters and the identity of Figure 4.10(a) to the figure, the direct structure can be transformed 
into Figure 4.19(b) consisting of a BSF core cell and trivial frequency shifts ((/)” or (-;)") at 
both input and output of the BSF. These frequency shifts will be cancelled at intermediate 
stages of the tree.
x(n)
( a )
expO'Trm)
x(n) x \n )
h(n)
cxpC-yii/iM) B S F  c o r e  exp(jnmJ2)
( b )
Figure 4.19 Direct complex BSF structure
Qi, Multicarrier DEMUX and VLSI Implementation
Chapter 4 Multi rate Signal Flow Graph Approach to Multirate Network Optimization
Now let us derive the optimal complex BSF via MSFG transforms of the core cell. The core 
cell has the MSFG shown in Figure 4.20(a). Using type 2 MPDT (Figure 4.8(b)) and 
polyphase decomposition of decimation filter (Figure 4.6(b)) transforms the direct structure of 
Figure 4.20(a) into that of Figure 4.20(b); Decomposing the l-to-4 SPCs and 4-to-l PSCs into 
l-to-2 SPCs and 2-to-l PSCs (Figure 4.3) in Figure 4.20(b) and applying the PSC-SPC 
cascade identity (Figure 4.4(c)) transform the network into that shown in Figure 4.20(c) in 
which the lowpass and the highpass branches share common SPCs and Tt-frequency shifts.
Till now, the computational complexity remains the same as the original BSF model: four real 
full length FIR filters (remember that the signal paths are complex) operating at the input 
(high) sampling rate. To reduce the complexity we need to share the filtering as much as 
possible and to move the filtering to the lower sampling side, that is, immediately after the 
front SPCs. For this purpose, we firstly polyphase decompose the interpolation filters in Figure 
4.20(c) and then move the %-shifts before the l-to-2 SPCs with the identity of Figure 4.11(e). 
That gives Figure 4.20(d). Reorganising the middle part of the figure by applying the 
superposition theorem for linear networks and noticing that the combinations of the rate 
changing operators in the reorganised network can be equalized to pure delays or a Tt-shifts 
according to identities of Figure 4.4(a) and Figure 4.9, we have Figure 4.20(e).
Basically, the structure in Figure 4.20(e) is already computationally optimal for an arbitrary 
real prototype filter h(n) because that the computational complexity of the four real sub-filters 
accounts for that of a single real prototype filter and that the computations are in the low rate 
side. If h(n) is linear phase, and furthermore, lowpass half-band type which is usually the case 
for binary tree structures, then the complexity of Figure 4.20(e) can be further reduced. For an 
half-band lowpass filter with length N=4n-\ where n is the number of distinct non-zero 
coefficients except the centre tap which is 0.5, let us assume that n is even (which gives iV=7, 
15, 23, ...). Then the polyphase filters
(or, h0o(i)= hoi(n-l-i), i=0, 1, ..., n -l) . Hence the signal paths corresponding to Hio(z) 
disappear. If we define
Hio(z)=0, 
ffii(z)=0.5z;"f'l/2_1) 
Hqo(z) and H0i(z) are non-symmetric and are related by
fW z)= r#o i(z- ')
(4.13a)
(4.13b)
(4.13c)
P(z)=[Hoi(z)+Hoo(z)]/2
Q(z)—[Hoi(z)—Hoo(z)]/2,
(4.14a)
(4.14b)
such that
H0i(z)=P(z)+Q(z) (4.15a)
Qi, Multicarrier DEMUX and VLSI Implementation 79
Chapter 4 Multirate Signal Flow Graph Approach to Multi rate Network Optimization
H0o(z)=P(z)-Q(z) (4.15b)
Then, according to Eq. (4.13c), P(z) and Q(z) are symmetric and anti-symmetric respectively. 
Obviously, the number of multiplications can be halved in implementation of Eqs.(4.15) using, 
say, a symmetrical transversal FIR structure (Figure 2.5(a)) for P(z) and Q(z) compared to 
direct implementations of H0i(z) and Hoo(z). Thus, the structure of Figure 4.20(e) can be 
transformed into that shown in Figure 4.20(f).
i -  1Finally, by moving the complex multiplier W%1 = e 4 =-j=(l+j) in Figure 4.20(f) backwards
and combining the centre tap (0.5) with the above real constant factor, we have the optimal 
BSF structure shown in Figure 4.20(g). The optimal complex BSF is redrawn in conventional 
block diagram in Figure 4.20(h). This optimal structure has been verified via simulation. The 
7t/2- and ^-frequency shifts in the figure are trivial requiring no multiplications. The complex 
multipliers (1+f) and j  also require no multiplicaions: the former actually performs two real 
additions while the latter simply computes the complement (-1) to the real component of the 
signal and then exchanges the roles of real and imaginary components. Therefore, the total 
number of real multiplications required by the optimal BSF structure (symmetric property of 
P(z) and Q(z) is exploited) is 2x[(n/2+n/2)+l]= 2(n+l) assuming that the half-band filter length 
N=4n-\ and n is even. It can be shown that the same optimal structure can also be used for 
BSFs with odd n (i.e., Af=ll, 19, 27, ...) except that the multiplier j  in Figure 4.20(g) should 
be replaced with -j. Hence the required number of real multiplications for n odd is also 2(«+l).
h(n)
W8-"
h(ri)
■o /„W
(a)
w,'
hSfl)
x\n)
W4-'
Qi, Multicarrier DEMUX and VLSI Implementation 80
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
2 hJn)
Wo1
x'(n) 2 hAn)
w-
Ô W.
(C)
w '
w,1
(d)
jc'(n)
ÔW, O W "
(e)
A:'(n)
w -1
ÔW,
-iz ■1
x'(«)
ÔW,'"
/ L(m)Ada Add
Sub Sub
x'(n)
K n-1)
(h)
Figure 4.20 Derivation for the optimal complex BSF
Qi, Multicarrier DEMUX and VLSI Implementation
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
Comparison to other BSF structures has been given in chapter 7. These BSF implementation 
schemes include the direct structure as shown in Figure 4.19(b) in which symmetrical 
transversal FIR is assumed; the ANT’s HMM cell [ANT85]; BAe’s reduced multiplication 
(RM) cell [Cra90]; Bi’s BSF structure [Bi90]; and our TM-BSF structure [Qi92a]. The 
comparison results are listed in Table 7.2. It clearly shows that the optimal structure of Figure 
4.20(h) is superior to all the other schemes in terms of the number of computations 
(multiplications), computation rate, as well as memory requirement.
4.5.3 The optimal real BSF structure
Similarly, the optimal real BSF structure can also be derived with MSFG transform. The direct 
structure of real BSF is shown in Figure 4.21(a) which is also represented in MSFG as shown 
in Figure 4.21(b). By a series of MSFG transforms we obtain the optimal real BSF structure in 
which the lowpass and the highpass filters completely share a common real filter and the 
filtering is performed at the low sampling side. This optimal structure is shown in Figures 
4.21(c) and (d).
x(ji)
h(n)
h(n)
expQnn) exp(-_ /7m )
(a)
exp  (/Tim)
(b)
x(n)
w 2m
(c)
x(n)
( - 1)'
Add
Sub
(d)
Figure 4.21 The optimal real BSF structure
4..6 Summary
The objective of reducing computational complexity of a system in multirate environment 
requires not only the number of computations be minimized but also the computation rate be 
minimized. Hence the main theme of the multirate system optimization is to share and move 
computations to lower sampling frequency side as much as possible. With direct and clear link 
to hardware architecture, the MSFG representation for multirate systems along with the 
identities and transforms of MSFG summarized in this chapter can provide us an unique and 
systematic approach to the optimization of multirate networks without resorting to tedious and 
often adhoc mathematical manipulations.
Qi, Multicarrier DEMUX and VLSI Implementation 82
Chapter 4 Multirate Signal Flow Graph Approach to Multirate Network Optimization
Many identities and transforms summarized in this chapter, in particular, those associated with 
commutators and the MPDTs are not seen in literature. They have been found extremely useful 
and handy to derive optimal MSFG structures.
Computationally efficient frequency DEMUX structures can be obtained via MSFG transforms 
as have been shown by the two examples. Adhoc tricks of mathematical manipulation 
associated with the conventional design method are avoided. Both structures of the examples 
are optimal in the sense that the theoretical minima of computation rate have been achieved. It 
should be noted that the optimal DEMUX structures in the two examples are by no means the 
unique. There exist a number of equivalent structures that can also reach the theoretical 
minima of computational complexities hence they are all considered optimal. This can easily be 
seen in our derivation of the optimal structures. The choice between these optimal structures 
will eventually depends on the implement schemes (in software, or in hardware) and factors 
such as modularity, regularity, parallelism, etc..
With MSFG’s sampling node functions, it can even describe some digital network functions, 
like sampling, switching, sampling and hold, integral and dump, interleave, etc.. Hence it has 
the potential of making the direct mapping from DSP algorithms to digital networks.
Qi, Multicarrier DEMUX and VLSI Implementation 83
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
Chapter 5 
Complexity and Power 
Consumption Analysis and VLSI 
Architecture Optimization
A systematic approach to mapping computationally efficient multirate DSP algorithms onto efficient VLSI architectures is proposed in this chapter. A simple modeling 
method for multirate system by which the requirements for complexity, power 
consumption, and system throughput are represented in forms of objective functions is 
presented. Efficient VLSI architectures subject to given constraints are obtained via 
optimization techniques [Qi96].
5.1 Introduction
With the development of VLSI technology and the design methodology it is now possible for 
high speed and very sophisticated DSP systems to be quickly integrated into VLSI with 
affordable costs. To reduce mass and power consumption of digital communications and 
signal processing systems in space environment, VLSI implementations of the systems seem 
the most attractive option. They inevitably require efficient mapping from algorithms 
(mathematical models) to VLSI architectures. Much effort has been given to system 
complexity reduction using various methods. To reduce power consumption the choice goes 
for low power dissipation technologies, for example, CMOS technology. Little work has been 
done on how the system architecture will affect the power consumption. Also when
Qi, Multicarrier DEMUX and VLSI Implementation 84
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
complexity, power, and throughput are considered jointly, their relations, particularly in 
multirate environment, must be established [Mur82].
Usually, the mapping from mathematical models to VLSI architectures requires three steps:
1) Design of a computationally efficient algorithm at the top level. The algorithm performs 
the specified task functions with the least (or less) computations which are defined by the 
basic operations (BOs), such as multiplication, addition, memory operations, etc., or their 
combinations (e.g., inner-product);
2) Mapping the algorithm into a general hardware implementing structure (IS). The IS 
consists of only fundamental building blocks known as basic components (BCs) that 
perform the basic operations. The IS can be either a direct form which is a one-to-one 
mapping from the BOs to the BCs, or one that involves time-sharing (time-multiplexed 
use) of some BCs or functional blocks, and
3) Mapping the implementing structure onto an efficient VLSI architecture that requires low 
complexity, less power, and allows high throughput. This procedure is completed via the 
basic component configuration that determines whether a bit-serial, or a bit-parallel 
architecture should be used for each BC.
Chapter 4 has addressed the step 1. For simplicity, one-to-one mapping from the optimized 
mathematical model to the IS is assumed. The step 3 is the main concern of this chapter. 
Throughout this chapter, the term optimal VLSI architecture is always under the assumptions 
that the embedding DSP algorithm is computationally efficient and its IS is obtained via one- 
to-one mapping.
There are, however, situations where a computational efficient algorithm does not necessarily 
lead to hardware-efficient VLSI architecture. This is because the computational efficiency is 
conventionally evaluated in terms of multiplication rate that leads to optimal DSP structures 
with minimum number of multipliers. Whereas in some circumstances operations like inner- 
product and convolution filtering can be implemented more efficiently (in area, or gate count) 
using special hardware architectures, rather than using simply a combination of multipliers 
and adders (e.g., MAC architecture). Examples can be found in ROM filters, distributed 
arithmetic FIR/IIR filters, and in general, table look-up techniques. If this is the case, that 
particular operation, instead of multiplication or addition, should be defined as a BO, and it 
might be necessary to go back to the first step to optimize the algorithm so that the number of 
these BOs is minimized.
In multicarrier DEMUX designs, once a computationally efficient multirate filter bank, e.g., 
tree or polyphase filter banks has been chosen, the next step is to map the filter bank 
structure into an efficient VLSI architecture. The commonly used approach to looking for a 
feasible architecture is to try, on an ad-hoc basis, some tentative schemes for some BCs using
Qi, Multicarrier DEMUX and VLSI Implementation 85
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
either bit-serial or bit-parallel architectures at some stages until the predefined constraints on 
complexity, power, and throughput are met. However, we find that even for a simple 
polyphase matrix DFT filter bank with just four types of BCs and three different sampling 
frequencies, there will be a total of 512 different configuration options. The multirate VLSI 
complexity model presented in this chapter suggests that the searching for optimal 
architecture be a NP-complete (non-determinative polynomial) problem; an exhaustive search 
would amount to the order of 2lxj where I  and J  are the number of BCs and the sampling 
frequencies respectively. It is thus either less likely to obtain the optimal architecture with 
only a few tentative searches, or inefficient to go through exhaustive search.
The aim of this chapter is to provide a systematic approach to the derivation of optimal VLSI 
architectures and to provide technology independent estimations of complexity, power 
consumption, and the throughput in order to give objective comparisons between alternative 
architectures. We first introduce a simple multirate VLSI complexity model upon which 
technology independent complexity, power consumption, and system throughput are defined.
It will be shown that for a given computationally efficient DSP structure, the optimal VLSI 
architecture subject to certain constraints can be obtained by manipulating the configuration 
schemes of the BCs, which have been carefully designed in both bit-serial and bit-parallel 
architectures and are provided in the design library. That is, by choosing different types (bit- 
serial or -parallel) of BCs that have different demands for complexity, power, and clock 
frequency, the resulting VLSI architecture will have very different complexities and some of 
the configuration schemes may meet the constraints. As a result, the optimization of VLSI 
architecture could be formulated as a multi-objective optimization problem based on the 
multirate VLSI complexity model. To search for feasible solutions constrained nonlinear 
optimization techniques [Las70, Got73] can be applied.
5.2 Multirate VLSI complexity model
The complexity of a DSP algorithm is commonly assessed by its computational complexity, 
e.g., the number of required multiplications and additions. When mapped into VLSI, the 
hardware (space) complexity in terms of number of gates or silicon area, however, will not 
necessarily reflect the computational complexity. It depends not only on the computational 
efficiency, but also on how BCs are implemented, and on the locality of signal routing.
The main difference between VLSI (hardware), DSP (digital signal processor), and software 
implementations is perhaps the degree of concurrence of signal processing which affects all 
the aspects of VLSI complexities (space, time, and power). Generally speaking, high 
concurrence will result in high space complexity and low time complexity, and vice versa. In 
practice, trade-offs between space and time complexities can be made by having different 
degrees of concurrence at different design hierarchies. That is, we can trade concurrence for
Qi, Multicarrier DEMUX and VLSI Implementation 86
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
low space complexity (at the expense of higher time complexity). A typical example is time- 
multiplexing (time-sharing) of computations which reduce the space complexity, but the time 
complexity (in terms of required time steps) for a given operation will be increased 
accordingly. Therefore, an optimized VLSI architecture is often the result of balanced 
concurrence and time complexity. In addition to space and time complexities, power 
complexity (power consumption) is another key factor that affects the VLSI architectures for 
space applications. Since the dissipated power of static CMOS gates is extremely small, but it 
rises in direct proportion to the switching frequency in transitions [Hur85], the power 
consumption, is more closely related to the computation rates of operations than to the 
complexity. Although these facts are commonly recognized, estimations of VLSI 
complexities are often on ad-hoc basis. This necessitates the study for a systematic approach 
to the estimations. CMOS technology and synchronous sequential logic are assumed 
throughout the chapter.
5.2.1 Complexity and power consumption in CMOS VLSI
5.2.1.1 V L S I com plex ity
The most appropriate measure for system complexity in VLSI is perhaps the chip (silicon) 
area that is directly related to the yield and the manufacturing costs. To estimate the chip area 
one usually needs to decompose the circuit into blocks whose area can be obtained either by 
resorting to knowledge-based expert systems or by the designer's own knowledge and 
estimation. Reliable estimation of chip area, however, depends not only on the computational 
complexity and the VLSI architecture, but also on signal communications between building 
blocks (cells, macros, functional blocks, etc.), the floor plan which affect the routing of 
layout, as well as the VLSI technology.
We choose the gate count as the complexity measure because it is technology independent 
which is preferred for comparative study between alternative VLSI architectures. However, 
the ultimate determination of the VLSI architecture may still be affected by area factors such 
as signal communications, the floor plan, etc. if severe discrepancy between the two metrics 
occurs. In order to make gate count consistent with chip area, one needs to avoid the global 
signal communications, e.g., using area efficient pipeline and systolic architectures.
The estimation of gate count is similar to that of chip area: first, decompose the circuit into 
functional blocks and further into building blocks (BCs), then calculate and sum up the 
number of gates of each block.
5.2.1.2 C M O S  V L S I p o w e r co n su m p tio n
For most MOS systems, the average power consumed can be approximated by the total 
switching power plus one half of the DC power that would result if all the MOS transistors 
were on [Mea80]. Because a CMOS transistor virtually requires no static DC current, the
Qi, Multicarrier DEMUX and VLSI Implementation 87
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
static DC power can be ignored. Therefore for CMOS process only the switching power will 
be considered in the power consumption estimation.
For the same reason, at any time instant, the energy consumed by a CMOS circuit is solely 
determined by those active (in switching) circuit cells (gates, or transistors).
To estimate the power dissipation of CMOS circuits, let us consider the power dissipation of 
a simple synchronous CMOS logic with gate count G and clock (switching) frequency 
/ sw= l/r0 MHz where TQ is the clock period. The power dissipated per active gate per unit 
switching frequency (mW/gate/MHz) is assumed p for given technology. We define a , the 
statistical average percentage of active gates of the circuit, as
where g(n) is the number of active gates at time nT0 and T=KT0 is sufficiently large. 
Obviously, a(ri)=g(ri)/G is the circuit's instantaneous proportion of active gates at time instant 
nT0. Then the total power consumption of the circuit can be approximated by
P = p a G /„  (5.2)
5.2.2 Complexity model of multirate VLSI
5.2.2.1 Single-rate systems
Since our main concern is the complexity and power consumption of VLSI, referring to Eq. 
(5.2), a BC can thus be sufficiently represented by a 3-element tuple (G, a , / )  where G, a , 
a n d /  are respectively EC’s complexity, average percentage of active gates, and sampling 
frequency. Suppose a single-rate system can be decomposed into I  types of BCs, each of 
which has component count A / component complexity Gj, and statistical average percentage 
of active gates a., then the system can be represented by / 4-element tuples (AT, G,., a ,.,/) , 0<i 
<1 w here/ is the sampling frequency. Alternatively, it can be graphically depicted in Figure 
5.1. The system complexity is given by
c , ^  = E g ,JV,. (5.3)
7=0
components (operations)
mult add ••• mem
(op0) (op,) (op,.,)
( " t r w )  * |«,.,cH«,.,/)
Figure 5.1 Single-rate system representation
If the switching frequency of component i is f  ^ } , the total power consumption of the single­
rate system can be estimated by
Qi, Multicarrier DEMUX and VLSI Implementation 88
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
/-i
P ~ * = 'L p a lGlNlf2> (5.4)
i=0
The configuration o f basic components:
In Eqs. (5.3) and (5.4), G, and a. are functions of component i which basically has two
architecture options: bit-serial and bit-parallel architectures. We define the configuration
vector A that determines the configuration of all BCs as
Â = [80, 8 , , - , ô J_1] (5.5)
where 5(. , 0</ </, is defined by
g _ TO, if component i is in bit - parallel architecture 
' [1, if component i is in bit - serial architecture
Thus the complexity and power consumption of a single-rate VLSI system are functions of 
its configuration vector A :
Gt = G f i) and a .  = a f ' ) (5.7)
which gives Gj0) and a f(0) for bit-parallel architecture and and a f(1) for bit-serial 
architecture of the component i. The switching frequency f ^  is usually the clock frequency 
of component i in synchronous logic, which is related to the sampling f r e q u e n c y b y
(5-8)
where Lz- is the internal word length for component i.
With Eq. (5.7) and Eq. (5.8), Eq. (5.3) and Eq. (5.4) can be rewritten as
C,Ml = E G.<5,>jv- (5.9)
1=0
I-l
(5.10)
1=0
Since = N if samp is the computation rate for the operation i, Eq. (5.10) can be expressed as
P ^ ^ P ^ a f ^ L f R ,  (5.11)
/=0
That is, the total power consumption increases linearly with the increase of computation rate 
of each BC.
If we define the effective complexity as
c .  = E a , . f % Lk ,5<,iVj = Z a -<l, -G,(5',jv- (5.12)
i= 0  V '  1=0
Qi, Multicarrier DEMUX and VLSI Implementation 89
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
where fsampl f ciock is the switching frequency of the operation i normalized by the system
clock frequency/c/oct, then Eq. (5.10) also takes the form of
Pfotal -  Pfclock C e  (5.13)
where the system clock frequency f clgck is usually the maximum switching frequency of the
circuit, that is,
(5.14)
Eq. (5.13) shows that the power consumption is in direct proportion to the effective 
complexity C which is the average number of active gates weighted by the normalized 
switching frequencies and is technology independent as shown in Eq. (5.12). Thus Ce can be 
used as a measure for power consumption.
S.2.2.2 M ultirate VLSI
Similarly, a multirate system consisting of I  types of BCs operating at J  different sampling 
rates can be, as far as hardware complexity and power consumption are concerned, 
represented by IxJ  4-element tuples, (M., Gip CLipf. ), 0</ </, 0<j </, or in matrix-vector form,
S„„,A(N,G,a,f) (5.15)
where the IxJ  matrices N, G, and a  are formed respectively with elements M ., Gv , and a ÿ , 
0<i <1, 0<j <J, which are respectively the component count, the complexity, and the average 
percentage of active gates of the operation i at the sampling frequency f. . The vector f  
consists of f . , 0<i <1, which is the i-th sampling frequency. The system can be graphically 
represented in Figure 5.2.
fo fi f2
° P o
opi
o p .
(^ oo ’^ oo,a oo ’fo) (^N0l,G0l, a 0l, f t ) (*02 ’^ li: ,<Xl)2’A  )
{ ^ i o ,(^ io’a io’f o ) (A(,1,G1i ,a n , / I)
(•^20 '^20 ,a20’f0 ) (W ji.G jpttji’/ i ) ( a ,, ,G2,,(X, , , / , )
Figure 5.2 A multirate system representation 
With this multirate VLSI model the complexity and power consumption can be estimated by
c M a l (5 i6 )
i=0 j= 0  
I - l  7 -1
^ = P I E a „ G „ A f „ / J >
i=0 7=0
(5.17)
Qi, Multicarrier DEMUX and VLSI Implementation 90
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
S.2.2.3 The configuration matrix A
Just as in the single-rate case, the complexity and power consumption of a multirate system 
are functions of configuration scheme which can be expressed by, instead of a configuration 
vector A in single-rate case, a configuration matrix A.
The A matrix is defined by
A = [8m L  =
G 0,0 ^ 0 ,1  " "  ^  0 ,7-1
0 1 ,0  8 1,1 "" ^  1,7-1
^  7-1,0 S j - i . 1  8
(5.18)
where
ô i j  =
-1, if there is no component i at f.
0, if component i at fj  is in bit - parallel (5.19)
1, if component i at fj is in bit - serial
It reflects the architectural information on the multirate VLSI. For given system 
implementation structure the ultimate VLSI complexity, power consumption, and the 
throughput will be determined mainly by the configuration of BCs. That is, which 
architecture (bit-serial, or bit-parallel) should be taken by a BC at a particular sampling 
frequency. The A matrix therefore plays an important role in VLSI architecture optimization.
5.2.2.4 On the A-domain
According to our definition, each of the IxJ  components of the matrix A, 8iy, is 1, 0, or -1 
depending on which implementing architecture the BC takes and also on system’s IS which 
determines the positions where 8,y =-1. Once the IS is determined the positions and the 
number of -  Is in the A are fixed. Suppose that the number of -  Is be r, (0<r<IxJ), and the rest 
of the IxJ-r  S s in A take either 0 or 1, then an one-to-one mapping between an I x J - r -bit 
integer and a A matrix can be established. Thus there are a total of 2/xZ~r possibilities for A that 
form the entire domain (denoted by D) of the configuration matrix A.
5.2.3 Complexity and power consumption in multirate VLSI
5.2.3.1 Complexity estimation for multirate systems
It has been shown that G. and a ÿ are functions of §ÿ. They can be expressed by
Grj = G,5®1, and a ,3 = a f i') (5.20)
According to Eq. (5.19), G-l) and Gf0) are the required number of gates to implement 
component i using bit-serial and bit-parallel architectures respectively, and ocf(1) and a /(0) are
the average percentage of active gates of component i for bit-serial and bit-parallel 
architectures of component i respectively. For those undefined, we set G(H) = 0 and 
a.(-1) = 0. Substitute Eq. (5.20) into Eq. (5.16) we obtain the total complexity,
Qi, Multicarrier DEMUX and VLSI Implementation 91
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
(5.21)
1=0 j=Q
5.23.2 Power consumption for multirate systems
Assuming that f 0 is system’s input sampling frequency and that all sampling frequencies do 
not vary, then they can be expressed with respect to /0,
f j  = V o  > 7 = 0, 1, -  , / - l  (5.22)
where Xy.G R+ are the normalized sampling frequencies.
The switching and sampling frequencies are related by
f ^ ) =Lii<'fj = ( L ^ X ]) f0 (5.23)
where is the word length of component i.
Referring to Eq. (5.20) and Eq. (5.23), the expression for the total power consumption of 
Eq. (5.17) can be rewritten as
p ,^  = P / o i £ « ! 5e>< " X }4 %  (5.24)
1=0 7=0
Similar to single rate systems, Ptotal can be expressed in terms of effective complexity with 
the same expression as Eq. (5.13). The effective complexity Ce in multirate system is, 
however, defined by
(k , 8,y €{0,1}
" [0  , s , = - l
where §ij is the normalized switching frequency for the operation i at the sampling frequency 
yjand the system clock frequency, which is also the maximum switch frequency, can be found 
to be
fdo* = max{ /® } = i6[0^ (v){(L-% )/o I  8 , ï e {O-1}} (5.25)
Clearly, the expressions of Eqs. (5.9) and (5.12) are special cases of Eqs. (5.21) and (5.25) for
Ôi;/ =8. , Nij =Ni , and =((), . The estimations for hardware complexity and power
consumption of multirate VLSI can be summarized as follows,
Cw - = X Z Gi(‘, X  (5.27a)
1=0 7=0
(5.27b)
i"=0 7=0
Qi, Multicarrier DEMUX and VLSI Implementation 92
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
^total — p f  clock Ce (5.27c)
5^ {0’1}} ( 5 - 2 7 d )
5<'e{M1} ’S '>S{° ’1} (5.27e)
0 ;ô ÿ = - l
5.2.4 Bit-parallel vs. bit-serial technique
Bit-serial techniques and arithmetic are featured for low complexity which makes them very 
attractive to realize low cost signal processing systems [Lyo81]. For example, referring to 
Table 7.3, an LxL  bit-parallel multiplier may take 12L2+44L-20 gates whereas its bit-serial 
counterpart needs only 70L gates.
The comparison of power consumption between the two, on the other hand, requires careful 
scrutiny. Let us consider a simple case where only one type of component is involved in a 
single-rate system as described in section 5.2.2.1. Then, according to Eq. (5.10), the power 
consumption for bit-serial architecture and bit-parallel architecture are Fscrial=poc(1)G(1)LR and 
Pparaiiei=pa(0)G(0)Li? respectively. Since in practice the percentage of active gates in a bit-serial 
architecture is generally higher than that in its bit-parallel counterpart, i.e., a (1)> a (0) ; and the 
complexity of a bit-serial architecture is usually higher than Lth of that of its bit-parallel 
counterpart, i.e., LG(1) > G(0), thus Pscrial > Pparallel holds in general. In other words, for a given 
system, the VLSI architecture using bit-serial arithmetic will consume more power than that 
with bit-parallel arithmetic architecture.
5.2.5 Throughput of multirate DSP systems
The throughput of a multirate system can be defined as the maximum allowed number of 
samples that the system can take up from the input signal(s) per second, which is constrained 
by the maximum allowed switching frequency Fraax for given VLSI technology. Thus to fully 
exploit the speed potential of the given technology we need to have
/* *  =m ax{/® }= j6[omaxo j){(4«Xy)/„| 6 ,  e{0,1}}=F„ra (5.28)
which gives the maximally allowed throughput:
fo = ------------ r r 2^ --------------=r (5.29)
8 » G {0-1}}
The throughput maximization problem can be therefore formulated as
mini max 8 e {0,1})} (5.30)
A g D  [ ie [0 , I )J e [0 ,J )  L '  ,} L ’ JJJ
where the domain D is defined as all the possibilities of A.
Qi, Multicarrier DEMUX and VLSI Implementation 93
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
That is, the throughput maximization is equivalent to the minimization of maximum 
switching frequency.
5.3 VLSI architecture for multirate system via optimization
5.3.1 Objective functions of optimal multirate VLSI architectures
utilization of limited hardware (silicon) resources. Ideally, we want the VLSI architecture
throughput. The requirements, however, are often in contradiction. For instance, low 
complexity architecture is usually accompanied by high power consumption and high
power consumption. Therefore, an intelligent trade-off can be made by assigning different 
weights to or imposing constraints on these requirements.
In the case of multirate DSP systems these requirements can be treated as, according to Eqs. 
(5.27) and (5.30), a multi-objective optimization problem, i.e.,
Because the Vs and the Al's are fixed for a given system and the L. 's, which are determined 
by the system's finite word length performance and the system dynamic range, are 
independent of architectural considerations, they are known parameters. The oc's and G's are 
determined by the configurations of BCs as indicated in Eq. (5.20). Therefore, Eq. (5.31) is 
an unconstrained nonlinear multi-objective programming problem. As it is a function of A, it 
can be expressed as
obtained from the pre-designed macros or libraries in ASIC approach. Therefore the
The purpose of optimal VLSI architecture is to achieve the best effects with efficient
demand minimal complexity and power dissipation, and at the same time allow maximal
throughput usually results in a high system clock frequency which in turn will increase the
mm
AeD
, for min complexity (5.31a)
for min power (5.3 lb)mm
AeD max
i e [ 0 , / ) J e [ 0 , J )
mini max
AeD  ( / e [ 0 , / ) J e [ 0 , / )
, for max throughput (5.31c)
(5.32a)
i=0 j= 0  
I - l  7 -1
(5.32b)
max
/G [0 ,/)J e [0 ,y )
Since the a f ij)'s and the G f ij)'s depend solely on specific design of the a BC, they can be
Qi, Multicarrier DEMUX and VLSI Implementation 94
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
structural optimization problem becomes the determination of the configuration pattern 
matrix A which is ideally the absolute optimal solution (if exists) to all the three objective 
functions of Eq. (5.31).
5.3.2 Sub-optimal criteria in minimum distance sense
When the absolute optimum solution for the multi-objective programming problem of Eq.
(5.31) does not exist, which is often the case, sub-optimal criteria and solutions have to be 
adopted [Las70, Got73]. A practical approach to the problem is to minimize the distance 
between the ideal point at which all the objective functions reach their minima and any point 
on the curve defined by the objective functions. The multi-objective programming problem is 
thus reduced to minimizing a single objective function, the evaluating function, which can be 
handled with conventional linear, or nonlinear, programming approaches. The solution is, 
though not globally optimal, the closest to the global optimum.
Since the objective functions in Eq. (5.31) have different units and dynamic ranges, 
normalization is often necessary. A common and the simplest way of normalization is to scale 
the objective functions to the interval of [0,1] using linear interpolation. Hence we have 
following normalized objective functions:
where M/A and MAXc are the minimum and maximum values of OBJc(A)  respectively and 
similar notations apply for other two objective functions. There are a number of ways of 
defining the "distance" with the normalized objective functions. A simple and yet the most 
commonly used one is the Euclid distance defined by
OBJdis, (A) = J d 2f  (A) + d 2p (A) + rfc2 (A) (5.34)
The minimum distance is thereby given as,
min{Ofi/Æ„(A)} = min{7<2(A) + d;(A) + ^ (A )}  (5.35)
5.3.3 Weighted evaluation function approach
When one wishes to impose preferences on the optimization, a weighted evaluating function 
can be defined by Eq. (5.36), where Yf ,Yp , and yc are respectively the weighting coefficients
Qi, Multicarrier DEMUX and VLSI Implementation 95
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
for the normalized maximum switching frequency, power, and complexity objective 
functions, showing the relative importance of each objective quantity .
OBJwei (A) = y cdc(A)+y pdp (A) +y f df (A)
1 = Y c  + Y P + Y /  (  J
The optimal solution (architecture) can be obtained by minimizing Eq. (5.36).
mÿ{Y cdc (A) + Ypdp (A)+ Y f df (A)} (5.37)
5.3.4 Strong preferences: partial optimization
It is also of practical interests in some circumstances that the absolute minimum of one of the 
objective function in Eq. (5.31) is strongly desirable whilst the other two are less important 
(but can be subject to a preference of one over the other). A typical example can be found in 
some low rate systems where the primary interest is to minimize the system complexity 
whereas the maximum allowed switch frequency and the power consumption are not a
concern. Then the optimization can be done via one of the following routines depending on
the sub-preference of the two less important objectives.
m in R (A )}— 2^2—> m inp,(A )} > m in ^ (A )} ------ >Â e  7)„ (5.38a)
min{<(A)} > mm{df  (A)} D''-D’ > min[dp(A)}------ >Âs  Dcf (5.38b)
The above notation indicates the order of optimization. That is, do the partial optimization on
the most important objective function J  (A) in the full set D  producing a subset DciD; do the
partial optimization on the less important objective function dp(A) (or tif^ A)) in the domain of
Dc producing a subset DcclDc (or Dc{<zDc ); and finally, do the partial optimization on the least
important objective function dJ^ A) in the domain of Dcp (or Dcf ) producing the desired 
configuration pattern A e Dcp (or A e Dcf). Similarly, for the power and the maximum
switching frequency, we will have can have two pairs of such optimization routines like Eq. 
(5.38) to complete all possibilities.
5.3.5 Constrained optimization problems
The optimization methods presented so far are all unconstrained, that is, the practical and 
physical constraints on that all the three objectives neither exist nor are considered. 
Sometimes, we do not care much about one (or even two) of the objective functions in Eq.
(5.31) as long as it is within a pre-determined constrained region whilst wishing to minimize 
the other two which are unconstrained; for instance, given an upper bound for the maximum 
switch frequency (the system clock frequency), find out the best configuration pattern that 
minimizes the complexity and the power. This is a constrained optimization problem and can 
be treated as a de-dimensioned multi-objective optimization one. Again, we can apply
Qi, Multicarrier DEMUX and VLSI Implementation 96
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
minimum distance criteria in this case. Thus the optimization task can be one of the following 
cases,
(5.39a)A6(D& r){V<2(A) + ^ (A)}
A e ( Æ S,}V <2(A) + 4(A)} (5.39b)
(5.39c)
a g{d c| < (A )< C }
where F, P, and C are the normalized upper limits for maximum switch frequency 
(corresponding to minimum throughput), power, and complexity respectively.
5.3.6 The mapping between p  and A
The need to parameterize the A matrix to an integer lies in the fact that, from the 
programming point of view, it is much easier to handle an integer variable than an binary 
matrix in which each element toggles. We define the following transform pair V(*) and M(*) 
which converts a matrix to a vector and a vector back to a matrix respectively
■** 8  0, J —1
V(A) = V
0,0 0,1
5  1,0  ^1,1 m.J-l
_ ^ / - l , 0  & 7-1,1 ^ / - 1 ,J - 1
=  7-1,0 ’ ^  , " ' , 8  / - 1 ,7 -1  > * * * > 8  o,o » 8  o , i ,  * * *, 8  o ,7 - l  ] (5.40a)
r 0 f  J~i Y)
8 u i - 7 + 'ài j  G {-1,0,1}
V"1\  0 J)
M(V(A))=A (5.40b)
where the notation |z - |> + jj means that the index pairs (ij)s are formed by advancing y first
starting from (7-1,0) with i in décrémentai and j  in incremental order (i.e., by scaning the 
matrix row by row from the bottom to the top). If we ignore those elements whose values are 
-1  in V(A), then the vector represents an integer and we subscript it with the 'bin' to denote 
such integerization that discards ‘-1 ’ elements
8 -,
0 7 7 -1
V '-'lV y y
’^ i j  G {-1,0,1}
Jbin
0 Z 7 -1, XA
x H V y y
;S y  €{0,1} (5.41)
bin
=  P l x J - r
where r is the number of ‘-1 ’s in the A matrix. The reverse mapping is given as follows,
Qi, Multicarrier DEMUX and VLSI Implementation 97
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
{P;xy-r>A.,}vec =V(A) 
M(V(A)) = A
(5.42)
That is the reverse mapping requires two steps: retrieving the vectorized A, V(A), from the 
integer pjxj_r considering - 1 ’s positions in the A matrix, and then converting the V(A) back
to the matrix form.
In summary, the forward mapping and the backward mapping between the A matrix and the 
configuration pattern p  are
^x7-,(A) = {V(A)}bi||
A(p,x,-r ) = M{{pIy]_r, A_, }yec )
(5.43)
5.4 Examples
5.4.1 PMDFT and TM-PMDFT filter banks
PMDFT filter banks can be implemented using fast speed and flexible DSP processors or 
implemented in VLSI for high speed applications. Implementation limits of the PMDFT filter 
bank can be frequency bounded (i.e., for wideband channels requiring too high sampling 
frequency), space bounded ( for large number of channels), or bounded by both. A frequency 
bounded system will result in either an impractical high sampling rate or too high power 
consumption, whereas a space bounded system will cause implementation difficulties due to 
high complexity. To allow constrained optimization of PMDFT filter bank, we discuss two 
implementation schemes, the direct structure and a time-multiplexed structure. The former 
uses a polyphase filter bank followed by an M-point DFT as shown in Figure 3.7. 
Alternatively, the time-multiplexing (time-sharing) concept can be applied to reduce the 
number of sub-filters of the filter bank. This structure is referred to as TM-PMDFT structure 
which uses only one sub-filter (out of LxM!) as shown in Figure 5.3.
L F ; ,
S e q u e n c i a l  
S y s t o l i c  D F T
Inner
Productor
(L/M) F in
MUX- 
commut. 
(P/S conv
Coeff./mem.DEMUX 
commut. 
(S/P conv
K - l
Control
Figure 5.3 Time-multiplexed implementation of PMDFT
Obviously, a TM-PMDFT can substantially reduce the complexity especially when filter 
length and the decimation factor M  are large (it is generally true that a PMDFT needs a long 
filter length because very short transition bands are often required particularly for large M).
Qi, Multicarrier DEMUX and VLSI Implementation 98
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
5.4.2 PMDFT and TM-PMDFT complexity
We assume that the system consists of solely multipliers, adders, shift registers/RAMs, and
ROMs. To calculate the BC counts of the two structures we have the following assumptions:
a) coefficient word length is the same as that for the sampled signal;
b) an «-tap FIR uses n real multipliers, n—\ real adders, n words of ROMs to store the 
coefficients, and n - \  words of shift registers for the delays;
c) referring to the to reference [Bla85], the multiplier and adder counts for fully optimized 
radix-2 FFTs are listed in Table 5.3;
n
d) the ROM size for twiddle factors for an «-point FFT is —-2 « , and 2«log2« words of 
registers are required for pipelining;
e) the complexity of SS-DFT is based on Chang's systolic DFT model [Cha88] (Figure 
6 .10);
f) a 1:« SPC commutator would require 2« shift registers (« latches, « shift-registers), and
g) an «: 1 PSC commutator would require « shift registers.
The required component counts and the total gate counts for direct implementation of
PMDFT and for TM-PMDFT are listed in Table 5.1. and Table 5.2 respectively. Though TM-
PMDFT requires less multipliers and adders its computational complexity is exactly the same
as that of the PMDFT. The complexities for some BCs can be found in Table 7.3.
Table 5.1 Operation count for direct implementation of PMDFT
functional block real mults real adders ROMs registers
FIR N N-LM N N/L-M
FFT MnfftCM)* Nm m * M/2-2 2M og2M
in/commutator ---- — — — 2M
out/commutator ---- — — LM
total A+MnfftCM) N—LM +Mlfft(M) N+M/2-2 LM+2Mlog2M+M+N/L
* See Table 5.3
Table 5.2 Operation count for TM-PMDFT filter banks
functional block real mults real adders ROMs registers
s/reg. bank — — — N/L-M
commutator — — — 2M
inner productor N/(LM) NI(LM)-1 1 —
coef. ROMs — — N —
SS-DFT 4(M-1) 8M-4 2(M-1) 6M
total N/(LM)+4M-4 N/(LM)+SM-5 N+2(M-1) N/L+1M
Qi, Multicarrier DEMUX and VLSI Implementation 99
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
Table 5.3 Multiplication and addition counts of optimized radix-2 complex FFTs
transf. size (M) 8 16 32 64 728 256
real multipliers 4 24 264 712 1800
real adders 52 152 408 1032 2504 5896
* Complex multiplication using 3 real multiplications and 3 real additions; trivial multiplications (by ±1 or 
± j )  not counted; symmetries of trigonometric functions fully used [Bla85]
5.4.3 VLSI architecture optimization
Figures 3.7 and 5.3 show the ISs for the two structures. Since one-to-one mapping is 
assumed, no time-sharing is allowed for BCs. Therefore, the VLSI architectures are 
determined by choosing either bit-serial or bit-parallel architectures for each BC in the ISs. 
Applying the multirate VLSI complexity model and optimization techniques introduced 
previously, optimized VLSI architectures can be found for given constraints.
A C code has been written to implement the optimization algorithms presented in this chapter.
The entry to the program is a data file containing the known system parameters. The outputs 
of the program are the optimized configuration matrices Aopt s.
5.4.3.1 Optimal PMDFT architectures
There are three sampling frequencies in the PMDFT structure shown in Figure 3.7, the input 
sampling frequency/0, the decimated sampling frequency for the FIR filter b m k f= f()/M  , and 
the output sampling frequency f= f0L/M, where M  and L are the decimation and interpolation 
factor respectively. In this example, M=8, L=3, and / 0=16MHz. Hence / 1=2MHz, / 2=6MHz, 
and the normalized sampling frequencies are (k0, Xv X,2)=(1.0, 0.125, 0.375). The PMDFT 
filter bank can be built with four kinds of BCs: multipliers, adders, ROMs, and shift registers. 
We assume that the word lengths for them are 11, 10, 8, and 8 bits respectively.
The a  values for bit-serial architectures are generally larger than those for bit-parallel 
architectures, because most bit-serial circuits are time-shared by all bit slides thus are more 
active than bit-parallel circuits. Since oc's are between 0 and 1 (typically 0.5 for random logic) 
and are not affected by structural parameters, M, L, Nb, etc., it can be envisaged that the 
inconsistency between the assumed and the actual values of oc’s will not have much influence 
on the optimization results (architectures) as long as the relative values for the assumed oc’s 
are reasonable. We choose oc’s for bit-serial multiplier and adder as 0.5 and 0.6; for bit- 
parallel multiplier and adder as 0.4 and 0.5; and for ROM and register for both bit-serial and 
bit-parallel cases as 0.2 and 0.7 respectively. That is, (a ^ , a  1(0), a  ^ 0),a  ^ 0) ) = (0.40,0.50,0.20,0.70) 
and ( a ^ ) = (0.50,0.60,0.20,0.70).
For simplicity, we assume that the prototype filter length V is a integer multiple of LxM, 
which is also the subfilter length. Thus a 72-tap FIR is assumed for the prototype filter
Qi, Multicarrier DEMUX and VLSI Implementation 100
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
(N=3xLxM=72). Refering to Figure 5.2 and Table 5.3, the N matrix showing the component 
count for operation i at the sampling frequency j  can be determined and is given as
[A^ l =
■ 0 72 4 ■
0 48 52
0 72 2
16 16 72
where the first row through the fourth row respectively shows the component counts for 
multiplier, adder, ROM, and register. Since there are no multipliers, adders, and ROMs in the 
first stage (or say, a t/0) in Figure 3.7, the A matrix should take the form of
A =
-1 0 0 '  
-10  0 
-10  0 
0 0 0
where 0= 0 , or 1. Consequently, the configuration pattern p is a 9-bit integer (0<py<512). 
The EC’s gate counts against the word length are listed in Table 7.3. Note that ROM and shift 
register have the same complexity and a  value for bit-serial and bit-parallel architectures.
Figures 5.4 (a), (b), and (c) show the complexity, power, and the maximum switch frequency 
respectively.
Complexity (Gate Count) for PMDFT
A— p a tte rn : p
(a)
Power Consumption (Active Gate Count) fo r  PMDFT
A -p a tte rn : p
(b)
Maximum Switch Frequency for PMDFT
2.0 -
200 .
A -pa tte rn : p
(c)
Figure 5.4 VLSI complexities of a PMDFT filter bank
Qi, Multicarrier DEMUX and VLSI Implementation 101
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
Some of the optimization results for the PMDFT structure are listed in Table 5.4.
Table 5.4 Optimization results for the PMDFT structure
m eth o d s cond itions com plex ity p o w er m ax  sw itch p A -m atrix
m in im u m
distan ce
0.043700
(73848)
0.321286
(11796.3)
0.392857
(3.750)
224 f - l  1 0 - 1 1  1 
- 1 0  0 
.  O 0 0.
w eig h ted  * 
ob jec tive
(0.33 0.33 0.33) 0.043700
(73848)
0.321286
(11796.3)
0.392857
(3.750)
224 " - 1  1 o '  - 1  1 1 
- 1 0  0 
.  0 0 0.
it (0.40 0.40 0.20) 0 . 0 0 0 0 0 0
(69264)
0.327479
(11985.2)
0.446429
(4.125)
480 - 1 1  1 - 1  1 1 
- 1 0  0 
. 0 0 0 .
it (0.45 0.09 0.45) 0.131940
(83104)
0.974365
(31717.9)
0.053751
(1.375)
160 - 1  1 0 - 1  1 0 
- 1 0  0 
.  0 0 0 .
n (0.14 0.43 0.43) 0.918548
(165616)
0.259640
(9915.8)
0.035714
(1.250)
32 - 1  0  o" - 1 1 0  
- 1 0  0 
.  0 0 0.
c o n s tra in e d complex < 0.2 0.131940
(83104)
0.411696
(14554.2)
0.285714
(3.000)
176 - 1 1 0  - 1 1 0  
- 1 0  1 
. 0  0  o j
ii complex < 0.9 0.830308
(156360)
0.048358
(3471.7)
0.392857
(3.750)
96 - 1 0  0 - 1 1  1 
- 1 0  0 
. 0 0 0.
h power < 0.5 0.131940
(83104)
0.480539
(16654.2)
0.285714
(3.000)
164 - 1 1 0  - 1 1 0  
- 1 0  0 
. 0 0 1.
n power < 1 .0 0.131940
(83104)
0.974365
(31717.9)
0.053751
(1.375)
160 - 1  1 0 - 1  1 0 
- 1 0  0 
. 0 0 0.
n max-swit < 0.3 0.131940
(83104)
0.411696
(14554.2)
0.285714
(3.000)
176 - 1 1 0  - 1 1 0  
- 1 0  1 
. 0 0 0.
ii max-swit < 1 .0 0.043700
(73848)
0.131266
(5999.9)
1 . 0 0 0 0 0 0
(8.000)
225 - 1 1 0  - 1 1  1 
- 1 0  0 
. 1 0 0.
p a r t ia l
o p tim iz
C - > P - > T * * 0 . 0 0 0 0 0 0
(69264)
0.152587
(6650.3)
1 . 0 0 0 0 0 0
(8.000)
481 - I l l  - 1 1  1 
- 1 0  0 
.  1 0 0.
h C -> T —> P 0 . 0 0 0 0 0 0
(69264)
0.327479
(11985.2)
0.446429
(4.125)
480 - I l l  - 1  1 1 
- 1 0  0 
.  0 0 0.
n P —> C —>T 1 . 0 0 0 0 0 0
(174160)
0 . 0 0 0 0 0 0
(1995.8)
1 . 0 0 0 0 0 0
(8.000)
1 - 1 0  0 - 1 0  0 
- 1 0  0 
1 0 0 .
n P - > T - > C 1 . 0 0 0 0 0 0
(174160)
0 . 0 0 0 0 0 0
(1995.8)
1 . 0 0 0 0 0 0
(8.000)
1
- 1 0  0 
- 1 0  0 
- 1 0  0 
.  1 0 0 .
n T —> C P 1 . 0 0 0 0 0 0
(174160)
0.334612
(12202.8)
0 . 0 0 0 0 0 0
(1.000)
0
- 1 0  0 
- 1  0 0 
- 1 0  0 . 0 0 0.
n T ->  P -> C 1 . 0 0 0 0 0 0
(174160)
0.334612
(12202.8)
0 . 0 0 0 0 0 0
(1.000)
0
- 1 0  0 
- 1 0  0 
- 1  0 0 
0 0 0 .
* The weighting vector is of the form ( j C,YP,Y f )-
** The optimization priority is in the order of C(omplexity), P(ower), and T(hroughput).
*** values in brackets are the ones not normalized.
It can be seen that the minimum distance approach and the weighted objective function 
approach with equal weighting coefficients lead to the same optimum solution. It can be 
perceived by the fact that the normalization of objective functions itself has put all the
Qi, Multicarrier DEMUX and VLSI Implementation 1 02
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
objective functions equally important. These two approaches are useful for situations where 
the three objectives are considered equally important. When one of them is constrained whilst 
the other two are equally important we can use the constrained optimization approach. Some 
tentative constraints can be found in Table 5.4. For situations where one of the objectives is 
strictly constrained to its minimum, the partial optimization approach or the constrained 
optimization approach can be used depending on whether the optimization priority within the 
other two is desired. The weighted objective function approach can be used for all these cases 
with carefully chosen weighting coefficients (vector). Thus an interactive optimization 
procedure is essential for any practical application.
According to Eq. (5.42), the resulting configuration patterns are mapped back to the A
matrices as shown in the last column of the table. The minimum distance solution of p=224
in the above table, as an example, is mapped to
'-1  1 O'
A _  “ I 1 1
A_ -1 0 0 
0 0 0
which suggests that the multipliers and adders of the sub-FIRs and the adders of the FFT be 
in bit-serial while the rest of the components in the PMDFT be in bit-parallel.
5.4.3.2 O p tim a l T M -P M D F T  a rc h ite c tu re s
In the TM-PMDFT structure of Figure 5.3, the sampling frequencies are the input sampling 
frequency / 0, the decimated sampling frequency for the shift register bank, and the
interpolated sampling frequency / 2=L/0 for the rest of the functional blocks. Again we assume 
that M=8, L=3, and / 0=16MHz. Hence /,=2MHz, / 2=48MHz, and the normalized sampling 
frequencies are (X0, X2)=(1.0, 0.125, 3.0).
As in the PMDFT case, multiplier, adder, ROM, and register are the four BCs and their word 
lengths are also assumed 11, 10, 8, and 8 bits respectively. The assumptions for the a  values 
are the same as those in PMDFT, i.e., ( a ^ ,aS0),a^0),a^0) ) = (0.40,0.50,0.20,0.70) and 
(a ^ , a  ^ , a  ^ , a  ^  ) = (0.50,0.60,0.20,0.70).
The prototype filter length N  is also assumed 72 as in PMDFT case. The N matrix, however,
is much different from that of PMDFT. From the Table 5.2, it can be calculated as
0 0 31"
0 0 30
0 0 72
16 16 56
Since there are no multipliers, adders, and ROMs in the first two stages (i.e., a t /0 and /j) in 
Figure 3.7, The A matrix should take the form of
[#</] =
Qi, Multicarrier DEMUX and VLSI Implementation 103
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
—1 —1 0 
A _  -1 -1 0  
A” -1 -1 0  
0 0 0
and, accordingly, the configuration pattern p  is an 6-bit integer (0<p6<64).
Figures 5.5 (a), (b), and (c) show the complexity, power, and the maximum switch frequency 
respectively. Figure 5.5(d) shows the normalized objective functions against the configuration 
pattern p.
P ow er (ac tive  ga te  count) fo r  TM-PMDFTC om plexity fo r  TM-PMDFT
A -patte rn : p
(b)
Normalized Complexity, Power, and Max Sw itch—freq
for a TM—PMDFT Filter Bank Structure
\ — 1----- 1 com plexity
0 8
1------1 power
0 .8 - j
1---- i m ax sw itch -freq
0 .4 -
0 .2 -
?
0 .0 -j \  ,
0. 20.
A - p a tte rn : p
40. 60.
Max sw itc h —freq u en cy  fo r  TM-PMDFT
A -patte rn : p
(c)
Figure 5.5 VLSI complexities of a TM-PMDFT filter bank 
The optimization results are listed in the Table 5.5.
Qi, Multicarrier DEMUX and VLSI Implementation 104
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
Table 5.5 Optimization results for the TM-PMDFT structure
methods conditions complexity power max switch P A-matrix
minimum
distance
0.000000
(32210)
0.356525
(12547.2)
1.000000
(33.0)
48 -1 -1 1 -1 -1 1 
-1 -1 0 
0 0 0.
weighted
objective
(0.33 0.33 0.33) 0.000000
(32210)
0.356525
(12547.2)
1.000000
(33.0)
48 -1 -1 1 -1 -1 1 
-1 -1 0
0 0 0.
ii (0.40 0.40 0.20) 0.000000
(32210)
0.356525
(12547.2)
1.000000
(33.0)
48 -1 -1 1 -1 -1 1 
-1 -1 0 
0 0 0.
h (0.50 0.00 0.50) 1.000000
(73036)
0.994114
(29532.8)
0.000000
(3.0)
0 -1 -1 0 -1 -1 0 
-1 -1 0 
0 0 0.
h (0.20 0.40 0.40) 1.000000
(73036)
0.318912
(11545.2)
0.166667
(8.0)
1 -1 -1 0 -1 -1 0 
-1 -1 0 
1 0 0.
constrained complex < 0.2 0.130671
(37550)
0.353249
(12459.9)
1.000000
(33.0)
32 -1 -1 1 -1 -1 0 
-1 -1 0 
0 0 0.
n complex < 0.9 0.869329
(67736)
0.000000
(3049.3)
0.900000
(30.0)
16 -1 -1 0 -1 -1 1 -1 -1 0 0 0 0
it power < 0.6 0.000000
(32210)
0.356525
(12547.2)
1.000000
(33.0)
48 -1 -1 1 -1 -1 1 
-1 -1 0
0 0 0.
m power < 1 .0 1.000000
(73036)
0.994114
(29532.8)
0.000000
(3.0)
0 -1 -1 0 -1 -1 0 
-1 -1 0 
0 0 0.
n max-swit < 0.2 1.000000
(73076)
0.318912
(11545.2)
0.166667
(8.0)
1 -1 -1 0 -1 -1 0 
-1 -1 0
1 0 0.
ii max-swit < 1 .0 0.000000
(32210)
0.356525
(12547.2)
1.000000
(33.0)
48 -1 -1 1 -1 -1 1 
-1 -1 0 
0 0 0.
partial
optimiz.
C —» P —» T 0.000000
(32210)
0.356525
(12547.2)
1.000000
(33.0)
48 -1 -1 1 -1 -1 1 
-1 -1 0
n C -»T—>P 0.000000
(32210)
0.356525
(12547.2)
1.000000
(33.0)
48 -1 -1 1 -1-1 1 
-1 -1 0
. 0 0 0.
it P—>C-»T 0.869329
(67736)
0.000000
(3049.3)
0.900000
(30.0)
16 "-1 -1 0 -1 -1 1 
-1 -1 0
n P —» T—»C 0.869329
(67736)
0.000000
(3049.3)
0.900000
(30.0)
16 -1 -1 0 -1 -1 1 
-1 -1 0
n T—> C —> P 1.000000
(73036)
0.994114
(29532.8)
0.000000
(3.0)
0 -1 -1 0 -1 -1 0 
-1 -1 0
h T—> P —> C 1.000000
(73036)
0.994114
(29532.8)
0.000000
(3.0)
0 [-1 -1 0 -1 -1 0 | -1 -1 0
The two examples show that the direct PMDFT has higher complexity and allows higher 
throughput and the TM-PMDFT is more efficient in complexity, but poorer in throughput 
performance. The power consumption for the TM-PMDFT structure can be comparable to
Qi, Multicarrier DEMUX and VLSI Implementation 105
Chapter 5 Complexity and Power Consumption Analysis and VLSI Architecture Optimization
that of the PMDFT structure in spite of its low complexity. This is due to the high sampling 
rates internal the TM-PMDFT structure (the sampling rates in FIR and DFT blocks in the 
structure are respectively LM  and M  times higher than those in PMDFT structure). Therefore, 
the TM-PMDFT architecture with bit-serial arithmetic is ideal for low complexity low data 
rate applications, whereas the PMDFT architecture is recommended for high data rate 
applications.
5.5 Conclusions
From the above discussion, the following conclusions can be made:
1) The task of mapping multirate DSP systems into VLSI can be guided by assessing the 
complexity, power, and throughput efficiencies of the resulting VLSI architectures;
2) To be objective in comparative studies between alternative architectures, estimations for 
complexity and power consumption are preferably technology independent. To this end, 
the gate count and the effective complexity can be used respectively for the complexity 
and the power measures;
3) The proposed multirate VLSI complexity model is found useful and effective for VLSI 
architecture analysis and optimizations;
4) Given system parameters and the IS, the searching for the optimal VLSI architecture 
subject to predefined constraints on each objective functions (complexity, power, or 
throughput) becomes the determination of the configuration matrix A. This can be done 
using optimization techniques. Trade-offs between space, time, and power complexities 
can be made by imposing different constraints on these objectives.
5) The proposed systematic approach to multirate VLSI optimization and complexity/power 
estimation can be used to evaluate the implementation efficiency and to study design 
trade-offs for multirate filter banks or other DSP systems.
Qi, Multicarrier DEMUX and VLSI Implementation 106
Chapter 6 Bit-Serial Techniques and Systolic Architectures
Chapter 6 
Bit-Serial Techniques and 
Systolic Architectures
In this chapter some useful DSP techniques will be introduced to optimize DEMUX VLSI architecture. They are bit-serial techniques and systolic architectures. The objective of the 
former is to reduce the complexity whilst the latter’s is to improve the performance of 
throughput (speed) and to optimize the utilization of the silicon area. Two bit-serial 
arithmetic units will be introduced: a bit-serial adder/subtracter and a novel full-precision bit- 
serial multiplier structure. Another technique introduced in this chapter is the distributed 
arithmetic approach to inner-product computation. To improve processing speed and to suit 
VLSI implementations we propose systolic architectures for DFTs and FIRs which account 
for most computations in a multicarrier DEMUX.
6.1 Bit-serial arithmetic for digital signal processing
Digital adders and multipliers are the two most commonly used arithmetic components in 
digital signal processing. These arithmetic operations, especially multiplication, demand 
substantial processing resources (for example, processing time in software implementation 
and complexity in hardware approach). To reduce the VLSI complexity, bit-serial 
architectures are often preferred over bit-parallel ones as the former use a single and simple 
processing unit with a single data line (wire) for each operand whereas the latter accept and 
process the data in bit-parallel using much more complicated processing units with multiple 
data lines. Besides the advantage of low complexity, bit-serial architectures also have the 
following advantages [Bi90]:
Qi, Multicarrier DEMUX and VLSI Implementation 107
Chapter 6 Bit-Serial Techniques and Systolic Architectures
• pin out problems of a chip rarely occur as all signals enter and leave the system by single 
wires. This is important in VLSI design since processing throughput may be limited by 
the input/output bandwidth of the chip.
• all interconnection lines are used with a 100% duty cycle, transporting data at the fastest 
bit-rate that the technology can support (data buses in bit-parallel architectures do not 
consistently make such efficient use of the available information bandwidth).
• simple interconnections between processing elements allow significant reduction of chip 
area and power consumption.
The main disadvantage of bit-serial techniques is the low processing speed (throughput). 
This, however, can be improved by employing pipeline (systolic) architectures at the expense 
of increased processing latency.
6.1.1 Bit-serial adder/subtractor
A bit-serial adder takes both operands in bit-serial format [Rot79]. In frequency DEMUX and 
many DSP systems (e.g., EFT), addition and subtraction operations between two signals are 
frequently encountered. We designed a low complexity ADD&SUB module that accepts both 
operands in bit-serial (the least significant bit (LSB) first) and 2’s complement format and 
produces the sum as well as the difference in the same format. Figure 6.1(a) illustrates this 
structure in which the upper half is a typical bit-serial adder and the lower half is a bit-serial 
subtractor. The operation is controlled by the LSB signal which indicates the start of each 
data word (the least significant bit) with a logic ‘O’. The LSB true (‘0’) resets the ‘carry in’ of 
the adder’s full adder and sets the ‘carry in’ of the subtractor’s full adder to provide the initial 
augment of one to compute the 2’s complement of X2. The complete operation is illustrated in 
Figure 6.1(b).
ad de r
LSB Cin Cout
■ o  X1+X2sum
<z> X1-X2sum
Cin Cout
Clock C=>—ih subtractor
clock
LSB 
Xi
X2 
X1+X2
U  ~ L
1 1
1 1 1
0 0 0
X1-X2
1 1 1
LSB MSB, 
(sign bit)
(a) (b)
Figure 6.1 Bit-serial add & subtract module
Qi, Multicarrier DEMUX and VLSI Implementation 108
Chapter 6 Bit-Serial Techniques and Systolic Architectures
6.1.2 Serial-parallel multiplier
Digital multipliers are widely used in digital signal processing and generally considered to be 
one of the most area consuming arithmetic components. In applications where the multiplier 
(input data) can be presented in bit serial format and the multiplicand (coefficient) in bit 
parallel (a good example is a FIR), serial-parallel multiplier architectures can be employed to 
reduce complexity [Lyo76, Dan84, Gna85, Ale92].
However, the product generated by existing serial-parallel multiplication schemes can give 
only half of the product precision by retaining the most significant bits (MSBs) and dropping 
LSBs if on-line multiplication is desired. This can lead to serious rounding-off errors with 
these schemes. Should a full-precision product be provided, either some extra time steps will 
be required (which will violate the requirement of on-line computation), or a complicated 
output data format (for instance, LSBs in bit serial and MSBs in bit parallel, hence more 
input/output (I/O) ports) will have to be used [Ma90].
In this section a novel semi-systolic serial-parallel multiplier architecture is proposed to 
satisfy requirements for high speed and high precision on-line multiplications in frequency 
demultiplexing of MCDD.
The main characteristics of the proposed multiplier are that
• it can provide a full-precision product (m+n-1 bits in two’s complement) of an m-bit 
multiplier and an M-bit multiplicand,
• the multiplication period is max(m,«) pipeline steps which enables on-line computation, 
and
• the ra+M-l-bit product is produced in quasi-bit-serial format, i.e., using two bit-serial 
output data lines, one for n least significant bits and the other for ra-1 most significant 
bits.
The proposed multiplier structure has a high multiplication rate (via systolic architecture) 
and, at the same time, full-precision product. The processing cell has good modularity and 
simple control logic and is well suited to VLSI implementation.
The proposed structure is derived from a pipeline bit-parallel multiplier structure using the 
Baugh-Wooley algorithm for two’s complement multiplication [Bau73, Hat86, Ale92]. 
Figure 6.2 shows the bit-parallel pipeline structure. To develop the bit-serial pipeline 
structure, we collapse the above parallel array vertically forming a new linear array shown in 
Figure 6.3(a) in which the processing element (PE) is modified as shown in Figure 6.3(b). 
This is allowed because the data bits are fed into the parallel array in an order that is exactly 
the same as that of a bit-serial data with the LSB first.
Qi, Multicarrier DEMUX and VLSI Implementation 109
Chapter 6 Bit-Serial Techniques and Systolic Architectures
o
a s )  (3 2 ao
CHZHZHZHZZHZZh" po
32 ao,
33) (32 ao
as) (32
32 ao
ao
-O -
latch ao
(a) pipeline array
po
CO
clock
Pi [=>
clock
/\ck
A C k
Cin
32 FA
sum
Cout
Cin
sum
Cout
(b) processing elements 
Figure 6.2 A 2’s complement, 4x4-bit, bit-parallel pipeline multiplier
The PE in Figure 6.3(b) basically consists of two identical circuits (the upper and the lower 
halves) with shared control logic. Each half can produce either inverted or non-inverted 
partial product which, in the parallel array case, are generated using the two PEs of Figure 
6.2(b) by setting the control signal ‘nc’ to 0 or 1. At any time only one half is allowed to take 
the multiplier (the input data) and the data path to the other is blocked. This is controlled by 
the signal im c’. When ‘lmc’=0, the lower half is enabled through which the LSBs of the 
product are processed whilst the upper half is in the state of propagation of MSBs, and vice 
versa for the case of ‘lmc’=l.
In Figure 6.3(a) we illustrate the process of the bit-serial product generation via a simple 4-bit 
by 4-bit multiplication. Both the multiplier (b3b2b1b0) and the multiplicand (a3a2a1a0) are 
assumed in 2’s complement. From step 0 to step 3, the ‘Imc’s of all the PEs are 0, hence the 
lower halves of the PEs produce the lower half of the product, p0, p,, p2, and p3 (LSBs) 
whilst the upper halves produce unusable outputs (initial states) which we denote with ‘x’. 
The second data word starts from the step 4. The PEs’ upper halves are activated by toggling
Qi, Multicarrier DEMUX and VLSI Implementation 110
Chapter 6 Bit-Serial Techniques and Systolic Architectures
the ‘Imc’ to 1 enabling the acceptation of the multiplier bits which is broadcast to all the PEs. 
At the same time, the lower halves are disabled to accept the multiplier and go into MSB 
propagation state. Thus from step 4 to 7 the upper halves process and produce the LSBs of 
the second product and the lower halves continue to produce the first product, p4, p5, p6 
(LSBs) and an unused bit x (might be used to indicate overflow or sign bit extension). 
Similarly, from step 8 to 11 which is not shown in the figure, the upper halves produce the 
MSBs of the second product and the lower halves start to produce the third product. 
Therefore, the lower halves produce the first, third, - , products in full precision whereas the 
second, the fourth, , products are produced by the upper halves of the PEs.
b nc
t = 5 .ac>
b'i
t = 6 ao.asQaa
b'i
t = 7 ao.
%
P  3 
X
mo
ck Imc
mi [=>
mo
nc o  
Imc i=>
neg_
pulse
neg_
pulse
Cin Cout
sum
Cin Cout
sum
(a) step-by-step illustration (b) processing element (PE)
Figure 6.3 Step-by-step illustration of parallel/serial full-precision 
multiplication
The multiplication scheme described above can be implemented with the circuits shown in 
Figure 6.4(a). All the required control signals ‘Imc’ and ‘nc’ for PEs are derived from an 
external signal ‘msb’ which indicates the MSB (sign bit) of the multiplier using a simple 
logic. The ‘reset’ signal guarantees that the ‘Imc’ toggles to 0 when the first rising edge of
Qi, Multicarrier DEMUX and VLSI Implementation 111
Chapter 6 Bit-Serial Techniques and Systolic Architectures
‘msb’ comes (because the T-type flip-flop toggles at the rising edge of the ‘ck’). The 
relationship between these control signals are shown in Figure 6.4(b).
As previously described the two outputs of the linear pipeline array give consecutive products 
in a alternating manner which is not convenient for future processing. Furthermore, 
conventional on-line (real-time) bit-serial processing requires only the MSBs of the product 
in 1-bit data path. Hence it is necessary to rearrange the product format in such a way that the 
MSBs and LSBs are separated and that the products are in consecutive order. This can be 
easily achieved by cascading a simple combination logic which either swaps the order of its 
two input signals or simply bypasses them, as is shown in Figure 6.4(c). The division between 
LSBs and MSBs of a product can be adjusted by delaying or advancing the ‘Imc’ before sent 
to control the SWAP logic. For example, an 1-bit advance of the signal would result in ‘lo’=x, 
p0,p,,p2; x, p'o,p'i,'", and ‘mo’=p3,p4,p5,p6; p'3, p'4,p ',,--, which gives more accurate MSBs of 
the product.
reset
IX lmc......
l \ l s b
>ck T Qmsb
m oas 3 2 ao
clock
b =ta>, bi, be, to; bo, bh, bb, bb; b ’b, b"t, bis, b t ;  b‘b,. . .
m o = x , x, x, x  ; p,' ph, pis, pb; p'4, pS, pfe, x  ; pH, p"X p*b,
lo =  p ,  pi, pe, pe; p», po, po, x ; pH, p '1, pis, pH; p‘4, pH, p H ,. . .
■> time
clock
msb
Isb
Imc
reset
(a)
L S B M S B  
( s i g n  b it )
(b)
reset cc»
c=0: x'=y, y'=x 
c= 1 :x '= x , y"=ymsb
3 2 ao swap
m o
AÇk
clock c=>
b = to, bi, to, to; bo, bS, bb, bb; b ’b, b 'i, bH, bH; b'b_____________
lo =  p ,  pi, pa, po; pb, pX pX pb; pH, p"i, pH, pH; p"b, p 'X  p'b,.
mo =  x, x, x, x  ; p  p ,  p ,  x; p», pb, pfe, x  ; p'4, pH, pH,
-> time
( C )
Figure 6.4 The novel parallel/serial full-precision multiplier
Qi, Multicarrier DEMUX and VLSI Implementation 112
Chapter 6 Bit-Serial Techniques and Systolic Architectures
Since the input (multiplier) is broadcast to all cells in Figure 6.4 the pipeline structure is 
therefore semi-systolic [Kun82]. The multiplication scheme has been verified via digital 
simulation using Mentor Graphics CAE tools.
The digital design shows that the cell complexity is about seventy gates (one gate is 
equivalent to a 2-input NAND gate). Thus for multiplication of an «-bit bit-serial datum and 
an m-bit coefficient, the proposed serial-parallel semi-systolic multiplier consisting of m PEs 
will require approximately 70m gates. This multiplication scheme can be applied to 
applications where high speed and high precision (being able to provide full precision of the 
product) multiplications are required and can significantly reduce the ASIC complexity.
6.1.3 Bit-serial 2’s complement
In digital signal processing, the computation of two’s complement is common, for example 
the multiplication of a signal by -1 , -1/2, -1/4, etc. as in FFT computation. These operations 
require the 2’s complement of the signal. When the signal is in bit-serial, a straightforward 
approach to the computation of two’s complement is to perform the subtraction of the signal 
from the constant 0 using the bit-serial subtractor introduced in section 6.1.1. We have 
designed a simpler and faster architecture for two’s complement as shown in Figure 6.5. The 
control signal and the computation process are illustrated in Figure 6.5(b).
c,ock FLRJ-LTLTLTLrL
CkA
clock o
Isb
- X - | O O P
0 0 o o
1 1 1 
0 0 0
L S B
1 1
M S B
(a) (b)
Figure 6.5 A simple bit-serial 2’s complement circuit
6.2 Distributed arithmetic techniques
Distributed arithmetic is so called because the arithmetic operations in signal processing (e.g., 
multiplication, convolution) are distributed in an often unrecognizable fashion without 
explicit use of multipliers. The DA computation mechanism can be best exploited by 
multiplication-and-sum type inner product computations such as in convolution/nonrecursive 
(FIR) filtering and matrix (vector) multiplications [Pel73, Zoh76]. The technique can also 
find applications in efficient realization of recursive arithmetic operations such as HR 
filtering [Pel74] and be applied to nonlinear/nonstationary processing (adaptive filtering). 
White gave a comprehensive survey on the DA techniques and their applications in a tutorial
Qi, Multicarrier DEMUX and VLSI Implementation 113
Chapter 6 Bit-Serial Techniques and Systolic Architectures
on this issue [Whi89]. As he pointed out, the motivation for using DA is its extreme 
computational efficiency. This is particularly important for custom circuit design. A carefully 
designed circuit of DSP arithmetic unit can reduce the total gate count by 50 to 80 percent. In 
our multicarrier DEMUX ASIC design, DA has been successfully used to implement the 
BSPs of the tree structure giving rise to a very low complexity implementation for tree 
structure DEMUX [Qi92b].
6.2.1 DA approach for inner product generation
Distributed arithmetic is basically a bit-serial computational operation that forms an inner 
(dot) product of a pair of vectors in a single direct step. Instead of using multiply & 
accumulate (MAC) for inner product, DA performs table-lookup and accumulation to 
generate the inner product without the use of multipliers. Thus low complexity and high 
computational efficiency can be expected. Another important feature of DA is its high 
computation precision because no multiplier is involved and it is therefore free of 
rounding/truncation errors [Zoh89, Whi89].
To show how DA works for inner product generation, consider a X-term inner product:
K
y = YuAkx k (6.i)
k—\
where A /s are considered fixed coefficients, and the %/s are the input data words. If each xk 
is a L-bit 2’s-complement binary number scaled (for convenience, not as necessity) such that
L-l
bejel, that is, xh = —bkQ + ^ j b kl2~l then the inner product of Eq.(6.1) can be expressed as
1=0
K f  L - l  X L -l
:v = 5>* -fct0+ 2 > ti2-' = £ C,2-' (6.2)
k=\ V 1=0 y i=o
where C/s are defined as
c i =  ]E AAz , / = 1,2,-",L —1
k=l
Co = ^ iiAk (r^kO )
(6.3)
jt=i
Since bkl takes values of 0 and 1 only and Ak are known, C; can be computed off-line and 
stored in a read-only memory (ROM). Therefore the operation of the sum of products of Eq. 
(6.1) is now converted to operations of table-lookup , shifting, and accumulation as required 
by Eq. (6.2) without use of multiplication. To store Cl the required ROM size will be 2K+1. 
Techniques have been proposed to reduce the ROM size and the resulting ROM size can be
Qi, Multicarrier DEMUX and VLSI Implementation 114
Chapter 6 Bit-Serial Techniques and Systolic Architectures
2k 1 for a K-term inner product [Whi89]. This approach has been used to implement half-band 
FIRs of the tree structure in our DEMUX design which will be addressed in chapter 7.
6.2.2 Long length FIR using DA techniques
As have been discussed previously, DA techniques efficiently realize multiply-and-add type 
inner product operations which are the dominant arithmetic operations in most digital signal 
processing systems as long as the vector size is not too large . The use of DA is justified from 
the view point of low complexity only if the complexity of the ROMs does not exceed that 
for multipliers and adders otherwise involved in the operation. With the DA approach, a total 
of 2n~' words of memory (ROM) will be required for an A-tap FIR filter. This is 
unacceptable when N  is large, for example, in the case of pulse shaping filter of MCDD.
A trivial way of dealing with long length FIR using DA technique is to segment the A-tap 
FIR into a number of shorter sub-FIRs and then apply DA to each of the sub-FIRs. For 
example, if an A-tap FIR can be, in some way (e.g., via polyphase decomposition), segmented 
into K  sub-filters, the process can be described by
N - l
y» = X V »-'-
i=0
= X 2 ^ , ; /, n  / ; = 0  for ; * j,  and Ik = {0,1, ■ • •, W -1 }  (6.4)
k=0 ie lk 
K - l
= X n
k=0
where vk is the inner product of the k-th sub-filter. Each vk , if implemented with DA, requires 
a ROM with size of 2l/tM where \Ik\ is the number of members of integer set Ik . The total 
ROM size will be,
(6.5)
k=0
which is smaller than 2AM being required by the direct DA approach. If all the sub-filters are 
assumed equal length, that is, l/J = N/K, then SR0M = K2nik~\ The saving of ROMs of this 
approach over the direct DA approach is 2m ~m /K. This approach is illustrated in Figure 6.6.
This property can be exploited in polyphase filter implementation. In a polyphase filter bank, 
the polyphase sub-filters are generally non-symmetrical due to polyphase decomposition of 
the prototype filter. Hence efficient transversal FIR structures which make use of the 
symmetry property of linear FIR (Figure 2.5(a)) are not applicable. However, the sub-filters 
are much shorter than the prototype filter and they can be efficiently implemented with DA 
techniques. Thus this approach offers an efficient means to realize the polyphase filter banks.
Qi, Multicarrier DEMUX and VLSI Implementation 115
Chapter 6 Bit-Serial Techniques and Systolic Architectures
x(n)
'k - i
y(n)
D A  I n n e r  
P r o d u c t  0
D A  I n n e r  
P r o d u c t  1
D A  I n n e r  
P r o d u c t  
K-l
S h i f t  R e g i s t e r  A r r a y
Figure 6.6 DA approach to long length FIR implementation
6.3 Systolic Stream DFT structures
6.3.1 Stream DFT
Discrete Fourier Transform and its fast algorithms (FFTs) are basically suitable for vector 
(parallel) data. When the need for short-time FFTs arises (e.g., in fast convolutions and, in 
general, in short-time spectral analysis) and when the signals to be transformed are in serial 
(stream), as have been frequently encountered in the real world, conventional FFT approaches 
seem neither convenient nor efficient. This is because FFTs are generally efficient for parallel 
signals (vector signals). When dealing with stream data, serial-to-parallel, parallel-to-serial 
conversions, large data buffers are required in order to use parallel FFTs. Another reason for 
using stream DFT is that it has lower hardware complexity compared to FFTs when the 
transform size is not in integer power of two or when the size is large. Typically, for an Ap­
point DFT, the former has the order of 0(N) whereas the latter is 0(Adog2AZ). It should be 
noted that the computational complexity of the stream DFT in terms of the number of 
computations per second is higher than that of FFT because the former is 0(N2) and the latter 
is again 0(N\og2N). Since stream DFTs are normally pipelined as a one-dimensional array of 
a series of identical processing elements and the array size is also the transform size, the 
transform size is independent of DFT structure which makes it ideal for 
reconfigurable/reprogrammable DFTs such as those for flexible DEMUX architectures 
discussed in chapter 3.
6.3.2 Two SS-DFT structures
As we shall see later in this section, there are basically two types of systolic stream DFT (SS- 
DFT) structures. One accepts parallel stream signal and produces the transformed signal in 
serial. The other takes the stream signal directly and delivers the transformed signal in 
parallel. The former is often referred as Kung’s model whilst the latter as Chang’s model 
[Kun82, Cha88]. In the derivation of the two structures, we take a different approach based 
on the MSFG transform introduced in chapter 4.
Qi, Multicarrier DEMUX and VLSI Implementation 116
Chapter 6 Bit-Serial Techniques and Systolic Architectures
6.3.2.1 Kung’s model
To derive Kung’s stream DFT let us consider, without loss of generality, a simple 4-point 
stream DFT. A stream DFT (or short-time DFT) can be in concept represented by a SPC- 
DFT-PSC cascade as shown in Figure 6.7(a). The insertion of the time advance of z in the 
figure is necessary for correct input sequence segmentation. The transformed signal X(ri) can 
be considered as the superposition of signals generated independently by xk, £=0,1,2, and 3 as 
shown in Figure 6.7(a). The signal component E /«) generated by xk can be obtained by the 
network shown in Figure 6.7(b). Inserting the identity of Figure 4.4(d) transforms Figure 
6.7(b) into Figure 6.7(c), which is, according to the definitions of the USH node and the type 
1 MPDT, equivalent to a cascade of an upsampling & hold and a modulator as shown in 
Figure 6.7(d) in which the intermediate signal %£n) is defined as the output of the USH. 
Consequently, the stream DFT network becomes that of Figure 6.7(e).
Thus the transformed signal X(n) can be expressed by
X(n) = = 2 %  (nW t (6.6)
k=Q k=Q
By using Horner’s rule [KnuSl], Eq. (6.6) is rewritten as
X(n) = U n )  + ( U  n) + (U rc) + (Ç 3 (n) + 0)W4" K  K  (6-7)
The imbedded structure of Eq. (6.7) is a linear array as shown in Figure 6.7(f).
(a) (b)
4
(0
Figure 6.7 Derivation steps for Kung’s model 
Qi, Multicarrier DEMUX and VLSI Implementation 117
Chapter 6 Bit-Serial Techniques and Systolic Architectures
4
o  X(/z-4)
O
0
4EI— *'w™t
| z " ' |   ^  Pout
I D  systol ic  at ray
•*F I  ► W out
PE0 P E
(b)
y»-D (^»-2) y«-3)
U U
P E  * P E  * P E  r
(c)
Figure 6.8 Kung’s SS-DFT
If we insert four delays at the output of Figure 6.7(f) and then propagate them backwards 
along the array, the array can be pipelined and forms a one-dimensional systolic array as 
shown in Figure 6.8(a). Note that the USHs can not be included into the pipeline stages due to 
the inserted delays. If we use the conventional block diagram and SFG the array can be 
redrawn in Figure 6.8(c) with the processing elements as defined by Figure 6.8(b). This 
structure is identical to that given by Kung [Kun82]. With this pipeline structure, the stream 
data to be transformed need to be segmented, upsampled and held, and then loaded in parallel 
to the array. The transformed data are produced in serial format. The hardware complexity is 
directly proportional to the transform size N  as compared with that of FFT which is 
proportional to Mog2M It can be approximately estimated in terms of number of gates by
where Gm, Ga, Gr, and Grom are the number of gates for the multiplier, adder, shift register and 
the memory unit respectively.
6.3.2.1 Chang’s model
Similar to the derivation of Kung’s model, we derive Chang’s model by transforming the 
stream DFT model of Figure 6.7(a). The k-th transformed signal component Xk(m) can be 
obtained by the network shown in Figure 6.9(a). Again, apply the identity of Figure 4.4(d) to 
this network resulting in Figure 6.9(b) which is equivalent to a modulator followed by a 
integral and dump operator as shown in Figure 6.9(c). As a result, the stream DFT model of
G = jV(4Gm+2G1+4Gt+2Gm) (6.8)
Qi, Multicarrier DEMUX and VLSI Implementation 118
Chapter 6 Bit-Serial Techniques and Systolic Architectures
Figure 6.7(a) is equivalent to that shown in Figure 6.9(d). The intermediate signals ^(%), &=0, 
1, 2, and 3, in Figure 6.9(d) have the following recurrence relation:
Ç*+1( " ) = W " W +3, *=0,1,2,
w » )  = *(». + 3) { ’
The above equations indicate that the transformed signal components can be realized via a 
linear array as shown in Figure 6.9(e). To pipeline the array delays are inserted and 
distributed across the array. Since the integral & dump can be realized with a comb filter 
followed by a downsampler (Figure 4.14(b)), Figure 6.9(e) is thus transformed into Figure 
6.9(f). Refer to the identity shown in Figure 4.14(d), the structure of Figure 6.9(f) is identical 
to the one-dimensional systolic array shown in Figure 6.10(a) where X'k(ri), £=0,1,2, and 3 are 
the transformed signals sampled with different time offsets. The array is redrawn in block 
diagram in Figure 6.10(c) with the processing elements defined in Figure 6.10(b). This 
structure is referred as Chang’s stream DFT model [Cha88]. Unlike Kung’s model, with this 
structure, the stream data to be transformed are directly fed into the array and the transformed 
data are produced by sampling the output signals from each PE in parallel. The hardware 
complexity is also in direct proportion to the transform size N  and can be estimated by
G = N (4Gm+2Ga+6G+2Grom) (6.10)
in which the integral and dump scheme is used to realize the comb filtering and sampling 
process.
Qi, Multicarrier DEMUX and VLSI Implementation 119
Chapter 6 Bit-Serial Techniques and Systolic Architectures
x(n) .w .1*
4-i
(a) (b) (c)
W .1"
%(«)
(d ) (e)
x(n)
0 4 0 4 0 4
1 r r 3
( f )
Figure 6.9 Derivation steps for Chang’s model
II) systolic arras
x(n) o-i-O-
vomh^n) | 
^  04
»~~Q—> w  > ■ 'Q ■ o
'  > UDiiiKln)
OL
x',W
(a)
cvml'.Vi)
02,
PE„
-♦F I — ► w(
■HE) » Po
PE
C o m M
~ r
— - 4 - 1
+FH— ►
C o m M
~ r
(b )
w 4n+4
x(n+4) PEPE, PE PE
%'oW X \(n )
(c)
X'(n)
Figure 6.10 Chang’s SS-DFT
Qi, Multicarrier DEMUX and VLSI Implementation 120
Chapter 6 Bit-Serial Techniques and Systolic Architectures
6.3.3 Comparison between SS-DFT and FFT
The advantages of SS-DFTs include:
a) fast speed for sequential data due to the parallelism introduced by systolic architectures;
b) low complexity compared to the conventional FFT when the transform size is larger than 
32. For AM-, 8, and 16, the gate count of FFT seems less than that of systolic DFT. Based 
on the estimations given by Eqs. (6.8) and (6.10), the gate count comparison between the 
parallel radix-2 FFT and systolic DFT is illustrated in Figure 6.11;
c) because of array’s modularity and locality of connections between cells the systolic arrays 
can be efficiently mapped into VLSI and require less chip area;
d) unlike the conventional FFT that requires the transform size be an integer power of 2, SS- 
DFTs allow arbitrary transform size; to perform an Appoint DFT simply cascade N  PEs, 
which makes them attractive for flexible/reconfigurable architectures;
e) because of the simplicity of PEs and local connections between them, a fault-tolerant 
DFT structure can be constructed with low degree of redundancy and the 
reconfiguration/switch scheme can be simple .
A major drawback of the two SS-DFT models is that serious roundoff errors can occur when 
the transform length is high because the roundoff errors in both case are propagated and 
accumulated from the first to the last PE and are proportional to N2 [Cha93]. To alleviate the 
roundoff error by direct computation of the DFT, CORDIC (coordinate rotation digital 
computer) arithmetic can be used avoiding explicit multiplications by the transform kernel 
value W'*N [Des74, Jon93].
1500k
e1000k
EFT500k
SS-DFT
transform length
4000ki
FFT
3000k -
<d 2000k ■
a SS-DFT
o 50 100 150 250 300200
transform length
Figure 6.11 Hardware complexity comparison between SS-DFT and FFT 
6.3.4 Streamed input and output systolic DFT
It should be observed that Kung’s SS-DFT architecture can not take stream input samples 
directly whereas Chang’s SS-DFT does not produce stream output. Therefore, for 
applications where both input signal and the transformed signal are required in serial format
Qi, Multicarrier DEMUX and VLSI Implementation 121
Chapter 6 Bit-Serial Techniques and Systolic Architectures
Kung’s and Chang’s SS-DFT structures are not directly applicable. In this case, boundary 
cells are necessary to convert the parallel samples into a stream for both structures. Recently, 
Murthy and Swamy proposed a new ID systolic array that takes stream input and produces 
stream output by modifying Chang’s and Cho’s [Cho90] (an variation of Kung’s model) 
architectures [Mur94]. However, the architecture requires broadcast of data into different PEs 
hence is semi-systolic limiting the size of the array.
6.4 Systolic FIR
In this section, we will propose a semi-systolic and a pure systolic FIR architectures which, as 
does the symmetric transversal structure (Figure 2.5(b)), exploit the symmetry property of 
linear phase FIR giving rise to the same low complexity as that of the former. A pure systolic 
halfband FIR architecture is also derived as a special case of the novel pure systolic FIR.
6.4.1 Why systolic FIR?
Conventionally, a FIR of length N  is implemented using the transversal (direct) structure 
which requires N  multiplications. Taking the coefficient symmetry (as for the linear phase 
case) into consideration, the number of multiplier stages can be halved by using the structure 
given in Figure 2.5(a).
The main disadvantage of this structure is the long propagation delay caused by the adder tree 
for summing the N  (or for symmetrical case) products. Although the adder tree can be 
pipelined to alleviate the problem, it will significantly increase the latency of the filter 
especially when N  is large. Another drawback of the transversal structure is the lack of 
structural modularity, a feature preferred for VLSI implementations.
For high speed applications, systolic pipeline arrays for FIR convolutions have been proposed 
[Kun82, Bom92, Ers85, and Kwa93]. However, most of them do not utilize symmetry 
property of linear phase FIRs, hence are not computationally efficient for linear phase cases. 
Bombardieri proposed a semi-systolic array architecture which makes use of the symmetry 
property reducing the multiplier stages by a factor of 2 [Bom92]. The architecture, however, 
requires wider systolic paths and more I/O than the one to be introduced in this section as the 
partial products pass through both directions via two systolic paths. That would increase the
silicon area and could reduce the throughput (for bit-serial architectures).
6.4.2 A new semi-systolic architecture for FIR
In the derivation of this systolic architecture, we adopt the same recursive state variable
approach that Bombardieri used in [Bom92]. Consider a FIR convolution with linear phase. 
The impulse response coefficients have the symmetry property
h i = ± h N - i - i  (6.11)
Qi, Multicarrier DEMUX and VLSI Implementation 122
Chapter 6 Bit-Serial Techniques and Systolic Architectures
where the signs *+’ and are for even and odd symmetry respectively. Assuming that the 
filter length N  is odd and considering Eq.(6.11), the convolution can be expressed by
^i-i
X, =  X  h i ( X - i  ±  X n - N * M  )  +  ha=lXn^ .TT 2 " 2$  (6.12) 
=  Ê / l ,: ( X - , ± X - / v +n - , )
where h'i=hi for /=0, 1, . . (A -l)/2-l and hN_x = \h N_x . If we define the delay operator "z"
~y T "
such that z kx=xn_k, then Eq. (6.12) can be rewritten as
AM
y n = Z l h i Z l ( X n ± X n - N+l+2 i )  (6-13)
i=0
Applying Homer's rule to successively factorize the z~l terms gives
X  = h o ( X n ± X n - N * l )  +  Z ~ ' { f h ( X n ± X - A / « )
+Z~'{ti2(xn ± xn_Nt5)
+z"1{--- + z~1fc_, ± x n_2) (6.14)
2  A M
2
Define state variables pt and vz- as
P i  =z"1{pi+i +fc/ (x  ±vi+i))
Vi = Z ';j+,
p  = 0 , ,/= 0 ,1 ,...,  (Al-l)/2 (6.15)
2 + 1
V .  =
Thus Eq. (6.14) can be represented by the recursive state variable equations given in Eq.
(6.15) with the output equation of Eq. (6.16).
X  = P i  + ( i o ( X  ± v i )
or, (6.16)
Jn-l = -Z"1 { P i  + t i 0 ( X n ± V l ) }  =  P o
Similarly, with the same procedure as discussed above, the state variable representation for N  
 cases is derived as Eq. (6.17). The output equation is the same as Eq. (6.16).
P i  = z ‘1{pi+i +fc, ( x  ±vi+i))
even
", = z~X,
= 0, ,;=0, 1 , ,  N/2-1 (6.17)
v «  =  X - i
Qi, Multicarrier DEMUX and VLSI Implementation 123
Chapter 6 Bit-Serial Techniques and Systolic Architectures
Eqs. (6.15) and (6.16) (or Eqs. (6.17) and (6.16)) are the recursive state variable 
representation that fully describes a pipeline structure. As a result, the FIR convolution can be 
realized via the one-dimensional (ID) systolic pipeline structure as shown in Figure 6.12.
out
out
out
P o u t
x(n)
y ( n - l )
PE PE PE
PE
0 .5 * h (A M )/2
( b )
x(ri)
L-  z",- i
PE PE PE
y  ( n - l )
( c )
v,in out
h
P o u tP in
PE out
P o u t
( d )
Figure 6.12 A new systolic architecture for symmetric FIR convolutions:
(a) definition of processing of the processing element (b) proposed systolic 
pipeline array for N  odd (c) the systolic pipeline array for N  even (d) 
pipelined PE
The proposed symmetric systolic architecture requires (N+l)/2 multipliers and N+l adders for 
N  odd (for N  even, N/2 and N  respectively) which is about the same complexity as the 
conventional transversal symmetrical structure shown in Figure 2.5(a).
Efficient and fast multiply-accumulate structures can be used to implement the PE shown in 
Figure 6.12(a). To further improve the throughput the multiplication and addition operations 
can be separated by inserting delays between them as shown in Figure 6.12(d). The system 
latency, however, will be increased by two sample periods accordingly.
Unlike Bombardieri’s semi-systolic architecture in which the partial products (which for 
numerical accuracy carry more bits than the coefficients and the input signal) move along two
Qi, Multicarrier DEMUX and VLSI Implementation 124
Chapter 6 Bit-Serial Techniques and Systolic Architectures
systolic paths in both directions, the proposed semi-systolic array propagates the partial 
products in one direction and on a single systolic path. It requires less I/O and is thus more 
area efficient for VLSI implementation.
Just as Bombardieri’s architecture, the proposed architecture also broadcasts input signal xn to
all the PEs. It is therefore also semi-systolic. The modular expandability of a semi-systolic 
array is limited due to the global data communication. For large V, the modular expandability 
of the systolic array can be improved by pipelining several predefined super processing 
elements (SPEs) which consist of a moderate number of PEs defined by Figure 6.12 such that 
no engineering problems associated with the global data communication would occur as 
shown in Figure 6.13. The structure shown in Figure 6.13 will become pure systolic if the 
number of PEs in the SPE is reduced to one. This method can not be directly applied to 
Bombardieri’s semi-systolic architecture because signals travel along the array in both 
directions and insertion of delays will result in unrealizable architectures.
PE
T
fy+A-l
PE
SPE
(a)
x(n)
SPE,SPE,
(b)
F ig u re  6.13 Modular expansion for large N
6.4.3 A novel pure systolic FIR architecture
Though simple and with low complexity, the semi-systolic FIR structure proposed previously 
and Bombardier!’s semi-systolic architecture have difficulties in VLSI implementations for 
very high speed processing due to the global communications of signals. To increase the 
processing speed of VLSI such global connections should be avoided. Thus there is a need to 
derive pure systolic architectures for FIR where all the PEs are connected locally.
Consider a linear phase FIR of length N  whose z-domain transform is H(z). It can be 
polyphase decomposed into a two-path network according to Figure 4.6(a) and is shown in 
MSFG in Figure 6.14(a). The FIR structure depends on the filter length N. There are four 
cases, namely, N  odd, (AM)/2 odd; N  odd, (V-l)/2 even; N  even, N/2 even; and N  even, N/2 
odd. Since H(z) has the symmetry property of Eq. (6.11) (even symmetry is assumed 
hereafter for simplicity), H0(z) and H^z) are also symmetrical for N  odd cases. Hence for 
these two cases, by applying the symmetrical transversal structure of Figure 2.5(a), the FIR
Qi, Multicarrier DEMUX and VLSI Implementation 125
Chapter 6 Bit-Serial Techniques and Systolic Architectures
filter can be realized with the structures shown in Figures 6.14(b) and 6.14(c). If N  is even, 
H0(z) and H^z) are no longer symmetrical. However, similar to the case of n even (N=4n-\) 
discussed in section 4.5.2, the two polyphase sub-filters are time-inverse of each other, or, in 
the z-domain, are related by
ff0(z) = z U (6.18)
With H0(z) and H^z), a symmetrical filter P(z) and an anti-symmetrical filter Q(z) can be 
defined (refer to Eqs. (4.14) and 4.(15)) by:
P(z) = [//0(z)+//1(z)]/2 (6.19a)
G(z) = % (z)-^ (z )]/2  (6.19b)
such that
H0(z) = P(.z)+Q(z) (6.20a)
H^z) = P(z)-Q(z) (6.20a)
Therefore the symmetry property can also be exploited for the N  even cases by transforming 
the non-symmetrical polyphase filters into symmetrical (or anti-symmetrical) ones according 
to Eqs. 6.19. Applying the symmetrical transversal structure for the modified sub-filters P(z)
and <2(z) the FIR filters with even N  can be realized using structures shown in Figures 6.14(d)
(for N/2 even) or 6.14(e) (for N/2 odd).
Qi, Multicarrier DEMUX and VLSI Implementation 126
Chapter 6 Bit-Serial Techniques and Systolic Architectures
x(n) o-MD 0-»-o y W  <=î> X»)
(a) two-path FIR
^ 2)
y(n)
(b)Nodd, (N-l)/2 odd (N=19)
-2 ,-2 ,-2
7-2
X(«)
y(n) 7- 2 '
(c) N  odd, (N-l)/2 even (#=21) 
» jO- z» jO
% % % %
-« o  « O 4 0 — 4-
(d) N  even, N/2 even (#=20)
7 -2 7 -2
7- 2 '
x(n) Pio
7 -2 ,-2 -2
(e) #  even, # /2  odd (#=22)
Figure 6.14 Two-path FIR structures using symmetry property
By interchanging the position of the unit delay and the H^z2) (or Q(z2) for N  even cases), the 
two sub-filters can share a common tapped delay line and the structures for the above four
Qi, Multicarrier DEMUX and VLSI Implementation 127
Chapter 6 Bit-Serial Techniques and Systolic Architectures
cases are equivalent to those shown in Figure 6.15 in which p'. and qf. are defined as p/2  and 
q/2 respectively.
x(ri) -2
,-2
(a) N  odd, odd (/V=19)
x{n) o—►
7-2
(b) N  odd, (/V-l)/2 even (N=2\)
-2I Pi  ^ 17-2l P4
x( n) o
y(n)
(c) iV even, M 2 even (M=20)
%(»)o
y(n)
(d) TV even, A72 odd (N=22)
Figure 6.15 A new ID array for linear phase FIR
These structures have clearly low complexities as the number of multiplications is the same as 
that of the symmetrical transversal structure and the number of delays is N - l  for even N  and 
N  for odd N. They, however, have better regularity and, as will be seen later, can be made 
pure systolic (no global broadcast of signals) without increasing the latency which is not 
possible for the semi-systolic structures discussed earlier.
To pipeline the ID FIR arrays, we apply the rule of “cut-and-insert” to the arrays, which is 
based on the principle that inserting a number of delays (time-advances) to all the in-bound 
signals and the same number of time-advances (delays) to all the out-bound signals of a LTI 
system will not alter the system’s transfer functions. The proof of the “cut-and-insert” rule is 
given in Appendix F. Inserting unit delays to the out-coming signals at the cutting positions 
shown in dashed lines in Figure 6.15 results in the pipelined ID arrays as shown in Figure 
6.16.
Qi, Multicarrier DEMUX and VLSI Implementation 128
Chapter 6 Bit-Serial Techniques and Systolic Architectures
x(n) °  *
,-2
(a) N  odd, (N-l)/2 odd (AM9)
x(n) o —►
(b) iVodd, (N-l)/2 even (N=2l)
(c) N  even, N/2 even (N=20)
x(n) o
y(n)
(d) N  even, N/2 odd (N=22)
Figure 6.16 Pipelined ID FIR array
Hence, by defining different types of PEs, the above pipelined ID arrays can be described by 
novel systolic arrays shown in Figure 6.17.
Qi, Multicarrier DEMUX and VLSI Implementation 129
Chapter 6 Bit-Serial Techniques and Systolic Architectures
x(n)
yin)
x(n)
yin)
P,out
R,out
out Xj— yA—K  x n
r
z-'-& —h0 Qout
o end
A
e end
( a )  P E s  f o r  W  o d d
( b )  / / o d d ,  ( N - l ) / 2  o d d
xin)
yin)
hp’ h\
PE
PE
PE
PE
PE
( N - 3 ) / 4
PEo_end
( c )  N  o d d ,  ( A M ) / 2  e v e n
out
out
'out
P'n
X,
Q ln
<x>
( d )  P E  f o r  N  e v e n
out
out
out
K ’ h <
PE
( e )  N  e v e n ,  ( N - l ) / 2  e v e n
PE
PE
PE
PE
M 4 - 1
PE PE
N / 4
( f )  N  e v e n ,  ( N - l ) / 2  o d d
Figure 6.17 Novel pure systolic arrays for linear phase FIR
To further improve the speed performance, unit delays can be inserted between the adder and 
the multiplier cascades in PEs which will reduce the critical path from 2 7 + 7  to 7m at the
Qi, Multicarrier DEMUX and VLSI Implementation 130
Chapter 6 Bit-Serial Techniques and Systolic Architectures
expense of two samples of the increased latency, where 7a and Tm are the propagation delays 
of the adder and the multiplier respectively.
The proposed pure systolic architecture has the same number of multiplications and latency as 
those of the symmetrical transversal structure. Compared to the proposed semi-systolic 
architecture, the proposed pure systolic architecture is more length-dependent (PEs are 
different for odd N  and even N) and post-PEs are required. Despite these insignificant 
disadvantages, the proposed novel pure systolic FIR architectures are considered superior to 
all the existing systolic/pipeline FIR structures in terms of low complexity, locality, and 
latency.
6.4.4 Pure systolic halfband FIR
As a special case of linear phase FIR, the halfband filter whose length N  is determined by 
N=An-\, where n is the number of distinctive non-zero coefficients not including the centre 
tap, can be described by the structure shown in Figure 6.16(a) where N  is odd and (A/-l)/2 is 
odd. Note that in the halfband filter, the odd numbered coefficients are all zeros except the 
centre tap (/*(^ 1)/2) which is 0.5. Thus the multipliers, except the right-most one, and the 
connecting adders of the lower systolic path for Hx(z) disappear in Figure 6.16(a). To further 
simplify the structure the centre tap can be appended to the upper path for HQ(z) resulting in 
the ID array as shown in Figure 6.18(a). If we define the processing element as that shown in 
Figure 6.18(b), the halfband FIR structure shown in Figure 6.18(a) is identical to the ID 
systolic array shown in Figure 6.18(c). The structure has the following advantages:
• It is pure systolic because all the connections between PEs are local;
• There is no increased latency compared to the transversal structures; and
• The computational complexity is minimized as the required number of multiplications and
the number of additions are respectively n and 2n which are the same as those for the
symmetrical transversal structure.
Qi, Multicarrier DEMUX and VLSI Implementation 131
Chapter 6 Bit-Serial Techniques and Systolic Architectures
y(n)o— « — O  ■ ■ r <‘ - j-Q  - ^ - j |0  r 4 j p — jO— 2*
x(n) o
(a) ID pipeline of halfband filter, Af=4n-1 (n=5)
- P .
LkZ
out
■X“  <X> X, ■
out
Rout
PE Xout
(b) PE definition
y(n)
x(ri)
1 K  | I fb |
«------ « * «
PE PE
1 2 -----► • •  •
PE
n Z-'
(c) pure systolic array 
F ig u re  6.18 Pure systolic architecture for halfband FIR
6.5 Summary
In this chapter, we have introduced some DSP techniques and architectures which are useful 
for efficient VLSI implementation for multicarrier DEMUXs. Basically, low complexity is 
the primary concern for the OBP DEMUX. Bit-serial techniques are hence appropriate for 
DEMUX implementations. Three bit-serial arithmetic units have been investigated:
• a simple bit-serial ADD&SUB module;
• a bit-serial 2’s complement unit; and
• an novel full-precision systolic parallel/serial multiplier.
The first two modules have been successfully used in the 8-channel DEMUX design which 
will be presented in next chapter. A new systolic parallel/serial multiplier is introduced to 
reduce the round-off error and to improve speed performance for full precision 
multiplication. The feature of the proposed novel systolic parallel/serial multiplier makes it 
suitable for high speed high precision multiplications such as in digital filtering and FFTs.
DA enables inner product operations to be realized very efficiently via table-look-up and 
accumulating in a bit-serial manner. When the dimension of the vector is not large, the DA 
architecture can be highly efficient with very low complexity making it suitable for VLSI 
implementations. As the complexity (dominated by the memory) increases exponentially with 
the vector size, the DA architecture is limited to small vector size in order to maintain the
Qi, Multicarrier DEMUX and VLSI Implementation 132
Chapter 6 Bit-Serial Techniques and Systolic Architectures
computational efficiency. For large vector sizes (as for long length FIRs) we have proposed 
proper segmentation of the vector so that the processing can be distributed into several 
smaller DAs in parallel each of which utilizes an efficient DA structure with small memory.
Systolic stream DFT structures have been proposed for low complexity, high speed, and 
flexible DFT computation. SS-DFT structures are useful in particular for flexible DEMUX 
architectures. There are basically two types of SS-DFT structures: Kung’s model which 
accepts parallel signals and produces the transformed signal in serial and Chang’s model 
which takes the stream signal directly and delivers the transformed signal in parallel. We have 
found that the relationship between the stream DFT and the parallel DFT can be described by 
Figure 6.7(a) and that the both SS-DFT structures can be derived from this relationship via 
MSFG transform. Unlike most FFT algorithms which are very restrictive on the transform 
size, the transform size of SS-DFT can be arbitrary. The complexity of SS-DFT is less than 
that of the conventional radix-2 FFT for N  larger than 32 . The main disadvantage of SS-DFT 
is the large roundoff error for large N.
Systolic FIR architecture has been proposed for high speed digital processing. Firstly, a new 
low complexity (symmetry property of linear phase FIR is exploited) semi-systolic FIR 
architecture has been proposed, which, in comparison to Bombardieri’s semi-systolic 
architecture, has the advantages of narrower systolic path and improved modular 
expandability (at the expanse of increased latency). Then, we have derived a novel pure 
systolic FIR architecture which also makes use of the symmetry property of FIR using the 
signal flow graph approach. The proposed architecture has significantly improved the speed 
performance comparing to semi-systolic architectures by eliminating global signal 
communications without increasing the latency. Finally, a pure systolic halfband FIR 
architecture has been proposed. This halfband FIR architecture is considered optimal in terms 
of low complexity, low latency and high throughput.
Qi, Multicarrier DEMUX and VLSI Implementation 133
Chapter 7 VLSI Architecture for Binary Tree DEMUX
Chapter 7 
VLSI Architecture for Binary 
Tree DEMUX
To meet low mass, power, and complexity and high reliability requirements for space applications, VLSI implementation of the on-board processing payload is a crucial step 
towards any practical employment of advanced OBP satellite. The work presented in this 
chapter mainly concerns low complexity VLSI architecture and implementations of tree type 
multicarrier demultiplexer. We propose a time-multiplexed tree architecture, the TM2-tree, for 
low complexity implementation of binary tree. In a TM2-tree, the band-splitting filter is 
realized by cascading a complex input buffer and a complex inner product (IP) processor 
which produces both the lowpass and the highpass samples on a time-sharing basis. To 
further reduce complexity, we apply the DA and bit-serial techniques to realize the IP, 
resulting in an multiplierless implementation of the tree DEMUX. Issues on VLSI 
implementation of the TM2-tree using techniques and architectures introduced in chapter 6 are 
addressed. As a design example, an ASIC design for an 8-channel DEMUX is presented. 
Finally, comparisons in complexity, power consumption and processing speed with other 
implementation schemes are given.
7.1 Binary tree DEMUX
In chapter 3, we briefly introduced the binary tree DEMUX structure which requires the 
number of channels be an integer power of two. The early application of the structure was in 
terrestrial transmultiplexers and sub-band coding for speech signals [Tsu78, Cla78, Est77]. 
The binary tree structure was later proposed for on-board multicarrier demultiplexing 
because of its simple and regular structure [Gar85, ANT85, Yim88, Goc88, Del88, Eys90,
Qi, Multicarrier DEMUX and VLSI Implementation 134
Chapter 7 VLSI Architecture for Binary Tree DEMUX
Bi90, Qi92a, Sec92, Auc93]. The computational efficiency of the structure lies in that each 
channel is retrieved by an equivalent multistage decimation filter, hence substantially relaxing 
stringent channel filter specifications [Cro75].
In a binary tree filter bank the channelization is realized by successively splitting the input 
signal into narrower frequency bands at each stage. Generally, for a Æ-channel uniformly 
spaced FDM signal, where K=2m, the binary tree consists of K -\  BSFs in m stages. To show 
how the binary tree structure works, let us redraw the simple 8-channel binary tree 
demultiplexer of Figure 3.14(a) as Figure 7.1 and investigate the spectral relations at the 
reference points shown in the figure.
fo n h fi
fhin) I?
12
B S F ( - l ) n
B S F
B S F
B S F
B S F
B S F
B S F
chO
c h i
ch 2
ch3
ch 4
ch5
ch 6
ch7
Figure 7.1 Binary tree DEMUX
7.1.1 Real binary tree
The input signal is assumed real, single sideband FDM, as in most transmultiplexing and 
multicarrier demultiplexing applications. For transmultiplexing, real channel signals are 
required at the output hence real lowpass and highpass filters can be directly used in the BSF. 
Spectra at reference points © to ® in Figure 7.1 are depicted in Figure 7.2. The 
characteristics of the BSF filter are also illustrated in the figure. Obviously, sharp transition 
bands are necessary to reduce the aliasing due to the decimation by two.
The lowpass and highpass filters of the BSF are translations of a prototype real lowpass filter
h{ri). For real BSF shown in Figure 3.14(c), the lowpass filter is the prototype filter itself and
the highpass filter is the ^-frequency translated version of the prototype (hence also real). 
That is,
/iL («) = % )  (7.1a)
hH (n) = e,mh(n) (7.1b)
The two filters are thus related by
(7.2)
The real BSF defined by Eq.(7.1) can be realized with the optimal structure shown in Figure 
4.21 in which the two filtering processes are completely shared and the computation load is 
equivalent to that of a single prototype filter.
Qi, Multicarrier DEMUX and VLSI Implementation 135
Chapter 7 VLSI Architecture for Binary Tree DEMUX
7.1.2 Complex binary tree
The multicarrier demultiplexing in MCDD applications usually demands analytical 
(complex) channel signals. When complex lowpass and highpass filters are used in the BSF, 
much wider transition bands of the BSF are allowed. This advantage can be fully exploited if 
halfband filters [Cro83, Had91] are used. In this case, the transition band can be as much as 
7T/2 as shown in Figure 3.14(b). The spectra at reference points © to ® in Figure 7.1 are 
shown in Figure 7.3.
It should be observed that for MCDD applications pulse shaping filtering is necessary after 
the last stage of a complex tree and before the demodulators in order to eliminate the spectra 
fold-over (aliasing due to decimation) in the negative frequency band [7U,27t] and to limit the 
bandwidth of the demultiplexed channel signal (refer to the spectra at © and ® in Figure
Referring to Figure 3.14(b), the lowpass and the highpass filters of a complex BSF are 
obtained by frequency shifting the prototype lowpass filter h(ri) to co=7t/4  and co=37t/4  
respectively, i.e.,
where h(ri) is assumed causal FIR (h(n)=0 for n<0) with length N  (N is odd for halfband 
filter). The constant phase shifts in the above equations are necessary for a causal complex 
filter to preserve the conjugate symmetry property (see section 2.3.2).
The two filters are thus related by
We denote yL(ri) and yH(n) to the BSF’s lowpass and highpass outputs respectively. For input 
signal x(ri), the input-output relations of the complex BSF can be described by
It should be noted that the ^-frequency shift (modulation by (-1)") in the BSF is necessary
highpass filters. It must be moved back to the lower frequency band for the following BSF to 
correctly split the spectrum (see the spectra at point © and © in Figure 7.3). This
7.3).
(7.3a)
(7.3b)
hH(n) = 0 T hL(n) (7.4)
(7.5a)
;=o
(7.5b)
because the band of interest at the highpass branch after the decimation (point ©) is located 
on the high frequency band, which is outside the passbands of the complex lowpass and
Qi, Multicarrier DEMUX and VLSI Implementation 136
Chapter 7 VLSI Architecture for Binary Tree DEMUX
requirement can be dropped in real BSF since a real BSF can cover all the band of interest at 
point © (see the spectrum at this point in Figure 7.2). However, the channel sidebands and 
the channel order will be reversed at the output of BSF’s highpass branch, giving rise to a 
channel order at the DEMUX output different from that of Figure 7.1.
The main advantage of binary tree filter bank lies in its simplicity and regularity in structure 
that is particularly important for efficient VLSI implementations. Even so, its complexity is 
still considerably high and it increases linearly with the number of channels because the 
number of nodes (BSFs) is in proportion to the number of channels. As a result, with 
moderate VLSI technology, it is very difficult to have a single chip design for DEMUXs with
more than four channels. Sharing of computations within the DEMUX is thus required and
essential for area efficient designs.
©
jd# ^  /w# yd#
0 I 2 .1 4 5 (I 7 7 6 5 4 3 2 I 0
I s I -  - I -  H  \ I - ° I ; i > I -  U  I : I I I f .  l l
fA
fo/2
0 1 2 ?  ?
■ i j -  i ;  i - i ' :
1 0  0  1i I ' if 3 3 2 1 01 - I ’ 1 -- I - :
A/2 A
©
7 6 ' '-5 | 4% WRN /A n  /i ; i ! I f i 4
7 ‘-7 MMk A1 : 1 1 l: 1 f
4 5 6 7
1 > H- M 1
A/2 A
©
4 , ! : - ! >
7 7 0 3 4
1 ' I ! i : i :
z
1 5 f. 7 7 
« i; i" i:
k ^  ^
6 5 4
i > i ; i ; . ;
A/2 A
I H l( z )I I H „ ( z ) l
-IJ
L .................. L  / I - . - ■ , 1
A/2 A A/2 A
© ©
■1 3
1 i
4 5 5 4
5 1 t \ I M
f- 7 7 h
l i i
6 7 7 6 
. 1 I
A/2 A A/2
IH , ( z )l IH „ (z ) l
7 7
© ©
V ^ 3  , V 1 M
A/2 A A/2 A A/2 A A# A
Figure 7.2 Real binary tree spectra and real BSF
Qi, Multicarrier DEMUX and VLSI Implementation 137
Chapter 7 VLSI Architecture for Binary Tree DEMUX
©
0 , 2 .  3 4 ? ,  6 ,  r4
ili U
^  ^3 2 1
0 1 2  3
_L_:___ L
4#^ ! 4#^  4^0 1 2  3
_L U
m
(3)
r  .
________
4 4 ,4 | | \ f 4 4 4 ) , j
j #  4*1#
.............. . .........: i i i; . i( 4/. . if i
f A u
©
1 - 1 ' 1 1 i 1 t
4 5 6 7 
: I' 1 .
fA
IH l( z )I
4 5 0
IH „ (z ) l
5 ft 7
i ; ' H  il_ _ _
f A f A
© ©
4 5
1 : 1 ’ - 1 1 i 1 6 . 7 ! 1 1
f A f A
IH l( z )I
* X
IH h( z )I © ©
/ 6 r " 1
7
i  I '
/ f
r  , r " " i
#2 A #2 & A/2 A
Figure 7.3 Complex binary tree spectra and complex BSF
A / 2  A
7.2 Time-multiplexed tree
To reduce the implementation complexity of the tree type DEMUX, operations should be 
shared wherever possible. We propose a low complexity implementation for the binary tree 
structure in which the arithmetic operations (filtering) of all the branches in a tree stage are 
performed by a common digital arithmetic processor in a time-multiplexed manner. The time- 
multiplexed use of the arithmetic unit is allowed because
a) referring to Figure 7.4(b), though the sampling rate of a stage is a half of that of the 
previous stage, the information rates of all the stages remain the same as that of the input 
(i.e., the input sampling rate);
Qi, Multicarrier DEMUX and VLSI Implementation 138
Chapter 7 VLSI Architecture for Binary Tree DEMUX
b) all the BSFs of the binary tree structure are identical; and,
c) the arithmetic unit is memoryless as a result of the separation of arithmetic operations 
which are memoryless and memory operations (buffering) which can not be shared for 
different signals in the BSF.
If the sampling rates at different stages remain the same as that of the input, it might be 
possible to trade the space complexity (processors and the designated signal lines for the stage 
branches) for the time complexity (sampling rate) using just one single processor and related 
signal lines for each stage as illustrated in Figure 7.4(c). In fact, as will be shown in next 
section, the trade of space complexity for time complexity is made possible by the separation 
of BSF’s memoryless arithmetic operations that perform MAC type inner product from BSF’s 
memory operations (for buffering the samples). The devices that perform these two kinds of 
operations are called the input buffer and the inner product processor. In Figure 7.4(a), the 
input sequences to the BSFs at each tree stage are unique because they are the demultiplexed 
outputs of the previous stage, a dedicated input buffer has to be used for each of the BSFs. 
Sharing of these input buffers is thus impossible. Also notice that in Figure 7.4(a) the 
aggregated output information rates of all the stages are the same because of the decimation 
by 2 in the BSF, it is therefore possible to use a common IP in each stage, which alternately 
takes signal samples from each input buffer of the stage and produces samples for the 
corresponding channel. We call the architecture which is configured and operates in this 
manner the time-multiplexed tree (TM-tree) and it is illustrated in Figure 7.4(d). With this 
architecture, for a ^-channel DEMUX, the number of arithmetic processors (complex IPs) is 
reduced from K -l  to log2X.
fo
sp a ce
time
(a) (b)
r r i n n n n r r i '
*
7
n n r m n n r m -
r L r l r l
time
(c)
Figure 7.4 The direct tree vs. the TM-tree
Qi, Multicarrier DEMUX and VLSI Implementation 139
Chapter 7 VLSI Architecture for Binary Tree DEMUX
Another advantage of the TM-tree is that the throughput (speed) potential of the VLSI 
technology can be fully exploited as the whole circuits will be clocked at the required highest 
sampling rate (the input rate) as compared to the direct tree structure in which only a very 
small portion of the circuits will be at the highest sampling rate.
7.3 Time-multiplexed BSF
7.3.1 Decomposition of BSF
A LTI FIR filter basically performs a MAC type inner product between a constant vector 
(filter coefficients) and the input signal sample vector. Since its memory operations that 
generate the signal sample vectors and the memoryless arithmetic operations that perform the 
inner product are always separable, it can be decomposed into a cascade of the input buffer 
and the inner product processor. Similarly for the BSF, if its arithmetic operations could be 
separated from memory operations (might not be possible for some BSF schemes), then it 
can be realized with a complex input buffer (two identical real buffers) and a complex inner 
product processor that generates a lowpass and a highpass output samples once a complex 
data vector is received from the input buffer. The first stage in Figure 7.4(d) illustrates a 
decomposed BSF structure. The vector dimension in general equals the filter length N, but it 
can be considerably reduced by taking into account the effects of zero coefficients and the 
filter symmetry property.
7.3.2 The time-multiplexed BSF
The band splitting filter defined by Eqs.(7.5a) and (7.5b) indicates concurrent output of the 
lowpass and highpass samples. The decimation by 2 in the BSF makes it possible for the 
lowpass and highpass outputs to be time-multiplexed into a single sequence at the input 
sampling rate. Thus if, instead of computing the lowpass and the highpass samples at half of 
the input sampling rate using two separate complex IPs, we could compute both the lowpass 
and the highpass samples with just one complex IP operating at the input sampling rate (have 
to be in a time-multiplexed manner), the BSF’s complexity could be further reduced. We call 
this type of BSF the time-multiplexed (TM)-BSF. The BSF shown in Figure 7.4(d) belongs to 
this category as its IP generates interleaved lowpass/highpass samples. Because the concept of 
time-multiplexed use of device has been applied twice at different hierarchical levels in 
Figure 7.4(d) (TM-tree with TM-BSFs), we will refer this architecture as TM2-tree. With the 
time-multiplexed use of the arithmetic processor between the lowpass and highpass processes, 
the complexity for the arithmetic operations of the BSF can be approximately halved.
To be able to time-share a common arithmetic processor, the lowpass and the highpass 
filtering processes need to be reorganized such that they perform the same kind of (if not 
identical) computations and that the switching from one process to the other should be trivial. 
Now let us consider a simple case where the halfband filter length N=7. By expanding Eqs.
Qi, Multicarrier DEMUX and VLSI Implementation 140
Chapter 7 VLSI Architecture for Binary Tree DEMUX
(7.5a) and (7.5b), the lowpass and the highpass processes have the following equivalent 
expression [Bi90]:
yl (m) = (2m) + xr' (2m -  6)] + j[xr' (2m) -  Xf' (2m -  6)]}
“[[xr' (2m — 2) + (2m — 4)] + (2m — 2) + xr' (2m — 4)]j- (7.6a)
H— ( 2m — 3) + (2m — 3)] + y[xr' (2m — 3) — x? (2m — 3)]j-
^ (m ) = —^ { [ - x r' (2m) -  (2m -  6)]+ ;[%,' (2m) -  (2m -  6)]}
+  ^ (2m — 2) — xr' (2m — 4)] + y[—xr' (2m — 2) + (2m — 4)]j- (7.6b)
H— ( 2m — 3) + (2m — 3)] + y[xr' (2m — 3) — (2m — 3)]j-
where x '  and x z are defined as
x'(n) = %(%)+%.(»), a n d (») = xJjiy-xjji) (7.7)
Define the coefficient vector C as
A(3) /z(2) A(0)"
C  =
and the signal sample vectors
2 ’ V2 ’ V2
X i  =  ( 2 m  -  3 ) ,  ( 2 m  -  4 ) ,  ( 2 m ) ]
X 2 =  [% / ( 2 m  -  3 ) , x r ' ( 2 m  -  2 ) ,  ( 2 m  -  6 ) ]
X 3 =  [ x /  ( 2 m  -  3 ) , - % /  ( 2 m  -  2 ) ,-x i' ( 2 m  -  6 ) ]  
X 4 =  [ - % /  ( 2 m  -  3),xr' ( 2 m  -  4 ) ,% /  ( 2 m ) ]
(7.8)
(7.9)
Then the two processes can be expressed in terms of inner products of a constant vector and 
signal sample vectors as described by Eqs. (7.10a) and (7.10b).
%(m) = C(X,+X,)" + jC(X,+Xy (7.10a)
y*(m) = C(X,-X,)T +jC(X,-X ,y (7.10a)
Clearly, for concurrent lowpass and highpass outputs, four real inner-products (two for each 
filtering process) must be computed. However, if we keep the real IPs operating at the same 
sampling rate as that of the input, just two of them will be sufficient to produce the four real
samples of Eqs. (7.10) within one input sample period. As a result, the output samples of the
two processes are time-multiplexed into a single complex stream. Figure 7.5 shows a 7-tap 
TM-BSF architecture.
Qi, Multicarrier DEMUX and VLSI Implementation 141
Chapter 7 VLSI
*,(n)
x'/n)------»
highpass/lowpass control
Figure 7.5 The time-multiplexed BSF (N=7)
The intermediate complex signal x\n )  assumed in bit-serial is firstly buffered and then 
converted into a complex vector sequence (5 bits wide in accordance with the 7-tap halfband 
FIR in which two of the seven coefficients are zeros). The data vector forming mechanism re­
groups the five parallel complex data samples from the input buffer into four real 3-bit wide 
data vectors Xv X2, X3, and X4 according to Eq. (7.9). The vector add/sub blocks perform 
vector addition when ‘lowpass’ is true. Otherwise they perform subtraction for highpass 
samples. Finally, two identical real inner product generators in which the coefficient vector C 
is stored are used to produce the imaginary and the real components of the two processes. The 
switches are used to swap the two output sequences from the two IPs as required by Eqs.
(7.10). This TM-BSF structure has been used in our tree DEMUX ASIC designs.
7.3.3 The DA inner-product
Although the low complexity bit-serial multiplier introduced in chapter 6 can be used for the 
realization of the IP of Figure 7.5, we have found that the use of DA technique for the inner 
product generation can substantially reduce the complexity and makes it much lower than 
that using bit-serial multipliers. In our tree DEMUX ASIC designs we used DA architecture 
to compute BSF’s four real inner products (Eqs. (7.10)). Figure 7.6 shows a real IP 
architecture using DA. It has been used for the real inner product computation in Figure 7.5. 
Because the vector size is 3, the lookup table needs only 23_1=4 words of ROM using the 
reduced memory DA architecture [Whi89], The input sample vector to the ROM block is in 
bit-serial and in 2’s complement format with USB first (as required in Eq. (5.2)). The output 
of the block is in bit-parallel and is assumed 12 bits in width. Instead of using ROM, we use 
the optimized combinational logic to further reduce the complexity. The combinational logic 
approach is not suitable for large ROM as its complexity tends to exceed that of the ROM not 
to mention the difficulty of optimization for logic with many inputs. A carry-look-ahead 
parallel adder/subtractor is used to improve the speed. As required by the reduced memory 
DA scheme, the subtraction is carried out when the MSB (sign bit) arrives. Finally, the bit- 
parallel inner product needs to be converted into bit-serial format in consistency with 
succeeding bit-serial processing units.
rchitecture for Binary Tree DEMUX
Input Buffer Inner-Product Processor (IP)
input buffer 
(real)
vector
add/sub
input buffer 
(real)
vector
add/sub
real inner 
product
real inner 
product
data
vector
forming
imaginary
real
Qi, Multicarrier DEMUX and VLSI Implementation 142
Chapter 7 VLSI Architecture for Binary Tree DEMUX
1/2 latch
P/S
converter
parallel
ADD/SUB
ROM (4 words), 
or
combinational
logic
Figure 7.6 The DA real inner-product processor architecture
7.4 The two-path input buffer
As a result of time-shared use of the arithmetic processor and of the requirement for a 
complete deployment of input buffers, the complexity of a TM-tree is dominated by that of 
the memory. Efficient realization of the input buffer is thus particularly important for area- 
efficient VLSI implementation of TM-tree. In this section we propose a low complexity input 
buffer structure for the TM-BSF introduced previously.
According to Eqs. (7.10a) and (7.10b), the lowpass and the highpass filtering of the BSF can 
share a common input buffer because they use the same signal sample vectors. In the TM- 
BSF shown in Figure 7.5, the input buffer should work in a way that it delivers each signal 
sample vector twice so as to facilitate the time-multiplexed use of the complex IP which 
generates lowpass and highpass samples alternately. A simple way of generating the repetitive 
samples is to use linear FIFO (first-in-first-out) shift registers with 2:1 logic multiplexers (a 
switch that alternately connects one of the two inputs to the output) placed at appropriate 
positions as shown in Figure 7.7(a). For an AT-tap halfband FIR (N=4n-l, n the number of 
distinct non-zero coefficients except the centre tap), N  words of storage (L D-type shift 
registers for L-bit word) and 2n+l MUXs are needed by the FIFO buffer structure.
Alternatively, a two-path input buffer architecture shown in Figure 7.7(b) can be used to 
facilitate the time-multiplexed use of the IP processor. In a two-path input buffer, the input 
signal samples are split between the two paths: the even-numbered and the odd-numbered 
samples are respectively fed into the lower and the upper paths. The shift registers in the 
lower path and the last one in the upper path alternately shift right and shift circular so that 
each signal sample is produced twice. The number of word shift registers and the number of 
MUXs required by the this structure are 3n and 2n+l respectively. The memory saving 
compared to the linear shift register buffer is thus (rc-l)/(4rc-l) (14% to 25% with the 
increase of ri), which is significant when the memory dominates the hardware complexity (as 
in the TM-tree architecture). The outputs of the buffer is controlled by the tri-state transfer 
gates which enable the buffer to be connected to a data bus. In fact the commutating switch of 
the IP processor shown in Figure 7.4(d) is realized by bussing all the stage’s input buffers
Qi, Multicarrier DEMUX and VLSI Implementation 143
Chapter 7 VLSI Architecture for Binary Tree DEMUX
together and allowing only one of them to access the bus at a time whilst the others being in 
the high impedance state.
The operation of the input buffer is illustrated in Figure 7.7(c). At ‘G / period, the lower path 
performs linear shift-right delivering even-numbered samples and storing them into the 
registers. In the mean time, the last shift register of the upper path performs circular shift- 
right pumping out an odd-numbered sample stored previously and saving it again in the 
register; At ‘G0’ period, the shift registers in the lower path perform circular shift-right re­
producing the even-numbered samples stored in the ‘G,’ period whereas the upper path 
linearly shifts out the odd-numbered sample the second time and stores a new sample.
xW x,4(n) Xgfn) 
(a)
x^ Cn)
x(n)
MUXMUX MUXMUXMUX
x'o(n)
(b)
1 w ord  cycle
Gi
Gn
 lull
x(n)
X1 e
X, X % % xn X12 X1S
X6 1 X6 X10 X10
x4 0  Q
x'a Q  0
X,2 1 X2 1 I X2 1
I X6 I I Xg 1 1 X8 |
0  0  0
0  0  0
x(n)
x"4(n) x"g(n) x’gfn)
G.
(c)
Figure 7.7 Input buffers: (a) linear FIFO (b) two-path buffer (c) operation 
of (b)
Qi, Multicarrier DEMUX and VLSI Implementation 144
Chapter 7 VLSI Architecture for Binary Tree DEMUX
7.5 VLSI architecture for TMMree
Having chosen the TM-tree architecture with TM-BSF, i.e., the TM2-tree architecture, for tree 
DEMUX implementation, we have the VLSI architecture as shown below in Figure 7.8.
x(n)
control
complex
local data bus
o/i
G2H,
B2
G3LL, G3HL B3 G3LH, G3HH,
D A IP 1
D A I P 2
D A IP 3
pre-conv.
pre-conv.
pre-conv.
B U F 1
B U F 2 ( H )B U F 2 ( L )
B U F 3 ( L L ) B U F 3 ( L H )B U F 3 ( H L ) B U F 3 ( H H )
time multiplexed output:
..., chO, ch4, ch2, ch6, ch1, ch5, ch3, ch7, chO ,...
Figure 7.8 VLSI architecture for an 8-channel DEMUX based on TM2-tree
The pre-conversion blocks in the figure perform the function defined by Eq. (7.7). The bit- 
serial add/substract module shown in Figure 6.1 is used for this block. The complex input 
buffer consists of two identical two-path input buffers which have two external control 
signals (G0 and G1 in Figure 7.7) that determine the two operation modes (linear shift and 
circular shift). The two-phase clocks (C0 and C1 in Figure 7.7) are generated internally by 
gating the continuous clock with G0 or G, . The complex DA-IP consists of two identical real 
IPs based on DA the architecture shown in Figure 7.6 and an address generator, the logic that 
includes the vector forming and the vector add/sub blocks in Figure 7.5. The IPs are 
controlled by ‘B’ signals to perform either the lowpass or the highpass filtering. The input 
switch of the stage buffers shown in Figure 7.4(d) is realized by broadcasting the signal after 
the pre-conversion to all the input buffers of the stage. At any time instant, only one of the 
input buffers is activated by assigning logic T  to its 'G  control signal. The rest of the input 
buffers must stay in idle state. The output switch for the buffers is realized by simply 
connecting the buffer output to the local data bus. At any sampling period, only one of the
Qi, Multicarrier DEMUX and VLSI Implementation 145
Chapter 7 VLSI Architecture for Binary Tree DEMUX
buffers has the right to access the bus through which its stored data are transferred if it is 
activated (‘G’= l) and the rest is ‘suspended’ (in high impedance state). Figure 7.9 illustrated 
the operation of this TM2-tree architecture.
1 word cycle1 data bit
lEininnilEEniEEEElEnfEElEEEllll
B U F 1
0-3X4-7XO-3X4f7XO-3X4-7X(>3Y4.7XO-3X4-2XO-3X4»?XO-3X4-,?yO-3X4r7YO-3X4-70(0-3)f4.7:XO-3myO-3y4-7yO-3
B U F 2 ( L )
g2h1 n
B U F 2 ( H ) a
G 3 L L  latency betwe 
G 3 L L ,  ______
L I H
• • •  1X4.5
en sjage 1 and 2 n_________n_
B U F 3 ( L U I  f
G 3 H L 0 ------
G 3 H L 1 f i  
B U F 3 ( H L )
rr n.
J“ Ln. n.
G 3 L F L
G3LHJ
B U F 3 ( L H ) [
G 3 H H 0 _ 
G 3 H H j  _
B U F 3 ( H H ) [
B 3  -  
I P 3
□
1 ri n nn n n
rr n n.n
H
(ÎXIXIXIXI>CL)ÜXIX2XI^^
latency between stage 2 and 3 
Input buffer operation:
WË linear shift-right (lower path)
H i  circular shift-right (lower path)
EH idle
IP operation:
m  output a lowpass sample 
# #  output a highpass sample 
n: channel number
Figure 7.9 Control signals and the data flow of the TM-tree architecture
Since in the linear shift period (‘G /= l, with respect to the lower path) the input buffer 
delivers a new data sample vector and then repeats it in the following circular shift period 
(‘G0’=l), the low/high control signal ‘B’ of the stage must have different logic values for ‘G / 
and ‘G0’ periods to enable the IP to perform two different filtering processes successively. In 
this TM2-tree architecture, for each input data vector, the lowpass filtering is performed first 
and then the highpass filtering by the IP processor. The lowpass and the highpass filtering are 
enabled respectively by letting ‘B’=0 and ‘B’=l. To reduce the switching rate of the DA-IP 
processor (good for reducing power), the input buffers of a stage are controlled such that they
Qi, Multicarrier DEMUX and VLSI Implementation 146
Chapter 7 VLSI Architecture for Binary Tree DEMUX
successively taking up data samples from the previous stage and then repeat the samples in 
the same order (see the operation of BUF2s and BUF3s in the figure). This specific 
arrangement for the input buffer control signals (‘G’s) also has the advantage of relaxing 
stringent timing requirement for the input buffer. As has been shown in the figure, the 
channelized samples at the output of a stage are samples later than those of the previous stage 
because of the processing latency between stages. The actual value of the latency depends on 
the implementation of the relating blocks. The shaded areas in the figure highlight the timing 
of operations of the input buffers and the IP in the corresponding stage. They also indicate a 
complete processing period of that stage in which each channel of the stage produces one 
demultiplexed sample.
7.6 An 8-channel DEMUX ASIC based on TM2-tree
Based on the above VLSI architecture, we designed an 8-channel TM-tree DEMUX ASIC 
using radiation hard CMOS/SOS (Silicon on Sapphire) technology. The reason for using 
radiation hard technology is because on satellite orbits, especially the geostationary orbit, 
trapped electrons in the Van Allen belt are the dominant source of radiation. Electrons on­
board the satellite must be able to survive an absolute minimum dosage of 7x1 (f rad (Si). In 
addition to the dosage, degradation of digital devices also arises from ionizing, cosmic rays 
which cause soft error, and latch-up [Kat87]. We chose GPS’s 1 1.5 micron SOS Sea-of- 
Gates [GEC91] for the DEMUX ASIC implementation. Key features of the SOS Sea-of- 
Gates and the cell library are summarized in Appendix G. GPS’s Sea-of-Gates offers several 
advantages over other conventional ASIC products. These include operation in severe 
environments, safety critical applications, rapid prototyping service and relatively low cost. 
Its disadvantage is the maximum size of array available (20k cells for MA9000A as compared 
to bulk arrays which can offer in excess of 250k cells). However, the advantage offered by 
the process for space applications, especially with regard to the radiation performance, often 
outweigh this limitation [Mat90].
Function blocks have been carefully designed to ensure a pipeline architecture and to reduce 
the latency. Control signals for each stage are generated locally, easing the problem of signal 
routing which would otherwise occupy substantial chip area. The prototype half-band filter 
length is chosen as M=7 which provides suitable performance for a wide range of channel 
spacing and the Eb/No degradation as compared to N=l 1 is negligible [Bi90, Sec92]. Both the 
data word length and the filter coefficient length are chosen to be 12 bits. This is longer than 
suggested in our earlier study (10 bits for data length and 9 bits for coefficients) [Bi90]
‘GPS: GEC Plessey Semiconductors, the merger of Marconi Electronic Devices Limited (MEDL) and Plessy 
Semiconductors Limited (PSL)
Qi, Multicarrier DEMUX and VLSI Implementation 147
Chapter 7 VLSI Architecture for Binary Tree DEMUX
because a 10-bit shift register is not available in GPS’s Sea-of-Gates library, so we had to use 
an 8-bit and a 4-bit shift registers provided in the library to form a 12-bit one. The popular 
‘top-down design and bottom-up implementation’ approach was adopted throughout the 
design. The design hierarchy showing the composition of each function block is shown in 
Figure 7.10. The numbers after block names show the required number of such blocks for the 
block at the higher hierarchy. The leafs of the hierarchy are the building blocks (macro cells) 
provided by GPS’s Sea-of-Gates library (see Appendix G) except the SWAP and the 
MUX2_1 modules which are simple combinational logic.
a ,o
nG oz<£ w
p i
i  i p i
i
i i p i
^  h -  0 9
;b b m u
n
i i m 3
U r l H Oto toc Gto to
I
Eu
oCQ-o'
"3 3to OQ CTO
Figure 7.10 Design hierarchy of the DEMUX ASIC
The circuit diagram at the top hierarchy is shown in Figure 7.11. The circuits have 
successfully passed digital simulations using the Mentor Graphics computer aided design 
(CAD) packages (the Quicksim of the Mentor Graphics IDEA station). The DEMUX function 
was verified by comparing the digital simulation results with those of a target C program.
Qi, Multicarrier DEMUX and VLSI Implementation 148
Chapter 7 VLSI Architecture for Binary Tree DEMUX
TTTITT
r*” ^
£q -  w
Figure 7.11 Eight-channel TM-tree DEMUX circuits diagram (top 
hierarchy)
Qi, Multicarrier DEMUX and VLSI Implementation
Chapter 7 VLSI Architecture for Binary Tree DEMUX
The simulation shows that the circuits can function properly at 20MHz clock frequency 
without violating timing constraints. That allows the DEMUX to demultiplex eight channels 
up to 64kb/s each. We have also investigated the feasibility of single chip design for a 16- 
channel DEMUX using the same TM2-tree architecture. In this case, both the word length and 
the coefficient length have to be reduced in order for the DEMUX to be able to fit into a 
single ASIC chip, e.g., the MA9200 SOS Sea-of-Gates which has up to 20k usable gates. 
Appendix H gives detailed circuits diagrams for a 16-channel, 6-bit word length design. The 
complexity and estimated power consumption for these designs are listed in Table 7.1.
Table 7.1 Complexity and power consumption* of DEMUX ASICs
No. of 
chn’ls
word
length
coeff.
length
No. of 
gates
memory
(IB)
arithmetic
(IP)
control power
8 12 bits 12 bits 12,280 68% 30% 2% 178mW
16 6 bits 8 bits 14,266 71% 24% 5% 164mW
16 4 bits 8 bits 11,690 67% 28% 5% 140mW
* The M A9000A has the pow er dissipation o f  1.25uW /M Hz/active gate (see Appendix G) and the pow er  
dissipation is estim ated a t the clock frequency of20M H z.
In this table, the DEMUX ASICs’ complexities were calculated by summing up the number 
of cell units (gates) of building blocks shown in design hierarchy diagrams (e.g., Figure 7.11) 
according to macro cells’ complexities listed in Appendix G. The power consumption due to 
memory and signal processing was approximately estimated using Eq. (7.15). To determine 
the a  values in the equation, we had a conservative assumption that 90% of the arithmetic 
circuits is active all the time and 100% of the input buffer circuits is active once enabled, 
which gives o^=O.9 and a BUF=1.0. The power consumption due to control logic was estimated 
by assuming that all gates of the control logic are active all the time. The overall power 
consumption listed in the table was the sum of the two.
It can be observed that these designs are very efficient in minimizing the control overhead 
and the complexity for arithmetic operations. It can be envisaged that any further significant 
reduction of the chip area will most likely come from the optimization of input buffers which 
accounts for most complexity of the system. One possibility of reducing the input buffer 
complexity is to use random access memories (RAMs) instead of the D-type flip-flops to 
realize the word shift register [Mur82]. But the processing speed may decrease and control 
overhead for the RAM scheme can be considerable. It therefore requires further investigation.
Although short word length will lead to high quantization and round-off errors, we suggest 
use short word and coefficient lengths as long as the errors do not exceed unacceptable levels. 
The advantage of using short word length is two-fold: significantly reducing the complexity 
(as seen in Table 7.1) and reducing the clock frequency (or, in other words, increasing the
Qi, Multicarrier DEMUX and VLSI Implementation 150
Chapter 7 VLSI Architecture for Binary Tree DEMUX
throughput) as the system clock frequency is as high as Nh (the word length) times of the 
highest sampling frequency in bit-serial processing systems.
7.7 Comparisons with other TM-trees
As has been discussed previously, the binary tree structure has basically two kinds of 
implementation structures: the direct (full tree) structure and the time-multiplexed structure. 
The major difference between different implementation schemes lies in the way that the BSF 
is implemented. In this section, we will firstly give a general comparison in complexity 
between the direct tree structure and the TM-tree structure and then compare the TM2-tree 
architecture with other TM-trees that use different BSF structures, namely, the ANT's HMM 
cell, BAe's RM-cell, Bi's TM-BSF, and the optimal BSF structures in terms of complexity, 
power consumption, and system throughput.
7.7.1 Direct tree vs. TM-tree
With the direct structure, one BSF is actually used for each of the tree nodes. K -l  BSFs must 
be used for a ^-channel (K=2m) binary tree DEMUX. Since most DSP systems, such as FFT, 
digital filters, etc., are computation bound (as compared to I/O bound processing in many 
control systems), their control overheads can be negligible (as can be seen in Table 7.1). The 
complexity of the tree structure is therefore approximately equal to that of K -l  BSFs, i.e., 
Gdjr=(K-l)GBSF , where GBSF is the complexity (number of gates) of the BSF. Because the BSF 
can be decomposed into an input buffer and an inner product, its complexity is thus the sum 
of the two, i.e., GBSF=GBUF+GIP, where GBUF and GIP are respectively the complexity for the input 
buffer and the inner product processor of the BSF. Hence,
(7.11)
For a TM-tree architecture, on the other hand, one IP processor is used for each stage and the 
number of input buffers required is the same as that in a full tree. Consequently, the 
complexity for a X-channel TM-tree can be estimated by
G t m =  ( 2m- l ) G BUF+ m G ip  (7.12)
which is (2m-l-m )G /p less than that of a full tree. The complexity saving of the TM-tree over 
the direct tree is
In our 8-channel DEMUX ASIC design, the ratio of GBUF/GIP is approximately 
2x575/1200=0.96 which gives the saving of ri=0.51 (2m-l-rn)/(2m-l)= 0 .5 1x4/7=29%. When 
the number of channel K  increases the complexity saving due to the use of TM-tree structure 
will approach the upper bound of 51%. For example, r|=46% for K=64. This trend is shown
Qi, Multicarrier DEMUX and VLSI Implementation 151
Chapter 7 VLSI Architecture for Binary Tree DEMUX
in Figure 7.14 in which the complexities for a full tree and a TM-tree, both using the optimal 
BSFs, are estimated and plotted. The saving is obvious and significant with r|=0.89(2m- l -
7.7.2 Other BSF structures
The BSF implementation schemes to be compared with are the direct transversal structure 
(Figure 4.19(b)) in which symmetrical transversal FIR is assumed, ANT’s HMM (hierarchical 
multistage method) cell, BAe’s reduced multiplication (RM) cell, Bi’s time-multiplexed BSF 
structure, and the optimal BSF structure derived in chapter 4.
In ANT’s BSF configuration, the structure has been reorganized such that the computations 
are all carried out after the sampling rate decimation, reducing the computation rate by a 
factor of two [ANT85]. However, the symmetry property of the prototype filter has not been 
used in this structure making its complexity virtually the same as that of the direct structure. 
BAe’s BSF configuration adopts a architecture similar to that of ANT except that it 
rearranges the input buffer structure (with more delay elements) so that the filter symmetry 
property can be exploited (hence called ‘reduced multiplication’ (RM) cell). Its computation 
rate is a quarter of that of the direct structure and the number of multiplications is half of that 
of ANT (at the expanse of increased memory) [Cra90]. In Bi's BSF architecture [Bi90], the 
complex filtering is rearranged in a way similar to our TM-BSF structure. It is thus also a 
time-multiplexed BSF architecture. The major difference between Bi's and our BSF structure 
lies in the input buffer structure (the former is more complex and uses more shift registers 
than does the latter) and in the way that the inner product is realized (the former adopts the 
conventional MAC structure using bit-serial multipliers and an adder tree whilst the latter 
takes advantages of DA techniques without the use of multipliers). For given BSF filter 
length N  (N=4n-1), the required multiplication rate and the required number of multipliers, 
adders, and delay elements for these BSF schemes are listed in Table 7.2.
Table 7.2 Complexity requirement for complex BSF structures
direct ANT BAe Bi Qi optimal
# mults 4n+4 4ft+2 2ft+2 2ft+2 2ft+2 2ft+2
mult, rate 4(ti+1)Fs (2ft+l)Fs (ft+l)Fs 2(ft+l)Fs 2(ft+l)Fs (ft+l)Fs
# adds 4n-4 4ft+6 4ft+6 4ft+4 4ft+4 4ft+10
delays 16«—8 6ft—2 2ft2+4ft-2 9ft+12 6ft 6ft-2
If we assume that both signal samples and filter coefficients have the same word length of Nh 
bits, the gate counts for bit-serial multipliers and bit-serial adder/subtractor can be estimated 
and listed in Table 7.3. They will be used to estimate complexities of different 
implementation schemes of the tree structure.
Qi, Multicarrier DEMUX and VLSI Implementation 152
Chapter 7 VLSI Architecture for Binary Tree DEMUX
Table 7.3 Gate count of some basic arithmetic operators and delay element
Component: Gate Count:
Serial multiplier (half-precision) 3 6 ^
Serial multiplier (full-precision) 7<W„
Parallel multiplier 1 2 ^ + 4 4 ^ -2 0
Pipelined multiplier 32^-40/^+ 126
Serial adder/subtractor 22
Parallel add/sub (carry-look-ahead) 20/Vb
delay/buffer (word) 6Nt
ROM/register (word) 6Nb
Assuming the use of the full precision bit-serial multiplier and the bit-serial adder introduced 
in chapter 6, the gate counts for these BSF structures are estimated and listed in Table 7.4. 
Note that the required number of adders is doubled due to the use of the full-precision bit- 
serial multiplier.
Table 7.4 Gate count estimation for BSF schemes
Scheme: Input buffer: Inner product: Total:
Direct 48(2n-l)Ab 280(«+l)Ab + 176(rc-l) (376»+232)Ab + 176n-176
ANT’s 12(3%-l)Ab 140(2»+!)#, + 88(2n+3) (316»+128)Ab + 176»+264
BAe’s 12(n+2n-l)Nb 140(n+l)Ab + 88(2n+3) ( 12n+ 164)i+128)/Vh+ 176n+264
Bi’s 18(3n+4)A/1 140(n+l)Ab + 176(n+l) (194fj+212)A/t + 176n+176
Qi’s 36nNh 140(n+l)Ab + 176(n+l) (116n+U0)Nb + 176n+176
Optimal l2(3n-l)N h 140(n+l)Ab + 88(2n+5) (176fi+128)jV6 + 176n+440
Figure 7.12 shows graphically the estimated BSF complexities of the schemes listed in Table 
7.4 against the filter length. Clearly, the TM-BSF structure proposed in this chapter and the 
optimal BSF structure are the least complex ones. Then comes BAe’s RM-BSF (for Akl5), 
Bi’s TM-BSF (for Ac 15), ANT’s HMM-BSF, and finally the direct BSF structure. The 
complexity of BAe’s structure is slightly higher than ours when filter length A<15. It will, 
however, increase rapidly with the increase of the filter length because its memory 
requirement is proportional to the square of the halfband filter order n. It will exceed that of 
Bi’s when AM 5. Though the complexity of our TM-BSF is the lowest and it is even slightly 
lower than that of the optimal structure, the computational complexity of the former is much 
more higher than the latter’s (about twice higher) due to the time-multiplexed use of the IP.
Qi, Multicarrier DEMUX and VLSI Implementation 153
Chapter 7 VLSI Architecture for Binary Tree DEMUX
That, as will be shown later, makes the TM2-tree consume more power than most other 
schemes do.
24000 -i
— direct 
- -  ANT 
—  BA e
22000  -
20000  -
18000 -
c  16000-
14000 -
(*12000  -
10000 -
8000
6000 -
4000
11 157 19 23 27
filter length N
Figure 7.12 Gate count estimation for BSF structures
7.7.3 TM-trees using other BSF structures
It is logical for us to use other BSF structures mentioned above in a TM-tree for low 
complexity implementation of the tree. A basic requirement for a BSF to be used in a TM- 
tree is that it can be decomposed into an input buffer and a memoryless inner product. The 
decomposition for the direct transversal, ANT's, and BAe's BSFs are straightforward. It is 
also possible for Bi's TM-BSF to be decomposed in this way as it has the complex processing 
arrangement similar to ours (Eq.(7.6)). For the optimal BSF, however, this kind of 
decomposition can be very complicated, if not impossible. Nevertheless, to provide a lower 
bounds for complexity and power consumption of TM-trees, we still assume that the optimal 
BSF can be decomposed as such. It can be found that the ANT's, BAe's, and the optimal BSF 
structures have the same TM-tree structure because their buffering and filtering operations are 
all performed at the lower sampling side and they all produce concurrent lowpass and 
highpass samples without time-multiplexed use of the IP. This kind of TM-tree architecture is 
shown in Figure 7.13(a). Though not efficient and having no practical interest, the TM-tree 
using the direct transversal BSF is also given as a reference for other schemes and is shown in 
Figure 7.13(b). Because Bi's BSF is also a time-multiplexed architecture, it's TM-tree 
architecture is therefore also a TM2-tree and we redraw it in Figure 7.13(c).
Qi, Multicarrier DEMUX and VLSI Implementation 154
Chapter 7 VLSI Architecture for Binary Tree DEMUX
m /t/4 m V4 )0/8
— i n p u tb u f f e r
inner
product 0 Vi-T^ p-
in-buf—o 1 1— o-p{l2}|}in-büf|—o i%2 jin- i
- > j S | §  
o-pP^2|| jin-buf[—1> 
o-|-i l2 |f|in-buf[- o
f *  b u f f e r
inner
product
(a) for ANT's, BAe's, and the optimal BSFs
/Ô/2 yy4
\-l2
tn-bur
in-but
in-bur o-j|in-buf}- o
»-||in-buf[- o
fo
(b) for the direct transversal BSFs
m
i n p u t
b u f f e r
inner
product
^yp4|ïn^bü^—^ 2
t^ in-bufj—o '
(a) for Bi's and Qi's TM-BSFs 
Figure 7.13 TM-tree architectures for different types of BSFs
7.7.4 Complexity of TM-trees
We have shown that the use of TM-tree architecture can significantly reduce the complexity 
compared to the direct tree structure (Eq.(7.13)). In this section we will show that the 
proposed TM2-tree is also favorable over TM-trees with other BSFs in terms of low 
complexity. In the complexity comparison between TM-trees, we have the following 
assumptions:
a) TM-tree architectures shown in Figure 7.13 are assumed for corresponding BSF schemes;
b) The full precision bit-serial multiplier and the bit-serial adder architectures will be used 
for all the schemes (except Qi’s DA scheme);
c) A 7-tap halfband prototype lowpass filter (n=2) is assumed for the BSFs of all the 
schemes; and
d) For simplicity, both the word length and filter coefficient length are assumed 12 bits 
(^=12).
Referring to Eq. (7.12) and taking into account the complexities for input buffers and inner 
products given in Table 7.4, the gate counts for TM-trees with different BSF schemes are 
estimated and plotted in Figure 7.14.
Qi, Multicarrier DEMUX and VLSI Implementation 155
Chapter 7 VLSI Architecture for Binary Tree DEMUX
1000000
— ♦ — direct 
— I t  —  ANT 
— A— B A e 
— ■ -©• - - — Bi
. . .  — . optim al
Q i-DA(real d a ta ) 
optim al (full tree)
2
§>
jQ -
10000
4 8 16 32 64 128
number of channels K
Figure 7.14 Complexity estimation for TM-tree DEMUXs with 7-tap BSFs
Form the figure the following observations can be made:
a) The TM2-tree has the lowest complexity only next to the optimal one;
b) For small K  (number of channels) and short filter length cases, TM-tree with BAe’s RM- 
BSF is nearly as good as our structure in terms of hardware complexity;
c) For large K, the complexity of TM-tree with ANT’s HMM BSF tends to become lower 
than that with BAe’s RM-BSF and approaches that of our TM2-tree;
d) The complexity of TM-tree with Bi’s TM-BSF is considerably higher than those using 
ANT’s BAe’s and ours BSF structures when £>16 and it could be even higher than that 
of the direct structure for large K\
e) The complexity of the input buffer is more significant than that of the arithmetic unit (IP) 
especially when K is large. This explains the observations c) and d).
The lowest plot in the figure shows the complexity of the TM2-tree using the DA-TM-BSF 
architecture. The actual complexities for the input buffer and the IP processor are used for the 
complexity projection. In the plots of TM-tree complexities, because the estimation for input 
buffer complexity takes account only the required number of delay elements and ignores the 
complexity for buffer’s control circuits (such as those MUXs in the FIFO and the two-path 
buffer structures), the estimated complexity in these plots could be considerably lower than 
the actual values. That explains why the complexity of the TM2-tree with DA architecture 
increases more sharply with K  than the others. However, it should not affect the general trend 
of these structures and relations between them. The reason for including the complexity plot 
of the TM-tree with DA-TM-BSF in this figure is to show the effectiveness of the DA 
approach in complexity reduction, as compared to bit-serial arithmetic implementations. We
Qi, Multicarrier DEMUX and VLSI Implementation 156
Chapter 7 VLSI Architecture for Binary Tree DEMUX
have also included the complexity plot for a full tree using the optimal BSFs to give a 
glimpse of efficiency of TM-trees comparing to a full tree.
7.7.5 Power consumption of TM-trees
In this section power consumption of different TM-tree structures will be analyzed and 
estimated by applying the multirate VLSI modeling technique developed in chapter 5. 
Consider the three types of TM-structures shown in Figure 7.13. Again, bit-serial arithmetic 
is assumed in these architectures. To simplify the problem we assume just two kinds of 
components (as being shown in the figure): the input buffer which has the average percentage 
of active gate a BUF and complexity (gate count) GBUF and the IP with the average percentage of 
active gate oc^  and complexity G^. Then, according to Eqs. (5.28b) and (5.28c), the power 
consumption of the three types of TM-tree architectures can be expressed as follows:
for TM-trees using ANT’s, BAe’s, and the optimal BSF structures (Figure 7.13(a)),
, and for TM-trees using the direct transversal and Bi’s BSF structures, as well as for our 
TM2-tree (Figures 7.13(b) and 7.13(c)),
For simplicity, we assume all schemes have the same a BUF and a IP , and for a conservative 
estimation, let ocBUF=1.0 and a^O .9, as in the case of the ASIC power estimations previously. 
We further assume that, for all the schemes, both the word and the coefficient lengths are 12 
bits, the VLSI technology to be used is GPS’s 1.5 micron SOS Sea-of-Gates 
(p=1.25pW/MHz/active gate), and all circuits are clocked at 20 MHz. Then, according to the 
above expressions for the power consumption and referring to Table 7.4, the power 
consumption per tree stage for different TM-trees can be estimated and plotted in Figure 7.15.
f  m - l  f \
~  Pfclock B U F ^ B U F  Q /+] ^ +  CL IpG lp ^ m (7.14)
2 Pfclock  B U F ^ B U F  +  ^  /p  G ip  )
/  m -l
7* Pfclock  ^  0L BUFGbuf§ i ^ BUF.i ^ I P  ’ 1 ‘ WZ
V=o
f  »j-i i \
(7.15)
\î=o z y
~ Pf clock B U F ^ B U F  +  CL IpGIp )
Qi, Multicarrier DEMUX and VLSI Implementation 157
Chapter 7 VLSI Architecture for Binary Tree DEMUX
500-1
450-
c 2 0 0 -
— ♦ —  d irec t
— - -E l — • ANT 
— A — B A e
— ■ •©• - ■ ■ Bi
Q.100-
50 -
— optim al
11 157 19 23 27
filter length N
Figure 7.15 Power consumption estimation for TM-trees
The above power consumption analysis for TM-trees shows that, from the view of low power 
consumption, TM-trees using non-time-multiplexed BSF structures (except the direct BSF) 
are preferred over the TM2-trees with TM-BSFs. The TM2-tree proposed in this chapter, 
though efficient in reducing complexity, tends to consume more power than TM-trees using 
ANT’s, BAe’s, or the optimal BSFs (e.g., its power consumption is about twice as much as 
that of the TM-tree with BAe’s BSF). The optimal BSF structure, again, presents the best 
performance in power consumption.
Since in a full binary tree as shown in Figure 7.1 the sampling frequency of a stage is half of 
that of the previous stage and the input buffer and the inner product operate at the same 
sampling frequency, with an analysis similar to that for the TM-tree case, we will have 
exactly the same power consumption expressions as those for the TM-trees. That is Eqs.
(7.14) and (7.15) are valid for both TM-trees and full trees. Hence we can conclude that there 
is no obvious advantage in reducing power consumption by using TM-tree structures.
7.7.6 Throughput of TM-trees
We have shown in chapter 5 that for given VLSI technology the system throughput of an 
ASIC is determined by the maximum switching frequency (clock frequency / clock) of the 
circuit, which is directly related to the input signal sampling frequency f 0. Maximizing a 
circuits’ throughput is synonymous to minimizing a circuits’ switching frequency. Hence to 
fully exploit the speed potential of VLSI technology and to maximize throughput of TM-trees 
(or equivalently, to maximize the allowed input sampling frequency, hence the channel 
bandwidth), the sampling frequency of the first stage should be as low as possible. Thus, 
according to Figure 7.13, TM-trees with ANT’s , BAe’s and the optimal BSF structures
Qi, Multicarrier DEMUX and VLSI Implementation 158
Chapter 7 VLSI Architecture for Binary Tree DEMUX
could have throughputs twice as high as that of the TM2-tree if the input FDM signal is 
assumed in bit-parallel.
7.8 Summary
The primary concern for VLSI implementation of multicarrier DEMUX is how to reduce the 
complexity. By carefully examining the feature of binary tree filter banks we have derived a 
low complexity, yet computationally efficient time-multiplexed tree filter bank architecture, 
the TM2-tree. The condition for a BSF structure to be used in a TM-tree is that it can be 
decomposed into a cascade of a complex input buffer and a memoryless arithmetic unit 
(complex IP). In a TM2-tree, not only the IP is time-shared amongst stage channels, but also 
at a lower hierarchy level the IP itself is realized by time-sharing a common complex 
arithmetic processor for both the lowpass and the highpass filtering processes. Thus the TM2- 
tree is, to be precise, a TM-tree with TM-BSF.
A gate array design for an 8-channel DEMUX based on TM2-tree architecture has been 
presented. In this ASIC design, following measures have been taken to efficiently implement 
the TM2-tree architecture:
1. Efficient DA architecture has been used to generate real inner products in the TM-BSF’s 
IP avoiding use of multipliers as in an MAC architecture;
2. Optimized combinational logic has been used to replace the ROM in DA-IP for further 
complexity reduction;
3. A novel two-path buffering scheme has been proposed to reduce input buffer complexity, 
which, compared to the conventional linear FIFO scheme, can save memory up to 25%; 
and
4. Distributed control, i.e., generating control signals locally, has been adopted to reduce 
global signal broadcast hence easing signal routing problem.
The design shows that with available radiation-hard VLSI technology (e.g., GPS’s SOS Sea- 
of-Gates, a commercial semi-custom product), an 8-channel, or even a 16-channel if 
moderate word length (say, 6-bits) is allowed , DEMUX using TM2-tree architecture can be 
comfortably put into a single chip. The ASIC design of an 8-channel DEMUX shows that the 
hardware complexity of the TM2-tree is dominated by memory. Further optimization of 
arithmetic operations, though needed, will not bring substantial complexity reduction to this 
architecture. One possible way of further reduce the complexity is to use RAMs to replace D- 
type flip-flop shift registers used in our ASIC design. This approach needs further 
investigation.
The TM2-tree architecture has been compared with TM-trees with different BSF structures in 
various aspects. The most important advantage of the TM2-tree is its extremely low
Qi, Multicarrier DEMUX and VLSI Implementation 159
Chapter 7 VLSI Architecture for Binary Tree DEMUX
complexity as compared to the others. In fact its complexity is substantially lower than those 
of other architectures except that of the TM-tree with the optimal BSF, which it is very close 
to. The power consumption and throughput performance of the TM2-tree architecture, on the 
other hand, are not as good as those of TM-trees with ANT’s or BAe’s BSFs. This is due to 
the fact that IPs in the former operate at a sampling frequency twice as high as that in the 
latter.
The comparison shows that the TM-tree with BAe’s BSF has very low complexity and power 
consumption (close to those of the TM-tree with optimal BSF) and also has good 
performance on throughput if the number of channels is not too large.
Qi, Multicarrier DEMUX and VLSI Implementation 160
Chapter 8 Conclusion and Future Work
Chapter 8
Conclusion and Future Work
In order for the MCDD function to be implemented with affordable complexity and power in space environment, efficient mapping from the function model to VLSI architecture 
has been investigated. It involves a three-step optimization:
* algorithm design and optimization,
* implementation architecture mapping, and,
* VLSI architecture mapping.
We have proposed the MSFG approach for the optimization of the first step. The approach is 
based on the multirate signal flow graph representation and transforms. Reducing 
computational complexity requires minimizing not only the number of computations but also 
the computation rate in a multirate system. The main theme of optimizing the computational 
architecture is therefore to share and move computations to lower sampling frequency as much 
as possible. With direct and clear link to hardware architecture, the proposed MSFG approach 
provides an unique and systematic approach to multirate system optimization without resorting 
to tedious and adhoc mathematical manipulations. Consequently, computationally efficient 
OBP frequency DEMUX structures can be derived via the MSFG approach. Many identities 
and transforms summarized in this chapter, in particular, those associated with commutators 
and the MPDTs are not seen in literature. They have been found extremely useful and handy to 
derive optimal MSFG structures.
With MSFG’s sampling node functions, MSFG has the potential to describe some digital 
network functions, like sampling, switching, etc., which can not be expressed through other 
representations. Hence it has the potential of making the direct mapping from DSP 
algorithms to digital networks.
Qi, Multicarrier DEMUX and VLSI Implementation 161
Chapter 8 Conclusion and Future Work
The mapping from a computationally efficient DSP architecture to an implementation 
architecture can be either a one-to-one or a multiple-to-one mapping (via multiplexed use of 
components) of operations in the second mapping step.
The third step is to map the implementation architecture into a VLSI architecture which 
requires low complexity and power consumption. This can be achieved by modeling the 
space, power, and processing speed complexities of VLSI in multirate environment with a 
simple complexity model upon which optimization techniques are applied to search for 
acceptable solutions. The mapping of DEMUX structures into VLSI can be assessed in terms 
of the complexity, power, and throughput of resultant VLSI architectures. Given system 
parameters and the DEMUX structure, the searching for the optimal VLSI architecture 
becomes the determination of the configuration matrix A. Trade-offs between complexity, 
power consumption, and throughput can be made by imposing different constraints on them.
Since the proposed estimation method for VLSI complexity and power consumption are 
technology-independent, this approach is useful for comparative study of alternative VLSI 
architectures.
A gate array ASIC design for an 8-channel tree DEMUX has been presented in this thesis. 
We have proposed a novel low-complexity binary tree architecture, the TM2-tree, for VLSI 
implementation. In a TM2-tree, not only the IP is time-shared amongst stage channels, but the 
IP itself is also realized by time-sharing a common complex arithmetic processor for both the 
lowpass and the highpass filtering processes at a lower hierarchy level. In the ASIC design, 
following measures have been taken for efficient implementation of the TM2-tree:
* Efficient DA architecture has been used to generate real inner products in the TM-BSF’s 
IP avoiding use of multipliers;
* Optimized combinational logic has been used to replace the ROM in DA-IP for further 
complexity reduction;
* A novel two-path buffering scheme has been used to reduce input buffer complexity, 
which, compared to the conventional linear FIFO scheme, saves memory up to 25%; and
* Distributed control logic has been adopted to reduce global signal broadcast hence easing 
signal routing problem.
The design shows that with available radiation-hard VLSI technology, an 8-channel DEMUX, 
or even a 16-channel DEMUX if moderate word length (e.g., 6-bits) is allowed, could be 
comfortably put into a single chip using the TM2-tree architecture.
In comparison with other implementation schemes, the TM2-tree has very low complexity. Its 
complexity is substantially lower than those of most existing binary tree architectures except 
that of the TM-tree with the optimal BSF. The power consumption and throughput
Qi, Multicarrier DEMUX and VLSI Implementation 162
Chapter 8 Conclusion and Future Work
performances of the TM2-tree architecture, on the other hand, are not as good as those of TM- 
trees with ANT’s or BAe’s BSFs. This is due to the fact that IPs in the former operate at a 
sampling frequency twice as high as that in the latter.
DA is bit-serial processing approach which, instead of computing single multiplication or 
addition, computes composite arithmetic operations (e.g., inner product) in a bit-serial 
manner. When the dimension of the vector is not large, the DA architecture for the inner 
product can be highly efficient with very low complexity making it suitable for VLSI 
implementations. As the memory size increases exponentially with the vector size, the DA 
architecture is limited to small sizes of vectors in order to maintain the computational 
efficiency. For large vector sizes (as for long length FIRs) we have proposed proper 
segmentation of the vector so that the processing can be distributed into several smaller DAs 
in parallel each of which utilizes an efficient DA structure with small memory.
Systolic stream DFT structures have been proposed for low complexity, high speed, and 
flexible DFT computation. These SS-DFT structures are useful in particular for flexible 
(reconfigurable) DEMUX architectures such as the fast convolution structures. We have 
found that the relationship between the stream DFT and the parallel DFT can be described by 
Figure 6.7(a) and that both Rung’s and Chang’s SS-DFT structures can be derived from this 
relationship via MSFG transform. Unlike most FFT algorithms which are very restrictive on 
the transform size, the transform size N  of SS-DFT can be arbitrary. The complexity of SS- 
DFT is less than that of the conventional radix-2 FFT for N  larger than 32. The disadvantage 
of SS-DFT is the large roundoff error for large N.
We have derived a novel pure systolic FIR architecture that makes use of the symmetry 
property of FIR via signal flow graph approach. Without increasing the latency, the proposed 
architecture has significantly improved the processing speed compared to semi-systolic 
architectures due to complete elimination of global signal communications. As a special case 
of the pure systolic FIR, a pure systolic halfband FIR architecture has been proposed, it 
appears optimal in terms of complexity, latency and throughput.
For a real FDM signal, it is generally advantageous to have the signal in even stacking 
positions so that the resultant uniform DEMUX will have low complexity by using ordinary 
DFT (FFT) and real polyphase filters, instead of more complicated GDFT and complex sub­
filters which would otherwise be needed. Trivial one-sided frequency translations can be used 
for odd-even stacking conversion if a sampled FDM signal is in odd-stacking.
From the survey of multicarrier demultiplexing approaches we have concluded that
a) for uniform and fixed traffic, computationally efficient block processing methods such as 
polyphase FFT and fast convolution approaches are preferred;
Qi, Multicarrier DEMUX and VLSI Implementation 163
Chapter 8 Conclusion and Future Work
b) tree type multistage DEMUX approaches are useful when some limited flexibility is 
required or when hardware implementation is considered due to their modular and simple 
structures; and
c) per-channel and analysis-synthesis approaches are considered only when the flexibility is 
of primary concern and the number of channels is small.
We have also reviewed and investigated various flexible DEMUX architectures and have the 
following conclusions:
a) the per channel approach, though the most flexible, is only feasible when the number of 
channels is small due to its poor computational efficiency,
b) to accommodate less restricted traffic change such as allowing rational bandwidth ratio, 
single-stage flexible architectures such as reconfigurable/programmable frequency 
filtering or analysis-synthesis filter bank structure can be used, and
c) for traffic changes restricted to integer bandwidth ratio, two-stage approaches such as per 
channel-polyphase DFT and reconfigurable tree architectures can be sought.
The first-order bandpass sampling theorem shows that sampling a bandpass signal can be a 
pitfall without careful consideration of relations between the sampling frequency, the signal 
centre frequency, and bandwidth. The bandpass sampling theorem has been modified to 
accommodate the sampling frequency instability and the carrier frequency variations. We 
have shown that to have maximum tolerance to carrier frequency uncertainty the carrier 
frequency is required on the grids of ±1/4 of the sampling frequency.
Although not necessary, it is convenient to have EDM signals with homogenous channel 
stacking such that trivial 7t/2-frequency shifts can fulfill the odd-even stacking conversion.
The major and original results of this study are summarized as follows:
1) the multirate extension of signal flow graph: MSFG and its transforms, which forms a 
systematic optimization approach for multirate networks;
2) a new VLSI complexity modeling method and the associated optimization technique for 
multirate DSP systems;
3) the novel TM2-tree architecture for low-complexity VLSI implementation of complex 
binary tree structures;
4) a new pipeline PMDFT filter bank that has the advantages of reduced DFT size, more 
suitable for VLSI implementation, and fast in processing as compared to the conventional 
PMDFT structure;
5) a new systolic parallel/serial multiplier which has the advantages of low round-off error 
and improved speed performance for full precision multiplication;
Qi, Multicarrier DEMUX and VLSI Implementation 164
Chapter 8 Conclusion and Future Work
6) a novel pure systolic FIR which makes use of the symmetry property of linear phase FIR 
and, as a special case, a pure systolic halfband FIR;
7) an optimal complex BSF used in tree structure, which has the minimum computation rate, 
minimum number of multiplications, and less memory requirement;
8) the concept of homogeneous channel stacking and odd-even or even-odd stacking 
conversion using trivial frequency shifts;
9) a novel RICF complex FIR structure that save the number of multiplications by 
approximately a factor of two as compared to the direct transversal complex FIR;
10) The practical minimum sampling rate determination for uniform bandpass sampling and 
the robust bandpass sampling principle.
The achievements listed above are by no means the completeness of the study. There are
more issues that need to be looked into and more questions to be answered. Not the least, the
following aspects are considered worthwhile to be investigated:
I. Flexible DEMUX. The motivation for this study lies in the inflexibility of OBP payload 
for handling other types of traffic and signal format. The work will includes studies on:
• Reconfigurable tree nodes: Though polyphase DFT is inflexible, an M-channel 
DEMUX could be reconfigured to an M/2- or M/3-channel polyphase DFT 
DEMUX without changing filter coefficients by exploiting the advantage of filter 
decimation. For example, a 6-channel polyphase DFT can be reconfigured into a 
2- or a 3-channel DEMUX if prime factor algorithms (PFA, e.g., the Good 
algorithm) is used for the DFT, which allows simple reconfiguration of the 2-pt 
and 3-pt FFTs. Further study on the performance of decimated polyphase 
DEMUX is necessary.
• Reconfigurable multistage DEMUX structure: A tree structure with reconfigurable 
nodes will allow various traffic modes. For example, if all the decimation factors 
of the nodes are the same or have integer multiple relations, the allowed frequency 
plans will be restricted to those that different channel bandwidths also have integer 
multiple relations; if some of the node decimation factors are co-prime (as in the 
case of the reconfigurable 6-channel DEMUX), the flexibility can be considerably 
improved allowing frequency plans with rational bandwidth ratios; the flexibility 
can be further improved if the tree structure itself is reconfigurable by allowing 
free reconnections between nodes.
• Programmable fast convolution filter banks: This flexible architecture allows 
virtually arbitrary frequency plans including rational, non-uniform channel 
spacing cases. The basic assumption is that the frequency resolution (determined
Qi, Multicarrier DEMUX and VLSI Implementation 165
Chapter 8 Conclusion and Future Work
by the minimum channel bandwidth) is fixed. Hence we will have a fixed 
(normally very large) FFT shared by all the channels. To adapt to traffic changes, 
a set of channel filter coefficients (complex) in frequency domain stored in 
memory are down-loaded before the change. The difficulty of the flexibility of 
this architecture lies mainly in the variable size IDFTs. It could be circumvented 
by using Kung’s SS-IDFT.
• Pipeline PMDFT architecture'. Though the structure has the same computational 
complexity as that of the conventional PMDFT, it is more suitable for VLSI 
implementation than the latter. If the pipeline stages are programmable which is 
relatively easy for small PMDFT, the structure would be able to accommodate 
traffic changes with varying number of uniform channels by simply cascading 
different numbers of the programmable pipeline stages.
II. Automatic generation of optimized multirate networks. With the proposed MSFG 
approach, it is possible to build an expert system that can automatically or interactively 
generate optimized (simplified) multirate networks under given criteria (e.g., minimum 
computation rate, multiplications, memory, latency, etc..). To this end, AI (artificial 
intelligence) techniques could be used. Identities and transforms of MSFG would be 
interpreted as ‘rules’. Knowledge and flow graph simplification experiences would be 
incorporated into the system. For given system function and optimization criteria, the 
expert system would generate several optimized multirate networks providing more 
alternatives. If incorporated with multirate VLSI modeling and optimization method 
introduced in chapter 5, it could even be used for automatic mapping from DSP functions 
to VLSI architectures [Sch93].
III. Formulation of time-multiplexed use (time-sharing) of components. In chapter 5, the 
configuration of basic components in a multirate VLSI model is restricted to bit-serial and 
bit-parallel architectures. To be more practical, time-multiplexed use of the components 
should be allowed and taken into account. This will necessitate the study of conditions 
under which the time-multiplexed use of a component is allowed and will require a new 
expression for the basic component configuration.
IV. Continuation of the ASIC design for an 8-channel tree DEMUX. The ASIC design 
presented in chapter 7 has not been completed as the layout design has not been conducted 
due to the unavailability of layout design package. What we have shown in chapter 7 is 
the feasibility of single chip design of DEMUX up to 16 channels based on the TM2-tree 
architecture. A complete ASIC design of tree DEMUX can be carried out provided 
sufficient funding and interests from industry.
Qi, Multicarrier DEMUX and VLSI Implementation 166
Appendices
APPENDICES
Appendix A Practical minimum sampling frequency for first- 
order bandpass sampling
Although ideally, the minimum sampling rate of f s = 2B (which is on the tips of the 
unshaded wedges in Figure 2.4) can be achieved if the bandpass signal is positioned at integer 
bands, it can, however, hardly be realized due to the infinite precision requirement for the 
sampling frequency and also due to the uncertainty of the passband centre frequency. In fact, 
any imperfection of sampling frequency source or variation of the centre frequency would 
pull the operating point away from the wedge tips causing aliasing.
To derive the expression for minimum sampling rates taking into account the uncertainty of 
both the sampling and carrier frequencies, let us consider the 7iw-th wedge of the allowed 
region in Figure 2.4. The wedge is confined by the following two intersecting lines,
/ , = ^ f  (A.,)
f s (A.2)
where f c > (nw -  0.5)5. The uncertainty of sampling and carrier frequencies implies that,
instead of requiring the single operating point (fc, f s ) being within the wedge, the neighboring 
area shown as the shaded rectangle in Figure A.l should be within the wedge for aliasing-free 
sampling under the given relative precision.
The minimum sampling frequency for a bandpass signal centred at f c can be determined by 
letting the lower right comer of the shaded rectangle on the lower edge of the wedge as 
shown in the figure. To ensure that the rectangle is within the wedge the upper left corner 
should be below, or at most, on the upper edge of the wedge. Hence the minimum sampling 
frequency in the nw-th wedge must satisfy
/ r ,K ) _ A i= ^ A H  (A3)
(A.4)
" I
where As and Ac are respectively the upper bounds of deviations of the sampling frequency 
and carrier frequency from their expected values. Here / A.(min) (nw ) denotes the minimum
sampling rates within the 7îw-th wedge which may not be the global minimum (i.e., the
Qi, Multicarrier DEMUX and VLSI Implementation 167
Appendices
minimum for given f c ). The Às is most often interpreted in relative precision of f s which is 
defined by
A.
Ps =
f .
(A.5)
/.
■ (min)
/:
(min)
fs  *
fs  —
IB
f c ~ Ac fc f c + Ac
Figure A. 1 Minimum sampling rates within the nw-th wedge 
In practice, the requirement for Ac is presented in terms of
a) relative precision fo r/c . This is appropriate as in the cases where the carrier frequency is 
affected by Doppler effects, such as in satellite communications, or
b) guard bands on both sides of the centre frequency with respect to the signal bandwidth B. 
Now let us discuss the minimum sampling rate problem for the two cases separately.
C a se  1: Carrier frequency uncertainty exp ressed  in relative precision.
In this case (such as where the carrier frequency varies due to Doppler effects) the relative 
precision fo r/c can be expressed by
A„
Pc =
fc
(A.6)
From equations (A.3) to (A.6), we have,
/ r )k ) =
2(1 + p c) f c + B
nA l ~ p . )
(A.7)
Qi, Multicarrier DEMUX and VLSI Implementation 168
Appendices
/ r ’k X i + p J s
2(1- Pc) /c - B
71... -  1
(A.8)
Substituting Eq. (A.7) into Eq. (A.8) gives
( l  +  P ,  ) ( 2 ( l  +  P c ) f c +  B )
71... <
* { P c + P , ) f c + 2 B
(A.9)
Hence the minimum sampling frequency for the given relative precision of sampling and 
carrier frequencies is achieved when nw reaches the maximum in the above inequality, that is,
f .
(min) _ 2 0 + P c ) / .  + g
where
/ '  =
( i+ f t) (2( i + p J / c+B) 
4(pc + p J / c + 2B
(A. 10)
(A .ll)
Clearly, the theoretical minimum bandpass sampling rates given by Equations (2) and (3) are 
special cases of Equations (A. 10) and (A .ll) where pc= ps = 0.
C a se  2: Carrier frequency uncertainty exp ressed  in guard-bands
If Ac is considered as the bandwidth of the guard-bands on each side of the signal sidebands 
as depicted in Figure 1, which is independent of the carrier frequency, it can be expressed in 
terms of the single-sided bandwidth B of the signal, namely,
A C = B = v B , v e  R+ (A. 12)
Again, from Equations (A.3) to (A.6), we have,
f ! m M =
2/, + (2i> + 1)B 
nA l ~Ps)
( 2 0 + 1 ) g
71... —  I
(A.13)
(A. 14)
The minimum sampling rate under the given sampling rate precision and guard band 
requirement is thus given by
.(min) 2 f c+(2\) + l)B
/ '  =
' ' ( i - pJ
(l+Ps)(2/c + (2t> +l)i?) 
4 p J > 2 ( 2 v  + l)B
(A. 15) 
(A. 16)
Qi, Multicarrier DEMUX and VLSI Implementation 169
Appendices
Appendix B Homogenous stacking of uniformly sampled FDM 
signals
The uniformly sampled real FDM signal is expressed as (Eq. (2.44))
(2n lv—l)(2A:—^ + 1 + 2 /7 )^
A .—1 J - - - - - - - - - - - - - - - - - - - -
x{n) =
K - \ j2 m
^ x k(n)e
k=0
8p (B.l)
where p  is the ratio of the FDM centre frequency to the channel spacing, i.e., p=f1/W. Since 
the signal is real, if the channels are centred on the grid of
grid{k)=| 2„ K z ! ) E z £ ± ! ± M ;o < * < k j  (b .2)
in [0,7t] (or [7C,2tc]), then they must be centred on the grid of
grid(k) = <2n
2 _  (2ww — l)(2Â: — K  + 1 + 2p)
; 0 < k <  K\ (B.3)
in [tc, 2tu] (or [0, 7t]). In order for the channel spectra to have consistent stacking in the 
frequency range [0,2%], p should be chosen such that the following condition is met,
((-(2«w -  1)(K -1) + 2(2%* -  1W 8p)2(2ji _1} = (((2ww “ 1)(^ “ 1) -  2(2%w -  l)p\p)2(2n
2(2nw-l)
(B.4)
Since the largest wedge order is given by 7=l//^W+0.5j=Lp/X+0.5j (Eq. (2.22)), the 
constraint on p  is
(B.5)nw < \ _ p / K  +  0.5J , or, p >  (nw - 0.5)K
Qi, Multicarrier DEMUX and VLSI Implementation 170
Appendices
Appendix C Derivation of polyphase-matrix-DFT filter banks
For a Æ-channel frequency multiplex x{n) with uniform channel spacing W=JW0 (Je 2?, W0 
the frequency resolution), the k-th demultiplexed channel output signal in Figure 3.4 can be 
expressed by
yk(n)= ^x ( i )e~ j2KFk^l+,^ h(nM - i l )  (C.l)
y*
where Fk = —  is the normalized centre frequency (with respect to the input sampling
f i n
frequency f in ) of the k-th channel. We assume that the frequency plan is such that the centre 
frequencies of individual channels and the input and output sampling frequencies are all 
located on the W0 grid, as is illustrated in Figure C.l. That is, f=(kQ+kJ)WQ ,f .=MWQ , and 
f out=LW0 where integer kQ is the centre frequency of the first channel and M  and L  are positive 
integers.
%
0 W )
Figure C.l ^-channel frequency multiplex 
Hence Eq.(C.l) can be rewritten as
y&W=  ^ l° \ (n M - iL )  (C.2)
For J  and M  co-prime (i.e., gcd(/, M)=l), a re-numbering notation can be introduced which 
results
yk (n) = j2n M^ +l°\(nM  -  i l)  (C.3)
where k '  = (k0 + k j )M.
Substituting i=mM-q and n=lL+p in Eq.(C.3) (this particular way of decomposition leads to 
PMDFT structures with minimum delays), where qe [0, M -l], pe  [0, L -l] , and m and / are 
integers, we have
Qi, Multicarrier DEMUX and VLSI Implementation 171
Appendices
M-l «
yk(lL+p)= %  2Lx(mM- q)h[(lL+ p ) M - ( m M - q ) L )
= ^  ^ x q(m)hp q( l -m )eJ2n ^  ^ (C.4)
where xq(m)=x(mM-q), is the g-th branch signal of the M-commutated input
signal (referring to Figure 3.5); hpJjiLM+pM+qL) is one of the decomposed LxM  sub-filters 
of the prototype channel filter h(n)', and the intermediate signals
polyphase matrix sub-filters which are shared by all channels (due to the fact that vpq(t)'s are
FIR type. The filter length is constrained to be multiple of LM in order to avoid 
(periodically) time-varying of the sub-filters.
Eq.(C.4) gives the polyphase decomposed channel signal of the k-th channel. The ultimate 
channel signal is obtained by interpolating and combining (via a PSC) these decomposed 
signals. An M-point DFT is necessary for each of the decomposed channel signals. Hence L 
such M-point DFTs would be required before P/S conversion for the whole DEMUX filter 
bank if Eq.(C.4) is directly implemented.
Fortunately, the order of P/S conversion and DFT can be interchanged allowing a shared M- 
point DFT for all the phases of channel signals. This can be seen by taking ^-transform to
(/)=  ^ , x q(m)hpq( l - m ) , p=0,l,---,L-l, #=0,1,--,M -l, are the outputs from the
independent of the channel index k). We assume the prototype lowpass filter h(n) to be of
Eq.(C.4):
Yk( z ) = !£ t Jyk( i L + p k ' L-p
L -l  oo M - l
Change the order of the second and the third summations giving
L -l M -l
p=0 q=0
Again, change the order of the two summations leading to
Qi, Multicarrier DEMUX and VLSI Implementation 172
Appendices
î f ï z -
q=0 \yP=0 
M - l
i ,t ( z ) = C ° Z  L z-pv J z L) wMq
1-, (C.5)
q=0
where Z7g(z) = is the interpolated (P/S converted) intermediate signals VM(z),
p=0
p=0,l,---,L-l. Thus only a single M-point DFT is required by moving the P/S conversions 
before DFTs. The resulting PMDFT filter bank structure is shown in Figure 3.5.
It should be noted that Eqs.(C.4) and (C.5) have the same computational efficiency even 
though the former requires L M-point DFT processors, instead of a single one for the latter. 
This is because the sampling rate for the DFT processor of the latter is L times higher than 
that of the former, giving rise to the same computation rate for both structures. However, 
Eq.(C.5) does have a significant improvement over Eq.(C.4) in hardware complexity, making 
it more appropriate for VLSI implementations.
Qi, Multicarrier DEMUX and VLSI Implementation 173
Appendices
Appendix D Simplification of DFT-Frequency Shift-IDFT 
network
The DFT-frequency shift-IDFT networks shown in Figure 3.11b can be concisely described 
by the following matrix operation:
w}°
w (UC’- l ) 0  W (LK’- l ) \  . . .  w (L K '- l ) (U r - i)
_ LK’ LK’
=  ^ 7 K -  ] •  diag[W %  ]  •  [W-J. ] = [s,.
W^. 0 0 00 wH^r+Mp) 0 o
0 0 0
o o o  w^ 'r1){!Mr+Mp)
i
L K r
wrJ;1
y^-O(LK'-l)
w-ijur'-O
w - ( U f '- l ) 0  W - ( ^ ' - l ) l  . . .  ^ - ( Z J C ' - l X U f '- l )fT IV' rr /t"  rr i y#
(D.l)
Since d i a g ^ W ^ r+Mp^ j • [WL/  ] = the above expression can be deduced to
K ] = 7 F [ w“ - ] * K " r+"',"y)] (D.2)
The matrix element S.. is therefore,
^ 1=1^ Y ^ ^ r+Mp-i)= ^ L2 ^ r+mH)
LK'—l
LKf , t0
_  fl for (LMr + Mp + i -  j ) LK, = 0 
10 otherwise
(D.3)
Which means that the resulting matrix [S.j ] takes one of the following forms depending on 
the values of r  and p (that is, the positions of these networks within the filter bank structure):
T o "0 1"
1 0 \ 1 0
, \  1 • • 1 \
1 1 0 1 0
(LMr + Mp)LK, = 0 (LMr+ Mp)LK, = 1 (LMr+ Mp)LK, = LK' - 1
where the (•) denotes modulo operation.
The networks described by the above matrices are simple switching networks in which any 
input is directly connected to one of the output ports, hence no arithmetic operations are 
actually involved.
Qi, Multicarrier DEMUX and VLSI Implementation 174
Appendices
Appendix E Upsampler-SPC and PSC-Downsampler identities
In filter bank design, more complicated problems such as commutations between an upsampler 
and a SPC and between a PSC and a downsampler may occur. In these cases the direct 
swapping of the two is not allowed due to delays between up/down-samplers, which are non­
integer multiples of the factors. In what follows we shall find that under the co-prime 
condition, as in the simple downsampler-upsampler and upsampler-downsampler cases, the 
positions of the samplers and commutators can be exchanged, however, the ordering of the 
polyphase signals must be changed and time advances must be introduced accordingly as are 
shown in Figures 4.13(a) and 4.13(b), which are redrawn in conventional signal flow graph in 
Figures E. 1 and E.2 respectively.
SPC SPC
x(n) x(n)
lM y'oM
Figure E.l unpsampler-SPC identity
x J m )
„ - i
El— lM
<=> .
y(m ) x 'o(n) lM ÎL
I m
-i
¥
PSC SPC
Figure E.2 PSC-downsampler identity
y(m )
For the upsampler-SPC cascade (Figure E.l), the time advance introduced to the i-th branch 
of Figure E.l(b) is given by (Eq.(4.7))
/ (M  — 1 — /)/
m J
where I and m are the solution of (Eq.(4.8))
Qi, Multicarrier DEMUX and VLSI Implementation 175
Appendices
m L - l M  = 1 (E.2)
for co-prime L  and M  (gcd(L, M)=l). The polyphase signals in Figure E.l(b) are the same as 
those in Figure E.l(b), but with different ordering, i.e., y ' i ( m ) = y k( m) ,  i, k = 0 ,1, •••, Af-1 
The new polyphase signal index k after the transform is determined by (Eq.(4.9))
(E.3)
In the case of PSC-downsampler cascade as in Figure E.2, the time advance at the /-th branch 
after the transform is (Eq.(4.10))
m. =
im
LLJ’
where I and m are the solution of (Eq.(4.11))
m L - l M  = - \
(E.4)
(E.5)
for co-prime L  and M. The new branch signal index k after the transform is determined by 
(Eq.(4.12))
k = {iM)L\, i = 0 , l , - , L - l  (E.6)
Proof of the upsam pler-SPC Identity:
Let us look at the polyphase signal y',(m) of the i-th branch which has time advance of z h of the 
transformed structure in Figure E.l(b). The input-output relation is redrawn in Figure E.3 (a). 
Swapping the upsampler and the downsampler and then applying the Noble identies [Cro83, 
Vai90] transform the structure into that of Figure E.3(b). Suppose that /,(m ) is identical to a 
one of the polyphase components, say the k-th polyphase signal y^m) of Figure E .l (a), it 
follows that
M - l - k  = ( M - l - i ) L - l iM  (E.7)
Then the structure becomes the one shown in Figure E.3(c).
x(n)
y’i(m)
x(n)
y’i(m)
x(n)
—  I m
I m
Figure E.3 the i-th polyphase signal branch of Figure E.l(b)
Qi, Multicarrier DEMUX and VLSI Implementation 176
Appendices
To prove the equivalence of the two structures in Figure E.l we need to prove that a) for any 
i, 0<i<M,  the mapping of Eq.(E.7) also falls into the region, i.e., 0<k<M  and that b) the 
mapping is unique (1-to-l), i.e., for any zVzz, we have k^k ' .
According to the Euclidean algorithm in integer ring theory, which states that for any integers 
L and M, there exist integers I and m such that their greatest common divisor can be expressed 
by [Bla85]
gcd(L, M) = m L -  IM (E.8)
In our case, L  and M  are relatively prime. Hence Eq.(E.2) holds.
Multiplying z = 0 , 1, ••• ,M - 1, on both sides of Eq.(E.2) gives
{M  — 1 — f)L —
m
M = -
m (E.9)
Since the rational ( M - l - i ) / m  can be expressed by an integer and a fractional number:
m
(M - l -Z )Z
m
+ ■
((M — 1 — z)/)
m
(E.10)
Substitute Eq.(E.lO) into Eq.(G9) resulting
( M - l - z ) Z( M - l - z ) L -
m M  = M ( ( M +  ( M - 1 - 0 )  (E.ll)
Because the left hand side of Eq.(E.ll) is an integer, the quantity 
i) must be an integer multiples of m, say (M - 1 - k )  ,i.e.,
{ ( M +  ( M - l - i ) = m ( M  - l - k )  (E .l2)
In the above equation, since ( ( M - l - / ) / ) m>0 and ( M - l - z ) > 0  hence ( M - l - k ) > 0 ,  or, 
k<M.  Furthermore, because and it follows that m ( M -
l - k )  = { ( M - l - i ) l ) mM + ( M - l - i ) < m M ,  that is, k>0 , . Hence 0<k<M.
Substituting Eq.(E.12) into Eq.(E.ll) gives
(M — 1 — i)l
{M — 1 — i)L —
m
M  — M  — 1 — k (E.l 3)
By comparing Eq.(E.13) with Eq.(E.7), we have Eq.(E.l)
Next, let us prove the uniqueness of the mapping of Eq.(E.7). That is, we need to prove that 
the mapping is 1-to-l. Suppose that the mapping were not unique. That is, for some i and f, 
0<i<M,  and z>zz, the mapping of Eq.(E.7) gives the same result:
((M -  l — mM + (M — - I -  k)m (E.14a)
((M -I-Z ')Z)  M + { M - \ - i ' )  = ( M - \ - k ’)m (E.14b)
Qi, Multicarrier DEMUX and VLSI Implementation 177
Appendices
where, by the non-uniqueness assumption, k=k'.
Subtracting Eq.(E.14b) from Eq.(E.14a) gives
( ( i ' - i ) l )mM + ( i ' - i ) = 0  (E .l 5)
Because ( i ' - i ) e  {- M  + 1, ■■■,-!, 1, ) , consequently 0< | i ' - i  | <M, and
Eq.(E.15) does not hold for any i ± ï .  Hence for z>zz, k ^ k '  must be true. In other 
words, the mapping of Eq.(E.7) must be 1-to-l.
Because 0<k<M,  we can perform modulo M  on both sides of Eq.(E.13) which results in 
Eq.(E.3).
Because the structure of Figure E.3(c) delivers one of the polyphase signals of Figure E.l (a) 
and because of the uniqueness of the mapping, the structure of Figure E.l(b) is therefore 
equivalent to that of Figure E.l (a).
Proof of the SPC -dow nsam pler Identity:
Similar to the proof of upsampler-SPC identity, we start by examining the i-th branch of the 
transformed structure shown in Figure E.2(b) and redraw it in Figure E.4(a).
x ’i(m)
x ’i(m)
x ’i(m) = xk(m)
I m
4m
Figure E.4 the i-th signal branch of Figure E.2(b)
Swapping the upsampler and the downsampler and applying the Noble identities transform the 
structure into Figure E.4(b). ). Suppose that /,(m ) is identical to a one of the input polyphase 
signals, say the k-th polyphase signal xk(m) of Figure E.2(a), we have
k = iM - m iL (E .l6)
Thus the /-th branch of the transformed structure in Figure E.2(b) can be equivalently 
represented by the structure shown in Figure E.4(c) which is the k-th branch of Figure E.2(a) 
for 0<k<L.  As in the previous case, to prove the equivalence of Figure E.2(a) and Figure 
E.2(b) we need to prove that for any /, 0< /<L,  the k defined by Eq.(E.16) also falls into the 
region of 0<k<L  and that the mapping is unique.
Since L  and M  are co-prime, Eq.(E.S) holds. Multiply i / l , , / = 0 , 1, ••• , L - 1, to both sides of 
Eq.(E.5) giving rise to
Qi, Multicarrier DEMUX and VLSI Implementation 178
Appendices
iM
l = î
(E.17)
Again, with — • =
im (im).
+ —:— , Eq.(E.17) can be written as
/
iM
im
T (E.l 8)
Since the left hand side of the equation is an integer, { im ) iL + i  must be an integer multiples 
of /. Let it be kl,  that is,
f = /& (E.l 9)
Obviously, k>0 due to ( im)i>0  and i>0. Furthermore, because {im)i<l  and i<L,  it 
follows that kl={im)iL+i<lL ,  that is, k<L.  Hence Q<k<L.
Applying Eq.(E.19) to Eq.(E.lS) gives
i M -
im
T
L — k (E.20)
Comparing Eq.(E.20) with Eq.(E.16) results in Eq.(E.4).
Because 0<k<L,  the k can be obtained by performing modulo L  on both sides of Eq.(E.20) 
which gives Eq.(E.6).
The uniqueness of the mapping of Eq.(E.l6) can also be proved with the same approach being 
used previously and is omitted here for the sake of length.
Qi, Multicarrier DEMUX and VLSI Implementation 179
Appendices
Appendix F The rule of “Cut and Insert”
The “cut-and-insert” rule states that if at any cross section of a LTI system we “cut” the 
system into two parts such that one of them has no any other in-bound and out-bound signals 
except those being cut, then at the cutting position inserting a number of delays (time- 
advances) to all the in-bound signals and the same number of time-advances (delays) to all 
the out-bound signals to this part of the system will not change the overall system transfer 
function. This process is illustrated in Figure F.l in which the overall system transfer function 
is H(z)=Y(z)/X(z)=P(z)Q(z) and the sub-system Q(z) satisfies the condition that all the input 
and output signals are cut.
To prove the rule let us consider the LTI sub-system Q(z) which has M  input and N  output 
signals as illustrated by the right side of Figure F.l(a). The transfer function Q(z) can be 
defined by the following matrix
60,0 W  60,1 W  ' " ' 6 o , m - i  (% )
Qi.oOO e u (z) -  e,,M-,(z)6(z) =
Q n - \ ,0  (z) Q n - \ , \  (z) (z)
(F .l)
in which the element H.fz) is LTI and is defined as the transfer function between the output 
xj/.(7i) and the input %;,(»). Therefore, the input-output relation of the sub-system Q(z) is given
by
'F(z) = Q(z)E(z) (F.2)
or, in matrix form,
V . C z )  1 60,0 ( z ) 80,1 (z) 80,M -l ( z )
Y , ( z ) - 81,0 ( z ) 81,1 ( z ) 81,M -l 00 2 i W ( F . 3 )
. < , ( z ) . _ Q n - \ , q (z ) 6 ^ - 1,1 (z) Q n - \ , m - \  o o _ _ S m- i ( z )_
Since z%Q(z)I^z *=Q(z), where and lM are unit matrices of dimension N  and M  
respectively and k is an arbitrary integer. Or, in the matrix form,
z k " 80,0 W 80.1 (z) 8 0 ,m - i (z ) f z -
z k 8 ,o (z ) 8i,i(z) 8 1 ,M -l (2) z~k
z \ _8 ;v-i,o  (z) 8 at-1,1 (z ) 8 j V - l ,M - l  ( ^ ) _ z~k_
80,0 (z) 80, (z) ••• 80,M -l (z)
= 81,0 (z) 8 1 ,1 (z) ••• 8 1 , M -l 0 0
_ 8 m - i ,o (z ) Q n - ,i(z) ••• 8 ^ - 1,M -l 00
(F.4)
Qi, Multicarrier DEMUX and VLSI Implementation 180
Appendices
The sub-system Q(z) is therefore equivalent to the system that cascades an array of delays (or 
time-advances for k<0) to the inputs of Q(z) and an array of time advances (or delays for 
k<0) to the outputs of Q(z) as suggested by Eq. (F.4) and the overall system transfer function 
H(z)=P(z)Q(z) remains unchanged. This result is shown in Figure F.l(b) which is the direct 
result of the “cut-and-insert” rule.
y(z) i
H(z) H(z)
t h e  c u t t i n g  p o s i t i o n
( a )  ( b )
Figure F.l A LTI system and its equivalence after insertion of delays and time-advances
Qi, Multicarrier DEMUX and VLSI Implementation 181
Appendices
Appendix G SOS technology and MA9000A Sea of Gates
Of all the current silicon-based semiconductor technologies, only CMOS SOS (silicon-on- 
sapphire) offers the necessary resistance to hazards of single event upset (SEU), transient and 
total dose radiation. It also provides the advantages of low power consumption and fast 
internal switching speeds, making CMOS SOS the key technology for defense and space 
applications. Independent investigators consider that its SEU immunity makes it the only 
choice for many key space applications.
Key features of 1.5 micron SOS radiation hardness:
Total Dose [Rad(Si)] Digital >10"
Dose Rate Survive [Rad(Si)/S] >1012
Dose Rate Upset [Rad(Si)/S] >10"
Single Event Upset [Errors/Bit day] <4x10-11
Neutrons [Neutron/cm2] 1015
Latch up not possible
Key Features of MA9000A Sea of Gates:
• Channelless array architecture
• Typical gate delay 1.0 ns (toggle rates of 100 MHz achievable)
• 1.25 jiW/MHz power dissipation per active gate
• Extensive CAD design and support system
• Comprehensive library of logic cells and logic function building macros, with RAM & 
ROM
• Double-Level-Metal CMOS/SOS technology
• High SEU immunity, latch-up free
• Radiation hard to IMRad(Si)
The building block for the SOS Sea-of-Gates is a six transistor ‘cell-unit’ equivalent in size to 
a 2-input NAND gate. The cell contains four transistors for building logic circuits and two 
transistors which are used in RAM macros. These extra two transistors are placed under the 
logic power routing and so have no detrimental effect upon overall area. Back-to-back cell 
units forms the core of the array.
Array Options:
Array type Cell units Bounding pads
I/O Power Total
MA9140 14112 102 8 110
MA9200 20296 120 8 128
Qi, Multicarrier DEMUX and VLSI Implementation 182
Appendices
C ell L ib ra ry :
C e l l  n a m e : F u n c t i o n : C e l l S y n c h r o n o u s  C o u n t e r :
U n i t s : S Y N C s y n c h r o n o u s  c o u n t e r  s t a g e  8
C o m b i n a t i o n a l  G a t e s : R e g i s t e r s  /  S h i f t  R e g i s t e r s :
I N V i n v e r t e r 1 S H R 4 m u l t i b i t  s e r i a l  r e g i s t e r  3 0
I N V B f a s t  i n v e r t e r 1 S H R 8 m u l t i b i t  s e r i a l  r e g i s t e r  5 4
I N V C s u p e r  f a s t  i n v e r t e r 2 R S H R 4 m u l t i b i t  s e r i a l  r e g .  w i t h  r e s e t  3 0
B U F n o n - i n v e r t i n g  b u f f e r 1 R S H R 8 m u l t i b i t  s e r i a l  r e g .  w i t h  r e s e t  5 4
B U F B f a s t  n o n - i n v e r t i n g  b u f f e r 2 D R E G 4 m u l t i b i t  p a r a l l e l  r e g i s t e r  1 5
B U F C s u p e r  f a s t  n o n - i n v e r t i n g  b u f f e r 3 D R E G S m u l t i b i t  p a r a l l e l  r e g i s t e r  2 7
N A N D 2 2  i n p u t  N A N D 1 D R E G T 4 m u l t i b i t  p a r a l l e l  r e g i s t e r  w i t h  2 5
N A N D 2 B f a s t  2  i n p u t  N A N D 2 t r i - s t a t e  o u t p u t s
N A N D 3 3  i n p u t  N A N D 2 D R E G T 8 m u l t i b i t  p a r a l l e l  r e g i s t e r  w i t h  4 5
N A N D 4 4  i n p u t  N A N D 2 t r i - s t a t e  o u t p u t s
A N D 2 2  i n p u t  A N D 2 I n v e r t i n g  T r i - S t a t e  B u f f e r s :
A N D 3 3  i n p u t  A N D 2 T R I B U F F t r i s t a t e  b u f f e r  ( e n a b l e  h i g h )  2
A N D 4 4  i n p u t  A N D 3 T R I B U F F L t r i s t a t e  b u f f e r  ( e n a b l e  l o w )  2
N O R 2 2  i n p u t  N O R 1 T R I N V t r i s t a t e  i n v  b u f f e r  ( e n a b l e  h i g h )  2
N O R 2 B f a s t  2  i n p u t  N O R 2 T R I N V L t r i s t a t e  i n v  b u f f e r  ( e n a b l e  l o w )  2
N O R 3 3  i n p u t  N O R 2 I n p u t  O u t o u t  a n d  P e r i p h e r a l :
N O R 4 4  i n p u t  N O R 3 T T L I P T T L I N  n o n - i n v e r t i n g
O R 2 2  i n p u t  O R 2 T T L I P N T T L I N  i n v e r t i n g
O R 3 3  i n p u t  O R 2 C M O S I P C M O S I N  n o n - i n v e r t i n g
O R 4 4  i n p u t  O R 3 C M O S I P N C M O S I N  i n v e r t i n g
A N D N O R 2 + 2  i n p u t  A N D / N O R 2 C S C H M I T T C M O S  S c h m i t t  n o n - i n v e r t i n g
O R N A N D 2 + 2  O R / N A N D 2 C S C H M I T T N C M O S  S c h m i t t  i n v e r t i n g
E X N O R e x c l u s i v e  N O R 4 B O P b u f f e r e d  o u t p u t  n o n - i n v e r t i n g
E X O R e x c l u s i v e  O R 4 N O P b u f f e r e d  o u t p u t  i n v e r t i n g
S E L 2 I N V s e l e c t  1  o f  2  ( i n v e r t i n g ) 4 T R I O U T t r i - s t a t e  o u t p u t  n o n - i n v e r t i n g
S E L 2 s e l e c t  1  o f  2 4 T R I O U T N t r i - s t a t e  o u t p u t  i n v e r t i n g
S E L 4 I N V 4  b i t  d a t a  s e l e c t o r  ( i n v e r t i n g ) 8 B O D N b u f f e r e d  o p e n  d r a i n  o u t p u t  p u l l
S E L 4 4  b i t  d a t a  s e l e c t o r d o w n
A r i t h m e t i c : N O D N i n v e r t e d  o p e n  d r a i n  o u t p u t  p u l l
H A D h a l f  a d d e r 4 d o w n
F A D f u l l  a d d e r 8 B O D P b u f f e r e d  o p e n  d r a i n  o u t p u t  p u l l
F L A D f a s t  l o o k  a h e a d  a d d e r 6 u p
L A H 2 2  b i t  l o o k  a h e a d  u n i t 1 2 N O D P i n v e r t e d  o p e n  d r a i n  o u t p u t  p u l l
L A H 3 3  b i t  l o o k  a h e a d  u n i t 1 6 u p
L A H 4 4  b i t  l o o k  a h e a d  u n i t 2 5 P D O L p u l l  d o w n  2 5 k  o h m s  a p p r o x
S i m p l e  L a t c h e s : P D O H p u l l  d o w n  5 0 k  o h m s  a p p r o x
N A S R N A N D  s e t - r e s e t  l a t c h 3 P U P L p u l l  u p  2 5 k  o h m s  a p p r o x
N O S R N O R  s e t - r e s e t  l a t c h  
C l o c k e d  L a t c h e s :
3 P U P H p u l l  u p  5 0 k  o h m s  a p p r o x  
P o w e r  S u p p l y  P a d s :
D L D - l a t c h  ( a c t i v e  l o w ) 4 V D D
D L H D - l a t c h  ( a c t i v e  h i g h ) 4 V S S
S D L s e t  D - l a t c h 4
R D L r e s e t  D - l a t c h 4
S R D L s e t - r e s e t  D - l a t c h  
E d g e  T r i g g e r e d  L a t c h e s :
6
R E T S w i t h  s e t 8
S R E T S w i t h  s e t / r e s e t  
M a s t e r - S l a v e  F l i p - F l o p s :
8
D T D - t y p e 6
D 2 T d u a l  i n p u t  D - t y p e 8
S D T s e t  D - t y p e 4
R D T r e s e t  D - t y p e 8
S R D T s e t / r e s e t  D - t y p e  
T o g g l e  F l i p - F l o p s :
8
S T T s e t  T - t y p e 8
R T F r e s e t  T = t y p e 8
S R T T s e t / r e s e t  T - t y p e 8
Qi, Multicarrier DEMUX and VLSI Implementation
Appendices
Appendix H Circuits diagrams of a 16-channel 6-bit DEMUX
-  sm ----
3
OHO
ooo
r-iu
cdSo
htso*+ cd x n
S5Emr^co
3 : t s
c o o
53
Qi, Multicarrier DEMUX and VLSI Implementation
Fi
gu
re
 
H
.l 
De
sig
n 
hi
er
ac
hy
 
an
d 
co
m
pl
ex
ity
 
es
tim
at
io
n 
for
 
a 
16
-c
ha
nn
el
 D
EM
UX
 
A
SI
C
Appendices
s ' S f l Z _____ 5 __________ _
p i  E I
t h k l A L k l ï â k  ^
9B8
•Si !
—08  -5 C  4»
m  — « — .... -  — C
— T : --------------------- 1
f i n k t s t s  :
T S
5 1 1 ------------- = f
(0*S) lîi
- 5 -
IB 1 S) 1 E « 0 - ^
te» SHE
te» s) -»e
te «si n < o —
te « S M t
Qi, Multicarrier DEMUX and VLSI Implementation 185
Fi
gu
re 
H.2
 
A 
16
-c
ha
nn
el
, 
6-
bit
 T
M
P-
tre
e 
D
EM
U
X
Appendices
>•11 I5Td k O
iJ_5
Figure H.3 The real two-path input buffer: IBF6B
i.liC'/i
1 IpO
- O r lC2i
>ok
',—0 1 n  bO—
- O l n
i n O
û û û
=i* n*
Figure H.4 The complex DA-IP: BSF6B_6B Figure H.5 MX_CV block
o - irO
i o
INV
l e b D  [ > --------
INV
BL
fdw 2r
c k O —1-------------
re ttD ---------------- -
NRN02
paoareg
Figure H.6 The real DA-IP: ROMF6B_6B
Qi, Multicarrier DEMUX and VLSI Implementation 186
Appendices
d ( 5 : 0 ) 0
O
CKCM
OB
BUFF
Figure H. 8 Parallel-to-serial convertor: PSCONV6B
d ( 5 : 0 ) 0
OB >•Oi>- 08 »- OS ► os»
q ( 5 : 0 ) 0
Figure H.9 SREG6 block
q  ( 5 : 0 )  O
b  ( 5 : G ) H O
Ah 12,1 ) (2 , 0 )
o2(01
o ie
oh 11) oh CB) o2 (0) c l  II) o l  (0)
8 f  ( 5 : 0 )
8  ( 5 : 0 )  < 3
B O U t
Figure H. 10 Parallel carry-look-ahead adder: CLADD6_B 
Q i ,  M u l t i c a r r i e r  D E M U X  a n d  V L S I  I m p l e m e n t a t i o n 187
Appendices
P? 1/ oj>2
gÊi
p C li 01 g b  ( l i 01
clnO-
00
r ti l2t0) ob 1OIZifi)
Figure H. 11 Look-ahead unit: CLA2 Figure H. 12 Look-ahead unit: BCLA3
o (5:0) tZ>
Figure H. 13 DIV2 block
odr (1:0)
Figure H. 14 ROM2_16 block
QC >
OB > |OB > i
Figure H. 15 Control logic for stage 1: CTL1_6B
Qi, Multicarrier DEMUX and VLSI Implementation 188
Appendices
OB
CK h
SL
INVB
d [310)
DTCK Q CK □
08 OB
RL 13.
CM
- O
Q
(O
CM
Figure H. 16 Control logic for stage 2: CTL2_6B
CK CK%
OB
SL
BUFF
g3hh9gilhB gihlBgil 10gihhl gilhl gihl 1 galllINVB
d 17.Q)
CKoCKo
en
Q
en
E2
00CZ eno»
Figure H. 17 Control logic for stage 3: CTL3_6B
Qi, Multicarrier DEMUX and VLSI Implementation
Appendices
A
H
snîie
i i um
2IUM
Qi, Multicarrier DEMUX and VLSI Implementation 190
Fi
gu
re 
H
.l8 
Co
nt
ro
l 
log
ic 
for
 s
tag
e 
4: 
CT
L4
_6
B
References
References
[Ale92] G.P. Alexiou and N. Kanopoulos, “A new serial/parallel two’s complement multiplier 
for VLSI digital signal processing,” International Journal of Circuit Theory and 
Applications, Vol. 20, pp. 209-214, 1992.
[Ana90] F. Ananasso, et al., “A multirate demodulator and its front-end for a multifrequency 
TDMA,” Space Communications, Vol. 7, No. 4-6, pp. 531-542 , 1009.
[Ana92] F. Ananasso, et al., “A multirate digital multicarrier demodulator: design, 
implementation and performance evaluation, ” IEEE Journal on Selected Areas in 
Communications, Vol. SAC-10, No. 5, pp. 1326-1341, 1992.
[ANT85] ANT Final Report, “Study and Development of an On-Board Multicarrier 
Demodulator for Mobile Satellite Communications,” ESTEC Contract 
6497/85/NL/JG(SC).
[Aue93] E. Auer and P. Battenschlag, “Design and performance of a VLSI implemented 16 
channel FDM demultiplexer for onboard processing,” Proc. 3rd European Conference 
on Satellite Communications, pp. 211-215, Manchester, U.K., November 1993.
[Bau73] C.R. Baugh and B.A. Wooley, “A two’s complement parallel array multiplication 
algorithm,” IEEE Trans. Computers, Vol. 22, No. 12, pp. 1045-1047, December 1973.
[Bel74] M.G. Bellanger and J.L. Daguet, "TDM-FDM transmultiplexer: digital polyphase and 
EFT,” IEEE Trans. Communications, Vol. 22, pp. 1199-1205, September 1974.
[Bel84] Maurice Bellanger, "Digital Processing of Signals, ” 2nd Edition, John Wiley & Sons, 
1984
[Bi90] G. Bi and F. Coakley, “The design of transmultiplexers for on-board processing 
satellite using bit-serial processing technique,” Proc. 13th AIAA International 
Communication Satellite Systems Conference, pp. 613-622, Los Angeles, CA, USA, 
March 1990.
[Bi92] G. Bi, F. Coakley, and B.G. Evans, “Rational sampling rate conversion structures with 
minimum delay requirements”, IEE Proceedings-E, Vol. 139, No. 6, pp. 447-485, 
November 1992.
[Bjô93] G. Bjomstrom, "Digital payloads: enhanced performance through signal processing,” 
ESA Journal, Vol. 17, pp. 1-29, 1993.
[Bla85] RE. Blahut, "Fast Algorithms for Digital Signal Processing,” Addison-Wesley 
Publishing Company, 1985.
[Bom92] J. Bombardieri, "Systolic pipeline architectures for symmetric convolutions", IEEE 
Trans. Signal Processing, Vol. 40, pp. 1253-1258, May 1992.
[Cam88] S.J. Campanella and S. Sayegh, “Flexible on-board demultiplexer/demodulator,” Proc.
12th AIAA International Communication Satellite Systems Conference, pp. 299-303, 
Arlington, VA, USA, March 1988.
[Cam90a] S.J. Campanella, S. Sayegh, and M. Elamin, “A study of on-board multicarrier digital 
demultiplexer for a multi-beam mobile satellite payload,” Proc. 13th AIAA 
International Communication Satellite Systems Conference, pp. 638-648, Los Angeles, 
CA, USA, March 1990.
[Cam90b] S.J. Campanella, J.V. Evans, T. Muratani, and P.Bartholome, “Satellite 
communications systems and technology, circa 2000,” Proceedings of the IEEE, Vol. 
78, No. 7, pp. 1039-1055, July 1990.
Qi, Multicarrier DEMUX and VLSI Implementation 191
References
[Can94] P. Cangiance, et al. “Multi-channel demultiplexer/demodulator,” Proc. 15th AIAA 
International Communication Satellite Systems Conference, pp. 822-830, San Diego, 
CA, USA, February 1994.
[Cha88] L. W. Chang and M. Y. Lin, “A new systolic array for discrete Fourier transform,”
IEEE Trans. Acoustics, Speech, and Signal Processing, Vol. 36, No. 10, pp. 1665-
1666, Oct. 1988.
[Cha93] L. W. Chang, “Roundoff error problem of the systolic array for DFT ,” IEEE Trans. 
Signal Processing, Vol. 41, No. 1, pp. 395-398, January 1993.
[Cho90] N.I. Cho and S.U. Lee, “DCT algorithms for VLSI parallel implementations,” IEEE
Trans. Acoustics, Speech, and Signal Processing, Vol. 38, pp. 121-127, January 1990.
[Cla78] T.A. Claasen and W.F. Mecklenbrauker, “A generalized scheme for an all digital time-
division multiplex to frequency-division multiplex translator,” IEEE Trans. Circuits 
and Systems, Vol. 25, No. 5 pp. 252-259, May 1978.
[Con83] V. Considine, "Digital Complex sampling,” Electronics Letters, Vol. 19, No. 16, pp. 
608-609, August 1983.
[Cor90] I.R. Corden and R.A. Carrasco, “Fast transform based complex transmultiplexer 
algorithm for multiband quadrature digital modulation schemes,” IEE Proceedings, 
Vol. 137, Part I, No. 6, pp. 408-416, December 1990.
[Cra90] A.D. Craig, et al., “Final Report of Study on Digital Beamforming Networks,” ESA 
Contract: 8087/88/NL/JG(SC), British Aerospace (Space System) Ltd., July 1990.
[Cro75] R.E. Crochiere and L.R. Rabiner, “Optimum FIR digital filter implementations for
decimation, interpolation, and narrow-band filtering,” IEEE Trans. Acoustic, Speech 
and Signal Processing, Vol. 23, pp. 444-456, Oct. 1975.
[Cro83] R.E. Crochiere and L.R. Rabiner, “Multi-rate Digital Signal Processing,” Prentice-
Hall, Inc., Englewood Cliffs, 1983.
[Dan84] P.E. Danielsson, “Serial/parallel convolvers,” IEEE Trans. Computers, Vol. 33, No. 7,
pp. 652-667, July 1984.
[Dar70] S. Darlington, "On digital single-sideband modulators,” IEEE Trans. Circuit Theory,
Vol. CT-17, pp. 409-414, August 1970.
[Del88] E. Del Re and R. Fantacci, “Alternatives for on-board digital multicarrier
demodulation,” International Journal of Satellites Communications, Vol. 6, pp. 267-
281, 1988.
[Del89] E. Del Re and R. Fantacci, “Multicarrier demodulator for digital satellite
communication systems,” Proc. IEE, Vol. 136, Part I, No. 3, pp. 201-207, 1989.
[Des74] A.M. Despain, “Fourier transform computer using CORDIC iterations,” IEEE Trans.
Computers, Vol. 23, pp. 993-1001, October 1974.
[Ela86] M.H. El-Amin, B.G. Evans, and L.N. Chung, “An access protocol for onboard
processing business satellite system,” Proc. 7th International Conference on Digital 
Satellite Communications, Munich, May 1986.
[E1182] D.F. Elliott and K.R. Rao, “Fast transforms: algorithms, analysis, applications,”
Academic Press, Inc., 1982.
[Ers85] O. Ersoy, "Semisystolic array implementation of circular, skew circular and linear
convolutions," IEEE Trans. Computers, Vol. 34, pp. 190-196, February 1985.
[Est77] D. Esteban and C. Gal and, “Applications of quadrature mirror filters to split band voice
coding schemes,” Proc. 1977 IEEE International Conference on Acoustic Speech 
Signal Processing, Hartford, Conn., USA, pp. 191-195, May 1977.
Qi, Multicarrier DEMUX and VLSI Implementation 192
References
[Eva87] B.G. Evans, I.E. Casewell, and A.D. Craig, “An on-board processing satellite payload
for European mobile communications,” International Journal of Satellite
Communications, Vol. 5, pp. 105-122, 1987.
[Eys90] H. Eyssele and H. Cockier, “Simulation of An On-board Hierarchical Multistage
Digital FDM Demultiplexer for Mobile SCPC Satellite Communications,”
International Journal of Satellite Communications, Vol. 8, pp. 79-93, 1990.
[Fel49] C. B. Feldman and W.R. Bennett, "Bandwidth and transmission performance,” Bell 
Syst. Tech. Journal, Vol. 28, pp. 490-595, 1949.
[Fer91] P.J. Fernandes, et al., “A reconfigurable pipelined transmultiplexer architecture,” Proc.
International Conference on Acoustic, Speech and Signal Processing, Vol. 3, pp. 1961- 
1964, Toronto, ON, May 14-17, 1991.
[Fli93] N.J. Fliege, “Computational efficiency of modified DFT polyphase filter banks,” Proc.
of the 27th Annual Asilomar Conference on Signals, Systems and Computers, pp. 
1296-1300,1993
[Fli94] N.J. Fliege, "Multirate Digital Signal Processing: Multirate Systems, Filter Banks, 
Wavelets, ” John Wiley & Sons, 1994.
[Gar85] F.M. Gardner, "On-Board Processing for Mobile-Satellite Communications, ” ESTEC 
contract no. 5589/84/NLGM, European Space Agency, May 1985.
[Gas78] J.D. Gaskell, "Linear Systems, Fourier Transforms, and Optics, ” New York: Wiley, 
1978.
[GEC91] SOS Radiation Hard Hi-Rel IC and ASIC Handbook, GEC Plessey Semiconductors, 
January 1991.
[Gna85] R. Gnanasekaran, “A fast serial-parallel binary multiplier,” IEEE Trans. Computers, 
Vol. 34, No. 8, pp. 741-744, August 1985.
[Gôc88] H. Gockler, “A modular multistage approach to digital FDM demultiplexing for mobile 
SCPC satellite communications,” International Journal of Satellites Communications, 
Vol. 6, pp. 283-288, 1988.
[Got73] B.S. Gottfried and J. Weisman, "Introduction to Optimization Theory, ” Prentice-Hall 
Inc., 1973.
[Gra96] P.M. Grant, “Multirate signal processing,” Electronics & Communication Engineering 
Journal, Vol. 8, No. 1, pp. 4-12,1996.
[Gre77] W.D. Gregg, "Analog and Digital Communications Systems,” New York: Wiley, 
1977.
[Gro94] S. Groppetti and A. Razzini, “Multicarrier demodulator for user oriented MF-TDMA 
systems (on-board processing),” Proc. 15th AIAA International Communication 
Satellite Systems Conference, pp. 807-613, San Diego, CA, February 28-March 3, 
1994.
[Guo92] X.Y. Guo and G. Maral, “A programmable demultiplexer using a polyphase approach 
for regenerative satellites,” Proc. 14th AIAA International Communication Satellite 
Systems Conference, pp. 1227-1233, 1992.
[Had91] R. A.. Haddad and T. W. Parsons, “Digital Signal Processing Theory , Applications 
and Hardware”, Computer Science Press, 1991.
[Har90] J.L. Harrold, et al., “On-board switching and processing,” Proceedings of the IEEE,
Vol. 78, No. 78, pp. 1206-1213, July 1990.
[Hat86] M. Hatamian and G.L. Cash, “A 70-MHz 8-bitx8-bit parallel pipelined multiplier in
2.5|im CMOS,” IEEE Journal of Solid-State Circuits, Vol. 21, No. 4, pp. 505-513,
August 1986.
Qi, Multicarrier DEMUX and VLSI Implementation 193
References
[Hsi87] C.C. Hsiao, “Polyphase filter for rational sampling rate conversions,” Pore. IEEE Int.
Conf. on Acoustics, Speech, and Signal Processing, pp. 2173-2176, April 1987.
[Hur85] S.L. Hurst, “Custom-Specific Integrated Circuits,” Marcel Dekker, Inc., 1985.
[Jac89] L.B. Jackson, "Digital Filters and Signal Processing,” second edition, Kluwer
Academic Publishers, 1989.
[Jon93] K.J. Jones, “Parallel DFT computation on bit-serial systolic processor arrays,” IEE
Proceedings -E, Vol. 140, No. 1, pp. 10-18, January 1993.
[Kat87] S. Kato, et al., “Onboard digital signal processing technologies for present and future
TDMA and SCPC systems,” IEEE Journal on Selected Area in Communications, Vol. 
SAC-5, No. 4, pp. 685-700, May 1987.
[Knu81] D.E. Knuth, "The Art of Computer Programming, ” Vol. 2, Addison-Wesley, 1981.
[Kov93] J. Kovacevic and M. Vetterli, “Perfect reconstruction filter banks with rational
sampling factors,” IEEE Trans. Signal Processing, Vol. 41, No. 6, pp. 2047-2066, 
June 1993.
[Kun82] H.T. Kung, “Why systolic architectures?,” IEEE Computer, Vol. 15, No. 1, pp. 37-46,
January 1982.
[Kwa90] C.C. Kwan, "Digital Signal Processing Techniques for On-Board Processing 
Satellites, ” Ph.D. thesis, University of Surrey, March 1990.
[Kwa92] D.C.C. Kwan, F. Coakley and B.G. Evans, "An efficient flexible transmultiplexer for 
on-board processing satellites,” Proc. 9th International Conference on Digital Satellite 
Communications, pp. 319-325, Copenhagen, Denmark, May 1992.
[Kwa93] H. K. Kwan, "Systolic realization of delayed two-path linear phase FIR digital filters", 
IEE Proceedings -G, Vol. 140, No. 1, pp. 75-81, February 1993.
[Las70] L.S. Lasdon, "Optimization Theory for Large Systems,” The MacMillan Limited,
1970.
[Liu93] B. Liu and L.T. Bruton, “The design of N-band nonuniform-band maximally
decimated filter banks,” 27th Annual Asilomar Conference on Signals, Systems and 
Computers, pp. 1281-1285, 1993.
[Lyo76] R.F. Lyon, “Two’s complement pipeline multipliers,” IEEE Trans. Communications,
Vol. 24, pp. 418-425, April 1976.
[Lyo81] R.F. Lyon, “A bit-serial VLSI architecture methodology for signal processing,” In
VLSI 81 (Ed. J.P. Gray), pp. 131-140.
[Ma90] G. Ma and F.J. Taylor,” Multiplier policies for digital signal processing,” IEEE ASSP
Magazine, pp. 6-20, Jan. 1990.
[Mar871 T.G. Marshall, Jr., "The polyphase transform and its applications to block-processing
and filter-bank structures,” 20th IEEE International Symp. on Circuits and Systems ( 
ISCAS 87 ), Philadelphia, PA, USA, 4 - 7 May 1987, pp. 1103-1109, 1987.
[Mar93] G. Maral and M. Bousquet, “Satellite Communications Systems,” 2nd edition, John
Wiley & Sons, 1993.
[Mat90] R.P. Mathur and R.S.M. Chapman, “Advanced CMOS-SOS ASIC implementation of a
digital demultiplexer,” 2nd International Workshop on: Digital Signal Processing 
Techniques Applied to Space Communications, Italy, September, 1990.
[Mea80] C. Mead and L. Conway, "Introduction to VLSI Systems, ” Addison-Wesley Publishing
Company, 1980.
Qi, Multicarrier DEMUX and VLSI Implementation 194
References
[Miz93] T. Mizuno and T. Inoue, “On-board direct regeneration for future satellite 
communications,” IEICE Trans. Communications, Vol. E76B, No. 5, pp. 488-496, 
1993.
[Mur82] S. Muroga, “VL57 System Design ,” John Wiley & Sons, 1982.
[Mur94] N.R. Murthy and M.N.S. Swamy, “On the real-time computation of DFT and DCT 
through systolic architectures,” IEEE Trans. Signal Processing, Vol. 42, No. 4, pp. 
988-991, April 1994.
[Nar79] M.J. Narashima and A.M. Peterson, "Design of a 24-channel transmultiplexer,” IEEE 
Trans. Acoustic, Speech and Signal Processing, Vol. 27, pp. 752-762, Dec. 1979.
[Nus87] P.P. Nuspl, et al., "On-board processing for communications satellites: systems and 
benefits", Int. Journal of Satellite Communications, Vol. 5, 65-76, 1987.
[Opp75] A.V. Oppenheim and R.W. Schafer, “Digital Signal Processing,” Prentice-Hall, Inc., 
1975
[Pel73] A. Peled and B. Liu, “A new approach to realization of nonrecursive digital filters,”
IEEE Trans. Audio and Electroacoustics, Vol. 21, No. 6, pp. 477-485, December 
1973.
[Pel74] A. Peled and B. Liu, “A new hardware realization of digital filters,” IEEE Trans.
Acoustics, Speech, and Signal Processing, Vol. 22, pp. 456-462, December 1974.
[Pro89] John G. Proakis, “Digital Communications,” 2nd edition, McGraw-Hill Book
Company, 1989.
[Qi92a] R. Qi and F. Coakley, “A Gate Array Design of a Multi-channel Tree Filter Bank 
Demultiplexer,” The Third International Workshop on Digital Signal Processing 
Techniques Applied to Space Communications, ESTEC, Noordwijk, The Netherlands, 
Sept. 1992.
[Qi92b] R. Qi and F. Coakley, “VLSI Implementation of Digital Channeliser Using Distributed 
Arithmetic,” Electronics Letters, Vol. 28, No. 11, pp. 973-974, May 1992.
[Qi96] R. Qi , F. Coakley and B.G. Evans, “Low Complexity and Low Power Consumption
Design for OBP Multicarrier DEMUX VLSI,” accepted by the 16th AIAA 
International Communications Satellite Systems Conference, Washington, DC, USA, 
February 25-29, 1996.
[Rad84] C.M. Rader, "A simple method for sampling in-phase and quadrature components,”
IEEE Trans. Aerospace and Electronic Systems, Vol. 20, No. 6, pp. 821-824, 
November 1984.
[Ric82] D.W. Rice and K.H. Wu, "Quadrature sampling with high dynamic range,” IEEE
Trans. Aerospace and Electronic Systems, Vol. 18, No. 6, pp. 736-739, November 
1982.
[Rob62] L.P.A. Robichaud, et al., "Signal Flow Graphs and Applications,” Prentice-Hall
International., 1962.
[Rot79] C.H. Roth, Jr., “Fundamentals of Logic Design,” 2nd edition, West Publishing
Company, 1979
[San92] P. Santis and P.S. Yeung, “On-board demultiplexing of unrestricted FDMA traffic,”
Proc. 9th International Conference on Digital Satellite Communications, ICDSC-9, pp. 
327-334, May 1992.
[Say92] S. Sayegh, M. Kappes, et al., "An overview of COMSAT work in multicarrier
demultiplexer/demodulators,” Proc. 14th AIAA International Communication Satellite 
Systems Conference, pp. 1234-1244, 1992.
Qi, Multicarrier DEMUX and VLSI Implementation 195
References
[SchBl] H. Scheuermann and H. Gockler, " A comprehensive survey of digital
transmultiplexing methods,” Proceedings of IEEE, Vol. 69, No. 11, pp. 1419-1450, 
November 1981.
[Sch93] A.F. Schwarz, "Handbook of VLSI Chip Design and Expert Systems, ” Academic Press,
1993.
[Sec92] N.P. Secord and Chun Loo, “Simulation study of an on-board satellite group
demodulator based on the multistage transmultiplexer,” Proc. of Globalcom’92, pp. 
717-721, 1992.
[Tsu78] T. Tsuda, S. Monta, and Y. Fujii, “Digital TDM-FDM translator with multistage
structure,” IEEE Trans. Communications, Vol. 26, No. 5, pp. 734-741, May 1978.
[Vai87] P.P. Vaidyanathan, “Quadrature mirror filter banks, M-band extensions and perfect-
reconstruction techniques,” IEEE ASSP Magazine, July 1987, pp. 4-20.
[Vai90] P.P. Vaidyanathan, "Multirate digital filters, filter banks, polyphase networks, and
applications: a tutorial,” Proceedings of the IEEE, Vol. 78, No. 1, pp. 56-93, January
1990.
[Vau91] R.G. Vaughan, "The Theory of Bandpass Sampling,” IEEE Trans. Signal Processing,
Vol. 39, No. 9, pp. 1973-1984, September 1991.
[Vet87] M. Vetterli, “A theory of multirate filter banks,” IEEE Trans. Acoustics, Speech, and
Signal Processing, Vol. 35, No. 3, pp. 356-372, March 1987.
[Whi89] S. A. White, “Applications of distributed arithmetic to digital signal processing: A
tutorial review,” IEEE ASSP Magazine, pp. 4-19, July 1989.
[Yim88] W. H. Yim, C. C. Kwan, F. Coakley and B. G. Evans, “Multi-carrier Demodulators for
On-Board Processing Satellites,” International Journal of Satellites Communications, 
Vol. 6, pp. 243-251, 1988.
[Yim91a] W. H. Yim "All-Digital Multicarrier Demodulators for On-Board Processing 
Satellites in Mobile Communication Systems, ” Ph.D. thesis, University of Surrey, June
1991.
[Yim92a] W. H. Yim and F. Coakley, "Polyphase Matrix and Lattice Decomposition for 
Multirate Filters and Filter Banks,” Proc. IEEE International Conference on Acoustic, 
Speech and Signal Processing, ICASSP-92, Vol. 4, pp. 625-627, March, 1992.
[Yim92b] W. H. Yim and F. Coakley, "A unified approach for the design of on-board
channelisers,” Proc. 9th International Conference on Digital Satellite Communications, 
ICDSC-9, pp. 305-310, May 1992.
[Zoh76] S. Zohar, “A realization of the RAM digital filter,” IEEE Trans. Computers, Vol. 25,
No. 10, pp. 1048-1052, October 1976.
[Zoh89] S. Zohar, “A VLSI implementation of a correlator/digital-filter based on distributed
arithmetic,” IEEE Trans. Acoustics, Speech, and Signal Processing, Vol. 37, pp. 156- 
160, No. 1, January 1989.
Qi, Multicarrier DEMUX and VLSI Implementation 196
List of Publications
Publications
1. “A Gate Array Design of a Multi-channel Tree Filter Bank Demultiplexer.” the Third 
International Workshop on Digital Signal Processing Techniques Applied to Space 
Communications, ESTEC, Noordwijk, The Netherlands, Sept. 1992.
2. “VLSI Implementation of Digital Channeliser Using Distributed Arithmetic.” Electronics 
Letters, May 1992, Vol. 28, N o.ll.
3. “A Gate Array Design of An 8-channel Digital Channeliser.” the 4th Bangor 
Communications Symposium, Bangor, UK, May 1992.
4. “Theoretical and Practical Aspects of the Design of Frequency Channelisers for Satellite 
Applications.” National Telecom Symposium, Bydgoszczy, Poland, September 1992.
5. “Low Complexity and Low Power Consumption Design for OBP Multicarrier DEMUX 
VLSI.” the 16th International Communications Satellite Systems Conference, 
Washington, DC, USA, February 25-29, 1996.
6. “A Reconfigurable Multistage Multicarrier DEMUX Architecture.” accepted by the Fifth 
International Workshop on Digital Signal Processing Techniques Applied to Space 
Communications, Barcelona, Spain, September 1996.
7. “A Novel Pure Systolic Linear Phase FIR for High speed Applications.” accepted by the 
7th International Conference on Signal Processing Applications & Technology, Santa 
Clara, CA, USA, October 1996.
8. “Practical Considerations for Bandpass Sampling.” the Electronics Letters, to appear.
9. “Optimization of Complex Binary Tree Filter Bank Structure Using Multirate Signal 
Flow Graph Transformation.” accepted by the IEEE International Conference on 
Communication Systems, Singapore, November 1996.
Qi, Multicarrier DEMUX and VLSI Implementation 197
VLSI IMPLEMENTATION OF DIGITAL 
CHANNELISER USING DISTRIBUTED 
ARITHMETIC
R. Qi and F. P. Coakley
Indexing terms: Multicarrier demodulator. Large-scale inte­
gration
A VLSI architecture for a digital channeliser based on the 
time-multiplexed tree filter bank is described, in which the 
maximum sharing of the arithmetic operations at each stage 
is achieved. A very efficient implementation of the band- 
splitting filter is achieved by using distributed arithmetic, 
allowing a single chip design that does not require multipliers 
for an 8 channel channeliser.
a g g r e g a t e d  o u t p u t  i n f o r m a t i o n  r a t e  a t  e a c h  l e v e l  i s  t h e  s a m e  
b e c a u s e  o f  t h e  d e c i m a t i o n  b y  2  i n  t h e  B S F s  a n d  t h a t  f o r  
u n i f o r m l y  s p a c e d  c h a n n e l s ,  a l l  B S F s  a r e  i d e n t i c a l .  I t  i s  p o s ­
s i b l e  t o  c o l l a p s e  t h e  B S F s  a t  e a c h  s t a g e  i n t o  a  s i n g l e  t i m e -  
m u l t i p l e x e d  B S F  w o r k i n g  a t  a  c o n s t a n t  r a t e  f o r  a l l  s t a g e s ,  t h u s  
r e q u i r i n g  o n l y  l o g 2 K  B S F s .
B a n d - s p l i t t in g  f i l t e r  u s in g  d is tr ib u te d  a r i th m e t ic :  T h e  B S F  i s  
d e f i n e d  b y  t h e  f o l l o w i n g  c o m p l e x  l o w p a s s  a n d  h i g h p a s s  p r o ­
c e s s e s  :
YL(m )  =  x X ( 0 X ( 2 m  -  i )  YH(m )  =  ^ \ , ( i ) X ( 2 m  -  i)
i = 0 i = 0
h L(i)  =  / i ' ( z y w 4 ) i  M O  =  / i ( 0 e J'( 3 W 4 ) i
I n t r o d u c t io n :  A  m u l t i c a r r i e r  d e m o d u l a t o r  f o r  F D M - T D M  
c o n v e r s i o n  o n - b o a r d  s a t e l l i t e  c o n s i s t s  o f  t h r e e  s e c t i o n s  a s  
s h o w n  i n  F i g .  1 .  T h i s  L e t t e r  d e s c r i b e s  a  s i n g l e  c h i p  d e s i g n  f o r
K F D M
ch an n els
A
< A/D
dig.tal
channelfsei
{démodulât^ - 
-fc^emod u lat -^
data
□ a a o
D O C O
d^emodulali^ -
sam pling
rate
© 1
IUUUUL r i m u u i
139?/lit’
Fig. 1 K-channel multicarrier demodulator
a n  8  c h a n n e l  f r e q u e n c y  d i g i t a l  c h a n n e l i s e r  ( D C H )  b a s e d  o n  
t h e  t r e e  f i l t e r  b a n k  d e s i g n .  A  m a j o r  p r o b l e m  w i t h  t h e  i m p l e ­
m e n t a t i o n  o f  c h a n n e l i s e r s  i s  t h e  h i g h  c o m p u t a t i o n  r a t e  [ 1 ,  2 ] ,  
A  t r e e  f i l t e r  b a n k ,  w i t h  i t s  h i g h  m o d u l a r i t y  a n d  s i m p l i c i t y ,  h a s  
p r o v e d  m o r e  s u i t a b l e  f o r  a n  A S I C  d e s i g n  t h a n  o t h e r  d e s i g n s  
b a s e d  o n  F F T  f i l t e r  b a n k s .
T im e - m u l t i p le x e d  t r e e  f i l t e r  b a n k :  I n  a  t r e e  f i l t e r  b a n k  ( F i g .  2 a )  
t h e  i n p u t  s i g n a l  i s  s u c c e s s i v e l y  s p l i t  i n t o  n a r r o w e r  f r e q u e n c y  
b a n d s  a t  e a c h  s t a g e  o f  t h e  t r e e .  F o r  a  K - c h a n n e l  u n i f o r m l y  
s p a c e d  F D M  s i g n a l  ( K  =  2 m ) ,  t h e  d e m u l t i p l e x i n g  b i n a r y  t r e e  
h a s  K  — I  n o d e s .  E a c h  n o d e  i s  a  b a n d - s p l i t t i n g  f i l t e r  ( B S F )  
w h i c h  c o m p r i s e s  a  l o w p a s s  a n d  a  h i g h p a s s  f i l t e r .  N o t e  t h a t  t h e
band -  splitting filter 
t  _
FDM signal 
x (n)
FIRM | 2
FIRL I 2
l_ T
1
FIRM
FIRL
y2
T 2
FIRM V2
FIRL * 2
FIRM V2
FIRL
FIRM ± 2
FIRL
FIRM
FIRL
FIRM
FIRL
' band -splitting  
! filter
! L_t7__Lrtr-i
] input "compte; 
i buffer inner 
j product
1 stage  1
input
buffer
complex
inner
product
input
buffer
/H\
■ y 7 (m) 
-y6 (m)
■y5 (m) 
-y4 (m) 
-y3 (m) 
-y 2 (m )
-y l (m) 
-yO(m)
y (m) 
0 -7
fcorrplexi 
inner [ 
product,
s ta g e  3sta g e  2 
b
Fig. 2 Conventional and time-multiplexed f ilter  banks 
a Tree filter bank
b Time-multiplexed version of tree filter bank
w h e r e  h(i)  i s  a  p r o t o t y p e  N - t a p  h a l f - b a n d  F I R  f i l t e r  
( N  =  4 n  +  3 ,  n  =  0 ,  1 ,  . . . )  a n d  h a s  t h e  s y m m e t r y  p r o p e r t y  i n  
w h i c h  h(i) =  h ( N  —  1 —  i ) ,  h(i) =  0  f o r  i  o d d  e x c e p t  f o r  t h e  
c e n t r e  t a p  / i [ ( N  —  l ) / 2 ] .
A n  F I R  f i l t e r  c a n  b e  d e c o m p o s e d  i n t o  a n  i n n e r - p r o d u c t  
g e n e r a t o r  ( I P G )  w h i c h  c o m p u t e s  t h e  m u l t i p l y - a n d - s u m m a t i o n  
i n n e r  p r o d u c t  a n d  a n  i n p u t  b u f f e r  w h i c h  i s  a  s e r i a l - i n  p a r a l l e l -  
o u t  s h i f t  r e g i s t e r  a r r a y  t h a t  s e q u e n t i a l l y  d e l i v e r s  d a t a  v e c t o r s  
f o r  t h e  I P G ,  a s  i l l u s t r a t e d  b y  F i g .  3 a .  A s  t h e  i n n e r  p r o d u c t
input buffer input buffer
-  D
bta]
inner
product
generator
inner
a
1397731
Fig. 3 Example o f  IP G  complexity reduction
a 4 tap filter with 4 term inner product
b  4  tap filter with 2 term inner product
i n v o l v e s  a  n u m b e r  o f  m u l t i p l i c a t i o n s  a n d  a d d i t i o n s ,  i t  w o u l d  
a p p e a r  t o  b e  t h e  m o s t  a r e a -  a n d  t i m e - c o n s u m i n g  p a r t  o f  F I R  
f i l t e r s .  D i s t r i b u t e d  a r i t h m e t i c  ( D A )  w h i c h  c o n v e r t s  t h e  
m u l t i p l y - a n d - s u m m a t i o n  i n t o  a  t a b l e - l o o k u p - a n d - s u m m a t i o n  
o p e r a t i o n  r e d u c e s  t h e  c o m p l e x i t y  o f  t h e  I P G .  D A  i s  c h a r a c t e r ­
i s e d  b y  i t s  h i g h  b i t - s e r i a l  p r o c e s s i n g  s p e e d ,  h i g h  c o m p u t a ­
t i o n a l  e f f i c i e n c y ,  a n d  l o w  c o m p l e x i t y .  A n o t h e r  i m p o r t a n t  
f e a t u r e  o f  D A  i s  i t s  h i g h  p r e c i s i o n  o f  c o m p u t a t i o n  b e c a u s e  n o  
m u l t i p l i e r  i s  i n v o l v e d  i n  D A  a n d ,  t h e r e f o r e ,  i s  f r e e  o f  
r o u n d i n g / t r u n c a t i o n  e r r o r s  [ 3 ] .
T o  p e r f o r m  a n  A M e r m  i n n e r  p r o d u c t  u s i n g  b i t - s e r i a l  D A  
m e c h a n i s a t i o n ,  a  l o o k - u p  t a b l e  o f  s i z e  2 W _ 1  w o r d s  i s  n e e d e d ,  
w h i c h  i s  i n t o l e r a b l e  w h e n  N  i s  l a r g e  [ 3 ] .  A n  A T - t e r m  i n n e r  
p r o d u c t  c a n  b e  c o n v e r t e d  t o  a  M - t e r m  o n e ,  a s  s h o w n  i n  F i g .  
3 b ,  w h e r e  M ( < N )  i s  t h e  n u m b e r  o f  t h e  n o n z e r o  d i s t i n c t  c o e f f i ­
c i e n t s  o f  t h e  f i l t e r .  T h e  t a b l e  s i z e  i s  t h u s  r e d u c e d  t o  2 M _ 1  
w o r d s .  T h e  s y m m e t r y  p r o p e r t y  o f  h(i)  i n d i c a t e s  t h a t  t h e  
n u m b e r  o f  n o n z e r o  d i s t i n c t  c o e f f i c i e n t s  o f  h (i)  i s  n  +  2 .  S i m i ­
l a r l y  t h e  c o m p l e x  l o w p a s s  h L ( i)  ( h i g h p a s s  h H ( i) )  f i l t e r  a l s o  h a s  
n  +  2  n o n z e r o  ‘ d i s t i n c t ’  ( i n  t h e  s e n s e  o f  a b s o l u t e  v a l u e ,  i . e .  
\ h L ( N  —  1 —  i )  I =  I h l f i )  |  )  c o e f f i c i e n t s .  T h u s  a  7  t a p  ( n  =  1 )  
h a l f - b a n k  f i l t e r  n e e d s  o n l y  2 3 - 1  = 4  ( i n s t e a d  o f  2 7 - 1  =  6 4 )  
w o r d s  o f  R O M .  F i g .  4  s h o w s  t h e  I P G  s t r u c t u r e  u s i n g  t h e  D A  
a p p r o a c h .  A n  o p t i m i s e d  c o m b i n a t i o n a l  l o g i c  n e t w o r k  r e p l a c e s
from
input
b u ffer
12
w o rd s
1 ROM
oarallp
12
p llel
ad d /
subtract
a d d r e ss
g e n e r a to r
12
P/S
convertor
|3 97 /6 |
Fig. 4 Inner-product generator (real) fo r  7 tap half-band filter
ELECTRONICS LETTERS 2 1 s t M ay 1992 Vol. 2 8  No. 11 973
e  R O M  t a b l e  t o  f u r t h e r  r e d u c e  t h e  a r e a .  A  c a r r y - l o o k - a h e a d  
i r a l l e l  a d d e r / s u b t r a c t e r  i s  u s e d  f o r  i n c r e a s e d  s p e e d .
A t  e a c h  s t a g e  o f  t h e  t r e e  i n  F i g .  2 a  a n  i n d i v i d u a l  i n p u t  
i f f e r  i s  n e e d e d  b y  e a c h  B S F  a s  t h e  i n p u t  d a t a  s t r e a m s  t o  t h e  
a n c h e s  a r e  d i f f e r e n t ;  t h u s  s h a r i n g  o f  i n p u t  b u f f e r s  i s  i m p o s s i -  
e .  T h e  c o m p l e x  I P G ,  o n  t h e  o t h e r  h a n d ,  c a n  b e  s h a r e d  
i t h i n  a  s t a g e .  T h e  r e s u l t i n g  m u l t i s t a g e  t i m e - m u l t i p l e x e d  t r e e  
t e r  b a n k  s t r u c t u r e  i s  s h o w n  i n  F i g .  2 b  w h i c h  f o r m s  t h e  b a s i s  
' o u r  8  c h a n n e l  D C H .
A  n o v e l  i n p u t  b u f f e r  s t r u c t u r e  i s  u s e d  f o r  t h e  D C H .  I t  
i t p u t s  e a c h  d a t a  v e c t o r  t w i c e  b y  f e e d i n g  o u t p u t  d a t a  w o r d s  
i  b i t  s e r i a l )  b a c k  i n t o  t h e  w o r d - s h i f t  r e g i s t e r s  o f  t h e  b u f f e r  t o  
l o w  t h e  t i m e - m u l t i p l e x e d  o p e r a t i o n  o f  t h e  B S F .  F o r  a n  
- t a p  h a l f - b a n d  f i l t e r  ( N  =  4 n  +  3 ) ,  3 / i  +  3  s h i f t  r e g i s t e r  w o r d s  
e  r e q u i r e d ,  i n  c o n t r a s t  t o  4 n  +  3  r e q u i r e d  b y  t h e  c o m m o n l y  
; e d  l i n e a r  s h i f t  r e g i s t e r  a r r a y .  T h e  s a v i n g  i s  1 4 - 2 5 %  d e p e n d -  
g  o n  t h e  f i l t e r  l e n g t h .
S I C  d e s ig n  f o r  8  c h a n n e l  D C H :  A  t i m e - m u l t i p l e x e d  t r e e  f i l t e r  
i n k  u s i n g  D A  t e c h n i q u e s  e n a b l e s  u s  t o  p u t  a n  8  c h a n n e l  
C H  o n t o  a  s i n g l e  M E D L  S O S  s e a  o f  g a t e  ( S O G )  ( M A 9 2 0 0  
i t h  2 0  K  c e l l s )  c h i p  [ 4 ] .  T h e  D C H  w a s  d e s i g n e d  t o  d e m u l t i -  
l e x  a n  F D M A / S C P C  s i g n a l  i n t o  e i g h t  9 - 6  k b i t / s  Q P S K  c h a n -  
: 1 s .  D i g i t a l  d e s i g n  e n s u r e s  a  f u l l y  p i p e l i n e d  a r c h i t e c t u r e  w i t h  
s h o r t  i n p u t - o u t p u t  l a t e n c y .  C o n t r o l  s i g n a l s  o f  e a c h  s t a g e  a r e  
m e r a t e d  l o c a l l y  e a s i n g  t h e  p r o b l e m s  o f  s i g n a l  r o u t i n g .  B o t h  
i c  d a t a  w o r d  l e n g t h  a n d  t h e  f i l t e r  c o e f f i c i e n t  l e n g t h  a r e  1 2  
i t s  w h i c h  i s  l o n g e r  t h a n  t h o s e  s u g g e s t e d  e a r l i e r  ( 1 0  b i t s  f o r  
a t a  l e n g t h  a n d  9  b i t s  f o r  c o e f f i c i e n t s )  [ 5 ] .  B e c a u s e  a  1 0  b i t  
l i f t  r e g i s t e r  i s  n o t  a v a i l a b l e  w i t h  t h e  M E D L  S O G  l i b r a r y  w e  
a d  t o  u s e  8  b i t  a n d  4  b i t  s h i f t  r e g i s t e r s  p r o v i d e d  b y  t h e  
b r a r y  t o  f o r m  a  l 2  b i t ^ h i f f j e g i s t e r .  T h e  i n p u t  d a t a  l e n g t h  
a u l d  b e  c o n s i d e r a b l y  r e d u c è d r  p o s s i b l y  d o w n  t o  4  b i t s ,  
i t h o u t  s i g n i f i c a n t  d e g r a d a t i o n  i n  S N R  [ 6 ,  7 ] .  I f  8  b i t ,  i n s t e a d  
f  1 2  b i t  s h i f t  r e g i s t e r s  w e r e  u s e d  i n  t h e  d e s i g n ,  a p p r o x i m a t e l y  
0 %  f e w e r  g a t e s  w o u l d  b e  n e e d e d .
D i g i t a l  s i m u l a t i o n  w a s  c a r r i e d  o u t  u s i n g  t h e  M e n t o r  
G r a p h i c s  C A D  t o o l s .  F o r  a n  8  c h a n n e l  9 - 6  k b i t / s  ( Q P S K )  
D M  s i g n a l  s a m p l e d  a t  1 3 4 - 4  k H z  ( 1 2  b i t s ) ,  t h e  c l o c k  f r e -  
u e n c y  s h o u l d  b e  1 2  x  1 3 4 - 4 =  1 6 1 2 - 8  k H z .  T h e  s i m u l a t i o n  
s e d  a  c l o c k  f r e q u e n c y  o f  2 0  M H z  ( 1 2 - 4  t i m e s  h i g h e r  t h a n  
c e d e d ) ,  i n d i c a t i n g  a  p o s s i b i l i t y  o f  d e m u l t i p l e x i n g  e i g h t  
4  k b i t / s  c h a n n e l s  w i t h  t h e  s a m e  c h i p .  F u r t h e r  d e s i g n  t r a d e o f f s  
: . g .  a t  b i t  r a t e  a n d  w o r d  l e n g t h )  a r e  c u r r e n t l y  b e i n g  i n v e s t i -  
a t e d  t o  a l l o w  h i g h e r  r a t e  c h a n n e l s  o r  t o  i n c r e a s e  t h e  n u m b e r  
f  l o w e r  r a t e  c h a n n e l s .  F u n c t i o n a l  v e r i f i c a t i o n  w a s  p e r f o r m e d  
y  c o m p a r i n g  t h e  o u t p u t  d a t a  o f  t h e  d i g i t a l  s i m u l a t i o n  w i t h  
l o s e  o f  a  t a r g e t  C  p r o g r a m .
T h e  h a r d w a r e  c o m p l e x i t y  o f  t h i s  d e s i g n  i s  s h o w n  i n  T a b l e  1 .
a b l e  1  H A R D W A R E  C O M P L E X I T Y
B l o c k s .
I n p u t
b u f f e r s
C o m p l e x
I P G s
C o n t r o l  
a n d  o t h e r s T o t a l
d u m b e r  
o f  c e l l s
8 1 2 0 3 6 6 0 5 0 0 1 2 2 8 0
P e r c e n t a g e  o f  
t o t a l  c e l l s
6 8 3 0 2 1 0 0
I t  c a n  b e  s e e n  t h a t  t h e  d e s i g n  i s  v e r y  e f f i c i e n t  i n  t e r m s  o f  
l i n i m i s a t i o n  o f  c o n t r o l  o v e r h e a d  a n d  a r i t h m e t i c  o p e r a t i o n s ,  
m y  f u r t h e r  s i g n i f i c a n t  r e d u c t i o n  o f  t h e  c h i p  a r e a  m u s t  c o m e  
o m  t h e  o p t i m i s a t i o n  o f  t h e  i n p u t  b u f f e r s  w h i c h  o c c u p y  m o s t  
f  t h e  c h i p  a r e a . .
C o n clu sio n s:  A  t r e e  f i l t e r  b a n k  t y p e  D C H  c a n  b e  i m p l e m e n t e d  
i  a  t i m e - m u l t i p l e x e d  m a n n e r  r e s u l t i n g  i n  a  v e r y  e f f i c i e n t  p i p e -  
n e d  s t r u c t u r e  w h i c h ,  t o g e t h e r  w i t h  t h e  u s e  o f  D A ,  e n a b l e s  a n  
c h a n n e l  D C H  t o  b e  p u t  o n t o  a  s i n g l e  V L S I  c h i p .  T h e  D A  
x h n i q u e  h a s  p r o v e d  v e r y  u s e f u l  a n d  e f f i c i e n t  i n  i m p l e m e n t i n g  
a l f - b a n d  F I R  f i l t e r s  o f  t h e  B S F .  A s  o n l y  t h e  c o m p l e x  I P G  
a n  b e  s h a r e d  i n  e a c h  s t a g e  o f  t h e  t r e e ,  t h e  m e m o r y  s i z e  o f  t h e  
i m e - m u l t i p l e x e d  t r e e  f i l t e r  b a n k  e x p o n e n t i a l l y  i n c r e a s e s  w i t h  
h e  n u m b e r  o f  s t a g e s  o f  t h e  t r e e .  T h e  A S I C  d e s i g n  o f  a n  8  
h a n n e l  D C H  h a s  s h o w n  t h a t  t h e  h a r d w a r e  c o m p l e x i t y  i s  
o m i n a t e d  b y  m e m o r y .  F u r t h e r  o p t i m i s a t i o n  o f  a r i t h m e t i c  
o p e r a t i o n s ,  a l t h o u g h  n e e d e d ,  w i l l  n o t  s u b s t a n t i a l l y  r e d u c e  t h e
a r e a .  H o w e v e r ,  t h e  a r e a  c o u l d  b e  r e d u c e d  c o n s i d e r a b l y  b y
r e d u c i n g  t h e  d a t a  w o r d  l e n g t h .
16th March 1992
R. Qi and F. P. Coakley (Dept, o f  Electronic and Electrical Engineer­
ing, University o f Surrey, Guildford, Surrey GU2 5X H , United
Kingdom)
R e f e r e n c e s
1 YIM, W. H., KWAN, C. C., COAKLEY, F. P., and EVANS, B. G .: ‘Multi- 
carrier demodulators for on-board processing satellites’, Int. J. 
Satellite Commun., 1988,6, pp. 243-251
2 G a r d n e r ,  F. M.: ‘On-board processing for mobile-satellite commu­
nications’. ESA Final Technical Report, ESTEC Contract No. 
5889/84/N L/G M , M ay 1985, pp. 4 -1 9
3 w h i t e ,  s. a . :  ‘Applications o f distributed arithmetic to digital 
signal processing: a tutorial review’, IEEE ASSP Magazine, July 
1989
4 G EC Plessey Semiconductors : ‘SoS radiation hard Hi-Rel IC and 
ASIC handbook’ (January 1991), pp. (8-3)-(8-9)
5 b i, G., and c o a k l e y ,  f .  p .:  ‘The design of transmultiplexors for 
on-board processing satellites using bit-serial processing tech­
nique’. 13th AIAA Int. Com munication Satellite Systems Conf., 
March 1990, pp. 613-622
6  j e s u p r e t ,  t . ,  m o e n e c l a e y ,  M., and a s c h e i d ,  G .: ‘Digital demodu­
lator synchronization’. Final report on ESTEC contract N o. 8437, 
Noordjwijk, The Netherlands, 19th February 1991
7 v a u g h a n ,  r .  G., s c o t t ,  N. L., a n d  w h i t e ,  R .: ‘The theory of 
b a n d p a s s  s a m p l in g ’, IEEE Trans., 1991, ASSP-39, (9), pp. 1973- 
1984
RECONSTRUCTION OF 
THREE-DIMENSIONAL DATA FOR 
ELECTRICAL IMPEDANCE TOMOGRAPHY
R. Gadd, F. Vinther, P. M. Record and P. Rolfe
Indexing terms: Image processing. Computerised tomography. 
Biomedical engineering
A two-dimensional reconstruction algorithm based on a 
modified version o f the ‘m ethod o f sensitivity regions’ is used 
to reconstruct data obtained from a three-dimensional finite 
element model. By using data obtained from off-drive-plane 
measurements an improved image o f changes in resistivity on 
the drive plane is obtained.
I n t r o d u c t io n :  E l e c t r i c a l  i m p e d a n c e  t o m o g r a p h y  i s  a n  i m a g i n g  
t e c h n i q u e  u s e d  t o  r e c o n s t r u c t  t h e  r e s i s t i v i t y  d i s t r i b u t i o n  
w i t h i n  a n  o b j e c t  f r o m  b o u n d a r y  e l e c t r o d e  v o l t a g e  m e a s u r e ­
m e n t s .  T y p i c a l l y  1 6  e l e c t r o d e s  a r e  p l a c e d  a r o u n d  t h e  c r o s s -  
s e c t i o n  t o  b e  i m a g e d ,  c u r r e n t  i s  a p p l i e d  t o  t w o  d r i v e  e l e c t r o d e s  
a n d  m e a s u r e m e n t s  m a d e  o f  t h e  r e s u l t a n t  p o t e n t i a l  d i f f e r e n c e  
a c r o s s  a d j a c e n t  p a i r s  o f  t h e  o t h e r  e l e c t r o d e s .  T h e  d r i v e  e l e c ­
t r o d e s  a r e  r o t a t e d  t o  t h e  n e x t  p a i r  a n d  t h e  p r o c e s s  i s  r e p e a t e d  
u n t i l  a  f u l l  d a t a  s e t  i s  o b t a i n e d .  A  r e s i s t i v i t y  d i s t r i b u t i o n  m a p  
o r  i m a g e  i s  t h e n  p r o d u c e d  b y  m e a n s  o f  a  s u i t a b l e  r e c o n s t r u c ­
t i o n  a l g o r i t h m .
T h e r e  a r e  s e v e r a l  a p p r o a c h e s  t o  t h e  r e c o n s t r u c t i o n  
p r o b l e m ,  m o s t l y  a s s u m i n g  t h a t  t h e  o b j e c t  i s  t w o - d i m e n s i o n a l .  
F o r  t h e  t w o - d i m e n s i o n a l  s i t u a t i o n  t h e  i m a g e s  o b t a i n a b l e  a r e  
n o w  l i m i t e d  o n l y  b y  t h e  n u m b e r  o f  m e a s u r e m e n t s  p o s s i b l e  
a n d  t h e  c o m p u t e r  r e s o u r c e s  a v a i l a b l e .
T h i s  L e t t e r  u s e s  a  p o t e n t i a l l y  f a s t  r e c o n s t r u c t i o n  a l g o r i t h m  
w h i c h  u s e s  a p p r o x i m a t e  d i a g o n a l  e l e m e n t s  o f  t h e  i n v e r s e  
s e n s i t i v i t y  m a t r i x  a n d  e x a m i n e s  i t s  u s e  i n  t h e  r e c o n s t r u c t i o n  o f  
d a t a  o b t a i n e d  f r o m  a  t h r e e - d i m e n s i o n a l  f i n i t e  e l e m e n t  m o d e l .
R e c o n s t r u c t io n  a lg o r i th m :  T h e  b a s i s  o f  t h e  r e c o n s t r u c t i o n  
a l g o r i t h m  u s e d  i s  t h e  ‘ m e t h o d  o f  s e n s i t i v i t y  r e g i o n s ’  d e v e l o p e d  
b y  T a r a s s e n k o  a n d  R o l f e  [ 5 ] .  T h i s  a l g o r i t h m  w a s  a p p l i e d  f o r  
t h e  c a s e  o f  ‘ p o l a r ’  d r i v e  c u r r e n t  i n j e c t i o n  s t r a t e g y  w h i c h  g i v e s  
a  h i g h e r  c u r r e n t  d e n s i t y  a t  t h e  c e n t r e  o f  a  u n i f o r m  r e s i s t i v i t y
174 ELECTRONICS LETTERS 2 1 s t M ay 1992 Vol. 2 8  No. 11
Low C o m pl e x it y  a n d  L o w  Po w e r  C o n su m pt io n  D esig n  f o r  OBP F r e q u e n c y
D e m u l t ip l e x e r  VLSI
Ronggang Qi, F. P. Coakley, and Prof. B. G. Evans 
Centre for Satellite Engineering Research, University of Surrey, 
Guildford, Surrey GU2 5XH, U. K.
Abstract
A systematic approach to the optimization of multirate 
filter bank (MFB) structures is introduced. The 
approach is based on the multirate signal flow graph 
(MSFG) representation and transforms. It has the 
advantages of presenting clear structural information 
and is free of tedious mathematical manipulations. 
Another issue discussed in this paper is the mapping of 
a demultiplexer structure onto VLSI architecture with 
low complexity and low power consumption. This is 
achieved by modelling the space, power , and 
processing speed complexities of VLSI in multirate 
environment and applying optimization techniques to 
search for acceptable solutions. Design examples have 
been given to show the effectiveness of the proposed 
methods.
I. Introduction
Frequency demultiplexers (DEMUX) are one of the key 
components of multicarrier demultiplexer and 
demodulator (MCDD) for OBP communication 
satellites \  It has been shown that a variety of 
frequency demultiplexing algorithms fall into the 
category of polyphase-DFT filter banks, which lead to 
very computationally efficient DEMUX structures 
either in single stage, or in multi-stage 2. However, 
digital multicarrier demultiplexing requires an 
immense amount of computations on-board the satellite 
increasing the demand for payload mass and power. 
The use of digital signal processing (DSP) techniques 
can significantly reduce computation efforts. 
Furthermore, if  the MCDDs are implemented in 
application specific integrated circuits (ASIC), the 
mass and power requirements on payload can be 
further reduced. These two aspects form the main 
topics of this paper.
The engineering design and implementation of 
DEMUXs, however, are less well covered in literature 
with lack of a systematic design methodology. This 
paper bridges MFB theory to telecommunication 
applications with emphasis on frequency DEMUX 
VLSI implementation. To this end, an MFB 
optimization approach and a modelling method for 
multirate VLSI complexities are presented to explain 
the mapping from multirate DSP algorithms to VLSI 
architectures.
Copyright © 1996 by the American Institute of Aeronautics and 
Astronautics, Inc. Allrights reserved.
DEMUX approaches
Efficient frequency demultiplexing methods can be in 
general classified into multi-stage and single stage 
(block processing) approaches. In a multi-stage 
DEMUX, the frequency multiplex is successively 
demultiplexed through a number of DEMUX stages 
relaxing the stringent channel filter requirement for 
sharp transition bands. The most commonly used 
multistage DEMUX is the binary tree structure in 
which simple half-band filters (7-tap, or 11-tap FTRs) 
can be used for the successive channelization 1’7.
Block processing methods mainly include polyphase- 
DFT and frequency-sampling filter banks (FSFB) 2’3,4. 
The FSFB, also called fast convolution filter bank, 
performs time domain linear convolution via 
frequency sampling approach. This approach is useful 
for some non-uniform demultiplexing problems and 
can be computationally efficient if  the channel 
bandwidths are narrow comparing to the total FDM 
bandwidth. Polyphase-DFT filter banks are well known 
for their computational efficiency 5. To allow rational 
sampling rate alteration, the polyphase-matrix-DFT 
(PMDFT) filter bank which is a more general form of 
polyphase-DFT can be used 3. The feature of rational 
rate conversion of PMDFT makes it attractive to some 
on-board MCDD applications, such as in SCPC and 
MF-TDMA systems 1’6.
From function model to VLSI
With the development of VLSI technology and design 
methodology, now it seems feasible for some very 
complex and high speed systems like MCDD to be 
integrated into VLSI with affordable cost and time 1. 
VLSI implementations require efficient mapping from 
the computation (DSP) models to hardware 
architectures. This mapping can be done in three steps:
• at the top level, the function model needs to be 
optimized such that the number of basic operations 
(e.g., multiplication, addition, memory operations, 
etc.) is minimized;
• map the optimized algorithm into a general 
implementation structure which consists of only 
fundamental building blocks (basic components) by 
either a direct one-to-one mapping from the basic 
operations to the basic components, or an multi-to- 
one mapping in which time-sharing of some basic 
components is allowed ; and
1
American Institute of Aeronautics and Astronautics
• map the implementation structure onto VLSI 
architecture which is efficient in terms of 
complexity, power, and throughput.
To tackle the mapping at the top level, in section II, we 
propose the MSFG approach to the derivation of 
computationally efficient algorithms (structures) for 
multirate systems. The one-to-one mapping is assumed 
in the second step. The third step is addressed 
separately in section III.
IL MSFG approach to filter bank optimization
Multirate filter bank design approach is normally based 
on mathematical manipulations, sometimes aided with 
conventional signal flow graph (SFG) transforms. The 
main advantage of mathematical approach lies in its 
conciseness. It can, however, obscure some structural 
information that can be vital for obtaining efficient 
DSP structures. The conventional SFG, on the other 
hand, renders explicit structural information hence is 
more appropriate for hardware structure mapping. In 
multirate environment, nevertheless, the conventional 
SFG which suits linear time invariant (LU ) systems is 
inadequate since linear multirate systems are linear 
periodically time varying (LPTV)5.
As a complement to the conventional design approach 
and an extension to conventional SFG, we introduce a 
multirate signal flow graph (MSFG) representation for 
multirate systems and use it for MFB designs. 
Comparing to the pure mathematical representations, 
the MSFG provides more direct and clearer link to the 
hardware structure and to the parallelism of the filter 
bank. Being extended from the conventional SFG, the 
MSFG preserves most of the properties of the former. 
In the multirate environment, new identifies and 
transforms of MSFG are identified and defined. With 
the introduction of a set of short-hand notations for 
MSFG, the task of flow graph manipulation and 
transformation can be considerably simplified and 
optimized DEMUX structures can be achieved with 
ease.
MSFG
Signal flow graph has been historically defined as a set 
of branches and nodes in which the former define the 
signal operations and the latter indicate the connection 
points of these branches 5. Similarly, we also define 
MSFG in terms of a set of branches and nodes. 
However, branches in MSFG are confined to be linear 
time-invariant (LTI) and all the non-linear and time- 
varying (e.g., modulation, downsampling, upsampling, 
etc.) operations are defined by node functions. The 
reason for defining the non-linear and multirate 
operations as node functions is that most of these 
operations are trivial in digital networks, such as
down-, up-sampling, sampling & hold, commutations, 
etc., hence can be defined as basic operations from 
implementation point of view. For those non-trivial 
node functions, such as modulation (multiplication by 
a signal), some useful transforms can be applied to 
convert them into trivial ones.
Basic multirate operators
The fundamental difference between classical DSP and 
multirate DSP lies in the sampling rate alteration that 
is not allowed in the former. In multirate DSP, the 
most fundamental multirate operators are downsampler 
(DS) and upsampler (US) which decrease or increase 
the sampling rate. From DS and US, two important 
rate-changing components are defined. They are serial- 
to-parallel commutator (SPC), which decreases the 
sampling rate by decomposing the incoming signal into 
a group of sub-signals (polyphase signals), and 
parallel-to-serial commutator (PSC), which increases 
the sampling rate by combining the polyphase signals. 
There are two other multirate operators which are 
frequently encountered , but commonly ignored in DSP 
literatures, in digital networks. They are sampling & 
hold (SH) and upsampling & hold (USH) operators 
which can also be defined with DS and US samplers 
(see Table 1 for definition and Table 2 for relations to 
samplers).
MSFG node functions
In MSFG, six multirate nodes are defined, namely, US, 
DS, SH, USH, SPC, and PSC, which perform the 
corresponding basic multirate operations. Two kinds of 
non-multirate nodes are defined: the ordinary node 
(OD) which is exactly the same as that in conventional 
SFG and the modulation node (MOD) which performs 
signal modulation (multiplication).
Nodes in MSFG can be classified into two types: 
additive and multiplicative. Nodes in a conventional 
SFG are all additive. Signals coming into an additive 
node are summed up at the node. The additional 
feature of the additive node in MSFG is the sampling 
function and sampling rate alteration. That is, the 
sampling functions (e.g., sampling & hold) or 
sampling rate alteration (down-sampling and up- 
sampling) can be performed at additive nodes. Thus an 
essential requirement associated with the additive node 
is that all signals coming into the node must have the 
same sampling rate. The only multiplicative node in 
MSFG is the modulation node. Having moved all 
nonlinear and time-varying operations to nodes, the 
branch transfer functions (transmittances) between any 
two nodes are left linear and time-invariant.
The SPC and PSC can be considered as a special type 
of additive node. The SPC node is additive and allows 
rate-changing, but has fixed number of output branches
2
American Institute of Aeronautics and Astronautics
with different output signals (defined by SPC 
functionality), whereas the PSC node is “additive" only 
in the sense that the allowed fixed number of in­
coming signals are rate-up-converted and “added" 
(combined) forming the output signal. All the node 
functions of MSFG are listed in Table 1.
Table 1 MSFG Nodes
MSFG transformation
An multirate DSP system can be transformed into 
equivalent structures via flow graph transformations. 
Fundamental relationships between multirate operators 
have been given in the form of identities {Noble 
identifies, etc.) 5’8. The most fundamental and 
important result in multirate DSP theory is, perhaps, 
the concept of polyphase decomposition transform for 
decimation and interpolation filters (FPDT), which 
allows highly computational efficient filter bank 
structures 5. We extend the concept of polyphase 
decomposition to signal modulation leading to the 
modulation polyphase decomposition transform 
(MPDT) and its variations. As will be seen later, the 
use of MPDT, together with FPDTs, can simplify a 
modulated filter band resulting in a highly efficient 
DFT filter bank structure.
In multirate DSP networks since the upsampling and 
downsampling processes are very often realized with 
PSC and SPC (as in polyphase decompositions of
filters and modulators), one will inevitably deal with 
various combinations of commutators and other signal 
processing components in simplifying the networks. 
We have identified a series of identifies associated with 
the cascades of commutators with filters, modulators, 
and single up/down-samplers, etc. which are found 
useful in multirate DSP network transformations.
Some of the MSFG identifies and transform pairs are 
listed, without proof, in Table 2. The mapping of an 
multirate DSP algorithm into the (hardware) 
implementation structure can thus be done by a series 
of MSFG transformations. We shall show how the 
MSFG identifies and transforms are applied to derive 
the efficient DEMUX structure in the following 
example.
Example
The frequency DEMUX given in this example is to 
channelize the input multi-frequency (MF) TDMA 
signal consisting of five carriers. Each MF-TDMA 
channel is shared by multiple users through TDMA 
and has the symbol rate of 1 Mega symbols/second. 
The roll-off factor is assumed 0.5, and the channel 
spacing is 3 MHz. If two samples per symbol is 
required by the demodulators which follow, the desired 
output sampling rate of the DEMUX will be 2 MHz. 
Though ideally the real input frequency multiplex can 
be sampled at the Nyquist rate, guard bands on both 
lower and upper edges of the frequency multiplex are 
necessary due to the imperfection of anti-aliasing 
filtering at the analogue stage. To simplify the 
demultiplexing stage, these guard bands are normally 
chosen to be integer multiples of the channel spacing. 
In this case, the upper and lower guard bands are both 
set to be one channel spacing. Consequently, the input 
sampling frequency is 2x(5+2)x3=42 MHz. The 
spectrum of the sampled input signal is illustrated in 
Figure 1(a). It is apparently an odd-stacking real FDM. 
To be able to use real channel filters and to avoid using 
GDFT (generalized DFT) which is more complex than 
DFT, the sampled signal needs to be in even channel 
stacking. The odd-to-even channel stacking conversion 
can be done by using a trivial 7t/2-frequency-shift 
(Figure 1(b)) which requires no arithmetic operations. 
The key characteristics of the channel filter is depicted 
in Figure 1(c).
The DEMUX functionality can be described by the 
MSFG shown in Figure 2. A direct implementation of 
this structure would require five (ignoring the guard- 
channels) identical lowpass channel filters and a bank 
of frequency shifts. According to MFB theory, uniform 
modulated filter bank can be efficiently realised with 
just one shared polyphase decomposed channel filter
M S F G
N o d e
Functions Short-hand
notation
Output
waveform
M O D
O D
U S
D S
S H
t L l — ►U S H
S P C
P S C X D
3
American Institute of Aeronautics and Astronautics
plus a DFT (hence polyphase-DFT), a significant
saving in computation over the direct structure.
‘ lX(f)l r • ' , / /
■ C  / O  1 J 2  .13  14 ,/U  J■T"-.— —   py ■MM
Fs=42MHz
( a )  r e a l  f r e q .  m u l t i p l e x  w i t h  t w o  g u a r d - b a n d s
Ixxoi
Fs/2 Fs=42MHz
( b )  e v e n - s t a c k i n g  b y  7 c / 2 - f r e q u e n c y - s h i f t
raised-cosine
“05  TS 1.5MHz %  Fs=42MHz
( c )  c h a n n e l  f i l t e r  c h a r a c t e r i s t i c s  
F i g u r e  1  A  5 - c h a n n e l  M F - T D M A  s i g n a l  a n d  d e m u l t i p l e x i n g
x(n) 3> > O >-<=> h(n) 21
Wi 21
y'o (m) 
y', (m)
ch2
w,"  
—>- h(/i) 21
wr 
—>- h(n) 21
Wi 
—>- 21
Wn 
—>- h(ji) 21
—>■
h(n)
>
21
chO
->~-0 y' (m ) d is c a rd
.y' (m ) c h i
y ; (m ) Ch3
>  °  y ,  (» i) d is c a rd
y ' (m ) c h 4
Wi
F i g u r e  2  A  5 - c h a n n e l  D E M U X  l o w p a s s  m o d e l  
To derive the polyphase DFT structure, let us consider 
the &-th channel of DEMUX in Figure 2 and redraw it 
in Figure 3(a). Obviously, any operations with, or after 
those with, channel index k can not be shared by 
different channels. Therefore, the basic theme of the 
MSFG simplification is to delink operations (especially 
those computation demanding ones) with the channel 
index k and to move them before any operations 
associated with the k. Another principle of MSFG 
simplification is to move operations to places where the 
sampling rate is as low as possible to reduce the 
computation rate.
By successive MSFG transforms provided in Table 2, 
Figure 3(a) can be transformed into an equivalent 
structure shown in Fig. 3(f).
h(n) 2V1 
O— — =>—«
x(n) y\(m )
Wf W4n
(a)
M F - 2 ,
M S - 1
h’(n) 21
O  >  | - - - - - >  •  >  P F - 1
x(n) ^  A  y \  ( m )
w7kn Wa*
( b )
h ’ n (  m )
C M - 2
(c)
3 hXn)
h'idri)
( d )
h ’isAri)
(f)
h ’o(ri)
CD-I
F i g u r e  3  S t e p - b y - s t e p  s i m p l i f i c a t i o n  o f  t h e  k - t h  c h a n n e l  
d e m u l t i p l e x i n g
The structure of Figure 3(f) shows that the polyphase 
filter bank is independent of the channel index k. 
Therefore, it can be shared by all the channels. Since 
the frequency-shift network in the figure simply
4
American Institute of Aeronautics and Astronautics
performs the k-th component of a 7-point DFT, the 
whole DEMUX network can be constructed by a 
polyphase filter bank followed by a 7-point DFT, as is 
shown in Figure 4.
Considering that the input signal is real and that the 
complex filter h (n)=WÂ%{n)=(j)nh{ri) has coefficients 
either pure real or pure imaginary, the actual 
computation load of the complex polyphase filter bank 
is the same as that of the real prototype filter h(ri). 
Furthermore, all the computations are carried out at the 
low sampling rate of 2 Mhz and the 7i/2-frequency- 
shift bank requires no arithmetic effort. Hence we 
conclude that the derived polyphase-DFT structure is 
computationally optimum.
7-»tDFr
y ’, (m)
x(n)
F i g u r e  4  P o l y p h a s e - D F T  s t r u c t u r e  o f  a n  m u l t i c a r r i e r  
D E M U X
III. Optimization of DEMUX VLSI architecture
With the development of VLSI technology and the 
increasing demand for high speed, low mass, and low 
power consumption digital processing payload, ASIC 
seems the appropriate and promising solution to 
MCDD. This requires efficient mapping from DSP 
algorithms to VLSI architectures. In a strict sense, 
there is no optimum mapping which can 
simultaneously minimize the complexity and power 
consumption and maximize the throughput, as these 
requirements are always in conflict. In this paper we 
propose a systematic mapping approach which treats 
the mapping as a multi-objective optimization problem 
based on a multirate VLSI complexity model. Trade­
offs between VLSI complexity, power consumption, 
and throughput can be made by choosing bit-serial or 
bit-parallel architectures of basic components at 
different sampling frequencies. Another advantage of 
this approach is that it provides technology 
independent estimations for complexity, power
consumption, and the throughput allowing objective 
comparisons between architectures.
Complexity and power consumption in CMOS VLSI
VLSI complexity The most appropriate measure 
for VLSI complexity is the chip area (silicon area), 
which is directly related to the yield and the 
manufacturing costs. To estimate it one needs to 
decompose the circuit into blocks whose area can be 
obtained either by resorting to knowledge based expert 
systems or by the designer's own knowledge. Reliable 
estimation of chip area depends on not only the 
computational complexity, the VLSI architecture, but 
also area factors like the communications between 
building blocks (cells, macros, or basic functional 
blocks), the floor plans which affect the layout routing, 
as well as the technology being used.
Since we are aiming at giving comparative study on 
the efficiency of different architectures, which should 
be ideally technology independent, the gate count is 
chosen as the measure for complexity. However, the 
final decision on the VLSI architecture may still be 
affected by the area factors if severe discrepancy 
between the two measures occurs. To make the gate 
count in consistency with the chip area, it is a good 
practice to avoid, or to minimize the global 
communications between blocks, such as using 
pipeline and systolic architectures.
CMOS VLSI power consumption For MOS 
systems, the average power consumed can be 
approximated by the total switching power plus one 
half of the d.c. power that would result if all the MOS 
transistors were on 9. Because a CMOS transistor 
virtually requires no static d.c. current, the static d.c. 
power of CMOS circuits can be ignored. Therefore, 
only the switching power will be considered for the 
power consumption estimation, which is determined by 
the average number of active circuit cells (gates, or 
transistors). As a result, the total power consumption of 
a CMOS circuit with G gates and clocked at f m can be 
approximated by P=paG fxw , where a  is the statistical 
average percentage of active gates and p is the power 
dissipated per active gate per unit switching frequency 
(mW/gate/MHz).
Complexity and power estimation for multirate DSP
Multirate system complexity model It is assumed 
that the time-redundancy (time-multiplexing) may only 
be considered at higher functional levels and is not 
allowed at the basic component level. A multirate 
system Smdsp consisting of I types of basic components 
with J  different sampling frequencies can be, as far as 
hardware complexity and power consumption are 
concerned, represented by 7x7 4-element tuples, (Vÿ,
5
American Institute of Aeronautics and Astronautics
G ip a ip fj)’ 0 -i<I, 0^j<J, in which the first three 
elements are respectively the component count, the 
complexity, and the average percentage of active gates 
of component i at the sampling frequency f j . We can 
graphically illustrate the complexity model of the 
multirate DSP system in Figure 5.
fo fi fl • • •
OPo
OP.
OP2
( 'W . A )
( W . A )
K 'V ,,./,) ( W = A )
2
F i g u r e  5  A  m u l t i r a t e  V L S I  s y s t e m  r e p r e s e n t a t i o n  
Alternatively, it can be described by ,
•5« „ A (N ,G ,a ,f )  (1 )
where the matrices N=|Wÿ], G=[Gÿ], and ot=[aÿ]. The
vector f  has components^, 0%'<7 .
Complexity, power, and clock frequency 
estimation With the above system representation, the 
expressions for the complexity, power consumption, 
and the clock frequency of the circuit are given below,
(2a)
1=0 ;= 0
Ptotal ~ P/( Qf (5>j)(j{&ij) N _ MWJ (2b)
^  = 8« 6 l0-1^  (2c)
where fj=fj/fo is the normalized sampling frequency of 
f, Li is the word length of component i , and 5,y=0, 1, 
or -1 respectively correspond to the cases that the 
component i atf is in bit-parallel, in bit-serial, or does 
not exist. As a result, G,(1) and G,(0) are the required 
number of gates to implement component i using bit- 
serial and bit-parallel architectures respectively; and 
similar definition applies to a ,(1) and (X;(0). For 5,y = -1, 
we let G,('1)=0 and Alternatively, if technology
and throughput independent estimations are desired, 
we can use the normalized power and clock frequency
estimations. P tntaF~Ltotal!pfcloch tMldy clock~fclock!fO'
The confisuration matrix A The matrix A=[ôÿ]/X/ 
is called configuration matrix as it determines the 
choice of bit-serial or bit-parallel architecture of basic 
components. Hence the VLSI complexities are 
functions of A . By definition, each of the 7x7 
components in A is 1, 0, or -1. Once the system 
structure is decided the positions and the number of -Is 
in A are fixed. Suppose that the number of -Is is r, 
(0^r</xJ ), and the rest of the 7x7-r 5,yS in A take 
either 0 or 1, then an one-to-one mapping between an 
(7x7-r)-bit integer, p, and a A matrix can be specified. 
Hence there are a total of 2/x7 r possibilities for A.
VLSI architecture optimization
Objective functions Because the XJs and the N fs  
in Eqs. (2) are fixed for a given system and the 7,/s, 
which are determined by the system's finite word 
length performance and the system dynamic range, are 
independent of architecture considerations, these 
parameters are thus known. The oc/^'s and the G /^ 's  
depend solely on specific designs of the basic 
component (can be obtained from the pre-designed 
macros or libraries in ASIC approach). Therefore, the 
objective functions are functions of A only. Hence the 
optimal VLSI architecture would be the one (if ever 
exists) that minimizes the complexity and the power 
and maximize the throughput. This can be achieved by 
minimizing the three objective functions of Eqs. (2), 
that is,
™n{c,w(A)} (3a)
m t o l O ) }  (3b)
™?{/L<A)} (3c)
where D is the value domain of A . Eq. (3c) gives the 
maximum throughput by minimizing the nomalized 
clock frequency. Since the objective functions in Eqs.
(3) are independent of technology and the input 
sampling rate, Eqs. (3) produce the optimal VLSI 
architecture.
In most cases, the global minimum of Eqs. (3) does not 
exist as the requirements are often in conflict. For 
instance, low complexity architecture is often 
accompanied by high power consumption. Another 
example is that when high throughput is desired, the 
system clock frequency will be increased which in turn 
will increase the power consumption. Therefore, the 
intelligent circuit design is to cope with the 
requirement trade-offs, for example, by assigning 
different weights to these requirements, or by imposing 
some constraints on them.
Sub-optimal criteria When the absolute optimal 
solution for multi-objective programming problem of 
Eqs. (3) does not exist, sub-optimal criteria and 
solutions have to be adopted. A practical approach to 
tackle these problems is to minimize the distance 
between the ideal point at which all the objective 
functions reach their minima and any point on the 
curve being defined by the objective functions. The 
multi-objective programming problem is thus reduced 
to minimizing a single objective function (the 
evaluating function) which can be easily dealt with 
conventional linear, or nonlinear, programming 
approaches. The solution is generally not optimal for 
each individual objective function. It is nevertheless 
the one closest to the optima.
American Institute of Aeronautics and Astronautics
When one wishes to impose preferences on the 
optimization problem, a weighted evaluating function 
can be constructed.
It is also of practical interests in some circumstances 
that the absolute minimum of one of the objective 
function in Eqs. (3) is strongly desired whilst the other 
two are less important, but can be subject to preference 
of one over the other.
The optimization methods mentioned above are all 
unconstrained. Sometimes, however, we can have more 
tolerance on one (or two) of the objective function in 
Eqs. (3) as long as it does not exceed a threshold value, 
and wish to minimize the other two which are 
unconstrained; for instance, given an upper bound for 
the maximum system clock frequency, find out the best 
configuration pattern which minimizes the complexity 
and the power. This constrained optimization problem 
can be treated as a de-dimensioned multi-objective 
optimization one.
Examples
5-channel Polyphase-DFT DEMUX Consider 
the DEMUX structure shown in Fig. 4. We assume that 
the system is solely composed of multipliers, adders, 
shift registers/RAMs, and ROMs. The complexities 
(gate count) for these basic components are given in 
Table 3 based on our previous design1.
[AU =
Components: Number o f gates:
Serial multiplier (full-precision) 70Nb*
Parallel multiplier 12Nb2+44Nb-20
Serial adder/subtractor 22
Parallel adder/sub (carry-look-ahead) 20Nb
delay/buffer (word) 6Nb
ROM/register (word) 6Nb
* Nb th e  d a ta  w o rd  len g th , a n d  a lso  a ssu m ed  to  b e  th e  coe ffic ien t leng th  
o f  th e  com p o n en t.
There are two sampling frequencies in this example, 
the input sampling frequency/0 = 42 and the decimated 
sampling frequency /i=/o/21=2MHz. Hence the 
normalized sampling frequency is (Ào,X,i)=(1.0, 0.048). 
Multiplier, adder, ROM, and register are the four basic 
components and their word lengths are assumed 11, 
10, 8, and 8 bits respectively. For a  values, we assume 
(a 0(0), a / 0),a 2(0),a j(0))=(0.40,0.50,0.20,0.70) 
for bit-serial architectures, and
(a 0(1), a / 1),a 2(1),a J(1))=(0.50,0.60,0.20,0.70) 
for bit-parallel architectures.
If the 7-point DFT is implemented with Winograd 
algorithm, it requires 8 real multipliers and 36 real 
adders. The prototype filter length N  is assumed 336 
and the N matrix is estimated and given below
0 344
0 379 
0 343 
42 702
The A matrix should take the form of
a  =
-i 0 
- i  0
-i 0  
0 0
Accordingly, the configuration pattern is a 5-bit
integer (p5=0,l,--*,31). Figure 7(a) shows the 
normalized complexities against the A-pattern p. 
Obviously, the global minimum does not exist. 
However, if we relax the constraint on power, a sub- 
optimal solution is found to be ps=24 which gives the 
normalized complexities of 0.0, 0.89, and 0.0. That is, 
the complexity and the maximum sampling frequency 
reach their global minima. Translating it to the A- 
matrix gives
f-i f
A=:i J
[O 0.
which means that the multipliers and adders at the 
lower sampling frequency are configured to use bit- 
serial structures and all the shift registers are in bit- 
parallel.
Time-multiplexed PMDFT In this example, we 
use the proposed complexity modelling method to 
optimize a TM-PMDFT structure (Figure 6) which is a 
low complexity implementation of PMDFT structure 2 
with a time-shared common programmable FIR filter 
for all polyphase filters and a sequencial systolic DFT.
F,„/M , Inner—►Productor i
vrc;; 1—► Î
SPC
q=l v....  • •*
• PSC Coeff/mem.
•
• 3=1—
^  J"»
Control
1
Sequencial 
Systolic DFT 1
r f T
0 1 **• K -l
(L/M)Fto
F i g u r e  6  T i m e - m u l t i p l e x e d  P M D F T  s t r u c t u r e  
There are three sampling frequencies in the circuits, 
the input sampling frequency / 0 , the decimated 
sampling frequency f\=fç/M  for the shift register bank, 
and the interpolated sampling frequency fi=Lf0 for the 
rest of the functional blocks. We assume that M=8, 
L=3, and / 0=16 MHz. Hence /i= 2  MHz, / 2=48 MHz, 
and (Ao, Ai, A^^l.O , 0.125, 3.0). The prototype filter 
length N  is assumed 72. The N matrix is estimated and 
given below
~ o  o :
o o  ;o o '
16 16 :
The A matrix have the form of
FVÿ] =
A =
—1 —1 0
-1 -1 0
-1 -1 0
0 0 0
7
American Institute of Aeronautics and Astronautics
Accordingly, the configuration pattern p  is an 6-bit 
integer (p6=0,l,---,63). Figure 7(b) shows the 
normalized objective functions against the A-pattern p. 
Again, the global minimum does not exist. However, a 
minimum distance sub-optimal solution of p6=48 can 
be found which gives the normalized complexity, 
power, and the maximum sampling frequency being 
0.0, 0.36, and 1.0 respectively. In A-matrix, we have
[-i -i f
A==l :| J 0 0 0
which means that the multipliers and adders at /2=48 
MHz are configured to bit-serial architectures and all 
the shift registers are in bit-parallel.
C om plexity, Power, an d  Max. S w itch in g  F requency
F "= ! com plex ity  I 1 pow er I i m ax. sw itch  freq .
1 .0   : -^---
0 .4 —
0 .2 -
0.0-j 
0. 20 .10. 30.
A- pattern: p
MSFG has more direct and clearer link to hardware 
structures than other DSP system representations. 
With its sampling node functions, MSFG has the 
potential to describe some digital network functions, 
like sampling, switching, etc., which can not be 
expressed through other representations.
The mapping of DEMUX structures onto VLSI can be 
guided by assessing the complexity, power, and 
throughput efficiencies of the resulting VLSI 
architectures.
Given system parameters and the DEMUX structure, 
the searching for the optimal VLSI architecture 
becomes the determination of the configuration matrix 
A . Optimization techniques can be used to determine 
optimal As. Trade-offs between complexity, power 
consumption, and throughput can be made by imposing 
different constraints on these quantities.
As the estimations for VLSI complexity and power 
consumption can be made technology-independent, the 
proposed mapping approach is useful in evaluating and 
studying design trade-offs for MFB structures and, 
more generally, for multirate DSP systems.
( a )  A  5 - c h a n n e l  p o l y p h a s e - D F T  D E M U X _ _ _ _ _ _ _
C o m p le x ity , P o w er , a n d  M ax S w itc h in g  F re q u e n c y  
I 1 com plex ity  i I pow er EEE3 m ax  sw itch—freq
a> 0 .6  —I :r
'd
n  0 . 4 -  
1
s °-2- i
0 .0 - j  '
0. 20. 40. 60.
A - p a t t e r n :  p
( b )  A  T M - P M D F T  f i l t e r  b a n k  
F i g u r e  7  N o r m a l i z e d  c o m p l e x i t i e s  a s  f u n c t i o n s  o f  
c o n f i g u r a t i o n  p a t t e r n
IV. Conclusions
Computationally efficient OBP frequency DEMUX 
structures can be derived via MSFG transforms. 
Adhoc tricks and tedious mathematical 
manipulations in conventional design methods can be 
avoided.
8
American Institute of Aeronautics and Astronautics
Table 2 MSFG Identities and Transforms
Identity/Transform Pairs conditions
Noble
Identities
NB-1
NB-2
AT1
L W )
...> — O
Sampling & 
Hold
SH-1
SH-2 N
-O
Modulator-
Sampler
MS-1
x(n) yim)
M x# -
(J)(n)
— > -o
x(n) A y(m) 
__________(j) (mM)_______
MS-2 AT1
->-<»
x(n) y(m)
O— >■
x(n)
AT1
y(m)
<|)(/n)
MS-3
x(n) A y(m) 
4)W
O—
x(ri) A y(m)
1 §(nL)
MS-4
x(n)
L
?> >  •
y(m) x(n) yipi)
it-D
Sampler-
Sampler
SS-1 M AT1 0
SS_2 AT1
e -
Af o > > o
°  wM(n)
SS-3 L
#-
AT1
-e
AT1#—
W»=
l,n = 0 , ±M, ±2M, ' 
.0, otherwise
gcd(L, M )=l
CD-I
Commutator
Decomposition
5,11,17
3,9,15,
2,8,14,
1.7,13
0 ,6 , 12,.
o->
134 ,7 ,- 5,11,17,...
2"' # C T O 33,15,...
^  y ®  1.7.13,...
4,10,16,... 
>^ — 23,14,.. .  
0.2,4.6,... 04.12,...
CD-2 5,11,17,...
4,10,16,...
3,9,15,...
2,8,14,...
1,7,13,...
0,6,12,.
5.11.17....
3.9.15.... 0 - 4 p * — X I
1.7.13.. ~  ”
0,1,2,3,.
0,1.23
4.10.16.... <
2.8.14.... o —@ e —X  
0.6,12. .. O ' "  O.2.4.6..
CC-1
Commutator-
Commutator
N-l
- O b N.,
a, O
a„ O
- O b ,
- O b „
CC-2
— O
9
American Institute of Aeronautics and Astronautics
Polyphase
Filter
Polyphase
Modulator
PF-1
PF-2
PM-1
PM-2
H(z) M 1 
O ^----•
L H(z) 
e — ^— o
x(n) À y(n) 
§(n)
->-o
are polyphase 
filters. 
hi(m)=h(mM+i)
Hi(z)=Zhi(m)zm 
are polyphase 
filters. 
hi(m)=h(mL+i)
§j(m)=§(mN+i),
i=0,l,"',N-l
MF-1 e^-hin)
O ^ O
h(n)O—
A /"  S e *Modulated
Filter MF-2 9 - Z & - 0 é^-hin)
-j\n
o —
O eihl
CM-1
Commutator-
Modulator CM-2
§i(m)=§(mL+i),
i=0,1,* • • ,L - 1
CS-1
VuM)
Commutator-
Sampler
L  J m - i  
— O
L  Z°
gcd(L, M )=l 
(M-l-z)/
m
where I and m are 
solution of 
mL-lM=\
CS-2 Xn(/l)' Af'
, x Af'
z 110 Af'O^M)l(^ ) O >-
gcd(L, M )=l
m, =
im
T .
where I and m are 
solution of
10
American Institute of Aeronautics and Astronautics
References
[1] W. H. Yim, C. C. Kwan, F. P. Coakley and B. G. 
Evans, “Multi-carrier Demodulators for On-Board 
Processing Satellites,” Int. J. o f Satellites 
Communications, Vol. 6, pp.243-251, 1988.
[2] W. H. Yim and F. P. Coakley, "Polyphase Matrix 
and Lattice Decomposition for Multirate Filters 
and Filter Banks,” Proc. IEEE International 
Conference on Acoustic, Speech and Signal 
Processing, ICASSP-92, Vol. 4, pp. 625-627, 
March, 1992.
[3] C.C. Hsiao, “Polyphase filter for rational 
sampling rate conversions,” Pore. IEEE Int. 
Conf. onASSP, pp. 2173-2176, April 1987.
[4] S.J. Campanella and S. Sayegh, “Flexible on­
board demultiplexer/demodulator,” Proc. 12th 
AIAA International Communication Satellite 
Systems Conference, pp.299-303, Arlington, VA, 
USA, March 1988.
[5] E.Crochier and L.R. Rabinar, “Multi-rate Digital 
Signal Processing,” Printice Hall, Inc., 
Englewood Cliffs, 1983.
[6] W. H. Yim and F. P. Coakley, “A novel 
multicarrier demultiplexer for MF-TDMA 
systems with minimum multiplication rate,” The 
3rd ESA International Workshop on Digital 
Processing Techniques Applied to Space 
Communications, September 1992.
[7] R. Qi and F.P. Coakley, “A Gate Array Design of 
a Multi-channel Tree Filter Bank Demultiplexer,” 
The Third International Workshop on Digital 
Signal Processing Techniques Applied to Space 
Communications, ESTEC, Noordwijk, The 
Netherlands, Sept. 1992.
[8] P.P. Vaidyanathan, "Multirate digital filters, filter 
banks, polyphase networks, and applications: a 
tutorial,” Proceedings of the IEEE, Vol.78, N o.l, 
pp.56-93, January 1990.
[9] C. Mead and L. Conway, "Introduction to VLSI 
Systems, ” Addisom-Wesley Publishing Company, 
1980.
11
American Institute of Aeronautics and Astronautics
