Programmable architectures for the automated design of digital FIR filters using evolvable hardware by Hounsell, Benjamin Iain
Programmable Architectures for the 
Automated Design of Digital FIR Filters using 
Evolvable Hardware 
Benjamin lain Hounsell 
A thesis submitted for the degree of Doctor of Philosophy. 
The University of Edinburgh. 
September 2001 
Abstract 
Continuing increases in both the size and complexity of digital signal processing (DSP) systems 
places a considerable demand on the design engineer to develop hardware architectures capable 
of fulfilling the growing functional requirements expected of modem DSP devices. Automated 
circuit design techniques provide the design engineer with a tool to more effectively generate 
high performance signal processors capable of meeting demanding specifications. 
Evolvable hardware (El-lW) is a relatively new approach to automated circuit design which 
utilises advances in reconfigurable hardware technology and the power of modem micro pro-
cessors to generate circuits based on the principles of natural selection and evolution. This 
thesis investigates the suitability of software-biased and hardware oriented programmable plat-
forms, configured via EHW, and tailored for the automated design of high performance DSP 
circuits. Performance criteria such as timing, area and circuit robustness are considered. 
A number of benchmarked DSP circuits were initially considered. It was shown that by using 
larger functional logic macros as building blocks El-lW is more successful at generating circuit 
solutions than if only gate primitives are used. In addition, the circuits generated are of com-
parable or better performance than equivalent circuits developed using a standard digital design 
methodology. Results also indicated that for more complex DSP functions to be generated, 
EHW platforms must use larger functional blocks, constrained for a specific application. 
Finite Impulse Response (FIR) filters were identified as the key function of many DSP applica-
tions, and the multiplication unit was targeted as the performance critical component. A novel 
Programmable Arithmetic Logic Unit (PALU) was therefore developed as a functional building 
block suitable for automated digital filter design using EHW. The PALU replaces coefficient 
multiplication with a series of bit-shifts, additions and subtractions. Two distinct arrays of 
PALU were developed based on conventional FPGA and PLA re-configurable hardware archi-
tectures. Results show that a PLA architecture with 2 levels of hierarchical interconnect and 
column-based fixed tap outputs provides a platform most suited to automated filter design using 
the EHW technique. The PLA was also shown to be be robust to faults covering up to 25% of 
the array when configured using EHW. 
Declaration of originality 
I hereby declare that the research recorded in this thesis and the thesis itself was composed and 
originated entirely by myself in the Department of Electronics and Electrical Engineering at 




Firstly, I would like to thank my Supervisor Dr Tughrul Arsian for his excellent support and 
guidance during the last 3 years. 
Considerable thanks to Dr Alan Murray, Dr Alister Hamilton, Dr Gerard Alan and Dr Ahmet 
Erdogan for their worthwhile discussions and invaluable feedback throughout. 
Grateful thanks to Applied Materials whose kindly provided the financial support for my thesis. 
Finally, many thanks to all those in the Department of Electronics and Electrical Engineering 
at the University of Edinburgh who made the last 3 years both rewarding and enjoyable. 
IV 
Contents 
Declaration of originality 	 iii 
Acknowledgements ................................iv 
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 
List of figures ...................................ix 
List of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 
Acronyms and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 
1 Introduction 
	
1.1 	Automated Circuit Design ............................1 
1.1.1 	Evolvable Hardware ...........................2 
1.1.2 Evaluating Circuits Through Evolvable Hardware ............4 
1.2 	Finite Impulse Response Filters ..........................4 
1.3 	Contribution .................................... 5 
1.4 	Thesis Outline ...................................6 
2 	Evolutionary Algorithms for Automated Digital Circuit Design 8 
2.1 Introduction 	.................................... 8 
2.2 An Overview of Evolutionary Algorithms 	.................... 8 
2.2.1 	Evolutionary Programming ........................ 9 
2.2.2 	Evolutionary Strategies .......................... 10 
2.2.3 	Genetic Programming 	.......................... 11 
2.2.4 	Genetic Algorithms ............................ 12 
2.3 Genetic Algorithms in Evolvable Hardware 	................... 13 
2.3.1 	Initialisation 	............................... 15 
2.3.2 	Selection 	................................. 16 
2.3.3 	Crossover and Mutation 	......................... 18 
2.3.4 	Fitness Function 	............................. 19 
2.4 Applying Evolvable Hardware to Automated Circuit Design 	.......... 20 
2.4.1 	Gate Level and Functional Level Circuit Evolution 	........... 21 
2.4.2 	Digital Circuit Design using Extrinsic Evaluation ............ 21 
2.4.3 	Digital Circuit Design using Intrinsic Evaluation 	............ 22 
2.4.4 	Encoding Digital Circuits Using Evolvable Hardware .......... 24 
2.5 Summary 	..................................... 25 
3 Generating DSP Circuits on the Virtual Chip EHW Platform 	 27 
3.1 	Introduction ....................................27 
3.2 The Virtual Chip Evolvable Hardware Platform .................28 
3.2.1 Encoding a circuit within the chromosome ...............28 
3.2.2 Connecting Cells Within the Chromosome ...............30 
3.2.3 	The Genetic Operators ..........................31 
3.2.4 	Circuit Evaluation with the Virtual Chip .................36 
V 
Contents 
3.3 Implementation and Results 	 . 
3.3.1 Genetic Algorithm Performance Using Primitive and Functional Com- 
ponent Libraries 	............................. 
3.3.2 Analysis of Timing and Area Performance ................ 
3.4 Phased Evolution in the Virtual Chip ....................... 
3.4.1 Implementation and Results ....................... 
3.4.2 Limitations of Virtual Chip EHW Platform ............... 
3.5 	Summary 	..................................... 
4 FIR Digital Filtering with Multiplierless Architectures 
4.1 	Introduction .................................... 
4.2 	FIR Filter Theory ................................. 
4.2.1 	Linear Phase FIR Filters ......................... 
A ' 	 TT1 	r-. 	i 	... l. 	.-.-.., -... ImpiLrnIILaLIuI1 	............................ 
4.3.1 	Direct Form FIR Structure ........................ 
4.3.2 Transposed Direct Form FIR Structure .................. 
4.4 Reduced Complexity FIR Filter Design ..................... 
4.4.1 	Canonic Signed-Digit Encoding ..................... 
4.4.2 	Primitive Operator Filters 	........................ 
4.4.3 	VLSI Implementations .......................... 
4.4.4 Design Adaptation and Fault Tolerance ................. 
4.5 Overview of Programmable Platforms ...................... 
4.5.1 Performing Multiplication on PLDs using Distributed Arithmetic 
4.5.2 Dedicated Programmable Logic Devices ................. 
4.6 	Summary 	..................................... 
5 Developing a Programmable Framework for Filter Design using EHW 
5.1 	Introduction .................................... 
5.2 Overview of EHW Platform 
5.2.1 	Programmable Arithmetic Logic Unit .................. 
5.3 Implementing the Genetic Algorithm ....................... 
5.3.1 	Analysis of Genetic Algorithm ...................... 
5.4 	Summary 	..................................... 
6 Reconfigurable platforms for FIR filter implementation using EHW 	 92 
6.1 Introduction 	.................................... 92 
6.2 Benchmark Filter Design 	............................. 93 
6.2.1 	Experimental Setup ............................ 93 
6.3 Field Programmable Gate Array (FPGA) Topology 	............... 95 
6.3.1 	Interconnecting CLBS for an FPGA-based FIR Filter .......... 95 
6.3.2 	Configuring the FPGA-based FIR Filter ................. 101 
6.3.3 	FPGA-based FIR filter Parameters 	.................... 102 
6.3.4 	Investigation of Genetic Operator Parameters 	.............. 103 
6.3.5 	Performance Comparison of FPGA Topologies 	............. 105 
6.3.6 	Graphical Representation of FPGA-Based FIR Filter 	.......... 110 
6.4 Programmable Logic Array (PLA) Topology ................... 113 

































6.4.2 	Configuring the PLA-based FIR Filter .................. 117 
6.4.3 PLA-Based FIR Filter Parameters .................... 119 
6.4.4 	Investigation of Genetic Operator Parameters .............. 120 
6.4.5 Performance Comparison of PLA Topologies .............. 121 
6.4.6 Graphical Representation of PLA-Based FIR Filter ........... 124 
6.5 Comparison of PLA and FPGA-Based Filter Platforms ............. 124 
6.5.1 	Further Investigations ........................... 126 
6.6 	Summary 	..................................... 128 
7 Translating the Co12 PtA Topology into Hardware 	 130 
7.1 	Introduction .................................... 130 
7.2 Synthesis and Performance Analysis of PLA-Based Filter ............ 131 
7.2.1 	Comparative analysis with RTL 'ideal' model .............. 131 
7.2.2 	Synthesis Details .............................133 
7.3 Fault Tolerant Characteristics of PLA-Based EHW Platform ...........136 
7.3.1 	Introducing Faults into the PLA-Based FIR Filter ............ 137 
7.3.2 	Analysis 	................................. 138 
7.3.3 Population Initialisation After Fault Detection .............. 141 
7.4 	Summary 	..................................... 143 
8 Summary and Conclusions 	 145 
8.1 	Introduction ......... 	 145 
8.2 Summary .......... 	 145 
8.3 Conclusions ......... 	 147 
8.4 Achievements ........ 	 150 
8.5 Future Work ......... 	 151 
8.6 Final Comments ....... 	 151 
References 	 153 
A VHDL Code for DSP Circuits 	 163 
A.! VHDL gate-level description of 2-bit multiplier ................. 163 
A.2 7-bit pattern recognizer (one's voter) ....................... 163 
A.2.1 	3-bit pattern recognizer .......................... 164 
A.3 A behavioural model of a two tonne discriminator ................ 164 
A.4 Schematic of 2x2-bit Parallel Multiplier Evolved by Miller et.al . and Associ- 
ated VHDL Code ................................. 165 
A.5 Schematic of 30-bit Parallel Multiplier Evolved by Miller et.al . and Associ- 
ated VHDL Code ................................. 166 
B Further Details of FPGA and PLA-Based EHW Platforms 	 168 
B.1 Postscript Templates of FPGA Interconnect Topologies for Graphical Repres- 
entation 	...................................... 168 
B.1.1 Elements of Postscript That Are Common to FPGA Interconnect Tem- 
plates ................................... 168 
B.1.2 Postscript Template for Alternating Feed-Forward Array (AFFA) FPGA 
Interconnect Topology .......................... 172 
vu 
Contents 
B. 1.3 Postscript Template for Continuous Feed-Forward Array (CFFA) FPGA 
Interconnect Topology 	.......................... 173 
B.1.4 Postscript Template for Continuous Feed-Forward Loop Array (CLFFA) 
FPGA Interconnect Topology 	...................... 173 
B.2 	Postscript Templates of PLA Interconnect Topologies for Graphical Represent- 
ation ........................................ 175 
B.2.1 Elements of Postscript That Are Common to PLA Interconnect Templates 175 
B.2.2 Postscript Template for Route I PLA Interconnect Topology ...... 179 
B.2.3 Postscript Template for Route 2 PLA Interconnect Topology ...... 179 
B.2.4 Postscript Template for Route 3 PLA Interconnect Topology ...... 180 
B.2.5 Postscript Template for Route 4 PLA Interconnect Topology ...... 181 
C Synthesis and Simulation Script for Generation of 6x5 PIA Core 	 182 
C' 1 	F.-.. rb,-. 	 i-... ( 	Dl A 	 1 ') 
,,.i ju)- 	uVv ii JJ SILl 	 St,i IL nil uX.j A 	S -*iI¼, .................. -5'-'- 
C.2 VHDL Leapfrog Testbench for Netlist Simulation 6x5 PLA Core ........ 183 
D Publications 	 186 
D.1 	Refereed Journals ................................. 186 
D.2 Refereed Conferences ............................... 186 
D.3 Refereed Workshops ............................... 186 
vii' 
List of figures 
2.1 Example of 2-bit multiplexor represented using genetic programming tree struc- 
ture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 
2.2 Algorithmic flow of genetic algorithm ......................14 
2.3 Single-point crossover of two parent chromosomes, generating two offspring 18 
2.4 Bit-flipping mutation example in bit-level chromosome encoding ........19 
2.5 Comparison of a standard gate-level encoding with the novel macro-based en- 
coding to describe a Fulladder with additional logic .. . . . . . . . . . . . . . . 25 
3.1 Chromosome structure defining sections for specific circuit description ..... 29 
3.2 Generic style of macro and other logic elements provided to component library 
for the evolution of arithmetic circuits ........................ 31 
3.3 Example of macro-based encoding describing a macro element (fulladder) and 
its connectivity . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 31 
3.4 Example of broken element connectivity resulting from crossover . 	. . . . . . . 32 
3.5 Four mutation operators used by the genetic algorithm . 	. . . . . . . . . . . . . 34 
3.6 Graphical representation of the Virtual Chip environment, evolving a 2-bit mul- 
tiplier within a population size of N........................ 37 
3.7 Execution flow and coding format of the genetic algorithm and Virtual Chip 
evaluation environment .. 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 38 
3.8 Output response of 2-frequency discriminator from behavioural HDL model 42 
3.9 Typical Number of Generations required by Genetic Algorithm to evolve DSP 
circuit structures using primitive and functional component libraries . . . . . . . 44 
3.10 Circuit diagram of 7-bit pattern recogniser generated by genetic algorithm using 
functional library . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 46 
3.11 Circuit diagram of 7-bit pattern recogniser generated by genetic algorithm using 
functional library with redundant elements removed . 	. . . . . . . . . . . . . . 46 
3.12 Circuit diagram of fully optimised 7-bit pattern recogniser generated by genetic 
algorithm using functional library . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 47 
3.13 Example of Phased Evolution For The Automated Design of a 30-bit Multiplier. 48 
3.14 Example of unsuccessful evolution of 30-bit multiplier using single-step EHW 
technique . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 50 
3.15 Example of synthesised 30-bit multiplier generated using phased evolution 
technique within the Virtual Chip EHW platform . 	. . . 	 . . . . . . 	 . . 	 . . 	 . . 	 . 52 
3.16 Schematic of sub-circuit relating to functionality of output 2 of 30-bit multiplier. 53 
3.17 Schematic of sub-circuit relating to functionality of output 5 of 30-bit multiplier. 53 
4.1 Filter Specifications for passband ripple (1 + 6 1 ) and stopband attenuation 02).58 
4.2 Convolution in frequency domain for (a) desired amplitude response; (b) fre- 
quency response of input signal, (c) actual frequency response from FIR filter. 59 
4.3 Impulse response of causal FIR filter shifted M times . . . . . . . . . . . . . . . 60 
4.4 Direct form FIR filter implementation . . . . . . . . . . . . . . . . . . . . . . . 62 
ix 
List of figures 
4.5 Folded direct form FIR filter implementation (N even) . 	. . . . . . . . 	 . . . . 	 . 63 
4.6 Multiply accumulate (MAC) operator . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 64 
4.7 Transposed direct form FIR filter implementation . . 	 . 	 . . . . 	 . . 	 . 	 . . 	 . 	 . 	 . 	 . 	 . 64 
4.8 Folded transposed direct form FIR filter implementation . 	. . . . . . . . . . . . 65 
4.9 Example Shift-add Approach . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 67 
4.10 Basic FPGA interconnect structures and CLB layout . 	. . . . . . . . . . . . . . 71 
4.11 Example of a PLA architecture from the Xilinx XC9500 series . . . . . . . . . . 72 
4.12 Distributed arithmetic processor . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 74 
4.13 Implementation of an N-tap FIR filter using distributed arithmetic . 	. . . . . . . 75 
5.1 Architectural overview of El-lW platform for FIR filter implementation. 	.... 80 
5.2 Programmable ALU for Multiplierless FIR Filtering . 	. . . . . . . . . 	 . . . . . 80 
5.3 Schematic of EHW platform including units comprising genetic algorithm and 
programmable platform (FPGAJPLA)....................... 82 
5.4 Schematic of MEMControl unit for memory read/write control . 	. . . . . . . . 83 
5.5 Schematic of Fitness -Unit for calculating quality of PLA/FPGA configurations 
for a given set of filter coefficients . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 85 
5.6 Schematic of Selection Unit implementing two way tournament selection. . . 87 
5.7 Schematic of Crossover-Unit which implements genetic operators crossover 
and mutation in order to generate new offspring solutions .. . . . . . . . . . . . 88 
5.8 Overview of waveform produced by genetic algorithm in EHW platform. 	. . 90 
6.1 Transfer function for 31-tap low-pass FIR Filter ................. 94 
6.2 Configurable logic block (CLB) for FPGA including routing to and from PALU. 96 
6.3 Various routing topologies for interconnecting PALUs in FPGA structure. 	. . . 97 
6.4 Various output topologies for FPGA structure . . 	 . 	 . 	 . 	 . . . 	 . 	 . . 	 . 	 . . 	 . 	 . 	 . . 	 . 99 
6.5 FPGA control of FIR filter input X(n); including position of input control 
string within FPGA string encoding . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 100 
6.6 Example configuration string for 4x4 FPGA-based FIR filter with LSIS, AFFA 
and EOS . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 101  
6.7 Example FPGA configuration of 5-tap primitive operator filter . . . . . . . . . . 102 
6.8 Performance of various FPGA interconnect and coefficient output topologies to 
autonomously generate a 31-tap low-pass FIR filter. L-shaped input sequence 
(LSIS) employed . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 106 
6.9 Performance of various FPGA interconnect and coefficient output topologies to 
autonomously generate a 31-tap low-pass FIR filter. Base-line input sequence 
(BLIS) employed . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 109 
6.10 Example FPGA configuration of 31-tap low-pass filter . 	. . . . . . . . 	 . . . . 	 . 111 
6.11 PALU re-use map from FPGA configuration of 31-tap low-pass filter . 	 . . . . . 112 
6.12 PLA architecture and interconnect overview . . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 113 
6.13 Various Interconnect Topologies for PLA 	.................... 116 
6.14 Layout of configuration string for programming PLA . . . . . . . . . . 	 . . . . . 118 
6.15 Example PLA configuration of 5-tap primitive operator filter . . . . . . . . . . . 119 
6.16 Performance of PLA topologies to autonomously generate a 31-tap low-pass 
FIR 	filter . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 122 
6.17 Example PLA configuration of 31-tap low-pass filter filter . 	 . . . . . . . . . . 	 . 125 
x 
List of figures 
6.18 Performance of Co12 and Row3 PLA topologies to autonomously generate a 
20-tap Hubert transform FIR filter . . . . . . . . . . . . . . . . . . . . . . . . . 127 
7.1 Example of reduced connectivity between PALUs ................132 
7.2 Performance of Co12educed and Co126x16 PLA topologies to autonomously 
generate a 31-tap low-pass FIR filter . . . . . . . . . . . . . . . . . . . . . . . . 133 
7.3 Logic area of PLA core as a result of synthesis for increasing operational speeds 134 
7.4 Critical delay path through PLA architecture . . . . . . . . . . . . . . . . . . . 135 
7.5 Simulation waveform of 6x5 PLA Core VHDL netlist synthesised at 10MI-[z. 136 
7.6 "Stuck-at-Zero" fault topologies covering PLA .................139 
7.7 Analysis of Co1211x16 PLA architecture with increasing percentages of faulty 
PALUs .......................................140 
7.8 Schematic showing configuration of low-pass FIR filter on PLA with 13% faults 142 
7.9 Fitness performance of filter evolved on PLA based on various methods of gen- 
erating the initial population of configuration-strings . . . . . . . . . . . . . . . 143 
A.1 2x2-bit parallel multiplier evolved my Miller et.al  . . . . . . . . . . . . . . . . . 165 
A.2 30-bit parallel multiplier evolved my Miller et.al  . . . . . . . . . . . . . . . . . 166 
xi 
List of tables 
2.1 Boolean logic look-up table of 2-bit parallel multiplier 	............. 20 
3.1 Primitive and functional logic elements available to genetic algorithm within 
Virtual Chip EHW platform . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 30 
3.2 Comparison of DSP Circuits Generated by Genetic Algorithm Using Different 
Logic Library Implementations . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 41 
3.3 Performance of arithmetic circuits in terms of circuit complexity and operation 
speed . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 45 
3.4 Performance of GA-Based Arithmetic Circuits in Terms of Area and Operation 
Speed After Optimisation . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 45 
3.5 Comparing 30-bit multiplier evolved using Virtual Chip EHW platform with 
that of functionally equivalent circuits generated with Miller's EHW platform 
and by using standard digital CAD techniques .. . . . . . . . . . . . . . . . . . 50 
3.6 Average Number of Generations Taken by Phased Evolution to Evolve Sub- 
circuits For Each Output of 30-bit Multiplier .................. 51 
3.7 Success of Virtual Chip EHW platform to generate 4-bit multiplier using phased 
evolution . 	. 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 	 . 54 
4.1 Example of CSD encoded coefficients and their 2's compliment equivalent. 	66 
4.2 Contents of LUT for K = 4 input data vectors . . . . . . . . . . . . . . . . . . 73 
6.1 Non-zero coefficients required for response of 31-tap low-pass filter . . . . . . . 93 
6.2 Performance of FPGA connection topologies in generating the 31-tap low-pass 
filter configured using genetic algorithm with and without crossover . . . . . . . 104 
6.3 Performance of FPGA connection topologies in generating the 31-tap low-pass 
filter configured using genetic algorithm with variable mutation rates . . . . . . 104 
6.4 Performance of PLA connection topologies in generating 31-tap low-pass filter 
configured using genetic algorithm with and without crossover . . . . . . . . . . 120 
6.5 Performance of PLA connection topologies in generating 31-tap low-pass filter 
configured using genetic algorithm with variable mutation rates and no cros- 
sover employed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 
xii 
Acronyms and Abbreviations 
AAOS Alternating Arrow Output Sequence 
AFFA Alternating Feed-forward Array 
ALU Arithmetic Logic Unit 
AOOS Alternating Orthogonal Output Sequence 
ASIC Application Specific Integrated Circuit 
BUS Base-line Input Sequence 
BLOS Base-line Output Sequence 
CFFA Continuous Feed-forward Array 
CFFLA Continuous Feed-forward Loop Array 
CSD Canonic Signed Digit 
CLB Configurable Logic Block 
DA Distributed Arithmetic 
DSP Digital Signal Processing 
EA Evolutionary Algorithm 
EHW Evolvable Hardware 
EOS Edged Output Sequence 
EP Evolutionary Programming 
ES Evolutionary Strategies 
FIR Finite Impulse Response 
FPGA Field Programmable Gate Array 
GA Genetic Algorithm 
GP Genetic Programming 
HDL Hardware Description Language 
IC Integrated Circuit 
[SB Least Significant Bit 
USIS L-Shaped Input Sequence 
LUT Look-up-table 
MAC Multiply Accumulate 
xlii 
Acronyms and Abbreviations 
MSB 	Most Significant Bit 
MUX Multiplexer 
PALU Programmable Arithmetic Logic Unit 
PLA 	Programmable Logic Array 
PLD 	Programmable Logic Device 
POP 	Primitive Operator Filter 
RTL 	Register Transfer Language 
VHDL Very High Speed Integrated Hardware Description Language 
xiv 
Nomenclature 
AAO Staps Number of CLBs available as potential taps using AAOS 
AQOStaps Number of CLBs available as potential taps using AOOS 
BLO Staps Number of CLBs available as potential taps using BLOS 
Bi Output Bits 
Di Boltzmann probability distribution 
EQ Staps Number of CLBs available as potential taps using EOS 
F Bit-wise fitness 
F Distributed arithmetic multiplication function 
H (w) FIR filter frequency response 
HA() Actual amplitude response of FIR filter 
Desired amplitude response of FIR filter 
Ha(w ) Convolution of HA() with HD() 
H (z) FIR filter transfer function 
I Total number of bits required to determine FPGA input of X(n) 
A Offspring population 
IL Parent population 
PAL U 7.1 Number of bits required to program 1 PALU 
PM Bit-wise mutation probability 
Q. Coefficient fitness score for evolvable hardware 
Number of control bits required for routing MUX 
se Control bits of left-shift operator for first column of PALUs 
SFPGA Total bit length of configuration string for FPGA-based filter 
Si Search Space using Virtual Chip 
s(n) FIR filter sample response 
SPLA Total bit length of configuration string for PLA-based filter 
W Bit length required to encode 1 PALU 
y(n) FIR filter difference equation 
XK Input string description using distributed arithmetic 
xv 
Nomenclature 
X (n) 	FIR filter input 
Xwidth Number of PALU Columns 




1,1 Automated Circuit Design 
The continued development of more complex electronic devices with increasing integrated 
functionality translates into an increase in the size and complexity of the digital systems em-
ployed. This places a considerable demand on the design engineer to develop hardware archi-
tectures capable of fulfilling the growing functional requirements expected from these mod-
em electronic devices, such as mobile phones and personal digital assistants (PDAs). Digital 
Signal processing (DSP) systems are extensively used in many electronic devices, performing 
tasks from signal filtering to data compression and real-time video streaming. Performance 
constraints such as operational speed, physical area, low power consumption, design portabil-
ity and device reliability, all of which contribute directly to design complexity, are increasingly 
dominant factors when developing very large scale integrated (VLSI) silicon devices. This is 
especially important for highly constrained portable electronic device such as mobile phones. 
Automated circuit design techniques provide the design engineer with a tool to more effectively 
generate high performance electronic processors capable of meeting these demanding specific-
ations. Circuit complexity, physical hardware constraints, and device flexibility and adaptation 
are all aspects of circuit design which can benefit from the design automation paradigm. 
A number of automated design techniques for specific types of DSP circuit have been proposed 
and include the automated design of high speed multiplication-and-accumulation circuits [1] 
and VLSI digital FIR filters [2]. To obtain functionality an expert system, or heuristic know-
ledge of the system under design is often required, or design parameters and algorithms must be 
painstakingly developed which are often specific to the application. More recent developments 
in circuit design automation have resulted in the emergence of an independent and expanding 
field of research termed evolvable hardware (El-lW) which is based on a non-heuristic search 
technique. 
Introduction 
1.1.1 Evolvable Hardware 
Digital circuit design automation using EHW differs from traditional IC design techniques 
which utilise a top-down or compartmentalised methodology in which complex systems are 
broken down and designed as smaller sub-systems. EHW instead approaches the design prob-
lem as a flat component hierarchy, generating a black-box of the completed circuit. This is 
achieved by using a number of circuit building blocks ranging from simple gate primitives to 
more complex digital signal processing elements. The size of building block relates to the 
level of design abstraction. When using gate primitives, the level of design abstraction can be 
likened to a bottom-up design approach, as many gates must be instantiated in order to achieved 
the desired circuit functionality. Tnis requires considerable design effort. When more complex 
macro functions are employed, the design problem becomes more like a top-down approach, 
because much of the functionality of the desired circuit can be described using fewer building 
blocks and less design effort. Often in EHW the design abstraction is a combination of the 
two. Designs are generated autonomously via a group of stochastic optimisation techniques 
termed evolutionary algorithms (EAs). Evolutionary algorithms have been shown to be valu-
able in applications such as neural networks [3-5], task scheduling [6], VLSI routing [7,8] and 
networking within telecommunications systems [9]. These are tasks where the computational 
time needed to provide a solution grows exponentially with problem complexity, and is termed 
NP-complete. The automated design of a number of digital circuits has also been shown to be 
an NP-complete [10] problem. This has lead to the development of a number of signal pro-
cessing applications which utilise EHW design techniques. These include the development of 
an adaptive control for a myoelectric hand [11], the generation of novel multiplier circuits [12], 
and the exploitation of the physical properties of silicon in the design of a highly area efficient 
tone discriminator [13]. The flexibility and wide applicability of EHW for a range of auto-
mated design applications stems from the non-heuristic evolutionary algorithm it employs. The 
following three examples demonstrate how evolvable hardware can be applied to solve critical 
development issues associated with modem DSP circuit design. 
Example 1: Ongoing advances in silicon manufacture have resulted in the widespread adop-
tion of DSP hardware developed using deep sub-micron technologies. Commercially available 
applications utilising transistor sizes of 0.18 microns are now common place. System-on-chip 
(SoC) design methodologies have been developed to exploit the high transistor density inher-
ent in deep sub-micron devices. SoC therefore facilitates the integration of many smaller data 
2 
Introduction 
processing tasks to form more complex signal processing applications, which often require 
tens of millions of transistors on a single chip. The time required to both design and test the 
complex functionality of SoC applications is regarded as the limiting factor in the products 
time-to-market. Circuits generated through EHW have been shown to reduce circuit area, and 
provide novel DSP architectures beyond those attainable through conventional design tech-
niques [14, 15]. Circuit test is considered an integral part of the design automation procedure. 
However concerns over the complexity issues associated with rigorously evaluating and testing 
the functional correctness of circuits developed using EHW presents a number of problems for 
the automated design of complex DSP applications [16]. El-lW has therefore been cited as a 
means of generating adaptive systems where the target function is not clearly understood such 
that test vectors can be added over an extended period whilst the device is in operation. 
Example 2: The integration of communication media, such as telecommunication and real-time 
video images in mobile devices, require systems capable of rapid data manipulation and adapt-
ation to both the changing requirements of the user, and the changing environment in which 
the device is deployed. Programmable, multi-purpose architectures are therefore required to 
provide a number of signal processing tasks on demand. Micro-processors and programmable 
DSP devices have traditionally been employed for such applications. However general purpose 
Programmable Logic Devices (PLD 's) are now increasingly favoured as they are low cost and 
can be configured to produce user-defined architectures, specific to a signal processing task. 
PLD's allow the design engineer to down-load circuit configurations onto hardware an almost 
unlimited number of times. This provides the designer with a means of both implementing a 
signal processing device, and testing it, without the need to fabricate a full-custom IC. Applic-
ations using EHW have recently been introduced which exploit the programmability of PLD 's 
by modifying device functionality in real time. For example, Tufte and Haddow have developed 
a programmable digital signal filter implemented on a PLD and capable of self-configuring for 
new tasks using EHW [17]. 
Example 3: Many DSP devices are deployed under hostile operating conditions for example 
in commercial satellite communications where hardware deteriorates due to damage caused by 
harmful radiation, and in other inhospitable environments where human intervention is diffi-
cult or impossible. These systems must therefore maintain functionality despite factors such 
as severe temperature variation, radiation, and operational ware. However, architectures de-
veloped using conventional fault tolerant design methodologies restrict DSP performance as 
3 
Introduction 
they reduce operational speed, and increase physical area [18]. EHW implementations of fault 
tolerant applications reduces the physical resources required, either by designing built-in ro-
bustness into the architecture as in [19,20], or providing a single fixed-sized resource capable 
of adapting the system should it become damaged [21,22]. In the latter case programmable 
logic devices must be employed. 
1.1.2 Evaluating Circuits Through Evolvable Hardware 
Circuits generated through EHW are evaluated either extrinsically through software simulation, 
or intrinsically, by which a circuit design is transfered directly onto silicon and then evaluated. 
Intrinsic evaluation has become feasible due to advances in recent years in PLD technologies, 
and has been applied to a wide range of DSP applications such as adaptive, online data com-
pression [23]. Both intrinsic and extrinsic evaluation approaches have their advantages, each 
suited to different applications, a taxonomy of which is presented in Chapter 2. As a result a 
wide range of programmable platforms have been developed which are specifically tailored for 
automated circuit design using EHW, and are also discussed in Chapter 2. 
1.2 Finite Impulse Response Filters 
Finite impulse response (FIR) filters constitute a key function of most DSP applications and 
are therefore typically embedded alongside other signal processing tasks which collectively 
comprise the overall system. The performance and portability of hardwired FIR filters are 
therefore important to the fast development of application specific DSP devices, such that an 
existing FIR architecture can be ported into a new application with minimum re-design and test 
overhead. 
General purpose programmable DSP chips, such as the TMS320 series from Texas Instruments, 
are generally not suitable for high speed FIR filtering, due to their single multiplier architec-
ture. However, their programmability makes them suited for filter applications where flexibility 
of the filter for a range of applications is the primary design constraint. Programmable logic 
devices such as those from Xilinx [24] are suitable for implementing dedicated, high perform-
ance FIR filters. Filter functionality can also be modified, although this requires implementing 
a new circuit configuration on the PLD which must first be developed by the design engineer. 
Both programmable DSPs and PLD devices contain considerable silicon redundancy as they 
4 
Introduction 
provide additional functionality not required by an FIR filter architecture. An example of this 
is the logarithmic and exponential functions present in most DSP chips. Optimal filter perform-
ance is often only achieved through custom ASIC design. However, this is at the expense of 
increased design time and loss of flexibility in modifying the architecture once the device is 
fabricated. Regardless of the platform on which the filter is implemented, the multiplier stage 
is both the most complex and costly component to implement. 
Evolvable hardware has already been applied to the design of FIR filters. Applications have 
primarily focused on the optimisation of filter coefficients, which target the multiplication stage 
of the filter used in coefficient generation. The multiplier is widely recognised to be the primary 
performance constraint when implementing the FIR filter structure; hardware. Multipliers are 
costly in terms of area, power, and signal delay. Several design techniques aim to reduce FIR 
filter complexity, and improve filter performance by targeting the multiplier. EHW techniques 
include the utilisation of coefficient encoding schemes designed to minimise the physical area 
of the coefficient multiplication stage [25]. Coefficient optimisation and hardware minimisation 
using EHW have also been investigated using 'multiplierless' filter architectures. This approach 
replaces the multiplier with a series of addition and shift operations reducing filter area at the 
cost of flexibility [2,26]. Both software simulation and intrinsic hardware evaluation have been 
used to determine filter performance, and each optimisation approach has its benefits. Each 
of the EHW platforms referenced have been designed to implement the optimisation technique 
employed by the cited author. However, it is unclear which evaluation technique might be best 
suited for the hardware optimisation of a wide range of FIR filter applications, and also which 
programmable platform and filter optimisation approach might be most effective. 
1.3 Contribution 
The work presented in this thesis provides an investigation into a number of novel program-
mable architectures specifically developed to autonomously implement digital FIR filters using 
evolvable hardware. Each programmable architecture is distinguished through a number of 
unique characteristics, which include topologies for programmable interconnect, various input 
and output configurations, and the level of programmable functionality available to the archi-
tecture. 
This has led to the development of a novel programmable architecture capable of implementing 
5 
Introduction 
digital circuits using various levels of component functionality, provided by two distinct digital 
logic libraries. The architecture is therefore capable of examining the relationship between 
the level of component functionality available, and the success of EHW to generate a circuit 
solution which operates under realistic physical circuit constraints [27,28]. 
Particular emphasis has been placed on the coefficient multiplication unit of the FIR filter, 
resulting in the development of two novel programmable architectures which replace explicit 
multiplication with a distributed series of bit-shifts, additions and subtractions, which must be 
successfully configured using EHW through an integrated evolutionary algorithm [29-32]. 
1.4 Thesis Outline 
This thesis therefore focuses on the automated design of coefficient multiplication hardware in 
digital FIR filters, designed using evolvable hardware. The objective of the research presented 
in this thesis is identified in the following statement: 
A programmable architecture tailored for evolvable hardware can be developed which is 
highly suited to the autonomous implementation of digital FIR filters. 
This thesis is organised as follows: 
Chapter 2 introduces the concept of evolutionary algorithms and how they are applied to 
evolvable hardware. The concepts of gate-level and functional-level evolution for automated 
circuit design are also introduced. Extrinsic and intrinsic evaluation techniques are examined, 
and example applications are identified for a range of DSP applications, including FIR filter 
design. 
Chapter 3 presents the first of three custom EHW platforms. This first platform is termed the 
'Virtual Chip'. The architecture of the Virtual Chip, is discussed and a number of DSP circuits 
are developed on the platform using both gate and functional-level evolution. Performance 
comparisons between both approaches are made in terms of each circuits operational speed and 
physical area. Further performance comparisons are made with functionally equivalent DSP 
circuits generated using standard design methodologies, and similar DSP circuits developed 
via EHW from other published works. 
Chapter 4 presents an overview of FIR filter theory. A detailed examination of reduced corn- 
Introduction 
plexity FIR filter design techniques is also given, and includes an overview of multiplierless 
filter architectures. The primitive operator filter (POF) design methodology is then identified 
and its advantages, limitations and applicability for design automation using EHW evaluated. 
Finally an overview of general purpose programmable logic devices is presented, particularly 
focusing on how these devices perform multiplication and implement digital FIR filters. 
Chapter 5 details the development of a Programmable Arithmetic Logic Unit (PALU) inspired 
by the POF approach. The development of a custom built genetic algorithm is also presen-
ted, which together with the PALU forms the backbone of the final two EHW platforms to be 
investigated. 
Chapter 6 presents the final two programmable platforms, one inspired by the general purpose 
FPGA architecture, the other by a standard PLA design. Both are tailored for FIR filter coeffi-
cient multiplication using primitive operator components, and designed to be configured using 
EHW. Various interconnect and coefficient output topologies are examined for each of the two 
platforms so as to determine the most effective programmable architecture for coefficient gen-
eration. An examination of the role of mutation and crossover in automated FIR filter design 
is also presented. A complex and challenging FIR specification is introduced as a benchmark 
and used to determine the performance of both the PLA and FPGA-based platform based on a 
number of criteria. 
Chapter 7 considers the design issues and compromises associated with translating the PLA 
architecture into a synthesisable netlist. Physical issues such as timing, interconnect density and 
drive strength are investigated. The netlisted PLA model is then compared with the original 
PLA architecture and evaluated using the same performance criteria identified in chapter 6. 
Finally, Chapter 7 investigates the ability of the PLA based El-lW platform to self repair in the 
context of safety critical applications, or hostile environmental conditions. 
Chapter 8 Conclusions obtained from each of the previous chapters are analysed, and the 
thesis statement critically evaluated. Further improvements are suggested concerning the design 
methodology behind all three EHW platforms, and the scope of the comparative analysis between 
platforms critically appraised. 
7 
Chapter 2 
Evolutionary Algorithms for 
Automated Digital Circuit Design 
2.1 Introduction 
Ongoing advances in silicon technology continue to yielded faster and more powerful micro-
processors which have enabled design engineers to examine new methods of generating cir-
cuit structures. One such method has become known as evolvable hardware (El-LW), which 
considers the automated design of electronic systems using both software simulation and pro-
grammable hardware technologies. This chapter presents a synopsis of evolutionary algorithms 
(EAs), and introduces the reader to the basic concept of each class of EA, along with example 
applications. The role of the genetic algorithm as the dominant approach taken in EHW is 
discussed in detail, and its adaptation for automated circuit design presented. The concepts 
of gate-level and functional-level evolution, reflecting the granularity of logic element used in 
EHW to generate more complex circuit structures, is discussed. In addition, a detailed overview 
of circuit applications generated using both gate-level and functional level EHW is presented, 
and is segmented into EHW platforms which use either extrinsic or intrinsic approaches to cir -
cuit evaluation. Finally, the effects of component granularity, and the choice of encoding used 
to describe a circuit architecture are discussed, and their influence on the success in autonom-
ously generating circuits using EHW presented. 
2.2 An Overview of Evolutionary Algorithms 
Evolutionary algorithms (EM) are a class of non-heuristic optimisation techniques which util-
ise the concept of Darwinian evolution to progressively generate a given solution for a specified 
application. A chromosome describes a potential solution which intern expresses the solution's 
phenotype, or functionality. The solution for a given application lies within a search space 
which must be successfully navigated by the EA. A population of competing solutions concur-
rently investigate the search space, and are subject to selection and modification by the EA in 
1.11 
Evolutionary Algorithms for Automated Digital Circuit Design 
order to both remove poorer solutions from the search, and generate better solutions than are 
currently in the population. Solutions are modified through genetic operators termed mutation 
and crossover. Mutation acts on an individual by altering it in some predetermined manner, 
such that the resulting solution is potentially improved. Mutation is therefore analogous to a 
copying infidelity. Crossover usually works by combining the characteristics inherent in mul-
tiple solutions, with the aim of generating new offspring solutions better than the original 
solutions which created them. This process of evaluation, selection and modification through 
genetic operators is iterative, where each iteration is termed a generation, and is continued un-
til an acceptable solution is found, or a specified number of generations have been performed. 
The terminologies used within the field of evolutionary algorithms are designed to reflect the 
conceptual similarity that exist between, natural evolution, EAs, and genetics. 
2.2.1 Evolutionary Programming 
Evolutionary programming (EP) was first developed my Lawrence J. Fogel in the early 1960's 
as an alternate method of artificial intelligence. Fogel's initial experiments were on the sim-
ulated evolution of finite-state machines, originally proposed by Moore [33], and Mealy [34]. 
Fogel's research on this subject can be found in [35]. 
Genetic algorithms (GAs) traditionally operate on a solution phenotype by using genetic op-
erators to build together smaller sub-solutions or genotypes within the chromosome (as with 
biological genetic models). However, EP differs from the GA approach as it operates entirely 
on the phenotype, manipulating the behavioural characteristics of a solution in order to produce 
new offspring. As a result the quality, or fitness, of a given solution cannot be disseminated 
into smaller sub-solutions as with GAs, but is instead evaluated as a single entity, which must 
provide the solution required. Crossover is therefore not used in EP, which instead relies solely 
on mutation to affect the generation of new solutions. Competing solutions in a population 
are selected probabilistically such that poorer solutions within the population have a small but 
non-zero probability of selection. Selected individuals are then modified through one or more 
mutation operators, each with a specified probability distribution. This process is repeated until 
a new population of solutions is generated consisting of offspring solutions, and potentially a 
number of the original "parent" solutions. The ratio of offspring to parent solutions is determ-
ined using one of a number of population selection rules detailed in Section 2.2.2 below. 
Evolutionary programming primarily uses real numbered (floating point, or integer) represent- 
Evolutionary Algorithms for Automated Digital Circuit Design 
ations for encoding a solution in a chromosome. The choice of representation depends greatly 
on the application domain, and both are factors which determine the type of mutation operator 
required for the algorithm to generate solutions. EP has been applied to a wide range of tasks 
including automated control systems [36,37], game theory [38,39], the optimisation of neural 
networks weights and their structure [3-5], and route planning [40] to name but a few. 
2.2.2 Evolutionary Strategies 
The field of evolutionary strategies (ES) was originally developed in 1964 by Peter Bienert, 
Ingo Rechenberg, and Hans-Paul Schwefel, as a means of minimising the total drag of three-
dimensional bodies in turbulent air flow. A number of revisions to the original algorithm have 
occurred since 1964, and include an increase in population size from one parent solution gen-
erating one offspring solution, to p parents generating A offspring. The relationship between 
p and A can be expressed by two population selection rules. The first, (p, A), denotes an ES 
that produces A offspring from p parents, thereby completely replacing the parent population 
with A offspring. In order to maintain the population size A must be > p. If A = it then the 
ES simply resembles a random walk search. Therefore, the relationship between offspring and 
parent solution in an ES is defined as 1 < p < A < 00. This approach is often termed gen-
erational. The second rule is termed (j + A) denotes an ES that produces A offspring from p 
parents, and selects the new population from the union of both parent and offspring solutions. 
ES utilise both crossover and mutation in order to generate new offspring solutions. Like evol-
utionary programming, ES predominantly use a real numbered encoding to describe a solution. 
However, unlike most non-adaptive approaches using EP, evolutionary strategies vary the in-
dividual mutation distribution of each solution according to a step size a, based on a number 
of individual parameters determined by a predefined "success" rule. For example, Rechen-
berg's well known method of determining a is named the 115 success rule [41]. This approach 
increases a by a predefined value if the relative frequency of successful mutations over a spe-
cified number of generations is greater than 1.5. If this is not the case then a is decreased by 
the same predefined amount. Detailed explanations of ES can be found in [42]. 
10 
Evolutionary Algorithms for Automated Digital Circuit Design 
2.2.3 Genetic Programming 
Genetic programming (GP) is conceptually very similar to GAs, and was first described by 
John R. Koza in 1989 [43]. However, there are a number of fundamental differences in the way 
that GP both encodes and manipulates solutions compared to GAs. Genetic programming uses 
tree-structures to evolve computer programs which describe a solution's functionality. This 
approach stems from the initial use of the LISP programming language for genetic algorithms 
presented by Cramer in 1985 [44]. Since then GP algorithms have been developed using a wide 
range of computing languages such as C and C++. Genetic programming commonly encodes 
each solution using a combination of two node classes, termed terminals and functions. Ter-
minals are either numeric constants or other inputs external to the evolving program. Functions 
perform operations which take either terminal outputs, or outputs from other function nodes in 
order to produce new outputs which form part of the evolving solution. An example of a simple 
genetic program encoding the logic functions of a 2-bit multiplexor is presented in Figure 2.1. 
Node functions include AND, OR and NOT boolean expressions; terminal nodes represent the 
two inputs of the multiplexor, INO and IN1, and the control signal, cntrl. 
out 
OR 
AND 	 (AND' 
INO 	 rI IN1 	tcntrl 
Figure 2.1: Example of 2-bit multiplexor represented using genetic programming tree struc-
ture. 
Crossover and mutation operators are applied at selected nodes within the program tree after 
11 
Evolutionary Algorithms for Automated Digital Circuit Design 
the selection of fit individuals. Again the type of mutation operator employed is application 
specific, but usually results in the changing of a nodes functionality within specified parameters. 
For example in Figure 2.1 above, the NOT gate node might be replaced by an AND function as 
a result of mutation. Crossover acts on two selected parent programs by selecting nodes within 
each parent which form 'branches' of the programs encoding. In Figure 2.1 if crossover were 
to select the right AND node, shown in red, then the associated program branch would contain 
the NOT function and terminals IN1 and cntrl. These branches of program are then used to 
create a new offspring program which combines the selected parts of each parent. From this 
procedure new solutions are generated. Another important difference of genetic programming 
over traditional 'GAS is that the length of chromosome use to encode each program is not fixed 
during evolution. This enables a GP program to grow autonomously in order to accommodate 
variations in problem size and complexity. However, without careful control parameters, the 
length of OP tree-structures can quickly become unwieldy. 
Genetic programming has been applied to a wide range of applications including the design of 
analogue electronic circuits [45,46], classification of medical data [47,48], and the automated 
optimisation of chemical structures [49]. A more detailed account of genetic programming can 
be found in [50,51]. 
2.2.4 Genetic Algorithms 
The concept of the genetic algorithm (GA) was first proposed by John Holland in 1975 [52], 
and was the first evolutionary algorithm to encode possible solutions using binary bit-strings. 
The prominent role of crossover to generate new variations was also introduced as an integral 
mechanism of the GA. Further work by Golderberg [53] supported Holland's notion of using 
crossover to create useful building blocks, termed schemata, which combing to form a descrip-
tion of a given solution. Each chromosome generated by a GA describes specific aspects of 
the solution it encodes. Mutation was therefore employed as a background operator, ensuring 
new material is randomly inserted in to the population to avoid the search stagnating around a 
sub-optimal solution. 
The manner in which a GA both encodes and manipulates possible solutions within the search 
space has made it extremely suitable as the primary engine behind evolvable hardware. For 
example each building block (schemata) can be represented as one or more circuit components 
with associated connectivity. This thesis therefore focuses on the use of genetic algorithms 
12 
Evolutionary Algorithms for Automated Digital Circuit Design 
in evolvable hardware for the automated design of a performance driven FIR filter coefficient 
multiplication circuit. Genetic algorithms and their associated terminologies when applied to 
EHW are described in detail in the following section. 
2.3 Genetic Algorithms in Evolvable Hardware 
The automated design of digital circuits using EHW is not trivial. Each possible circuit solution 
for a given task lies within a search space. The search space is defined by the number of differ-
ent component building-blocks available, the number of logic cells used to generate the circuit, 
and the application for which the circuit is being evolved. Genetic algorithms, when applied 
to EHW, perform a non-heuristic, search through a space of possible circuit configurations, in 
order to find a solution which corresponds to a desired specification. Each circuit configuration 
is encoded within a chromosome which expresses circuit functionality. The encoded chromo-
some is termed a phenotype as it comprises numerous smaller logic cells or genotypes. Many 
possible circuit solutions are investigated concurrently, comprising a population of competing 
chromosomes. Each chromosome is evaluated and assigned a fitness which is representative of 
the quality of the circuit solution it describes. Selection mechanisms are then used to identify 
the most successful chromosomes within the current population. The logic elements which 
make up the circuits of the selected chromosomes are then modified via two genetic operators: 
mutation and crossover, to produce a new set of offspring solutions potentially better than their 
parents. This iterative process is repeated until an acceptable solution is found, or a specified 
number of iterations has been completed, where each iteration is termed a generation. Fig-
ure 2.2 displays the algorithmic flow of a standard genetic algorithm for evolvable hardware. 
Genetic algorithms traditionally utilise a binary representation of fixed length N, which en-
codes a solution in a specific chromosome. Each bit in the string is termed a loci. Groups of 
loci of length L form schemata, which can be represented as 0, 1, or #, where # is a 'wildcard' 
matching either 0 or 1. Schemata therefore define useful vectors in the search space (sub-blocks 
of solution), which partly form the final solution. The larger the length of schema L, the greater 
its contribution to the final solution. The order of a schema is defined by the number of non-# 
loci in the string. For example, the string #10#1#0# is an order four schema, with defining loci 
at locations 2, 3, 5 and 7. Strings such as "01001000" and "11011101" contain the defining loci 
which describe the above schema. A schema's defining length is the distance between the first 
13 
Evolutionary Algorithms for Automated Digital Circuit Design 
Initialise 
Population 
Evaluate Population 	 Fitness 	Yes 
and Assign Fitness —>< Met Requirement  
No 
Miithficrn 	ki-1 	Crr_c,pr 	V I—I 	cplp-.t;cui 
Figure 2.2: Algorithmic flow of genetic algorithm 
and last positions of non-# loci. In the case of the example given above, the first non-# loci is 
at 2, and the last at 7. The defining length of the above schema is therefore 5, despite an actual 
schema length L of 8. 
The choice of binary encoding depends greatly on the optimisation problem. If a binary repres-
entation is used to encode numerical parameters requiring optimisation, then empirical studies 
have found that Gray encoding is generally superior to a standard power-of-two binary cod-
ing [54]. For example, if part of a chromosome encodes a value of 7, or "0111", in a 4-bit 
standard binary string, then for the same string to express the value 8 all bits must change to 
encode the string "1000". This requires significant manipulation of the bit-string on behalf of 
both the crossover and mutation operators. However, as Grey coding only requires a one bit 
change per decimal increment, 7 and 8 can be represented as "0100" and "1100" respectively. 
Such a transition can easily be achieved through mutation. 
Not all GA encoding schemes use bit-string representations, real-valued representations have 
also been used to encode solutions within a chromosome, often when the problem requires 
the optimisation of control parameters which are time or frequency dependent [54]. Gener-
ally, modified genetic operators are required from those used in binary representations in order 
to provide an effective means of navigating the search space. A detailed review of genetic 
algorithms and chromosome representations is beyond the scope of this thesis, however the 
reader is referred to the following for more information [42,53,54]. 
14 
Evolutionary Algorithms for Automated Digital Circuit Design 
23.1 Initialisation 
In order to begin the automated design procedure, a population of circuit chromosomes must 
first be initialised; this determines each chromosomes original circuit configuration. A number 
of initialisation techniques have been employed in EHW, many include heuristic knowledge of 
the circuit to be generated [55]. This approach is termed population seeding, and can be effect-
ive in reducing the number of generations needed to determine an acceptable circuit solution. 
One potential draw back of population seeding is that it can cause the search to become fixed 
around a sub-optimal, or unsatisfactory solution. This is often caused when a population of 
chromosomes prematurely converges around a single sub-optimal solution (circuit configura-
tion). Seeding the population can therefore tip the delicate balance that exists between exploit-
ation of promising solutions, and exploration of the solution search space. Aggressive selection 
methods, high crossover rates, and population seeding are all methods of biasing the search to-
wards the current fittest solution in the population, thereby exploiting desirable characteristics 
present in the chromosomes circuit encoding, which can then proliferate throughout the popu-
lation via crossover and mutation. This can therefore lead to the fixation of a population around 
a sub-optimal solution. A population is said to have reached stasis when no improvement in 
the fitness of a solution is experienced over an extensive number of generations. 
Another requirement of population seeding is that specific knowledge of both the problem do-
main, and the platform on which the circuit is to be autonomously configured, is required. There 
are a number of cases in EHW when this information is not available, or might limit the success 
of the search; for example, evolving a parallel multiplier through EHW might be achieved by 
seeding the initial chromosomes in the population with design rules from well known circuit 
configurations such as the Booth multiplier. This might increase the speed in which a solution 
is found, however, it could also severely limit the novelty of any new multiplier architectures 
that might potentially be developed. In order to maintain the general applicability of EHW to a 
wide range of automated circuit design applications the most common initialisation procedure 
simply generates a random circuit configuration for each chromosome in the population. This 
approach is termed random initialisation, and is an accepted way of eliminating many of the 
undesirable effects associated with population seeding. 
15 
Evolutionary Algorithms for Automated Digital Circuit Design 
23.2 Selection 
Selection is the operator used within genetic algorithms for guiding the search towards a de-
sirable solution. The primary objective of the selection operator is to highlight better solutions 
in a population. Selection therefore removes poorer solutions from the population, leaving the 
remaining solutions available for modification through crossover and mutation operators. GAs 
can be classified as either generational or steady-state. A steady-state GA usually produces 
only one or two new solutions or offspring in each generation. Solutions from the current, or 
parent, population are usually deleted, based on some distribution, to make room for the new 
offspring solutions. Generational GM instead replace the entire parent population with an en-
tirely new population of offspring solutions each generation. This approach is denoted (p, A), 
where ji represents the parent population, and A the new offspring population. Evolutionary 
algorithms utilise one of four selection operators. 
Proportionate selection identifies individual solutions based on the proportional fitness of that 
individual with respect to the fitness of all other solutions in the current population. Therefore 
a solution having twice the fitness of another solution is twice as likely to be selected. Multiple 
copies of a fit solution can therefore be selected and used to form the next population. The most 
simple form of proportionate selection is termed roulette-wheel selection, where each solution 
in the population is assigned an area on the wheel which is proportional to its fitness. The 
roulette-wheel is then spun a number of times equal to the population size. An individual is 
then selected based upon where the conceptual 'marker' points. However, the basic propor -
tionate selection operator has two important disadvantages. If a population contains a solution 
which is markedly superior to any other solution in the current population then a large area 
of the roulette-wheel will be dominated by this individual. As a result the dominant solution 
will most often be selected each time the wheel is 'spun'. This could lead to a reduction in 
solution diversity, and potentially result in the convergence of the population around a sub-
optimal solution. The second disadvantage occurs when most of the solutions in the population 
have very similar fitness. In this case, each solution posses a roughly equal proportion of the 
roulette-wheel. This can have the effect of random selection, and the search then looses dir-
ection. Both these difficulties can be avoided using a scaling scheme such as that outlined by 
Goldberg in [53]. 
Tournament selection does not suffer from the disadvantages highlighted using proportionate 
selection. This is because selection is based on the absolute fitness of each solution in the 
16 
Evolutionary Algorithms for Automated Digital Circuit Design 
population, as a tournament of t individuals are randomly chosen from the current population, 
and the fittest solution selected. The approach therefore eliminates the negative selection bias 
towards highly fit solutions, whilst ensuring that the fittest individuals are continually identified, 
even when all solutions in the population have similar fitness. Tournament selection is also 
amenable to fast implementation (for example in hardware) as only a few solutions are required 
to be compared at a time without needing to calculate the average fitness of the population as 
is the case with the proportionate selection approach. The most common size oft is two, and is 
termed binary, or two-way tournament selection. 
Ranking selection operates in a similar manner to proportionate selection except that solutions 
are ranked in a ascending or descending order of fitness, depending on whether the problem 
requires maximisation or minimisation. Each solution is then assigned a ranked fitness based 
on its rank within the population. Individual solutions are then selected based on a selection 
probability calculated using the ranked fitness score. 
Boltzmann selection assigns a modified fitness to each solution based on a Boltzmann probabil-
ity distribution: Di = 1/ (1 + exp(F2 /T)), where T is a parameter analogous to the temperature 
term in the Boltzmann distribution. T is reduced in successive generations. Because a large 
value of T is initially used, almost any solution is equally likely to be selected, but, as the 
generations progress, T becomes small and only good solutions are selected. 
Elitism can be applied to any of the selection operators detailed in this subsection. Elitism 
preserves solutions from the parent population and places them, untouched, into the new off-
spring population. This approach is used to maintain a specified number of superior solutions 
which might be lost during a selection. Elitism is therefore commonly used in generational 
Cu, A) genetic algorithms which normally remove older genetic material from the new popula-
tion. Improper use of Elitism can lead to fast convergence around sub-optimal solutions as elite 
individuals may be present in the population for many generations if no better solutions are 
found, biasing the search towards the elite individual. Retaining more than one elite solution 
in the population at any one time can help maintain diversity, as does limiting the number of 
generations an elite solution may be present in the population (its lifespan). Solution diversity 
using Elitism can further be maintained by using larger population sizes and higher mutation 
rates. 
17 
Evolutionary Algorithms for Automated Digital Circuit Design 
2.33 Crossover and Mutation 
Crossover and mutation are the operators through which new circuit solutions are generated. 
Crossover swaps material between two circuit configurations at randomly chosen points along 
the chromosome, producing new offspring. Chromosomes of fixed bit-length must therefore 
cross at the same loci on both parents. The greater the number of crossover points, the higher 
the degree of intermixing between the two parent solutions. The resulting offspring encode 
two new circuit solutions, potentially better than the two parent solutions which generated 
them. The probability that two selected solutions will crossover is termed the crossover rate. 
One-point, two-point and uniform crossover are most commonly used. Uniform crossover 
randomly swaps individual bits between the two parents with the same probability. Uniform 
crossover is the most disruptive of the crossover operators, as larger schemata which make up 
each parent are generally not expressed in the offspring solutions. Single point crossover is 
used most dominantly in EHW applications which utilise genetic algorithms, and is illustrated 
in Figure 2.3. 
Patent 1:  
Parent 2: 
One Splice Point 
Offspring 1: 	 I 
Offspring 2: 
Figure 23: Single-point crossover of two parent chromosomes, generating two offspring 
Mutation is used to maintain diversity within the population. It operates directly after crossover 
and is analogous to a copying infidelity as material is transfered from parent to offspring. Muta-
tion is invoked with relatively low probability so as not to prove deleterious to the search. For 
this reason mutation is considered by Goldberg and others to be a background operator with 
crossover cultivating useful schemata from both parent solutions, and mutation ensuring di-
versity in the population through operations such a bit-flipping [53]. Bit flipping is the standard 
mutation operator used with bit encoded genetic algorithms, and is depicted in Figure 2.4. Each 
bit in the string has a probability that it will ifip to its current inverse (i.e 0 to 1); mutation rates 
are usually set such that on average one mutation occurs per chromosome (bit string). There-
fore a chromosome of bit length 100 would have a mutation rate of 0.01. Mühlenbein [42] 
I1;3 
Evolutionary Algorithms for Automated Digital Circuit Design 
expresses this relationship more formally as 
Pm = 	 (2.1) 
Where Pm  is the probability of a bit mutation within a chromosome of bit length N. 
Chromosome before 
Mutation: 
Ct. rnnn o nm 
_,11 tiLl 
n
j,tiiii%., ai L%.'I 
Mutation: 
Figure 2.4: Bit-flipping mutation example in bit-level chromosome encoding 
23.4 Fitness Function 
Development of the fitness function in EHW is directly related to the circuit application. The 
success of the GA in finding an acceptable circuit solution is therefore most influenced by 
how accurately the fitness function describes a given circuit specification. When using EHW 
for automated circuit design, fitness is often represented as a percentage of circuit functional-
ity. Correctness if therefore frequently calculated by summing the total number of correct bits 
produced by the circuit under evaluation and comparing this to the desired output response. 
For example, consider the look-up table for a 2-bit parallel multiplier as presented in Table 2.1. 
BothAin and Bin are 2-bit input vectors, therefore the total number of 4-bit test vectors required 
to ensure correct functionality are: Vt est = 2, where I is the total number of bits applied to 
the circuit input. In this case 2 4 = 16 4-bit test vectors are needed to fully test the multiplier. A 
total of 2 4 * 16 = 256 4-bit output vectors are therefore possible, from this the correct circuit 
configuration which produces the desired sixteen 4-bit output vectors must be identified. Fit-
ness might then be calculated by comparing the number of correct output bit vectors, produced 
by a given circuit solution, i, as shown in equation 2.2. 
Fj = O s /V 	 (2.2) 
19 
Evolutionary Algorithms for Automated Digital Circuit Design 
Ain Bin Multi out 
00 00 0000 
00 01 0000 
00 10 0000 
00 11 0000 
01 00 0000 
01 01 0001 
01 10 0010 
01 11 0011 
10 00 0000 
10 01 0010 
10 10 0100 
10 11 0110 
11 00 0000 
11 01 0011 
11 10 0110 
11 11 1001 
Table 2.1: Boolean logic look-up table of 2-bit parallel multiplier 
Where Oi is the number of correctly matching output bits for the current solution, and V is 
the total number of bits which must be matched. The simple 2-bit multiplier example above 
provides an indication as to both the size and complexity of the search space which must be suc-
cessfully navigated if EHW is to prove an effective tool for automatically designing industrially 
useful DSP circuits. 
2.4 Applying Evolvable Hardware to Automated Circuit Design 
Circuits generated via evolvable hardware are evaluated by one of two methods: extrinsic eval-
uation (software simulation), and direct intrinsic evaluation by which a circuit is transfered 
directly into silicon and then evaluated. Intrinsic evaluation has become feasible due to re-
cent advances in the past decade in programmable hardware technologies such as PLDs (Pro-
grammable Logic Devices) and FPGAS (Field Programmable Gate Arrays), both of which are 
detailed in chapter 4. 
With the advent of faster and larger FPGAS, resulting from advances in silicon technology, 
and the move towards deep sub-micron technologies, designers are under increasing pressure 
to provide high performance DSP circuits which take advantage of these new platforms. The 
'LI] 
Evolutionary Algorithms for Automated Digital Circuit Design 
result are circuits which must operate under critical constraints imposed by high density, and 
the domination of interconnect capacitance [56]. 
The successful generation of digital circuits through extrinsic and intrinsic evaluation has demon-
strated the great potential of EHW for automated circuit design. It has also raised a number of 
questions as to how current EHW techniques can be further improved. 
2.4.1 Gate Level and Functional Level Circuit Evolution 
Evolvable hardware generates circuits by manipulating a number of 'building blocks' which the 
EA has at its disposal. When applied to digital circuit design these building blocks generally 
take the form of primitive gates; basic logic elements such as ANDs, NOTs, ORs and XORs. 
EHW architectures which use primitive logic elements are said to perform gate-level evolution. 
One problem with gate-level evolution is that encoding lengths can become unwieldy when 
larger circuits are to be evolved. However, Thompson [13] argues that fine-grained building 
blocks such as these enable EHW to develop novel architectures beyond the scope of conven-
tional design techniques. Despite this, evolving with primitive gates is thought to impede the 
success of EHW frameworks in evolving all but relatively simple DSP circuits [57]. Miller et. 
al. have demonstrated the evolution of 2-bit, 3-bit and 4-bit parallel multiplier circuits using 
gate-level El-lW [12,58], reflecting some of the most complex DSP circuit currently generated 
with the gate-level EHW approach. However, multiplier complexity greater then 4-bits has 
not been achieved using gate-level evolution. An alternative to gate-level evolution is that of 
functional-level evolution [59, 60]. Here larger logic elements, or macros, are used comprising 
many primitive gates. An example of a functional-level component within a multiplier design 
might be a fulladder, or half adder circuit. This approach is further investigated by Ersin in [61], 
in which a number of macro units are utilised to evolve more complex arithmetic logic units. 
Function-level evolution has been shown to produce DSP circuits which are of a level of func-
tionality sufficient to be of use to industry. These include online data compression for colour 
printers [62] and the adaptive equalisation of digital communication channels [63]. 
2.4.2 Digital Circuit Design using Extrinsic Evaluation 
Evolvable hardware for automated digital design favours software based, or extrinsic evalu-
ation, due to the simplicity of its implementation and the ease in which evolved circuits can be 
21 
Evolutionary Algorithms for Automated Digital Circuit Design 
examined once a solution is found [64]. Using this approach only the final solution is down-
loaded onto a reconfigurable device, or fabricated on a custom IC. The majority of frameworks 
which employ extrinsic evaluation use a technology independent net-list to model the digital 
circuit undergoing evolution. Examples include the use of genetic programming to synthesis 
logic functions using generic Multiplexer trees [64,65]. Drechsler and GUnther take this ap-
proach further by evolving logic functions using Multiplexer circuits which can be mapped onto 
Multiplexer-based FPGAs [66]. 
Other extrinsic platforms are more closely based on programmable logic devices which are dis-
cussed in detail in Chapter 4. Arslan et.al . presents a novel FPGA-based architecture designed 
to implement digital filters [67], whilst Miller and Thomson have developed a chromosome 
representation which more closely models the architecture of Xilinx's now obsolete XC6200 
series of FPGAs [68] for evolving novel 2-bit multiplier circuits. In addition, the multiplier 
structures developed by Miller et.al  and discussed in section 2.4.1 were implemented using a 
second, more constrained FPGA-based circuit encoding. This produced significant improve-
ments over the XC6200 model, enabling the evolution of 3 and 4-bit multiplier circuits. The 
use of FPGA-based architectures for EHW is an important issue which will be investigated in 
more depth in Chapter 6. 
One of the dominant limitations in using extrinsic circuit evaluation, which is common to all 
of the approaches cited above, is that little or no information is processed in terms of how 
accurately a system is modelled in terms of the circuits physical characteristics, such as timing. 
This is because in most cases the additional detail required has not been integrated into the 
software. This inhibits the development of high performance DSP circuits where timing and 
area constraints are of great importance, and therefore must be accounted for in the EHW design 
procedure. 
2.4.3 Digital Circuit Design using Intrinsic Evaluation 
Research on intrinsic circuit evolution has presented a whole new aspect of automated digital 
circuit design as well as fundamental problems resulting from the effects of circuit evaluation 
using the 'unconstrained' intrinsic approach. 
Thompson's development of a two-tone discriminator using gate-level, unconstrained evolu- 
tion [13] has shown that the evaluation of a circuit intrinsically in silicon can be affected by 
22 
Evolutionary Algorithms for Automated Digital Circuit Design 
the physics of the device upon which evaluation takes place. Circuits evolved using this tech-
nique have been shown to be temperature, silicon, and voltage dependent. It is presently un-
clear why intrinsic evaluation exploits a given architecture's physical characteristics, or how 
these negative effects can be minimised [19,69]. Layzell has since attempted to provide an 
environment suitable for answering these fundamental questions by using a general-purpose 
evolvable motherboard consisting of an array of programmable switches, connected to up to 
6 plug-in daughterboards. Each daughterboard can theoretically perform any number of func-
tions, however, Lazell focuses on daughterboards containing arrays of operational amplifiers 
and transistors [70]. Circuits evolved on the motherboard included a simple NOT gate, an amp-
lifier, and oscillator. Again each circuit exhibited dependence on the components on which 
they were evolved, such that if new transistor components were inserted then the system must 
be re-evolved to regain the desired functionality. Layzell also developed a software model of 
the evolvable motherboard. Results presented in [70] show that extrinsic evaluation provides 
solutions to each of the the circuits investigated, and could be used to configure the hardwired 
motherboard with minimal re-evolution. 
On solution to the problems inherent in intrinsic circuit evaluation has been presented through 
the development of a Java-based tool for evolving gate-level circuits on Xilinx's XC4000EX/XL 
series of FPGA devices [71]. The architecture, called GeneticFPGA, uses Java to interface to 
the XC4000EX/XL bitstream, which then generates circuits on the fly. This EHW platform 
avoids the problems associated with unconstrained intrinsic evaluation by driving all inputs 
with flip-flops in a completely synchronous mode. In addition, evolution is constrained such 
that only neighbouring connectivity is possible so as to avoid possible contentious circuit con-
figurations, such as feedback and same pin multiple source short circuits. 
Tufte and Haddow implement the genetic algorithm directly on an Xilinx XC4044XL FPGA 
such that the GA and the evolving circuit design are implemented together on the same device. 
They coin this process Complete Hardware Evolution (CHE) [72]. Early work focused on the 
automated design of Multiplexer circuits, however, more recent work uses the same approach 
to autonomously evolve adaptive filter coefficients within a constrained filter architecture em-
bedded on the FPGA [17]. The CHE principle has also been demonstrated earlier by Kajitani 
et. al in [60] and was used to implemented many of the practical applications detailed in [11]. 
Due to the limited commercial availability of analogue programmable devices, only a small 
number of architectures suitable for evolvable hardware have been developed in the academic 
23 
Evolutionary Algorithms for Automated Digital Circuit Design 
field. Whilst this thesis does not focus on analogue circuit design using EHW, a number of pro-
grammable devices have are now included for completeness which are capable of implementing 
both analogue and digital circuits using EHW. Examples include the Palmo chip developed by 
Hamilton et. al. in [73]. The device is constructed in mixed-signal VLSI and can process ana-
logue signals specified as pulse streams. Palmo can therefore be used to process mixed-signal 
data effectively on a single programmable device. The Palmo chip has been shown to be a 
useful platform for EHW design techniques, where the evolution of novel filter structures has 
demonstrated [74]. Stoica et.al  have developed a reconfigurable transistor array specifically de-
signed for evolutionary oriented circuit design [75]. The device can implement both analogue 
and digital circuits and cxhibits robustness to extreme temperature variations when evolved 
on-line. 
2.4.4 Encoding Digital Circuits Using Evolvable Hardware 
Gate-level evolution, using either extrinsic or intrinsic evaluation, presents a number of dif -
ficulties as circuit complexity increases. These difficulties become manifest in the encoding 
schemes used to represent circuit functionality. Lengthy chromosomes are attributed to the 
manner in which circuits are encoded using the gate-level approach [14]. As circuit complexity 
increases, so too does the number of logic gates required to build it. This increase in gate count 
relates directly to an increase in the chromosome length, as each logic gate must be encoded. 
Because the majority of evolutionary algorithms require populations of chromosomes to find an 
adequate solution, reducing chromosome length can greatly reduce the memory requirements 
of many EHW applications. Limiting the manner in which logic gates can connect, and in-
creasing the building block size to accommodate function-level evolution are two methods of 
limiting the length of circuit chromosomes in EFIW. Higuchi et.al . have demonstrated both 
reduction methods successfully in [59]. As an example consider the two encoding approaches 
used to describe the circuits illustrated in Figure 2.5. Both circuits are functionally identical, 
however, the gate-level encoding would require a seven cell description to represent the circuit, 
while the functional-level encoding would require only three. Although more cell connectivity 
information is required to encode the fulladder cell described using the functional-level ap-
proach, the overall reduction in chromosome length justifies this. Again, it could be argued 
that increased component granularity and reduced freedom of component connectivity might 
reduce the novelty of circuits produced using EHW. Such an argument should however also take 












Evolutionary Algorithms for Automated Digital Circuit Design 
Figure 2.5: Comparison of a standard gate-level encoding with the novel macro-based encod-
ing to describe a Fulladder with additional logic. 
space, the memory resources available to the El-lW platform, and the designers requirement for 
circuit novelty against speed of implementation. Chapter 3 will also show that functional-level 
evolution requires fewer generations to attain functionally correct DSP circuits, than equivalent 
circuits generated using gate-level evolution. 
2.5 Summary 
This chapter has introduced the concepts of evolvable hardware, and presented four derivative 
classes. A detailed overview of genetic algorithms and associated genetic operators tailored for 
EHW applications has been presented. This has highlighted the GAs suitability for automated 
circuit design and demonstrated how genetic operators might be used to developed circuit ar-
chitectures for a given specification, often in the form of a boolean lookup table. The benefits 
and limitations of both gate-level and functional-level approaches to automated circuit design 
have been investigated through literature review. Gate-level evolution has provided a number of 
novel DSP circuits with smaller area than that achieved using conventional design techniques, 
however, a non-linear growth exists between the complexity of the circuit to be evolved and the 
size and complexity of the search space in which an acceptable solution might be found. This 
has limited the effectives of gate-level EHW to relatively simple DSP circuits. Functional-level 
evolution has demonstrated the automated design of a number of much more complex DSP 
applications, where the search space has been constrained for the specific DSP task. It has 
however been stated that functional-level building blocks limit the novelty and performance 
of circuits generated through EHW, although the author is unaware of any direct comparisons 
25 
Evolutionary Algorithms for Automated Digital Circuit Design 
between gate-level and functional-level approaches 
The merits of both extrinsic and intrinsic circuit evaluation using EHW have also been subject 
to literature review. Applications requiring on-line adaptation must inevitably be implemented 
directly in hardware and therefore suit the intrinsic approach. This technique has the disadvant-
age of restricting the DSP to one technology platform, and has been shown to produce unstable 
circuits which are reliant on the specific physical characteristics of the device upon with the are 
evolved; unless necessary constraints are set in place as part of the circuit encoding. Extrinsic 
evaluation through software simulation provides the design engineer with a means of more ac-
curately assessing the circuit developed using EHW. However, physical circuit characteristics 
such as timing are usually ignored in favour of area minimisation. The next chapter explores the 
development of a more accurate software simulation environment for autonomously developing 
DSP circuits, and provides a direct comparison of gate-level and functional level approaches to 
circuit design for a number of DSP applications. 
26 
Chapter 3 
Generating DSP Circuits on the 
Virtual Chip EHW Platform 
3.1 Introduction 
Because of the need for high performance DSP applications, many modem designs must tar-
get DSM (deep sub-micron) technologies to achieve demands set by low area and fast data 
throughput. As a result timing and area issues have become a dominant factor in the design 
of performance DSP circuits and therefore should be accounted for by EHW platforms when 
designing such applications. 
The EHW platform presented in this chapter was developed to incorporate performance related 
design criteria concerning the area, timing and correct functionality of DSP circuits. Also, 
the platform was developed to ascertain the merits of both gate-level and functional-level ap-
proaches to the automated design of a high performance multiplier stage for FIR filter coef-
ficient multiplication. A number of other DSP applications are also investigated to provide a 
wider basis for comparison. The work set out in this chapter therefore aims to achieve the 
following: 
• Investigate the use of a custom evolvable hardware platform, termed the Virtual Chip, 
to produce novel, high performance DSP circuits, developed under area and timing con-
straints. 
• Provide a performance comparison of circuits generated using functional-level evolution 
with functionally equivalent circuits developed using gate-level evolution. 
• Provide a basis for analysis by comparing those circuits generated by the Virtual Chip 
with functionally equivalent circuits developed using standard CAD-based design meth-
odologies. 
The results obtained in this chapter intend to establish the effectiveness of gate-level and 
functional-level design approaches to automated circuit design using a genetic algorithm and 
27 
Generating DSP Circuits on the Virtual Chip EHW Platform 
EHW platform common to both approaches as the basis for comparison. The effectiveness of 
the DSP circuits produced using the Virtual Chip will also be compared with equivalent circuits 
generated using other published EFIW platforms, and circuits developed using standard digital 
design methodologies. 
3.2 The Virtual Chip Evolvable Hardware Platform 
The Virtual Chip has been designed to provide an automated digital circuit design environment. 
'Within this platform a novel genetic algorithm encoding is used to evolve digital circuits. Eval-
uation is performed extrinsically through detailed simulation of circuit functionality and timing 
characteristics. The Virtual Chip platform was designed to be a generic, flexible environment 
for generating a wide range of digital circuits. Although this is not the underlying motivation 
of this thesis, a flexible platform was required so that a range of DSP circuits could be evolved 
in order to determine the effect of component granularity (gate-level vs functional-level evolu-
tion) on the effectiveness of an EHW platform to develop non trivial DSP circuits. This research 
translates as the first stage in the development of the more complex DSP functionality needed 
for FIR filter design. 
3.2.1 Encoding a circuit within the chromosome 
Genetic algorithms for evolvable hardware are used to develop chromosomes which then en-
code the functional description of a given circuit. As with many applications which utilise 
genetic algorithms, the resulting circuit is termed a phenotype as it comprises numerous smal-
ler logic cells or genotypes. The terminologies used are designed to reflect the conceptual 
similarity between genetic algorithms, natural evolution, and genetics. 
The genetic algorithm presented in the Virtual Chip platform uses a permutation-based integer 
encoding of fixed-length. As such a specified number of logic elements are presented to the 
framework. From this, the desired circuit functionality must be generated. Using a fixed-
length encoding is standard practice and is one of the main restrictions within which a genetic 
algorithm operates [53]. 
Specific sections of each chromosome are reserved for describing the inputs and outputs re- 
quired for the desired circuit. Logic elements are referenced by position within the chromo- 
Generating DSP Circuits on the Virtual Chip EHW Platform 
some. Figure 3.1 displays the relative location of each encoded section. Circuit inputs are 
Input I 
Input 3 	Positional element 	 Output 1 
S • S • • • S • • S • S • • 
Input 2 	 Output 2 
Input section 	Main circuit description 	Output 
section 
Figure 3.1: Chromosome structure defining sections for specific circuit description 
encoded in the first section of chromosome. If a circuit has I inputs, then the first I logic ele-
ments in the chromosome will describe these inputs. This description includes the input pin 
number in addition to which logic element the input pin is connected. Outputs are similarly 
defined at the end of the chromosome, where position relates to the identification of an output 
pin connected to a logic element. Total chromosome length, N, is then defined as the number 
of logic elements summed with the number of circuit inputs. Therefore, if a circuit has two 
outputs, what ever logic elements are at N and N-I are connected to output pin one and output 
pin two respectively. The encoding ensures that the number of inputs and outputs described by 
a chromosome remains consistent after operations such as crossover. 
The GA comprising the Virtual Chip utilises a range of functional elements or macro blocks, 
along with simple gate primitives with which to generate various DSP circuit structures. Macro 
blocks particularly suited to more complex DSP circuits were chosen such as a halfadder and 
fulladder. Other macro cells include small combinational logic blocks. In addition, through-
connects are provided to increase the flexibility of the circuit encoding. Table 3.1 lists all of 
the logic elements available to the GA and indicates if the component constitutes a primitive or 
functional logic element. Figure 3.2 clarifies a number of the component terminologies presen-
ted in Table 3.1, and displays examples of the type of logic element. Two component libraries 
are therefore available, one representing gate-level evolution, the other function-level. A total 
of 14 logic elements are included in the gate-level library, and 28 logic elements (primitive 
logic elements are also included) in the functional library. 
Each cell is connected within a flexible chromosome encoding which allows placement of any 
29 
Generating DSP Circuits on the Virtual Chip EHW Platform 
Primitive Logic Elements Functional/macro Logic Elements 
2-input NAND 3-input NAND 
2-input XOR 3-input XOR 
2-input XNOR 3-input NOR 
2-input OR 3-input OR 
2-input AND Common XOR 
2-input NOR Common AND 
BUFFER Common NAND 
NOT Common OR 
Pull-high Half-adder 
Pull-low Fulladder 
Through-connect Combinatoral NAND 
Through-connect + float Combinatoral AND 
Pull-high + float Combinatoral OR 
Pull-low + float Combinatoral XOR 
Table 3.1: Primitive and functional logic elements available to genetic algorithm within Virtual 
Chip EHW platform. 
cell (functional or primitive) into any position within the string. This provides the EHW plat-
form both the flexibility of standard gate level encodings, and the potential of building more 
complex systems afforded by less flexible functional-level architectures. 
3.2.2 Connecting Cells Within the Chromosome 
Each genotype (logic element) in a circuit is allocated a specific position within the correspond-
ing chromosome. The type of logic cell at any given position is initially determined randomly, 
however cells can be allocated different positions after initialisation through manipulation by 
the genetic operators, detailed in section 2.5.3. Figure 3.3 demonstrates the chromosome en-
coding scheme used to describe connectivity of the fulladder cell depicted in Figure 2.5. It is 
important to note that a cells connectivity is not restricted to its nearest positional neighbour. 
Rather, cells are free to connect to any cell of higher position within the chromosome. This 
form of 'over-the-cell' connectivity provides a much wider range of possible circuit configur -
ations. Feedback connections are not permitted as their effects are not desirable for most DSP 
applications, with the exception of functions such as the Infinite Impulse Response (IIR) filter. 
However, the chromosome encoding used does allow for feedback connectivity should such a 
feature be required. Feedback is simply achieved by allowing logic cells to connect to cells at a 
lower chromosome position than the current cell. Logic cells near the end of the chromosome 
30 
Generating DSP Circuits on the Virtual Chip EHW Platform 
	
Common input 	Combinatoral ORed 	Pull-low cell 	with floating input output 
Through Connect 	Through Connect 	Pull-high cell 	3-input logic cell 
with floating input 
_ H  — -D- 
Figure 3.2: Generic style of macro and other logic elements provided to component library for 
the evolution of arithmetic circuits. 
Macro-based Encoding 
Position within Position of cell to which 
chromosmome first 01P pin is connected 








I Location' I Location I 	I ~11,1d, 
OutO mO Outi I I BinO I 
Mof adder of NAND of NOR 
/ 7 
ID from library First 01P pin 
	
UP pin of 
of components 	of Fulladder connected cell 
Figure 3.3: Example of macro-based encoding describing a macro element (fulladder) and its 
connectivity. 
may therefore loop back there output connections to earlier cell locations. 
3.2.3 The Genetic Operators 
The genetic algorithm utilised by the Virtual Chip use single-point crossover. Single point cros-
sover was used as it is the least disruptive means of combining circuit characteristics between 
two parent solutions. Because of the complex interactions between logic elements which form 
a circuit, the effects of crossover (and also mutation) can be highly disruptive to the search. This 
is often caused by the breaking of connections between logic elements due to the recombina-
tion process, and the high degree of interdependence between logic elements, termed epistasis, 
which is needed to achieve a desired circuit functionality. Figure 3.4 illustrates the effects of 
31 
Generating DSP Circuits on the Virtual Chip EHW Platform 
	
Common input 	Combinatoral ORed 	Pull-low cell 	
Pull-low/high 
output 	 with floating input 
Through Connect 	Through Connect 	Pull-high cell 	3-input logic cell 
with floating input 
_ _ H 
Figure 3.2: Generic style of macro and other logic elements provided to component library for 
the evolution of arithmetic circuits. 
Macro-based Encoding 
Position within Position of cell to which 
chroinosnwnw firs: 01P pin is connected 




IL) from library 
of components 
7 
Firs: 01P pin UP Pin of 




O utO tinO I  0lit Macro ID 	i a,Jder of NAND of NOR 
Figure 3.3: Example of macro-based encoding describing a macro element (fulladder) and its 
connectivity. 
may therefore loop back there output connections to earlier cell locations. 
3.2.3 The Genetic Operators 
The genetic algorithm utilised by the Virtual Chip use single-point crossover. Single point cros-
sover was used as it is the least disruptive means of combining circuit characteristics between 
two parent solutions. Because of the complex interactions between logic elements which form 
a circuit, the effects of crossover (and also mutation) can be highly disruptive to the search. This 
is often caused by the breaking of connections between logic elements due to the recombina-
tion process, and the high degree of interdependence between logic elements, termed epistasis, 
which is needed to achieve a desired circuit functionality. Figure 3.4 illustrates the effects of 
31 
Generating DSP Circuits on the Virtual Chip EHW Platform 
recombination through crossover. 
CHROMOSOME BEFORE CROSSOVER: (PARENT I) 
r 	.iI. .... 	r-[1- 
t>01ADDIk 	S S 
Splice Point 
CHROMOSOME BEFORE CROSSOVER: (PARENT 2) 
mom 
Splice Point 
CHROMOSOME AFTER CROSSOVER: (OFFSPRING) 
T_ 
Head Parent 1 	 Tail Parent 2 
Figure 3.4: Example of broken element connectivity resulting from crossover. 
It should be noted that unlike bit-wise crossover described in section 2.2.3, the crossover oper-
ator detailed in this chapter works in a different manner because of the integer-based encoding 
utilised by the Virtual Chip platform. Crossover therefore only splices each circuit encoding at 
the beginning of a specific section of chromosome (schema), which then describes a individual 
logic element and its connectivity. It is therefore not permissible to splice a chromosome mid-
way through the encoding of an individual logic element. For example, if the crossover location 
of a parent chromosome were chosen to lie mid-way through the encoding of a Fulladder at 
location .r, and the corresponding .r location on the second parent related to the encoding of a 
2-input NAND gate, then the resulting offspring circuit might encode an invalid description of 
a logic element, as the the physical characteristics of both the Fulladder and the NAND gate 
are very different. 
32 
Generating DSP Circuits on the Virtual Chip EHW Platform 
So as to minimise the negative effects of crossover, chromosome 'repair' is used to reconnect 
any element connections broken during the operation. This is achieved using a nearest neigh-
bour connection rule. It can be seen that with the example offspring chromosome depicted in 
Figure 3.4, the logic element at position x is no longer able to connect to the new logic element 
now at position y. The repair algorithm instead attempts to connect element x to its nearest 
neighbour at position x + 1. If this is unsuccessful then subsequent reconnection attempts are 
made from x + 2 to N - 1, where N denotes the total number of logic elements present in 
the chromosome. In the event that no logic elements are available for connection, the current 
element is assigned as floating and further attempts at reconnection made during later crossover 
operations. If feedback connectivity is enabled, the search for reconnection does not finish at 
the last chromosome position, instead the search loops back to the start of the chromosome and 
will continue until the logic cell immediately before the current element has been examined. 
This is to prevent direct component feedback which can damage the circuit. Each logic element 
in the resulting offspring chromosome is examined for broken connections after every crossover 
event. 
Mutation is invoked with relatively low probability so as not to prove deleterious to the al-
gorithm search. There are four circuit-specific mutation operators used within the genetic al-
gorithm presented. Each is depicted in Figure 3.5. 
Each operator was specifically designed to enhance the genetic algorithm by providing it with 
the ability to introduce both new logic elements and connections not obtainable using crossover. 
Of these operators only cell replacement needs explanation. The result of this mutation is to 
replace an existing logic element in the chromosome with one randomly selected from the 
component library. This ensures than new solutions can be obtained through diversification 
many generations after population initialisation. 
The GAs mutation rate is a derivation of the chromosome bit-length relationship originally 
proposed by Mühlenbein [42], and defined in equation (2.1). Where the number of bits used to 
encode the chromosome, L, governs the mutation rate. Because the genetic algorithm presented 
here uses a permutation-based integer encoding, a direct translation of MUhlenbeins relation-
ship is not possible. Instead the total number of logic elements encoded in the chromosome N 
is used to represent L. Mutation therefore operates on each logic element with the same prob-
ability rate given above. If an element becomes subject to mutation, one of the four mutation 
operators highlighted in Figure 3.5 is applied, each with an equal probability of selection (i.e. 
33 
Generating DSP Circuits on the Virtual Chip EHW Platform 











	 After mutation 




~0_ Tj >0_] 
Figure 3.5: Four mutation operators used by the genetic algorithm. 
0.25). 
The GA in the Virtual Chip uses two-way tournament selection for the reasons described in 
section 2.3.2 of chapter 2. Whilst a number of selection methods could have been investigated, 
an important focus of the research presented in this chapter was to determine the effectiveness 
of both primitive and functional-level evolution in enabling the GA to find successful DSP 
circuit solutions. The selection method employed therefore simply provided a common basis 
for comparison. 
Several constraints are imposed during initialisation of the genetic algorithm. Some are de- 
signed to eliminate contentious circuit configurations, while others are a result of the evolu- 
tionary algorithm employed. Initial global parameters are entered by the designer and are as 
34 
Generating DSP Circuits on the Virtual Chip EHW Platform 
follows: 
e Number of inputs and outputs required for the desired circuit; 
Definition of input and output vectors upon which evaluation takes place, and which 
describe circuit functionality; 
. Number of logic elements within a chromosome used to create the circuit; 
. The maximum number of possible fan-outs per cell output; 
Definition of global clock speed for timing constraints; 
• Population size, defining the number of circuit solutions concurrently evolving within the 
search space. 
A summary of the parameters applied to the genetic algorithm are as follows: 
• Generational genetic algorithm 
• Two-way tournament selection 
• One-point crossover at 0.7 and chromosome repair applied 
• Mutation using Mühlenbein derivation Pm = 1/L, with application specific operators 
Population size fifty 
So as to optimise cell connectivity within a fixed-length circuit encoding, each output pin on 
a logic element is randomly allocated a fan-out ranging between one, and a user defined max-
imum. Fan-out describes the number of logic elements that an individual element may connect 
with. The connectivity of any specific logic element is not restricted to its nearest positional 
neighbour. 
Circuit correctness is evaluated using the fitness scheme described in section 2.3.4, and cal-
culated using the fitness expression presented in equation in 2.2. Each circuit is firstly tested 
through interaction with a HDL (Hardware Description Language), described in the following 
section. A population of 50 was chosen as it is commonly used in other EHW applications and 
represented the maximum number of solutions which could be evaluated without incurring a 
prohibitive delay. 
35 
Generating DSP Circuits on the Virtual Chip EHW Platform 
3.2.4 Circuit Evaluation with the Virtual Chip 
Hardware description languages (HDLs) are predominantly used by design engineers when 
developing performance circuits. VHDL (Very High Speed Integrated Hardware Description 
Language) is one of two dominant HDLs for describing digital electronic systems [76]. VHDL 
is a technology independent environment that describes the structure of a digital system by 
describing electronic subsystems (logic elements) and how they are interconnected. In addition, 
circuit descriptions can then be accurately simulated without the need for hardware prototyping. 
VHDL is therefore a powerful tool for both circuit design and evaluation, and provides an ideal 
environment for EHW techniques which utilise extrinsic evaluation. Few EHW platforms have 
been developed which utilise HDLS for circuit evaluation, however one example can be found 
in [77]. The technique cited however does not take into account the physical characteristics of 
the circuit undergoing evaluation. 
After successful testing a circuit can then be synthesised to provide a technology specific netlist, 
ready for transfer onto silicon. A netlist describes the physical composition of a given circuit, 
and includes details of the type of logic component used and how it is connected. Almost all 
technology vendors provide models for logic elements within component libraries. The Vir-
tual Chip therefore evolves the structure of a circuit directly within the VHDL language. This 
is performed within a specially designed testbench. It is this testbench which instantiates and 
interconnects all the logic elements within a chromosome which encodes a specific circuit solu-
tion. Evaluation is performed by instantiating and simulating all the circuits described within a 
population of chromosomes, as if they were being implemented within a single reconfigurable 
chip. Simulation of each circuit was performed using Cadence's Leapfrog VHDL simulation 
tool. Figure 3.6 illustrates a 2x2-bit multiplier evolving within the Virtual Chip environment. 
As can be seen in Figure 3.6, each 2x2-bit multiplier has 4 inputs and 4 outputs. All inputs 
and outputs are synchronised with flip-flops to account for propagation delays and ensure that 
all output signals have reached a steady state. It is these flip-flops which, governed by a global 
clock, set the timing constraints within which the evolving circuit must operate. A circuit with 
incorrect timing will produce output signals offset with those desired and will therefore incur 
low fitness. 
Each 4-bit output grouping represents an individual circuit evolving within the virtual envir 
onment. Each grouping is tagged according to the circuits ID within the evolving population. 
Every circuit solution is therefore represented as a technology independent VHDL netlist. Netl- 
36 
Generating DSP Circuits on the Virtual Chip EHW Platform 
Component Library 
Input Vectors  
VHDL Testhench  
a hUT 
/ 










Figure 3.6: Graphical representation of the Virtual Chip environment, evolving a 2-bit multi-
plier within a population size of N. 
ists are direct interpretations of the circuit chromosome and define a circuit in terms of its logic 
elements and inter-connectivity. Standard CAD tools can then be used to both optimise the 
circuit and translate the generic netlist into a technology specific netlist suitable for implement-
ation in hardware. 
Due to the implicit parallelisation of the Virtual Chip environment, the entire population is 
compiled, and simulated as one entity. This differs from the majority of EHW platforms which 
use extrinsic evaluation and evaluate each individual solution sequentially. As a result, within 
the Virtual Chip environment, an entire population of fifty individuals, evolving fifty 2x2-bit 
parallel multiplier circuits, can be simulated and evaluated in approximately five seconds. In 
contrast, if each circuit were evaluated as an individual entity, it would take approximately two 
37 
DSP Circuits on the Virtual Chip EHW Platform 
minutes to evaluate the same population. These figures were obtained on a standard Sparc Ultra 
10 workstation with 640Mb of memory. 
The Virtual chip is a fusion of C code and VHDL. The genetic algorithm itself is executed in 
C and generates the VHDL required to instantiate each chromosome encoded circuit. After a 
circuit has been successfully evolved it is then passed through a CAD tool for optimisation. 
Figure 3.7 displays the execution flow and coding format of the Virtual Chip EL-LW platform. 
C routine: 
Initial circuit population 
created 
C routine: 	 Virtual Chip: C routine: 
Com Generate circuit 	 pile Teatbench 	 Evaluate simulation: 
structures within 	 and simulate 
\W 	 circuit H L Tebench 	population H Assign fitness to circuit 
C routines: 	







Synthesise final circiut 	 End evolution  
solution 	 Exit programme 
Figure 3.7: Execution flow and coding format of the genetic algorithm and Virtual Chip eval-
uation environment. 
3.3 Implementation and Results 
The following section highlights the results obtained when using the Virtual Chip EHW plat-
form to autonomously generate three types of DSP circuit evolved with both timing, area and 
functionality constraints. The quality of circuit solution based on the performance of the ge-
netic algorithm is further investigated through comparison of two component libraries. The 
first library uses both the primitive and functional logic elements detailed in section 3.2.1, and 
equates to functional-level evolution. The other component library uses only gate primitives, 
and reflects gate-level evolution. The following suppositions were used to provide a basis for 
analysis: 
Generating DSP Circuits on the Virtual Chip EHW Platform 
• The evolution of circuits using larger functional building-blocks directly translates into 
an improvement in the number of successful circuit solutions uncovered by a genetic 
algorithm 
• The implementation of functional elements reduces the number of generations required 
to identify a correct circuit solution; 
• A circuit encoded to implement functional elements is shorter in length than an equivalent 
circuit encoded to utilise only simple logic gates 
• Circuit solutions generated using larger functional logic elements are not less competit-
ive in terms of area and timing, than those generated exclusively using gate level logic 
elements 
• The EHW platform presented provides circuit solutions of equal or better performance 
compared with equivalent circuits developed using standard high level CAD-based design 
methodologies. 
Three DSP circuits were initially examined and consisted of a 2x2-bit unsigned parallel multi-
plier, 7-bit pattern recogniser (one's voter), and a 2-tone frequency discriminator. The multiplier 
circuit was chosen as it is the traditional circuit used for multiplication of FIR filter coefficients, 
and would provide a valuable indication as to the suitability of the Virtual Chip EHW platform 
for such an application. Further more, each of the three DSPs represent benchmark applications 
previously investigated within the field of evolvable hardware research [64,68,71,78]. They 
also provide the foundation blocks for larger DSP applications. It should be noted that the cir-
cuits chosen represent some of the most complex arithmetic modules currently generated using 
the EHW design paradigm. 
The performance of the Virtual Chip in generating each of the three DSP circuits, using either 
the functional or gate-level component libraries, is further compared with the same three DSP 
circuits developed using standard design methodologies and written by hand in VHDL. Using 
a standard HDL methodology, each of the three DSP circuits were developed in two stages. 
Firstly, each circuit was described at the behavioural level using VHDL, the source code of 
which is presented in Appendix A.l A.2 and A.3, for the multiplier, one's voter, and 2-tone 
frequency discriminator respectively. Each VHDL circuit description was then passed through 
Cadence's Build Gates circuit synthesis environment [79] in order to produce a netlist tailored 
for a specific silicon technology. The same synthesis tool was used to synthesise each of the 
39 
Generating DSP Circuits on the Virtual Chip EHW Platform 
circuit netlists generated by the Virtual Chip platform. The reader is referred to a more detailed 
account of standard digital design and synthesis methodologies using HDLS in 1801. 
The following terminologies are used to describe each of the three design approaches: 
. Primitive library: Library of primitive logic elements used by Virtual Chip genetic al-
gorithm for automated circuit design. 
Functional library: Library comprising primitive logic elements and larger macro com-
ponents (see section 3.2.1) also used by the Virtual Chip genetic algorithm. 
Behavioural HDL: Conventional design and synthesis flow for digital circuit generation 
from a behavioural description written in VHDL. 
Each circuit architecture was evolved ten times, and terminated after 10,000 generations if 
a fully correct solution (fitness of 1.0) had not been found. Evolution was halted as soon as a 
correct solution was discovered. All three circuits were constrained by global timing parameters 
to operate no slower than 10 MHz. 
In order to provide a common basis for comparison, all circuits generated using either the 
Primitive library, Functional library, or through Behavioural HDL were synthesised using the 
same silicon technology vendor. A technology library describes the physical characteristics 
(such as timing and area) of the logic components associated with a specific fabrication process. 
Alcatel Microelectronics' 0.35 fLin CMOS MTC45000 technology was therefore used as the 
common library platform for comparison, and is the default technology library used throughout 
this thesis. 
3.3.1 Genetic Algorithm Performance Using Primitive and Functional Compon-
ent Libraries 
Table 3.2 displays the averaged GA performance obtained after 10 runs for each of the circuits 
analysed. The results show that the functional library provides the genetic algorithm with a 
significantly better success rate when generating correct 2x2-bit multiplier and 7-bit pattern 
recogniser circuits. No complete solutions were found when using the primitive library to 
generate the pattern recogniser. This result is supported by findings published by Levi et.al . 
[711, and support the supposition that logic elements functionally more significant than gate 
40 
Generating DSP Circuits on the Virtual Chip EHW Platform 
primitives enable the genetic algorithm to correctly generate more complex circuits. 
Both the multiplier and pattern recogniser circuits generated using the functional library dis-
played a better average fitness, influenced by higher success rates. In addition, the genetic al-
gorithm required fewer generations to find a correct solution using the functional library when 
compared to same circuits evolved using the primitive library. Due to the poor performance of 
the primitive library in unsuccessfully generating the pattern recogniser, this comparison could 
only be drawn from the 2x2-bit multiplier and 2-tone frequency discriminator circuits. 
Column five in Table 3.2 provides an indication as to when, on average, the genetic algorithm 
finally became 'stuck' on a sub-optimal circuit and was unable to find a better solution. Such 
local optima are well known to hinder a GA search, resulting in periods of stasis where no 
improvements on a current solution are found. Results indicate that neither component library 
provided the genetic algorithm with consistent improvement through out every run, and in some 
cases stasis was reached well before forced termination at 10,000 generations. 
Component Library Success Average Average Average Logic 
Rate Fitness Number 	of Generation Elements in 
of 	Final Generations Before Final Chromosome 
Solution if Successful Stasis 
2x2-bit Multiplier 
Primitive library 0.3 0.9750 3370.7 4082.4 30 
Functional library 0.7 0.9922 2780.0 1187.0 15 
7-bit Pattern Recogniser 
Primitive library 0.0 0.8930 NA 4769.6 50 
Functional library 0.43 0.9794 2407.5 6726.2 15 
2-Frequency Discriminator 
Primitive library 0.5 0.89480 9630.0 7812.8 100 
Functional library 0.33 0.8672 7816.5 7770.5.2 30 
Table 3.2: Comparison ofDSP Circuits Generated by Genetic Algorithm Using Different Logic 
Library Implementations. 
Figure 3.8 shows the output response of the discriminator circuit written in behavioural HDL. 
The circuit was designed to respond to a change in input frequency with an integration time of 
one complete impulse period. The input impulse frequencies were chosen to be 2.5 MHz and 
833 kHz (one 4th and one 12th the operating frequency at 10 MI-[z). The fitness of an evolving 
circuit was based upon how well the circuit matched the response of the behavioural model. 
Thompson "evolved" a two-tone frequency discriminator successfully in [78]. Using his gate- 
41 
Generating DSP Circuits on the Virtual Chip EHW Platform 
level approach circuit feedback was permitted. For this reason, feedback was also enabled on 
the Virtual Chip. 
Figure 3.8: Output response of 2-frequency discriminator from behavioural HDL model. 
Table 3.2 shows that, when using the Functional library, the genetic algorithm performs slightly 
better, on average, at finding successful solutions than when the Primitive library was imple-
mented. In addition, the number of generations required by the GA to generated a correct 
solution is approximately 20% less using the functional library than that using the primitive 
approach. 
It should be noted that the number of logic elements required to encoded each chromosome 
using the functional library was at least half that needed to encode chromosomes implementing 
the Primitive library. The number of logic elements used for each component library, and for 
each type of DSP circuit were empirically derived so as to obtain optimal performance and thus 
a fair comparison of the actual chromosome length required. However, on average, functional 
logic elements utilised by the GA are two to three time larger than gate primitives. The number 
of logic elements needed to encode chromosomes using the Primitive library is therefore two 
to three times larger than the number of logic elements employed using the functional library. 
The chromosome lengths shown in Table 3.2 reflect this. 
The success of the functional library over the Primitive library can be further justified by in- 
vestigating the relative size of the search space produced by each approach. Both component 
libraries are subject to the same fitness parameters, and both must correctly match the number 
42 
Generating DSP Circuits on the Virtual Chip EHW Platform 
of output bits corresponding to the lookup table of each DSP circuit. However, the greater func-
tionality of logic elements available to the functional library means that fewer components are 
required to successfully encode each circuit. This translates directly to a decrease in the search 
space with respect to chromosome length, and can be formalised as follows: 
S i = 
	
(3.1) 
Where S j is the search space for a given circuit architecture i, C is the number of different 
logic elements available to the GA from the component library, and N is the number of logic 
elements used to encode the circuit in a chromosome. For example, consider the search space 
size associated with the 7-bit pattern recogniser. In this example 50 logic elements were used 
to encode the pattern recogniser using the Primitive library. Table 3.1 shows that 14 distinct 
logic elements are used in the primitive library. The search space is therefore calculated at: 
1450 = Alternatively, the search space for the functional library can be calculated at: 
2815 = 521 Whilst this calculation of search space size is crude, it adequately demonstrates 
the potentially huge differences in the magnitude of search space resulting from a functional vs 
gate-level approach to circuit evolution, and provides evidence that gate-level evolution restricts 
the complexity of circuit which can be generated because of the prohibitively large search space 
produced. 
Figure 3.9 displays the average performance of the genetic algorithm when evolving all three 
circuit architectures. 
33.2 Analysis of Timing and Area Performance 
Table 3.3 displays both timing and area statistics of the three arithmetic circuits under investiga-
tion. Each circuit is identified as having been generated using either the primitive library,func-
tional library, or behavioural HDL implementation. Timing slack is defined to be the duration 
for which the slowest output of the circuit remained stable before the next data pulse arrives. It 
should be noted that +INF denotes that timing constraints are well within specified limits. Cir-
cuit complexity is measured in equivalent NAND gates and represents the total physical area of 
the synthesised circuit using 0.35pm CMOS MTC45000 technology. The complexity measure 
therefore takes account of transistor area and interconnect dimensions. 
Results show that on average the primitive library produces circuits of smaller complexity than 
43 







- Macro 2-bit Multi 
- Macro Discriminator 
- Macro Pattern Recog 
- - Primitive 2-bit Multi 
- - Primitive Discriminator 
- Primitive Pattern Recog 
0 500 10001 5002000250030003500400045005000550060006500700075008000850090009500 
GENERATION 
Figure 3.9: Typical Number of Generations required by Genetic Algorithm to evolve DSP cir-
cuit structures using primitive and functional component libraries. 
both the behavioural HDL and Functional library. Timing is comparable in all cases. However 
individual solutions generated using the functional library are very similar to the best circuits 
generated using the primitive library, particularly for the 2x2-bit multiplier. In all cases, the 
best solutions generated by the genetic algorithm are either comparable or better in performance 
than those developed though standard behavioural HDL synthesis. 
In addition to providing a technology specific circuit netlist, the synthesis procedure also provides 
circuit optimisation by removing redundant logic elements. This is particularly useful for cir-
cuits generated using EHW as many redundant logic elements such as through connects (Fig-
ure 3.2) will be removed. Table 3.4 presents the area and timing performance of the best 
solutions taken from each of the evolved circuit architectures examined. Where possible, both 
primitive and Functional libraries have been presented. 
Results presented in Table 3.4 show marked reductions in area for both the 2x2-bit multiplier 
and 7-bit pattern recogniser circuits generated using the genetic algorithm. Comparison with 
Table 3.3 demonstrates that both circuits are between 15% and 25% smaller in area than there 
behavioural HDL equivalents. In all cases, the timing of circuits generated by the Virtual Chip 
UJI 
Generating DSP Circuits on the Virtual Chip EHW Platform 
Implementation Circuit Average Best 	Area Corresponding 
Complexity Timing (ns) in NAND Best 	Timing 
in NAND Gates Gates (ns) 
2x2 -bit Multiplier  
Primitive library 10.99 93.0392 10.32 93.7459 
Functional library 18.55 93.1228 10.67 94.0642 
Behavioural HDL 12.68 93.5936 NA NA 
7-bit Pattern Recogniser  
Functional library 38.58 89.9210 27.33 90.8866 
Behavioural HDL 20.00 91.75 NA NA 
2-Frequency Discriminator 
Primitive library 32.56 91.5315 6.67 0.93.6528 
Functional library 54.59 89.8399 21.67 +INF 
Behavioural HDL 75.04 +INF NA NA 
Table 3.3: Performance of arithmetic circuits in terms of circuit complexity and operation 
speed. 
Implementation Best 	Area Corresponding 
in NAND Best 	Timing 
Gates (ns) 
2x2-bit Multiplier  
Primitive library 8.99 94.0752 
Functional library 8.99 94.1303 
Miller et. al. 8.66 88.9066 
7-bit Pattern Recogn:ser 
Functional library 1 16.66 1 92.3691 
Table 3.4: Performance of GA-Based Arithmetic Circuits in Terms of Area and Operation 
Speed After Optimisation. 
is further improved after optimisation. Table 3.4 also draws a comparison with the 2x2-bit 
parallel multiplier evolved by Vasselin, Miller and Fogarty in [81]. The multiplier developed 
was implemented using fewer logic gates than a conventional design, a total of 7 two-input logic 
gates is quoted. The design presented in [81] was then converted into VHDL and synthesised 
using the same parameters identified above. Appendix A.4 displays the multiplier schematic 
and associated VHDL code. It can be seen from Table 3.4 that Miller's multiplier is comparable 
both in Timing and area to those evolved by the Virtual Chip. This provides a clear benchmark 
as to the success of developing performance driven DSP circuits using EHW over conventional 
design approaches. 
45 
Generating DSP Circuits on the Virtual Chip EHW Platform 
Figure 3.10 and Figure 3.11 demonstrate the best 7-bit pattern recogniser solution obtained us-
ing the functional library before and after the removal of redundant logic elements. Figure 3.12 
displays the 7-bit pattern recogniser analysed in Table 3.4 after full optimisation. These figures 













Figure 3.11: Circuit diagram of 7-bit pattern recogniser generated by genetic algorithm using 
functional library with redundant elements removed. 
3.4 Phased Evolution in the Virtual Chip 
The sheer size of the search space involved in the automated design of digital circuits can 
often result in the failure of an evolvable hardware framework in finding a suitable solution. 
Results from section 3.3.1 have shown that limiting the size of logic components available to 
the genetic algorithm, as with the primitive library, increases both the size of the search space, 
and the number of iterations (generations) required to find an acceptable solution. In some 
cases, as with the 7-bit pattern recogniser, this can prove to inhibitive. 
As shown in Table 3.4, Miller et.al . have provided valuable research material from using EHW 
46 









Figure 3.12: Circuit diagram of fully optimised 7-bit pattern recogniser generated by genetic 
algorithm using functional library. 
and gate-level evolution to develop autonomously multiplier architectures with fewer logic 
components than standard multiplier designs 1811. Multiplier architectures such as those de-
veloped by Miller's EHW platform are therefore particularly relevant to high performance SoC 
signal processing applications, such as filter coefficient multiplication. However, the highly 
non linear growth in search space size and complexity demonstrated by Miller prohibits the ef -
fectiveness of EHW in generating multiplier architectures with input vectors greater than 4-bits 
long [12]. 
In addition to the number of logic elements required, and the desired circuit functionality, such 
complexity is well represented in the fitness evaluation of such circuits. Inmost cases evaluation 
consists of matching the output vectors of the circuit under analysis with the actual output 
vectors required by the desired functionality. This has been the approached adopted in this 
chapter for generating DSP circuits on the Virtual Chip. However, by reducing the number 
of possible output vectors, and thus the required complexity of a circuit, it becomes possible 
to develop circuits of complexity that were previously difficult, or unattainable. Figure 3.13 
visualises this approach for the example of a more complex DSP circuit; a 30-bit parallel 
multiplier. If the 30-bit multiplier were evolved as one unit the number of correct output bits 
required to correctly describe the entire circuit would be: 
B 1 =2'*O 
	
(3.2) 
Where Bi represents the number of correctly matched output bits required for the current cir- 




Generating DSP Circuits on the Virtual Chip EHW Platform 
Sub.drcuits 
'I, 
L1 Ai.O ulO O 
h, Mal OUII Milo. M00. EVOLVED  
Ai" 
MULTIPLIER 
BiaO AFTER 0u13 
STAGE Two R1.1 SYNTHESIS 
111n2I 	 lOutS 
STAGE THREE 
Figure 3.13: Example of Phased Evolution For The Automated Design of a 3x,3-bit Multiplier. 
However, stage one of the example circuit shown in Figure 3.13 demonstrates that, through 





This represents a marked difference in circuit complexity, and therefore a reduction in the size 
of the search space. 
Stage two in Figure 3.13 denotes the removal of redundant logic between the evolved sub-
circuit structures as they are combined to generate the required circuit. A benefit of evolving 
partitioned circuits through phased evolution is to reduce the negative effects of the high degree 
of epistasis, inherent in design-based EHW applications. Epistasis describes the degree of 
inter dependency each element in the chromosome has on the other. It has been shown that a 
very high degree of epistasis, as can be found in high performance digital circuits, begins to 
favour random search over genetic algorithm techniques [82]. It might be assumed that simply 
combining each sub-circuit would result in an overall circuit much larger than that developed 
by either a design engineer, or an alternative EHW platform. Results will show however that 
Generating DSP Circuits on the Virtual Chip EHW Platform 
the high degree of common functionality between each of the sub-circuits generated, results in 
large amounts of cell reuse between circuits and thus extensive minimisation is achieved during 
stage two optimisation. 
Stage three in Figure 3.13 represents circuit synthesis enabling the designer to investigate the 
evolved circuit for different technologies, and confirm timing constraints are adhered to. 
3.4.1 Implementation and Results 
The following section details an example circuit evolved using the Virtual Chip EHW platform 
and phased evolution. The example presented is that of an unsigned, 30-bit parallel multiplier 
and was chosen as an incremental progression from the 2x2-bit multiplier developed in section 
2. The 30-bit multiplier is also compared with a functionally equivalent design, generated 
using the same standard behavioural level HDL-to-synthesis procedure described in section 
3.3, and with a 30 bit multiplier evolved by Miller in [12]. Both the schematic of the 3x3-bit 
multiplier presented in [12], and the corresponding VHDL code are shown in Appendix A.5. 
So as to verify reproducibility, each of the phased outputs (six sub-circuits representing each of 
the six circuit outputs) were evolved ten times. After evolution, specific sub-circuit solutions 
were chosen at random, and combined to form the final completed multiplier. The circuit was 
constrained to run no slower than 10 MHz, and the area of each sub-circuit was restricted by a 
chromosome length of 15 logic elements. A total of 90 logic elements were therefore used to 
encode the multiplier. However, many of these elements will be simple through-connects and 
many will become redundant. 
The completed circuit was then synthesised to remove redundancies and calculate cell area. 
Table 3.5 displays timing and area information about the 30-bit multiplier evolved, along with 
the CAD-based and Miller equivalent. To further test the performance of both multiplier cir-
cuits, each was synthesised to run at 100 MIHz. The results are also displayed in Table 3.5. 
The results indicate that, despite a slight increase in circuit complexity of 2 NAND gates, the 
evolved 30-bit multiplier operates equally as well at 100MHz as the hand designed, CAD 
based circuit. It should be noted that equivalent performance was obtained at this higher fre-
quency, despite being evolved to operate at only 10MHz. 
The following compares the phased evolution technique with that of the same Virtual Chip 
platform without phased evolution. Through this, the difficulty faced by single-step EHW tech- 
BE 
Generating DSP Circuits on the Virtual Chip EHW Platform 
Method of Circuit Generation Circuit Timing Slack at Timing Slack at 
Complexity 10 MHz (ns) 10() MHz (ns) 
in NAND gates  
Phased Evolution 45.67 +INF 1.7266- 1.8151 
Standard CAD synthesis tool 43.67 +INF 1.7395 - 1.7926 
Miller et. al. 41.36 87.3771 6.3771 
Table 3.5: Comparing 3x3-bit multiplier evolved using Virtual Chip EHW platform with that of 
functionally equivalent circuits generated with Miller's EHW platform and by using 
standard digital CAD techniques. 
niques when evolving complex digital circuits becomes apparent. Figure 3.14 demonstrates the 
unsuccessful evolution of a 30-bit multiplier under the same constraints as previously detailed. 
In this case the total chromosome length was extended to 100 logic elements (both gate primit-
ives and functional logic blocks), greater than the total number of elements used for the phased 
approach. Ten attempts were made to evolve a 30-bit multiplier in this way. In all cases the 
trend is typical of that shown in Figure 3.14, indicating that many more than 10,000 generations 
would be required to evolve a successful circuit. Although not substantiated, a figure of 30,000 













'ii 	o.pt`  
0 1 	iiiiii 
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 950010000 
GENERATION 
Figure 3.14: Example of unsuccessful evolution of 3x3-bit multiplier using single-step EHW 
technique. 
50 
Generating DSP Circuits on the Virtual Chip EHW Platform 
Synthesis of Miller's evolved 30-bit multiplier shown in Figure 3.5 reveals that its circuit 
area is on average 10% smaller than that of multipliers generated through phased evolution. 
However, results presented by Miller et. a! in [12] shown that 20 million generations were re-
quired to generate the 30-bit multiplier cited using a single-step evolutionary approach. Whilst 
is is difficult to make direct comparisons between different EI-PV platforms this result clearly 
demonstrates the difficulty in generating the multiplier circuit. 
In stark contrast to the single-step approach, phased evolution provides the GA with numerous 
smaller complexity issues, and thus shorter evolution times. Table 3.6 displays the average 
number of generations taken to evolve each successful sub-circuit (a maximum of ten sub-
circuits per output) for the 30-bit multiplier presented. It is clear that if each sub-circuit were 
evolved in serial approximately 20,000 generations would be required to generate the multi-
plier circuit. However, modern networking and efficient processors provide simple methods for 
executing the phased algorithm in parallel. In any case, by using phased evolution the number 
of generations required to evolve a 30-bit multiplier in either serial or parallel is considerably 
smaller than a standard one-step procedure, as demonstrated in Figure 3.14 
Average Number of Generations to Evolve Sub-circuit 
Output0 Output 1 Output2 Output3 Output4 OutputS 
3838 3930 7734 3508 827 55 
Table 3.6: Average Number of Generations Taken by Phased Evolution to Evolve Sub-circuits 
For Each Output of 3x3-bit Multiplier 
Many of the sub-circuits evolved were compared to the average number of generations taken. 
Table 3.6 therefore reveals a good indicator of the circuit complexity required to produce the de-
sired output. Figure 3.15 displays an example of the best synthesised 3x3-bit multiplier evolved 
through the phased evolution technique. Examination of the schematic demonstrates that the 
most complex logic path results in the output of pin two. Figure 3.16 presents the section of 
digital logic relating to the sub-circuit evolved for output two after logic minimisation. 
Comparison with Figure 3.15 shows that sub-circuit 2 contains the most complex logic required 
to achieve correct functionality. This is confirmed by the number of generations taken on aver-
age to evolve the sub-circuit. 
Figure 3.17 illustrates the simplification of the sub-circuit relating to output five. The result- 






Generating DSP Circuits on the Virtual Chip EHW Platform 
Figure 3.15: Example of synthesised 3x3 -bit multiplier generated using phased evolution tech-
nique within the Virtual Chip EH14 platform. 
through-connect elements within the evolving component library (recall that each sub-circuit 
had a fixed length encoding of fifteen logic elements). Although not shown, the original sub-
circuit representation of output five utilised a large number of through-connect and floating 
input elements (shown in Figure 3.2), before the removal of redundancies. 
3.4.2 Limitations of Virtual Chip EHW Platform 
The minimum coefficient word-length for an FIR filter is generally 8-bits. Therefore the Vir -
tual Chip platform would be required to generate multiplier architectures considerably larger 
then the 3x3-bit multiplier developed using phased evolution in section 3.4. Miller et.al . [12] 
demonstrated the successful generation of a 4x4-bit multiplier using an array style chromosome 
encoding. In order to make a further comparisons with Miller's work, and to develop the com-
plexity of DSP circuit, the automated design of a 4x4-bit multiplier was attempted using the 
Virtual Chip platform and phased evolution. Timing and GA parameters remained the same as 
those detailed in section 3.4 and circuits were generated using the functional library. 
52 








Figure 3.16: Schematic of sub-circuit relating to functionality of output 2 of 3x3-bit multiplier. 
LO 	i 4 
Figure 3.17: Schematic of sub-circuit relating to functionality of output 5 of 3x3-bit multiplier. 
Table 3.7 demonstrates the results obtained for the automated design of each circuit correspond-
ing to a single output of the 4-bit multiplier. It can be seen that only 50% of the multiplier's 
output circuits were evolved correctly. No successful solutions could be found for circuits 
which correctly described outputs 2, 3, 4 and 5. As with the 30-bit multiplier discussed in 
the previous section, a number of outputs were noticeably easier to generate using the genetic 
algorithm. Both the average fitness of those output solutions which were successfully gener-
ated, and the average number of generations required by the GA to produce these solutions, 
again provides strong indications as to non uniform distribution of circuit complexity within 
the multiplier search space, defined by the lookup table. 
3.5 Summary 
This chapter has presented an EHW platform, termed the Virtual Chip, for the automated design 
of performance driven DSP circuits. Within the platform a genetic algorithm was used to gen- 
53 
Generating DSP Circuits on the Virtual Chip EHW Platform 
Multiplier Success Average Average Average 
Output Rate Fitness Number 	of Generation 
Generations Before Final 
if Successful Stasis 
Output 1 0.10 0.984809 8002 3823 
Output 2 0.00 0.912326 
Output 3 0.00 0.836372 
Output 4 0.00 0.85 1562 
Output 5 0.00 0.842014 
Output 6 0.50 0.977431 3734.4 3389.8 
Output 7 1.00 1.000000 1585.7 
Output 8 1.00 1.000000 198.7 
Thble 3.7: Success of Virtual Chip EHW platform to generate 4-bit multiplier using phased 
evolution. 
erate a number of benchmark circuits based on published research, including multiplier ar-
chitectures which would be required for FIR filter coefficient multiplication. All of the DSP 
architectures were required to operate within specified timing constraints, and were also lim-
ited in area by the number of logic elements chosen to represent each circuit. Results show 
that the genetic algorithm successfully generated more circuits solutions for each of the DSP 
applications using a library consisting of both primitive and larger functional logic elements 
than when only primitive logic elements were provided. The genetic algorithm also required 
fewer generations to find a correct solution when a functional library was used. Considerably 
fewer logic elements are required to describe circuits encoded using the functional library, res-
ulting in chromosome lengths much shorter than equivalent chromosomes encoded using the 
primitive library. Chromosome length translates directly into search space size, and it has been 
shown that a reduction in chromosome length, as a result of using a functional component lib-
rary, translates directly into a search space many orders of magnitude smaller than when only a 
primitive library is used. 
Both the timing and area of circuits produced using the genetic algorithm were analysed. Find-
ings indicate that the type of component library used by the genetic algorithm is largely inde-
pendent on circuit performance, in terms of both timing and area. This can be substantiated 
as the GA produced circuits of comparable timing and area using either functional or primitive 
component libraries. Results show that circuits generated using the Virtual Chip are of compar-
able or better performance in terms of timing and area than those generated using behavioural 
HDL. By removing redundant logic elements, common in the circuit structures evolved, further 
54 
Generating DSP Circuits on the Virtual Chip EHW Platform 
performance increases can be obtained. 
In order to provide a mechanism for generating more complex DSP circuits using the Virtual 
Chip platform, a phased approach to circuit evolution was presented, and involved the seg-
mentation of specific circuit outputs into individual circuit structures. A more complex 30-bit 
multiplier was evolved using this approach. Analysis revealed that logic element reuse between 
the multipliers sub-circuits is high, such that after the removal of redundant logic through syn-
thesis, surface areas are comparable to a functionally equivalent 30-bit multiplier generated 
through standard HDL design techniques. This has been attributed to the high degree of com-
mon functionality between each sub-circuit. Phased evolution partitions circuit complexity and 
in doing so reduces the search space into smaller landscapes, related to each sub-circuit. This 
segmented approach therefore reduces the associated degree of epistasis inherent in the chromo-
somes circuit encoding; making it possible to evolve complex multiplier circuits more effect-
ively than a standard single-step EHW approach. Results also demonstrate the non-uniformity 
in the complexity of the multiplier architecture related to individual output paths. However, 
this approach was not successful in autonomously generating a more complex 4x4-bit parallel 
multiplier circuit. 
The failure of the Virtual Chip EHW platform and phased evolution to generate a 4x4-bit paral-
lel multiplier casts doubts on the success of using fine-grained component libraries to generate 
the more complex multiplication tasks required for digital FIR filtering. Difficulties in evolving 
multiplier circuits larger than 4-bits using fine-grained gate primitives which constitute gate-
level evolution, are also expressed by Miller et. al. in [12,81]. Functional-level circuit evol-
ution provides a possibility for generating more complex DSP applications. However, the size 
of logic elements must be considerably larger than those presented in this chapter. The next 
chapter therefore presents an alternative method of generating filter coefficients without ex-
plicitly using a multiplier architecture, and details the type of logic element which would be 
required for implementation using evolvable hardware. 
55 
Chapter 4 
FIR Digital Filtering with 
Multiplierless Architectures 
4.1 Introduction 
When implementing digital signal processing (DSP) applications in hardware, great effort is 
made to ensure the level of performance demanded by the consumer market on such devices. 
Finite impulse response filters (FIRs) constitute the back-bone of most DSP applications and 
are therefore typically embedded alongside other processing cores which comprise the system. 
This is especially true when considering system-on-chip (S0C) applications. As a result con-
siderable design resources are poured into inovative realisations of the FIR filter algorithm in 
hardware. The performance and portability of hardwired FIR filters are therefore of great im-
portance. Filter performance issues in this thesis centre around speed of processing, physical 
area, design re-use and device reliability; all of which contribute directly to design complex-
ity. Design re-use is becoming increasingly important to the fast development of application 
specific DSP devices, such that existing architectures can be ported into new applications with 
minimum re-design and test overhead. General purpose DSPs, such as the TMS320 series from 
Texas Instruments, do not provide sufficient throughput to implement high speed FIR filters, 
due to their single multiplier architecture. General purpose FPGAs, such as those from Xii-
mx [24], are suitable for implementing dedicated filter architectures. However, the general 
functionality of FPGAS result in complex configurable logic blocks (CLBs) which require a 
high degree of interconnect. This restricts circuit throughput and increases the physical area of 
the device. 
This chapter presents the basic theory behind FIR filtering and demonstrates a number of ways 
in which filters can be implemented, particularly in hardware. Various multiplierless filter 
design methodologies and a range of hardware architectures and devices on which they can be 
implemented are also presented; in addition to the major building blocks required to generate 
multiplier-free digital filters in hardware. 
56 
FIR Digital Filtering with Multiplierless Architectures 
4.2 FIR Filter Theory 
Finite impulse response (FIR) theory is well documented and will not be covered compre-
hensively in this thesis. Instead, only that material relevant to the research outlined in chapter 
one will be presented, in order to provide a better understanding and appreciation of the re-
search problem investigated. Detailed coverage of FIR system theory and design can be found 
in [83,84]. 
An FIR filter can be described as a sum of N coefficients, resulting in an N - 11h order filter 
given by the difference equation 
y(n) = >h(j)x(n_i) 
	
(4.1) 
Where h(1 ) is a weight assigned to a given coefficient, and multiplied with the input sequence 
x (n). It is these weights which describe the behaviour of the filter. By taking the z-transform 
of equation (4.1) an HR system can be described by the transfer function 
	
H(z) = 	j h (j) Z— ' (4.2) 
H(z) therefore describes a filter with both N - 1 poles and zeros. Because all the poles of an 
FIR filter lie within the unit circle at the origin of the z-plane, an FIR system can be described 
as an all pole filter which is unconditionally stable. Stability is achieved because an FIR filter 
is a non-recursive system, as is clearly evident from equations (4.1) and (4.2). As a result the 
unit sample response for the FIR system is identical to the coefficients set h(s ) such that 




0, 	otherwise  
The frequency response H (w) of an FIR filter can be calculated from a given set of coefficient 
weights, h() as follows: 
H(w) = 	h()exp(—jnwt) 	 (4.4) 
57 
FIR Digital Filtering with Multiplierless Architectures 
The set of coefficient weights, H(s), is therefore calculated relative to H(w) by integrating 
equation (4.4) in the frequency domain. 
Each set of filter coefficients is determined by the amount of signal shaping and the frequency 
response of the input signal, both of which are required to achieve the desired output response. 
This is specified through three criteria which constrain the filter as shown in Figure 4.1. The 
maximum gain of the filter, represented in decibels (dBs), is determined by the stopband at-
tenuation given as 20log io (62), where 12  corresponds to the edge of the stopband. Passband 
ripple is defined in dBs as 20log io (1 + 6) and governs the amplitude of the resulting output 
response, HA (f), where  f is the passband cut-off frequency. The transition band is calculated 






Figure 4.1: Filter Specfi cations for passband ripple (1 + 6) and stopband attenuation (62). 
In order to obtain a finite impulse response, a window function of M samples is used to trun-
cate the infinite time-domain sequence, c, into a limited range defined by M. This results in a 
2M + 1 tap filter. Because only a finite number of coefficients are employed, the actual amp-
litude response, HA(W),  will not match exactly with the desired amplitude response, HD(w). 
HA () is therefore calculated by convolving the desired frequency response with the frequency 
response of the window function, denoted as W(w), such that 
H,, (w)= HD (W) * W(w) 	 (4.5) 
FIR Digital Filtering with Multiplierless Architectures 








Figure 4.2: Convolution infrequency domain for (a) desired amplitude response; (b) frequency 
response of input signal, (c) actual frequency response from FIR filter. 
A wide range of window functions are available, and each displays characteristics which effect 
the passband, stopband and transition band which together relate to the frequency response of 
the filter depicted in Figure 4.1. It is the job of the filter designer to determine which window 
function best expresses the desired frequency response for a particular signal processing applic-
ation, whilst minimising the number of taps which directly translates into filter complexity. 
4.2.1 Linear Phase FIR Filters 
Linear phase characteristics in an FIR filter provide a means of maintaining the delay and phase 
relationship between frequency components of the input pulse applied to the filter, thereby 
minimising signal distortions. For an FIR filter to exhibit linear phase, the coefficient set must 
59 
FIR Digital Filtering with Multiplierless Architectures 
posses conjugate-even symmetry around its centre weight. There are four ways of achieving 
a linear phase FIR which depend on whether the number of taps, N, is odd or even, or if 
the symmetry of the impulse response, c, is odd or even. Therefore if the impulse response 
displays even symmetry then e = c_a , where c lies within the finite range -M to M 
defined by the size of the truncation window. However, for an FIR to exhibit linear phases the 
filter must be made physically realisable, such that impulse responses < co are not associated 
with negative time. This case is not permissible, or non-causal, as it infers that the FIR filter 
starts producing an output responses before any input stimulus is applied. In order to make 
the filter causal and maintain linear phase, the entire finite impulse response sequence resulting 
from window truncation is delayed by M samples. As a result, the impulse sample relating to 
-M is delayed in cascade by M before coefficient multiplication. Figure 4.3 shows the effect 
of delaying the centre impulse response, originally located at c o , by M, such that the impulse 
response sequence now lies in the range CO  to C2M. 
n 
-M 	 01234 	 M 	 2M 
Figure 43: Impulse response of causal FIR filter shifted  times. 
The transfer function of a linear phase FIR filter can therefore be written as 
H(z) = 	C Z — ( M) 	 (4.6) 
n=—M 
4.3 FIR Filter Implementation 
This chapter focuses on the implementation of filter architectures associated with the direct 
form (DF) and transposed direct form (TDF) FIR systems. Both DF and TDF structures have 
zC 
FIR Digital Filtering with Multiplierless Architectures 
been chosen as they represent the most widely used form of FIR architectures. In all forms of 
FIR system M additions, M delays and M + 1 multiplications are required to implement the 
filter, where M is the FIR filter length (number of taps). 
In an FIR filter each impulse of the input, x (n), is expressed as a finite word-length of N-bits. 
The length of bit encoding used to represent both the filter coefficients and the input signal is 
important as it effects both the design of the filter response and the size and complexity of the 
hardware needed to implement the system. Quantising the input signal generates noise and lim-
its the accuracy of the filter calculations. Within an FIR system word-lengths of 8 to 24-bits are 
usual, and depend on the signal processing application. Few FIR filters are implemented with 
I--*' precisions higher than 24-bits, as the hardware resources required filter become prohibitive. 
Although of interest, the effects of quantisation noise are not examined in this thesis as they are 
covered in depth in texts such as [83,84]. 
It should be noted that many other FIR structures exist which provide architectures suited to 
particular DSP applications. For example the lattice FIR filter structure is extensively used 
in digital speech processing and in the implementation of adaptive filters. All FIR structures 
however require a multiplication stage in which the input x (n) is weighted by M filter coeffi-
cients. It is the positioning of the multiplier elements within the FIR design flow which forms 
the primary difference between DF and TDF filter architectures. 
4.3.1 Direct Form FIR Structure 
The direct form implementation of equation 4.1 is illustrated in Figure 4.4. It can be seen that 
X (n) is delayed in descending coefficient order from N - 1 before it is multiplied with the 
relevant coefficient and then summed. 
The number of multiplications can be reduced by a factor of two if the FIR exhibits linear phase. 
The direct form FIR system can then be folded to produce the filter architecture displayed in 
Figure 4.5. N additions and delays are still required to implement the design, however, the 
number of multiplications is reduced from N to either N/2 for even symmetry or (N - 1) /2 
for odd symmetry. This translates into a substantial reduction in hardware. 
Because of the initial delay unit before each coefficient multiplication, both cases of the DF 
structure are particularly suited to hardware implementation using a single multiplier archi- 
tecture performing a MAC (Multiply ACcumilate) operation. The MAC operation continually 
61 
FIR Digital Filtering with Multiplierless Architectures 
Figure 4.4: Direct form FIR filter implementation. 
stores the result of each coefficient multiplication on every unit delay, until X (n) has been 
passed through each filter tap. Figure 4.6 illustrates this concept. Using a single MAC in place 
of multiple multiplier units in either folded direct form, or direct form structures greatly re-
duces the area of the filter and imposes no additional delays into the system as each addition of 
the weighted input, W2 (n), is required before data is ready at the filter output. This approach 
is therefore highly suited to low-power, low area DSP applications which do not require high 
speed data processing. Each coefficient is simply multiplexed from its corresponding memory 
location at the relevant time and passed to the multiplier unit. The resulting FIR filter can now 
be implemented using 1 multiplier, 1 storage unit (shift register) and 1 adder. 
4.3.2 Transposed Direct Form FIR Structure 
The transposed direct form FIR structure differs from the direct form in that the input, x (n), is 
fed into all multiplication units simultaneously. A cascade of delays and additions then connects 
to each coefficient multiplier in order to impose the relevant delay. Figure 4.7 and Figure 4.8 
displays the basic TDF and folded TDF structures respectively. 
Both figures show that the TDF structure is capable of filtering data a factor of M faster than 
the DF architecture. This is because of the parallelism of the multiplier array such that the 
weighted calculation of W0 (n) incurs no delay, and is fed directly to the filter output. However, 
unlike the direct form FIR a single MAC unit will incur significant latency delay if used in place 
of M separate multipliers, as the benefits of multiplier parallelism will be lost. 
62 
FIR Digital Filtering with Multiplierless Architectures 
Figure 4.5: Folded direct form FIR filter implementation (N even). 
4.4 Reduced Complexity FIR Filter Design 
Within an FIR system the multiplier is the primary performance constraint when implementing 
either the direct form or transposed direct form structures in hardware. Multipliers are costly 
in terms of area, power and signal delay. Several design techniques aim to reduce FIR filter 
complexity and improve performance by targeting the multiplier unit. 
The subset selection method relies on the design of non-uniformly spaced FIR filters [85] to 
produce filters requiring fewer multiplications and additions [86]. However this is at the ex-
pense of increased signal delay. In addition, the desired filter is not guaranteed to be of minimal 
complexity. Kim et.al . [87] is able to provide filters of minimal complexity by using mixed 
integer linear programming (MILP); hardware area is reduced accordingly. This approach is 
beneficial for programmable filters with time varying coefficients, where high sampling fre-
quencies are not required. 
63 
FIR Digital Filtering with Multiplierless Architectures 
1(n) 	h(i) 
Figure 4.6: Multiply accumulate (MAC) operator. 
Figure 4.7: Transposed direct form FIR filter implementation. 
4.4.1 Canonic Signed-Digit Encoding 
Coefficient recoding is an effective means of reducing the circuit complexity and power con-
sumption of fixed-coefficient filters, prevalent in high performance, application specific archi-
tectures. By fixing the coefficient of each tap, dedicated multipliers can be implemented. Ded-
icated multiplication replaces the multiplier unit with a series of additions, subtractions and 
bit-shifts which are specific to the coefficient multiplicand. Using this approach additions and 
subtractions become the most costly operation, and fixed bit shifts are effectively resource free. 
It is therefore interesting to note that the number of add operations required to realise a constant 
coefficient multiplication is one less than the number of nonzero bits used to encode the coeffi-
cient. The canonic signed digit (CSD) code represents coefficients in a manner which minim-
ises the number of additions required to perform the multiplication, by reducing the number of 
nonzero bits in the coefficient bit-string when compared to a 2's compliment encoding [88,89]. 
In order to achieve this, coefficients represented in CSD are encoded in strings of length W 
such that: C = bw_i, bw_2 ... bo.  Each b2 then takes on a value in the set {i, 0, 11 where 'f' is 
reV 
FIR Digital Filtering with Multiplierless Architectures 
Z L.i\_.J z 1 L'ii] 
Z -1 	-----------------  z_ 1 	z_ 1 
	
(n) 	. . . . . 	w(n) 	w(n) 
h(N/2)i' 	. . . . . h(1) 	h(0)( 
X(n) 	 I 
MULTIPLICATION UNIT 
Figure 4.8: Folded transposed direct form FIR filter implementation. 
used to denote a subtraction operation and '1' an addition. The position of the bit in the string 
denotes its shift value. For example consider the fixed coefficient multiplication of -894 by the 
filter input x (n), where the coefficient is encoded in CSD 
y(n) = 010010000010 * x(m) 
the multiplication may then be implemented as follows 
y(n) = —(x(n)>> 1) + (x(m) >> 4) + (x(n) >> 10) 
where >> denotes a right bit-shift by an integer, i, which relates to the nonzero bits position 
in the string, and is equivalent to the scaling operation 2. FIR filter performance using CSD 
is therefore governed by the word-length W and the number of nonzero bits in the coefficient 
representation, L. Within a CSD encoded word no two nonzero bits are consecutive in the 
string, hence then term canonic. As a result the CSD representation of each number is unique 
such that a number contains the minimum possible nonzero bits. On average, numbers encoded 
in CSD contain around 33% fewer nonzero bits than an equivalent 2's compliment encoding. 
This can be demonstrated by comparing the 4 CSD encoded coefficients shown in Table 4.1 
with their 2's compliment equivalents. 
Importantly, the value of L directly effects hardware complexity as an extra addition per tap is 
FIR Digital Filtering with Multiplierless Architectures 
CSD Encoding Decimal Equivalent 2's Complement Encoding 
101010000010 -1406 1101010000010 
010001010101 -1109 1101110101011 
010010000010 -894 1110010000010 
10000010l000 2072 0100000011000 
Table 4.1: Example of CSD encoded coefficients and their 2's compliment equivalent. 
required each time L is incremented. Samueli [89] has shown that one nonzero CSD digit is 
required for approximately each 20 dB of stopband attenuation. Techniques for optimisation of 
CSD coefficients include localised search, gradient-based and branch-and-bound optimisation 
algorithms [90-92]. However, genetic algorithms have also been employed to find the optimal 
set of powers-of-two based coefficients which can then implement CSD [93,94], in addition to 
GAs which explicitly optimise CSD encoded coefficients [25]. 
4.4.2 Primitive Operator Filters 
Bull et.al . introduced the concept of primitive operator filters (POPs), which utilise directed 
graphs to optimise filter coefficients using a combination of add, subtract and shift operations 
in order to generate reduced complexity filter architectures which use a standard 2's compliment 
coefficient encoding [2,95]. The POF approach therefore replaces the entire coefficient mul-
tiplication unit with a highly distributed architecture, tailored for a specific set of coefficients. 
Bull describes the principle behind POF such that it: "exploits the redundancy which exist in the 
direct-form structure The underlying principle relies on the fact that partial products formed 
in any one coefficient-sample multiplication can be reused to assist in the formation of other 
product terms" 
Four POF algorithms were initially proposed each utilising one or more operations: add, add/sub, 
add/shift, add/sub/shift. Bull demonstrated that algorithms which utilised shifts produced by 
far the best results, typically a factor of two better than when non-shift based algorithms were 
employed. Filters which utilise fixed shifts are favoured because they require no logic and 
are therefore considerably smaller in area than either add or subtract units. An example of a 
multiplierless POF graph with n-bit shift and addition is presented in Figure 4.9. Note that 
the arbitrary coefficients used do not require additional re-coding as with the CSD approach, 
and that an impulse response (logic '1') must first be applied to the design to ensure that the 
MOO 
FIR Digital Filtering with Multiplierless Architectures 
coefficients on each tap are correct. 
WO 	 Wi W2 W3 	W4 
Figure 4.9: Example Shift-addApproach. 
Arsian et.al . [96] investigated a number of configurable arithmetic macro structures designed 
to perform coefficient multiplication using the POF design technique. These structures were 
designed to be implemented on an FPGA and were shown to offer advantages both in terms 
of speed and physical area. One limitation of the configurable architecture proposed by Arslan 
lies in the single bus routing system used to connect the relevant macro structures; this bus 
produces a bottle-neck which restricts communication between macro elements thereby redu-
cing throughput. In addition, the size and complexity of the configurable logic blocks (CLBs), 
required to implement the macro structures, severely limited the scalability of the architecture, 
as CLBs did not map efficiently into the corresponding filter coefficients. 
POF is particularly attractive for autonomous filter design using EHW as it requires no initial 
encoding scheme such as CSD, and uses only three simple building blocks. Also, whilst the 
original work on POF utilised a heuristic search algorithm, Bull has demonstrated that POF 
graph synthesis problem is NP complete making it a suitable candidate for optimisation using 
evolutionary algorithms. 
Redmill et.al presents a method of obtaining minimal coefficient sets (as with CSD) in addition 
to optimising filter complexity in terms of the number of subtraction and addition operations 
performed [97]. Redmill's approach combings a heuristic directed graph search with a genetic 
algorithm. Wade et.al  [26] employs a similar approach to POF by providing a GA with a 
number of basic FIR sections such as delay elements and addition units. The GA then generates 
the desired filter specification using these building blocks. 
67 
FIR Digital Filtering with Multiplierless Architectures 
Bull et.al . have shown that FIR filters designed using POF are smaller in area than those de-
signed using the CSD approach [2,97]. An exhaustive investigation of POF and related filter 
design approaches can be found in the Ph.D. thesis written by David Bull [98]. 
4.43 VLSI Implementations 
Custom hardware design of the FIR filter algorithm provides greater performance than if im-
plemented on a general-purpose DSP architecture [99]. Both ASIC's and programmable logic 
devices (PLD's) are used to develop custom FIR filter architectures. The choice of implement-
ation depends greatly on the specification of the filter. Selection criteria are dependent on the 
filters sampling frequency, number of taps, word-length, and the need for programmable coef-
ficients. 
It has been shown that the performance of an FIR filter can be improved by replacing the mul-
tiplier with a series of bit-shifts and additions/subtractions. Physical area, and signal delay 
are both reduced if this approach it taken. However, multiplierless architectures rely on the 
accumulation of partial products, which therefore generate a fixed set of coefficients. VLSI im-
plementations of multiplierless FIR filters range from circuits capable of sampling frequencies 
from 313kHz to 120 MHz, and filter orders from 32 to 64-taps [100-103]. These utilise power-
of-two encodings and architectures tailored to a specific set of fixed coefficients. Coefficient 
programmability provides a means of extending the life of a filter core through design re-use. 
Whilst still tailored to the filter algorithm, these designs can be re-programmed for a range of 
applications; at the expense of increased complexity. Khoo et.al . presented a multiplierless 
filter architecture encoded using CSD and capable of implementing 32 programmable taps lim-
ited to a maximum of two nonzero CSD encoded digits[ 104, 105]. Woon Jin Oh et.al developed 
a method of reducing the length of shifters for architectures implementing programmable CSD 
coefficients [106]. This approach reduces area at the expense of increased computation to gen-
erate an appropriate subset within the CSD code for a specific set of coefficients. 
Powell et.al . presents an investigation of the suitability of several VLSI architectures for high-
speed, general purpose, programmable coefficient digital fllters[107]. It was concluded that 
the direct form filter implementation using powers-of-two coefficients is highly effective from 
implementing hardware filters with 70 or fewer taps. However a generalised transversal filter 
architecture (GTF) may be a better compromise between hardware efficiency and ease of im-
plementation when programmability and scalability are desired. In the GTF, tap weights are 
FIR Digital Filtering with Multiplierless Architectures 
applied to intermediate nodes which form a cascade of identical sub-filters. The output of each 
sub-filter is then appropriately delayed and summed to produce the desired filter response. 
A number of programmable architectures have also been developed which are specifically 
tailored to implement FIR filters designed using EHW. Miller uses gate-level evolution com-
prising XOR, AND and multiplexer logic functions to generate novel filters that do not use 
explicit coefficients [108]. Instead filters are evolved using one of two different fitness func-
tions. The first is based on computing the sum of the absolute differences between the actual 
filter response and that desired, the other is defined by examining characteristics of the Discrete 
Fourier Transform of the filter output. Whilst still mostly theoretical, this work demonstrates 
future avenues for VLSI implementations of filter applications. Flockton and Sheeham present 
a functional-level approach to the design analogue filters centred around a generalised building 
block circuit [109] which uses a number of resistors, capacitors and operational amplifiers. The 
architecture is demonstrated through the intrinsic evolution of a linear band-pass filter. This ap-
proach is noteworthy because of the ease in which multiple building blocks can be concatenated 
to realise more complex filter functions. 
4.4.4 Design Adaptation and Fault Tolerance 
Multiplierless FIR filter architectures have been shown to produce high performance DSP in 
terms of operational speed and area. The reduced complexity FIR filter design methodologies 
discussed above provide inovative solutions conducive for high-performance hardware archi-
tectures. Whilst a number of programmable architectures have been cited which implement 
these concepts, no platform yet exists in which both the filter design algorithm and the pro-
grammable architecture interact in real time. Such a platform would provide a means of online 
filter adaptation resulting in an architecture optimally configured for the current set of coeffi-
cients. 
Device reliability is perhaps the most costly of all performance issues discussed in this chapter. 
It is costly as it directly impairs operational speed and increases physical area. Fault toler -
ant VLSI systems employ techniques such as check-pointing [110], concurrent error detection 
[111] and redundancy. Karri et.al . present a means of rapidly prototyping fault tolerant VLSI 
systems. Two approaches to the fault tolerant design of a 16-point FIR filter are examined. Ana-
lysis shows that designing reliability through controlled redundancy results in a VLSI design 
with smaller area and faster throughput than the same filter generated using a self-recovering 
FIR Digital Filtering with Multiplierless Architectures 
architecture [112]. The effectiveness of FIR filters developed using EHW to withstand faults is 
examined in detail in chapter 7. 
4.5 Overview of Programmable Platforms 
Programmable logic devices (PLDs) provide an alternative means of implementing DSP al-
gorithms in hardware beyond that of more traditional approaches which use either custom VLSI 
hardware, generic microprocessors, or more specific DSP processors. Since the early 1990's 
PLD technology has branched into two distinct logic structures termed field programmable gate 
arrays (FPGAs) and programmable logic arrays (PLAs). Both architectures comprise an array 
of identical configurable logic blocks (CLB5) which are used to implement a given algorithm 
in hardware. A binary data string is used to configure every CLB in the array, thereby program-
ming the PLD with the desired functionality. One of the largest differences between FPGAs 
and PLAs are the interconnect structures used to pass data between CLBs. Interconnect can be 
highly distributed as with FPGAs, or can be more restricted to rows or columns of CLBs as is 
the case with many PLA architectures. Figure 4.10(a) illustrates the basic interconnect topo-
logy of an FPGA. Each CLB is directly connected to each of its adjacent neighbours allowing 
data to be passed and received in all four directions of the array (north, south, east and west). 
Greater interconnectivity is further achieved by "fast" routing CLBs which are not directly ad-
jacent. This approach can be seen in figure 4.10(b), which forms a hierarchical routing structure 
by grouping CLBs into 4x4 arrays. A CLB is then able to send or receive data from another 
CLB 4 units away, bypassing the CLBs which lie in between, which in turn frees resources. 
Greater levels of interconnect hierarchy can be achieved by increasing the array size of each 
CLB grouping. 
An example of a programmable logic array structure can be seen in Figure 4.11. The archi-
tecture shown represents that of the XC9500 family of PLAs from Xilinx [24]. It can be seen 
that each identical function block (FB) is arranged and connected in columns, with connectivity 
between FBs provided using an extended interconnect matrix. PLA routing is therefore simpler 
than that required for an FPGA. This reduces the complexity of implementing circuits on PLAs, 
but also limits the flexibility of the devices when compared to the FPGA. 
Both FPGA and PLA architectures implement logic functions through a combination of look- 
up tables (LUT5), which provide synchronous RAM, D-type flip-flops, and basic gate primit- 
70 
























CLB CLB CLB 
If 	It 	It 	1$ 
FIR Digital Filtering with Multiplierless Architectures 
(a) Basic nearest neighbour FPGA in- 	(b) Example 4-length FI'UA -last" mterconnect structure. 
terconnect structure. 
Figure 4.10: Basic FPGA interconnect structures and CLB layout. 
ives (used for glue logic and signal multiplexing). Each of these functions is built into every 
CLB/FB in the array which, when suitably programmed and interconnected, produce the de-
sired circuit functionality from the programmable device. The majority of logic functions are 
achieved through the LUTs. Each LUT is capable of implementing any arbitrarily defined 
boolean function in the form of logic truth tables, which store the bit patterns in the individual 
CLBIFB. As these bit patterns are stored in RAM, they can be reloaded or newly written an 
unlimited number of times. Circuit designs implemented on a PLD can therefore be modified 
and corrected by programming new bit patterns into the LUT without actually changing the 
hardware. More complex logic functions can be generated by spanning the truth table across 
of number of LUTs. Each CLBs therefore passes combinatorial bit data to the interconnect 
network, which can then be distributed within the array. A CLB can also store combinatorial 
data in D-type flip-flops which can be passed directly to the interconnect network. Multiple 
CLBs can therefore be configured to implement registers for storing binary words of arbitrary 
length. 
71 
FIR Digital Filtering with Multiplierless Architectures 
Figure 4.11: Example of a PLA architecture from the Xilinx XC9500 series. 
4.5.1 Performing Multiplication on PLDs using Distributed Arithmetic 
The majority of programmable logic devices such as those manufactured by Xilinx and Al-
tera do not have dedicated multiplier architectures (however this is now changing). Instead 
multiplication is performed bit-serially across multiple CLB/FBs, an approach that requires 
considerable logic and interconnect resources. This limitation can be over come by exploiting 
the LUT-based approach to computation inherent in most PLDs, which favours a much more 
efficient technique for data multiplication referred to as distributed arithmetic (DA). DA is most 
commonly used as an efficient method of implementing the weighted sum of products, or dot 
product algorithm, required for applications such as FIR filtering as shown in equation (4.1). 
The DA approach is similar to that of POF such that one factor of each product term remains 
constant. Each product term therefore consists of a single input variable and a constant coeffi-
cient. Input variables are normally represented as 2's complement binary numbers such that all 
partial product terms are computed simultaneously in the same period that would be required 
to implement a single partial product. Each input string of word length N can therefore be 
72 
FIR Digital Filtering with Multiplierless Architectures 
described as 
Xk = —bko + E b2 	 (4.7) 
Where bkn is binary data (0 or 1), bkN - 1 is the LSB of the data word, and bkO is the signed 
Most Significant Bit (MSB). The result of multiplying data vector, X, of length K, with con-
stant coefficient vector A, of length K can be written as 
F = A1 X1 + A2X2  + A3X3  + ......AKXK 	 (4.8) 
As an example, the LUT contents for a k = 4 data vector can be seen in Table 4.2. A total 
of 21c = 16 possible input configurations must be referenced with the relevant partial product 
terms stored in the LUT. if each input word in XK is N bits in length, then each partial product 
term output by the LUT must be accumulated each time the next data bit, Xk n , is passed. The 
DA multiplication function, F, therefore requires N LUT address reads, and N accumulates. 
X4 X3 X2 X1 LUT content 
0 0 0 00 
0 0 0 1 A1 
0 0 1 0 A2 
0 0 1 1 A2+A1 
0 1 0 0 A3 
0 1 0 1 A3+A1 
0 1 1 0 A3+A2 
0 1 1 1 A3+A2+A1 
1 0 0 0 A4 
1 0 0 1 A4+A1 
1 0 1 0 A4+A2 
1 0 1 1 A4+A2+A1 
1 1 0 0 A4+A3 
1 1 0 1 A4+A3+A1 
1 1 1 0 A4+A3+A2 
1 1 1 1 A 4 +A3+A2+A1 
Thble 4.2: Contents ofLUT for K = 4 input data vectors. 
Figure 4.12 illustrates the basic DA processor required to implement the algorithm shown in 
(4.2). When implemented on an FPGA the DA processor can be realised by storing all pos- 
sible partial product results within a single LUT, usually spanning multiple CLBs as described 
73 
FIR Digital Filtering with Multiplierless Architectures 
in [113]. The LUT is addressed bit serially such that each input variable is converted from an 
n-bit parallel word into a serial data stream which is passed to the LUT. The bit-serial input 
data, Xk n , references the LUT using the Least Significant Bit (LSB) first. Each partial product 
output from the LUT is then summed with the previous accumulated result and shifted one bit 
to the right and stored. Because all data paths in the DA processor are N bits wide, each right 
shift (equivalent to a divide by two) causes the LSB to be discarded. However, double preci-
sion can be retained by passing the discarded LSB via Yi ower onto an auxiliary shift register. 
This process is repeated until all the sign bits of input vector Xk are passed (simultaneously) 
to the LUT. Once this occurs Sign control is read to determine the sign of the result present 
on Yupper . If it is negative then a subtraction is performed. A minimum of N clock cycles are 
therefore required to process the input data vectors. Therefore, if the data width of each input 
word, N, is less than the number of input vectors, k, (N < k) then the DA processor is in fact 









LUT ::: ii:: 	
egister Store 






Figure 4.12: Distributed arithmetic processor. 
An FIR filter can therefore be implemented on an FPGA using the DA technique simply by 
increasing the size of the LUT, thereby utilising greater RAM resources. Figure 4.13 illustrates 
the extension of the DA processor to digital FIR filtering. 
The initial input signal X (n) is firstly loaded in parallel and then converted serially to form a 
cascade of serial shift registers (SR) which provide the necessary tap delays and order the input 
74 
FIR Digital Filtering with Multiplierless Architectures 
X(n) 
I PSR I 
SR 1 	
LUT 	 Scaling 
2 Words F1 Accumilator 
SRN 
Figure 413: Implementation of an N-tap FIR filter using distributed arithmetic. 
data for correct bit serial addressing of the LUT. The MAC function of the scaling accumulator 
then sums each partial product term output by the LUT to achieve the desired filtering function. 
As a result filter complexity, defined by tap length, is limited by the memory resources available 
to the programmable logic device. 
A more detailed review of applying distributed arithmetic to DSP can be found in [114]. The 
article, written by Stanley White, also provides a number of techniques designed to increase 
the speed of DA multiplication, for example by partitioning input words into sub-words, which 
requires greater memory, but introduces greater parallelism into the multiply accumulate opera-
tion. Another technique aims at reducing the size of LUT required by DA. This approach is able 
to reduce memory resources to 12 k  words by using a modified 2's complement representation 
termed Offset Binary Coding (OBC). OBC instead casts the binary states '0' and '1' as '-1' 
and '1' respectively. The input Xi of word length N can therefore be used to re-write equation 
(4.2) as 
—Xk = -bkO + E b2 + 2-(N1) 	 (4.9) 
Where bkn is the complement of the bit bkn. This approach can considerably reduce the limit-
ations imposed by the memory resources available to the PLD, and enable the implementation 
of more complex FIR filters. Linear phase FIR filters can be used to further reduce the num-
ber of LUT addresses by a factor of 2. This is achieved by bit serially adding the outputs of 
symmetrical tap pairs, as with folded form filter implementations. 
75 
FIR Digital Filtering with Multiplierless Architectures 
Marcos et.al presents a comparison between three classic FIR filter structures: direct form, 
cascade and lattice, each implemented on an ALTERA 10K50 FPGA using DA [115]. Whilst 
this work identified the limitations inherent in each structure, it also showed that the direct 
form implementation was the most scalable, and translated well in to the DA processor. An in 
depth overview on the implementation of transposed form FIR filters on Xilinx's latest range 
of Virtex FPGA devices can be found in [116]. An extensive overview of distributed arithmetic 
processors and programmable logic device architectures is presented in [117]. 
4.5.2 Dedicated Programmable Logic Devices 
A number of recent PLD architectures have been developed which are dedicated to DSP ap-
plications, in addition to the dedicated programmable FIR filter architectures discussed earlier 
in this Chapter. Chen and Rabaey developed a field-programmable multiprocessor IC termed 
PALDDI (programmable arithmetic devices for high-speed digital signal processing). The device 
comprises a number of identical arithmetic units connected in a similar way to the PLA archi-
tecture shown in Figure 4.11 and specifically designed for high speed signal processing applic-
ations. DSP architectures benchmarked on the PADDI include a low-pass biquadratic filter, a 
3x3 linear convolver for image processing, and a RGB video matrix. The PADDI was shown 
to out perform a commercially available FPGA series (XC3090) produced by Xilinx at the 
time [118]. Rajagopalam and Sutton have recently presented an FPGA architecture dedicated 
to high speed flexible multiplication for demanding DSP applications [119]. Each functional 
block supports multiplication, addition and multiply-accumulate operations generated through 
a modified carry-save adder and carry logic circuitry. Again, this dedicated FPGA architec-
ture out performs modern FPGA devices, such as those currently available from Xilinx and 
Altera, which must implement multiplication though LUTs, and require extensive interconnect 
between many fine-grained CLBs (in terms of CLB functionality) in order to configure the 
desired DSP implementation. 
An example of a coarse-grained field programmable logic device for DSP applications is presen-
ted in [120]. The platform consists of an Arithmetic Switching Network (ASN), similar in lay-
out to the FPGA. However, unlike the general FPGA architecture discussed earlier the ASN 
comprises an array of adders, subtractors and multipliers. These arithmetic operations were 
chosen so that the device could efficiently perform different classes of linear, non-linear, or-
thogonal and non-orthogonal transforms applicable to algorithms such as the Discrete Cosine 
76 
FIR Digital Filtering with Multiplierless Architectures 
Transform (DCI), and Fast Fourier Transform (FFT). 
4.6 Summary 
This chapter has presented a basic overview of FIR filter theory, and underlined the relative 
benefits of implementing an FIR system in direct form (DF) and transposed direct form (TDF). 
A number of reduced complexity methodologies have been presented which target the multi-
plier stage of the FIR system, often replacing explicit coefficient multiplication units with a 
distributed series of bit-shifts, additions and subtractions. This approach is embodied by the 
primitive operator design methodology, which builds on logic elements used to produce pre-
vious coefficients in the current filter in order to generate the next coefficient in the set. The 
disadvantage of this approach is that the FIR system developed is constrained to a specific set of 
filter coefficients, which then binds the filter to an individual application and does not promote 
design reuse. Whilst this chapter has identified a number of high performance programmable 
architectures specifically designed to implement multiplierless filters, as well presenting more 
general purpose PLDs, these platforms do not integrate design algorithms such as POF which 
would enable the system to reconfigure adaptively. Chapter 5 therefore presents the under -
lying framework for an EHW platform specifically tailored for implementing programmable 
multiplier-free FIR filters in hardware. Evolution is to be performed at a functional level con-
siderably higher than that used in the Virtual Chip. Instead, the POF design methodology is 
adopted such that the GA is able to utilise a combination of additions, subtractions and bit-
shifts. The EHW platform aims to provide the flexibility of an adaptive programmable filter, 
with the performance benefits inherent in a fixed coefficient architecture, resulting in a high 
performance programmable platform dedicated for rapid prototyping of FIR filter algorithms. 
77 
Chapter 5 
Developing a Programmable 
Framework for Filter Design using 
5.1 Introduction 
Chapter 4 has shown that fixed coefficient multiplierless filter architectures are suitable for high 
performance signal processing applications. However, they are not flexible and do not promote 
design re-use. Whilst programmable multiplierless filters have been developed, they have not 
yet been integrated with design algorithms such as the POF directed graph approach which 
would make them adaptive. 
This chapter presents the relevant building blocks identified in Chapter 4 to produce a dedicated 
Programmable Arithmetic Logic Unit (PALU) capable of implementing programmable multi-
plierless FIR filters as part of an embedded array of PALUs. Filters are to be realised within 
two competing programmable platforms, one inspired by the FPGA, the other by the PLA; each 
made up of any array of PALUs. Both programmable platforms, presented in chapter 6, are de-
signed to accommodate the performance constraints discussed in the previous chapter which 
centre around processing speed, physical area, component re-use and device reliability. The 
genetic algorithm developed to autonomously configure the two programmable platforms for a 
given set of filter coefficients is also presented. 
Both the PALU and genetic algorithm form the back bone of an EHW platform which has been 
designed to provide an adaptive, multiplierless hardware architecture, tailored to programmable 
FIR digital filter applications, and autonomously configured using a genetic algorithm. 
5.2 Overview of EHW Platform 
A further two programmable platforms are to be presented which provide a means of autonom- 
ously configuring the coefficient multiplication stage of an FIR filter using evolvable hardware. 
Developing a Programmable Framework for Filter Design using EHW 
Each of the two programmable architectures has been tailored for FIR filter design, program-
mability, and adaptation. Whilst the topology of each programmable platform is different, both 
architectures utilise the same genetic algorithm, and the same programmable arithmetic lo-
gic units (PALUs), designed to implement reduced complexity multiplierless filters. Both the 
PALU and the genetic algorithm have been developed in VHDL at the RTL (Register Transfer 
Language) level, such that global parameters can be characterised and provide a scalable logic 
core which can be ported into larger DSP applications. This approach can therefore be termed 
as Complete Hardware Evolution, as defined by Tufte and Haddow in [72]. The designer must 
then decide the data input width, data output width, and the maximum number of taps the plat-
form can support before the EHW platform is fixed in hardware. The latter will determine the 
dimensions of the specific programmable platform. 
The EHW platform developed is common to both the PLA and FPGA-based programmable 
platforms and consists of three processing units as detailed in Figure 5.1. The system controller 
programmes the current filter specification storing both the tap-length and coefficient variables, 
these are passed to the platform by the user 'during operation. Tap-length is passed directly 
to the programmable platform, whilst the coefficient variables are relayed to the genetic al-
gorithm which must then determine how best to configure the programmable logic. The GA 
also verifies that the programmable platform is outputting the desired coefficients. Each tap 
within the programmable platform is therefore output back to the GA unit. Communication 
between processing units is fully synchronous. 
5.2.1 Programmable Arithmetic Logic Unit 
Each programmable platform is constructed from a number of identical programmable arith-
metic logic units (PALU's). Unlike the more macro-based approach taken by Arslan et.al  [96], 
a more granular structure is proposed. Each PALU is able to implement either a parallel n-bit 
left-shift, addition or subtraction as shown in Figure 5.2, with a bit-width dependent on the 
input data width of X (n). Therefore as with POF and CSD approaches, explicit coefficient 
multiplication is removed. 
A total of five control-bits are required to configure each PALU: 1-bit to determine the operation 
of the adder/subtractor, 1-bit for each of the routing multiplexors, and 2-bits to control the 
programmable shifter, which is capable of left-shifting from 0 to 3-bits. A PALU implementing 
a shift-by-zero acts as a through-connect. Each PALU output then feeds into a synchronous 
79 
Developing a Programmable Framework for Filter Design using EHW 
Figure 5.1: Architectural overview of EJIW platform for FIR filter implementation. 
register to create a pipelined architecture which increases data throughput. 
Figure 5.2: ProgrammableAL Ufor Multiplierless FIR Filtering. 
Chapter 6 investigates a number of programmable logic topologies which utilise the PALUs 
illustrated in Figure 5.2. A genetic algorithm will be used to determine the most suitable topo-
logy of PALUs in which to implement a reconfigurable FIR filter using evolvable hardware. 
Developing a Programmable Framework for Filter Design using ERW 
5.3 Implementing the Genetic Algorithm 
The genetic algorithm presented in this Chapter was chosen to facilitate an investigation of the 
main focus of research presented in this thesis, that being to develop the most suitable platform 
for implementing a high-performance digital FIR filters using EHW. From this, the coeffi-
cient multiplication stage has been identified as the primary unit for design automation. The 
genetic algorithm also provides a controlled comparison on the merits of each programmable 
platform as it requires no specific knowledge of either the PLA or FPGA based architectures 
under investigation. As a result the success of each programmable platform relies solely on the 
ease in which the GA is able to navigate the search space and generate the desired set of filter 
coefficients. 
Evolutionary algorithms have been used to optimise coefficient sets for multiplierless filter ap-
plications, whilst optimising a number of addition/subtraction and shift resources [26,94,97]. 
The GA presented in this chapter extends this principle to the optimal configuration of custom-
built PALU's to obtain a set of desired filter coefficients within two dedicated programmable 
architectures. Each programmable platform is configured using a configuration string of binary 
data. The bit string is then used to determine the functionality of each PALU in the architec-
ture, and the flow of data between communicating PALUs, as constrained by each platforms 
interconnect topology. The chromosome encoding therefore requires a binary representation 
like that discussed in Chapter 2. The genetic algorithm therefore differs considerably from the 
numeric-based chromosome encoding developed for the Virtual Chip EHW platform discussed 
in Chapter 3. Each programmable platform is now modified entirely by the genetic algorithm 
such that a population of configuration-strings are used to produce an optimal PALU configur -
ation for a given filter specification. 
A number of programmable logic devices have been used to implement EAs in hardware 
[17,121-1231. These algorithms are capable of running considerably faster than those imple-
mented on general purpose micro-processors, and are therefore suitable for applications which 
require online adaptation as is often required with high performance digital filters. Evolutionary 
algorithms are frequently mapped onto PLDs so that the fitness function can later be modified 
for different optimisation problems. However, faster algorithms can be achieved when imple-
mented in dedicated VLSI hardware. Custom evolutionary algorithms implemented on ASICs 
can be found in [124, 1251. In such cases the fitness algorithm is fixed for a specific applica-
tion. The custom ASIC approach has been implemented in this thesis so that the GA can be 
Developing a Programmable Framework for Filter Design using EHW 
embedded along side the programmable platform, making it highly suited to SoC single chip 
DSP devices. Figure 5.3 displays a schematic of the generic VHDL EHW platform model used 
to implement both the FPGA and PLA programmable architectures, and embedded genetic 
algorithm. 
MM 
Programmable 	 - 
va 	Wm 
Platform 	 "-- 
(PLA I HA) 
Figure 5.3: Sc heinatic ofEHWplatforn including units comprising genetic algorithm and pro-
grammable platform (FPGAJPL4). 
Each unit illustrated in Figure 5.3 is described briefly below. Units highlighted in blue indicate 
VHDL models which may be synthesised into silicon, units in red use real valued numbers for 
calculation, and therefore represent high-level behavioural VHDL descriptions. 
Memory-Unit 1 and 2: These memory arrays store the population of configuration strings 
required to program either the PLA or FPGA architectures. Each generation one memory unit 
is triggered to be read-only, while the other is write-only. These read and write states are 
determined by a logic high on the the inputs Read-Enable and Write-Enable respectively, and 
are present on both memory arrays. This is to enable both the fitness assessment of the parent 
MN 
Developing a Programmable Framework for Filter Design using EHW 
population, and the creation of a new offspring population. As a result the memory arrays 
toggle between read-only and write only modes each evaluation cycle (generation) so that once 
an offspring population is created it can be evaluated in the following generation. The address 
of each configuration string in memory is passed via the Address input, whilst the data itself is 
fed bit-serially through Input .String. 
MEMControI: This unit governs which Memory-Unit is to be read from or written to in 
each generation. The input signal RouteCnrrl determines when this transition occurs, and is 
only toggled when each configuration string in the current population has been evaluated and 
the resulting offspring strings written. Inputs Address_I and Address -2 are then multiplexed 
between MemoryUnit_l and Memory Unit1, whereAddressi selects configurations strings to 
be read into the programmable platform, and Address 1 points at the relevant memory location 
to which the next offspring string will be written. String writing to memory is triggered by 
the Write signal, set high by the Crossover Unit once an offspring string has been generated. 
Child output then passes the new configuration string into memory. A circuit diagram of the 
MEMControl unit is shown in Figure 5.4. 
Ro,EC,WI 
S-4 
Child Owpm 	 LIJ1I Soft 
Figure 5.4: Schematic of MEMControl unit for memory read/write control. 
Pop Control: Determines when a configuration string should be read from memory, and which 
Developing a Programmable Framework for Filter Design using EHW 
unit has requested the bit string. Primarily this unit simply increments the read address in 
memory of the next configuration string which is to program the PLAJFPGA. This is determ-
ined by the unit Population -Counter which passes the next address location to Pop-Control via 
Pop -count. Both Selection -Unit and Elitism -Unit request configuration strings via Pop Control 
using signals XoverEnable and Elite-Enable respectively. The corresponding memory loca-
tion for both units is signalled by Address location. Only when SelectCntrl is low (logic '0') 
are both Address location and ActiveEnable recognised. When Select_Cntrl is high (logic '1') 
then the EHW platform is in evaluation mode and addressing is achieved through Pop Count. 
Population -Counter: While searching for an acceptable filter solution, the EHW platform has 
two modes of operation: evaluation mode, when configuration stings are read from memory 
and assigned a fitness score based on how effectively they configure the corresponding pro-
grammable platform, and evolution mode, when good solutions in the current population are 
selected to form offspring configuration strings which are written into memory for the next 
generation. These two modes are controlled by Select_Cntrl, which is high during evaluation 
mode and low during evolution mode. An internal sequential counter is used to increment 
Pop Count so that each configuration string in the current population can be accessed and eval-
uated. During this period Select_Cntrl is held high. Once the population limit is reached, 
counting stops, SelectCntrl toggles low, and the evolution mode begins. Evaluation mode re-
sumes once the next generation of configuration strings is written, indicated by a single pulse 
from Enable J21-IW, originating from the Crossover-Unit. Once this flag is received Pop Count 
is reset and then continues to increment again. 
Selection _Reg: Acts as a temporary memory store for the current configuration string program-
ming the PLA/FPGA. The string will remain in the shift register until the performance of the 
platform has been evaluated. The output, EJTflVString, is enabled by Out select which ensures 
that the register output is delayed by one clock cycle so that memory can be safely accessed and 
read though Pop Control. Out-select is therefore also governed by Select_Cntrl as the memory 
store is only required during the evaluation mode. 
MIEM_Coefficients: This memory unit stores the coefficient set which defines the current filter 
specification. Each tap output is therefore multiplexed and passed via Platform -Output to the 
Fitness-Unit where it is evaluated with the corresponding desired coefficient, passed to the 
Fitness-Unit via Current Coeff. 
Developing a Programmable Framework for Filter Design using ERW 
Fitness-Unit: The performance of both the PLA and FPGA-based platform is assessed directly 
through the Fitness-Unit. This is achieved by determining the quality of each filter coefficient, 
presented to the Fitness Unit, from the PLA/FPGA core via Platform -Output. The fitness of 
each coefficient is then calculated by comparing it with the desired coefficient, stored in the 
MEMCoefficients unit. The fitness scores of each coefficient are then summed to provide an 
absolute fitness of the current configuration-string, formalised as follows: 
T  f fe /F. iff<F Qx = (5.1) 
Fe /f, otherwise  
Where 0  is the final "fitness score",' is the total number of taps, f is the PLA/FPGA output of 
the current tap and F2 is the desired current coefficient. The success of each PLA/FPGA archi-
tecture is therefore measured on the ability of the GA to successfully modify the configuration-
string over a number of generations, such that a set of coefficients are obtained which most 
closely match those stored in MEMCoefficients. One benefit of employing this comparative 
fitness measure was that it would be simple to implement in VHDL at the RTL level, and would 
translate easily into hardware. Figure 5.5 displays the corresponding circuit diagram. 
SysEnable_Min 
StoredOutput 
Summed Result  
Division Out Divider 	 FAdd7e A_Out Store 	- FitnessOutput 
clock 
Figure 5.5: Schematic of Fitness Un it for calculating quality of PLA/FPGA configurations for 
a given set offilter coefficients. 
SysEnablelvlin is used to reset the accumulated fitness score only after all the filter coefficients 
have been evaluated for the current string, and a new configuration string is loaded into the 
programmable platform. 
Developing a Programmable Framework for Filter Design using EHW 
MEM_Fitness: Is identical to Memory-Units 1 and 2 discussed previously except that real 
numbered variables are passed from Fitness -Unit and as a result is implemented as a separate 
core. MEMFitness therefore stores this accumulated fitness score, corresponding to each con-
figuration string, in memory to be passed via Fitness into the Selection Unit when requested. 
Fit-Control: Is similar in functionality to Pop-Control in that it determines read/write access 
to memory, in this case MEMYitness. When in evaluation mode (Select_Cntrl = '1') memory 
address locations are determined by Pop Count such that the position of a fitness score in 
MEMYitness translates directly to the position of the associated configuration string. SysEn-
able lviin acts as Write-Enable permitting FitnessOutput to be written into memory. When 
in evolution mode Select-Enable and SelecLAddress provide read access and select the desired 
memory location respectively. Both signals originate from Selection -Unit and are invoked when 
performance comparisons between configuration strings are made. 
Selection-Unit: Two-way tournament selection was chosen as the selection algorithm for the 
EHW platform as it is the simplest to implement in hardware, compared to more complex 
algorithms such as proportionate selection, and proved successful when used with the Virtual 
Chip EHW platform in Chapter 3. A schematic of Selection-Unit can be seen in Figure 5.6 
The Selection-Unit is first activated through Initialise, and thereafter via SelectActive. Initial-
ise is transmitted from Population -Counter once the maximum population count is reached 
and Select_Cntrl goes low. Both these Control signals flag the Random Address-Unit and 
have a period of two clock cycles. At each flagged clock cycle a randomly generated address 
location is passed to MEMYitness via SelectAddress; Select-Enable is also set high in or-
der to permit the memory read. This process is synchronous, therefore both SelecLAddress 
and Select-Enable activate one cycle after the Control flag is received. This one cycle delay 
between Select Enable and Control is used to flag the remaining clock cycle when both sig-
nals are simultaneously high, so that during this period the first fitness score can be stored in 
RegA and the corresponding address stored in Reg B; this control process is highlighted in red. 
When the second fitness score is received the decision unit, highlighted in blue, compares the 
two scores and indicates the winner through the signal Decision ('0' if fitness score on Reg A 
is the greatest, otherwise '1'). The selection unit, highlighted in yellow, then passes the win-
ning configuration string location to SelectPos_Out. Only once the selection has been made is 
Select -flag set high so that XoverJJnable can can be used to activate the Crossover-Unit. 








----p DOW Uflft .. 
REG A 









Select- flag 	 Xower_Enable 
Select_CntrI 
Figure 5.6: Schematic of Selection -Unit implementing two way tournament selection. 
Elitism-Unit: Stores the address of the fittest solution in the current population, provided by 
the input PopCount, so that it can be re-introduced into the new offspring population un-
changed. Only one elite individual is maintained each generation. On each clock event when 
SysEnableMin goes high a new fitness score, received via FitnessOutput, is compared with 
the highest fitness score currently stored within the Elitism-Unit. The location of the fittest 
configuration string is then held in Elite-address until the start of the evolution mode when 
the Elite-Enable flag is set high for one clock cycle only, allowing the configuration string to 
be read from the current parent population and written, via the Crossover-Unit into the new 
offspring population memory. 
Crossover-Unit: The primary function of the Crossover-Unit is to generate new offspring 
configuration strings through the genetic operators crossover and mutation. However. Cros-
sover-Unit can also be used as means of writing parent configuration strings directly from the 
current memory population into the new memory population when crossover and mutation do 
not occur, as is done when elitism is employed. Figure 5.7 illustrates the circuit diagram of the 
Crossover-Unit. 
87 
Developing a Programmable Framework for Filter Design using EHW 
Mik Jbk 
Figure 5.7: Schematic of Crossover-Unit which implements genetic operators crossover and 
mutation in order to generate new offspring solutions. 
There are therefore two situations when a parent solution may be copied directly into the off-
spring population. The first is due to elitism, and the second is when the probability of crossover 
for a given string is not sufficient. 
The request for an elite string read/write is issued by the Elitism-Unit and passed into the Cros-
sover-Unit via the Elite-Enable flag. This flag is passed to two internal modules; AddressCount 
and Decision Unit. AddressGounr serves to control which Memory-Unit acts as the parent 
population (read-only), and which stores the offspring population (write-only) in each genera-
tion. An internal counter is used to increment the address location of the current Memory-Unit 
in write mode. A count is triggered by either EnableA, Enable_B or Elite-Enable. Once the 
counter reaches the total population size, then enough configuration strings have been written 
to memory and the Enable_EHW flag is set high to switch the EHW platform into evaluation 
mode. At the same time Route_Gntrl is toggled, setting the Memory-Unit containing the new 
offspring configuration strings into read-only mode, and allowing the old memory population 
to be over written by switching it to write mode. 
The Decision Unit receives a copy of both the original parent string and potentially an asso- 
Developing a Programmable Framework for Filter Design using EHW 
ciated offspring string (if crossover and mutation occurred). With the Elite-Enable set, the 
parent configuration string is output directly to Output-String, which then feeds into the current 
offspring memory via MEMControl. Write is also set high to enable the memory write. 
Each crossover operation, activated by the Xover_Enable flag, is determined randomly via the 
Crossover-Unit's internal Random Generation module. A user defined crossover probability 
is used to determine if crossover occurs, where the random number generator is bound in the 
range 0 to 100 (representing a 0 to 100% chance of crossover). If no crossover occurs then the 
EnableA flag is set, acting in the same manner as the Elite-Enable flag discussed earlier. If 
the crossover probability is met then Mate-Enable is used to activate the Crossover module, 
which performs one point crossover at a randomly selected locus along the bit string. 
Because two parent strings are required to generate offspring, the Crossover-Unit must then 
wait for a second configuration string to be passed to it from the SelectionUnit. The re-
quest is made via Next .String and extended by an additional clock cycle to generate the Se-
lectActive signal expected by the Selection Unit. The Crossover-Unit's wait state is signalled 
by Splice-Enable which causes the Random Generation module to bypass probability se-
lection, enabling the next parent string to pass directly to the internal Crossover module. Once 
both offspring strings are generated Mutate-Enable is set to activate the unit Mutation module, 
which applies bit-flip mutation to each offspring string with uniform probability determined by 
the user. Each offspring is then output in turn via WriteChild and passed to the Decision-Unit. 
Enablei3 is set high to ensure that the offspring strings and not the parent are written into 
memory, and that the address location is incremented. 
In summary, the genetic algorithm embedded within the EHW platform is parameterised as 
follows: 
• (p,.\) generational genetic algorithm 
Population size 100, 
. User defined crossover and mutation rate 
• 1\vo way tournament selection 
• One elite solution maintained each generation. 
Developing a Programmable Framework for Filter Design using EHW 
5.3.1 Analysis of Genetic Algorithm 
The completed GA was simulated using Cadence's Leapfrog VHDL simulation environment 
with a crossover rate of 60% and a mutation rate equal to 1 / L, where L is the bit length. A pop-
ulation of four arbitrary configuration strings of 15-bits were stored in the GAs MemoryUnit, 
and each was assigned an imaginary fitness corresponding to how well the string might have 
configured an array of PALUs for a given filter specification. It is clear that considerably longer 
bits strings would be required to actually configure an array of PALUs, however, such short 
string lengths were chosen as it would be easy to note the effects of crossover and mutation. 
Figure 5.8 presents the resulting simulation waveforms relating to the GAs Crossover-Unit. 
The Crossover-Unit reflects the most complex aspect of the GA architecture, and adequately 
demonstrates the global operation of the embedded genetic algorithm. 
GbI 
Ifl00l_ 00O1O!!njnhIIIfljmm00o I 	IlIII1IlGD000O0J_0000000011IIII1 1 	ittiitii000 	I 	tnxiiooiiooiio 	joio,onoionaio 
Enu 






Add,.,. o 	 Id 2 	-I 3 	 IiI 








_______ _______ _______ 
___________ 
000000000000303 	1 0llllI11 111111110DOD000 	1l001 10011001I0 






L 	Ill I110011II1000 
xxxxxxx.xxxxxxox 	1.1 
C-1.1 Iii 2 	I 3 	 6 
FW 	0 1 	 25 .873 1 	740261 1 	 14.002 	 1 	104471 
0 1 14 13 -101 
—After croosooer 
After mutation 
Figure 5.8: Overview of waveform produced by genetic algorithm in EHW platform. 
Figure 5.8 clearly shows the activation of the Crossover-Unit through the signal XoverEnable 
produced by the Selection Unit, and the writing of Output-String in to memory when both 
Enable and Write are simultaneously high. The signal Address can also be seen to increment its 
write location in memory each time an offspring string ( Child-String) is available for writing. 
The two initial configuration strings "000000000000000" and '1 111111 10000000" present on 
90 
Developing a Programmable Framework for Filter Design using EHW 
Input .String are shown by Splice Yos (an internal signal in the Crossover module) to crossover 
at bit 14, creating offsprings "01111110000000" and "100000000000000" on String-1 and 
String-2 respectively. Mutation is then shown to occur with the correct probability on ChildA, 
highlighted in red, and ChildB, highlighted in blue, before being passed to Child-String and 
written to memory. 
Further evidence that the genetic algorithm functions correctly is presented in detail in Chapter 
6, through the successful evolution of digital FIR filters using the EHW platform developed in 
this chapter. 
5.4 Summary 
This chapter has presented the development of a programmable arithmetic logic unit (PALU) 
which constitutes the basic building block for an EHW platform developed to autonomously 
implement FIR coefficient multiplication. The PALU developed is designed to replace explicit 
coefficient multiplication with a distributed series of bit-shifts, additions and subtractions. An 
array of PALUs comprise a programmable platform which is then autonomously configured us-
ing a genetic algorithm with a fitness function designed to reflect a given filter specification. The 
GA is also employed to investigate the most suitable programmable platform for implementing 
high-performance multiplierless digital filters. Two of the key genetic operator: crossover and 
mutation (in addition to population size) can be parameterised in order to optimise the GA for 
the filter application. 
The basic El-lW framework identified in this chapter therefore comprises the GA and the FPGA 
or PLA-based programmable platform, both of which are written in VHDL. A detailed overview 
of communication between the GA and the programmable platform has also been presented, 
and the GA has been shown through VHDL simulation to operate correctly. Chapter 6 details 
an investigation into the most suitable programmable platform for digital FIR filter coefficient 
multiplication using EHW. 
91 
Chapter 6 
Reconfigurable platforms for FIR filter 
implementation using EHW 
6.1 Introduction 
This chapter presents two programmable platforms, specifically designed to implement multiplier-
free coefficient multiplication for high performance, digital FIR filter applications. The first 
programmable platform is inspired from a class of logic devices termed field programmable 
gate arrays (FPGAs), the second is from a similar family of devices termed programmable lo-
gic arrays (PLA5). Both programmable platforms will utilise the PALU detailed in section 5.2.1 
of chapter 5 for the automated design of digital FIR filters using evolvable hardware. 
The genetic algorithm developed in chapter 5 is used to examine a number of performance cri-
teria which focus on the following: the success of each EHW platform in generating a specified 
coefficient set, the number of PALU components utilised in each array, the degree of compon-
ent re-use required to produce new coefficient terms, and the ratio of left-shift, addition and 
subtraction operations required to implement the filter. Both the PIA and FPGA-based plat-
forms are examined with a range of filter input, tap output and PALU interconnect topologies 
in order to determined the most suitable programmable multiplierless architecture. 
Both the FPGA and PLA-based EHW platforms were implemented using a hardware descrip-
tion language (HDL) at the RTL level so as to provide accurate hardware modelling of each 
system. VHDL was chosen as it provided a simple means of creating arrays of PALU using 
the GENERATE statement, a feature which does not exist in Verilog. The relevant circuit 
layout of each programmable platform is also presented. Finally, the most successful program-
mable platform based on each of the performance criteria discussed is identified and selected 
for translation into a synthesised hardware model. 
92 
Reconfigurable platforms for FIR filter implementation using EHW 
6.2 Benchmark Filter Design 
In order to investigate the suitability of each programmable input, output and interconnect to-
pology for automated filter design using EHW, a 31-tap low pass filter was selected to provide 
the benchmark with which both the FPGA and PLA-based architectures will be compared. The 
filter was taken from the industrial design of low-power filter cores for hearing aids, developed 
in joint collaboration with Bernafon LTD and the university of Edinburgh detailed in[126]. The 
corresponding coefficient set shown in Table 6.1 is highly challenging as it exhibits a large dy -
namic range with coefficient multiplicands scaling the filter input from 2' to 214,  using word 
lengths of only 16-bits. In addition the low-pass filters gain must be no less than -52 dB. All 
Coefficient Taps Dec 
W-15, W15 -59 
W-13, W13 96 
W-11, W11 -220 
W-9, W9 461 
W_7, W7 -876 
W-5, W5 1606 
W-3, W3 -3171 
W_i , Wi 10326 
WO 16384 
Table 6.1: Non-zero coefficients required for response of 31 -tap low-pass filter 
other coefficients are zero. Therefore 9 distinct taps are required for a folded form implement-
ation, using the approach detailed in section 4.3.2. The corresponding filter response is shown 
by the blue line in Figure 6.1. 
The filters transfer function was achieved by quantising the input impulse, X (n), and coefficient 
word lengths to 16-bits. Because a number of negative coefficients are used, a 2's compliment 
encoding is required. Each programmable platform must therefore be characterised to accom-
modate these specifications. This was achieved during RTL level parameterisation of both the 
PLA and FPGA VHDL models. 
6.2.1 Experimental Setup 
Tests on both EHW platforms have therefore focused on the automated configuration of the 





50 	100 	150 	200 
EF 




















50 	100 	150 	200 	 0.3 	0.4 	0.5 	0.6 
samples (n) 	 f/f S 
Figure 6.1: Tranferfunctionfor 31-tap low-pass FIR Filter 
using the genetic algorithm described in section 5.2. Ten randomly generated populations of 
configurations-strings were created for each investigation: such that each PLA and FPGA topo-
logy evaluated is initially configured using the same set of configuration strings. This provides 
a common basis for comparison between all input, output and interconnect topologies, and 
between the FPGA and PLA architectures themselves. It is then the task of the genetic al-
gorithm to manipulated each PLAJFPGA topology and generate the correct set of filter coeffi-
cients detailed in Table 6.1. 
Communication between both the FPGA and PLA-based programmable platforms and the ge-
netic algorithm is detailed in section 5.2 and illustrated in Figures 5.1 and 5.3. The entire 
EHW platform, comprising either the FPGA or PLA-based programmable PALU topology, 
the genetic algorithm and FIR filter coefficient parameters, is then simulated in detail using 
Cadence's Leapfrog VHDL simulation environment, where the best configuration string and 
corresponding coefficient fitness is written to file each generation. 
A total of 6700 generations were performed by the GA for each of the 10 investigations, and for 
every programmable topology. The limit on the number of generations reflects the maximum 
ReconIigurable platforms for FIR filter implementation using El-lW 
number of iterations each El-LW platform can execute in one second of simulated "real time". 
One second was chosen as it was deemed the maximum period acceptable for adapting the filter 
specification, either due to component damage, or to modifications to the filter application. 
6.3 Field Programmable Gate Array (FPGA) Topology 
The FPGA developed in this thesis has been tailored specifically for implementing reduced 
complexity primitive operator filters, by replacing the FIR multiplication unit with a program-
mable series of bit-shifts, additions and subtractions. The PALU illustrated in Figure 5.2 there-
fore reflects the computational aspect of the CLB which is required to implement the coefficient 
multiplication stage of an FIR filter. 
6.3.1 Interconnecting CLBS for an FPGA-based FIR Filter 
There are a number of simplifications which can be made to the nearest neighbour connection 
topology highlighted in Figure 4.10(a). Because an FIR filter must be stable, no feedback 
between CLBs can be permitted as this might cause the filter configuration on the FPGA to 
become unstable. As a result each CLB will receive data from the south and east of the array, 
and output data from the north and east. Data travelling westward across the CLB array is 
therefore not permitted. This was also done to constrain the number of possible configurations 
on the FPGA, thereby reducing the search space required to find an acceptable filter solution, 
and lessening the burden on the genetic algorithm. Figure 6.2 illustrates the CLB element which 
incorporates the FPGA-based FIR filter. 
Programmable routing is performed by six 2:1 Multiplexor units, each governed by a single 
control bit. CIO and C9 determine which of the two inputs HrzIN (east input), or VrtIN 
(south input) are passed to the PALU. C'8 and CT then controls whether the output of the 
PALU is fed into the final routing unit, or whether the CLBs original inputs are to be selected. 
If this is done then the CLB performs a through connect operation. The output routing of 
each CLB is determined by control bits Cl and CO (bits C6 - C2 are used to configure the 
PALU). HrzOUT and VrtOut form the output of each CLB, which then connect to HrzIN 
and V rtIN of the next CLB, determined by one of the the interconnect topologies detailed in 
Figure 6.3. 
95 
Reconflgurable platforms for FIR filter implementation using EHW 
Figure 6.2: Configurable logic block (CLB) for FPGA including routing to and from PAL U. 
The three interconnect topologies, shown in Figure 6.3, were investigated based on nearest 
neighbour connectivity to determine the most suitable interconnect sequence for implementing 
FIR coefficient multiplication on an FPGA-based EHW platform. 
• AFFA: Alternating feed-forward array. The flow of horizontal inputs fed to each CLB 
alternates from east to west. Although westward data flow is permitted in this topology it 
is still constrained such that no CLB feedback is possible. This interconnect topology was 
designed to maximise linkage between PALUs by providing a maximally long critical 
path through the PALU array. 
• CFFA: Continuous feed-forward array. Similar to a systolic array such that each PALU 
is clocked and data flows from the bottom left CLB of the FPGA to the top right. 
• CFFLA: Continuous feed-forward loop array. The connection topology builds on the 
CFFA by permitting connectivity between CLBs on the top row of the FPGA, with the 
CLB of the next adjacent column on the bottom of the FPGA. Again this approach elim-








(a) Alternating feed-forward array topology (b) Continuous feed-forward array topology. 
Xwidth 
A 	B 	C 
Reconfigurable platforms for FIR filter implementation using EHW 
Xwidth 
(c) Continuous feed-forward looping array to-
pology. 
Figure 6.3: Various routing topologies for interconnecting PAL Us in FPGA structure 
97 
Reconfigurable platforms for FIR filter implementation using EHW 
It is also important to determine the optimal placement of filter taps within the FPGA architec-
ture such that each FIR filter can be implemented successfully in hardware and with minimal 
CLB resource. Four programmable output topologies for placing FIR coefficient taps have 
therefore been investigated. 
• EOS: Edged output sequence. All CLBs on the outer edge of the FPGA are potential 
filter taps as shown in Figure 6.4(a). Each filter coefficient can therefore be generated 
by programming the relevant output CLB. The disadvantage of this approach is that a 
number of CLBs within the array must act as through connects to those CLBs on the 
outside edge. The total number of CLBs available as potential tap outputs is therefore 
EOStaps = (Ywidth * 2) + ((Xwidth2) - 4) 	 (6.1) 
Where Xwidth and Ywidth represents the number of CLB columns and rows respect-
ively. 
• AOOS: Alternating orthogonal output sequence. The topology shown in Figure 6.4(b) 
enables filter taps to be output throughout the CLB array. This approach was intended 
to reduce the need for CLB though connect which might arise in the EOS topology. The 
total number of CLBs available as potential tap outputs is given by 
AOOS taps = (Ywidth/2) * Xwidth 	 (6.2) 
• AAOS: Alternating arrow output sequence shown in Figure 6.4(c) is a derivative of the 
AOOS topology. However AAOS provides better localised connectivity between poten-
tial output CLBs, which more tightly couples the generation of partial products required 
to produce subsequent tap outputs within the coefficient set. The total number of CLBs 
available as potential tap outputs can be calculated as 
AAOS taps = (Ywidth + 1) * (Xwidth/2) 	 (6.3) 
Both AAOS and AOOS topologies provide the almost the same number of output CLBs. 
• BLOS: Base-line output sequence. Whilst the topology shown in Figure 6.4(d) is highly 
unrealistic in terms the high degree of control logic and interconnect that would be re-
quired to implement in hardware, it provides the genetic algorithm with a highly flexible 
Reconfigurable platforms for FIR filter implementation using EHW 
means of implementing the desired filter response as every CLB in the array is a potential 
filter tap. The number of available CLBs is therefore 
BLOSt aps = Ywidth * Xwidth 
	
(6.4) 
The number of bits required to encode the allocation of a CLB to a given tap is determined by 
the number of CLBs which can potentially output a coefficient tap. For example, if a 4x4 array 
of CLBs utilised the base-line output sequence, BLOS, then 4-bits would be required to encode 
the relevant tap on each output CLB in the range 0 to 15. 
(a) Edged output sequence topology. (b) Alternating orthogonal output se-
quence topology. 
(c) Alternating arrow output sequence 
topology. 
(d) Base-line: All CLBs connected out-
put sequence topology. 
Figure 6.4: Various output topologies for FPGA structure. 
Reconuigurable platforms for FIR filter implementation using EHW 
In addition to CLB interconnect and output topologies, the connectivity and control of the filter 
input, X(n) must also be considered. Two signal input topologies are therefore presented. 
Figure 6.5 illustrates the L-shaped input sequence (ISIS) as it would be in conjunction with the 
CFFLA interconnect topology. 
AO 	Al 	A2 
c6 c5 c4 c3 c2 ci co 
Output Routing 	 FPGACLB Control 
(1-bit each) 
Figure 6.5: FPGA control of FIR filter input X (n); including position of input control string 
within FPGA string encoding. 
Because all of the FPGA-based interconnect topologies presented in this thesis are feed-forward, 
X (n) is connected via Multiplexor control only to CLBs on the far right column of the array, 
and the bottom row. CLB inputs not directly connected to either a neighbouring CLB or X (n) 
are pulled low (given a logic value of zero). The total of number control bits required to de-
termine the input connectivity of X (n) is then 
I = 1092 Ywidth + (Xwidth - 1) 	 (6.5) 
The second input topology is considered the base-line input sequence (BLIS). This configur- 
ation connects all horizontal CLB inputs to X (m) via Multiplexor control. All north feeding 
100 
Recontigurable platforms for FIR filter implementation using EHW 
CLB inputs therefore receive an input from either their nearest southerly neighbour, or from the 
filter input response. The total number of input control bits required is therefore 
I = 1 0921" width * Xwidth 	 (6.6) 
6.3.2 Configuring the FPGA-based FIR Filter 
Each FPGA-based filter architecture is configured via a binary configuration string. Each bit 
string is compartmentalised into three regions of control, defining the FPGAs output routing, 
input routing and individual CLB configuration as illustrated in Figure 6.6. 
4-hits 
 
]-bit 	 Il -bits 
CLBI5 1 CLBI4 1 CLBI3 	 cLBO 
- 	 - 	 - 	
- 
	
Output Routing 	 Routing of input X(n) 	 FPGA CLR Control 
Figure 6.6: Example configuration string for 4x4 FPGA -based FIR filter with LSIS, AFFA and 
EOS. 
The total length of configuration bit string can therefore be calculated as follows: 
5FPGA = 10920 + I + (11 * Xwidth * Ywidth) 	 (6.7) 
Where O is the total number of output bits governed by the output topology expressed by the 
relevant equation from (6.1) to (6.4), I is the total number of control bits used to program the 
filter input X(n.) determined by the current input topology defined in equation (6.5) or (6.6), 
and 5FPGA  is the total resulting bit length required to program the FPGA. 
Figure 6.7 presents an example FPGA configuration of the 5-tap primitive operator filter ori-
ginally illustrated in Figure 4.9. A 4x4 CLB array is interconnected using the AFFA topology, 
with filter taps connected to CLBs using the EOS. The input pulse is held constant at logic '1' 
and connected to the FPGA via the L-shaped input sequence (LSIS) in order to produce the 
desired coefficient set. The bit string required to configure the FPGA-based FIR filter is also 
shown, and has been sectioned into the three regions of control discussed above. A number of 
the control bits used to configure the CLBs shown in Figure 6.7 are in the "don't care" state, 
'x'. This due to redundancies inherent in the CLB control encoding. For example when a CLB 
acts as a simple though connect, as is the case with the CLB at position 0 in the array, then the 
Reconligurable platforms for FIR filter implementation using EHW 
control bits used to encode the PALU become redundant, as do control bits C 10 and ('9 which 
govern the inputs to the PALU. 
Chromosome Encoding 




o : xxx0x II 
xx0Ox xx 10 
2 :xxx0xxxll 










13: xxx0xxxl I 
14:011100001 
IS: xxOxxxxO0 
Figure 6.7: Example FPGA configuration oj'5-tap primitive operator filter 
6.3.3 FPGA-based FIR filter Parameters 
An 8X8 array of PALUs was chosen to implement the 31-tap filter presented in section 6.2. 
64 PALU elements were deemed to be sufficient to provide enough partial product terms to 
generate the 9 distinct coefficient taps that were required. The dimensions of the FPGA were 
kept symmetrical so that each of the tap output topologies could optimally utilise the four 
interconnect sequences. The word length of the filter input. X(n), the FPGA coefficient output 
taps, and the PALU processing elements were parameterised to 16-bits and represented in 2's 
compliment within the VHDL model. These parameters were set to match the specification 
of the low-pass FIR filter. The FPGA filter input, X (n) is held constant at a value of one (i.e. 
"0000000000000001" in 16-bits) so that each selected tap output can be compared directly with 
the corresponding coefficient in the filter specification. A clock of 50MHz was used to control 
the FPGA-based EHW platform so that approximately 6500 generations were performed within 
the one second evolution window specified. A total of 831 configurations bits are therefore 
required to implement the 8x8 FPGA-based filter when implemented with base-line input and 
102 
Reconhigurable platforms for FIR filter implementation using EHW 
base-line output topologies (BLIS and BLOS respectively). This represents a search space of 
2831 possible bit string configurations which must be successfully manipulated by the GA in 
order to achieve the desired low-pass filter response. 
6.3.4 Investigation of Genetic Operator Parameters 
Both the crossover and mutation rates of the genetic algorithm developed in section 5.3 are 
parameterisable, and are user defined before circuit evolution. A crossover rate of 0.6 (60%) 
is generally recommended [53], and is consistent with the rate set for the Virtual Chip EHW 
platform in Chapter 3. However, the parameterisation of genetic operators depends greatly 
on the problem domain, and the manner in which the chromosome is encoded. These two 
constraints differ significantly from the GA implementation required for the Virtual Chip. As 
a result two crossover rates were initially examined. The first utilised a crossover probability 
of 0.6, the second removed crossover altogether (probability 0.0). This was done in order to 
determine the effectiveness of crossover as a search mechanism for determining an acceptable 
FPGA filter configuration using a binary encoded chromosome. 
All three FPGA interconnect topologies were evaluated when evolving the low-pass filter with 
and without the crossover operator. This was to achieve a thorough indication as to the use-
fulness of the crossover operation. In order to maintain continuity over each evaluation, the 
L-Shaped Input Sequence (LSIS), and Edged Output Sequence (EOS) were arbitrarily selected 
as the I/O topologies. The probability of mutation was set according Muhlenbein's 1 over bit 
length rule, defined in equation (2.1). 
The fitness of each configuration string is calculated by the EHW platform using equation (5.1). 
Each fitness score therefore lies in the range 0.0 to 9.0. This corresponds to how effectively each 
FPGA topology was mapped to produce the 9 distinct filter coefficients required to implement 
the desired low-pass transfer function shown in Figure 6.1; and represents a functional correct-
ness of 0 and 100% respectively. All other run-time parameters are specified above. Results 
showing the average fitness of each low-pass filter evolved with and without crossover over 10 
evolutionary runs are displayed in Table 6.2. 
The average fitness of each filter coefficient set evolved on the three FPGA interconnect to- 
pologies does not seem to be influenced by the presence or absence of crossover. In fact, the 
fitness of the coefficients evolved using the GA without crossover is marginally higher than the 
103 







AFFA 8.54595 = 95.1% 8.6358 = 96.0% 
CFFA 8.59490 = 95.5% 8.6124 = 95.7% 
CFFLA 8.64150 = 96.0% 8.6501 = 96.1% 
Table 6.2: Performance ofFPGA connection topologies in generating the 31-tap low-pass filter 
configured using genetic algorithm with and without crossover 
corresponding coefficient set generated when crossover was employed. This is almost certainly 
due to the high level of epistasis inherent in the POF design problem, such that interactions 
between PALU elements and interconnects (expressed as genes in the chromosome) are non-
linear and filter fitness cannot be directly attributed to the effects of an individual gene. Max-
imum epistasis then occurs when the fitness contribution of an individual gene depends on the 
values of all other genes in the chromosome, resulting in a highly uncorrelated search space. 
High degrees of epistasis therefore inhibit the effectiveness of crossover as a search mechanism 
through chromosome recombination of two parent solutions. As the problem presented to the 
GA becomes increasingly difficult (epistatic) crossover become less effective and the search 
favours progress though mutation [82,127,128]. Crossover is therefore no longer applied to 
filter coefficients evolved using the FPGA-based EHW platform. 
In order to ascertain the influence of varying the mutation rate, Fm , equation (2.1) was modified 
to produces two new probabilities of bit-flip mutation 2 and 3 times greater than that originally 
used: Pm  = 21N and Pm = 31N respectively. The same evolutionary runs as above were 
performed, the results of which are presented in Table 6.3. Crossover was not employed. 
Connection 
Topology 
Average Fitness at 
Pm = 21N 
Average Fitness at 
Pm = 31N 
AFFA 8.6538 = 96.0% 8.4132 = 93.4% 
CFFA 8.5949 = 95.5% 8.2569 = 91.7% 
CFFLA 8.7543 = 97.3% 8.6880 = 96.5% 
Table 6.3: Performance ofFPGA connection topologies in generating the 31-tap low-pass filter 
configured using genetic algorithm with variable mutation rates. 
The results show very little difference between the average fitness of the filter coefficient sets 
generated by the GA using the increased mutation probabilities. Only the CFFLA interconnect 
topology produced a noticeably better solution (97.3%) when Pm  = 21N. As a result the 
104 
Reconfigurable platforms for FIR filter implementation using EHW 
original mutation rate Pm = 1/N will be maintained, as this will conform with excepted GA 
practices. 
63.5 Performance Comparison of FPGA Topologies 
The performance of each FPGA input, output and interconnect topology was assessed based 
on four criteria: The fitness of the filter coefficient set, the number of PALUs used, the de-
gree of PALU re-use (generation of partial product terms), and the total number of shift, add 
and subtract operations required to implement the specified low-pass filter coefficient set. All 
three interconnect topologies and all four FPGA tap output sequences were initially investig-
ated using the LSIS. The results of each performance criteria, averaged over ten evolutionary 
investigations are shown in Figure 6.8. 
Coefficient fitness: Analysis of the transfer functions produced by the evolved FPGA coeffi-
cients sets reveals that a fitness score of > 98.5% is required to produce an acceptable low-pass 
characteristic, with a gain no less than -52 dB. An example of the minimum filter performance 
criteria produced at 98.5% is provided by the green transfer function illustrated in Figure 6.1. 
This highlights the small margin for error which can be accepted in order to successfully gener -
ate the highly constrained filter specification outlined in Table 6.1. Fitness variations of between 
0.5 and 1% are therefore critical and highly challenging for EHW. The GA was able to produce 
an acceptable coefficient set on each output topology except AOOS. In fact the AOOS output 
topology produced the worst set of coefficients regardless of the interconnect topology. The 
CFFLA interconnect topology provided the GA with the greatest PALU connectivity and as a 
result produced on average the fittest coefficient sets across each of the output topologies (with 
the exception of AOOS) It is therefore not surprising that CFFA interconnect, which exhibits 
the least PALU connectivity as defined by shortest critical path through the FPGA, produced 
on average the poorest results. 
On average, the best coefficients sets produced by the GA were attained using a CFFLA-BLOS 
interconnect and tap output combination. However, the fittest coefficient set was generated 
using the CFFLA-OROS FPGA topology with a 99.1% coefficient match. No FPGA topology 
was able to provide the GA with a means of producing a coefficient set that matched exactly 
with the original. However, the very nature of FIR filter design allows for some compromise, 
for example due to implementation factors such as coefficient quantisation error. The EHW 
approach to FIR filter implementation therefore introduces some error, however, it can be seen 
105 












0001.- C00* CFOIA.- *00*-. SOFA-. CORA- *00*- COOs.. 00016-. 0FF>. COOS.- CORn- 
00$ £05 £05 AOOS 4006 0005 0.50$ 0.60$ *005 00.05 6.06 00.06 
 
(a) Success of FIR filter based on fitness criteria. 
100*- >0*. GLA- 100*- SOFA- (>RA. 000*.- 0*0*- SOFIA- 000*. SOfiA-
LOS 	EQS EQS 	4005 O 	00)05 *005 *405 *008 0MB 6.05 0006 
Tso6gy 
5_0o. 2._o. u -u-.. 
10 4500.6. 















*00*- 000*- 000LA- *00*- 0*0*- CFFI.A- *00*.- C"*- 0001*- AM- COOn- CORk-
005 0.05 E0S *005 6005 *005 *005 2.605 *006 OLOS BIOS 000$ 
TOgOtOgy 







• *0059 *505 
o #5_OS 
flFWA- COWS- CVR.*- SAWn- COOn- CFR.fl- *00*.. CFF*- CFR.A. *FW*- COWS- CORn-
£06 EOS LOS 2.00$ *008 *005 0.00$ 0.80$ 1.90$ 6.06 6.06 6.05 
TopoOgy 
(d) Operations performed by PALU to implement FIR filter. 
Figure 6.8: Performance of various FPGA interconnect and coefficient output topologies to 
autonomously generate a 31-tap low-pass FIR filter L-shaped input sequence 
(ISIS) employed. 
Ipi, 





ASEA- (PEA- CFFLA- SF55- (PEA- CEFLA- APES- (PEA- (FF15- APES- (EPA- COWLS-
EQS EQS EQS 5006 *005 90CM 5208 MOO 5205 0108 0108 0(08 
Topo45y 
(a) Success of FIR filter based on fitness criteria. 
I 
SEES- 0+5- 0+10- SOPS- (PEA- (P05- APES- (PEA- SEPIA- BEES (PEA 5500-
EQS EM EQS W 0008 W 2506 2505 2508 BUSS 0108 0105 
Topolo 










AFF*- CFFA- CEFEA- APES. CEEl- CFR.A- APES- CCEA- CFFEA- APES- CFFA- CFFl.A-
EQS EQS EQS 40(45 A 	5055 MOS 5505 5505 4106 01(45 0106 
Topology 
















APES- COO.- 050*- APES- SEES- CFR.A- APEA- SEES- SEPIA- SF55- CflA- SEPIA-
EM WS EQS SOOS 2005 200$ 5505 2506 2506 0105 01045 0106 
Topology 
(d) Operations performed by PALU to implement FIR filter. 
Figure 6.8: Performance of various FPGA interconnect and coefficient output topologies to 
autonomously generate a 31-tap low-pass FIR filter. L-shaped input sequence 
(LSIS) employed. 
106 
Reconfigurable platforms for FIR filter implementation using EHW 
that these are often minimal, and still provides acceptable solutions. 
PALU utilisation: The average number of PALUs utilised within the FPGA remains relatively 
constant at about 78% of the total FPGA area, regardless of interconnect or output topology. 
PALU coverage therefore seems to have little influence over coefficient fitness. However, a 
link can be established with the AOOS output topology, which implements the fewest PALUs 
when averaged across all three interconnect topologies, and also exhibits the poorest filter coef-
ficients. 
The AFFA interconnect topology promotes the greatest PALU utilisation regardless of the out-
put topology. This can be explained by the manner in which PALUs are connected within the 
AFFA. Both the AFFA and CFFLA interconnect topologies produce the same critical delay 
path in the FPGA which lies between the bottom left PALU and top right. This delay is then 
equal to the number of PALUs in current the architecture (for the 8x8 array employed, this is 
64). However, many more considerably shorter paths can be achieved between these two points 
using the CFFLA. This is because all PALUs are routed in the same direction. However, the 
AFFA is almost forced to implement more PALUs by virtue of its routing topology and close 
PALU linkage. The AFFAs more constrained routing might be reduced by providing intercon-
nectivity between PALUs on the top and bottom rows, as with the CFFLA. However, this would 
result in an architecture which exhibited feedback, a property which is not desired in linear FIR 
filtering. 
PALU re-use: Again there appears to be little correlation between the degree of PALU re-
use and coefficient fitness on any of the FPGA topologies. However, a relationship can be 
established between PALU re-use and utilisation on FPGAs which utilise AFFA interconnect. 
Both PALU re-use and utilisation are at there highest when the AFFA is employed, and is the 
case for each output topology examined. This is again due to the high degree of linkage between 
PALUs connected using the topology. The lowest degree of PALU-reuse is exhibited by the 
CFFA interconnect sequence, regardless of the FPGA tap output topology. This is because, 
of all three FPGA interconnect topologies examined, the CFFA provides the lowest linkage 
between neighbouring PALUs, and also exhibits the shortest critical path ((Xwidth - 1) + 
Ywidth PALUs). 
PALU operations: In all of the FPGA topologies examined, the number of shift operations 
selected by the GA is approximately 40% higher than the number additions. This is partic- 
IDY 
Reconflgurable platforms for FIR filter implementation using EHW 
ularly desirable as shifts consume almost no power and have low area, and are the primary 
means of partial product generation using multiplierless design techniques such as CSD and 
POE Approximately twice as many PALUs implement addition than subtraction. Again this 
is encouraging for an EHW platform based around the POF design because Bull has shown 
subtraction to play a relatively minor role [2], which is reflected by the genetic algorithms 
configuration of PALUs within the FPGA-based filter. 
Figure 6.9 presents the following set of results obtained after the GA was again used to examine 
three interconnect topologies and both FPGA output sequences, this time with the BLIS input 
topology in place of the LSIS. 
Coefficient fitness: The inclusion of the Base-Line Input Sequence (BUS) generates a number 
of difference in the quality of filter coefficients produced by the GA on each of the FPGA to-
pologies, where the U-Shaped Input Sequence is substituted. The most noticeable difference is 
the success of the AOOS output topology using BLIS when compared to USIS. The AOOS now 
provides the GA with the only topology capable of producing an acceptable filter coefficient set 
(> 98.5%) within the specified evolution time. This was achieved using CFF[A interconnect. 
It can also be seen that all FPGA topologies which utilise AOOS produce coefficients with 
significantly better fitness (as much as 4% between CFFLA-AOOS and CFFLA-BLOS) than 
that attainable using other output sequences. Regardless of input or output topology the CFFA 
again produces the poorest coefficient sets. 
The generally poor performance of FPGA topologies configured using BUS can be attributed 
to the increased complexity of the configuration string and a resulting increase in the epsistatic 
representation of the chromosome. A further 59 bits are required to encode BLIS compared to 
USfS, each relating to any given PALU in the array (see equations 5.9 and 5.10). This further 
complicates the search space and can disrupt routing between horizontally connected PALUs, 
due to the manner in which the BLIS is connected. 
PALU utilisation: As with the LSIS input topology, the AFFA interconnect sequence continues 
to promote the highest PALU utilisation when configured by the GA. On average fewer PALUs 
are used to implement coefficient sets when interconnects are configured with base-line input 
and output topologies. This is most likely due to the freedom of input and tap positioning 
afforded by these approaches, thereby lessening the need for extensive PALU connectivity and 
re-use required by the other input/output topologies. 
101-11 

















AFFA- CFF*- COFLA- AFFA- CFFA- CFFLA- AFFA- C000_ COFLA- 
*005 *005 *005 0*05 AAOS 0000 BLOS 81.05 81.00 
Topok)gy  








AFFA- OFFA.- CFFLA. AFFA- CFFA_ CFFLA- AFFA- OFA- CFR.A- 
000$ A005 000$ MOS M.OS AAOS 81.0$ BIOS 81.0$ 
lopohIgy 
UMa UMQ* 
AFFA- COFA- COFLA- AFFA- CFFA- CFFLA- AFFA- CFFA- COFLA- 
000$ *000 0005 MOO MOO 000$ 84.00 BIOS BIOS 
Togology 




AFFA- CFFA. CFFLA- AFFA- CFFA- COFLA- AFFA- CFFA- COFLA- 
000$ A005 0005 000$ 000$ MOO 81.05 81.05 BIOS 
Topobgy 
(d) Operations performed by PALU to implement FIR filter. 
Figure 6.9: Performance of various FPGA interconnect and coefficient output topologies to 
autonomously generate a 31-tap low-pass FIR filter. Base-line input sequence 
(BUS) employed. 
109 
Reconfigurable platforms for FIR filter implementation using EHW 
PALU re-use: On average FPGA topologies utilising BUS re-use 50% fewer PALUs than 
equivalent FPGA topologies implementing LSIS. This is again due to a reduction in dependency 
(linkage) between neighbouring PALUs, resulting from the universal availability of the filter 
input, X (n). It is unclear as to why the AFFA-AAOS FPGA topology promotes considerably 
higher PALU re-use than the other output and interconnect topologies using BUS (in fact re-
use is also 50% greater than with the equivalent FPGA topology using LSIS). However, the 
fact that AFFA interconnect again promotes such high levels of PALU re-use further supports 
the observation that extensive PALU linkage (inherent in AFFA interconnect) translates to high 
PALU re-use. 
PALU operations: Similar trends between shifting, addition and subtraction can be seen 
between output and interconnect topologies using BLIS as with those using LSIS. 
63.6 Graphical Representation of FPGA-Based FIR Filter 
In order to represent the results obtained in section 5.5.5 above, graphical software was de-
veloped to visualise the FPGA topologies implemented, and present the FPGA configurations 
programmed by the genetic algorithm. To achieve this three postscript templates were gener-
ated reflecting each of the three interconnect topologies investigated. Horizontal connections 
are coloured red, vertical connections green, and those connected to X (n) are coloured blue. 
Each of these postscript templates can be found in Appendix B. 
The functionality of each input, output and interconnect topology is coded in C, and the evolved 
configuration bit string (originally created and then saved by the VHDL model of the FPGA-
based EHW platform) used as input to the C program. The selected configuration string is then 
used to generate a graphical postscript representation of the FPGA, displaying PALU operations 
and programmable interconnect. The C program generates postscript by appending additional 
information about the FPGAs configuration to the relevant postscript interconnect template. 
Each PALU operation is then scripted as a yellow coloured box containing the relevant sym-
bol: +, -, or Si to S4, relating to shift-left 1 bit to shift-left 4 bits respectively. Each PALU 
also displays four coloured regions relating to its two inputs and two outputs (i.e. top region = 
horizontal output, bottom region = horizontal input). A PALU which displays a yellow output 
region is therefore outputting the result of its arithmetic operation. PALUs with output regions 
the same colour as either the horizontal or vertical interconnect are denoted as acting through 
connects. The horizontal and vertical numerical output of each PALU is also shown, and coef - 
110 
Reconfigurable platforms for FIR filter implementation using EHW 
ficient tap outputs are marked according to the taps position in the coefficients set. In addition, 
the corresponding output PALUs are highlighted. PALUs which serve no purpose are marked 
out in grey. Figure 6.10 displays the graphical representation of the most successful low-pass 
filter coefficient set evolved by the GA which was configured on the LSIS-CFFLA-EOS FPGA 
topology. 
I GENERATION:6600 I TOTAL CELL USAGE:59 = 92.2% I Shifters:31 
I Adders:14 I Subtractors:14 I Cell re-use:354 I 
	
. 	 . T 2 	
T24 
KE-is 	 111j_ Uj  
2 





n1BD 1AI 17) 1 E= D 46 
TAP A 
2 	 2 	 11 	 7 	 12 	 28 	 447 
81 044 
461 
--i --- 	 ua'° hF uF 
- 
- Cu 	ui 	irJ T 	uID - in '- 
TAO A 
5 	 3 	 4 	 3 	 1 	 7 
CI 1 ci 1  0 uIu iEu1 01 0 
TAI 
o 	 4 	 3 	 4 	 1 	 128 	 2.17 
4 41 •U ° - 	128IUD 
• 	 I. 	
A 
_I 	 - 	ri 	=
IHO 
- 	I • • j 1 	uflu •• 
TAPA 	TAP 1 
1 	 62 	1 
- 	
I 	 •fl 	
4 1 6 
3 	 I• 	 I 
Figure 6.10: Example FPGA configuration of 3 1-tap low-pass filter. 
The graphical software described was further developed to gain additional information about 
the contribution of each PALU in generating the desired coefficient set. The degree of PALU 
re-use was therefore mapped onto the relevant FPGA interconnect template and represented as a 
normalised RGB (Red, Green, Blue) colour scale, each in the range 0 to 255, where 255 repres-
ents the maximum of a given colour. A recursive tree search was used to trace the connectivity 
111 
ReconIigurabie platforms for FIR filter implementation using EHW 
of each PALU acting as a tap output and noting the PALUs used as the trace progresses. It is 
therefore possible for multiple tree searches to follow the same path, thereby using the same 
PALUs. As a result the most extensively re-used PALU for any given FPGA configuration is 
denoted in RGB as maximum red (255, 0, 0) and the lowest re-use in maximum blue (0, 0, 255). 
PALUs that were not used are again marked out in grey. Figure 6.11 displays the corresponding 




Figure 6.11: PALU re-use map from FPGA configuration of 31-tap low-pass filter. 
It is interesting to note that the majority of PALU activity lies along the left side of the array, 
yet the majority of filter coefficients are output on the right of the array. This demonstrates 
that the partial product terms, necessary for coefficient generation, most frequently stem from 
terms generated by PALUs in the first few columns of the CFFLA interconnect topology. This 
matches well with the systolic nature of CFFLA, and the fact the filter input is only available to 
PALUs along the far right column and bottom row of the array. 
112 
Rcconulgurable platforms for FIR filter implementation using EHW 
obsolete. As a result the PLA configuration bit string requires no compartmentalization other 
than the identification of the left-shift-only PALUs present in the first column of the array. Each 
PALU and associated interconnect is therefore described as a binary word, W, in the order of 
PALU referencing shown in Figure 6.12; where PALU 1 denotes the LSB of the binary con-
figuration string. The bit length of a binary word encoding a given PALU is thus described 
as 
	
W = PALUcntri + 2R 	 (6.13) 
Where PALUCThrI is the number of control bits required to program each PALU (5), and R 
is the bit length of the control required to encode each P : 1 multiplexer in the programmable 
interconnect array. The total length of each binary configuration string required to encode a 
given PLA topology can then be described as 
SPLA = (Xwidth * Ywidth * W) + (Ywidth * S) + PLAtaps 	(6.14) 
Where Xwidth is the number of PALU columns, Ywidth is the number of PALU rows, W is 
the bit length of control required for each PALU and its interconnect, S is the number of bits 
used to determine the left-shift of PALUs in the first column, and PLA taps is dependent on the 
output topology employed. Figure 6.14 describes the layout of the configuration string required 
to program the PLA. The region encoding the control of PALUs in the first column is shown in 
grey 
PALU (Ywidth) 
PALU (YwidthXwidth) 	 PALU (Ywidth+2) 	PALU (Ywidth+1) 
_________________________________________________ - 	 S U 
Bin ArnPALUcmd[ ... •..Bin AinPALUcnut Bin AinPALUcnu1I 	 LJ2PSLU 
• yy y 
2Rcbits 	5-bits 	 2R-bits 	5-bits 	2R -bits 	5-bits 	Se-bits 	 Se-bits 	S'-bits 
Figure 6.14: Layout of configuration string for programming PM. 
Figure 6.15 illustrates an example PLA configuration for implementing the primitive operator 
filter shown in Figure 4.9 using Route 1 interconnect and Output 2 tap routing. PALU's and 
interconnects which are not utilised are shaded in blue. As with the FPGA architecture, the 
input pulse, X(n), is held constant at logic '1'. 
It can already be seen that a great deal of redundancy is inherent in the PLA architecture, such 
118 
Reconfigurable platforms for FIR filter implementation using EHW 
6.4 Programmable Logic Array (PLA) Topology 
Like the FPGA-based architecture presented in section 6.3, the PLA described in this thesis has 
been developed specifically to implement reduced complexity multiplier-free coefficient mul-
tiplication for digital FIR filtering. The PALU element illustrated in Figure 5.2 again provides 
the backbone for the signal processing of X (n) using the POF approach. Importantly, the 
PLA architecture is particularly suited to implementing the sum of products equation of (4.1) 
(required for FIR filtering) because of its inherent 2D array structure [129]. 
6.4.1 Interconnecting PALUs for an PLA-based FIR Filter 
PALU's are arranged in columns, each column connecting to an array of interconnect logic, 
which in turn connects to the next column of PALU's. Every PALU in one column is thereby 
connected to every PALU in the next column via the interconnect array. Whilst costly in terms 
of physical area, this approach was initially taken to determine the most suitable ideal connec-










Din I 	 - — 	 SERIAL IN PARALLEL OUT SHIE RE(IISTER 














Programmable Interconnect Array 
	 PAI.0 	Coefficient Outputs 
Figure 6.12: PLA architecture and interconnect overview. 
Each interconnect array comprises a number of P : 1 Multiplexers which route the P outputs of 
the previous PALU column, where P denotes the number of PALUs per column. Both inputs 
of a given PALU are therefore connected to a separate P : 1 routing Multiplexer to provide 
maximum connectivity. A total of 2 * P, P : 1 Multiplexers are required to construct each 
113 
Reconuigurable platforms for FIR filter implementation using EHW 
programmable interconnect array. The number of PALUs in a column, Ywidth, and the number 
of columns in the array, X width, are determined in VHDL by the user during parameterisation 
of the PLA-based EHW platform. In addition to PALU routing, the interconnect array provides 
each PALU with a direct connection to the filter input, X(n), and to logic '0'. This approach, 
originally implemented on FPGA topologies using BLIS, has been shown to reduce the number 
of PALU's utilised and re-used, thereby freeing up programmable interconnect resources. The 
number of control bits required to route each MUX is therefore given by 
	
= 1092 (P + 2) 	 (6.8) 
Where the addition of 2 to P represents the inclusion of routing for X (n) and logic '0'. 
PALUs are identified in numerical order, from the bottom left corner of the array to the top right 
corner. Due to the column based 2D topology of the general PLA architecture, all PALUs in the 
first, left-most, column are connect directly to X(n). As a result, the first column of PALUs are 
left-shift only, as addition of X(n) could only yield a result twice that of X(n) (i.e. left-shift 
by 1), and subtraction would simply produce logic '0'. Therefore in order to provide additional 
functionality, all PALUs in the first column are capable of shifting between 0 and L - 2 bits, 
where L denotes the bit-width of X(n). The number of control bits, S, required to determine 
the shift factor is then given by 
S = 1092(L - 2) 
	
(6.9) 
It is important to determine the most suitable method of connecting PALU's to achieve high-
performance FIR filtering using the PLA approach. As a result, four interconnect topologies 
were investigated and reflect various degrees of interconnectivity between columns of PALU 
as illustrated in Figure 6.13. These interconnect sequences vary from those investigated on 
the FPGA in that they go beyond nearest neighbour connectivity. Hierarchical interconnect 
has instead been investigated so as to maintain the column based form of the PLA architec-
ture shown in Figures 4.11 and 6.12, whilst reducing the degree of interdependence (linkage) 
between PALUs, which has been shown to reduce the performance of filter coefficients gener-
ated on the FPGA. Additionally, routing sequences such as AFFA would simply turn the PLA 
into an FPGA-based topology. Each interconnect is categorised as follows: 
Route 1: Simplest interconnect sequence, and requires minimal routing. PALU's are only 
114 
Reconflgurable platforms for FIR filter implementation using EHW 
connected to the next adjacent interconnect array (Figure ??). No feedback is therefore 
permissible. Routing to PALUs in non-adjacent columns is possible only by PALUs in 
intermediate columns performing a shift-by-zero. The number of control bits required 
to route each P : 1 MUX as part of the programmable interconnect array is given in 
equation (6.8). 
. Route 2: 2-level interconnect; provides greater connectivity between PALU's in non-
adjacent columns through additional routing between alternate interconnect arrays as 
shown in Figure ??. The number of control bits required to configure the routing multi-
plexers is therefore twice that of Route 1 and is given by 
R e = 10922(P + 2) 	 (6.10) 
. Route 3: Utilises 4-level interconnect along with both 2-level and adjacent array con-
nectivity, illustrated in Figure??. This approach provides extensive connectivity between 
columns of PALU, intended to reduce linkage between adjacent PALU elements. However, 
the number of control bits required to configure each P : 1 MUX remains at that given 
in equation (6.10). 
• Route 4: Comprises routes 1, 2 and 3 and additionally incorporates routing between 
neighbouring PALU's in the same column (Figure ??). This routing topology requires 
the most interconnect control for each routing Multiplexer. The number of control bits 
can be represented as 
R = 10922(P+3) 	 (6.11) 
Optimal placement of filter taps within the PLA architecture is also important for generating an 
efficient FIR filter structure capable of meeting the demanding performance criteria required of 
many modern DSP applications. For this reason three output topologies for the placement of 
filter taps have been investigated and are identified as follows: 
• Output 1 employs row-based tap placement as shown in Figure ??. This topology is 
suitable for a direct form filter implementation in that coefficients can be summed and 
stored sequentially after each product term is generated in the correct order. To achieve 
this tap outputs must be ordered such that tap N (in the coefficient set 1 to N) is produced 
115 
Reconfigurable platforms for FIR filter implementation using EHW 




wol 	WI 	 W2 	 WM4 
(a) Routingi: Localised connect. 
-S uuuu" 
i1.. i.. i;._ i i.. i • 
'iIuI 
(b) Routing2: Localised and 2-level connect. 
i! 
IN 





(d) Routing4: Localised 2 and 4-level with column connect. 
Figure 6.13: Various Interconnect Topologies for PLA 
116 
Reconfigurable platforms for FIR filter implementation using EHW 
on the final, far-right, column in the PLA, and earlier coefficients in the set are present on 
columns progressively further from the final column. Note that taps are only connected 
to the bottom row. However, since all PALU's are ideally connected, the choice of row is 
irrelevant. A total of C clock cycles is required to process the filter input, X (n), where 
C denotes the number of PALU columns and is the critical path. C therefore grows as 
the number of taps required increases. For any given filter specification only N output 
PALUs are therefore required. 
• Output 2 employs column-based tap placement as shown in Figure ??. This topology is 
most suited to a transposed direct form filter implementation as all filter coefficients are 
present on the same clock edge. A total of C clock cycles is again required to process 
X (n), however, C is no longer directly dependent on tap length, and could be consid-
erably smaller than that required to implement output 1 for complex filters with large 
coefficient sets. So as to utilise all PALUs, taps are output on the final column of the 
PLA such that coefficient 1 is output on the bottom row, with later coefficients output on 
PALUs at progressively higher positions in the column. As with Output 1, for any given 
filter specification of N taps, only IV PALUs are required 
• Output 3 is classed as the base-line tap topology. Each PALU is capable of represent-
ing a filter coefficient. Whilst such a topology is highly unrealistic in terms of added 
control logic and interconnect, it provides the genetic algorithm with a highly flexible 
means of implementing the desired filter coefficient set and further reduces interdepend-
ency between PALUs. The total number of PALUs available as potential tap outputs is 
therefore given by 
PLA taps = Xwidth * Ywidth 	 (6.12) 
Where a total of 1092 PLA taps  bits are required to encode the desired filter tap at a given 
PALU location. 
6.4.2 Configuring the PLA-based FIR Filter 
Unlike the FPGA-based filter outputs presented earlier, tap outputs within the PLA are at fixed 
predetermined locations. This is because of both the nature of the PLA structure, and the hier- 
archical ideal connection topologies discussed above, which make programmable tap outputs 
117 
Reconfigurable platforms for FIR filter implementation using EHW 
Interconnect 
PALU 	- 	 PALU 
Interconnect 	Interconnect 
PALU 	 PALU 
A 
 33 L11 
Ll •h- r 
28 	 21 L4JI 









Figure 6.15: Example PL4 configuration of 5-tap primitive operator filter. 
that the topology is naturally suitable to fault tolerant design. This aspect will be examined 
more thoroughly in Chapter 7. Also note that a number of PALU's simply act as through 
connects (shift-by-zero) to adjoining PALU elements. 
6.4.3 PLA -Based FIR Filter Parameters 
Each PLA topology was investigated using II columns of PALU and 9 rows. These dimensions 
were chosen so that the 9 distinct taps of the low-pass filter could be configured using all 
three output topologies. Eleven columns of PALU where used (instead of 9) so that the first 
three columns could generate sufficient partial products to produce the desired filter coefficients 
when output 2 was employed. It is therefore apparent that more PALUs are to be utilised 
using the PLA architecture than were implemented using the FPGA. However, larger PALU 
dimensions are required to ensure that the array size remains constant for each of the output 
and interconnect topologies investigated, and to provide sufficient PALUs to make use of the 
hierarchical interconnect employed. Results from section 6.3.5 show than on average only 78% 
of PALUs in the FPGA-based filter were utilised. It is therefore prudent to assume that if the 
PLA is to produce competitive PALU utilisation, then the same number of PALUs should be 
Reconilgurable platforms for FIR filter implementation using EHW 
used; this would require a lower PALU utilisation in the PLA of around 50%. 
All other GA and filter parameters are the same as that detailed in section 6.3.3. Therefore for 
a 11 x9  PLA-based filter implementing Route 4 interconnect and Output 3 tap sequence, 1449 
bits are required to encoded the configuration string. This translates to a maximum search space 
of 21449  possible bit string configurations, which must then be successfully manipulated by the 
genetic algorithm. 
6.4.4 Investigation of Genetic Operator Parameters 
The configuration bit string uscd to program the PLA-based filter is eoiiipartiiieiitaiiscd all 
 in a different way to the bit string used to configure the FPGA. In addition, the archi-
tectural differences between the two programmable platforms are significant enough to produce 
search spaces with very different fitness landscapes. These two factors potentially effect the 
suitability of the original crossover and mutation parameters used by the GA to manipulate the 
FPGAs configuration. As a result the same investigations detailed in section 6.3.4 will again be 
performed, this time to determine the effects of crossover and mutation when using the GA to 
configure the PLA for coefficient generation. Crossover probabilities of 0 and 60% are therefore 
examined, in addition to mutation rates of 1/N, 21N and 31N. Three of the four interconnect 
topologies were implemented for each crossover and mutation parameter investigated. Output 
1 was arbitrarily selected as the fixed output topology for each investigation. Tables 6.4 and 6.5 
display the average fitness of the coefficient sets generated for each PLA interconnect sequence 







Route 1 8.5387 = 94.9% 8.6018 = 95.6% 
Route  8.3010=98.1% 8.7930=97.7% 
Route 3 8.8757 = 98.6% 8.8824 = 98.7% 
Table 6.4: Performance of PLA connection topologies in generating 31-tap low-pass filter con-
figured using genetic algorithm with and without crossover 
ences in architecture and bit string encoding, both the FPGA and PLA-based filters perform 
equally well without crossover as they do when it is employed. This is for the same reasons of 
epistasis discussed in section 6.3.4. Whilst the mutation rate of 21N produced coefficients of 
slightly better fitness (between 0.3 and 1.1% depending on the PLA interconnect topology using 
120 
Reconuigurable platforms for FIR filter implementation using EHW 
Connection 
Topology 
Average Fitness at 
Pm = 21N 
Average Fitness at 
Pm 	31N 
Route 1 8.6450 = 96.1% 8.5770 = 95.3% 
Route 2 8.8560 = 98.4% 8.8213 = 98.0% 
Route 3 8.9192 = 99.1% 8.8712 = 98.6% 
Table 6.5: Performance of PLA connection topologies in generating 31-tap low-pass filter con-
figured using genetic algorithm with variable mutation rates and no crossover em-
ployed. 
a mutation rate of 1/N), it was decided to keep the original parameters used for the FPGA by 
maintaining both the original mutation rate at Pm = 1/N and rcmoving crossover. This would 
also provide more accurate future comparisons between the two programmable platforms. 
6.4.5 Performance Comparison of PLA Topologies 
The performance of each PLA routing and output topology was based on the same four per-
formance criteria used to investigate the FPGA outlined at the beginning of section 6.3.3. For 
completeness the criteria a listed here again; and includes the fitness of the filter, the number of 
PALUs used, the degree of PALU re-use (generation of partial products), and the total number 
of shift, add and subtract operations required to implement the specified set of coefficients. The 
results of each criteria, averaged over the ten investigations for each PLA topology, are shown 
in Figure 6.16. 
Coefficient Fitness: Both output topologies 1 and 3 produce coefficients of greater fitness the 
higher the degree of routing available between PALUs. This is not the case when Output 2 is 
used, in fact higher interconnectivity reduces the ability of the GA to find suitable PLA con-
figurations when column-based tap outputs are employed. However, A PLA combination of 
Output 2 and Route 2 provides the genetic algorithm with a programmable architecture which 
consistently produces highly fit filter coefficients, and was the only topology to produce a set of 
filter coefficient which exactly matched those presented in table 6.1. Remember that a fitness 
of > 98.5% was required to produce a transfer function with acceptable low-pass characterist-
ics. At least one in ten evolutionary runs resulted in an acceptable filter response for each of 
the PLA output and interconnect topologies examined; this is a considerable improvement over 
the FPGA platform. In addition, the Output 2, Route 2 PLA topology was the only program-










-, - 0CC.0CA.4 -, 0*5.20CC. 0CSt4 	 0*5.5 	 - 
0*. CC. Ac. ROW Cd 	Cd Cd 	Cd *( 	- WAS 






- AC2 AC2 A.sfl As 	fle.Se5 RA5A3 R.MS RI aa ACItI 0*5.4 
(a) Success of FIR filter based on fitness criteria. 
I I 1'! n7ri 1 • *O4C. • AM, o u dasgA 





As. As. As. CA CA 
_s 
Ca 





(C) PALU re-use exploited by FIR filter. 
• ACC.Q. 
• hAs AC-I.. 
(d) Operations performed by PALU to implement FIR filter. 
Figure 6.16: Performance of PL4 topologies to autonomously generate a 31-tap low-pass FIR 
filter 
122 
Reconfigurable platforms for FIR filter implementation using EHW 
coefficients. 
PALU Utilisation: On average, the greater the number of PALU's used within a PLA topology 
the better the fitness of the filter coefficients produced. This is particularly true for PIA topolo-
gies employing Output 2, where the degree PALU utilisation closely mirrors coefficient fitness. 
Approximately 57% of PALUs were required to implemented the desired coefficient set. In real 
terms this equates to an average increase of 6 PALUs per implementation of the low-pass filter 
coefficient set, when compared to the average number of PALUs utilised on the FPGA. This 
result demonstrates that although the array dimensions of the PIA are larger than the FPGA, 
both platforms utilise roughly the same number of PALUs. The PLA therefore remains com-
petitive with the FPGA in terms of PALU utilisation, however, the quality of coefficient set 
generated on the PLA is markedly higher than that produced on by the FPGA. 
PALU Re-use: Route 1 (local routing) promotes the highest degree of PALU re-use, independ-
ent of the output topology. This is again due to the high level of dependence between PALUs 
inherent in the Route 1 interconnect topology, a relationship which was also found with the 
FPGA architecture. However, unlike the FPGA architecture the critical path through the PLA 
is not influenced by interconnect topology because of its 2D column-based structure, and there-
fore cannot be used as a direct measure of linkage between PALUs. Also, the ideal connection 
topology between columns of PALU means that the available connectivity between neighbour -
ing PALUs is considerably higher than that found on the FPGA. As a result PALU re-use on 
the PLA platform is three times lower than that on the FPGA. 
Route 1 also produces a poorer quality of filter coefficients than when other routing topologies 
are used. The genetic algorithm therefore demonstrates that Route 1 connectivity is the least 
effective means of generating the specified set of coefficients. This stems directly from the to-
pologies flat interconnect hierarchy which does not provide direct routing between non-adjacent 
columns of PALU. 
PALU Operations: Almost identical trends in the use of PALU operations can be seen between 
both the PLA and FPGA architectures. In all PLA topologies the number of shift operations 
utilised is approximately twice that of addition or subtraction; and the number of PALU addi-
tions is consistently greater than the number of subtractions. Again, these characteristics are 
particularly desirable for efficient partial product generation using multiplierless design tech-
niques such as CSD and POE Remember that the genetic algorithm has no a priori knowledge 
123 
Reconflgurable platforms for FIR filter implementation using EHW 
of POF or either of the programmable platform architectures. The GA has therefore provided 
a strong indication as to the natural suitability of both programmable platforms for POF based 
FIR Filter coefficient generation. 
6.4.6 Graphical Representation of PIA-Based FIR Filter 
The graphical C program detailed in section 6.3.6 was modified to accommodate the various 
PLA topologies. Four postscript templates were developed to reflect each of the four intercon-
nect topologies investigated; these are shown in Appendix B. The hierarchical interconnect of 
each PLA topology is colour encoded. Local, or nearest neighbour connectivity, is denoted 
in yellow, level-2 interconnect in green, level-4 in red, and column based connectivity in light 
blue (the same colours as the PALU). Connections programmed to logic '0' are shown in pink, 
and those connected to the filter input, X (n) are displayed in dark blue. Coefficient tap outputs 
are labelled and also highlighted. As with the FPGA graphical program, PALU operations are 
denoted by their relevant signs. Figure 6.17 displays the PLA configuration of the best coeffi-
cient set generated by the GA. As mentioned above it was implemented on a PLA with Route 
2 interconnect and Output 2 tap placement. 
6.5 Comparison of PLA and FPGA-Based Filter Platforms 
The genetic algorithm has provided a means of independently appraising the suitability of both 
the FPGA and PLA-based filter platforms for implementing the coefficient set used to describe 
the benchmark 31-tap low-pass FIR filter presented in this chapter. Whilst comparisons have 
already been made in section 6.4.5, this section identifies those critical comparisons which 
remain. 
The results presented show that the average quality of coefficients produced by the GA on 
the PLA based FIR-filter (98.6%) significantly outperform the fitness of those produced on 
the FPGA architecture (94.6%). It is therefore apparent that the PLA-based filter consistently 
produces coefficients sets which fulfil the desired low-pass specification (> 98.57c), and as 
such is more suited to FIR filter coefficient generation than the FPGA-based approach. This can 
be further substantiated by recalling that a deviation in the accumulated fitness of a coefficient 
set by more than 0.5 to 1.0% can have significant impact on the performance of the transfer 
function produced. An average drop of 4% in the fitness of coefficients produced on the FPGA 
124 
Reconfigurable platforms for FIR filter implementation using EHW 
I GENERATION6661 I TOTAL CELL USAGE69 = 69.7% 1 Shifters23 
I Addersr26  I SubtracLors20  I Cell re-use166 = 70.638298 I 
Figure 6.17: Example PL4 configuration of3l-tap low-pass filter filter 
is therefore a significant indication as to the superiority of the PLA-based architecture. 
Results also indicate that in both programmable platforms a strong relationship exists between 
the degree of linkage, or freedom of connectivity, provided by interconnect topologies between 
PALUs, and the amount of PALU re-use within each array. It has been shown that the higher the 
degree of linkage (dependency) between neighbouring PALUs the greater the degree of re-use. 
This relationship ship can be gauged by the critical path that is created by each interconnect 
sequence, and the availability of shorter routes which might be taken along the critical path. 
Routing topologies which display the highest degree of linkage and least freedom of intercon-
nect along the critical path are the AFFA and Route 1 interconnect sequences for the FPGA and 
PLA respectively. 
The following summarises the PLA and FPGA topologies best and least suited to autonomously 
implementing the benchmark FIR filter coefficient set using EHW. 
. The best average coefficient fitness produced on the FPGA was achieved using the LSIS- 
125 
Reconfigurable platforms for FIR filter implementation using EHW 
CFFLA-BLOS topology at 97.3% 
. The worst average coefficient fitness produced on the FPGA was achieved using the 
BLIS-AFFA-AOOS topology at 89.9% 
The best average coefficient fitness produced on the PLA was achieved using Route 2 
and Output 2 at 99.8%. For the purposes of further investigation this PLA topology will 
now be denoted Co12. 
. The worst average coefficient fitness produced on the PLA was achieved using Route 1 
and Output  at 95.6% 
The second most effective programmable topology was again achieved using the PLA, in this 
case with Route 3 interconnect and the column-based tap output sequence, Output 1. For the 
purpose of further investigation this PLA topology will now be termed Row3. 
6.5.1 Further Investigations 
In order to further validate the results obtained through configuration of the low pass filter, the 
two most effective programmable topologies, Co12 and Row3, were analysed. A second filter 
specification was chosen to be the 20-tap Hubert transformer, designed using the Remez ex-
change algorithm developed by McClellan et. al. and benchmarked in [130]. This filter was 
chosen as it required 10 taps to implement in folded form and is therefore of similar length 
to the 9 distinct taps required to implement the 31-tap low-pass filter benchmarked previously. 
However, the Hilbert transformer has a different coefficient distribution which will test the gen-
eral suitability of both PLA architectures, which have very different output topologies. Whereas 
the low-pass filter response requires a set of coefficients who's magnitudes increase with tap 
length, the Hilbert transformer coefficient distribution varies in magnitude along the length of 
the filter. Finally the Hilbert transformer is also represented using a 16-bit 2's compliment 
encoding, which again matches the PIA specification required for the low-pass filter. 
Co12 was implemented using 11 columns and 10 rows. Row3 implemented the Hubert trans-
form using 13 columns and 9 rows. Both topologies therefore have a comparable number of 
PALU's. The same experimental setup was used as for the low-pass filter investigation. Results 














Co12 	 Row3 Co12 	 Row3 










   
COO 	 R0w3 
 
 
CoO 	 R0w3 




(a) Success of FIR filter based on fitness cri-
teria. 





(c) PALU re-use exploited by FIR filter. 	 (d) Operations performed by PALU to imple- 
ment FIR filter. 
Figure 6.18: Performance of Co12 and Row3 PLA topologies to autonomously generate a 20-
tap Hubert transform FIR filter. 
Co12 enables the genetic algorithm to produce coefficients for the Hubert transform with con-
siderably better fitness than those generated using Row3. Comparison between coefficient fit-
ness and PALU usage (Figure 6.18(a) and Figure 6.18(b)) further demonstrates the relationship 
between PALU usage and coefficient fitness. Co12 therefore utilises approximately 15% more 
PALU's than Row3, achieving greater filter performance. 
Reproduction of the desired filter response inside each PIA has been of primary importance. 
Filter performance on the PLA topology with column-based tap placement (Output 2) is not 
dependent on coefficient distribution. Row-based tap placement (Output 1) performs better 
on filters who's coefficients are distributed in ascending order. This observation outlines a 
restriction in the PLA architecture as partial products for each coefficient are generated column-
by-column. Each PALU therefore relies on terms generated by previous columns to produce 
the relevant product. Output 1 is therefore detrimental to the effectiveness of the basic PIA 
architecture for non-ordered coefficient filters. This is because the magnitude of partial products 
127 
Reconfigurable platforms for FIR filter implementation using EHW 
generated in each proceeding column may not necessarily increase. This can be overcome using 
Output 2, or has been shown in section 6.4.5, by increasing the degree of available interconnect 
between PALUs. 
6.6 Summary 
This chapter has presented the development, and evaluation of two programmable platforms 
tailored for implementing FIR filter coefficient multiplication using EHW. Coefficient sets are 
implemented on either an FPGA or PLA-based programmable architecture, both of which re-
place explicit coefficient multiplication with a distributed series of bit-shifts, additions and 
subtractions. Each programmable platform employs an embedded genetic algorithm designed 
to autonomously configure the PLD for a given filter specification. The genetic algorithm 
was used to investigate the most suitable programmable architecture for implementing high-
performance multiplierless digital filters, and provided parameterisation of the key genetic op-
erators: crossover and mutation. Initial tests however have shown the limitations of crossover 
as an effective means of generating a specified coefficient set on either programmable platform. 
A 31-tap low-pass FIR filter was benchmarked to enable comparisons between the perform-
ance of the PLA and FPGA architectures. Each architecture was implemented using a number 
of filter input, tap output and PALU interconnect topologies. The performance of each topo-
logy was evaluated based on the coefficient fitness, area utilisation, and PALU re-usability of 
the configurations generated by the genetic algorithm. Coefficient fitness is the most import-
ant measure of FPGA/PLA performance. Results demonstrate that the PLA-based architecture 
considerably outperformed the FPGA in terms of the quality of the coefficient sets produced. 
Investigations show that Co12 produced filter coefficients of higher fitness than other topolo-
gies, when autonomously configured using the genetic algorithm. On average the PLA pro-
duces coefficient sets with a fitness score 4% higher than the FPGA. The PLAs dominance over 
the FPGA is attributed to the higher degree of flexibility afforded by the PLA interconnect to-
pologies which utilise a hierarchical connectivity; and the fact that the critical path of the PLA 
is markedly shorter than the FPGA. Both these factors have been shown to effect the degree 
of interdependence (linkage) between neighbouring PALUs. Greater flexibility of interconnect 
and short critical paths therefore reduce PALU linkage and increase FPGAIPLA performance. 
Co12 was also shown to be the most flexible PLA architecture for implementing filters with a 
non-uniform coefficient distribution, significantly out-performing the next best programmable 
128 
Reconfigurable platforms for FIR filter implementation using EHW 
topology (row3). 
Whilst the Co12 PLA architecture has been shown to be the most effective in generating FIR 
filter coefficients using EHW, its currents implementation using an ideal interconnect between 
columns of PALU is unrealistic and would require prohibitive routing in VLSI as the number of 
PALU columns and rows increases to match filter complexity. Chapter 7 therefore investigates 
the translation of the Co/2 architecture into a synthesisable VHDL netlist for implementation 
in silicon. Physical constraints will be examined, such as timing along the critical path and 
the degree of interconnect required to implement a functionally acceptable Co12 architecture 
without ideal interconnect. 
129 
Chapter 7 
Translating the Co12 PLA Topology 
into Hardware 
7.1 Introduction 
All of the programmable platforms currently investigated have used software simulation to eval-
uate the performance of the filter coefficient sets configured on them by the genetic algorithm. 
The performance evaluation of both the PLA and FPGA-based FIR filters therefore exhibit ex-
trinsic evolution in EHW terms, as discussed in Chapter 2. In order for faster and more realistic 
generation of filter coefficients to occur, the programmable platform on which the filter is to be 
implemented must be realised in physical hardware. Intrinsic real-time evaluation of the PLA 
architecture configured to implement a given coefficient set is then possible. 
Analysis of both the FPGA and PLA-based EHW platforms investigated in Chapter 6 have 
shown that a PLA architecture capable of column-based tap placement with localised and 2-
level PALU interconnect, Co/2, is the most effective programmable topology for implement-
ing an FIR coefficient multiplication unit using EHW. However, its ideal interconnect topology 
does not make it suitable for implementation in hardware, which is of little use to real world 
SoC signal processing applications. In addition, FIR filters are crucial for robust data commu-
nication and manipulation. DSP devices are frequently employed in environments where issues 
such high processing speed, low physical area, and device reliability are highly critical, such 
as in space applications. For many such applications DSPs must maintain functionality over 
prolonged periods in harsh environments. Built in reliability of FIR filter devices is therefore 
required. The performance and sustained reliability of hardwired FIR filters is therefore of great 
importance. 
This Chapter presents the translation of the Co12 PLA architecture from an RTL-level beha-
vioural VHDL model into a physically realisable, technology specific netlist using Synopsys 
Design Analyser synthesis software and Alcatel's 0.35 pm MTC45000 technology library. The 
architectural limitations of the Co12 PLA are identified in order to develop a physically real-
istic PLA structure. The synthesised PLA netlist is then compared with the original Co12 PLA 
130 
Translating the Co12 PLA Topology into Hardware 
architecture using the GA. Finally the real-world performance of the netlisted PLA is further 
investigated by examining the ability of both the GA and the PLA to adapt to an increasingly 
high number of faults randomly introduced onto PALUs in the PLA architecture. 
7.2 Synthesis and Performance Analysis of PLA-Based Filter 
Translating Co12 into a synthesisable IP core requires modifications to be made to the original 
array of ideal interconnects between PALUs, detailed in section 6.4.1. The ideal connectivity 
model is not suitable for hardware implementation as it incurs a large area for both routing 
and control logic, due to an cxccssivc interconnect. This would jesuit in lung delays betweeii 
PALU's, and high capacitive loads from excessive fanout on interconnect pins. In order to 
minimise these problems the interconnect array was modified such that each PALU input could 
route to one of only three PALU's from the previous column. So as to maximise connectivity 
between columns, no two inputs were routed to the same set of three PALU's. The reduced 
connectivity architecture centres around the position of the current PALU in the column. Ain is 
connected to one PALU above, below and including that of the PALU at the same location in the 
previous column. Bin then connects to the second, third and fourth PALU directly above that of 
the PALU in the previous column. Figure 7.1 displays an example of the reduced connectivity 
model. 
The amount of control logic for the interconnect array is therefore reduced, along with inter-
connect area, signal delay, and drive-strength. Also, this complexity does not increase with 
column height (as would be the case with the ideal interconnect). This ensures that the PLA 
architecture remains scalable. 
7.2.1 Comparative analysis with RTL 'ideal' model 
In order to determine if the reduced connectivity PLA architecture performed as well as the 
ideal interconnect, Co12 was modified and simulated using the same benchmark low-pass fil-
ter detailed in section 6.2 of Chapter 6, and the same experimental setup as that used for the 
original Co12 PLA presented in sections 6.2.1 and 6.4.3. The modified Co12 PLA is named 
Co12 i-educed for clarity and was again written in VHDL at the RTL level. A second investiga-
tion of Co12 i-educed was also performed to determine the effects of changing PLA dimensions. 
Column width was reduced from 11 to 6, and the number of rows was instead extended to 16. 
131 
Translating the Co12 PLA Topology into Hardware 
Interconnect 	 Interconnect 
Array N-i Array N 





1JF—PALU Local Interconnect If :1~ 
PALU BusA 	[J PALU 	 PALU 
1
5 
PALU 	 j 4LU 	 U {wu 
4 r.-.J 4 1 	ii 	4 





PALU 	 Lj- PA-LU 
BusB 	 - l'.•\l.( 	
L 
Fast interconnect 
Figure 7.1: Example of reduced connectivity between PAL Us 
In this case the bottom 9 PALUs in the final column were connected to output taps. These 
dimensions were designed to roughly maintain the number of PALUs within the PLA, whilst 
reducing the latency of the circuit. The same experimental setup was again employed, and the 
PIA was identified as Co/2 -6x.16. Figure 7.2 displays averaged data of the results obtained. 
The average fitness of coefficients generated using Co12educed is almost identical to those 
produced using the original Co12 topology. However Co12reduced used approximately 15% 
more PALUs than Co/2, and re-used on average 81%; around 20% more than the original Co12 
PLA. This supports evidence presented in Figure 6.16 which links reduced connectivity with 
greater PALU re-use and in some cases poorer filter performance. The fact that more PALUs 
are required to implement filters of high quality suggests that Col2educed still provides the 
genetic algorithm with a means of counter-acting the negative effects of reduced connectivity 
between PALUs. Altering dimensions of the PLA in Co1213x16 maintained the quality of the 
filter coefficients, but greatly reduced both the number of PALUs used and the degree of re-use 
when compared to Co12 limited. In fact usage and re-use are shown to be comparable to or 
lower than those produced using the original Co12 PLA. The results highlight the flexibility of 
132 



















(C) PALU re-use exploited by FIR filter.  




(d) Operations performed by PALU to imple-
ment FIR filter. 
Figure 7.2: Performance of Col2educed and Co126x16 PLA topologies to autonomously 
generate a 31-tap low-pass FIR filter. 
the PLA to adapt to varying dimensions and maintain filter performance. Ratios between shift, 
addition and subtraction operations remain similar to those originally identified in Chapter 6 
throughout. 
7.2.2 Synthesis Details 
Due to the success of the Co126x16 topology, which was written in VHDL at the RTL level, 
a PLA core with 6 columns and 5 rows was synthesised using the Alcatel MTC45000 library. 
Five rows were chosen to maintain a compact PLA core that could readily be synthesised. 
Multiple cores are simply connected during initial parameterisation of the PLA-based FIR filter 
platform. Therefore a PLA can be sized according to the maximum number of taps required for 
a specific range of applications. A total of C + 1 clock cycles are required to multiply input data 
with the desired coefficient set, where C = 6 and is the number of PALU columns. As a result, 
133 
Translating the Co12 PLA Topology into Hardware 
the throughput of the PLA is not effected by filter length. This is a considerable advantage over 
single multiplier MAC filter architectures. Top-down synthesis was performed using Synopsys 
Design Analyzer. This was shown to produce better timing and area results than synthesis 
using a bottom-up approach. Appendix C.1 displays the synthesis script used. The scalability 
of the PLA core was examined by synthesising it at six operational clock frequencies: 10MHz, 
25M1-lz, 50MHz, 80MHz, 90M1-lz, and 100 MHz; all PLA data-widths were set at 16-bits. 
Theses frequencies were chosen to reflect typical timing constraints required on high speed 
SoC bus architectures, which range from 60 to 100MHz. Result can be seen in Figure 7.3. 
Area remains relatively constant from 10 to 50 MHz. Between 50 and 100MHz the area of the 
synthesised PLA core increases approximately linearly. For technologies smaller than 035m, 
faster throughput could be achieved. 
38000 




















Figure 7.3: Logic area of PL4 core as a result of synthesis for increasing operational speeds. 
The critical timing path was found to lie between the BusA/BusB input of any given set of 
interconnect logic, and the output register of the PALU which is associated with this intercon-
nect. This is explained in more detail in Figure 7.4. The largest delay is incurred through 
the adder/subtracter unit of the PALU. One way to reduce this would be to customise the ad- 
134 
Translating the Co12 PLA Topology into Hardware 
der/subtracter block for a specific silicon technology. This would limit the general portability 
of the P/A core, but further increase its performance. 
Interconnect logic 
bUout 
Figure 7.4: Critical delay path through PL4 architecture. 
Figure 7.5 displays the Leapfrog VHDL simulation of the 6x5 PL4 Core netlist synthesised to 
operate at 10MHz and then back annotated into the simulation testbench presented in Appendix 
C.2. 
The PIA Core presented has been programmed to multiply X (n) by the coefficient set { 1, 7, 
16, 21, 331, representing taps Ito 5 respectively, defined as Output-Port(i) in the waveform of 
Figure 7.5. The coefficients were configured on the P/A Core using the bit-string displayed 
in Memory -Contents. The waveform signal Pla -Data -Stream is simply the binary data held 
in Memory-Contents as it is fed bit-serially into the PLAs serial-to-parallel shift register as 
shown in Figure 6.12 of Chapter 6. Ten input stimuli (16-bit words) representing the filter 
input. X(n), were applied to the P/A Core via PLASignalinput. Each of the ten input vectors 
in the set {l, 25, 49, 385, 553, 271, 1, 449, 1071 is fed in turn to the P/A Core which then 
takes 7 clock cycles to perform the distributed coefficient multiplication before the result is 
present on Output.Port(). The 7 cycle latency between PL4Signalinput and Output Jort() 
can clearly be seen by the red markers in Figure 7.5, which indicates when the P/A Core has 
finished processing the current coefficient multiplication. For example, the third input word, 
49, when multiplied by the 5-tap coefficient set can be seen to display the correct corresponding 
tap outputs 49, 343, 784,1029 and 1617. 
135 
Translating the Co12 PLA Topology into Hardware 
p 
ji 	 ';,i 	 g 	 1 	 u 	 .1 	 liii 	 II 	 Ii i 	k 	 .1 
IME __. Li 	iI_iuuuuuuuuI. iuuuul_ui.zi_uuuuui 
01 
Figure 7.5: Simulation waveform of 6x5 PLA Core VHDL netlist synthesised at 10MHz. 
7.3 Fault Tolerant Characteristics of PLA-Based EHW Platform 
Figure 7.2(b) shows that on average around 42% of the Go12_6xl6 PLA topology remains re-
dundant after the low-pass filter is implemented. This provides the GA with sufficient PALU 
resources to reconfigure the PLA if sections of the architecture become damaged. The PLA-
based FIR filter platform therefore exhibits fault tolerance through controlled redundancy. Karri 
has already shown this fault-tolerant method to be efficient in [112]. 
Fault tolerance systems are widely used in space applications such as commercial satellite com-
munication where hardware deteriorates due to damaged caused by cosmic rays, and in other 
inhospitable environments where human intervention is difficult or impossible. Systems must 
therefore maintain functionality despite factors such as severe temperature variation, radiation 
and operational ware. Conventional fault tolerant VLSI systems employ techniques such as 
check-pointing [110], concurrent error detection 1111] and redundancy. There purpose is to 
maintain system operation, or prevent further successive faults by minimising the damaged 
sustained [18]. However, fault tolerant systems are costly as they reduce operational speed and 
136 
Translating the Co12 PLA Topology into Hardware 
increase physical area. 
Alternative approaches to the design of fault tolerant systems have recently been proposed 
using evolvable hardware [21, 131-133]. Such approaches provide novel techniques for fault 
recovery and prevention without the need of additional redundancy, fault detection or diagnosis. 
Instead a genetic algorithm is used to monitor system performance and reconfigure aspects 
of a circuit to counter-act any deleterious faults. Fault tolerant systems which employ EHW 
must therefore be able to adapt on-line when required. Hardwired GAs are capable of running 
considerably faster than those implemented in software on general purpose micro-computers, 
and are therefore suitable for applications which require online adaptation. As a result, GAs 
are frequently mapped onto Programmable logic devices (PLD5) so that the fitness function 
can later be modified for different design criteria [17, 121-123]. Custom EAs have also been 
implemented on ASICs [124, 125]. In such cases the fitness algorithm is then set for a specific 
application. 
For EHW to provide a competitive solution to conventional fault tolerant design, EHW re-
sources must be smaller than those required by conventional fault tolerant architectures. The 
benefits of using a single fixed EHW resource become more apparent as circuit size or com-
plexity increases. Inversely, hardware requirements for conventional fault tolerant designs will 
continue to grow. 
7.3.1 Introducing Faults into the PIA-Based FIR Filter 
The Co12_6x16 PLA architecture (RTL description) was subjected to four increasingly large 
numbers of faulty PALUs. These faults covered 0%, 5%, 13% and 25% of the PLA architecture. 
Each individual fault was obtained by pulling both inputs of a given PLA to zero and setting it 
to shift-by-zero, effectively simulating a "Stuck at zero" fault. This was achieved by "freezing" 
sections of configuration string which related to the selected faulty PALUs. For each increasing 
percentage of faults the low-pass filter coefficients were evolved ten times on the PIA using 
the genetic algorithm. Again, ten randomly generated populations of configuration-strings were 
created for each of the ten filter coefficient sets evolved. The dimensions of Co12_6x16 were 
maintained to provide limited redundancy for when faults were introduced into the PLA. 
Faults were placed at random for each level of coverage. The same faults were then main- 
tained over the ten times the low pass filter coefficients were evolved so to obtain an average. 
137 
Translating the Co12 PLA Topology into Hardware 
Figure 7.6 displays the topology of PALU faults for each level of fault coverage. 
7.3.2 Analysis 
The ability of the fault tolerant hardware platform to adapt to or sustain increasing faults was 
investigated through the same four criteria identified in Chapter 6: The fitness of the filter 
evolved, the number of PALUs used, the degree of PALU re-use (generation of partial products), 
and the total number of shift, add and subtract operations required to implement the desired 
coefficients. Results are shown in Figure 7.7. 
Figure 7.7 reveals that at PALU faults from 0 to 13%,  he genetic ml or1th  1s in each case able 
to evolve a filter with a maximal fitness of 99.9%. When the PLA is 25% faulty a 0.5% decrease 
in the fittest filter solution is incurred. The average fitness of filter coefficients evolved remain 
above 98.5% until 25% of the PLA experiences faults. Recall from Chapter 6 that a fitness > 
98.5% was required to produce a transfer function with acceptable low-pass characteristics and 
a gain no less than -52 dB. An example of a typically acceptable response for the low-pass filter 
is shown by the green transfer function in Figure 6.1. Variation of the least fit filters evolved by 
the GA is more marked as the percentage of faults in the PLA increases. 
The average number of PALUs used to implement the low pass filter reduces as the percentage 
of faults found in the PLA increases. However, only a 10% reduction in PALU usage is exper-
ienced on the PIA despite a 25% decrease in the number of functional PALUs available. This 
suggests that the PLA provides the GA with a means of counter-acting the deleterious effects 
of PALUs which are "stuck at zero". Comparisons between figures 7.7(a) and 7.7(b) reveal 
that as fewer PALU resources are made available to the GA through faults, a reduction in filter 
performance is experienced. The number of PALUs required to implement the filter therefore 
relates directly to the fitness of the evolved solution. 
Accounting for variations of faults at 5 and 13% of the PLA area, the average number of PALUs 
reused within the PLA (to generate partial product terms) remains relatively constant, with an 
overall reduction of less than 5%. Again this supports the notion of a robust PIA architecture 
capable of providing an adaptable environment for fault tolerant design using EHW. 
Regardless the degree of faulty PALUs, the number of shifter operations selected by the GA 
is double that of either additions or subtractions. As expressed in section 6.3.5 and 6.4.5 in 
Chapter 6, this is encouraging as programmable shifts consume considerably less power then 
138 
Translating the Co12 PLA Topology into Hardware 
• UII1'I • IIIIuI7 . 
9iI' • —— i:i - i . • - • ._ •. r• . 
-IjIL- • — I — II—I U • .I r U 
-I.IL • — —I I!1 U • ..i •• i 1...r U 
- IjIL I—I U • _•_ • U. . • II — • — • • ..i_ • •• • — • U — 
I I 
.r ij'I7L rn 
(a) 5% PALU faults. 	 (b) 13% PALU faults. 
I :i -"-•' 11u .jimt'1I 
;!r' •i:I :i : jI 
I I .ii xx•.! •!i :j 
•11 
• I!1 :.•......•..:::. •!i • I •uu.rn •',• 
Iii UI • •.. riuri -1? 
1:1 
(c) 25% PALU faults. 
Figure 7.6: "Stuck-at-Zero "fault topologies covering PLA 
139 
Translating the Co12 PLA Topology into Hardware 
• 
• i. a 







1355% 	 25.42% 
Pa004390 FatO Coa.tg 
(g) Pitneocz ef fTR filter ,'n ffit'ier,tç h,etI 'Sn 








- Aa.a55 Fb-In. 
• Ml, R5.4a. 
• Mlx lb-ta. 
0.42% 	 342% 	 1342% 	 5555% 	 055% 	 542% 	 1355% 	 2542% 
P 	 Fa2ICta 	 pt42. Fat Caa.9. 
(b) Number of PALUs utilised to generate desired 	(c) Amount of PALU re-use based on increased per- 
FIR filter coefficients based on increased percent- centage of faults in PLA. 




055% 	 5. 	 1355% 	 2555% 
Fa 
(d) 'I'pe of PALU operation based on increased per-
centage of faults in PIA. 
Figure 7.7: Analysis of Co12...6x16 PLA architecture with increasing percentages of faulty 
PAL Us 
140 
Translating the Co12 PLA Topology into Hardware 
either addition or subtraction operations, and are considered to be the primary means of product 
generation when using multiplierless filter techniques such as POF design. 
Of course the placement of faulty PALUs will greatly effect the ability of the GA to evolve high 
quality filters. Faults close to, or directly on PALUs which are connected to coefficient taps will 
have a more detrimental effect than those distributed in the centre of the PLA. This accounts for 
the poorer average fitness performance of filters generated on the PLA with a faulty area of 5% 
compared to those generate by the GA on a PLA with 13% faults. Given the added number of 
redundant rows present in the PLA architecture it would be possible to adapt the platform such 
that taps could be moved to other PALUs which are not near faults, or are themselves faulty, but 
still located on the final column. An example PLA configuration of the 31-tap low pass filter 
evolved with 99.9% correctness and 13% of its PALUs faulty is illustrated in Figure 7.8. Faulty 
PALUs are shown as dark green anomalies. 
7.3.3 Population Initialisation After Fault Detection 
In each of the fault scenarios investigated in section 7.3.1, populations of configuration-strings 
were randomly generated. However, two other approaches to population initialisation are avail-
able. In this Chapter they are termed Population seeding, and Population recall. Both of these 
approaches were investigated to see if fault recovery times after detection could be decreased 
when compared to random population initialisation. The method of population seeding applied 
in this Chapter involved taking the fittest solution stored from the previous evolutionary run 
(and currently in operation on the PLA), and placing it into a population of 99 randomly gen-
erated configuration-strings. Population recall simply involves re-introducing the most recent 
population of configuration-strings evolved, and using this as the initial start point. 
The effectiveness of both approaches was examined using the same Co126x16 PLA topology 
with 13% of its PALUs "stuck-at-zero", as shown in Figure 7.6(b). So as to obtain an averaged 
performance the same low-pass filter was evolved ten times using the population seeding ap-
proach. In each case the PLA configuration shown in Figure 7.8 was used as the seed. All other 
configuration-strings were randomly generated. A randomly selected final population, evolved 
for the low-pass filter with no faults in the PLA, was used as the initial population for the recall 
approach. As no configuration-string needed to be randomly generated this scenario was run 
only once. The performance of each initialisation approach was determined to be the number of 
generations required by the GA to produce a filter response with a fitness > 98.5%. Figure 7.9 
141 
Translating the Co12 PLA Topology into Hardware 
I GENERATION:6661 	I TOTAL CELL OSAGE:53 = 	55.2% I 	Shifters:27 
I 	Adders:18 	I Subtractors:8 I 	Cell re-use:67 = 558 	I 







+ 	10 03 	10 IIIIII 50 	1) • _ 
• E L 129• L1 Fj 
LiI • a 5-11 F-31 
- rn 
• S1)• c - E 
L1- 28 - -55 
—6 

































Translating the Co12 PLA Topology into Hardware 
displays the results obtained. The averaged evolution of filter fitness using random population 




















- Pcwsi ITufl 
- Randomised PoouIaton 
- Seeded PopuIi1on 
o too 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 
Generation 
Figure 7.9: Fitness performance offilter evolved on PL4 based on various methods of gener-
ating the initial population of configuration-strings. 
Both population seeding and recall enable the GA to adapt to the faulty PLA architecture 
and produce coefficient sets with target fitness considerably faster than when a population of 
configuration-strings are randomly generated. Population seeding produces filters of fitness > 
98.5% after 272 generations, population recall required 513 generations, whilst random initial-
isation took an average of 1521 generations to reach target fitness. This translates to a 6 fold 
and 3 fold increase in fault recovery over that of random initialisation for population seeding 
and recall respectively. 
7.4 Summary 
The C'612 PLA architecture, identified as the most effective for the autonomous implementa-
tion of FIR filter coefficients, was translated into a synthesisable, technology dependent VHDL 
netlist. The ideal interconnect between columns of PALU was instead replaced with more re-
stricted interconnect reflecting realistic PALU fan-out. The resulting PLA architecture, termed 
143 
Translating the Co12 PIA Topology into Hardware 
Col2educed, was then compared with Co12 using the benchmark low-pass filter; with results 
showing comparable fitness in the coefficients sets produced by the GA. 
Six columns and five rows of PALU were selected as the base dimensions to generate a synthes-
ised VHDL core of the Col2educed PLA architecture. Operational speeds of 10 to 100MIHz 
were presented after synthesis, and reflect typical SoC bus frequencies. A signal latency of 7 
clock cycles, independent of filter length, is experienced on the core. This is a considerable 
advantage over single multiply accumulate DSP architectures which have processing times dir-
ectly proportional to filter tap length. More complex FIR filters can be constructed by simply 
adding a number of 6x5 PLA cores during initial VHDL parameterisation of the filter specific-
ation. 
The ability of the platform to adapt to increasing numbers of faults was investigated through 
the evolution of a 31-tap low-pass FIR filter. Four increasingly large numbers of faulty PALUs 
were introduced to the PLA from 0 to 25% of the total architecture. Results show that the 
functionality of filters evolved on the PLA was maintained despite the increasing number of 
faults present. This was attributed to redundant PALUs (inherent in the PLA) exploited through 
the use of EHW. Two additional methods of population initialisation were examined to see 
if fault recovery times after detection could be decreased compared with random population 
initialisation. It was shown that seeding a population of random configuration-strings with the 
best configuration currently obtained produced filters of acceptable fitness 6 times faster than 
with purely randomised population initialisation. 
144 
Chapter 8 
Summary and Conclusions 
8.1 Introduction 
The focus of this thesis has been to investigate if a programmable platform tailored for evolvable 
hardware (EHW) can be developed which is highly suited to the autonomous implementation of 
digital FIR filters. Three novel programmable platforms have been developed for this purpose 
each with a distinctly different architectural design to accommodate FIR coefficient multiplic-
ation. 
This chapter is organised as follows: Section 8.2 presents a summary of the material presen-
ted in each chapter of this thesis, Section 8.3 then provides conclusions drawn from the data 
collected, supporting or rejecting the thesis statement given above. Section 8.4 highlights what 
has been achieved as a result of the research carried out and Section 8.5 outlines future work 
which would add to the knowledge already gained from research undertaken in this thesis, and 
discuses what might be done to improve the performance of the EHW platforms. Finally sec-
tion 8.6 presents a number of final comments regarding the thesis, and possible applications of 
research presented therein. 
8.2 Summary 
The underlying theme of this thesis has been to investigate a number of programmable ar -
chitectures for the automated design of digital FIR filters using the principle of EHW. Each 
programmable architecture is therefore configured using a genetic algorithm (GA) which is 
derived from a class of non-heuristic search and optimisation techniques termed evolutionary 
algorithms, which are inspired by the process of biological evolution. The GA must there-
fore successfully search for and manipulate the encoding used to configure each programmable 
architecture in order to generate the desired filter coefficient set in hardware. 
Chapter 2 introduced the concepts behind evolutionary algorithms and identified four distinct 
classes, one of which was the GA. The suitability of the GA for automated digital circuit design 
145 
Summary and Conclusions 
using EHW was identified due to the manner in which possible circuit solution are encoded 
and manipulated. The mechanisms behind the algorithms operation were also presented and 
focused on the application of autonomous circuit design. 
Chapter 2 also demonstrated the benefits and limitations behind both gate-level and functional-
level approaches to EFIW circuit design through literature review. The differences between 
software-based circuit simulation (extrinsic evaluation), and hardware-based, or intrinsic circuit 
evaluation were also evaluated. It was shown that extrinsic evaluation techniques lack models 
which describe the physical characteristic of the circuit evolved, whilst intrinsic evaluation 
often resulted in the exploitation of anomalous physical characteristics particular to the device 
on which the circuit was evolved. 
The first of three EHW platforms termed the Virtual Chip was presented in Chapter 3. Circuits 
were generated in the Virtual Chip using a GA which described each circuit as a VHDL model. 
Each model was then evaluated extrinsically using simulation tools designed to take account of 
physical circuit characteristics such as timing and area. The Virtual Chip was initially used to 
compare the effectiveness of two circuit component libraries; one reflecting gate-level evolu-
tion, the other function-level by using a GA to autonomously design three types of DSP circuit: 
an MxN multiplier, 7-bit one's voter and a 2-tone frequency discriminator, all of which had 
been previously benchmarked in the EHW community. 
The concept of phased evolution is also introduced in Chapter 3 as an approach to generate 
more complex multiplier circuits for FIR coefficient multiplication. Phased evolution partitions 
circuit complexity into relevant circuit outputs and in doing so reduces the search space into 
smaller landscapes which relate to each sub-circuit. A 3x3 bit parallel multiplier was generated 
using phased evolution that could not be generated in the same number of generations using the 
conventional Virtual Chip 
Chapter 4 presented the basic concepts behind FIR filter theory and demonstrated how fil-
ters can be implemented in hardware using the direct-form and transposed-direct-form. The 
multiplier was identified as the most costly component in filter implementation and a num-
ber of design methodologies were discussed which either reduced the role of the multiplier, 
or replaced it with a series of bit-shifts, additions and subtractions. In particularly, the prim-
itive operator filter (POF) methodology was identified as a technique suitable for adaptation 
to EHW using functional-level evolution. A wide range of fixed-function and programmable 
146 
Summary and Conclusions 
VLSI architectures dedicated to implementing FIR filters, with and without explicit multiplic-
ation, were also presented. In addition, a class of general purpose programmable logic devices 
(PLD5) were introduced and methods for performing coefficient multiplication on theses archi-
tectures investigated. 
From the information presented in Chapter 4, Chapter 5 detailed the development of a PALU 
dedicated to FIR coefficient multiplication using the POF approach. The PALU was designed 
to be implemented in two distinct array structures, reflecting two different classes of PLD. The 
GA used to configure each PALU was also presented. Both the GA and PALU were written in 
VHDL and formed the backbone of the final two programmable EHW platforms. 
Chapter 6 presented the development and evaluation of the final two programmable platforms, 
dedicated to FIR coefficient multiplication. The first was based on a standard FPGA archi-
tecture, the second on a conventional PLA. Both PLDs were investigated using a number of 
filter input, tap output and PALU interconnect topologies. A 31-tap low-pass filter was used to 
provide a benchmark for comparison between each programmable platform and topology vari-
ation. Each PLA and FPGA topology was analysed based on a number of performance criteria 
such as the quality of filter response produced by the coefficient set, and the number of PALUs 
utilised in any given array. 
The most effective PLA-based filter architecture identified in chapter 6 was translated into a 
technology dependent netlist in Chapter 7. Physical constraints were examined an modifica-
tions to the original architecture made where necessary. The ability of the platform to adapt to 
increasing levels of faulty PALUs was also investigated. Results show that the quality of coef-
ficients evolved on the netlisted PLA was maintained. Three approaches to population seeding 
were compared to see which most aided fault recovery times after detection. 
8.3 Conclusions 
This thesis proposed that a programmable platform tailored for evolvable hardware can be de-
veloped which is highly suited to the autonomous implementation of digital FIR filters. From 
information presented in Chapter 2 it can be concluded that a genetic algorithm provides suit-
able search mechanisms for developing digital circuits using the EHW approach. Gate-level 
evolution has been shown to produce digital circuits smaller in area than those developed using 
conventional design techniques. However gate-level evolution fails on more complex circuits 
147 
Summary and Conclusions 
as the search space which must be successfully navigated to generate them grows non-linearly 
with circuit complexity. It was shown that functional-level evolution can be used to constrain 
the search space by using larger circuit building blocks, providing the genetic algorithm with 
an easier means of finding solutions for more complex digital circuits. 
From results presented of the Virtual Chip EHW platform developed in Chapter 3 it can be 
concluded that functional-level evolution considerably outperforms gate-level by enabling the 
genetic algorithm to successfully generate more solutions for each of the DSP circuits invest-
igated. In addition, it can be concluded that for all circuits investigated on the Virtual Chip the 
genetic algorithm required less time to generate circuit solutions using a functional-level com-
ponent library than when the gate-level component library was employed. Because fewer logic 
elements are required to encode circuit descriptions using the functional library, this results in 
a search space several orders of magnitude smaller than that produced by the longer circuit 
encodings required when using the gate-level library. The timing and area characteristics of the 
circuits generated by both the functional and gate-level component libraries were comparable. 
However in both cases the GA produced circuits which were either equal to or better in perform-
ance than functionally equivalent circuits generated using standard digital design techniques. It 
can therefore be concluded that, for the DSP circuits evaluated, functional-level evolution does 
not result in the generation of circuits with lower performance in terms of physical area than 
those produced using simple gate primitives. 
Phased evolution, also presented in Chapter 3, partitions circuit complexity and in doing so 
reduces the search space into smaller landscapes, related to each sub-circuit. It can therefore be 
concluded that this segmented approach reduces the associated degree of epistasis inherent in 
the chromosomes circuit encoding. This make it possible to evolve complex multiplier circuits 
more effectively than simply evolving the circuit as a single entity. Results also demonstrate 
the non-uniformity in the complexity of the multiplier architecture related to individual output 
paths. It can be concluded that multiplier circuits generated using phased evolution are of equi-
valent performance in terms of area and timing to multipliers generated using standard design 
techniques, in addition to other published design techniques using EHW. However failure of the 
Virtual Chip to generate a 4x4 bit parallel multiplier when using phased evolution indicates that 
larger logic components of greater functionality are required to generate a multiplication unit 
of sufficient complexity to implement FIR coefficient multiplication. In addition, the success of 
other published EHW approaches indicates that a more constrained programmable architecture 
148 
Summary and Conclusions 
is required to further reduce the multiplier search space. 
Chapter 6 presented the development of two programmable platforms inspired by two different 
classes of PLD. Each platform was specifically designed to implement FIR coefficient multi-
plication using a distributed, multiplierless architecture based around primitive operator filters 
(POF). Initial results support the conclusion that crossover was not effective in enabling the 
genetic algorithm to configure either programmable platform to implement the specified coef-
ficient set. This is due to the high level of epistasis inherent in the POF design problem. This 
high degree of epistasis means that interactions between PALUs and programmable intercon-
nects is non-linear, and filter fitness cannot be directly attributed to the effects of an individual 
PALU. 
From a number of PLA and FPGA-based filter topologies examined, it can be concluded that 
the PLA-based filter architecture considerably outperformed the FPGA in terms of the quality 
of coefficients sets produced. This is supported by results in Chapter 6 which show that on 
average the PLA-based architectures produced coefficients with a fitness score 4% higher than 
those produced on the FPGA. It can further be concluded that the performance of the PLA is 
attributed to the higher degree of flexibility afforded by the PLA interconnect topologies which 
utilise hierarchical connectivity, and that the critical path of the PLA is in every case shorter than 
the FPGA. These two factors have been shown to reduce linkage between PALUs, indicated 
by reductions in PALU re-use, which improves performance. It has therefore been shown 
that a PLA architecture implementing column-based tap placement with a nearest neighbour 
and 2-level PALU interconnect hierarchy is the most effect programmable platform for the 
autonomous implementation of FIR filter coefficient multiplication using EHW. 
From translating the most effective PLA architecture identified into a synthesisable, physic-
ally realisable component netlist with reduced PALU interconnect, it can be concluded that no 
reduction in performance was experience when compared to the original PLA model. The cre-
ation of a 6x16 PLA revealed that changing the dimensions of the PLA significantly reduces 
the latency of the filter to 7 clock cycles, irrespective of tap length, and incurs no reduction 
in the quality of filter coefficients produced. In fact around 20% fewer PALUs were required 
to implemented the 31-tap filter benchmarked when the 6x16 PLA was employed. Similar ra-
tios were experienced with PALU re-uses, supporting the conclusion that lower PALU linkage 
results in better PLD performance. 
149 
Summaiy and Conclusions 
The fault tolerance of the 6x16 PLA was investigated in Chapter 7. Results show that the 
quality of filter coefficients evolved on the P[A can be maintained despite a 25% increase in 
the number of faulty PALUs present on the array. It can therefore be concluded that the 6x16 
PLA provides sufficient PALU redundancy to enable the GA to over come the inclusion of a 
significant number of faults. Three approaches to population seeding were compared: random 
initialisation, population seeding and population recall. Each approach was examined to see 
which most aided fault recovery times after faults were detected on the PLA. Results show that 
population seeding reproduced coefficients that were of acceptable fitness 6 times faster than 
when random initialisation was used and 3 time faster than population recall. It can then be 
concluded that population seeding provides the most effective means of adapting a population 
of configuration strings to over come faults on the 6x16 PLA for a given set of filter coefficients. 
8.4 Achievements 
The research presented in this thesis has required the development of a number of software and 
hardware-based models and programs. For completeness these are highlighted below: 
The development of a novel genetic algorithm written in C for the Virtual Chip EHW 
platform. 
. The creation of the Virtual Chip EHW environment used to autonomously generate 
VHDL descriptions of evolving circuit solutions. 
. The development of a Programmable Arithmetic Logic Unit (PALU) written and para-
meterised in VHDL and inspired by the concept of the primitive operator filter (POF) 
approach to implementing FIR filters. 
• The development of an FPGA and PLA-inspired array of PALUs dedicated to program-
mable FIR filter coefficient multiplication. The programmable arrays were written in 
VHDL and designed to express a number of interconnects and filter input and output 
topologies which could then be configured using a GA. 
• Generation of a parametrisable genetic algorithm, written in VHDL and embedded along-
side the FPGA and PLA-based programmable platforms. 
MH 
Summary and Conclusions 
• Development of visualisation software written in C and designed to model both the FPGA 
and PLA-based architectures and graphically depict the filter configurations produced. 
This was achieved using postscript format. 
8.5 Future Work 
This thesis has endeavoured to provide a rigorous investigation of the focus of research outlined 
in section 8.1. However, a number of additional potentially interesting areas remain which 
might further add to the knowledge already gained from the research presented. 
The suitability of each coefficient string used to configure the two PLD-based programmable 
platforms is currently calculated by the fitness of the coefficient set produced. Filter perform-
ance might further be improved by including a Pareto-based fitness measure incorporating all 
four performance criteria: PALU utilisation, PALU re-use, the ratio of additions, subtractions 
and bit-shifts; as well as coefficient fitness. Whilst this would not directly add to a more efficient 
comparison between EFIW platforms, it might provide more optimised filter implementations. 
Another method of improving the effectiveness in which the GA can configure each program-
mable platform would be to add heuristic knowledge of both the programmable architecture 
and the POF design approach into the GA search function. This might for example utilise the 
directed graph/GA hybrid approach implemented by Redmill et.al  in [97]. 
An investigation as to the suitability of other search techniques such as simulated annealing 
for autonomously implementing filter coefficients on each of the three programmable platform 
might also be performed. This would provide a useful evaluation of the effectiveness of EHW 
for the automated digital filter design problem presented in this thesis. 
Finally, it is possible to extend the Virtual EHW platform to enable POF-based coefficient 
multiplication. This would then provide a means of accurately appraising the performance of 
the Virtual Chip when compared to the PLA and FPGA-based EHW platforms. 
8.6 Final Comments 
In summary, this thesis has investigated whether a programmable platform tailored for evolvable 
hardware can be developed which is highly suited to the autonomous implementation of digital 
151 
Summary and Conclusions 
FIR filters. It can therefore be concluded that the 6xN PLA architecture developed in Chapter 7 
provides the most suitable platform for evolving coefficient taps in terms of the quality of filter 
coefficient produced, the PALU resources utilised, the overall latency of the filter implemented, 
and the architectures resilience to faults through controlled redundancy. 
High performance digital filters are in great demand throughout the communication industry 
and other sectors that require data control and manipulation. Industrial requirements include 
fast operational speed, low physical area, device portability, and reliability/robustness. The 
PLA-based EHW platform developed satisfies many of these conditions, and could be embed-
ded into a number of SoC devices that might benefit from online adaptive data manipulation. 
152 
References 
Z. Yajiang, H. Lingyi, Q. Yulin, X. Xia, and H. Xiaoling, "A high speed multiplication-
and-accumulation design methodology for submicron and deep submicron dsp solu-
tions," in 5th IEEE mt. Conf on Solid-State and Integrated Circuit Technology, pp.  502-
504, 1998. 
D. R. Bull and D. H. Horrocks, "Primitive operator digital filters," in lEE Proc -G, 
pp. 401-412, 1991. 
V. M. Porto, "Evolutionary methods for training neural networks for underwater pattern 
classification," in 24th Ann. Asilotnar Cunf On Signals, Systems 	 l 2and Computers, VO. ,  
pp. 1015-1019, 1989. 
D. B. Fogel, L. J. Fogel, and V. M. Porto, "Evolving neural networks," Biol. Cybernet., 
vol. 63, pp.  487-93, 1990. 
P. Angeline, G. Saunders, and J. Pollack, "Complete induction of recurrent neural net-
works," in Proc. 3rd Ann. Conf on Evolutionary Programming, pp. 1-8, 1994. 
K. P. Dahal, G. M. Burt, J. R. NcDonald, and A. Moyes, "A case study of scheduling 
storage tanks using a hybrid genetic algorithm," IEEE Transactions on Evolutionary 
Computation, vol. 5, pp.  283-294, June 2001. 
R. Chandrasekharam, S. Subhramanian, and S. Chaudhury, "Genetic algorithm for node 
partitioning problem and applications in vlsi design," in lEE Proceedings Computers and 
Digital Techniques, vol. 140, pp.  255-260, Sept 1993. 
B. M. Goni and T. Arslan, "An evolutionary 3d over-the-cell router," in 12th Annual 
IEEE InternationalASlC/SOC Conference, (Washington, DC.), pp.  206-209, Sept 15- 
18 1999. 
J. Arabas and S. Kozdrowski, "Applying an evolutionary algorithm to telecommunication 
network design," IEEE Trans. on Evolutionary Computation, vol. 5, pp.  309-322, Aug 
2001. 
R. Drechsler, Evolutionary Algorithms for VLSI CAD. Boston MA: Kluwer, ISBN 0-
7923-8168-8, 1998. 
T. Higuchi, M. Iwata, D. Keymeulen, H. Sakanashi, M. Murakawa, I. Kajitani, E. Taka-
hashi, and K. Toda, "Real-world applications of analog and digital evolvable hardware," 
IEEE Transactions on Evolutionary Computation, vol. 3, pp.  220-235, September 1999. 
V. K. Vassilev, D. Job, and J. F Miller, "Towards the automatic design of more efficient 
digital circuits," in In Proceedings of the Second NASA/DOD Workshop on Evolvable 
Hardware, pp.  151-160,2000. 
A. Thompson, "An evolved circuit, intrinsic in silicon entwined with physics," in 
Evolvable Systems: From Biology to Hardware. (ICES 96), pp.  390-405, 1996. 
153 
References 
X. Yao and T. Higuchi, "Promises and challenges of evolvable hardware," in Evolvable 
Systems: From Biology to Hardware. (ICES 96), pp.  55-80, 1996. 
M. Sipper and D. Mange, "Guest editorial from biology to hardware and back," IEEE 
Transactions on Evolutionary Computation, 1999. 
K. Imamura, J. A. Foster, and A. W. Krings, "The test vector problem and limitations 
to evolving digital circuits," in In Proceedings of the Second NASA/DOD Workshop on 
Evolvable Hardware, pp.  75-79,2000. 
G. Tufte and P. C. Haddow, "Evolving and adaptive filter," in In Proceedings of the 
Second NASA/DOD Workshop on Evolvable Hardware, pp.  143-150, 2000. 
S. Levi and A. K. Agrawala, Fault tolerant system design. McGraw-Hill, 1994. 
A. Thompson and R Layzeil, Analysis of unconventional evolved electronics," Commu-
nications of the ACM, vol. 42, pp.  71-79, Apr. 1999. 
A. M. Tyrell, G. Hollingworth, and S. L. Smith, "Evolutionary strategies and intrinsic 
fault tolerance," in In Proceedings of the Third NASA/DOD Workshop on Evolvable 
Hardware, pp.  98-108, July 2001. 
T.-S. Park, C.-H. Lee, and D.-J. Chung, "Intrinsic evolution for synthesis of fault recov-
erable circuit," IEICE Trans. Fundamentals, pp.  2488-2497, Dec. 2000. 
A. Stoica, D. Keymeulen, and R. Zebulum, "Evolvable hardware solutions for ex-
treme temperature electronics," in In Proceedings of the Third NASA/DOD Workshop 
on Evolvable Hardware, pp. 93-97, July 2001. 
M. Salami, H. Sakanashi, M. Tanaka, M. Iwata, T. Kurita, and T. Higuchi, "On-line 
compression of high precision printer images by evolvable hardware," in Proceedings of 
Data Compression Conference. DCC '98, pp.  219-228,1998. 
www.xilinx.com, "Xilinx data book 2000." 
A. T. G. Fuller and B. Nowrouzian, "A novel technique for optimization over the canon-
ical signed-digit number space using genetic algorithms," in IEEE Int. Sypm. Circuits 
and Systems, ISCAS, vol. 2, pp.  745-748,2001. 
G. Wade, A. Roberts, and G. Williams, "Multiplier-less fir filter design using a genetic 
algorithm," lEE Proc. Vision, Image and Signal Processing, vol. 141, pp.  175-180, June 
1994. 
B. I. Hounsell and T. Arslan, "A novel evolvable hardware framework for the evolution 
of high performance digital circuits," in In Proceedings of GECCO 2000, vol. 1, (Las 
Vegas USA), pp.  525-532, July 2000. 
B. I. Hounsell and T. Arslan, "A novel genetic algorithm for the automated design of per-
formance driven digital circuits," in In Proceedings of IEEE Congress on Evolutionary 
Computation (CEC), vol. 1, (La Hoya USA), pp.  601-608, July 2000. 
154 
References 
B. I. Hounsell and T. Arsian, "Evolutionary design and adaptation of digital filters within 
an embedded fault tolerant hardware platform," in Proceedings of 3rd NASA/DOD IEEE 
workshop -on Evolvable Hardware, vol. 1, (Los Angeles USA), pp.  127-135, July 2001. 
B. I. Hounsell and T. Arsian, "A programmable multiplierless digital filter array for em-
bedded soc application," In lEE Electronics Letters, vol. 37, pp.  737-737, June 2001. 
B. I. Hounsell and T. Arsian, "An embedded programmable core for the implementation 
off high performance digital filters," in In Proceedings of 14th Annual IEEE Interna-
tionalASiC/SoC Conference,, (Washington USA), pp.  12-14, Sept 2001. 
B. I. Hounsell and T. Arsian, "n embedded programmable logic array for online adapt-
ation of multiplierless fir filters," Submitted to IEEE Transactions on Very Large Scale 
Integration (VLSI) Systems, 2001. 
E. F Moore, "Gedanken-experiments on sequential machines: automata studies," in An-
nals of Mathematical Studies, vol. 34, pp.  129-153, Princeton, NJ: Princeton University 
Press, 1957. 
G. H. Mealy, "A method of synthesizing sequential circuits," Bell Syst. Tech. J., vol. 34, 
pp. 1054-79, 1955. 
L. J. Fogel, A. J. Owens, and M. J. Walsh, Artificial intelligence through simulated evol-
ution. New York: Wiley, 1966. 
D. B. Fogel and J. L. Fogel, "Optimal routing of multiple autonomous underwater 
vehicles through evolutionary programming," in IEEE Proc. Symp. on Autonomous Un-
derwater Vehicle Technology, pp. 44-47, 1990. 
W. C. Page, J. M. McDonnell, and B. Anderson, "An evolutionary programming ap-
proach to multi-dimensional path planning," in Proc. 1st Ann. Conf on Evolutionary 
Programming, pp. 63-70,1992. 
P. G. Harrald and D. B. Fogel, "Evolving continuous behaviours in the iterated prisoner's 
dilemma," BioSystems, vol. 37, pp.  135-145, 1996. 
D. B. Fogel, "The evolution of intelligent decision making in gaming," Cybernet. Syst., 
vol. 22, pp.  223-236,1991. 
D. B. Fogel, "Applying evolutionary programming to selected travelling salesman prob-
lems," Cybernet. Syst., vol. 24, pp.  27-36, 1993. 
I. Rechenberg, "Evolutionsstrategien," in Simulationsmethoden in der Medizin und Bio-
logie (B. Schneider and U. Ranft, eds.), pp.  83-114, Berlin: Springer, 1978. 
T. Back, Evolutionary Algorithms in theory and Practice. Evolutionary Strategies Evol-
utionary Programming Genetic Algorithms. Oxford University Press, 1996. 
J. R. Koza, "Hierarchical genetic algorithms operating on populations of computer pro-




N. L. Cramer, "A representation of the adaptive generation of simple sequential pro-
grams," in Proc. 1st mt. Conf on Genetic Algorithms (J. J. Grefenstette, ed.), Hillsdale, 
NJ: Eribaum, July 1985. 
J. R. Koza, E H. B. I. amd D. Andre, M. A. Keane, and E Dunlap, "Automated syn-
thesis of analog electrical circuits by means of genetic programming," EEE Trans. on 
Evolutionary Computation, vol. 9, pp.  109-128, July 1997. 
D. Andre, F H. B. III, J. Koza, and M. A. Keane, "On the theory of designing circuits 
using genetic programming and a minimum of domain knowledge," in EEE World Con-
gress on Computational Intelligence., pp.  130-135, 1998. 
M. Brameier and W. Banzhaf, "A comparison of linear genetic programming and neural 
networks in medical data mining," IEEE Trans. on Evolutionary Computation, vol. 5, 
nn" 17-2tS Pphiiirvflfl1 rr 
P. S. Negan, M. L. Wong, K. S. Leung, and J. C. Y. Cheug, "Using grammar based 
genetic programming for data mining of medical knowledge," in Proc. 3rdAnnu. Conf 
on Genetic Pogramming, 1998. 
R. B. Nachbar, "Molecular evolution: Automated manipulation of hierarchical chemical 
topology and its application to average molecular structures," Genetic Programming and 
Evolvable Machines, vol. 1, pp.  57-94, April 2000. 
J. Koza, Genetic Programming: On the programming of Computers by means of natural 
selection. MIT Press, 1992. 
K. E. K. Jr, ed., Advances in Genetic Programming. Cambridge MA: MIT Press, 1994. 
J. Holland, Adaptation in Natural andArtificial Systems. University of Michigan Press, 
1975. 
D. B. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. 
Addison-Wesley, 1989. 
T. Back, D. B. Fogel, and T. Michalewicz, eds., Evolutionary Computation 1 Basic Al-
gorithms and Operations. Institute of Physics Publishing, ISBN 0750306645, 2000. 
P. Thompson, "Circuit evolution and visualisation," in Evolvable Systems: From Biology 
to Hardware. (ICES 2000), pp. 229-240, 2000. 
M. Pedram, "Power minimisation in ic design: Principles and applications," ACM 
Transationson Design Automation ofElectronicSystems, vol. 1, no. 1, pp. 3-56, January 
1996. 
V. K. Vassilev and J. F. Miller, "Scalability problems of digital circuit evolution," in In 
Proceedings of the Second NASA/DOD Workshop on Evolvable Hardware, pp.  55-64, 
2000. 
V. K. Vassilev, J. F Miller, and T. C. Fogarty, "On the nature of two-bit multiplier land-
scapes," in In Proceedings of the First NASA/DOD Workshop on Evolvable Hardware, 
pp. 36-45, 1999. 
156 
References 
M. Murakawa, "Hardware evolution at functional level," in Proc. of the international 
Conference on Parallel Problem Solving from Nature (PPSN'96), pp.  62-71, 1996. 
I. Kajitani, T. Hoshino, D. Nishikawa, H. Yokoi, S. Nakaya, T. Yamauchi, T. Inuo, 
N. Kajihara, M. Iwata, D. Keymeulen, and T. Higuchi, "A gate-level ehw chip: Im-
plementing ga operations and reconfigurable hardware on a single sli," in Evolvable 
Systems: From Biology to Hardware. (ICES 98), pp.  1-12, 1998. 
E. Ozdemir, Evolutionary methods for the design of digital circuits and systems. PhD 
thesis, The University of Wales, Cardiff, 1999. 
M. Tanaka, H. Sakanashi, M. Salami, M. Iwata, T. Kurita, and T. Higuchi, "Data 
compression for digital color electrophotographic printer with evolvable hardware," in 
Evolvable Systems: From Biology to Hardware. (ICES 98), 1998. 
M. Murakawa, S. Yoshizawa, and T. Higuchi, "Adaptive equalization of digital commu-
nication channels using evolvable hardware," in Evolvable Systems: From Biology to 
Hardware. (ICES 96), pp.  379-389, October 1996. Functional level approach to EHW 
using taylored GA/hardware chip. 
R. S. Zebulum, M. A. Pacheco, and M. Vellasco, "Evolvable systems in hardware design 
taxonomy, survey and applications," in Evolvable Systems: From Biology to Hardware. 
(ICES 96), pp.  344-358, 1996. 
A. Hernandez-Aguirre, C. A. Coello, and B. P. Buckles, "A genetic programming ap-
proach to logic function synthesis by means of multiplexers," in Proceedings of the First 
NASA/DoD Workshop on Evolvable Hardware, pp.  46-53, 1999. 
R. Drechsler and W. G. unther, "Evolutionary synthesis of multiplexor circuits under 
hardware constraints," in In Proceedings of Genetic and Evolutionary Computation Con-
ference (GECCO-2000, pp. 513-518 9 2000. 
T. Arsian, D. H. Horrocks, and E. Ozdemir, "Structural cell-based vlsi circuit design 
using a genetic algorithm," in IEEE International Symposium on Circuits and Systems, 
(Atlanta, USA), pp.  308-311, 1996. 
J. E Miller and P. Thompson, "Aspects of digital evolution: Geometry and learning," in 
Evolvable Systems: From Biology to Hardware (ICES 98), pp.  25-35 9 1998. 
A. Thompson, "On the automatic design of robust electronics through artificial evolu-
tion," in Evolvable Systems: From Biology to Hardware. (ICES 98), pp.  13-24, 1998. 
P. Layzell, "Reducing hardware evolution's dependency on fpgas," in Proceedings of 
the Seventh International Conference on Microelectronics for Neural, Fuzzy and Bio-
Inspired Systems. MicroNeuro '99, pp.  171-178, 1999. 
D. Levi and S. A. Guccione, "Geneticfpga: Evolving stable circuits on mainstream 
fpgas," in In Proceedings of the First NASA/DOD Workshop on Evolvable Hardware 
(A. Stoica, D. Keymeulen, and J. L. (Eds.), eds.), pp.  12-17, IEEE Computer Society 
Press, Los Alamitos, July 1999. 
157 
References 
G. Tufte and P. C. Haddow, "Prototyping a ga pipeline for complete hardware evolution," 
in Proceedings of The First NASA/DoD Workshop on Evolvable Hardware, pp.  18 —25, 
1999. 
A. Hamilton, K. Papathanasiou, M. Tamplin, and T. Brandtner, "Palmo: Field program-
mable analogue and mixed-signal vlsi for evolvable hardware," in Evolvable Systems: 
From Biology to Hardware. (ICES 98), pp.  335-344, 1998. 
A. Hamilton, P. Thompson, and M. Tamplin, "Experiments in evolvable filter design 
using pulse based programmable analogue vlsi models," in Evolvable Systems: From 
Biology to Hardware. (ICES 2000), pp.  61-71, 2000. 
A. Stoica, R. Zebulum, D. Keymeulen, R. Tawel, T. Daud, and A. Thankoor, "Reconfig-
urable vlsi architectures for evolvable hardware: From experimental field programmable 
transistor arrays to evolution-oriented chips," IEEE Trans on Very Large Scale Integra-
tion (VLSI) Systems, vol. 9, pp.  227-232, February 2001. 
P. J. Ashenden, The Designer's Guide to VHDL. Morgan Kaufmann Publishers, Inc, 
1995. 
T. Hikage, H. Hemmi, and K. Shimohara, "Hardware evolution system introducing dom-
inant and recessive heredity," in Evolvable Systems: From Biology to Hardware. (ICES 
96), pp.  423-436, Springer, October 1996. 
A. Thompson, P. Layzell, and R. S. Zebulum, "Explorations in design space: Unconven-
tional electronics design through artificial evolution," IEEE Transactions on Evolution-
ary Computation, vol. 3, pp.  267-196, September 1999. 
Cadence Design Systems, Inc., BuildGates User Guide, release 2.3 ed., March 1999. 
J. Bergeron, Writing Testbenches Functional Verification of HDL Models. Kluwer Aca-
demic Publishers, 2000. 
[811 V. K. Vassilev, J. F. Miller, and T. C. Fogarty, "Digital circuit evolution and fitness land-
scapes," in Proceedings of the 1999 Congress on Evolutionary Computation, CEC 99, 
vol. 2, pp.  1299-1306,1999. 
Y. Davidor, "Epistasis variance - suitability of a representation to genetic algorithms," 
tech. rep., The Weizmann Institute of Science, Dept of Applied Mathematics and Com-
puter Science, December 1989. 
J. G. Proakis and D. G. Manolakis, Digital Signal Processing, ch. 7, pp.  470-485. Mac-
millan Publishing Company, NY, 2 ed., 1992. 
B. Mulgrew, P. Grant, and J. Thompson, Digital Signal Processing Concepts andApplic-
ations. MacMillan Press LTD, 1999. 
N. M. Mitrou, "Results on nonrecursive digital filters with nonequidistant taps," IEEE 
Trans. Acoust., Speach, Signal Processing, vol. 33, pp.  1621-1624, Dec. 1985. 
R. J. Hartnett, "Design of efficient parallel hybrid fir filters using dynamic programming 
and subset selection methods," in in Proc. 1990 mt. Conf Acoust., Speach, Signal Pro-
cessing, pp.  1337-1340, 1990. 
158 
References 
J. T. Kim, W. J. Oh, and Y. H. Lee, "Design of nonuniformly spaced linear-phase fir 
filters using mixed integer linear programming," IEEE Trans. Signal Processing, vol. 44, 
pp. 123-126, Jan. 1996. 
A. Avizienis, "Signed-digit number representation for fast parallel arithmetic," IRE 
Trans. Electon. Comput, pp.  389-400,1961. 
H. Samueli, "An improved search algorithm for the design of multiplierless fir filters 
with powers-of-two coefficients," IEEE Trans. Circuits Syst, vol. 36, pp.  1044-1047, 
July 1989. 
X. Xu and B. Nowrouzian, "Local search algorithm for the design of multiplierless di-
gital filters with csd multiplier coefficients," in IEEE Canadian Conference on Electrical 
and Computer Engineering, vol. 2, pp.  811-816, May 1999. 
Y. C. Lim and S. R. Parker, "Fir filter design over a discrete powers-of-two coefficient 
space," IEEE Trans. Acoust., Speach, Signal Processing, vol. 31, pp.  583-591, June 
1983. 
Q. Zhao and Y. Tadokoro, "A simple desing of fir filters with powers-of-two coefficients," 
IEEE Trans. Circuits Syst, vol. 35, pp.  566-570, May 1988. 
P. Gentili, E Piazza, and A. Uncini, "Efficient genetic algorithm design for power-of-
two fir filters," in mt Conf Acoustics, Speech, and Signal Processing. ICASSP-95 , vol. 2, 
pp. 1268-1271,1995. 
S. Sriranganathan, D. R. Bull, and D. W. Redmill, "Design of 2-d multiplierless fir fil-
ters using genetic algorithms," in First mt Conf on Genetic Algorithms in Engineering 
Systems: Innovations and Applications. GALESIA , pp. 282-286, 1995. 
G. Wacey and D. R. Bull, "Architectural synthesis of digital filters for asic implementa-
tion," in Digital andAnalogue Filter and Filtering Systems, lEE Colloquium on, pp. 6/1-
6/5, 1991. 
T. Arslan, H. I. Eskikurt, and D. H. Horrocks, "Configurable structures for a primitive 
operator digital filter fpga," in IEEE Workshop on Signal Processing Systems, SIPS 97-
Design and Implementation, pp.  532-540, 1997. 
D. W. Redmill, D. R. Bull, and E. Dagless, "Genetic synthesis of reduced complexity 
filters and filter banks using primitive operator directed graphs," lEE Proc. -Circuits 
Devices Syst, vol. 147, pp.  303-310, Oct. 2000. 
D. R. Bull, Reduced complexity. PhD thesis, School of Electronic and Systems Engin-
eering, University of Wales, 1989. 
R. A. Hawley, B. C. Wong, T. ji Lin, J. Laskowski, and H. Samueli, "Design techniques 
for silicon compiler implementations of high-speed fir digital filters," IEEE Journal. 
Sold-State Circuits, vol. 31, pp.  656-667, May 1996. 
J. B. Evans, "An efficient fir filter architecture," in in Proc. mt. Symposium. Acoust., 
Speach, Signal Processing, vol. 1, pp.  627-630, May 1993. 
159 
References 
I. E. Ungan and M. Askar, "A gate array chip for high frequency dsp applications," in in 
Proc. mt. Conf 7th Mediterranean Electrotechnical, pp. 549-552, 1994. 
S. Nooshabadi, J. A. Montiel-Nelson, and G. S. Visweswarain, "Micropipeline architec-
ture for multiplier-less fir filters," in in Proc. mt. Conf 10th Conference on VLSI Design, 
pp. 451-456, Jan. 1997. 
S. Yoon and M. H. Sunwoo, "An efficient multiplierless fir filter chip with variable-
length taps," in IEEE Workshop on Signal Processing Systems. SIPS 97 - Design and 
Implementation, pp.  412-420,1997. 
K.-Y. Khoo, A. Kwentus, and J. A. N. Wilson, "An efficient 175mhz programmable fir 
digital filter," in IEEE mt Symp on Circuits and Systems, ISCAS '93, pp. 72-75, 1993. 
K.-Y. Khoo, A. Kwentus, and J. A. N. Wilson, "A programmable fir digital filter using 
csd coefficients," IEEE Journall, Sold-State Circuits, vol. 31, pp.  869-874, June 1996. 
W. J. Oh and Y. H. Lee, "Implementation of programmable multiplierless fir filters with 
powers-of-two coefficients," IEEE Trans. Circuits Syst, vol. 42, pp.  553-556, Aug. 1995. 
S. R. Powell and P. M. Chau, "Reduced complexity programmable fir filters," in IEEE 
mt Symp on Circuits and Systems. ISCAS '92. Proceedings, pp. 561-564, 1992. 
J. F. Miller, "On the filtering properties of evolved gate arrays," in In Proceedings of the 
FirstNASA/DOD Workshop on Evolvable Hardware, pp.  2-11, 1999. 
S. J. Flockton and K. Sheeham, "Behaviour of a building block for intrinsic evolu-
tion of analogue signal shaping and filtering circuits," in In Proceedings of the Second 
NASA/DOD Workshop on Evolvable Hardware, pp.  117-123, 2000. 
Y. Tamir and M. Tremblay, "High performance fault-tolerant vlsi systems using micro 
rollback," IEEE Trans. Computers, vol. 39, pp.  548-554, Apr. 1990. 
J. H. Patel and L. Y. Fung, "Concurrent error detection in alu's by recomputing with 
shifted operands," IEEE Trans. Computers, vol. 31, pp.  589-595, July 1982. 
R. Karri, K. Hogstedt, and A. Oraioglu, "Rapid prototyping of fault tolerant vlsi sys-
tems," in mt Symp on High-Level Synthesis. 7th Proc, pp.  126-131, 1994. 
L. Mintzer, "Digital filtering in fpgas," in Twenty-Eighth Asilomar Conf on Signals, 
Systems and Computers, vol. 2, pp.  1373-1377, 1994. 
S. A. White, "Applications of distributed arithmetic to digital signal processing: A tu-
torial review," IEEE ASSP Magazine, vol. 6, no. 3, pp.  4-19, 1989. 
M. Martinez-Peiro, J. Valls, T. Sansaloni, A. P. Pascual, and E. I. Boemo, "A comparison 
between lattice, cascade and direct form fir filter structures by using a fpga bit-serial 
distributed arithmetic implementation," in 6th IEEE Proc. mt. Conf Electronics, Circuits 
and Systems (ICECS'99), vol. 1, pp.  241-244, 1991. 
V. Pasham, A. Miller, and K. Chapman, "Application notes from virtex and virtex-ii 
series: Transposed form fir filters," tech. rep., Xilirix, January 10th 2001. 
160 
References 
A. Amira, A custom coprocessor for matrix algorithms. PhD thesis, University of Be!-
fast, 2001. 
D. C. Chen and J. M. Rabaey, "A reconfigurable multiprocessor ic for rapid prototyping 
of algorithmic-specific high-speed dsp data paths," IEEE journal of Solid-State Circuits, 
vol. 27, pp.  1895-1904, Dec 1992. 
K. Rajagopalan and P. Sutton, "A flexible multiplication unit for an fpga logic block," in 
Int. Symp on Circuits and Systems (ISCAS) 2001, vol.4, pp.  546-549,2001. 
N. Venkateswaran, A. K. Murugavel, and G. Chandramouli, "Field programmable dsp 
transform arrays," in IEEE Workshop on Signal Processing Systems (SIPS), pp.  152-161, 
1998. 
R. Porter, k. McCabe, and N. Bergmann, "An applications approach to evolvable hard-
ware," in In Proceedings of the First NASA/DOD Workshop on Evolvable Hardware, 
pp. 170-174, 1999. 
M. Sipper, M. Goeke, D. Mange, A. Stauffer, E. Sanchez, and M. Tomassini, "The firefly 
machine: online evoiware," in IEEE mt Conf on Evolutionary Computation, pp.  181-
186, 1997. 
Y.-H. Choi and D. J. Chung, "Vlsi procsesor of parallel genetic algorithm," in Proceed-
ings of the Second IEEEAsia Pacific Conference onASICs. AP-A SIC 2000, pp.  143-146, 
2000. 
T. Higuchi, M. Masahiro, M. Iwata, I. Kajitani, W. Liu, and M. Salami, "Evolvable 
hardware at functional level," in IEEE International Conference on Evolutionary Com-
putation, pp.  187-192, 1997. 
S. Wakabayashi, T. Koide, N. Toshine, M. Goto, Y. Nakayama, and K. Hayya, "An lsi 
implementation of an adaptive genetic algorithm with on-the-fly crossover operator se-
lection," in Proceedings of the ASP-DAC '99. Asia and South PacificD esign Automation 
Conference, vol. 1, pp.  37-40, 1999. 
E. Zwyssig, "Low power digital filter design for hearing aid applications," Master's 
thesis, The University of Edinburgh UK, 2000. 
C. R. Reeves, "Predictive measures for problem difficulty," in Congress on Evolutionary 
Computation, CEC, vol. 1, pp.  736-743,1999. 
P. Merz and B. Freisleben, "On the effectiveness of evolutionary search in high-
dimensional nk-landscapes," in IEEE mt. Conf on Computational Intelligence. Evol-
utionary Computation Proceeding, pp. 741-745, 1998. 
H. Tsutsui, K. Hiwada, T. Izumi, T. Onoye, and Y. Nakamura, "A design of lut-array -
based pid and a synthesis approach based on sum of generalized complex terms expres-




J. McClellan, T. W. Parks, and L. R. Rabiner, "A computer program for designing 
optimum fir linear phase digital filters," Transactions on Audio and Electroacoustics, 
pp. 506-526, Dec. 1973. 
A. Thompson, "Evolving fault tolerant systems," inFirstlnt. Con! on GeneticAlgorithms 
in Engineering Systems: Innovations and applications. GALESIA, pp. 524-529, 1995. 
A. Stoica, D. Keymeulen, V. Duong, and C. Salazar-Lazaro, "Automatic synthesis and 
fault-tolerant experiments on an evolvable hardware platform," in IEEE Aerospace Con-
ference Proceedings, vol. 5, pp.  465-471, 2000. 
J. D. Lohn, G. L. Haith, S. P. Colombano, and D. Stassinopoulos, "Towards evolving cir-
cuits for autonomous space applications," in IEEE Aerospace Conference Proceedings, 
vol. 5, pp.  473-486,2000. 
162 
Appendix A 
VHDL Code for DSP Circuits 
A.! VHDL gate-level description of 2-bit multiplier 
LIBRARY ieee; USE ieee.std_logic_1164 .ALL; 
USE WORK, library_cells .ALL; 
ENTITY multi_2bit IS 
PORT(SIGNAL mO, ml, in2, in3 : IN std_ulogic; 
SIGNAL outO, outi, out2, out3 	OUT std_ulogic); 
END multi_2bit; 
ARCHITECTURE struc OF multi_2bit IS 
SIGNAL interO, interi, inter2, inter3 : std_ulogic; 
SIGNAL zero : std_ulogic : '0'; 
BEGIN 
cell_O :AND_2input 
PORT MAP (mO, in2, outO); 
cell_i :AND_2input 
PORT MAP (ml, in2, interO); 
cell_2 :AND_2 input 
PORT MAP (ml, in3, inter2); 
cell_3 :AND_2 input 
PORT MAP (mO, in3, interl); 
cell_4 :FULLADDER 
PORT MAP (interl, zero, interO, outi, inter3); 
cell_5 : FULLADDER 
PORT MAP (zero, inter3, inter2, out2, out3); 
END struc; 
A.2 7-bit pattern recognizer (one's voter) 
LIBRARY ieee; 
USE ieee.std_logic_1164 .ALL; 
USE WORK, library_cells .ALL; 
ENTITY recog_7bit IS 
PORT(SIGNAL mO, i, in2, in3, in4, ins, in6 : IN std_ulogic; 
SIGNAL outO : OUT std_ulogic); 
END recog_7bit; 
ARCHITECTURE struc OF recog_7bit IS 
163 
VHDL Code for DSP Circuits 
SIGNAL interi, inter2 : std_ulogic; 
BEGIN 
cell-0: recog_3bit 
PORT MAP (mO, ml, in2, interi); 
cell-1: recog_3bit 
PORT MAP (in3, in4, in5, inter2); 
cell-2: recog_3bit 
PORT MAP (interi, inter2, in6, outO); 
END struc; 
A.2.1 3-bit pattern recognizer 
LIBRARY ieee; USE ieee.std logic 1164.ALL; 
ENTITY recog_3bit IS 
PORT(SIGNAL mO, ml, in2 : IN std_ulogic; 
SIGNAL outO : OUT std_ulogic); 
END recog_3bit; 
ARCHITECTURE struc OF recog_3bit IS 
BEGIN 
outO <= ((mO AND ml) OR (mO AND in2) OR (ml AND in2)); 
END struc; 
A.3 A behavioural model of a two tonne discriminator 
LIBRARY ieee; 
USE ieee. std_logic_i i64 .ALL; 
USE ieee. numeric_std . ALL; 
ENTITY bhv tonne IS 
PORT(SIGNAL f req_in 	: IN std—logic; 
SIGNAL clock : IN std—logic; 
SIGNAL decision : OUT std—logic); 
END bhv_tonne; 
ARCHITECTURE bhv OF bhv_tonne IS 
BEGIN 
MainBody: PROCESS(clock, f req_in) 
VARIABLE pos_count : integer RANGE 0 TO 255 : 0; 




IF rising_edge (clock) 
THEN 
pos_count : pos_count + 1; 
END if; 










VHDL Code for DSP Circuits 
THEN 
CASE pos_count IS 
WHEN 4 => 
count decision : 1 0 1 ; 
WHEN 12 => 
count—decision : '1'; 
WHEN OTHERS => 
count _decision := count—decision; 
END CASE; 
pos_count := 0; 
END IF; 
decision <= count decision; 
END PROCESS MainBody; 
END bhv; 
A.4 Schematic of 2x2-bit Parallel Multiplier Evolved by Miller et.al . 
and Associated VHDL Code 
Figure A.!: 2x2-bit parallel multiplier evolved my Miller et.aL 
-- Description of Miller 'Novel' 2x2 parallel multiplier 
-- Taken from "Digital Circuit Evolution and Fitness 
-- Landscapes", Proceedings of the 1999 Congress on 
-- Evolutionary Computation, CEC 99 
LIBRARY ieee; 
USE ieee. std_logic_1 164 .ALL; 
USE ieee . numeric_std . ALL; 














VHDL Code for DSP Circuits 
PORT(SIGNAL Input : IN std_logic_vector(3 DOWNTO 0); 
SIGNAL Output : OUT std_logic_vector(3 DOWNTO 0)); 
END Miller2x2mult; 
ARCHITECTURE rtl OF Miller2x2mult IS 
BEGIN 
-- purpose: describes 2x2-bit multiplier at gate-level 
multiplier: process (Input) 
variable nodel, node2, node3 : std logic; 
begin 
nodel := (not Input(0)) or (not Input(3)); 
node2 := Input(l) and Input(2); 
node3 := nodel and (not node2); 
Output(0) <= node3; 
Output(1) <= (not node3) and (Input(0) and Input(2)); 
Output(2) <= (not nodel) xor node2; 
Output(3) <= Input(1) and Input(3); 
end process multiplier; 
END rtl; 
A.5 Schematic of 30-bit Parallel Multiplier Evolved by Miller et.al . 
and Associated VHDL Code 
Figure A.2: 3x3-bit parallel multiplier evolved my Miller et.al . 
166 
VHDL Code for DSP Circuits 
-- Description of Miller 'Novel' 3x3 parallel multiplier 
-- Taken from "Towards the Automatic Design of More 
-- Efficient Digital Circuits", In Proceedings of the 
-- Second NASA/DOD Workshop on Evolvable Hardware, 2000 
LIBRARY ieee; 
USE ieee. std_logic_1l64 .ALL; 
USE ieee. numeric_std .ALL; 
ENTITY Miller3x3mult IS 
PORT(SIGNAL Am, Bin : IN std _ logic _vector(2 DOWNTO 0); 
SIGNAL Pout 	: OUT std_logic_vector(5 DOWNTO 0)); 
END Miller3x3mult; 
ARCHITECTURE rtl OF Miller3x3mult IS 
BEGIN 
-- purpose: describes 3x3-bit multiplier at gate-level 
multiplier: process (Am, Bin) 
variable nodel, node2, node3, node4 : std _logic; 
variable node5, node6, node7, node8, node9 : std—logic; 
begin 
nodel 	: Ain(0) and Bin(0); 
Pout(0) <= nodel; 
Pout(l) <= (Ain(0) and Bin(l)) xor (Ain(l) and Bin(0)); 
node2 	:= Ain(0) and Bin(2); 
node3 := Ain(l) and Bin(l); 
node4 	:= node3 and (not nodel); 
node5 := (Ain(2) and Bin(0)) xor node4; 
Pout(2) <= node2 xor node5; 
node6 	:= Ain(2) and Bin(2); 
node7 := (Ain(l) and Bin(2)) xor (Ain(2) and Bin(l)); 
node8 	:= ((node2 xor node4) and node5) xor node3; 
Pout(3) < node7 xor node8; 
node9 : nodel and node8; 
Pout(4) < node9 xor ((not node4) and node6); 
Pout(5) < (node3 xor node9) and node6; 




Further Details of FPGA and 
PLA-Based EHW Platforms 
B.1 Postscript Templates of FPGA Interconnect Topologies for Graph-
ical Representation 
B.1.1 Elements of Postscript That Are Common to FPGA Interconnect Tem-
plates 
.5 .5 scale 
/box size 65 cief 
/x size 8 def 
/y_size 8 def 





—size 0 rlineto 
0 box—size 2 dlv rlineto 
yshift y_size ne 
gsave 
1 0 0 setrgbcolor 




0 box_size 2 dlv rlineto 




/Courrier flndfont % Get the basic font 
15 scalefont 	% Scale the font to 15 points 







box_size 0 rlineto 
0 box_size rlineto 







%DRAW HIGHLIGHTED BOX 
box size 0 rlineto 	% right 
0 box—Size rlineto % up 
box—Size neg 0 rlineto % left 
0 box—Size neg rlineto % down 
% SET COLOUR OF BACKGROUND BOX TO GREY 
gsave 
.75 .75 .75 setrgbcolor 
gsave 









%DRAW HIGHLIGHTED BOX 
box_size 0 rlineto 	% right 
o box_size rlineto % up 
box_size neg 0 rlinetO % left 




box_size 2 dlv box-size rmoveto % moveto start (top of box) 
gsave 
o 1 0 setrgbcolor % Set left Mux input green 
o 6 rilneto % up 
box_size neg 0 rlineto % left 
o box size 4 dlv rlineto % up 
grestore 
box 	neg box _size 4 div 6 add rmoveto _size 
box_size 3 dlv 0 rllneto % right (mux symbol bottom right) 
box _size 3 div neg 20 rlineto % up-left 
box-size 6 div neg 0 rlineto % left 
gsave 
0 box size 4 div rllneto % up 
box 	0 rlineto _size % right 
box size 6 div 0 rllneto 6 right 
0 6 rlineto 6 up 
stroke 
grestore 
box size 6 div neg 0 rllneto 6 left 
box size 3 div neg 20 neg rlineto % down-left (mux symbol bottom right) 
box-size 3 div 0 rlineto 6 right 
gsave 
0 0 1 setrgbcolor 6 Set left Mux input blue 
0 box-size 4 div neg rlineto % down 
box_size neg 0 rlineto % left 
0 box-size 2 mul neg rilneto 6 down 
yshlft 1 eq 
xshift 1 eq 
10.9 12.5 llneto % down 










box size 2 div 0 rmoveto 
0 boxsize 2 div neg rllneto 6 down 
box size 6 div neg 0 rlineto 6 left 
box size 3 div neg 20 neg rlineto 6 down-left (mux symbol bottom right) 
box-size 3 div 0 rlineto 6 right 
gsave 
0 0 1 setrgbcolor 6 Set left Mux input blue 
0 box size neg rlineto 6 down 
box _size 2 mul neg 0 rllneto % left 
stroke 
grestore 
box-size 3 div 0 rllneto 6 right 
gsave 
0 1 0 setrgbcolor % Set right Mux input green 
0 box _size 3 div neg rilneto % down 
stroke 
grestore 
box size 3 dlv 0 rlineto 6 right 
box size 3 dlv neg 20 rlineto % up-left 




/Courrier findfont S Get the basic font 
18 scalefont S Scale the font to 15 points 
Further Details of FPGA and PLA-Based EHWPlatfornis 
setfont 	 % Make it the current font 
newpath 
Ixzveto 
box _size 3 dlv box size 3 div add box—size 2 dlv box—size 3 
dlv add 20 add neg rmoveto 
15 neg 0 rmoveto 
30 0 rlineto 
gsave 
0 1 0 setrgbcolor 	 % Set VDD text colour to green 






/Courrier flndfont % Get the basic font 
18 scalefont % Scale the font to 15 points 
setfont % Make it the current font 
moveto 
gsave 
1 0 0 setrgbcolor % Set pen colour red 
0 box size 2 div rmoveto % Move up half of box 
box size box size 3 dlv add neg 0 rlineto _ % left 
0 box_size 3div neg rlineto % down 
gsave 
box _size 4 div neg 0 rmoveto % left 
box size 2 dlv 0 rlineto % right 
stroke 
grestore 






/Courrier findiont S Get the basic font 
18 scalefont S Scale the font to 15 points 
setfont S Make It the current font 
Inoveto 
gsave 
1 0 0 setrgbcolor % Set pen colour red 
box _size box—size 2 div rinoveto S Move up half of box 
box _size 2 dlv 0 rlineto S right 
0 box—size 3 dlv neg rilneto S down 
gsave 
box size 4 div neg 0 rnloveto S left 
box size 2 dlv 0 rlineto S right 
stroke 
grestOre 







box size 3 dlv 0 rlineto S right 
o box size 8 dlv rlineto S up 
box _size 3 dlv neg 0 rllneto S left 




box size 8 div 0 rlineto S right 
0 boxsize 3 div rlineto S up 
box size 8 div neg 0 rlineto S left 
0 	x_size bo 3 div neg rlineto S down 
} def 
/addition block { 
rxnoveto 
S DRAW BOX 
box size 2 dlv 0 rlineto S right 
0 	xsize bo 2 dlv rlineto S up 
box size 2 dlv neg 0 rlineto % left 
0 box_size 2 dlv neg rlineto S down 
S SET COLOUR OF ADDITION BLOCK TO YELLOW 
170 
Further Details of FPGA and PLA -Based EHW Platforms 
gas  






% DRAW ADDITION SYMBOL 
gsave 
box size 4 dlv box size 8 div rmoveto 
O box size 4 dlv rlineto 	% up 
box—size 8 div neg box_size 8 div neg reoveto 





/subtraction block { 
Eluoveto 
% DRAW BOX 
box size 2 dlv 0 rllneto 	% right 
bo O x_size 2 dlv rlineto % up 
box—size 2 div nog 0 rllneto 9 left 
o box—size 2 div neg rilneto 9 down 
9 SET COLOUR OF SUBTRACTION BLOCK TO YELLOW 
gsave 






9 DRAW SUBTRACT SYMBOL 
gsave 
box size 8 div box_size 4 div rmovetO 





/Shifter block { 
/shift_value exch def 
9 DRAW BOX 
box size 2 div 0 rlineto 	% right 
o boxsize 2 dlv rlineto % up 
box size 2 div neg 0 rllnetO % left 
o box—size 2 dlv neg rlineto % down 
% SET COLOUR OF SUBTRACTION BLOCK TO YELLOW 
gsave 






9 DRAW VALUE OF LEFT SHIFT 
gsave 
/Courrier findfont % Get the basic font 
19 Bcalefont 	% Scale the font to 19 points 
setfont 	 % Make it the current font 
5 12 rmoveto 
(S) show 
1 0 rinoveto 





Further Details of FPGA and PLA-Based EHW Platforms 
B1.2 Postscript Template for Alternating Feed-Forward Array (AFFA) FPGA 
Interconnect Topology 
START PROGRAM  
1 1 x_size 
/xshift exch def 
1 1 y_size 
/yshift exch del 
DRAW BOTTOM MUX AND BOTTOM WRAP CONNECT 
xshift 1 eq 
yshift grid _spacing mul xshift grid _spacing mul bottom _mux 
yshift grid—spacing mul xshift grid_spacing mul VDD 
} if 
% DRAW THE BOX 
yshift grid—spacing mul xshift grid—spacing mul box 
yshift 1 eq 	- 
stroke 
%gsave 
%1 1 1 setrgbcolor 







% DRAW LEFT MUX 
ysbift 1 eq 
{ %if 
xshift x_size eq not 
{ %if 
yshift grid—Spacing mul xshift grid spacing mul left—mux 
}if 
% ALTERNATE GROUND CONNECTION 
xshift 2 mod 0 ne 
% Draw ground connection to PALU 
yshift grid—spacing mul xshift grid_spacing mul ground 
}if 
{ %else 
yshift grid—Spacing mul xshift grid—Spacing mul moveto 
box size 2 div box size rmoveto 
x_size xshift eq not 
gsave 
o i o setrgbcolor 





% DRAW FAR RIGHT GROUND 
yshift y_size eq 
xshift 2 mod 0 eq 
Draw ground connection to PALU 






Further Details of FPGA and PLA -Based EHW Platforms 
B.13 Postscript Template for Continuous Feed-Forward Array (CFFA) FPGA 
Interconnect Topology 
START PROGRAM  
1 1 x_size 
/xshift exch def 
1 1 y_size 
/yshift exch def 
% DRAW BOTTOM MUX AND BOTTOM WRAP CONNECT 
xshift 1 eq 	 - 
yshitt grid _spacing mul xshift grid spacing mul bottom_mux 
yshift grid—spacing mul xshift grid—Spacing mul VDD 
)if 
% DRAW THE BOX 
yshift grid 
—
spacing mul xshift grid—Spacing mul box 
yshift 1 eq 
Lroke 
%gsave 
%1 1 1 setrgbcolor 







% DRAW LEFT MUX 
yshift 1 eq 
{ %if 
xshift x_size eq not 
{ %if 
yshift grid—spacing mul xshift grid—spacing mul left_mux 
}if 
% Draw ground connection to PALO 
yshift grid—spacing mul xshift grid—Spacing mul ground 
{ %else 
yshift grid—spacing mul xshift grid spacing mul moveto 
box _size 2 div box—size rmoveto 
x_size xshift eq not 
gsave 
o 1 0 setrgbcolor 







B.1.4 Postscript Template for Continuous Feed-Forward Loop Array (CLFFA) 
FPGA Interconnect Topology 
/wrap { 
/Courrier findfont 	 % Get the basic font 
18 scalefont 	 % Scale the font to 15 points 
setfont 	 % Make it the current font 
newpath 
ncveto 
xshift x_size eq 
{ %if 
box—Size 2 div box size 2 inul box—size 3 div add rmoveto 
15 neg 0 rmoveto 
30 0 rlineto 
0 box—Size 3 div neg rlineto 	% down 
30 neg 0 rlineto 
closepath 
gRave 








5 16 negrNDveto 
yshift (xxxx) cvs show 
{ %else 
box _size 3 div box—size 3 div add box—size 2 div box—size 3 
dlv add 20 add neg rinovetO 
15 neg 0 rinoveto 
30 0 rllneto 
yshift 1 eq 
{ %if 
gsave 
0 1 0 setrgbcolor 
	 % Set VDD text colour to green 




0 box _size 3 dlv neg rlineto 
	% down 
30 neg 0 rlineto 
closepath 
gsave 






5 16 neg rmoveto 





START PROGRAM  
1 x_size 
/xshift exch def 
1 1 y_slze 
/yshift exch def 
% DRAW BOTTOM MUX AND BOTTOM WRAP CONNECT 
xshift 1 eq 
yshift grid—Spacing mul xshift grid—spacing rnul bottom_mux 
yshift grid—Spacing mul xshift grid—spacing mul wrap 
}if 
% DRAW THE BOX 
yshift grid—Spacing mul xshift grid—spacing mul box 
yshift 1 eq 
stroke 
%gsave 
%1 1 1 setrgbcolor 







% DRAW LEFT MUX 
yshift 1 eq 
{ %if 
xshift x_size eq not 
{ %if 
yshift grid spacing mul xshift grid spacing mul left —mux 
{ %else 
yshift grid—spacing mul xshift grid—spacing mul moveto 
box—size 2 div box size rNDvetO 
gsaVe 
0 1 0 setrgbcolor 




% Draw ground connection to PALU 
174 
Further Details of FPGA and PIA-Based EHW Platforms 
yshift grid—spacing inul xshift grid—spacing inul ground 
{ %else 
yshift grid—Spacing inul xshift grid—spacing mul moveto 
box_size 2 div box _size rmoveto 
y_size x_size mul yshift xshift mul eq not 
gsave 
o i o setrgbcolor 





% DRAW TOP WRAP CONNECTS 
xshift x_size eq 
yshift y_size eq not 





B.2 Postscript Templates of PIA Interconnect Topologies for Graph-
ical Representation 
131.1 Elements of Postscript That Are Common to PtA Interconnect Templates 
%%Orientation: Landscape 
/box—Size 65 def 
/grid_spacing_vrt box—size 2 inul def 
/grid_spacing_hrz box_size 3 nail def 
yscale xscale scale 
90 rotate 
0 grid_spacing_vrt x_size 2 add mill neg translate 
/colourA{ 
.7 .7 1 setrgbcolor 
} def 
/COlOurB{ 
1 .4 .4 setrgbcolor 
} def 
/colourC { 
1 .8 0 setrgbcolor 
} def 
/COlOurD{ 
.2 .7 .8 setrgbcolor 
} def 
/cOlOurE{ 
.5 .8 0 setrgbcolor 
} def 
/PALU{ 
/Courier findfont 	 % Get the basic font 
25 scalefont 	 % Scale the font to 15 points 
setfont 	 % Make it the current font 
newpath 
naiveto 
box—size 0 rlineto 	 % right 
0 box_size 2 div rlineto 	% up 
yshift y_size ne 
gsave 
box size 2 div 0 rlineto 	% right 
% DISPLAY OUTPUT NUMBER 
box size 3 div neg 5 rmoveto 




Further Details of FPGA and PLA -Based EHW Platforms 
}if 
o box _size 2 div rlineto 	% up 
box_size neg 0 rlineto % left 
O box _size 4 div neg rlineto 	% down 
gsave 
box _size 2 dlv neg 0 rilneto 	% left 
stroke 
grestore 
o box _size 2 div neg rlineto 	% down 
gsave 
box size 2 dlv neg 0 rlineto 	% left 
Stroke 
grestore 
o box _size 4 div neg rlineto 	% down 
closepath 
gsave 
% SET COLOUR OF ROUTING BLOCK FOR CONNECTION CLARITY 




yshift X2—colour eq 
colourC 
/X2 —colour X2 —colour 4 add Store 
}if 
yshift X4—colour eq 
colourD 
/X4—colour X4 —colour 6 add store 
)if 










/Courier flndfont 	 % Get the basic font 
25 scalefont 	 % Scale the font to 15 points 
setfont 	 % Make it the Current font 
newpath 
moveto 
box size 0 rlineto 	 % right 
0 box_size 2 div rlineto 	% up 
gSave 
box—size 2 div 0 rlineto 	% right 
% DISPLAY OUTPUT NUMBER 
box _size 3 div neg 5 rmoveto 
xshift 1 sub (xxxx) cvs show 
stroke 
grestore 
o box size 2 div rlineto % up 
box—Size neg 0 rlinetO % left 
o box—size 2 div neg rlineto % down 
gsave 
box _size 2 div neg 0 rlineto % left 
stroke 
grestore 
o box _size 2 div neg rlineto % down 
closepath 
gsave 









Further Details of FPGA and PIA-Based EHW Platforms 
/inputbus{ 
/Courier findfont 	 % Get the basic font 
20 scalefont 	 % Scale the font to 25 points 
setfont 	 % Make it the current font 
gsave 
yshift grid_spacing_hrz mul xshift grid_spacing_vrt mul moveto 
box-Size 2 div neg box-Size 2 div reoveto 
0 grid_spacing_vrt x_size 1 sub mul 2 dlv rlineto 
gsave 
100 neg 0 rlineto 
% SHOW IMPULSE STRING 









box _size 0 rmoveto 	 % move to bottom right of PALU at 
box-size 2 dlv box_size 2 dlv neg rooveto % 1/2 box -Size distance 
box size 0 rlineto 	 % right 
bo 0 x_size x_slze mul rlineto 	% up 
0 grid_spacing_vrt 2 dlv x_size mul rilneto 	% up 
box size neg 0 rlineto 	 % left 
0 box_  size x_slze nail neg rlineto % down 





box size 1.5 div 0 rlineto 	% right 
0 box_size 2 dlv rlineto % up 
box-size 1.5 dlv neg 0 rlineto % left 




% DRAW BOX 
box size 2 dlv 0 rlineto 	% right 
0 box_size 2 div rlineto % up 
box-size 2 div neg 0 rlineto % left 
0 box-size 2 div neg rllneto % down 
% SET COLOUR OF ADDITION BLOCK TO WHITE 
gsave 






% DRAW ADDITION SYMBOL 
gsave 
box size 4 div box size 8 dlv rmoveto 
0 box size 4 dlv rilneto 	% up 
box _size 8 dlv neg box-Size 8 div neg rmoveto 







% DRAW BOX 
box size 2 div 0 rilneto 	% right 
bo 0 x_size 2 dlv rilneto % up 
box _size 2 div neg 0 rlineto % left 
0 box_size 2 dlv neg rlineto % down 
% SET COLOUR OF SUBTRACTION BLOCK TO WHITE 
gsave 







Further Details of FPGA and PIA-Based EHW Platforms 
% DRAW SUBTRACT SYMBOL 
gsave 
box_size 8 div box—size 4 dlv rmoveto 







% DRAW BOX 
box size 2 div box size 4 dlv add 0 rlineto 	% right 
O box size 2 div rlineto 	% up 
box size 2 div box size 4 div add neg 0 rlineto % left 
o box_size 2 div neg rlineto % down 
% SET COLOUR OF SUBTRACTION BLOCK TO WHITE 
gsave 








box _size 4 div 0 rmoveto 
box—size 2 mul box—size 2 div neg rmoveto 
% SET TWO ROUTE LINES SO CONNECTIONS ARE MORE VISABLE 
yshift 2 mod 0 eq 
o box size 2 div neg rlineto 	% down 
grid_spacing_hrz 2 mul box_size 2 div sub 0 rlineto % right 
o box—size 2 div rilneto 	 % up 
O box size neg rlineto 	% down 
grid_spacing_hrz 2 mul box_size 2 div sub 0 rlineto % right 




/X4_f ast_route ( 
box size 4 dlv 0 rmoveto 
box—size 2 mul box—size 2 div neg rmoveto 
S SET TWO ROUTE LINES SO CONNECTIONS ARE MORE VISABLE 
yshlft 4 mod 0 eq 
O box size 1.5 mul meg rlineto 	 S down 
grid_spacirig_hrz 4 mul box—Size 2 dlv sub 0 rlineto S right 
o box—size 1.5 mul rlineto 	 S up 
o box size neg rlinetO 	 S down 
grid_spaclng_hrz 4 mul box—Size 2 dlv sub 0 rlineto S right 






%DRAW HIGHLIGHTED BOX 
box size 0 rlineto 	S right 
o 13;x size rlinetO S up 
box_size neg 0 rilneto S left 
o box_size neg rlineto S down 
closepath 
7 setlinewidth 





%DRAW HIGHLIGHTED BOX 
box size 0 rilneto 	S right 
o box _size rlineto S up 
box—size neg 0 rlineto S left 
o box size neg rlineto S down 
S SET COLOUR OF BACKGROUND BOX TO GREY 
178 
Further Details of FPGA and PLA-Based EHW Platforms 
gsave 








B.2.2 Postscript Template for Route 1 PLA Interconnect Topology 
GENERATE ROUTE 1 PLA TEMPLATE  
1 1 x_size 
/X2—colour 3 clef 
/X4 colour 2 clef 
/xshift exch clef 
1 1 y_size 
/yshift exch clef 
% DRAW SHIFTER 
yshift 1 eq 
yshift grid_spacing_hrz inul xshift grid_spacing_vrt mul Shifter 
% DRAW INPUT BUS 
xshift 1 eq 
input_bus 
}if 
% DRAW PALO 
yshift grid_spacing_hrz mcii xshift grid _spacing _vrt mul PALU 
}ifelse 
yshift grid_spacing_hrz miii xshift grld_spacing_vrt mul moveto 
% DRAW INTERCONNECT BOX 
xshift 1 eq 
yshift y_size ne 









B.23 Postscript Template for Route 2 PIA Interconnect Topology 
GENERATE ROUTE 2 PLA TEMPLATE  
1 1 x_size 
/X2 colour 1 def 
/xshift exch def 
1 1 y_size 
/yshift exch clef 
% DRAW SHIFTER 
yshift 1 eq 
yshift grid_spacing_hrz inul xshift grid_spacing_vrt mcii Shifter 
% DRAW INPUT BUS 
xshift 1 eq 
input—bus 
}if 
% DRAW PALU 
179 
Further Details of FPGA and PLA -Based EHW Platforms 
yshift grid_spacing_hrZ mul xshift grid_spaCing_vrt mul PALO 
}ifelse 
yshift grid_spacing_hrz mul xshift grid_spacing_vrt mul moveto 
% DRAW INTERCONNECT BOX 
xshift 1 eq 
yshift y_size ne 





% DRAW 2X FAST INTERCONNECT 
yshift y_size 2 Rub it 
gsave 








B.2.4 Postscript Template for Route 3 PIA Interconnect Topology 
GENERATE ROUTE 3 PIA TEMPLATE %%%%%%%%%%%%%%% 
1 1 x_size 
/X2 colour 3 def 
/X4 colour 2 def 
/xshift exch def 
1 1 y_size 
/yshift exch def 
% DRAW SNIFTER 
yshift 1 eq 
yshift grid_spacing_hrz mul xshift grid_spacing_vrt mul Shifter 
% DRAW INPUT BUS 
xshift 1 eq 
input_bus 
}if 
% DRAW PALU 
yshift grid_spacing_hrz mul xshlft grid_spacing_vrt mul PALO 
}ifelse 
yshift grid_spacing_hrz mul xshift grid_spacing_vrt mul moveto 
% DRAW INTERCONNECT BOX 
xshift 1 eq 
yshift y_size as 




% DRAW 2X FAST INTERCONNECT ON ODD 'YSHIFTS' ONLY 
yshift 2 mod 1 eq 






% DRAW 4X FAST INTERCONNECT ON EVEN 'YSHIFTS' ONLY 
Further Details of FPGA and PLA-Based EHW Platforms 
yshift 2 mod 1 ne 











B.2.5 Postscript Template for Route 4 PIA Interconnect Topology 
GENERATE ROUTE 4 PLA TEMPLATE  
1 1 x_size 
/X2—colour 3 def 
/X4 colour 2 def 
/xshift exch def 
1 1 y_size 
/yshift exch def 
% DRAW SHIFTER 
yshift 1 eq 
yshift grid_spacing_hrz mui xshift grid_spacing_vrt mul Shifter 
% DRAW INPUT BUS 
xshift 1 eq 
input_bus 
}if 
% DRAW PALO 
yshift grid_spacing_hrz mui xshift grid_spacing_vrt mul PALU 
}ifeise 
yshift grid_spacing_hrz mui xshift grid_spacing_vrt mul moveto 
% DRAW INTERCONNECT BOX 
xshift 1 eq 
yshift y_size ne 




% DRAW 2X FAST INTERCONNECT ON ODD 'YSHIFTS' ONLY 
yshift 2 mod 1 eq 






% DRAW 4X FAST INTERCONNECT ON EVEN 'YSHIFTS' ONLY 
yshift 2 mod 1 ne 













Synthesis and Simulation Script for 
Generation of 6x5 PLA Core 
C.! Top-Down Synthesis script for 6x5 PLA Core 
LOG_FILE = " /tmp/EHW_platform/synopsis flog_files! 
TO_b 0MNz_6X5_PLA_C2_limited_V3 . log" 
MASU_FILE ='TD_100MHz_6X5_PLA_C2' 
NETOIR = " /tmp/ENW_platform/synopsis/reports/" 
PLOTS = "/tmp/EHW_platform/synopsis/plots/" 
/ Parameters for Clock, Reset, In- and Outputs: *1 
CLKNANE = "clock' 
RESNAME = "GlobalReset" 
CLKPERIOD = 10.0 
CLEJIP = CLEPERIOD / 2 
CLESKEW = 1 
cLETRANS = 0.5 
LKDELAY 1 
INPDELAY 1 
OUTPDELAY = 1 
/ Parameters for PLA Elaboration: */ 
BUSWIDTH = 16 	/* I/o Bus width of EHW environment 1 
XWIDTH = 6 	 /* Number of PALU5 in X Axis *1 
YWIDTH = 5 /* Number of PALU5 in Y Axis */ 
CONNECT_CNTRL = 3 	/ sit width of control for interconnect_mux */ 
PALU_CNTRL = 5 	/ sit width of control for each PALU */ 
CIRCUITOUTPUTS = 5 1* Number of Circuit Outputs required */ 
MAX_CONNECT = 3 	1* Number of PALUS connected to Interconnect _Mux*/ 
CNTRL_OFFSET = 20 1 Number of bits required to control MaxShifters/ 
MASU = "PLA_C2_1imited_V3 
VER = "limited _V3" 
MASU_VER = MASU_FILE + VER 
analyze -f vhdl -lib WORE VNDSRC + pack_local.vhd 
analyze -f vhdl -lib WORK VNDSRC + MUX_2_FFA_cells.vhd 
analyze -f vhdl -lib WORK VNDSRC + addsub_cla.vhd 
analyze -f vhdl -lib WORK VNDSRC + NbitMux_21n.vhd 
analyze -f vhdl -lib WORK VNDSRC + Npos_leftshift.vhd 
analyze -f vhdl -lib WORK VNDSRC + Maxshift_limitedl6.vhd 
analyze -f vhdl -lib WORK VNDSRC + POF_ALU.vhd 
analyze -f vhdl -lib WORE VHOSRC + Nbit_shiftEnable.vhd 
analyze -f vhdl -lib WORE VEDSRC + Nbit_ShiftReg.vhd 
analyze -f vhdl -lib WORK VHDSRC + Interconnect_Mux_limited.vhd 
analyze -f vhdl -lib WORK VHDSRC + Interconnectjeux2_limited.vhd 
analyze -f vhdl -lib WORK VHDSRC + Nbit_ser_to_par.vhd 
analyze -f vhdl -bib WORK VHOSRC + PLA_C2_13*mited_V3 .vhd 
include SCRIPT + "PLA_parameters" + VER + ". 5cr" 
sh date 
elaborate MASU -param "suSWidth&' + BUSWIDTH + ",Xwidth=" + XWIDTH + \ 
Ywidth&' + YWIDTB +, Connect_Cntrl=" + CONNECT _CNTRL + ",PALU_Cntrl=" +\ 
PALU_CNTNL + ",CircuitOutputs=" + CIRCUIPOUTPUTS + , Max_Connect=" + \ 
MAX CONNECT > NETDIR + MASU VEN + "elaboration. rpt" 
current_design = MASU 
set_operating_conditions -min BCIND -max WIND -lib MTC45000.db:MTC45000 
set_wire_load_model -name 36000to42000 -lib \ 
14TC45000_WL_WORST. db:MTC45000_WL_WONST -max 
set_wire_load_model -name 36000to42000 -bib \ 
MTC45000_WL_TYP.db:MTC45000_WL_TYP -mm 
create_clock CLKNAME -period CLKPERIOD -waveform (0 CLKHP} 
set_clock_uncertainty CLESKEW CLKNAME 
182 
Synthesis and Simulation Script for Generation of 6x5 PIA Core 
set_clock_transition CLKTRANS CLENAME 
set_dont_touch_network {CLKNAME RESNAME} 
set_drive 0 {CLKNANE.RESNANE} 
set_input_delay INPDELAY -add_delay -clock CLENAME all_inputs() >> LOG_FILE 
set_output_delay OUTPDELAY -add_delay -clock CLKNAME all_outputs() >> LOG FILE 
set fix hold CLKNAME 
uniquify 
current design MASU 
compile -map_effort medium >> LOG-FILE 
change_names -h -rules NET >> LOG-FILE 
report_constraint -all-violators > NETDIR + MASU_VER + '_violations.rpt" 
report_area >> NETDIR + MASU_VER + rpt 
report_timing -delay max >> NETDIR + MASU_VER + ".rpt 
write -f vhdl -h -Output NETDIR + MASU_VER + vhd 
write -f verilog -h -output NETDIR + MASU_VER + ". 
write -f db -h -Output NETOIR + MASU_VER + db" 
remove design -all >> LOG-FILE 
exit 
C.2 VHDL Leapfrog Testbench for Netlist Simulation 6x5 PLA 
Core 
-- PLA architecture for development of FIR Filter algorithms 
LIBRARY ieee; 
library mtc_lib; 
USE ieee.std_logic_1164 .ALL; 
USE ieee .numeric_Std.ALL; 
USE WORR.local.ALL; 
use mtc_lib.MTC45000_VcompOfleflts.all; 
use work.CONV PACK PLA C2 lflsited_V3.all; 
ENTITY PLA_C2_limited_V3_testbench IS 
END PLA_C2_limited_V3_testbench; 
ARCHITECTURE functional OF PLA_C2_limited_V3_testbench IS 
-- Constants are user defined and determine dimensions of PIA architecture 
CONSTANT BusWidth integer 16; -- I/O Bus width of EHW environment 
CONSTANT Xwidth : integer 6; -- Number of EHW CLBS in X Axis 
CONSTANT Ywidth : integer 5; -- Number of EHW CLBB in Y Axis 
CONSTANT CircuitOutputs integer 5; -- Number of circuit Outputs required 
CONSTANT PP.LUCntrl : integer 5; -- Bit width of control for 
-- each Programmable ALU 
CONSTANT Max Connect integer = 3; -- Number of PALU5 
-- connected to Routing logic 
constant Connect_Cntrl : integer log_2((2*Max_Connect)_1)+1; -- Bit width of 
-- control for 
-- routing MUX5 





constant string_length : integer 	(((Xwidth-l) * Ywidth * 
(PALU_Cntrl+(2*COnflect_Cntrl))) + 
Cntrl_Offset); 
constant initial delay : integer := 10; 	-- Number of clock cycles before 
-- loading configuration data 
type Output_pins is array (1 to CircuitOutputs) of std_logic_vector(BusWidth-1 downto 0); 
-- Global Inputs 
SIGNAL clock 	: std logic; 
SIGNAL GlobalReset : std logic; 
-- Inputs to PLA Architecture 
SIGNAL PLA_Signallnput 	: typeld_0; 
SIGNAL PLA_data_stream : std-logic; 
SIGNAL Data-enable 	: std logic := '0; 
183 
Synthesis and Simulation Script for Generation of 6x5 PLA Core 
SIGNAL Load_PLA 	 : std-logic := 1 0 1 ; 
-- Outputs from PLA Architecture 
SIGNAL PLA_Output_Bus : typeld_l; 
SIGNAL Output_Port 	: Output_pins; 
-- Inputs to Nbit_par_to_ser (temporary memory unit) 
SIGNAL load-memory 	: std-logic := 1 1 1 ; 






port(Environmentlnput : in typeld_0; 
ChromosomeString, clock, GlobaiReset, load_PLA, Enable_data : in std-logic; 
PLA_Output_Bus : Out typeld_1); 
end component; 
COMPONENT Nbit_par_to_ser 
OppT(gy,N-h 	 integer; 
Ywidth : integer; 
Cntrl_offset 	: integer; 
PALU_Cntrl : integer; 
Connect_Cntrl 	: integer); 
PORT(SIGNAL clock 	: IN std _logic; 
SIGNAL load-enable : IN std _logic; 
SIGNAL Par_input 	: IN std_logic_vector(string_length - 1 DOWNTO 0); 
SIGNAL Ser_output : OUT std-logic); 
end COMPONENT; 
BEGIN 
-- Create individual output buses from PLA_Output_Bus 
Outputsus: FOR i in 1 to CircuitOutputs GENERATE 
Output_Port(i) em (PLA_Output_Bus( (i*NusWidth)_1) & 
PLA_Output_Bus( (j*NusWjdth)_2) & 
PLA_Output_Bus( (i*BusWidth)_3) & 
PLA_Output_Bus( (i*BusWidth)_4) & 
PLA_Output_Nus( (j*BusWjdth)_5) & 
PLA_Output_Nus( (i*BusWidth)-6) & 
PLA_Output_Nus( (i*BusWidth)_7) & 
PLA_Outputsus( (i*BusWidth)_8) & 
PLA_Output_Bus( (i*NuSWidth)-9) & 
PLA_Output_Bus( (j*Nuswjdth)_10) & 
PLA_Output_Bus( (j*Nuswjdth)_11) & 
PLA_Output_Bus( (i*BusWidth)_12) & 
PLA_Output_Bus( (j*BusWjdth)_13) & 
PLA_Output_Bus( (j*BusWjdth)_14) & 
PLA_Output_Bus( (i*BusWidth)_15) & 
PLA_Output_Bus( (i*BusWidth)_16)); 
end generate OutputBus; 
-- Instantiate PLA_C2_limitedV3 
PtA_Unit: PLA_C2_limited_V3 
PORT MAP(PLA_Signallnput, 






-- Instantiate temporary memory unit Nbit_par_to_ser 
REM_register: Nbit_par_to_ser 









-- Counter for loading of PtA configuration string 
184 
Synthesis and Simulation Script for Generation of 6x5 PLA Core 
Counter: process(clock, GlobalReset) 
type input_strings is array (1 to 10) of typeld_0; 
variable count 	: integer : 0; 
variable ip_count : integer := 0; 
variable input_data : input_strings : ( 1 me "0000000000000001", -_ 1 
2 me '0000000000011001', -- 25 
3 => 'OOOOOOOOO011OOOl", -- 49 
4 me '0000000110000001", -- 385 
5 me "0000001000101001", -- 553 
6 me "0000000100001111", -- 271 
7 me "0000000001010101", -- 85 
8 => "0000000000000001", -- 1 
9 => "0000000111000001", -- 449 
10 => "0000000001101011"); -_ 107 
begin 
if GlobalReset = '1' then 
Load_PLA <= 1 0 1 ; 
load _memory <= '1'; 
Data _enable < '0'; 
count := 0; 
ip_COunt : = 1; 
emit CLOCK'EVENT and clock = '1' then 
count := Count + 1; 
if Count < (string_length + initial_delay) then 
if Count = initial _delay then 
Data enable <= 
load memory me 
PLA_Signallnput <= '0000000000000001'; -- after 0 ns; 
end if; 
else 
Load_PLA <= '0'; 
if (count = ((ip_cOunt * initial_delay) + string_length)) then 
ip_count := ip_count + 1; 
if(ip_count < 12) then 





end process counter; 
-_ Set Input Vectors for testbench analyses of PLA Architecture 
GenerateClock: PROCESS 
BEGIN 
clock <= '1' AFTER 0 os; 
FOR i IN 0 TO 500 LOOP 
clock <= '0' AFTER 100 ns, '1' AFTER 200 ns; 
WAIT FOR 200 ns; 
END LOOP; 
WAIT; 




AFTER 0 us, 
AFTER 247 ns; 
WAIT; 





Di Refereed Journals 
B. I. Hounsell, T. Arslan, A programmable multiplierless digital filter array for em-
bedded SoC applications, in TEE Electronics Letters, Vol. 37(12), pp 735-737, June 
2001. 
B. I. Hounsell, T Arsian, An embedded programmable logic array for online adapta-
tion of multiplierless FIR filters, Submitted to IEEE Transactions on Very Large Scale 
Integration (VLSI) Systems. 
D.2 Refereed Conferences 
B. I. Hounsell, T. Arsian, An Embedded programmable core for the implementation 
off high Performance digital filters, Proceedings of 14th Annual IEEE International 
ASIC/SoC Conference, Sept. 12-14, 2001. Washington USA. 
B. I. Hounsell, T. Arsian, A novel genetic algorithm for the automated design of per-
formance driven digital circuits, Proceedings of IEEE Congress on Evolutionary Com-
putation (CEC), Vol. 1, pp  601-608, July 16-19, 2000, La Hoya USA. 
B. I. Hounsell, T. Arsian, A novel evolvable hardware framework for the evolution of 
high performance digital circuits, Proceedings of GECCO 2000 Vol. 1, pp  525-532, 
July 8-12, 2000, Las Vegas USA 
D.3 Refereed Workshops 
B. I. Hounsell, T. Arslan, Evolutionary design and adaptation of digital filters within 
an embedded fault tolerant hardware platform, Proceedings of 3rd NASA/DoD IEEE 
workshop on Evolvable Hardware, Vol. 1, pp  127-135, July 12-14, 2001, Los Angeles 
USA. 
ir 
