Analogue VLSI neural networks for phoneme recognition. by Gatt, Edward.
7545107
UNIVERSITY OF SURREY LIBRARY
ProQuest Number: 10130511
All rights reserved
INFORMATION TO ALL USERS 
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a com p le te  manuscript 
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
uesL
ProQuest 10130511
Published by ProQuest LLO (2017). Copyright of the Dissertation is held by the Author.
All rights reserved.
This work is protected against unauthorized copying under Title 17, United States C ode
Microform Edition © ProQuest LLO.
ProQuest LLO.
789 East Eisenhower Parkway 
P.Q. Box 1346 
Ann Arbor, Ml 48106- 1346
Analogue VLSI Neural Networks for 
Phoneme Recognition
Edward Gatt
Submitted for the Degree of 
Doctor of Philosophy 
from the 
University of Surrey
Centre for Vision, Speech and Signal Processing 
School of Electi'onics and Physical Sciences 
University of Surrey 
Guildford, Surrey GU2 7XH, UK
October 2004
© E. Gatt 2004
Summary
This thesis presents the implementation of three VLSI neural network systems: A 
chip for implementing Self-Organising Maps, a Radial Basis Function chip and a Back- 
Propagation Learning chip.
The first chip was implemented using mixed mode teclmology, while the other 
two chips used analogue teclmology. The chips have been designed and applied 
successfully to the task of phoneme recognition.
Cascadability was the most important feature included in the design of the chips as 
the main intention was to allow as much flexibility as possible in order to test the 
functionality of the different topologies and architectures. Also, a design strategy to re­
use components whenever possible was employed to reduce the possibility of design 
enors, design time and silicon area.
The main goal of the study was to design a VLSI neural network, which exploits 
the strong points of different algoritlims. The following characteristics were desirable -  
namely low training times, which was obtained for radial basis function network learning, 
high recognition rates normally associated with the back-propagation algorithm and also 
the benefit of competitive learning. In order to achieve the above it was decided to 
combine the three chips together in order to implement a time-delay radial basis function 
neural network. The recognition rates obtained compares well with the recognition rates 
obtained for the time-delay neural network for the same phoneme set, while its training 
efficiency is very close to that of the radial basis function network.
The chips that were fabricated consume very little power, with synapses and 
neurons requiring only tens and himdreds of pW to operate. Most of the individual 
building blocks can be operated with a supply rail of +1V and -IV, but in order to allow 
the implementation of complex neural network architectures, a supply rail of +2V and -  
2V was adopted when networks topologies were created.
Acknowledgements
I would like to express my gratitude to my supervisor Dr. Edward Chilton, 
and co-supeiwisor Prof. Joseph Micallef for their constant guidance and support 
tlrroughout the course of this research work.
I would also like to thanlc the University of Malta for offering me the 
opportunity and funding necessary to caiTy out this research work.
Finally, I would also like to thank my wife for the constant support and 
encouragement she has shown during the course of my studies. I would like to dedicate 
this thesis to my child Gillian-Anne, who was bom on the 13^ '^  October 2002.
Ill
Publications
Parts of this thesis have been published as follows:
1. E. Gatt, J. Micallef, and E. Chilton, “An Analog VLSI Tiine-Delay Neural 
Network hnpleinentation for Phoneme Recognition”, Proceedings of the 6th IEEE 
International Workshop on Cellular Neural Networks and their Applications, 
pp. 315-320, 2000.
2. E. Gatt, J. Micallef, P. Micallef, and E. Chilton, “Phoneme Classification in 
Hardware Implemented Neural Networks”, Proceedings o f the 8th IEEE 
International Confei'ence on Electronics, Circuits and Systems^ pp. 481-484, 
2001 .
3. E. Gatt, J. Micallef, and E. Chilton, “Hardware Radial Basis Functions Neural 
Networks for Phoneme Recognition”, Proceedings o f the 8th IEEE International 
Conference on Electronics, Circuits and Systems, pp. 627-630, 2001.
4. E. Gatt, J. Micallef, and E. Chilton, "Analogue radial basis function networks for 
phoneme recognition", Proceedings o f the 9th IEEE International Conference on 
Electronics, Circuits and Systems, pp. 583 -586, 2002.
IV
Contents
Contents
Chapter 1 -  Introduction
1.0 Neural Network Ai'chitectures ....................................................................  I
1.1 Analogue versus Digital Teclmology........................................................... 2
1.2 Low Voltage Low Power Applications...................... ................................ 4
1.3 Neural Networks for Speech Recognition................................................... 5
1.4 Scope and Outline of Thesis........................................................................  7
Chapter 2 -  Speech Processing
2.0 hitroduction....................................................................................................  9
2.1 Speech Production........................................................................................  10
2.2 Speech Perception.........................................................................................  10
2.3 Signal Processing, Analysis Methods for Speech Recognition ~ Feature
Extraction......................................................................................................  12
2.4 Spectral Analysis Models  ................................................................... 13
3.10.1 The LPC Processor for Speech.....................................................  13
3.10.2 Mel Cepstral Coefficients.............................................................  16
2.5 Vector Quantisation......................................................................................  17
2.6 Summary........................................................................................................ 18
Chapter 3 - Pattern Recognition
3.0 hitroduction...................................................................................................  20
3.1 Pattern Matching.........................................................................    20
3.2 Hidden Markov models ...............................................................................  20
3.3 Artificial Neural Networks...........................................................................  23
3.3.1.1 Learning Process........................................................................... 24
3.3.1.2 Supervised Learning...................................................................  27
3.3.1.3 Reinforcement Learning.............................................................  28
3.3.1.4 Unsupervised Learning........................    28
3.3.1.5 Supemsed versus Unsupervised Learning................................. 29
3.4 Back-Propagation Learning..........................................................................  29
3.5 Time-Delay Neural Networks (TDNNs)...................................................  31
Contents
3.6 Self-Organising Networks...........................................................................  33
3.7 Radial Basis Function Networks.................................................................  35
3.8 Biological Neurons and their Artificial Models.........................................  37
3.9 Automatic Speech Recognition with Neural Networks..............................  40
3.10 Har dware hnplementations of Neural Networks........................................  41
3.10.1 Pulse Coded hnplementations......................................................  42
3.10.2 Digital Implementations...............................................................  42
3.10.3 Analogue Implementations...........................................................  44
3.10.4 Weight S torage...........................................................................  47
3.10.5 Multipliers..................................................................................... 48
3.10.6 Activation Functions.....................................................................  49
3.11 Summary......................................................................................................  49
Chapter 4 -  Software Simulation of Neural Networks for Phoneme 
Recognition
4.0 hitroduction..................................................................................................  50
4.1 Speech Recognition......................................................................................  50
4.2 Simulation Results for the Different Coding Techniques...........................  53
4.3 Simulation Results for the Different Neural Network Algorithms.............  56
4.4 Results for a Time-Delay Radial Basis Function Neural Network.............  57
4.5 Results for a Time-Delay Neural Network (TDNN) Implementation Back- 
Propagation Learning....................................................................................  62
4.6 Results for Different Speech Systems.........................................................  63
4.7 Conclusions....................................................................................................  66
4.8 Summary........................................................................................................ 66
Chapter 5 -  VLSI Implementation of Self-Organising Maps
5.0 hitroduction....................................................................................................  68
5.1 Mapping the Algoritlim on VLSI.................................................................. 68
5.2 The Hamming Network................................................................................  68
5.2.1 The Winner-Take-All Circuit..........................................................  71
5.3 The Learning Algoritlnii...............................................................................  75
5.4 Winner-Take-All Circuit Simulation Results..............................................  77
5.4.1 Large-Scale Neural Network Environment.................................. 79
VI
Contents
5.5 Chip Architecture...........................................................................................  85
5.6 Experimental Results...................................................................................  86
5.7 Testing Hardware Self-Organising Maps for Phoneme Recognition  89
5.8 Conclusions...................................................................................................  89
5.9 Summary.......................................................................................................  90
Chapter 6 -  VLSI Implementation of Radial Basis Functions
6.0 hitroduction....................................................................................................  91
6.1 Hardware Considerations..............................................................................  92
6.2 Synapse Circuit Modification and Optimisation......................................... 94
6.3 Proganimability............................................................................................ 96
6.4 Simulation Results........................................................................................ 98
6.5 Circuit Evaluation......................................................................................... 102
6.6 Summary.......................................................................................................  103
Chapter 7 -  Hardware Implementation of An Analogue Neural
Network with On-Chip Back-Propagation Learning
7.0 hitroduction....................................................................................................  104
7.1 Neural Network Chip Aidiitecture.............................................................. 105
7.2 Electronic Circuits........................................................................................ 110
7.3 Weight Storage.............................................................................................  117
7.4 Operation Amplifier Design........................................................................  121
7.5 Chip Measurements.......................................................................................  125
7.5.1 Synapse Operation........................................................................  125
7.5.2 Neuron Operation.........................................................................  126
7.5.3 EiTor-Signal Generator Operation................................................  126
7.6 Circuit Evaluation.........................................................................................  127
7.7 Summary........................................................................................................  129
Chapter 8 -  System Testing
8.0 hitroduction....................................................................................................  130
8.1 Test Setup.....................................................................................................  130
8.2 Results for Different Speech Systems.........................................................  130
Vll
Contents
8.3 Time-Delay Radial Basis Function (TD-RBF).............................................  138
8.4 Use of Second Priority Phoneme to Improve the Recognition R ate...........  142
8.5 Summary........................................................................................................  142
Chapter 9 -  Conclusions And Further Work
9.0 VLSI Implementations..................................................................................  143
9.1 Further Work..................................................................................................  145
9.1.1 Hidden Markov Model/Neural Network Systems............................ 145
9.1.2 Circuit Level Implementation...........................................................  147
Bibliography...................................................................................................................  149
Appendices
Appendix A Self-Organising Map Chip Layout................................................. 159
Appendix B Radial Basis Function Chip Layout...............................................  160
Chip Layout D etail......................................................................... 161
Appendix C Time-Delay Neural Network Chip Layout....................................  162
Amplifier Layout........................................................................... 163
Multiplier Layout........................................................................... 163
Derivative Generator.....................................................................  164
EiTor Signal Generator U nit..........................................................  165
Neuron Unit Layout....................................................................... 166
Synapse Unit Layout...................................................................... 167
Appendix D PCB Top V iew ................................................................................ 168
PCB Bottom V iew ........................................................................ 169
Vlll
List o f Figures
List of Figures
Figure 2-1 Human Vocal Tract.................................................................................... 9
Figure 2-2 The source-filter model of speech production......................................... 11
Figure 2-3 Diagram of the E a r....................................................................................  12
Figure 2-4 Bank-of-Filters Analysis M odel...............................................................  13
Figure 2-5 Block diagr am of the LPC Processor for Speech Recognition...............  14
Figure 2-6 Mel Cepstral Coefficients Processor.......................................................  17
Figure 3-1 Complete Hidden Markov Model of a simple grammar.......................... 21
Figure 3-2 Multilayer Perceptron............................................................................... 24
Figure 3-3 Block Diagram of Supervised Leaming...................................................  27
Figure 3-4 Block Diagram of the Adaptive Heuristic Critic - Reinforcement
Learning....................................................................................................  28
Figure 3-5 A time-delay neural network computational element..............................  32
Figure 3-6 Information Flow in the Nervous System................................................  38
Figure 3-7 Neuron Schematic Diagram.....................................................................  38
Figure 3-8 Synapse Diagram.......................................................................................  39
Figure 3-9 McCulloch-Pitts Model Neuron................................................................ 40
Figure 3-10 Floating Gate Teclmology.........................................................................  48
Figure 4-1 Speech Recognition M odel.......................................................................  51
Figure 4-2 Recognition Rates for a 12-150-48 MLP and SOM setup....................... 53
Figure 4-3 Recognition Rates for different vector sizes............................................  54
Figure 4-4 Recognition Rates for a different number of Hidden Nodes.................... 56
Figure 4-5 Recognition Rates and Training Efficiency for the different ANN
Architectures............................................................................................  58
Figure 4-6 Adopting Multiple Recognisers................................................................ 60
Figure 4-7 Speaker Dependent Phoneme Recognition Rates....................................  64
Figure 4-8 Speaker Independent Phoneme Recognition Rates, Speakers of the
Same Sex and Dialect............................................................................... 64
Figure 4-9 Speaker Independent Phoneme Recognition Rates, Speakers of
Opposite Sex from the same Dialect Region...........................................  65
IX
List o f  Figures
Figure 4-10 Speaker Independent Phoneme Recognition Rates, Speakers of
Opposite Sex from Different Dialect Regions........................................ 65
Figure 5-1 Hamming Network................................................................................... 70
Figure 5-2 Synapse Cell in the Hamming U nit.........................................................  71
Figure 5-3 Winner-take-all Circuit............................................................................. 72
Figure 5-4 Dynamic Steering Wimier-Take-All Circuit........................................... 75
Figure 5-5 SPICE Simulated Results for 2-input WTA (Vin(2) = IV ) ...................  78
Figure 5-6 Transient behaviour of the winning output with a capacitive load of
0.2 p F ........................................................................................................  78
Figure 5-7 Results for 1000 input W TA ...................................................................  80
Figure 5-8 Simulation Results for 1000 input WTA -  Comparing Single and
Cascade Mode - DC Level of the Winning Output................................ 80
Figure 5-9 Results for 1000 input WTA -  Comparing Single and Cascade Mode -
Response time of the winning output -  Capacitive Load of 1.0 p F   81
Figure 5-10 Comparison of the Fixed and Distributed Bias Currents for a Different-
Number of Competitive Cells..................................................................  82
Figure 5-11 (a) Output Voltages for Single-Stage WTA (b) Output Voltages for
Cascaded WTA (c) Output CuiTents for Single-Stage without Dynamic 
Steering (d) Output Current for Single-Stage with Dynamic Steering ... 83
Figure 5-12 Self-Organising Maps Chip Building Block...........................................  86
Figure 5-13 Measurement results of the dc transfer voltage characteristics for
200-input hardware WTA........................................................................  87
Figure 5-14 The output waveform on cell 20 when a 0.1 V peak-to-peak voltage,
centred at 2.50V, is applied as input to the ce ll..................................... 88
Figure 5-15 The output levels and the response time against the number of cells
with the second lar gest input voltage.......................................................  88
Figure 5-16 The effects of the process-induced variations of the competition
threshold across several chips..................................................................  89
Figure 6-1 Basic Gaussian synapse cell with single-ended input/weight values  94
Figine 6-2 Simulation Results for the Synapse Characteristics -  Maximum
List o f Figures
Figure 6-3 
Figure 6-4
Figure 6-5
Figure 6-6 
Figure 6-7 
Figure 6-8
Fig
Fig
Fig
Fig
Fig
Fig
Fig
Fig
Fig
Fig
Fig
Fig
Fig
e7-l
e7-2
e7-3
e7-4
e7-5
e 7-6
e7-7
e7-8
e7-9
e7-10
e 7 -ll
e7-12
e7-13
magnitude being 8 pA, mean 0, standard deviation 0.302 .....................  95
Modified Gaussian synapse Cell with differential input/weight values .. 96
Simulation Results for the Synapse Characteristics with Differential 
Inputs -Maximum Magnitude 8 pA, Mean 0 and Standard Deviation
0.55............................................................................................................  97
Simulations for the Programmability of the Modified Gaussian Cell 
(a) Reference currents set at 16 uA, 8 pA and 4 pA (b) Weight Value 
Changed at -0.5V, OV and 0.5V reference current set at 8 pA (c) W/L 
ratios changed to 1, 2, 4 with the setting at 1 producing the largest 
standard deviation while the setting at 4 gave the lowest standard
deviation....................................................................................................  98
A Network with Four Gaussian Synapse Cells........................................  100
Speed Response for Gaussian Network of Figure 6 -6 ............................ 100
Simulation results for dc characteristics of the four-input neuron. 
Synapse 1 has a mean of -1.6, standard deviation of 0.55, maximum 
cunent of 8 pA, synapse 2 has a mean of 0.2, standard deviation of
0.38, maximum current of 8 pA, synapse 3 has a mean of 1.3, standard 
deviation of 0.38, maximum current of 8 pA, synapse 4 has a mean of 
2.5, standard deviation of 0.55, maximum current of 8 pA...................  101
Neuron Unit Schematic............................................................................ 105
Synapse Unit Schematic............................................................................  106
Error-signal Generator Unit Schematic...................................................  107
Output Unit  ....................................................................................... 108
Chip Layout..............................................................................................  109
Amplifier Schematic................................................................................. I l l
Amplifier dc sweep results....................................................................... 112
Multiplier Schematic................................................................................ 113
Multiplier dc sweep simulation results....................................................  114
Bias Generator Schematic.......................................................................  115
Derivative Generator Schematic and dc sweep results...........................  116
Derivative Generator Schematic dc sweep results.................................. 116
Weight Processing Unit Schematic.........................................................  118
XI
List o f Figures
Figure 7-14 Simulation Waveforms for Weight Processing Unit................................. 119
Figure 7-15 Simulation Results for Weight Initialisation............................................  120
Figure 7-16 Weight Decay P lo t  ........................................................................ 121
Figure 7-17 Operational Amplifier Schematic.............................................................  123
Figure 7-18 DC Response for the Operational Amplifier............................................  123
Figure 7-19 AC Response for the Operational Amplifier...........................................  124
Figure 7-20 Foiivaid mode synapse transfer characteristics.......................................  125
Figure 7-21 Reverse mode synapse transfer characteristics.........................................  126
Figure 7-22 Percentage enors between an ideal sigmoid curve and one neuron
output for various input voltages.............................................................  127
Figure 7-23 Derivative Generator Error Transfer Function Discrepancies.................  127
Figure 8-1 Test setup for the system testing............................................................... 131
Figure 8-2 Phoneme Recognition using Self-Organising Maps Implemented with
3 Self-Organising Map Chips..................................................................  132
Figure 8-3 Phoneme Recognition using 3 Layer Multi-Layer Perceptions.....  133
Figure 8-4 Phoneme Recognition using a Radial Basis Function Neural Network .. 135
Figure 8-5 Speaker Dependent Phoneme Recognition R ates....................................  136
Figure 8-6 Speaker Independent Phoneme Recognition Rates Speakers of the
Same Sex and Dialect............................................................................... 136
Figure 8-7 Speaker Independent Phoneme Recognition Rates Spealcers of
Opposite Sex fr om the same Dialect Region...........................................  137
Figure 8-8 Spealcer Independent Phoneme Recognition Rates Speakers of
Opposite Sex from Different Dialect Regions......................................... 137
Figure 8-9 Comparing the TD-RBF with other learning algorithms......................  141
XII
Acronyms
Acronyms
ADC Analogue to digital converter
ANN Artificial Neural Network
BiCMOS Bipolar-Complementary metal oxide silicon
CMOS Complementary metal oxide semiconductor
CPU Central Processing Unit
DAC Digital-to-Analogue Converter
DCT Discrete Cosine Transform
DSP Digital Signal Processor
EEPROM Electrically Erasable Programmable Read Only Memory
FSO Frequency-Sensitive Self-Organisation
HMM Hidden Markov Model
LD Log domain
LPC Linear Predicitive Coding
MLP Multi-layer Perceptron
MOS Metal oxide semiconductor
NN Neural Network
PARCOR Pai'tial correlation coefficients or reflection coefficients
RAM Random Access Memory
RBF Radial basis function
RISC Reduced Instruction Set Computer
S/N Signal to noise
SOM Self-Organising Maps
SRAM Static Random Access Memory
TDNN Time-Delay Neural Network
TD-RBF Time-Delay Radial Basis Function
VQ Vector Quantisation
VLSI Very large scale integration
Vt Threshold voltage
WTA Winner take all
XNOR Exclusive NOR Gate
Xlll
Chapter 1. Introduction
Chapter 1
INTRODUCTION
1.0 Neural Network Architectures
Neinal networks are a very promising computational teclmology due to their 
capabilities in modelling and solving complex problems hardly approachable with 
traditional numerical methods [1]. The main attractive feature of neural network 
paradigms is their learning capability by which they can solve problems whose 
formalisation is not known or whose solution is available only in some instances used as 
training examples [2], This has made it possible for neural networks to be widely adopted 
for use in a variety of industrial, scientific and commercial applications, which range 
from pattern recognition, optimization and resource scheduling [2].
When embedded in microelectronic hardware implementation, neural networks 
exhibit a high degree of fault tolerance to system damage and also high data tlnoughput 
rate due to parallel data processing [3]. Neural network chips are being developed 
nowadays in order to provide low cost modules, which can be adopted into both existing 
and newly developed systems. These modules will facilitate the system performance in 
pattern recognition, noise filtering, cluster detection, process control and adaptive 
control [4]. There are many different types of neural network paradigms, each of which 
has different strengths particular to its application. Abilities of different networks can be 
related to their structure, dynamics and learning methods. By far, the most popular neural 
network methods include multi-layer perceptions, self-organising maps, adaptive 
resonance theory and cellular neural networks [5].
The fast growing field of neural network research attempts to learn fr om nature’s 
success and to mimic some of nature’s tricks in order to accomplish intelligent 
information processing tasks, not easily performed by the conventional methods [6]. 
Neural networks are composed of massively parallel architectures that process large 
quantities of infoimation in analogue values and solve varieties of ill-defined and/or 
computation-intensive signal processing tasks [2].
Many efficient network-based computing architectures and efficient algorithms 
have emerged over the past decade, following the revitalization of great interest in the 
mid-1980’s. These neural network paradigms are particularly suitable for ill-defined
1
Chapter 1. Introduction
mapping or pattern recognition task from noisy data, control of multi-parameter processes 
and prediction of complex phenomena. Development of advanced microelectronic and/or 
photonic circuits and systems for massively parallel processing paradigms has been the 
main focus of active research by many scientists and engineers [7].
The usefulness of a particular design teclmology for implementing artificial neural 
networks depends on the extent to which it can meet the functional, connectivity and 
precision requirements of the specific network type and leaming rule. Researchers have 
been actively pursuing the construction of compact hardware suitable for large-scale, 
parallel implementations.
1.1 Analogue versus Digital Technology
Neural network VLSI chips are classified according to whether their circuit 
architecture is analogue or digital. Digital neural VLSI chips have been fabricated using 
well-established digital circuitry technology and several successes have been achieved in 
implementing feedforward networks [8]. However, the digital approach is not suitable for 
recurrent neural network models with analogue dynamics. Another important 
characteristic of analogue neural networks is the high processing speeds that they can 
acquire.
Analogue teclmology also provides the benefit of designs with reduced power 
dissipation. Whilst digital systems are essentially discrete-level and discrete-time 
(sampled data) systems, analogue systems are essentially continuous-level but can still be 
classified into continuous time and discrete time systems.
When choosing between digital and analogue application specific designs, one has 
to take into account the aspects of power and area costs in each case. Research has shown 
that power and area cost of an analogue system is considerably lower than that of a digital 
system for low precision requirements, and vice-versa for high precision requirement [9]. 
Analogue computation at low precision is cheap because the circuitry is usually a direct 
mapping of the problem to be solved and therefore there is little area and power overhead. 
However, for high precision systems, the costs of maintaining low noise and offset are 
expensive. The issue of low power dissipation is obviously very critical for portable 
battery-powered applications. Compared to digital VLSI techniques, a full-custom layout 
is required in analogue designs in order to achieve good perfonnance.
Chapter 1. Introduction
This thesis presents a number of analogue VLSI chips and one mixed-mode chip 
(digital and analogue), which can implement a number of leaming algoritlims and which 
can be applied to the task of phoneme recognition. In this case, analogue implementation 
was adopted for the following advantages:
• Parallelism: In the pinsuit for faster data processing, two approaches are possible
-  one is to use faster systems with high clock frequencies, while the other is to 
use parallel data processing elements. In the former case, the main problem is 
that difficulties are encountered in interfacing to and communicating with these 
fast systems [10]. On the other hand, neural networks are inlierently parallel and 
are therefore easily mapped on to parallel hardware. Another advantage is the 
fact that the analogue neural network computing primitives -  multiplication and 
addition -  are much smaller than their digital counteipait, thus allowing 
massively parallel neural systems to be efficiently implemented using analogue 
VLSI, giving a potential for very fast data processing [11-13].
• Asynchronousiiess: Many neural networks are asynclnonous in nature and this
asynchronousness can be efficiently exploited. In fact, asynchronous systems 
have a number of distinct advantages over systems governed by a clock [10, 11], 
[14, 15]. The main advantage is that asynclnonous systems can run at the 
maximum speed of the present hardware, while synchronous systems must be 
designed to mn at a conservative clock frequency to ensure functionality in the 
worst case situation. Another problem with synclrronous systems is the fact that 
with an increasing system clock frequency, communication between 
components becomes a problem as it is very difficult to distribute the system 
clock (without skew) over a large area, hi synclrronous systems, all components 
change states simultaneously at the clock edge and this puts very high demands 
on the tolerable power supply peak cuiTents, capacitive decoupling, etc. The 
high peak currents also introduce a lot of noise. Asynclnonous systems are 
power averaging. Another point worth mentioning is the fact that real world 
interfacing is basically asynclnonous and analogue asynchronous systems are 
well-suited for their implementation.
• Fault Tolerance: A complete loss of a connection can definitely affect the
generalisation ability of any neural network architecture, hr order to coiTect this
Chapter 1. Introduction
problem, redundant hardware is introduced. This can be done at a very low cost 
with analogue teclmology since the different components occupy very small 
silicon area [16].
• Low Power Applications: The use of sub-tlneshold operation MOSFETs offers
the possibility of extremely low power systems. Though digital systems can 
function in sub-threshold, analogue systems carry more infomiation per wire 
and fewer transistors per operation and thus inherently use less power [12].
• Real World Interfacing: As well as being asynclnonous, real world interfaces
are often required to be analogue. Analogue neural networks obviously 
eliminate the need for analogue-to-digital and digital-to-analogue converters, 
which is an attractive featine. This is of great importance when data is applied in 
massive parallelism as tire use of hundreds or thousands of high speed analogue- 
to-digital converters would seldom be justified and for low power operation, it is 
not a good idea to waste power in changing from analogue to digital values or 
vice-versa [17].
• Regularity: The regularity of artificial neural networks makes them well suited
for massively parallel implementations and the design effort can be put in 
designing a few efficient components, which ai*e used repeatedly and 
intercomiected in a regular way.
1.2 Low Voltage Low Power Applications
Most CMOS building block circuits designed in this work have been optimised to 
operate at a low supply voltage of ± 1 V. The use of a low supply voltage is not merely 
dictated by considerations of low power dissipation, but is a necessity as a result of 
maximum voltage limits imposed by today’s CMOS teclmologies [18]. In fact, low power 
and low voltage are often conflicting specifications. The digital market has pushed 
technology into higher component densities, resulting in reduced line-widths and hence 
lower gate-channel breakdown voltages. The reduced ratio of maximimi supply voltage to 
threshold voltage is not critical for digital designs, but has made analogue design more 
challenging.
Some of the implications of low voltage operation are clear if one considers just 
the noise issue. In fact, in order to maintain a specific signal-to-noise (S/N) ratio at low 
supply voltages both flicker and thermal noise contributions have to be reduced. Flicker
Chapter 1. Introduction
noise can be reduced by increasing the area of the metal-oxide semiconductor (MOS) 
devices while thermal noise can be reduced by either increasing the dc current (and hence 
power dissipation) or increasing the width/length (W/L) ratio. The minimum L value is 
defined by teclmology constraints and also by the need to minimise errors due to 
mismatches and channel length modulation effects. Hence an increase in W/L ratio 
effectively also means an increase in the area cost together with associated parasitics. hi 
order to maintain the same bandwidth, with an increased value of parasitics, the current 
consumption has to be finther increased [19].
Log-domain (LD) current-mode circuits offer a substantial advantage in this case 
since they essentially require low gate overdrive voltage, reduced V ds sat, and no high gain 
amplifiers are required. Furthermore voltage swings are compressed relative to the current 
signals.
In most of the circuits, a compromise had to be sought as low voltage and power 
operation normally affected the operational speed of the network.
1.3 Neural Networks for Speech Recognition
Neural networks have recently been successfully applied to speech recognition 
systems. Although very promising work has been done, no fully integrated speaker 
independent large vocabulary speech understanding system based on neural networks 
exists. The main features that make neural networks attractive for speech recognition 
include the following [20-22] :
• Massive parallelism, which provides advantages in terms of speed, regularity and 
fault tolerance.
• Learning, since a number of training algorithms exist to identify unknown 
models.
® Stochastic modelling, uncertainty, variability and fuzziness allow probabilities to 
be encoded as patterns of activity across its elements rather than single scalar 
values.
« Nonlinear modelling, that is, they allow the implementation of nonlinear 
classifiers, mapping frrnctions and can also represent multi-model distribution.
• Discovery of “hidden” knowledge, since neural networks are capable of 
abstracting and generalising complex problems.
Chapter 1. Introduction
• Unifoiinity, since computations are perfonned by simple underlying computing 
elements and the interaction between them.
• Speed-leaming versus recognition. By virtue of massively parallel computations, 
neural networks nm very efficiently.
Neural networks have been used at tlrree important levels of speech 
understanding:
• The phonemic level
• The word level
9 The language level
Much of the earliest work in comiectionist speech recognition has been focused on 
phoneme recognition [23]. The choice is largely motivated by the fact that phoneme 
recognition is a tractable sub-problem when compared to large-vocabulary systems. 
Phoneme classification networks can be sub-divided into two groups:
1. those that require precise temporal alignment of input tokens for accurate 
recognition perfonnance, i.e. temporally static classifiers
2. those that do not, i.e. temporally dynamic classifiers
Although usefi.ll recognition results have been achieved using neural networks for 
phonemic patterns, the question remains whether this teclmology can be effectively used 
for word recognition as well. An early set of experiments simply extended the 
classification capabilities of these networks by applying an entire word’s coefficient 
matrix to the inputs of static full-word networks with output units for each of the words to 
be classified. Good results were achieved, but time alignment and word end point 
detection are problems that limit this approach [20]. Similarly limiting is the fact that 
only small vocabularies can be handled in this fashion, because the network size and 
training time become prohibitively large with vocabulary size. To overcome the former 
limitations, networks that model time alignment and/or shift invariance internally have 
been developed for small vocabulary recognition. For large vocabulary recognition, sub­
word units such as phonemes or syllables must be employed. A number of novel 
techniques are emerging that attempt integration of neural networks sub-word models into
Chapter 1. Introduction
words and sentences. A majority of these use hybrid techniques, that seek to combine the 
strengths of neural networks and hidden Markov models.
Beyond recognition of words, neural network models have also been applied to 
language models and natural language processing. The attempts are driven by the desire 
to develop more robust and perhaps cognitively plausible models for language. Indeed 
some recent work has suggested productive uses for the development of spoken language 
systems [24, 25].
1.4 Scope and Outline of Thesis
The scope of this thesis is the design of analogue CMOS neural network chips, 
which can be applied to the task of phoneme recognition. Phoneme recognition has been 
adopted in order to maintain the neural network size as small as possible. The system can 
then be integrated into a word recognition system where the phonemes can then be 
integrated into whole words and sentences, while allowing large vocabulary recognition. 
The analogue hardware building blocks are implemented in a standard CMOS process, 
operating at a low supply voltage of ± 1 V with low power dissipation.
Chapter 2 of the thesis gives an overview of the speech recogiition system. This 
chapter presents the speech coding algorithms that have been adopted together with the 
different neural algorithms that can be applied to the task of phoneme recogiition. 
Software simulations for the different coding tecliniques and learning algorithms are 
presented in chapter 3. These simulations are necessary in order to evaluate the 
complexities of the different neural network algoritlims, together with an estimate of the 
recognition rate for each algorithm. In this case, the training time for each algorithm has 
been tested in order to evaluate the performance of the networks in terms of their training 
efficiency. Different speech coding tecliniques are investigated in order to decide on 
which coding scheme provides the best recogiition rates and which lends itself best to 
hardware implementation. Another important issue is the minimum number of 
coefficients that need to be considered in order to obtain adequate recognition rates. This 
will minimise the network architecture that is required. This is an important issue since 
the chip area should be maintained as small as possible.
Chapter 4 discusses the design, simulation and testing of a mixed mode digital- 
analogue CMOS circuit which implements competitive learning. First the principle of 
operation is reviewed. The chapter then describes the design of the various building
7
Chapter 1. Introduction
blocks, namely the Hamming network and winner-take-all circuit. The desigi is 
optimised for low power and low supply voltage operation. Simulation results for each 
building block and the whole network are then discussed.
Chapter 5 presents the design, simulation and testing of an analogue CMOS chip, 
which can implement radial basis functions neinal networks. This chip is designed 
entirely using MOSFETs operating in the sub-tlneshold region. The chip is capable of 
low power operation. The implementations of Gaussian synapse cells are first discussed. 
The chapter then describes how the different values within the Gaussian equation can be 
programmed within a hardware environment. Simulation results are presented for each 
building block. Measurement results for the fabricated chip show that the chip exhibits 
tlie required performance of the software simulated algoritlrm.
The design, simulation and testing of an analogue low voltage multi-layer 
perceptron chip implementing back-propagation leaming is presented in chapter 6. This 
chip is mainly implemented using a number of building blocks which are required by the 
back-propagation algoritlnn, namely multipliers, amplifiers and derivative generators. An 
oveiwiew of the required building blocks is first given. Contr astive back-propagation is 
presented as a method for reducing offset errors, intrinsic within analogue devices. The 
design and test results of the multi-layer perceptron network are discussed. Testing 
results shown for the implemented chip follow closely the simulated results attained using 
software programs.
Chapter 7 describes the test setup used for testing a time-delay radial basis 
function neural network. The system topology is presented together with the results 
obtained, when the system is tested with speech data extracted fi'om the TIMIT database. 
Test results show that a recognition rate of around 70% can be obtained for the task of 
phoneme classification.
General conclusions drawn from the previous chapters, together with a summary 
of the original contributions carried out at each stage are presented in chapter 8. Future 
work, including possible enhancements on the work presented in this thesis, is discussed 
and possible extensions to this work in order to implement a complete hardware word 
recognition system are presented.
Chapter 2. Speech Processing
Chapter 2
Speech Processing
2.0 Introduction
The speech generation process starts when the speaker fomiulates a message that 
needs to be delivered to the listener. The next step in the process is the conversion of the 
message to a language code, that is, to a set of phoneme sequences corresponding to the 
sounds that make up the words, along with prosody markers denoting duration of sounds, 
loudness and pitch accent associated with the sounds. Once the language code is chosen, 
the speaker must activate a series of neuromuscular commands to cause the vocal cords to 
vibrate when appropriate and to shape the vocal tract such that the proper sequence of 
speech sounds is created [26]. These commands must simultaneously control all aspects 
of articulatory motion including control of the lips, jaw, tongue and velum as shown in 
Figure 2-1.
Once the speech signal is generated and propagated to the listener, the recognition 
process begins. The acoustic signal is first processed along the basilar membrane in the 
inner ear. This provides a running spectrum analysis of the incoming signal. A neural 
process converts the spectral signal at the output of the basilar membrane into activity 
signals on the auditory nerve. The process of conversion of the neural activity into a 
language code occurs within the brain and the message is finally understood.
A L V E O L A R  
R ID G E  / /
TEETH
T O N G U E
V O C A L  F O L D S
B la d e  -  ' 
B a c k  —
N A S A L  CAVITY
H A R D  PA LA TE
.V.’
^ S O F T  PALATE
PH A R Y N X
EPIG L O T T IS  
A  LA RY N X  
v ^ O E S O P H A G U S  
TR A C H E A
Figure 2-1 Human Vocal Tract [26]
Chapter 2. Speech Processing
2.1 Speech Production
Speech sounds can be classified into two categories, voiced and unvoiced [27]. 
Voiced sounds are produced, by expelling air from the lungs tln'ough the vocal chords. 
These chords vibrate and the vibrations produce periodic pulses of sound with a spectrum 
consisting of a sequence of harmonics, each of which is a multiple of a fundamental 
frequency. The pitch of the sound is governed by the fundamental frequency and the rest 
of the vocal tract spectrally shapes the soimd waves. This is effectively an acoustic 
enclosure with several resonant cavities whose size and shape can be changed by 
movements of the jaw, tongue, lips, teeth and the velum.
Unvoiced sounds, on the other hand, are produced, by relaxing the vocal chords 
and expelling air through the constriction in the upper part of the vocal tract. This 
process causes turbulence in the airflow.'
Speech can therefore be modelled as the convolution of an excitation source, 
which is either voiced or unvoiced, with the vocal tract. The recognition process relies on 
separating the vocal tract response, which identifies the sound, fi'om the excitation, which 
contains information on voicing and pitch.
The linear model of speech production is a special case, where the vocal tract 
filter is assumed to be a linear pole-zero filter, with two possible input sources, a periodic 
pulse-train for voiced speech and random Gaussian noise for unvoiced sounds. Thus, the 
speech waveform at the output of the model is considered to be the convolution of one of 
the excitation functions with the linear filter response. The spectral characteristics of the 
speech and therefore the identity of the sound, are hence determined by the parameters of 
the linear filter [28]. Figure 2-2 shows this model.
2.2 Speech Perception
The human ear, as shown in Figure 2-3, has tln'ee distinct regions;- the outer ear, 
the middle ear and the inner ear. The outer ear consists of the pinna (the ear surface 
surrounding the canal in which soimd is funnelled) and the external canal. Sound waves 
reaching the ear are guided tlrmugh the outer ear to the middle ear, which consists of the 
tympanic membrane upon which sound waves impinge, causing its movement, and a 
mechanical transducer composed of the malleus, the incus and the staples. The latter 
convert the acoustical soimd wave to mechanical vibrations along the inner ear.
10
Chapter 2. Speech Processing
white noise
impulses @ fO
vocal tract /V
îpeech
Figure 2-2 The soiirce-filter model of speech production. fO is the pitch [27].
The inner ear consists of the cochlea, which is a fluid-filled chamber partitioned 
by the basilar* membrane, and the cochlea or auditoi*y nerve. The mechanical vibrations 
impinging on the oval window at the entrance to the cochlea create standing waves of the 
fluid inside the cochlea. These waves cause the basilar membrane, within the cochlea, to 
vibrate at frequencies commensurate with the acoustic wave frequencies and at a place 
along the basilar membrane that is associated with those frequeircies. The basilar 
membrane is characterised by a set of frequency resporrses at different points along the 
membrane. The membrane acts as a broad band-pass filter with each location having an 
almost constant Q. The critical bairdwidth is slightly smaller than 500 Hz but it increases 
approximately logarithmically for frequencies above llcHz. This spacing has led to a 
mapping between the acoustic frequency and the perceptual frequency called the mel 
scale [29] and is defined by the equation;
M=2595 log,Q(l+f/700) (2 .1)
The orgair of Corti, which lies on the basilar membrane, contains thousands of 
hair cells, which lie in rows along the length of the cochlea. The vibrations of the basilar 
menrbrane cause the cells to bend, resulting in neural firings along the nerve fibres 
connected to the cells.
11
Chapter 2. Speech Processing
I' ®  O utery ^ \ \ ^
£ a i  ^ A uditoiy
M iddle  
Ear
Eaidkwn
Figure 2-3 Diagram of the Ear |26|
The mechanisms, which convert the auditory nerve firings into a linguistic 
message, are still not completely understood.
2.3 Signal Processing, Analysis Methods for Speech Recognition — 
Feature Extraction
Perhaps the greatest common denominator of all recognition systems is the signal- 
processing front end, which converts the speech wavefonn to some type of parametric 
representation. At this stage, the input speech signal samples are transformed into a 
domain where information of importance for recognition is preserved and redundant 
infonnation discarded. This signal processing stage is usually quite computationally 
intensive.
A wide range of possibilities exists for parametrically representing the speech 
signal. These include short time energy, zero crossing rates, level crossing rates and other 
related parameters.
Probably the most important parametric representation of speech is the short time 
spectral envelope [30]. Spectral analysis methods are therefore generally considered as
1 2
Chapter 2. Speech Processing
the core of the signal-processing front end in a speech-recognition system. The two 
dominant methods of spectral analysis are the filter-bank spectrum analysis model and the 
linear predictive coding (LPC) spectral analysis model [31]. Together with these, vector 
quantisation [32], a procedure for encoding a continuous spectral representation by a 
“typical” spectral shape in a finite codebook of spectral shapes, in order to reduce the 
information rate even frirther, is nonnally considered.
2.4 Spectral Analysis Models
The overall structure of the bank filters model is shown in Figure 2-4. The speech 
signal s(n) is passed through a banlc of Q bandpass filters whose coverage spans the 
frequency range of interest in the signal. The individual filters generally overlap in 
frequency as shown in Figure 2-4. The output of the i‘'’ bandpass filter is the shoit-time 
spectral representation of the signal s(n) at time n, as seen through the i^ ’’ bandpass filter 
with centre frequency cOj. It can readily be seen that in the banlc-of-filters model each 
bandpass filter processes the speech signal independently to produce the spectral 
representation X„.
BANDPASS-------------► FILTER1
SPEECH------ :------ ►8 (n) •
BANDPASS
FILTER
—  > X„ ( )
■> X„ ( t
01, <0,
j WlH«a. oJiH
Figure 2-4 Baiik-of-Filters Analysis Model [27].
2.4.1 The LPC Processor for Speech
Figure 2-5 shows a block diagram of the LPC processor [33]. The basic steps are 
the following:
® Pre-Emphasis - The digital signal is put through a low-order digital system 
to spectrally flatten the signal and make it less susceptible to finite precision effects 
later in the signal processing. The system adopted is either fixed or slowly adaptive.
13
Chapter 2. Speech Processins
® Frame Blocking - In this phase, the pre-emphasised signal is blocked into 
frames of N samples, with adjacent frames being separated by M samples.
® Windowing - The next step is to window each individual frame so as to 
minimise the signal discontinuities at the beginning and at the end of each frame. The 
typical window used for the autocorrelation method of LPC is the Hamming window.
® Autocorrelation Analysis - Each frame of the windowed signal is next 
autocoiTelated to give
r{m) = ^ 2.0 (”)^/ {n+m) m = 0, 1, 2,... , p. (2.2)
where the highest autocoiTelated value, p, is the order of the LPC analysis. 
Typical values of p range fi*om 8 to 16. A side benefit of the autocorrelation analysis is 
that the zeroth autoconelation is the energy of the given frame.
mn)
x,{n)Frame
blocking
Param eter
w eigh tin g
Tem poral , 
derivative
W indowingP re-em p h asis
LPC 
param eter ■ 
conversion
Autocorrelation
an alysis
LPC
an alysis
Figure 2-5 Block diagram of the LPC Processor for Speech Recognition [31]
® LPC Analysis - This phase converts each fi-ame of p+1 autocorrelations into 
an “LPC parameter set”, in which the set might be the LPC coefficients, the reflection 
(or PARCOR) coefficients, the log area ratio coefficients, the cepstral coefficients or 
any desired transformation of the above. The formal method used is known as 
Durbin’s method.
14
Chapter 2. Speech Processins
-  /-(O)
L - \
h- ~ /•-I) 1 < i < p (2.3)
a^p = aj' -  k^a^ip
C2 4)
(2.5)
(2 .6)
The equations are solved recursively to give the LPC coefficients, given by a, 
the PARCOR coefficients [34], given by k and the log area ratios, denoted by g, vyhere
= log 1 + /C (2.7)
© LPC Parameter Conversion to Cepstral Coefficients - A very important 
LPC parameter set, which can be derived directly fi*om the LPC coefficient set, is the 
LPC cepstral coefficients, denoted by c [35]. The recursion used is:
Cq = In <7~ (2 .8)
k=] m (2.9)
v f  *1
A=1 (2 .10)
15
Chapter 2. Speech Processing
where a" is the gain temi. These coefficients, which are the coefficients of the 
Fourier transform representation of the log magnitude spectmm, are more robust than 
the LPC coefficients, the PARCOR coefficients or the log area ratio coefficients.
® Parameter Weighting - Because of the sensitivity of the low-order cepstral 
coefficients to the overall spectral slope and the sensitivity of the high-order cepstral 
coefficients to noise, it has become a standaid technique to weight the cepstral 
coefficients by a tapered window so as to minimise these sensitivities.
• Temporal Cepstral Derivative - An improved representation of the speech 
spectrum provided by cepstral coefficients can be obtained by extending the analysis to 
include information about the temporal cepstral derivative (both first and second 
orders have been found to improve the performance of recognition).
2.4.2 Mel Cepstral Coefficients
Mel cepstral coefficients are a commonly used set of cepstral coefficients. The 
process to obtain them normally involves the following steps [36] as shown in Figure 2-6: 
® the samples are first assembled and windowed into overlapping frames - the 
overlapping period normally chosen in the range 10-20 ms during which the speech is 
assumed to be quasi-stationary.
® a Fast Fourier transfoim is used to calculate the power spectrum.
® the power spectimn is then organised into frequency bands according to a 
series of Mel-scale filters. These filters are spaced linearly to 1 kHz and then 
logaritlimically up to the maximum frequency, reflecting the sensitivity of the human 
ear to changes in fiequency. hi this way, the power spectrum can be represented by 
about 20 Mel-scale filter outputs - considerably reducing the data rate.
® finally, a discrete cosine transfoim is perfoimed on the logged Mel filter 
outputs. This is given as:
C{k) = for k 6 [0 ,M]  (2.11)
where C(k) is the k*'’ DCT output and /(i) is the i"'' of the N log filter-bank outputs.
16
Chapter 2. Speech Processins
The DCT output is known as the cepstmm and acts as a data reduction process, 
since the power spectrum envelope varies slowly over the frequency range and M is 
usually much less than N. Also, the DCT outputs are relatively uncorrelated so that each 
output value can be assumed to be independent of every other value. This is ideal for 
most types of pattern matching as the vector operations for training and recognition can 
be greatly simplified.
Each feature vector now contains a subset of the cepstral coefficients. In addition, 
the time derivative of the cepstral coefficients computed over successive, non­
overlapping, frames is also included.
Speech
signal Mel 
filter banksd^EESEBH
Preemphasis IFF JDFT [►MFOC
JrnHAx\ri).
Window
Energy
DerivativesMFCC
Figure 2-6 Mel Cepstral Coefficients Processor [36]
2.5 Vector Quantisation
The results of either filter-banlc or LPC analysis are a series of vectors 
characteristic of the time-varying spectral characteristics of the speech signal. It can be 
shown that the infoimation rate of the vector representation to the raw speech wavefonn 
has been significantly reduced by spectral analysis. The infonnation rate in this case can 
even be reduced by a factor of 10 [37]. However, based on the concept of ultimately 
needing only a single spectral representation of each basic speech unit, it may be possible 
to further reduce the raw spectral representation of the speech to those drawn from a 
small, finite number of “unique” spectral vectors, each con esponding to one of the basic 
speech units or phonemes.
This ideal representation is impractical, because there is so much variability in the 
spectral properties of each of the basic speech units. However, the concept of building a
17
Chapter 2. Speech Processins:
codebook of “distinct” analysis vectors, albeit with significantly more code words than 
the basic set of phonemes, remains an attractive idea and is the basis behind a set of 
techniques commonly called vector quantisation (VQ) methods. Normally, this will 
frirther reduce the spectral representation by another factor of 16. Thus, vector 
quantisation has the following potentials:
® reduced storage for spectral analysis information
® reduced computation for determining similarity of spectral analysis vectors 
® discrete representation of speech sounds
On the other hand, the disadvantages include:
® an inherent spectral distortion in representing the actual analysis vector 
® the storage required for the codebook vectors is often nontrivial.
To build a VQ codebook and implement a VQ analysis procedure, the following 
are required:
® a large set of spectral analysis vectors which form the training set. These are 
used to create the “optimal” set of codebook vectors for representing the 
spectral variability observed in the training set.
® a measure of similarity or distance between a pair of spectral analysis 
vectors so as to be able to cluster the training set vectors as well as to associate 
or classify arbitraiy spectral vectors into unique codebook entries.
© a centroid computation procedure: on the basis of the partitioning that 
classifies the L training set vectors into M clusters, the centroid of each cluster 
is then calculated to obtain the M codebook vectors.
© a classification procedure for arbitrary speech spectral analysis vectors that 
chooses the codebook vector closest to the input vector and uses the codebook 
index as the resulting spectral representation.
2.6 Summary
This chapter reviews the different coding techniques adopted for speech. The 
signal-processing front end, which converts the speech waveform to some type of 
parametric representation is perhaps the greatest common denominator of all recognition 
systems. At this stage, the input speech signal samples are transformed into a domain 
where information of importance for recognition is preserved and redundant information
1 8
Chapter 2. Speech Processins
discarded. This chapter presents the computations required during this signal processing 
stage, which represents the first stage of a speech recognition system.
19
Chapter 3. Pattern Recognition
Chapter 3
Pattern Recognition
3.0 Introduction
The speech parameters are often obtained using some type of spectral analysis that 
concentrates on the short-time characteristics of the speech signal. Time spectral 
measurements are perfoimed sequentially over time, producing a sequence of spectral 
feature vectors. Using the basic pattern recognition approach, the input is compared with 
each class reference pattern and a measure of similarity between the unknown pattern and 
each reference pattern is calculated.
3.1 Pattern Matching
For recognition, some form of model of the entities to be recognised is needed. 
The role of a model is to represent the common patterns of similar speech soimds while 
also allowing for variability.
The first section below describes the application of the statistical models, laiown 
as the Hidden Markov models (HMMs), for pattern matching.
3.2 Hidden Markov models
An acoustic pattern is represented by a sequence of feature vectors and different 
acoustic realisations of the same word give rise to different vector sequences. Therefore a 
good model must deal with the two following types of variability presented by the vector 
sequences:
® the temporal variability - i.e. the different vector sequences coiTesponding 
to a sound will generally differ in duration.
» the acoustic variability - i.e. the feature vectors coiTesponding to a given 
phoneme sound will vary within a spoken word and more so across 
different speakers.
With HMMs, processing is stochastic and it can accommodate both types of 
variation [38, 39]. An HMM consists of a number of states as shown in Figure 3-1. 
Transitions between model states have an associated probability and the transitions are 
usually chosen to reflect the temporal progi ession of speech sounds.
2 0
Chapter 3. Pattern Recognition
In practice, the spectral properties of sound are too variable to allow a unique 
mapping of frames to states. Instead, every frame is mapped to all states with an 
associated probability. As each frame cannot be uniquely matched to one state, the model 
output probability is the sum of all state-sequence probabilities. The calculation of the 
state-sequence probability must also incorporate the state-output probability for each state 
in the sequence. The Markov model in this case is ‘hidden’ because the true state 
sequence traversed through the model is unknown.
The number of possible state sequence is, in practice, unfeasibly large and 
recursive algorithms are used to efficiently traverse all possible state sequences.
The most difficult aspect of pattern recognition, generally, lies in training models 
for pattern matching. There are two forms: -
• supervised - where the pre-categorised examples of the item to be 
recognised are presented to the training process.
• unsupervised - where the training process derives categories for recognition 
from the training data.
Silence
n » r ,» N O  2
Figure 3-1 Complete Hidden Markov Model of a simple grammar [39]
Training of HMMs is normally carried out in a supervised mode and a number of 
training sets are required to model the statistics of a signal [38]. In this case, speech data 
must be annotated to be of use for supervised training. Annotation requires specifying
21
Chapter 3. Pattern Récognition
where each speech sound starts and ends and labelling the sound according to how it is to 
be classified.
The Baum Welch algorithm is a powerful method used to determine model 
parameters such as state transition probabilities, means and variances [39]. The algorithm 
is iterative and requires an initial estimate of the model parameters.
Model initialisation starts with an arbitrary assignment of feature vectors to states 
for each training example. The vectors assigned to a given state are pooled for all 
training examples. When all the vectors from all the examples have been assigned, the 
vectors in each pool are clustered. In this way, the mean of the vectors within a cluster 
becomes the initial estimate of a Gaussian mode.
The next phase of initialisation uses this estimate of the parameters, but this time 
the assignment of vectors to states is not arbitrary. The Viterbi algorithm is used to 
decide the most likely state sequence tlnough the model for a training example and hence 
to which pool each vector is to be assigned [39]. After the vectors from all the training 
examples are assigned, the state pools are again clustered. The process is iterated on the 
training data until a suitable convergence criterion is met.
The Baum Welch algorithm takes an initial model % and estimates a new model 
X’ using all the training examples. During this process, the model output probability 
P(V|X) is calculated for the vectors V of each training example. The algorithm guarantees 
that:
P{V\X')>P{V\X) (3.1)
i.e. the model output probability for the training data set is at least as good as for 
the initial model. This algorithm is repeated using X' as the initial new model for the 
next iteration.
The principle behind Baum Welch training is fairly straightforward. The mean 
value of a variable x can be estimated by:
x = Y^xP{x) (3.2)
V.v
2 2
Chapter 3. Pattern Recosrnitiou
where P(x) is the probability of x for all values of x. The cuiTent estimate of the 
model is used to calculate Pjiy^), the probability of being in state j for feature vector V, , 
given an entire training example.
For each iteration over the training data, the probabilities will be different from 
the last iteration for a given training example and the probabilities will more accurately 
reflect the ‘best’ assignment of a feature vector to a state.
Although HMMs provide a powerful mechanism for simultaneously modelling 
temporal and spectral variations of an acoustic signal, some assumptions made in the 
application of HMMs to speech signals seriously affect the robustness and recognition 
accuracy.
The most dubious is the independence assumption [38]. This assumes that 
successive frame vectors matched to a model state are independently and identically 
distributed and makes no allowance for the influence of preceding vectors. HMM 
variants such as the segmental HMM arid inter-frame dependent HMM have been 
postulated to overcome this limitation.
However, the assumption that vectors mapping to a state are identieally distributed 
is also a simplifying gross approximation to real speech signals as this models speech as a 
series of discontinuous jumps tlnough a sequence of sounds. In reality, the position of the 
vocal tract for one speech sound gr eatly effects how the following sound evolves.
Another weakness of HMMs is the fact that sound durations are unrealistically 
modelled [39, 40]. This may initially have been seen as a str ong-point in favour of HMMs 
as they can accommodate patterns, which evolve through time, allowing arbitrary 
duration signals to be matched to a model. However, the inaccurate modelling of the 
HMMs does not penalise implausible long or short model-state durations and this affects 
the recognition performance. Thus, models incorporating durational models are being 
investigated [41].
3.3 Artificial Neural Networks
An alternative to statistical models, such as HMMs, is that of artificial neural 
networks (ANNs) [42]. They are patterned on the organisation of the neurons in the 
brain. The input to an ANN is typically a speech feature vector and each output may 
represent a recognition outcome. Figure 3-2 shows an example of an ANN known as the 
multilayer perceptron.
23
Chapter 3. Pattern Recosnitiou
Signal
(Stimulus)
Hidden Layers
(Response)
Figure 3-2 Multilayer Perceptron
The inputs are the components of a feature vector and each component connects to 
a node in the input layer. Each input node then connects to all the nodes in a hidden 
layer. Each output node represents a phoneme sound in this case and is fed from all 
hidden nodes. The highest output node is usually taken to indicate the recognised sound.
ANNs distinguish different inputs by finding partitions between vectors for 
different patterns. The number of nodes in the hidden layer limits the number of 
boundaries the perceptron can find. Adding more hidden layers allows more complicated 
boundary shapes to be accommodated.
A major drawback of ANNs is that a fixed size of input vector is required, making 
it difficult to deal with time-varying patterns [43]. Several ANN architectures 
specifically deal with this problem. The time delay neural network is one such 
architecture which feeds back the outputs to the input layer.
A growing trend today is to combine HMMs with ANNs in a hybrid system, since 
this allows the temporal modelling capability of HMMs and the discriminative nature of 
ANNs to be exploited [44].
3.3.1 Learning Process
Among the many interesting properties of a neural network is the ability to learn 
from its environment and to improve its performance through learning. This process of
24
Chapter 3. Pattern Recosnition
learning is done through an iterative process o f adjustments applied to the synaptic 
weights and thresholds.
The error-correction learning algoritlmi is to start from an arbitrary point on the 
error surface (detennined by the initial values assigned to the synaptic weights) and then 
move foiward towards a global minimum [45]. The first objective is always attained, but 
the second one may not always be attained, because the algorithm can become trapped at 
a local minimum o f the eiTor surface and may never be able to reach a global minimum 
[46].
Hebbian Learning can be summarised into two statements:
* if  two neurons on either side o f a synapse are activated simultaneously, then 
the strength o f that synapse is selectively increased.
® if  two nemons on either side o f a synapse are activated asynchronously, then 
the strength o f that synapse is selectively weakened or eliminated.
This type of synapse is known as a Hebbian synapse. Therefore, Hebb proposed a 
synapse that uses a time dependent, highly local, and strongly interactive mechanism to 
increase synaptic efficiency as a function of the couelation between the pre-synaptic and 
post-synaptic activities.
Thus Hebbian learning enforces positively conelated activity and weakens 
uncoiTelated activity.
Although the Hebbian postulate has aroused interest among neurophysiologists, 
Hetherington and Shapiro [46] have demonstrated that:
1. an anti-Hebbian rule is needed to decrease the saturation of cell assembly 
activity.
2. a synaptic modification rule that decreases synaptic weights when post- 
synaptic activity occurs in the absence o f pre-synaptic activity is necessary, 
but not sufficient, for stable assemblies.
3. dendritic trees must be partitioned into independent regions o f activation.
In competitive learning, the output neurons of a neural network coirrpete among 
themselves for beirrg the one to be active. This corrtrasts to Hebbian learning, where 
several output neurons may be active simultaneously. It is this feature that makes 
competitive learning highly suited to discover those statistically salient features that may 
be used to classify a set of input patterns.
25
Chapter 3. Pattern Recosnition
There are three basic elements to the competitive learning rule proposed by 
Rumelhart and Zipser [47]:
1. a set of neurons that are all the same except for some randomly distributed 
synaptic weights and which therefore respond differently to a given set of 
patterns.
2. a limit imposed on the strength of each neuron.
3. a mechanism that peiinits the neurons to compete for the right to respond 
to a given subset of inputs, such that only one output neuron, or only one 
neuron per group is active at a time. The neuron that wins the competition 
is called a winner-take-all neuron.
The Boltzmann learning mle is a stochastic learning algoritlun derived from 
information-theoretic and thermodynamic considerations [48].
In a Boltzmann machine, neurons constitute a recurrent structure and they operate 
in a binary manner - “on” denoted by +1 and “o f f  denoted by -1. The machine is 
char acterised by an energy function E, the value of which is determined by the particular 
states occupied by the individual neurons of the machine, as shown by
E = V /  ((O'- i (3.3)i J
where Si is the state of neuron i and w„ is the synaptic weight coiurecting neuron i 
to neuron j. The machine operates by choosing a neuroir at random at some step of the 
learning process and flips the state of that neuron from state Sj to - Sj at some temperature 
T with probability:
JViSj ^  -S j) =  z  ^  (3.4)
1 + exp
where AEj is the energy change resulting from the flip. T is a pseudo-random 
temperature and if the process is repeated the system will reach thermal equilibrium. 
There are two types of neurons - visible and hidden. The visible neurons provide an 
interface between the network and the environment in which it operates, whereas the 
hidden neurons always operate freely. There are two modes of operation to be 
considered:
26
Chapter 3. Pattern Recognition
® clamped condition, in which the visible nemons are all clamped onto 
specific states detennined by the environment.
® free-running condition, in which all neurons are allowed to operate freely.
A distinctive feature of Boltzmami learning is that it uses only locally available 
observations under two operating conditions: clamped and hee-rumiing.
3.3.2 Supervised Learning
An essential ingredient of supervised learning is the availability of an external 
teacher. The teacher has knowledge of the environment, which is unlaiown to the neural 
network, as shown in Figure 3-3. The teacher is capable of providing the neural network 
with a desired or target response for a given training vector. This desired response is the 
optimum action to be perfonned by the neural network and the network parameters are 
adjusted under the influence of the training vector and the eiTor signal, which is given by 
the difference between the actual response and the target vector. The adjustment is 
canied out iteratively with the aim that eventually the neural network emulates the 
teacher. When this condition is achieved, the nemal network can do without the teacher 
and it can now deal with the environment completely by itself.
Examples of supervised learning include error-correction learning methods, such 
as the least-mean-square-algoritlnn, developed by Widrow and Hoff [49] and 
backpropagation learning, developed by Werbos [50]. Supervised learning can be 
perfonned both in an off-line and on-line manner.
Desired
R e sp o n se
Actual
R esp on se
Vector
describing
Environment
State
Error Signal
Learning System
TeacherEnvironment
Figure 3-3 Block Diagram of Supervised Learning
27
Chapter 3. Pattern Recognition
The disadvantage of supervised learning, regardless of whether it is performed 
off-line or on-line, is the fact that without a teacher, a neural network cannot learn new 
strategies for particular situations that are not covered by a set of examples used to train 
the network. This limitation may be overcome by the use of reinforcement learning.
3.3.3 Reinforcement Learning
Reinforcement learning is the on-line learning of an input-output mapping through 
a process of trial and error designed to maximise a scalar performance index called the 
reinforcement signal. The whole idea behind this type of learning was summarised by 
Sutton et al. [51] and is shown in Figure 3-4:
I f  an action taken by a learning system is followed by a satisfactoiy 
state o f affairs, then the tendency o f the system to produce that particular 
action is strengthened or reinforced. Otherwise, the tendency o f the system to 
produce that action is weakened.
Primary
Reinforcement
State Vector
Heuristic
Reinforcement
Actions
C ritic
Learning System
Environment
Figure 3-4 Block Diagram of the Adaptive Heuristic Critic - Reinforcement
Learning
3.3.4 Uiisupervised Learning
In unsupeiwised learning, there is no external teacher to oversee the learning 
process. In this case, the network tunes itself to the statistical regularities of the input data
28
Chapter 3. Pattern Recosnitiou
and then develops the ability to fomi internal representations for encoding features of the 
input and thereby creating new classes automatically.
A competitive learning rule can be used for competitive learning and the network 
can include two layers, namely, an input layer and a competitive layer.
3.3.5 Supervised versus Unsupervised Learning
Among the algorithms used for supervised learning, the back-propagation 
algoritlnn has emerged as the most widely used and successful algoritlmi for the design of 
multi-layer feed-foiward networks [52]. There are two distinct phases of operation: the 
foiward and backward phases. During the forward phase, the input signals propagate 
through the network layer by layer, eventually producing some response at the output of 
the network. The actual response is compared with a desired response and eiTor signals 
are back-propagated through the network. During this phase, the free parameters of the 
network are adjusted to minimise the sum of squared eiTors. This algorithm has been 
applied successfully to a number of applications, but its application and that of other 
supeivised learning systems has been limited due to its poor scaling behaviour. In fact, as 
the size of the network increases, the network becomes more computationally intensive 
and the time to train the network grows exponentially and the learning process becomes 
unacceptably slow [50].
One possible solution to the scaling problem is to use imsupervised learning, since 
when the self-organising process is applied in a sequential maimer, one layer at a time, it 
becomes feasible to train deep networks in a time that is linear in the number of layers.
Hence, a hybrid use of both supeivised and unsupemsed learning procedures has 
been adopted especially for large network sizes [53].
3.4 Back-Propagation Learning
Back-propagation learning has been applied successfully in a number of signal
processing applications with both recurrent and feedforward networks [109], [111], hi
conventional back-propagation learning [6], the weights are updated using the following 
equations:
Aw.j=sS.o. (3.5)
= C // 'K )  (3.6)
29
Chapter 3. Pattern Recognition
The weights in the output layer adopt Equation (3.7)
<^.=t.-o. (3.7)
w]The weights in the output layer adopt Equation (3.7)
(3.8)
while the weights in the hidden layer will adopt Equation (3.8) 
where ii. and o^  are the input and output of a neuron i, w.j is the synaptic weight,
/  is the activation function, s is the learning rate and /. is the target. Since is fed to
the output unit and back-propagates towards the input layer, ^.is the backpropagation
eiTor signal.
Offset eiTors arising from the real learning circuits, especially multipliers, 
coiTespond to biases, which have fatal effects on the learning perfonnance [113]. It has 
been shown that offset eiTors above 1 % nonnalised by the full-scale o f weights degrade 
the learning success rate by 50% in small problems, such as learning o f the exclusive-OR 
problem. The offset-eiTor effects arise not only in back-propagation learning, but also in 
other error-coiTection learning where the difference between targets and network outputs 
must be minimised.
These effects are minimised by using contrastive back-propagation, which splits 
the learning stage into 2 phases. Instead of using the difference between the target and 
output as the eiTor signal, the target and output values are adopted as enors to two 
different phases. The weight changes for both phases are calculated and the net weight 
change would be obtained by subtraction. In the first phase, the back-propagation error 
signal is expressed by Equation (3.9),
(3.9)
while in the second phase, it is replaced by Equation (3.10).
^i= o, (3.10)
The weight change for the first phase and second phase are calculated according 
to Equation (3.5), (3.6) and (3.7). The net weight change is obtained by the subtraction 
of the two weight values. The same probabilistic steepest descent gradient sequence as 
the conventional back-propagation learning is obtained only when the weights are 
updated by the total amount of modification in a set of phases. However, as long as the 
learning rate is small enough, probabilistic steepest descent is performed even when 
weights are sequentially updated in each phase, hi addition, it is not necessary to hold
30
Chapter 3. Pattern Recognition
input signals constant during two consecutive learning phases. It is sufficient for 
successful contrastive back-propagation learning that the input patterns be presented with 
the same probability in both learning phases. This learning procedure requires no extra 
memory to accumulate weiglit changes in each phase for hardware implementation [114]. 
This situation is the same as in the detenninistic Boltzmami machine. Therefore, 
contrastive learning is a back-propagation version of conventional learning.
In contrastive back-propagation learning, all offset eiTors, mentioned above, are 
cancelled out, since the subtraction operation between the two learning phases tends to 
cancel most offset errors in analogue circuit components. In fact, it can be shown that 
contrastive learning can handle systems with large offset eiTors.
The difference between contrastive and conventional back-propagation learning 
networks is very small as far as circuit configuration is concerned, but the effect is 
extremely large in terms of cancelling out the offset en ors.
3.5 Time-Delay Neural Networks
Time-delay neural networks, (TDNNs), have been particularly useful for phoneme 
recognition systems. To be useful to tackle such complex recognition tasks, the neural 
network must have the following properties: multiple layers, and sufficient 
interconnections between the units of the different layers. Secondly, the network must 
have the capability of representing relationships between events in time. These events 
would nonnally be spectral coefficients. Thirdly, the actual features or abstractions 
learned by the network should be invariant under translation in time. Fourth, the learning 
procedure should not require precise temporal alignment of the labels that can be learned. 
Finally, the number of weights in the network should be sufficiently small compared to 
the amount of training data so that the network is forced to encode the training data by 
extracting regularity.
The basic unit used in many neural networks computes the weighted sum of its 
inputs and passes this sum tlnough a non-linear function, hi TDNNs, the basic unit is 
modified by introducing delays. Figure 3-5 shows a time-delay neural network 
computational element, where U represents the input neuron, D represents a delay, W is 
the weight value and F is the chosen activation function.
31
Chapter 3. Pattern Recognition
1+1
Wi + N
Wi +  l
D m
Dn
Figure 3-5 A time-delay neural network computational element
Using this kind of architecture, the TDNN unit has the ability to compare the 
current input to the past history of events. The sigmoid function is the prefeixed non­
linear function that is normally adopted.
For phoneme recognition, a four-layer net is normally adopted. One such 
aichitecture was proposed by Waibel et al. [54]. 16 nonnalised mel scale cepstral
coefficients were used as input to the network. The input layer was then fully
intercomiected to a layer of 8 time-delay hidden units, in this case over 3 hames with 
time delay 0, 1, and 2. This architecture was adopted for recognising utterances of “B”, 
“D” and “G”.
In the second hidden layer, each of 3 TDNN units looks at a 5 frame window of 
activity levels in the hidden layer. The idea of choosing a larger frame window in this 
layer lies on the intuition that higher level imits should learn to make decisions over a 
wider range in time based on more local abstractions at lower levels.
Finally, the output is obtained by integrating the evidence from each of the 3 units 
in hidden layer 2 over time and connecting it to its pertinent output unit. Back­
propagation learning was adopted to train the network.
To achieve the desired learning behaviour, we need to ensure that the network is 
exposed to sequences of patterns and that it is encouraged to learn about the most
32
Chapter 3. Pattern Recosnitiou
powerful cues and sequences of cues among them. Each collection of TDNN units 
described above is duplicated for each one frame shift in time. In this way, the whole 
history of activities is available at once. Since the shifted copies of the TDNN are mere 
duplicates, the weights of the corresponding connections in the time shifted copies must 
be constrained to be the same. Of course, this applies to all connections and all time shifts 
and in this way, the network is forced to discover useful acoustic features in the input, 
regai'dless of when in time they actually occurred. This is an important property, as it 
makes the network independent of error-prone pre-processing algorithms that otherwise 
would be needed for time aligmnent and/or segmentation.
The learning procedure is computationally intensive and to improve the 
performance, operations are vectorised. Also, the learning time can be improved by using 
a staged learning strategy. In this case, we start optimising the network based on 3 
prototypical training token to provide rapid convergence. Once the convergence is 
complete, the network is presented with approximately twice the number of tokens and 
learning continues until there is full convergence. The process is repeated until the 
network has been presented with all the training tokens.
One point to note is that the structure of TDNNs is very simple and is ideal for 
VLSI implementations.
3.6 Self-Organising Networks
In self-organising networks, four properties are required [55]:
® The weights in the neurons should be representative of a class of
patterns
© Input patterns are presented to all of the neurons and each neuron
produces an output. The value of the output of each neuron is used 
as a measure of the match between the input pattern and stored 
pattern in the neuron 
© A competitive learning strategy selects the neuron with the largest
response
® A method of reinforcing the largest response is required.
Self-organising maps are good examples of Kohonen networks. Their aim is to 
produce a network where the weights represent the co-ordinates of some kind of
Chapter 3. Pattern Recosnitiou
topological system and the individual elements in the network are aixanged in an ordered 
way.
Initially, each weight is set to some random number. A pattern vector is then 
presented to the system. The vector is simultaneously compared to all the elements in the 
network and the one with the lowest Euclidean distance is selected. The Euclidean 
distance is given by:
1=1
where x refers to the input vector and w refers to the weight vectors. The element 
with the lowest Euclidean distance is denoted by c. The elements neighbouring c are also 
defined as being those elements which lie within a distance from c. Having identified 
the element c, the centre of the neighbourhood, the elements that are included in the 
neighbourhood and the weights of those elements are adjusted using Kohonen learning. 
The weights of all other elements are left alone [56]. The fomiula for weight adjustment 
using Kohonen learning is given by:
Avi^ . (3.12)
where if j > nBt. for all i, i ^  j
y  j =0 otheiivise
k is a constant. In the extreme case, k=l, the weights in a particular neuron will 
be adjusted so that they are identical to the inputs. Alternatively, with k<l, the weights 
will change in a way that makes them more like the input patterns, but not identical. After 
training, the weights should be representative of the distribution of the input patterns.
The decisions about the size of and the value for k are very important. First of 
all, both must decrease with time and there are several ways of doing this. They could 
decrease linearly with time. However, it has been pointed out that there are two distinct 
phases -  an initial ordering phase, in which the elements find their correct topological 
order and a final convergence phase in which the accuracy of the weights improves [57]. 
In the initial phase, k decreases linearly from 0.9 to 0.01 approximately, while 
decreases linearly fr om half the diameter of the network to one spacing. During the final
34
Chapter 3. Pattern Recognition
stage, k may decrease from 0.01 to 0, while stays at one spacing, with the final stage 
taking up to 100 times longer than the initial stage.
Several researchers have used the clustering technique of self-organising maps to 
the problem of speech recognition [58, 59]. Mel-scale cepstrum coefficients have been 
used as input to the networks. The idea behind such implementations is that each 
phoneme has its subspace and using a self-organising map a 2-dimensional projection of 
these clusters is realised. This projection foims a phonetic map, where each phoneme has 
its own cluster. During recognition, a path is foimed over this map representing the 
spoken words. Such systems have been adopted for Dutch, Finnish and Japanese 
languages. To obtain high recognition rates, post processing, namely Hidden Markov 
modelling and vocabulary matching has been adopted to obtain high recognition rates.
3.7 Radial Basis Function Networks
The basic architecture for a radial basis function (RBF) is a thiee-layer network. 
The input layer is simply a fan-out layer and does no processing. The hidden layer 
performs a non-linear mapping from the input space into a usually higher-dimensional 
space in which the patterns become linear separable [60]. The final layer perfomis a 
weighted sum with a linear output. However, for pattern classification systems the 
sigmoid function is more adequate so that the output neurons could output 0 or 1 values.
The unique feature of the RBF network is the process performed in the hidden 
layer. The idea is that the patterns in the input space fomi clusters. If the centres of the 
clusters are known, then the distance from the centre cluster can be measured. 
Furthermore, the distance measure is made non-linear, so that if a pattern is in an area that 
is close to a cluster centre then it gives a value close to 1. Beyond this area, the value 
drops dramatically. The notion is that this area is radially symmetrical around the cluster 
centre, so that the non-linear function becomes Icnown as radial basis function. The most 
commonly used function is given by:
(  /  1^(r) = exp ——  (3.13)
X ^
where r is the distance from the cluster centre. The equation represents a Gaussian 
curve. The distance measured from the cluster centie is usually the Euclidean distance. 
For each neuron in the hidden layer, the weights represent the co-ordinates of the centre
35
Chapter 3. Pattern Recognition
of the cluster [61]. Therefore, when that neuron receives an input pattern, X, the distance 
is found using the equation
V /•=!
So the output of neuron j in the hidden layer is:
r
') -  (3.14)
(3.15)
The variable cr defines the width of the curve and can be detennined empirically. 
When the distance from the centre of the Gaussian reaches a , the output drops from 1 to 
0.6 [62].
The weights of the neurons of the hidden layer ai'e found using a clustering 
algoritlim such as the k-means algorithm or Kohonen learning. Training is unsupeiwised, 
but the number of clusters that is expected is known in advance [63]. The algorithms then 
find the best fit to these clusters. Input patterns are presented to all the cluster centres one 
at a time and the cluster centres are adjusted after each pattern. The cluster centre that is 
nearest to the input data wins and shifted slightly towards the new data.
Having established the cluster centres, the next step is to detemiine the radius of 
the Gaussian curves. This is done using the P-nearest neighbour algoritlim. A number P is 
chosen and for each centre, the P nearest centres are found. The root mean squared 
distance between the current cluster centre and its P nearest neighbours is calculated and 
this is the value chosen for o' . So, if the current cluster centre is Cj, the value is:
A typical value for P is 2, in which case o  is set to be the average distance from 
the two nearest neighbouring cluster centres [64].
The weights of the output layer can be trained using a standard giadient descent 
algorithm [65].
Radial-basis flinction neural networks, together with multi-layer perceptrons have 
been successfully used as probability estimators in Hidden Markov Model continuous
36
Chapter 3. Pattern Recognition
speech recognition systems and isolated word recognition systems. Simulations results 
have shown that both the RBF networks and MLP systems obtained comparable 
recognition rates on networks taking single frame perceptual linear prediction coefficients 
as inputs [24, 66]. In these simulations, the RBF’s were detennined by a k-means 
clustering process and training of the output weights were perfomied using back- 
propagation.
Experiments have also been performed in which the coefficients of a tied mixture 
density HMM were used to initialise an RBF network, which is then trained using the 
scheme mentioned above. However, this additional training did not result in an improved 
perfonnance [66].
3.8 Biological Neurons and their Artificial Models
A human brain consists of approximately 10“ computing elements called 
neurons. They communicate tlii'ough a connection network of axons and synapses having 
a density of lO'^  synapses per neuron. The hypothesis regarding the modelling of the 
natural nervous system is that the neurons communicate with each other by means of 
electrical impulses [67]. The brain can be considered to be a densely connected electrical 
switching network conditioned largely by the biochemical processes. The vast neural 
network has an elaborate structure with very complex interconnections. The input to the 
network is provided by sensory receptors. Receptors deliver stimuli both from within the 
body, as well as from sense organs when the stimuli originate in the external world. The 
stimuli are in the fomi of electrical impulses that convey the information into the network 
of neurons. As a result of infoimation processing in the central nervous systems, the 
effectors are controlled and give human responses in the foim of diverse actions [67]. 
This is shown in Figure 3-6.
37
Chapter 3. Pattern Recognition
Central
Nervous
System
Internal
Feedback
. Receptors 
Sensory 
Organs
Effectors 
Motor Organs
External Feedback
Figure 3-6 Information Flow in the Nervous System
The elementary nerve cell, called a neuron, is the fundamental building block of 
the biological neural network. Its schematic diagram is shown in Figure 3-7 and shows 
tluee major regions: the cell body or soma, the axon and the dendrites.
/  Axon liillo ck
Axon
Dendrite Nucleus
T enuinal buttons
Schematic of biological neuron.
Figure 3-7 Neuron Schematic Diagram [67]
Dendrites form a dendritic tree, which is a very fine bush of thin fibres around the 
neuron’s body. They receive infoimation from neurons, tluough axons -  long fibres that 
serve as transmission lines. Each branch of the axon splits into a fine arborisation and
38
Chapter 3. Pattern Recosnition
each branch of it terminates in a small end bulb almost touching the dendrites of 
neighbouring neurons. The axon-dendrite contact organ is called a synapse. The synapse 
is where the neuron introduces its signal to the neighbouring neuron. The signals reaching 
a synapse and received by dendrites are electrical impulses. Figure 3-8 shows the 
synapse. The receiving neuron will then either generate an impulse to its axon or produce 
no response [68].
The neuron is able to respond to the total of its inputs aggregated within a short 
time interval and the response is generated if the total potential of its membrane reaches a 
certain limit.
The first formal definition of a synthetic neuron model based on the highly 
simplified considerations of the biological model described above was formulated by 
McCulloch and Pitts [69]. The model of the neuron is shown in Figure 3-9.
Figure 3-8 Synapse Diagram [68]
The inputs are denoted by x, while the neuron output is denoted by o. w denote the 
synaptic strengths, while T is the neuron’s threshold value, which needs to be exceeded 
by the weighted sum of the signals for the neuron to fire.
39
Chapter 5. Pattern Recogintion
X I
W 1
W 2
W 3X 3
Figure 3-9 McCiilloch-Pitts Model Neuron
3.9 Automatic Speech Recognition with Neural Networks
A number of publications have proposed neural networks as building blocks for 
speech recognisers [20, 21, 24, 58, 59, 66]. Their function is to provide a statistical model 
capable of associating a vector of speech features with the probability that the vector 
could represent any of a given number of phonemes. However, today’s laiowledge of the 
speech recognition process is still very limited and hilly connectionist models are not 
nonnally used. In fact, researchers are today combining the best features of neural 
modelling with other statistical approaches such as Hidden Markov Models.
The first problem for any automatic speech recogniser is finding an appropriate 
representation of the speech signal, that is, feature extraction, which has been extensively 
covered in Chapter 2. Many speech recognition systems use some kind of variation of the 
Fourier coefficients, that provide infoimation about the shape of the vocal tract. The two 
most populai' alternatives are cepstral coefficients and linear predictive coding.
In speech recognition, researchers postulate that the vocal tract shapes can be 
quantised in a discrete set of states roughly associated with the phonemes which compose 
speech [20, 21]. But when speech is recorded the exact transitions in the vocal tract 
camiot be observed and only the produced sound can be measured at some predefined 
time intervals. These are the emissions, and the states of the system are the quantised 
configurations of the vocal tract. From the measurements we want to infer the sequence 
of states of the vocal tract, i.e. the sequence of utterances which gave rise to the recorded 
sounds. As mentioned in section 3.2, this problem can easily be overcome with a Hidden 
Markov Model, where a recursive algorithm is used to compute the most probable
40
Chapter 3. Pattern Recognition
sequence of state transitions which could have produced the recorded sequence of output 
values [24, 58].
Neural networks are nonnally adopted as classifier networks to compute the 
probability that any given set of phonemes could coiTcspond to a given spectrum [20, 21]. 
The speech signal is divided into windows of approximately 10ms length and feature 
extraction is earned out to compute an N-dimensional vector of coefficients. The 
networks are then trained to associate the coefficients with the probability of each 
phoneme occurring in a speech segment. In many recent experiments, the networks are 
nonnally fed with coefficients representing between 10 to 15 windows of speech and the 
network will end up with a number of output nodes, each providing the probability of 
utterance of a given phoneme. The network needs to be trained with labelled speech data 
and once trained the network can be used to compute the emission probabilities of the 
different phonemes [21].
Once a set of emission probabilities have been computed for several time frames it 
is necessary to compute the most probable path of transitions of the vocal tract and 
emissions. This can be done by dynamic programming methods such as dynamic time 
warping.
When combined HMM/NN systems are proposed, the role of the neural network 
has been identified as that of providing the probabilities of emission, while the Markov 
model is to reduce the number of alternatives that need to be considered [24].
3.10 Hardware Implementations of Neural Networks
There are various criteria based upon which one can categorise neural hardware. 
For instance, one such criterion could be the type of signals on which the networks 
operate. Conespondingly, the networks can be described as pulse, digital, analogue or 
mixed-mode implementations. If the flow of information is considered, then the 
architectures can be characterised as feedforward or feedback type. Another important 
consideration is the training process -  whether training is implemented on-chip or off- 
chip. Together with these, the processing functions implemented in the synaptic and 
neural units are important. These include linear multiplicative, Gaussian, probabilistic or 
competitive learning synapses. Table 3-1 shows the different criteria characterising 
hardware neural networks.
41
Chapter 3. Pattern Recognition
Characteristics Hardware Settings
Type of Electrical Signal Digital, Pulse, Analogue, Mixed Mode
Flow of Information Feed-Foi*ward, Feedback
Training Method On-chip, Off-chip
Type of Synaptic Unit Lineal', Distance Measure, Gaussian, Quadratic
Implementation Technology CMOS, EEPROM, BiCMOS
Physical Implementation Medium Chemical, Optical, Electrical, Opto-Electronics
Storage of Synaptic Weights Digital, Analogue Dynamic, Analogue Non- 
Volatile
Generality of Usage General-Purpose, Specific Class of Applications, 
Dedicated to a Specific Application
Performance vs. Flexibility Network of Optical Neurons, Network of Electronic 
Neiu'ons, Dedicated Multi-Neuron Processors, 
General Purpose Parallel Machines, Generic Neural 
Processors, Optimised RISC Processors, Serial 
Conventional Computers
Table 3-1 Taxonomy of Neural Hardware
3.10.1 Pulse Coded Implementations
111 a biological system, the output of neurons consists of pulses, whose frequency 
reflects the output activity level of a neinon. Recognition of this fact has inspired a class 
of pulse-signal neural networks, which are designed to work with pulses of variable 
frequency, rather than with discrete or continuous levels. This approach also introduces 
the possibility of using a digital CMOS process for asynchronous analogue circuits. A 
wafer-scale neural network using this approach with gate-aiTay representation was 
presented by Hitachi [70].
3.10.2 Digital Implementations
Table 3-2 illustrates the relationship between various digital implementation 
teclmiques used for neural networks. The first and most widely used is the simulation of
42
Chapter 3. Pattern Reco2nition
neural networks on a conventional serial computer. It is the most general-puipose 
implementation and can equally well simulate different kinds of networks. However, it 
has the lowest speed performance. Here, parallelism is non-existent and regularity and 
repeated structures of either regular’ or training data have a minimum impact. Of course, 
the Reduced-Instruction-Set Computer (RISC) processor provides an improvement in the 
performance. A related option is to use a specialised RISC CPU, whose internal structure 
has been optimised to take full advantage of the regular’ity and locality of data in a neural 
algoritlrrn. The maximum possible performance of a RISC structure can be obtained by 
taking advantage of deep pipelining and the availability of data in fast register fries.
hr the class of general parallel machines [71-76], several particular developments 
can be identified, rramely:
® Special machines designed for matrix operations 
® Supercomputers
Parallel array nrachines are more specific solutions with less flexibility, but can be 
flexible enough to be called general solutions. For the implernentatiorr of each specific 
neural algorithm, the training arrd data-processing procedures should be written in a 
single assignment language and therr mapped orrto an existing machine having a specific 
number of processors. In these implementations, parallelism at a synapse-, neuron- or 
training-set-level can be employed. While the concurrent use of parallelism at all levels 
theoretically brings the highest performance, it is not practical in terms of the hardware 
requirement. Therefore, mixed parallelism at the neuron- and training-set-levels is 
normally employed.
The last group of digital hardware implementations to be considered are those 
having innovative network architecture developed specifically for solving neuro- 
cornputation problems and whose stnrctures are not very similar to conventional 
computers. Some of these chips have been designed for a specific family of neural 
algoritlnns.
43
Chapter 3. Pattern Recosnition
Types of Computers Available Technology
Conventional Serial Computers Ordinary Microprocessors 
Ordinary RISC Processors 
Optimised RISC Processors 
Dedicated Co-Processors
General Puipose Parallel Machines Multiprocessors 
Systolic Arrays 
Message-Driven Machines
Parallel Machines Developed Explicitly for 
Neural Networks
Dedicated Digital Neuro-Chips
Table 3-2 Digital Implementations of Neural Networks
3.10.3 Analogue Implementations
As noted by Vittoz [77], manipulation of tasks related to perception is expected to 
become the area in which analogue signal processing will remain superior to an all-digital 
approach. As can be seen in biological neural systems, what is required for perception is a 
massively parallel collective processing of large numbers of low-accuracy signals, which 
vary continuously in time and in amplitude. Such massive processing of low-precision 
signals can be handled ideally by VLSI analogue hardware.
Dedicated analogue-hardware implementations normally operate at the high end 
of the speed performance spectrum and with a few exceptions, at the low-end of 
flexibility. Generally speaking, they are examples of the fact that the more one tunes the 
hardware and on-chip resources for a specific task, the more one will gain in perfonnance 
for that specific task.
A good example of a general-puipose analogue hardware implementation is the 
analogue neural network arithmetic and logic unit (ANNA), designed at AT&T Bell 
Laboratories [78]. Another example is the ET ANN neural network chip implemented by 
INTEL and which incoiporates 64 neurons and 10,240 synapses [79].
Simple fixed resistive weighting can be achieved using an op amp with strong 
negative feedback for each of the neurons and with a resistive input at each synapse
44
Chapter 3. Pattern Recognition
location. Bell Laboratories have designed one such implementation having 256 neurons 
and 100,000 synapses [80]. The difficulties with this implementation are the shape of 
their characteristics, excessive power consumption and the large area occupied by the 
resistors and op amps. Also, weights are encapsulated in the resistances of discrete fixed 
resistors, which are difficult to adjust or control. An improvement to the system was 
attained using voltage-conti’olled resistors, using MOS transistors operating in the triode 
region. Some other useful techniques were originally developed to implement linear 
MOS resistances in continuous-time filters [81]. A four-MOS-transistor block has been 
adopted together with an op amp, in a scheme in which nonlinear effects of transistors 
compensate for each other and lead to the effect of a linear resistor over a wide range of 
operation. This architecture has been used to design four-quadrant continuous-time 
multipliers, which have also found some application in the implementation of basic neural 
elements.
Tsividis and Anastassiou [82] have explored the possibility of employing 
capacitance-based techniques in a switched-capacitor neural design. The sensing scheme 
they proposed necessitates two to four transmission gates per synapse in order to 
implement both excitatoiy and inhibitory connections in the same synaptic matrix. Also, 
some programmable switched-capacitor neural networks have been reported to solve 
some nonlinear programming problems [83].
Current-mode signal processing teclmiques have been used to implement current­
mode artificial neural networks [84]. The neural modules involve transconductors, 
integrators, summers and multipliers in a manner, which is tolerant to device deviations. 
CuiTent-niode signal processing offers several advantages when used in neural circuits. 
One of the most apparent advantages is that the summing of many signals is most readily 
accomplished when these signals are currents. Other advantages are increased frequency 
of operation due to the use of low-impedance internal nodes and the increased dynamic 
range of signals allowed when MOS transistors can be operated over a wide range, from 
weak inversion to strong inversion.
Carver Mead [3] has evolved a great many analogue-neural CMOS design 
methods, circuits and devices based on sub-threshold or weak inversion MOSFET 
operation, hi an MOS transistor, the amount of cun’ent flowing from source to drain is 
controlled by an electric field horn an applied voltage at the transistor's gate. The electric
45
Chapter 3. Pattern Recosnition
field attracts charge earners from the substrate. A higher voltage and stronger electric 
field translates to more current flowing through the transistor.
In a semiconductor, there are two modes by which current can flow: diffusion and 
drift. Diffusion is the natural flow of particles fiom higher to lower concentration, and 
drift is the flow of particles subject to an applied force. In the weak inversion region, the 
inversion layer charge, Q,^ is effectively zero, so that |& | < < |& |, where is the 
depletion layer charge. The drain-to-source current still flows at the surface underneath 
the gate oxide, but it now practically flows in the depletion layer. The MOS structure now 
effectively exhibits the behaviour of a source-substiate-drain bipolar transistor with the 
chamiel coiTesponding to the base. The drain current is dominated by diffusion instead of 
drift, and is given by:
= -qAD„ = qAD„ '< 0 )-n {.L )  (3.17)ay L
where A is the channel cross-section of the current flow, D„is the diffusion 
constant, n(0) and n(L) are the electron densities in the channel at the source and the 
drain, respectively, and L is the channel length.
«((9) = (3.18)
n{L) = n.e (3.19)
where 77. is the intrinsic canier concentration, is the bulk potential, (p^  is the 
surface potential at the source, which is approximately -  Vj . Substituting:
^qAD^pt.e ( I - e  )e^’^ (3.20)
This corresponds to the weak inversion current equation:
k  (3.21)
kT Cwhere V, -  —  . Taking into account the process parameter 77 = I + ^ B Cq Cox
46
Chapter 3. Pattern Recosnition
where is the depletion layer capacitance (effectively the bulk-to-channel 
capacitance), is given by;
lo = (3,22)
is a process constant. For = 0 and > 3k), the drain cuiTcnt simplifies
to:
w ^ (3.23)
where /z % 1.5.
3.10.4 Weight Storage
Weights can be programmed and stored digitally in registers. Once the weight 
values have been set digitally, some sort of digital-to-analogue converter is required at 
each interconnection.
The analogue coimection strength can also be stored as a charge packet on a 
capacitor. Several research groups have used this concept, which has the potential for 
variable weight values with a relatively high resolution and small cell size. Both MOS 
and inter-poly capacitors have been reported. However, this technique requires refreshing, 
since the charge on the capacitor will gradually leak away. Refreshing analogue values 
requires considerable overhead that may offset much of the advantages gained in smaller 
interconnection and cell sizes.
Most refreshing schemes rely on quantising the weights to recharge the weight 
capacitors. One of the more obvious approaches to do so is to use a digital RAM backup 
memory and the capacitors are periodically refr’eshed via a digital-to-analogue converter. 
Serial access to the weight capacitors and words in the RAM would be required. 
However, serial access to the uploading of weights is a severe limitation for a system with 
hardware learning. It is also possible to employ a quantise-regenerate refreshing scheme, 
where the voltage on the weight capacitor is periodically compared to a discrete number 
of reference voltages and the capacitor voltage is regenerated to the closest reference. 
Unfortunately, this system is not well suited for learning systems because of the required 
high resolution of the weights. An altogether different approach to refi*eshing relies on the
47
Chapter 3. Pattern Recosnition
presence of a learning scheme that is refreshed by re-leaming. During an idle phase of the 
neural network, it is trained using an epoch of the original training data, thus restoring the 
weights. The obvious disadvantages with this approach are that the network cannot run 
continuously and that the whole training set needs to be stored in the system.
The above problems can be solved using floating-gate transistors [85], which 
provide an adjustable, non-volatile analogue memory. These are shown in Figure 3-10.
control gate
drain
source
floating gate
bulk
Figure 3-10 Floating Gate Technology [85]
3.10.5 Multipliers
Unlike analogue memories, analogue multipliers are very easily implemented in 
CMOS -  provided the inherent offset and non-linearities are acceptable. Several 
architectures have been proposed [86-91], but the main considerations that need to be 
taken into account for the adoption in the synapses include the following:
1. their size
2. their current output
3. their voltage input impedance
Another important consideration that needs to be taken is the need of a four- 
quadrant multiplier. This mainly depends on the type of activation function that is chosen, 
for example, the sigmoid hyperbolic tangent function yields a bipolar value, thereby 
requiring four-quadrant operation. It is also important to consider whether the system 
requires a linear multiplier. For gradient descent the derivative must be known and
48
Chapter 3. Pattern Recosnition
therefore the multiplier must have a computable transfer function derivative. In this case a 
linear multiplier would fulfill this need.
The dynamic range of the multiplier nonnally puts constraints on the network that 
can be mapped on a given topology. The offset error is also relevant in this case since 
when the outputs finin many synapse multipliers are connected together the offset effects 
will accumulate, easily giving a resulting offset that is greater than the maximum output 
of a single synapse. The Gilbert multiplier [86], the MOS resistive circuit [92] and the 
multiplying DAG [93] are the circuits which have been popularly adopted for analogue 
implementation.
3.10.6 Activation Functions
The last thing that needs to be considered before implementing the analogue 
neural network is the threshold function. If a binary valued neuron transfer function is 
sufficient for the application at hand, the neuron circuit can be very simple. If, on the 
other hand, a continuous valued non-linearity is sought, the circuit is somewhat more 
complex. The exact shape of the neuron transfer function is usually iiTelevant. What is 
more important is its qualitative shape, for example, that it is monotonous and saturates 
for numerically large inputs [15]. This is a very attractive feature, which means that 
transfer functions easily implemented in the teclmology can be chosen.
3.11 Summary
This chapter reviews pattern recognition systems. Two pattern matching systems 
have been outlined. These include Hidden Markov Models and Artificial Neural 
Networks. Artificial Neural Networks have been covered in detail, including the aspects 
of super-vised and unsuperwised learning, together with the different architectures, which 
could be used for the task of phoneme recognition. The algorithms adopted for multi­
layer perceptron neural networks, self-organising maps, radial basis functions and tirne- 
delay neural networks have been analysed. The different VLSI teclmiques adopted for 
neural network implementations have been reviewed. Finally, various analogue 
implementations have been analysed in detail.
49
_______ Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
Chapter 4
Software Simulation of Neural Networks for 
Phoneme Recognition
4.0 Introduction
Current best speech recognisers use Hidden Markov Model (HMM) techniques for 
the statistical modelling of data [40]. However, neural network approaches to speech 
recognition have recently received great interest, as they are easier to implement in 
hardware and seem more suitable for multistage recognisers combining co-articulation 
modelling [21].
Before implementing the neural network system in hardware, a number of 
simulations have been canied out in order to evaluate the performances of the different 
neural network architectures for the phoneme recognition task. Different speech coding 
techniques were adopted in order to decide on the coding scheme which provides the best 
recognition rates and which lends itself best to hardware implementation. Another 
important decision that needs to be taken is the minimum number of coefficients that need 
to be considered in order to obtain adequate recognition rates. This will minimise the 
network architecture that is required. This is an important issue since the chip area should 
be maintained as small as possible.
4.1 Speech Recognition
As can be seen from Figure 4-1, the first stage of the speech recognition system is 
the coding stage. The main coding techniques investigated here included LPC, 
PARC OR, Log-Area Ratios, Cepstral, Mel-Scale Cepstral Coefficients and Energy Terms 
together with their differences and acceleration coefficients. C programs were written in 
order to obtain the given coefficients from the speech samples.
50
Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
SpcccU
Pattern Training
Decision Logic Pattern Classifier
Templates 
Or Models
•Analysis 
System  
Kilter Bank 
LPC, DPT
Local Distance M easuie -  Dynamic Tim e W aipitig 
Figure 4-1 Speech Recognition Model [41]
Before the simulations could begin, a small speech database had to be compiled. 
Phoneme utterances were extracted from recorded speech from the TIMIT corpus. The 
texts and speakers in TIMIT have been subdivided into suggested tiaining and test sets. 
The TIMIT corpus splits up the testing and training directories using the following 
criteria:
® 27% of the corpus should be used for testing purposes, leaving the
remaining 73% for training.
® No speaker should appear in both the training and testing portions.
® All the dialect regions should be represented in both subsets, with
at least 1 male and 1 female speaker from each dialect.
® The amount of overlap of text material in the two subsets should be
minimized; if possible no texts should he identical.
® All the phonemes should be covered in the test material, preferably
each phoneme should occur multiple times in different contexts.
Table 4-1 lists the speakers in the core test set for each dialect. The male and 
female columns list the directory names for the different available speakers for each 
dialect. This set is the minimum recommended set for test puiposes.
51
Chapter 4. Software Simulation o f Neural NeWorks for Phoneme Recosnition
Dialect Male Female Texts/Speaker Total Texts
1 DABO, WBTO ELCO 8 24
2 TASl, WE WO PASO 8 24
3 JMPO, LNTO PKTO 8 24
4 LLLO, TLSO JLMO 8 24
5 BPMO, KLTO NLPO 8 24
6 CMJO, JDHO MGDO 8 24
7 GRTO, NJMO DHCO 8 24
8 JLNO, PAMO MLDO 8 24
Total 16 8 192
Table 4-1 : Speakers in the Core Test Set
The phonemes adopted in the TIMIT database are given in Table 4-2.
Category Symbol
Stops: b, d, g, p, t, k, dx, q, bel
Affricates: jh, ch
Fricatives: s, sh, z, zh, f ,  th, V ,  dh
Nasals: m, n, ng, em, en, eng, nx
Semivowels and 
Glides: 1, r, w, y, hli, hv, el
Vowels: iy, ih, eh, ey, ae, aa, aw, ay, ah, ao, 
oy, ow, uh, ux, er, ax, ix, axr, ax-h
Table 4-2 TIMIT phonemes
Training samples, for the neural networks, were obtained from utterances saved in 
the training directories, while testing samples were extracted from wave files stored in the 
test directory. These samples were extracted using a program written in C. The program 
reads a TIMIT wave file (*.wav) and encodes it according to the coding technique, which 
has been selected. The program also opens the TIMIT phone file (*.phn) and it assigns a 
phoneme number to each frame obtained during the coding sequence. In this way, each 
fr ame is linked to a coiTesponding phoneme and this is necessary for supervised learning.
52
_______Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
Only forty eight phonemes have been used since a number of phonemes present in the 
TIMIT database did not provide enough samples for training purposes.
The network uses all the wave files in the training directory for training, while the 
samples in the testing directory are adopted for evaluating the perfonnance of the 
network. The files containing the speech samples are passed on to coding programs.
4.2 Simulation Results for the Different Coding Techniques
Initially, a twelve-coefficient vector was adopted for each speech frame. The 
frame window size was set to 25 ms and a pre-emphasis Hamming window with a 
coefficient of 0.97 was adopted. Frames were initialized at 10 ms intervals, thus providing 
a 15 ms overlap of speech for each frame.
C programs and MATLAB scripts were written in order to implement 
backpropagation learning for multi-layer perceptrons (MLP) and Kohonen learning for 
self-organising maps (SOM). The coefficients files were first applied as input to a 12- 
150-48 MLP implementing backpropagation learning and to the SOM program. 
Figure 4- 2 shows the recognition-rate results obtained for the different coding techniques 
investigated.
64%
62%
60%
58%
56%
54%
52%
50%
48%
Recognition
Rates □  MLP
□  SOM
LPC Cepstral Mel-Scale PARCOR Log Area 
Cepstral
Coding Techniques
Figure 4-2 Recognition Rates for a 12-150-48 MLP and SOM setup
As can be seen from the simulation results for the complete training and testing 
set, the Mel-Scale Cepstral coefficients resulted in the highest recognition rates for both
53
_______ Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
the MLP and SOM. Another set of simulation runs was then canied out in order to 
evaluate whether the inclusion of difference and acceleration coefficients improves the 
recognition rates and in order to detemiine the ideal vector size that gives a good 
compromise between the recognition rates achieved and the complexity of the neural 
network that will be required. The first tests canied out were targeted at evaluating the 
perfomiance of the MLP and SOM for different coding vector sizes. The basic Mel-Scale 
Ceptral coefficients were adopted for these simulations. The number of neurons in the 
hidden and output layers of the MLP was kept constant for all the tests, while the vector 
size was varied from four to sixteen. The number of neurons in the hidden layer was 
fixed at 150, while the number of neurons in the output layer was fixed at 48. Figure 4-3 
shows the phoneme recognition rates obtained when the networks were trained and tested 
on the whole speech database set.
70% 1
65% -
60% -
55% -
50% -
45%
40% -
35% -
30% -
25% -
20%
4 6 8 10 12 14 16
Vector Size
•S O M
Figure 4-3 Recognition Rates for different vector sizes 
It can easily be seen that if the vector size exceeds twelve, then there is no 
significant gain in tenus of recognition rates for both the MLP and the SOM. Therefore, a 
vector size of twelve was adopted. Simulations were then earned out in order to identify 
the benefits in terms of recognition rates for using difference and acceleration
54
_______ Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
coefficients. Table 4-3 shows the results attained for the whole training and testing set 
with Mel-Scale Coefficients on an MLP.
Code Adopted Vector
Size
Network
Size
Recognition
Rate
Mel-Scale + Difference Coefficients 24 24-300-48 65%
Mel-Scale + Acceleration Coefficients 24 24-300-48 64%
Mel-Scale + Difference + Acceleration 
Coefficients
36 36-300-48 69%
Table 4-3 Recognition Rates obtained for various Difference and Acceleration
Coefficients
The initial simulations indicated that Mel-Scale Cepstral coefficients together with 
the difference and acceleration terms provided the best recognition rates and they were 
adopted to encode the speech input samples. A vector size of thirty-six coefficients per 
frame - twelve Mel-Scale Coefficients, twelve difference and twelve acceleration 
coefficients - was adopted. The effect of including an Energy Term and the Difference in 
Energy was also investigated and Table 4-4 show the results for Mel-Scale Cepstral 
Coding. As can be seen, a 1% improvement can be attained if an Energy Teim is added to 
the Coding Coefficients.
Code Adopted Vector
Size
Network
Size
Recognition
Rate
Mel-Scale + Energy Tenn 25 25-300-48 66%
Mel-Scale + Energy Tenn + Difference in 
Energy Term
26 26-300-48 66%
Table 4-4 Recognition Rates for Evaluation of the Energy Term
55
_______ Chapter 4. Software Simulation o f Neural Networks for Phoneme Recognition
4.3 Simulation Results for the Different Neural Network Algorithms
Simulations were carried out in order to evaluate the perfonnance of the different 
network topologies to the phoneme recognition task. A three-layer multi-layer perceptron 
was first investigated. An input layer of thirty-six neurons and an output layer of forty- 
eight neurons were adopted, while the number of neurons in the hidden layer was varied. 
This is done in order to establish the minimum size of the network that gives a reasonable 
compromise between the network complexity and the recognition rates. Figure 4-4 shows 
the results obtained when the network was trained and tested with the whole speech 
database.
The simulations indicated that a network size of 36-150-48 gives the best 
compromise in terms of recognition rates and network complexity. Other simulations 
were carried out on four layer multi-layer perceptron networks. However, the results did 
not yield higher recognition rates than the three layer networks. As a result, the 36-150-48 
three-layer network was adopted.
70%
68%
66%
64%
62%
60%
58%
56%
54%
Hidden Hidden Hidden Hidden Hidden Hidden 
Nodes 100 Nodes 150 Nodes 200 Nodes 250 Nodes 300 Nodes 350
Figure 4-4 Recognition Rates obtained for different number of Hidden Nodes
Further work was carried out to evaluate the performance of the different learning 
algorithms. C programs were written in order to simulate backpropagation learning in 
multi-layer perceptrons, time-delay neural networks and Elman recurrent networks, 
Kohonen learning in self-organising maps and radial-basis functions. The criteria used to
56
_______Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
evaluate the performances of the algorithms and topologies included the phoneme 
recognition rates, the architecture topology complexity and the training time.
Time-delay neural networks, with a three-fiame window, provided the highest 
phoneme recognition rates, radial basis functions provided the fastest training times, 
while self-organising maps resulted in the smallest networks. Figure 4-5 shows the 
software results for different learning algorithms. The training efficiency is a measure of 
the time saved during the learning process. A maximmn allowable time for training is set 
and the time taken for algorithm to settle is subtracted from the maximum allowable time. 
The time saved is then expressed as a percentage of the total time. The maximum 
allowable time for training was reached after observing all possible training algorithms.
The recognition rates obtained for the whole TIMIT training and testing set 
provided a maximum recognition rate of around 73%, which is comparable with work 
done by other researchers for neural networks of the same complexities. The following 
results were quoted for the various reviews: 72.7% for a 600-neuron hidden layer TDNN 
[94], 69.5% for a 300-neuron hidden layer TDNN [95], 68.9% for a 200-neuron hidden 
layer TDNN [96] and 69.1% for a HMM model [97], [98].
4.4 Results for a Time-Delay Radial Basis Function Neural Network
As a result of the simulations canied out, it was decided that a time-delay radial 
basis function neural network would be evaluated. The time-delay radial basis function 
neural network was adopted in order to exploit the benefits of the different learning 
algoritlnns, namely the high recognition rates obtained by the time-delay aspect and the 
backpropagation learning and the faster learning times obtained by the radial basis 
functions.
RBF networks have a static Gaussian function as the nonlinearity for the hidden 
layer processing elements. The Gaussian function responds only to a small region of the 
input space where the Gaussian is centred. The key to a successful implementation of 
these networks is to find suitable centres for the Gaussian functions. This can be done 
with supeiwised learning, but an unsupeiwised approach usually produces better results. 
For this reason, a hybrid supervised-unsupervised topology was adopted.
57
Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
90%
60%
50%
40%
30%
20%
10%
0%
MLP TDNN RBF SOM Elm an
□  R ecogn ition  R a te s  □  Training Efficiency
Figure 4-5 Recognition Rates and Training Efficiency obtained simulating the
different ANN Architectures
The algorithm starts with the training of an unsupervised layer. Its function is to 
derive the Gaussian centres and the widths from the input data. These centres are encoded 
within the weights of the unsupervised layer using competitive learning. During the 
unsupervised learning, the widths of the Gaussians are computed based on the centres of 
their neighbors. The output of this layer is derived from the input data weighted by a 
Gaussian mixture.
Once the unsupervised layer has completed its training, the multilayer perceptron 
topology is used for the classification of the weighted input, that is to train the weights 
connecting the hidden neurons to the output layer. The back propagation algorithm was 
adopted in this case.
The advantage of the radial basis function network is that it finds the input to 
output map using local approximators. Usually the supervised segment is simply a linear 
combination of the approximators. Since linear combiners have few weights, these 
networks train extremely fast and require fewer training samples.
The network was trained and tested using 4000 phoneme utterances encoded using 
a 13 input speech vector per frame. The vector included 12 Mel-Scale Cepstral 
Coefficients and an Energy Term. The network was trained on the TIMIT training set and 
its performance evaluated on the TIMIT testing set. A three-frame time window was 
adopted, resulting in a network size of 13 input nodes, 39 hidden neuron nodes and 48
58
_______ Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
output nodes. The individual phoneme recognition results that were obtained are 
presented in Table 4-5.
Category Phoneme Recognition
Rates
Category Phoneme Recognition
Rates
Vowel a 85.0% Glide el 45.3%
Vowel axr 77.4% Glide hh 38.7%
Vowel er 81.0% Glide y 84.0%
Vowel u 82.7% Glide w 80.6%
Vowel ow 54.8% Glide r 100.0%
Vowel oy 61.5% Glide 1 57.8%
Vowel ao 55.0% Fricative zh 74.3%
Vowel ah 48.4% Fricative dh 18.3%
Vowel ay 67.5% Fricative V 77.8%
Vowel aa 58.9% Fricative th 62.5%
Vowel ey 77.3% Fricative f 92.3%
Vowel i 80.2% Fricative z 88.6%
Vowel ix 75.5% Fricative sh 84.3%
Vowel aw 76.2% Fricative s 97.5%
Vowel iy 81.4% Affricate ch 78.9%
Vowel ih 77.8% Affricate j h 77.6%
Vowel ux 67.8% Stop P 27.8%
Vowel ax 66.6% Stop g 22.3%
Nasal ng 17.9% Stop d 29.1%
Nasal n 56.9% Stop b 25.0%
Nasal m 40.0% Stop q 20.8%
Nasal en 15.4% Stop t 31.8%
Nasal em 18.3%
Nasal dx 34.6%
Nasal nx 68.2%
Total - 61.05%
Table 4-5 Individual Phoneme Recognition Rates obtained simulating the 
Time-Delav RBF Network
59
_______ Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
In this case, the recognition rate obtained was just 1% less than the time-delay 
neural network, which obtained the highest recognition rate for the same training and 
testing set, while the training efficiency was just 2% less than the radial basis function, 
which provided the most efficient training time. A different training strategy was also 
adopted. A time-delay RBF network was first adopted to categorise the phonemes into 
their category as shown in Table 4-2. Depending on the category which registers the 
highest priority, the input vectors will be passed on to a network specifically trained to 
recognise the phonemes belonging to the given category. The system is shown in 
Figure 4-6.
INDIVIDUAL
P H O N E M E
PR O B A B IL IT Y
F E A T U R E
E X T R A C T O R
F O R
FR IC A T IV E S
F E A T U R E
E X T R A C T O R
F O R
A F F R IC A T E S
F E A T U R E
E X T R A C T O R
F O R
S T O P S
F E A T U R E
E X T R A C T O R
F O R
N A SA LS
F E A T U R E
E X T R A C T O R
F O R
V O W E L S
F E A T U R E
E X T R A C T O R
F O R
S E M I-V O W E L S
G L ID E S
b, d, g, p, t, k, 
dx, q, bel
jh, ch
s, sh, z, zh,
f, th , V, d h
m. n, ng, 
em, an, 
eng, nx
i, r, w, y, hh, 
hv, el
iy. ih, eh, ey, 
ae, aa, aw, 
ay, ah. ao. 
oy, ow, uh, 
ux, er, ax, 
ix. axr, ax-h
Figure 4-6 Adopting Multiple Recognisers
60
_______ Chapter 4. Software Simulation o f Neural Networks for Phoneme Recognition
This system will necessitate a larger network size, since it would require seven 
different neural networks -  one designed to categorise the phoneme and another six to 
identify the phoneme. Again, in this case, the network was trained using a 13 input speech 
vector per frame, including 12 Mel-Scale Cepstral Coefficients and an Energy Term. 
Very high recognition rates were attained for the categorisation phase, but only a very 
slight improvement was achieved for the recognition of the individual phonemes. The 
overall phoneme recognition rate achieved was 61.72%, indicating that the main problem 
in the recognition process is to differentiate between phonemes belonging to the same 
category. Table 4-6 shows the results for the categorisation phase. Table 4-7 gives a 
confusion matrix, which provides the probabilities that a particular phoneme has been 
classified to an inappropriate category.
Category Percentage of Correct Classification
Stops 44.5%
Affricates 38.2%
Fricatives 85.9%
Nasals 8T3%
Semi-Vowels and Glides 48.2%
Vowels 98T%
Total 82.8%
Table 4-6 Classification Results obtained simulating the Time-Delay RBF Network
61
Chapter 4. Software Simulation o f Neural Netw’orks for Phoneme Recognition
Correct
Category
% error 
Stop
% error 
Affricate
% error 
Fricative
% error 
Nasal
% error 
Semi-Vowel 
Glide
% error 
Vowel
Stop X 9% 13% 67% 7% 4%
Affricate 3% X 88% 4% 2% 3%
Fricative 4% 81% X 6% 5% 4%
Nasal 24% 13% 8% X 22% 33%
Semi
Vowel
Glide
7% 12% 7% 10% X 64%
Vowel 8% 3% 3% 15% 71% X
Table 4-7 Confusion Matrix for Incorrect Categorisation obtained simulating 
the Time-Delay RBF Network
4.5 Results for a Time-Delay Neural Network (TDNN) 
Implementation Back-Propagation Learning
A time-delay neural network, implementing back-propagation learning could have 
more tlian one hidden layer. Simulations were earned in order to evaluate whether an 
extra layer would provide any improvement in temis of recognition. The tlnee-layer 
neural network had an architecture of 14-96-48 neurons. Each mput unit is connected to 
each hidden unit by tlu*ee different links having time delays of 0, 1 and 2, while each 
hidden unit is comiected to each output units by 5 different links having time delays of 0, 
1 , 2 , 3  and 4. The input vector per frame includes 12 Mel-Scale coefficients and an 
Energy Tenu. The architecture of the four-layer TDNN included 13 input neurons, 12 
neurons in the first hidden layer, 12 neurons in the second hidden layer and 48 output 
neurons. The same number of linlcs as the tliree-layer network was adopted, that is 3 links 
for the input neurons and 5 links for the hidden neurons. Both systems were trained and 
tested with whole database set and the recognition rates achieved are given in Table 4-8.
62
Chapter 4. SofUvare Simulation o f Neural Networks for Phoneme Recosnition
Architecture of TDNN Recognition Rates
Three-layer 60.45%
Four-layer 61.02%
Table 4-8 Comparing Recognition Rates for Three or Four-Layer Neural Networks
The improvement in the four-layer neural network is quite minimal, even though 
it requires more computations. Therefore, three-layer networks were adopted for 
hardware evaluations.
4.6 Results for Different Speech Systems
Several other tests were carried out in order to evaluate the performance of neural 
network architectures for speaker-dependent systems, speaker-independerrt systems 
(speakers of the same sex arrd dialect), speaker-independent sex-independent systems 
(same dialect) arrd speaker-independent sex-independeirt systems using speakers from 
different dialect regions. Figure 4-7 to Figure 4-10 present the results.
Although the simulation results were obtairred with a reduced training set, a 
number of deductions cair be made. Phoneme recogrrition rates are normally high for 
sirrgle user systems, with the TDNN attaining a high 77%. When the system was trairred 
with one speaker and tested on another speaker of the sarire sex, phoneme recogrrition 
rates are also quite high, clearly identifying that neural networks are capable of extracting 
the essential data that differentiates one phoneme from another. Both the TDNN and RBF 
systems produced a good recognitiorr rate of around 72%. With the networks trained on 
more than one speaker, the recognition rates obtained were quite similar to the ones 
obtained in the previous sinrulatiorrs with the TDNN achieving a 72% recognition rate. 
The results obtained for Elman neural networks were quite low. When the system was 
trained and tested with speakers of opposite sex coming from different dialect regions, the 
recognition rates fell only slightly implying that the differences in sex and dialect regions 
have minimal impact on the neural networks.
63
Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
Speaker Dependent - Phoneme 
Recognition Rates
78%
76% /'!
74% /
72%
70%
68%
66%
64%
62% f - r  ÏT».
MLR TDNN RBF SO M  Elm an
Training Data was obtained from a 
single female speaker.
Training Directory:
\Train\Dr4\FlhdO 
Test Data from the same speaker was 
adopted.
Testing Directory:
\Test\Dr4\Flhd 
12 Mel-Scale Coefficients + 1 Energy 
Term were adopted as the Coding 
Option
Figure 4-7 Speaker Dependent Phoneme Recognition Rates
Speaker Independent - Phoneme 
Recognition Rates
72%
70%
68%
66%
64%
62%
60%
58%
MLR TDNN RBF SO M  Elm an
• Training Data was obtained from a 
single female speaker.
• Training Directory:
\T rain\Dr4\FlhdO
• Test Data from a different female 
speaker was adopted.
• Testing Directory: \Test\Dr4\FssbO
• 12 Mel-Scale Coefficients + 1 
Energy Term were adopted as the 
Coding Option
Figure 4-8 Speaker Independent Phoneme Recognition Rates 
Speakers of the Same Sex and Dialect
64
Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
Speaker Independent - Phoneme 
Recognition Rates
80%
60%
50%
40%
30%
20%
10%
0%
MLR TDNN RBF SO M  Elman
Training Data was obtained from two 
female speakers.
Training Directory: \Train\Dr4\FlhdO
\Train\Dr4\FssbO 
Test Data from two different male 
speakers was adopted.
Testing Directory:
\T est\Dr4\MarwO 
\T est\Dr4\MbmaO 
12 Mel-Scale Coefficients + 1 Energy 
Term were adopted as the Coding 
Option
Figure 4-9 Speaker Independent Phoneme Recognition Rates 
Speakers of Opposite Sex from the same Dialect Region
Speaker Independent - Phoneme 
Recognition Rates
70%
68%
66%
64%
62%
60%
58%
56%
54%
52%
MLR TDNN RBF SO M  Elm an
• Training Data was obtained from two 
female speakers.
• Training Directory: \Train\Dr4\FlhdO
\T rain\Dr4\F ssbO
• Test Data from two different male 
speakers from a different dialect region 
was adopted.
• Testing Directory: \Test\Dr7\MarwO
\Test\Dr7\MbthO
• 12 Mel-Scale Coefficients + 1 Energy 
Term were adopted as the Coding 
Option
Figure 4-10 Speaker Independent Phoneme Recognition Rates 
Speakers of Opposite Sex from Different Dialect Regions
65
_______ Chapter 4. Software Simulation o f Neural Networks for Phoneme Reco2nition
4.7 Conclusions
As shown in Table 4-9, the simulations carried out have indicated that the TDNN 
neural networks provided the highest recognition rates. However, RBF networks provided 
lower training times. Consequently, it was decided that a time-delay radial basis function 
neural network would be the prefeiTed implementation since it exploits the benefits of the 
different learning algoritlmis, namely the high recognition rates obtained by the time- 
delay aspect and the backpropagation learning and the faster learning times obtained by 
the radial basis functions. Table 4-9 summarises the simulation results.
Use of Whole Traiiiing / Testing Set MLP SOM RBF TDNN
Recognition Rates 69% 64% 70% 73%
Training Efficiency 51% 76% 90% 47%
4000 Sample Size for 
Training / Testing Set
Single Speaker Results 74% 68% 75% 77%
Speaker Same Sex and Dialect 71% 68% 72% 72%
Speaker Opposite Sex, Same Dialect 68% 66% 71% 72%
Speaker Different Dialect, Sex 68% 66% 68% 69%
Table 4-9 Summary of Simulation Results
4,8 Summary
Before proceeding to the hardware implementation, it was necessary to evaluate 
the performance of the different neural network algorithms in order to evaluate their 
perfomiance to different speech systems. Various softwai'e simulation results have been 
presented in order to evaluate, which coding technique is best adopted for neural 
networks, which algorithm presents the best recognition rates and performance and also 
to determine the size and topology of the neural networks that provide adequate 
recognition rates. Section 4.5 presents the results of a time-delay RBF system, which was 
adopted to exploit the capabilities of three different learning algoritlnns, namely, the time- 
delay neural network, self-organising maps and radial basis functions. The final section
6 6
_______ Chapter 4. Software Simulation o f Neural Networks for Phoneme Recosnition
presents the perfomiance of the neural networks for different speaker-dependent and 
speaker-independent systems.
67
Chapter 5. VLSI Implementation o f Self-Orsanisius Maps
Chapter 5
VLSI Implementation o f Self-Organising Maps
5.0 Introduction
In this chapter, the implementation of a self-organising map chip will be 
presented. The chip will be eventually adopted in order to train the weights of the hidden 
layer of the radial basis function. The chapter will focus on the choice of network models 
and topologies that have been adopted for implementing the algorithm. Also the 
hardware topologies and sub-circuits are discussed -  with the future implementation in 
mind. The main problem tackled at the design stage of analogue neural networks is the 
need to size up transistors in such a way tliat they can allow coinect operation even within 
a large-scale environment.
This chapter will first outline the theory behind the standard circuits that have 
been adopted and will then provide simulation results, which have been used to provide 
circuit optimisation.
5.1 Mapping the Algorithm on VLSI
The architecture of a small, low power, application specific system must be 
tailored to the application. The building block components for such systems are thus the 
atomic parts of the neural network and form a cell library to the CMOS process.
5.2 The Hamming Network
One of the important examples of competitive learning is the Hamming Network 
[4], [96]. It is used for recognizing the binary patterns with the minimum error. The 
Hamming Network consists of two main blocks:
® The Hamming distance measure block [97], which computes the
Hamming distance between the applied input pattern and the stored 
exemplar pattern represented by the weight values.
© The MAXNET [98], which selects the pattern with the best
matching score.
6 8
Chapter 5. VLSI Implementation o f Self-Orsanising: Maps
E. Sanchez-Sineiicio et al [99-100], have developed analog implementations of the 
Hamming Network. Figure 5-1 shows a block diagram of the Hamming Network. It 
consists of the matching score computation network and the winner-take-all (WTA) 
circuit. Each column of the matching score computation network provides the computed 
matching score Sj, which is represented by the summing current between the applied input 
pattern and the stored pattern. The matching score between the input pattern vector and 
the stored pattern represented by the weights is defined as,
Sj = N~HD{x,wj)  forj = 1 , ,  N. (5.1)
where the Hamming Distance, HD{x, Wy ) -  |%. -  (5.2)/=i
and Wfj aie the elements of the input pattern vector x and weight pattern Wj, 
respectively. For a binary pattern, (5.1) can be simplified to.
M
(5.3)/=i
The WTA circuit selects the best matching score and provides the conesponding 
outputs, which are either analog values or digital values. The Hamming distance of the 
binai'y patterns and the static-random-access-memory (SRAM) is performed by a current­
mode XNOR logic gate. The idea was proposed by J. Choi and B. Sheu. [101]. Figure 5-2 
shows the schematic diagram of one synapse cell [101]. The current mode XNOR gate 
can be implemented using five MGS transistors. Transistors M, to Mg operate as 
switches and it can be easily noted that current flows at the XNOR gate only when the 
elements x,. and ma. ai*e equal. Transistor M, provides the reference cunent for the 
synapse cell. Thus, the analog value of 7^ ., of the (j,i)"' synapse cell can be represented as,
^ji ~ (5.4)
Thus the matching score, Sj, for each stored pattern can be expressed in temis of 
the summing cun*ent of 7^ . ’s as follows,
M
= l i Y j h i  (5.5)ml
69
Chapter 5. VLSI Imvlementation o f Self-Orsanisins: Maps
The SRAM stores the digital weight value reliably as long as the power is 
supplied. The switch controls of the SRAM input portion are used for loading the weight 
value into the local memory site. The transistor sizes were kept to a minimal size and the 
circuit was operated at very low voltage in order to maintain small cuiTents for low power 
consumption.
Column Address
XN O R X N O R
SRAM SRA M
X N O R X NOR
SRA M SRA MInput
Data WeigtitVector
XNOR
SRAM
X N O R
SRAM
XNOR
SRA M SRAM
C o lu m n  D e c o d e r
Winner-Take-All Circuit
Outputs of the Competitive Network
Figure 5-1 Hamming Network [96]
70
Chapter 5. VLSI Implementation o f Self-Orscinisins Maps
[  1'
id
\ /
pi p| — o
\ /
XNOR SRAM
Figure 5-2 Synapse Cell in the Hamming Unit [101]
5.2.1 The Winner-Take-All Circuit
The output neuron of a self-organising map functions as a cunent summer and a 
cuiTent-to-voltage converter to produce an analogue voltage proportional to the cunent 
produced by the Hamming network circuitry. The winner-take-all circuit will then search 
for the largest analogue voltage from the output neurons and produce a sufficiently large 
output voltage level for the winning unit against the others. In order to design the winner- 
take-all circuit, a number of considerations need to be taken into account, namely,
® High accuracy
® High operation speed
® Compactness
WTA circuits have been built with transistors biased in the sub-threshold region 
[102-103]. The approach is suitable for biologically inspired artificial neural systems 
where billions of transistors are expected to be integrated on a single substrate. However, 
it has some significant limitations due to low operation speed, small dynamic range, poor 
noise immunity and severe fabrication-induced device variations. Switched capacitor 
versions are also available but they require a clock.
A schematic diagiam of the WTA circuit as proposed by J. Choi and B. Sheu 
[101] is shown in Figine 5-3. All active transistors operate in the saturation region. Each 
WTA cell consists of two branches. The first branch with transistors M ,, and 
Mj converts an input voltage into the current cell as.
71
Chapter 5. VLSI Implementation o f Self-Orsauisiim Maps
/ y  - V „ , f  for j = 1 , ,  N (5.6)
where /?, and . are the transconductance parameter and the effective threshold 
voltage of transistor M. respectively.
These cuiTents are compared and redistributed along the common signal line . 
In the second branch of and , the current in each cell is converted into the output 
voltage as,
yU) ^  J_out g 4^
-1 ss (5.7)
where X. is the channel-length modulation parameter and m is the current gain
betweentiansistors and M^.
Vin(j)
>
u
M2
MS VBB2
»
>
V o w t( j )
VSS
Figure 5-3 Winner-take-all Circuit [101]
72
Chapter 5. VLSI Implementation o f Self-Orsanisius Maps
Since the source temiinals are at the same potential for all the cells, the current 
flowing through each cell is related to the square of the input voltage. Thus, the strongest 
input can secure the largest amount of current out of the total bias cunent. The largest 
current is converted and amplified to produce the largest voltage as the output of the 
winning node. If the input voltage differences are sufficiently big, the winner output is 
saturated at the positive supply value, while the other outputs are saturated at the negative 
voltage supply value. Through the use of a common voltage node, the total bias cunent is 
provided by the transistor in every cell. As the number of inputs increases, the WTA 
circuit can be extended linearly by simply abutting the common signal node through the 
cells.
Consider P different input voltage groups. The group has n. elements and its 
input value is FJ. The cuiTent flowing tlnough each cell in the z,y, - group is
A- C^M ~ K ,)“ (5-8)
The bias current from transistor in each cell is
“  ^ss ~ Ki, f  [l + K  ~ ^ss )] (5 .9)
The total cunent is distributed as,
= N  .1 B (5.10)1=1
p
where A/" = ^zz. (5.11)/=i
N is the total number of competitive cells. From the above equations, is 
detennined and the output voltage of each cell is uniquely decided according to the 
corresponding input value. When the input voltage value to the f ' '  group is the largest, 
the number of cells in the f '  group should be 1. The cunent flowing in this cell, A , 
should be larger than the cunent flowing through a single cell in any other group in order 
to ensure proper wimier-take-all function. This largest cunent is designed to make the 
output node saturated to the positive power supply.
73
Chapter 5. VLSI Implementation o f Self-Orsanisins Maps
Performance improvement due to cascading of identical single stages is 
significant. Cascading single stages makes the entire voltage gain of the cell drastically 
increase so that the transition region between the winner and losers is greatly narrowed. 
The operation speed is improved since a large load capacitance can be driven efficiently 
by the stronger output signal. For a large-scale network, the output levels of the winner 
and losers cannot be fully binai'y -  i.e. they take up the rail-to-rail voltages. However, the 
wimiing output is still larger than the rest. The problem in this case is due to the loss of 
the available current in the winner cell. The conesponding response time becomes quite 
long because a smaller amount of charging cun ent is provided for transistor M3 in the 
winner cell. By using the cascading configuration, the output voltage level of the winner 
can be maintained to be saturation towards the positive supply and the operation speed 
can be gieatly increased.
A circuit, based on a dynamic current steering principle, proposed by J. Choi and 
B. Sheu [101], is shown in Figure 5-4. This circuit is able to ensine that only the winning 
output will have the logic-high value by dynamically adjusting the cunent levels. The 
input to transistor M  ^ is the output of the single-stage competition cell. As the number of 
competition cells increases, the network can still be easily extended. The drain terminal of 
transistor Mj  ^in each cell is tied together to provide the necessary bias cunent. When the 
number of outputs with the logic-1 value is more than 1, the currents can flow in 
transistors Mg and Mg in each cell so that the current in each competition cell is 
decreased by the same amount. This steering cunent is strongly dependent on the number 
of logic-1 outputs. The operation continues until only one output has a high voltage value 
and the cuiTents of other cells are below the tlneshold to make the conesponding output 
voltages at the logic-0 value. After the current steering technique is applied, large steering 
current flows because several outputs start at the logic-1 value. This operation reduces the 
current levels of all outputs. Once all cell cuiTents conesponding to the secondary inputs 
are below the threshold current, the steering current decreases so that only the wimring 
output rises up to the positive supply value.
74
Chapter 5. VLSI Implementation o f Self-Orsanisins Maps
>
>
L,
>
I
J
%
t
>
J
n - NV
□
Output toNextStage
Single Stage Current Steering Circuit
Figure 5-4 Dynamic Steering Winner-Take-All Circuit [101]
5.3 The Learning Algorithm
One major challenge of using self-organising networks is that some neural units 
may be under-utilised. In order to overcome this problem, a frequency-sensitive self­
organisation (FSO) method is adopted. The algorithm modifies Grossberg’s variable- 
tlueshold competitive learning by applying a winning fr equency and its associated upper 
thr eshold value to centroid learning mle. It systematically distributes the code vectors in 
the vector space to approximate the unknown probability density function of the random 
training vectors. Code vectors quantise the vector space and converge to cluster centroids. 
This method produces near-optimal results.
The synapse weight vector is stored as a code vector. In the one-iteration FSO 
scheme, the training data must pass once in constructing the codebook. It is a fast and 
powerful scheme for adaptive vector quantisation owing to its low computing 
requirement and parallel computing structure. The one-iteration FSO scheme for adaptive 
vector quantisation is described as follows:
75
Chapter 5. VLSI Implementation o f Self-Orsanisins Maps
1. Initialise the code vectors and the winning frequency for each 
distortion-computing neuron:
^ .(0) = fî(0 ^
A(0) = 1 
for i = 1 , ,  N
R(i) is a random vector-number generator hmction. N is the number of code 
vectors and (0) -  [)TJ, (0), IjF, ( 0 ) , ( 0 ) ] .  M is the number of vector components,
However, it should be noted that instead of using random vectors, the first N input 
vectors may be used as the initial code vectors.
2. Compute the distortion between the input vector and all code 
vectors.
M
D,(t) = d(X{f),WXt)) = ^ { X j ( t )  -  W,j Q ) Y  (5.13)
, /= l
where t is the training time index.
3. Select the distortion-computing neuron with the smallest distortion 
and set its output to high.
O,(0 = {l} i î D , { t ) < D f i ) , \ < i , j < N , i ^ J  (5.14)
0,(0  = {O} otherwise
4. Update the code vectors with a frequency-sensitive training rule 
and the associated winning frequency:
W,(t +1) = 1 .^(0 + Sit)0,{t)[X(t)-W,U)]  (5.15)
5 ( 0 = ^ i f i < ^ ; . ( o < F ,„
or S{t) = 0 otherwise (5.16)
and Fiit + l)^F.(t)  + 0,it) (5.17)
where S(t) is the frequency-sensitive learning rate and F^,^  is the upper threshold 
frequency. It should be noted that only the winning code vector is updated. The training 
rule moves the wimiing code vector towards the training vector by a fractional amount 
which decreases as the winning count increases. If F  is larger than the upper-threshold
76
Chapter 5. VLSI Implementation o f Self-Orsanisins Maps
frequency F,,,, then S(t) is set to 0 and no further training will be perfonned on the given 
neural unit.
5. Steps (2) through (4) are repeated for all training vectors.
Use of the upper-tlrreshold frequency can avoid code vector under-utilisation 
during the training process for an inadequately chosen initial codebook. The selection of 
the upper-threshold frequency is heuristic and depends on the source data statistics and 
training sequence. A value of three times larger than the average training frequency was 
found adequate.
The performance of the one-iteration FSO method can be incrementally improved 
by using additional iterations to adjust the code vectors into better cluster centroids. The 
codebook obtained fr om the previous iteration is used as the initial values for the current 
iteration. After the first iteration, the upper-threshold frequency is not needed because a 
good initial codebook is available. This method is called the multiple-iteration FSO 
method.
5.4 Winner-Take-All Circuit Simulation Results
The wimier-take-all circuit is the most sensitive block within the chip and hence 
various simulations have been carried out.
hiitially, several simulation mns were carried out, for the circuit shown in 
Figure 5-3, in order to evaluate the coixect transistor sizes that maintain low power and 
low voltage operation, while still allowing correct operation with a large network 
envirorunent. The winner-take-all circuit is tested with an input supply rail of -2V to 2V. 
One input is set to IV and the other input is increased linearly from -IV  to 1.5V. To 
obtain the fully binary output values for the winning and losing nodes, the required input 
voltage difference is found to be at least 1 OOrnV. The waveform for the output voltage to 
the 2-input WTA circuit is shown in Figure 5-5. The simulated transient response is 
shown in Figure 5-6. The response time of the single stage is 60 ns with a capacitive load 
of 0.2 pF at each cell. Table 5-1 shows the lerrgths and widths of the transistors used.
77
Chapter 5. VLSI Implementation o f Self-Orsanisins Maps
Transistor Ml M2,M4 M3,M5
W/L [|.im/j.im] 8/2 8/4 16/4
Table 5-1 Transistor sizes of the 2-inpiit WTA cell
Simulation Results for 2-input W TA circuit
Vin(1)
-1 -
Figure 5-5 SPICE Simulated Results for 2-input WTA (Vin(2) = IV)
S’ 0 .5 -
±00_...... .1 5 0 250 _30.a _350Q.
“2 — 
-2.5 - -
Time (nsec)
Figure 5-6 Transient behaviour of the winning output with a 
capacitive load of 0.2 pF
78
Chapter 5. VLSI Implementation o f  Self-Orsanisiiis Maps
5.4.1 Large-Scale Neural Network Environment
An important design consideration for the winner-take-all circuit is its capability 
of operating in large-scale network environments. In fact, the transistor sizes were chosen 
such that the operation of the winner-take-all circuit could operate in large-scale network 
enviromnents. The main considerations taken up in the evaluating the transistor sizes is 
that the wimier-take-all circuit is capable of discriminating between voltages which are 
very close to each other, since when the self-organising map size is extended, voltage 
levels fed to the wimier-take-all circuit are very likely to be close to each other. The 
circuit should be able to handle large systems with up to 400 inputs, while operating on a 
supply rail of ± 2V.
To clearly illustrate the behaviour of the wimier-take-all circuit in a large-scale 
network, three groups of input voltages are considered: the winning voltage, V^ y, the 
second largest voltage and the smallest voltage . The number of cells in each group 
is 1, M and N-M-1 respectively. The cuiTent flowing through each cell can be expressed 
as:
= "y (K' -  O^M ~ Ki, )~ (5.18)
hf  == "y (Ki ~ '^ CM ~ K, y  (5-19)
(5.20)
The total current is obtained as:
^total + { N - M -Ï ) . I ^ = N  ./^ (5.21)
Figure 5-7 shows the results obtained for a 1000-input WTA circuit. The voltage 
levels were set at 2.525 V for Fj,,, 2.5 V for V^^ ,, and 2.475 V for with a supply rail 0 V
to 5 V. As the number of second lai gest input, M, increases, a larger portion of the total
bias cunent flows through this group and the cunent flowing through the winner cell 
decreases. This results in reduction of the output DC level and increases the rising time of 
the wimiing output.
79
Chapter 5. VLSI Implementation o f Self-Orsanisins Mays
0)D)
> > •<-> <u
3  _ JQ.5o
Results for 1000 input WTA
5
4
3
2
1
0
2000 400 1000600 800
M
1st W in n e r  2nd W inner
Figure 5-7 Results for 1000 input WTA
Simulations were also camed. out on cascaded versions. The improved 
perfomiance of the cascade mode can be seen in Figure 5-8 and Figure 5-9.
Results for 1000 input WTA
5
o> 4
3
2a
1
0
0 200 400 600 800 1000 1200
 1st Winner Single
 1st Winner Cascade
2nd Winner Single 
2nd Winner Cascade
Figure 5-8 Simulation Results for 1000 input WTA -  Comparing Single and Cascade 
Mode - DC Level of the Winning Output
80
Chapter J. VLSI Implementation o f Self-Or^amsins Mays
Results for 1000 input WTA
1000
800
o 600U)c 400
Q.
200
0 200 400 600 800 1000
M
Single-Stage Cascade-Stage
Figure 5-9 Results for 1000 input WTA -  Comparing Single and Cascade Mode 
Response time of the winning output -  Capacitive Load of 1.0 pF
In the WTA circuit, shown in Figure 5-3, the total bias cuiTent is provided by 
transistor from each cell in the distributed manner instead of having a fixed amount of 
tail cuiTent source. Thus, the total bias current is proportional to the number of 
competitive circuit cells. This approach makes the circuit response time quite independent 
of the number of cells. In the fixed biased current case, the response time of the wimiing 
output increases rapidly as the number of cells increases. On the other hand, in the case of 
the distributed biasing, the response time is almost constant because the available 
charging current is adjusted proportionally. Figure 5-10 shows the results of simulated 
tests for fixed and distributed biasing.
If the input voltages to selective cells have very large values as compared to the 
values to the rest of the cells, then the selective cells consume most of the total biasing 
current. In such a case, more than one output voltage might saturate at the positive power 
supply value. This results fr om the fact that the currents of these cells are large enough to 
make output values saturated though they are still smaller than the winning cell. Here, the 
critical current, , is used to describe the value which the current flowing in the cell 
should exceed in order that the output can be interpreted as a logic 1 value. This value is 
determined from the circuit parameters.
81
Chapter 5. VLSI Implementation o f Self-Orsanisins Maps
Fixed vs Distributed Biasing
Ô 3 5 0  
"  300 f 250.i 200
0 150 - 
c  100  -
1  50 -
S. 0
0 200 400 600
M
800 1000
Distributed Fixed 12.2uAx250  Fixed 12.2uAx500
Figure 5-10 Comparison of the Fixed and Distributed Bias Currents for a 
Different Number of Competitive Cells
The circuit of Figure 5-4, was finally adopted and the transistor sizes are given in 
Table 5-2.
Transistor Ml M2,M4 M3,M5 M6 M7 MS M9 MIO
W/L
[pm/pm]
8/2 8/4 16/4 8/2 4/2 6/4 24/4 12/4
Table 5-2 Transistor sizes of the WTA cell shown in Figure 5-4
The current-steering circuit was initially tested for a 10-input winner-take-all cell. 
The input voltages for each cell were maintained at the levels given in Table 5-3. 
Figure 5-11 shows the output voltage and current simulation results comparing the 
cuiTent-steering single and cascaded WTA.
Input Vi[l] Vi[2] Vi[3] Vi[4] Vi[5] Vi[6] Vi[7] Vi[8] Vi[9] Vi[10]
[V] 1.89 1.43 2.25 3.03 2.35 2.94 2.41 2.50 2.82 1.96
Table 5-3 Input Voltages for a simulation on a 10-input WTA
8 2
Chapter 5. VLSJImplementation o f Self-Orsariisiiis Maps
Output Voltage » Single Stage
0)D)
Ë
§
3Q.3O
5
4
3
2
1 //--- time (ns)
0
100 150 200
1
■ — Vo(4) Vo(6) ------- Vo(9)
(a)
Output Voltages = Cascading Stages
----
a.
0 50 100 150 200
 Vo(4) Vo(6) ..........Vo(9)
(b)
83
Chapter 5. VLSI Implementation o f  Self-Orsanisins Maps
Cell Currents - Single Stage - Without 
Steering
55
35 -<3
15 -
20 40 60 100 120
 1(4) 1(6 ) 1(9)
time (ns)
(c)
Single Stage with Current Steering - Current for Node 
4
30 -
25
<3
3U
100 120
time (ns)
(d)
Figure 5-11 (a) Output Voltages for Single-Stage WTA (b) Output Voltages 
for Cascaded WTA (c) Output Currents for Single-Stage without Dynamic Steering 
(d) Output Current for Single-Stage with Dynamic Steering
84
Chapter 5. VLSI Imvlementation o f Self-Orsanisins Maps
The transistor sizes were appropriately chosen such that the circuit could be 
capable of coiTect operation with network sizes containing matrix connections up to 
300x300. Several simulations were earned out on 300 input wimier-take-all circuits and 
the given sizes provided coiTect operation, while still maintaining a small silicon size and 
low power consumption.
5.5 Chip Architecture
The chip architectiue based on the FSO learning method is shown in Figure 5-12. 
The chip uses analogue circuitry to perfoim the massively paralleled neural computation 
and digital circuitry to process and store the digital weights information. The chip, 
implemented using standard CMOS 0.6pm teclmology, includes input neurons, 
programmable synapses, current-summing neurons, winner-take-all cells and decoders. 
The chip area is 3.3 mm x 3.4 mm. The synapse matrix is composed of 32 x 32 synapse 
cells. The output neuron aiTay is composed of 32 summing neurons, which perfonn 
paralleled summation of the distortions between the input vectors and code vectors. The 
wimier-take-all block consists of 32 competitive circuit cells, which perfonn paralleled 
comparison among the 32 distorted values and choose a single winner.
The digital blocks were programmed using Verilog and synthesised using Synergy 
for Cadence version 97A. The analogue building blocks -  namely the cunent XNOR 
block and the Wiimer-Take-All Circuit were implemented using analogue full custom 
design. 8 bits were adopted for coding the weight and input vectors as they provided 
enough accuracy for the learning algoritlnn. The actual layout of the chip clearly shows 
that the digital block occupied a silicon area which is around 6 times larger than the 
analogue block. This clearly demonstrates that computational blocks implemented using 
analogue teclmology occupy less silicon area, besides consuming less power.
85
Chapter 5. VLSI Implementation o f Self-0 rsanis ins Maps
A d d ress Input V ector
W
Input f
Buffer J \1/
A d d ress D eco d er Input Buffer
7 7
S y n a p se  Matrix — Input N eu ron s
Current Su m m in g N eu ron s
W Inner-Take-All C ells
\ z
Figure 5-12 Self-Organising Maps Chip Building Block
5.6 Experimental Results
Six chips were used to construct a 192-input WTA circuit. In this case, the large 
WTA circuit can easily be implemented by connecting the common signal node of the 
different chips together. Figure 5-13 shows the measurement results of the dc transfer 
voltage chaiacteristics. The input voltage of cell 20 was increased linearly from 2.50 V to 
2.55 V as shown. Cell 50 is set at 2.52 V, while all other inputs were set at 2.50 V. When 
the input voltage exceeded the voltage of cell 50, the corresponding output voltage values 
were flipped to make a new wimier. All other outputs were saturated towards the negative 
supply voltage.
8 6
Chayter 5. VLSIInwlementation o f Self-Orsanisins Maps
Demonstrating change of winning node
>
a>D)5o>
3tO
2.45 2.47 2.49 2.51 2.53 2.55
input voltage Vin(20) (V)
Vo (20) Vo (50)
Figure 5-13 Measurement results of the dc transfer voltage characteristics for 200-
input hardware WTA.
Figure 5-14 shows the output wavefomi on cell 20 when a 0.1 V peak-to-peak 
voltage, centered at 2.50V, was applied as input to the cell. The load capacitance was 
7 pF, while the measured rise time and fall time were 182 ns and 350ns respectively.
In a separate experiment, the input voltage of cell 80 was set to 2.58 V to make it 
the wimiing cell. The second largest input voltage and the input voltage to other cells 
were set to 2.50 V and 2.42 V, respectively. The output levels and the response time 
against the number of cells with the second largest input voltage are shown in 
Figure 5-15. The output levels of the wimier for all cases are above the 90% value of the 
full operation range. The operation speed can be significantly increased when the circuit 
is integrated with other blocks in a VLSI chip because the internal load capacitance could 
be much less than the 7 pF of the measurement setup.
87
Chapter 5. VLSI Inwlementation o f Self-Orsanisins Maps
Vin -V(20)
5 10 15
time (us)
20
^  5 
D) 4
>  2
3 0 -io
Vo (20)
10
time (us)
15 20
Figure 5-14 The output waveform on cell 20 when a 0.1 V peak-to-peak voltage, 
centred at 2.50V, is applied as input to the cell
50 100 150
Number of Nodes
200
Calculated Measured
o  400 
C  300
W 200c oCL (/>Q) 0
50 100 150 200
Number of Nodes
C alculated ■ M easured
Figure 5-15 The output levels and the response time against the number of 
cells with the second largest input voltage.
Figure 5-16 shows the effects of the process-induced variations of the competition 
threshold across several chips. It was typically measured to be less than 15 mV.
Simulations were earned out in order to test the effects of temperature variation 
on the perfomiance of the circuitry. The only perceived problem with excessive operating 
temperature (typically higher than 100°C) is that the cunent increases and there is a 
higher possibility that a number of output nodes will saturate. This will effect the 
operation of the winner-take-all circuit. However, the situation may be remedied by 
increasing the power supply range. Otheiwise, the recognition rates of the self-organising 
maps are unaffected.
88
Chayter 5. VLSI Implementation o f Self-Orscinisins Maps
Effects of Process Variations
O)a
>
tO
5
4.5 
4
3.5 
3
2.5 
2
1.5 
1
0.5
0
2.51 2.52 2.53 2.54 2.55
 Cell (2)
 Cell (50)
Cell (100) 
 Cell (150)
Input Voltage (V)
Figure 5-16 The effects of the process-induced variations of the competition
threshold across several chips.
5.7 Testing Hardware Self-Organising Maps for Phoneme 
Recognition
Two such chips were adopted to implement a phoneme recognition system. The 
system was tested for recognising 48 different phonemes. Speech utterances extracted 
from forty different speakers (twenty male, twenty female) of different dialect regions 
were coded using Mel-Scale Coefficients, hi this case, 8 Mel-Scale Coefficients per frame 
were extracted and supplied as input to the chips. A recognition rate of 61.7% was 
attained. This result was slightly lower than the value of 63% obtained using software 
self-organising maps, as shown in Figure 3-5. The system used a supply rail of ± 2V and 
each chip consumes around 10 mW.
5.8 Conclusions
The implemented chip uses mixed-mode CMOS 0.6pm teclmology. Analogue 
circuitry perfoiins the massively paralleled neural computation and digital circuitry is 
used to process and store the digital weights infoiination. The chip, includes input 
neurons, programmable synapses, current-summing neurons, winner-take-all cells and 
decoders. The chip area is 3.3 mm x 3.4 mm. The synapse matrix is composed of 32 x 32
89
Chapter 5. VLSI Implementation o f Self-Orsauisins Maos
synapse cells. The output neuron anay is composed of 32 summing neurons, which 
perfoim paralleled summation of the distortions between the input vectors and code 
vectors. The winner-take-all block consists of 32 competitive circuit cells, which perform 
paralleled comparison among the 32 distorted values and choose a single winner. 
Table 5-4 summarises the properties of the self-organising map chip.
Characteristics Properties
Chip Teclmology Mixed-Mode Analogue/Digital CMOS 0.6pm technology
Chip Area 3.3 mm x 3.4 mm
Number of Synapses 32x32
Number of Output Neurons 32
Minimum Power Supply Rail ±2V
Power Consumption 10 mW
Phoneme Recognition Rates 
(4000 samples -  8 Mel Scale Coefficients) 61.7%
Table 5-4 Summary of the properties of the self-organising map chip.
5.9 Summary
This chapter reviews the structure of self-organising maps. The tasks of the 
individual building blocks aie defined and the hardware implementation of each of the 
individual block is treated in detail. Simulation results are presented, including 
comparisons between the different implementations. The chip perfoiinance followed very 
closely the software simulation results that were obtained and presented. The structure of 
the eventual learning chip is shown and the results of the hardware implementation are 
presented for the task of phoneme recognition. The layout plots of the chip and the 
internal building blocks are presented in Appendix A.
90
Chapter 6. VLSI Implementation o f Radial Basis Functions
Chapter 6
VLSI Implementation of Radial Basis Functions
6.0 Introduction
Neural networks with Radial Basis Function (RBF) present several attractive 
properties. They are suitable for many classification tasks and provide easy and effective 
learning algoritlnns. They adopt the simple idea of constructing complex decision regions 
by superimposing simple kernel functions [103]. The centres and widths of the kernel 
functions are the main parameters to be estimated during the learning phase [103]. The 
typical choice of kernel function is the Gaussian function, which gives the highest output 
when the input is near to its centre and monotonically decreases as the distance from the 
centre increases. The Gaussian function clearly contrasts from the traditional sigmoid 
function, responding significantly only to local regions of the space of the input values. 
As a result, training in neural networks based on the Gaussian functions is more efficient 
than networks adopting the sigmoidal function and in fact, up to two or tlnee orders of 
magnitude speed-up in training have been reported in pattern recognition applications that 
have adopted the RBF kernel functions [104-106].
RBF neural networks consist of one hidden layer of kernel nodes and one output 
layer of linear' decision nodes. The fact that RBF neural networks have only one hidden 
layer is a benefit as the interconnection scheme can easily be implemented for VLSI 
circuits unlike the greater interconnection requirements of complex multi-layer 
perceptrons.
This chapter will first outline the theory behind the standard circuits that have 
been adopted and will then provide eventual modification to the circuitry together with 
simulation results, which have been used to provide circuit optimisation. Finally, the chip 
results are presented.
91
Chapter 6. VLSI Implementation o f Radial Basis Functions
6.1 Hardware Considerations
As stated in Chapter 3, the network chosen here consists of tlrree layers:
1. An input layer where an element of the input vector is subtracted from 
each stored weight. Each result is then squared and then summed together 
in the row.
2. A hidden layer which performs an exponential operation on the summed 
results for each row of the previous layer. This creates a Gaussian transfer 
function with respect to the input vector.
3. An output layer which multiplies the output of each row’s hidden layer by 
a stored weight. The results of these multiplications plus a global constant 
are then summed along the column to produce an output.
Analogue circuit elements operating in the sub-threshold region can provide all of 
the above operations. In fact in the sub-threshold region, the drain cunent of an MOS 
transistor has an exponential dependence on the gate bias so that the exponential 
nonlinearity can be easily achieved. Sub-threshold VLSI design is well suited for 
implementation of large-scale biologically-inspired neural networks due to low power 
consumption.
Within a radial basis function neural network, the input neurons are densely 
connected to output neurons tlirough the synapse matrix. Input neurons can belong to the 
input layer or a hidden layer of the complete neural network, while the output neurons can 
belong to another hidden layer or the output layer of the complete network.
Each synapse between an input neuron and an output neuron can perfoim the 
Gaussian function with the mean being the weight value. Therefore, if the weight value is 
changed, the connection strength of an input neuron X. to an output neuron Yj is 
increased or decreased [106]. The cr_ determines the standard deviation value of the 
Gaussian function.
Figure 6-1 shows the circuit schematic of a basic synapse cell with single-ended 
input data. This circuit was proposed by B. Sheu and J. Choi [107]. The Gaussian 
function synapse cell consists of the MOS differential pair and several arithmetic 
computational units operating in the cunent-mode configuration. Transistors with non­
minimum channel lengths are used to minimise channel-length modulation effects. The 
input voltage is applied at the gate terminal of one transistor in the differential pair and
92
Chapter 6. VLSI Implementation o f Radial Basis Functions
the weight value is applied at the gate temiinal of the other transistor. The two currents in 
the differential pair can be expressed as:
I . = r ~ V > n - V . )  (6.4)
and
{ K . - K Ÿ  (6.5)
V A
with the differential input voltage in a finite region of
K - k \^ 21
f A
(6.6)
Here I^ is the tail cuiTent o f  the differential pair and JVP  -  //Cgy —  is the transconductance value o f  transistors M, and M , .
T bf^ .^ntniit r.iirrftnt n f  th ie  çvnnncR  p p II rnn  h p  Hpfpi-minpH K\/'The output current of this synapse cell can be detemiined by:
I o u , = ^ h - i h - h ) (6.7)
where A is the drain current ratio of transistor to .
If 0, then /,>/,. and /, < /,.. In this case, f  -  /, -BI^  and /ç, = 0 , where
B is the drain current ratio of transistor M ^fM p  to . On the other hand, if -  V^ ^> 0, 
then /]</,. and f  >1 .^ In this case, f  = f  ~ B f  and = 0. Then the input voltage 
is comparable to the synapse weight and transistors and nearly turn off
and the output cunent is mainly contributed by transistor M ,,. Cunent gain values A and 
B can be chosen to better approximate the ideal Gaussian cuiwe. Their typical values are 
quite close to 1.
93
Chapter 6. VLSI Implementation o f Radial Basis Functions
M il  M4
\ /
M5 M6
\ /
16 
/  \
M91 Mu
%
Figure 6-1 Basic Gaussian synapse cell with single-ended input/weight values [107]
Transistor M l,M2 M3 M4-M9 MIO M il M12
Size (W/L) pm 8/4 10/4 8/8 5/4 7/8 4/4
Table 6-1 Transistor Sizes for Figure 6-1
6.2 Synapse Circuit Modification and Optimisation
The circuit simulation result for a weight value of 0 is shown in Figure 6-2. The 
transistor sizes adopted for the circuit were carefully chosen such that the cunent out of 
the circuit would very closely approximate the Gaussian function at very low current 
operation. Low current operation is a necessity in a large-scale environment as outputs 
will tend to saturate when the chips are operated at low voltage. As can be seen from 
Figure 6-2, the simulated output currents closely approximates the ideal Gaussian
94
Chapter 6. VLSI Implementation o f Radial Basis Functions
function cui-ve -  dotted line - within the operational range -1.5 V to 1.5 V, with the best 
match obtained for values of A and B at 1.
A modified synapse cell with differential input/weight values was preferred. The 
circuit schematic is shown in Figure 6-3. Figure 6-4 shows the simulated results, which 
demonstrate an improved approximation to the Gaussian curve. This is achieved because 
the circuit uses symmetric handling of the positive and negative signals.
Varying the Values of A,B
<3 .
cË
o
3Q.■*->3o
“'M— —
1 -0.5
---
Input Voltage (V)
—  A,B—1 —  — -A,B—1.5 ~ “ * “ A,B—2 Ideal
Figure 6-2 Simulation Results for the Synapse Characteristics -  Maximum 
magnitude being 8 pA, mean 0, standard deviation 0.302
Transistor Ml-2,M7-8 M3,M9 M4,M10 M6-6,M11-12 M13 M14 M15
Size(W/L) pm 8/4 25/4 19/8 16/8 25/4 16/8 29/4
Table 6-2 Transistor Sizes for Figure 6-3
95
Chayter 6. VLSI Implementation o f Radial Basis Functions
" a
] a  c
J  L
J  L U  L
n r
Figure 6-3 Modified Gaussian synapse Cell with differential input/weight values.
Figure 6-4 shows that the curve obtained for the modified synapse cell 
approximates the Gaussian function with an accuracy better than 98% over the input 
range -2V to 2V for values of A and B at 1. However, device mismatch induced by 
fabrication will cause some degradation to this accuracy. In fact, the usable output signal 
range of the modified synapse cell is almost double that of the single-ended input version. 
The area of the modified cell is 146 pm x 99 pm.
6.3 Programmability
A great advantage of neural networks results horn their ability to adapt to the 
changing environment. Tlnee values can be programmed in a Gaussian function: the peak 
magnitude, the weight value and the standard deviation. In fact, the following parameters 
are extremely important since the effect the training efficiency of the network. The 
weights are corrected using any appropriate learning algorithm, namely either competitive 
learning or b ack-prop agation.
96
Chapter 6. VLSI Implementation o f Radial Basis Functions
The peak value can be changed by varying the tail current , while the weight 
values can be programmed using a learning algorithm. The tail current can easily be 
adjusted by changing the values of resistance, which are connected to the positive and 
negative supplies in order to limit the value of . These resistances are connected 
externally to the chip, allowing great flexibility. The standard deviation, for a fixed value 
of , can be varied by changing the W/L ratio of the input transistors of the differential 
pairs. In order to allow better progr ammability within the chip itself, a small library of 
transistor differential pair with different W/L ratios were implemented on chip and these 
transistors could be chosen for operation in the circuit through the use of MOS switches, 
which are controlled by data stored in a local flip flop. After carefully evaluating the 
software runs for Gaussian training in radial-basis function networks, the W/L ratio 
chosen were 4/4, 8/4, 12/4 and 16/4. These transistor sizes were capable of providing the 
typical final standard deviation values that was adopted during software training.
In this way, the implemented circuitry allows great flexibility in order to achieve 
the main important goal of radial-basis frrnction, that is, the ability to obtain fast training 
times. This was achievable using very simple components and little hardware overhead 
on chip.
 8->
<3 */5
3Ü 2 -
Voltage (V)
Figure 6-4 Simulation Results for the Synapse Characteristics with Differential 
Inputs -Maximum Magnitude 8 pA, Mean 0 and Standard Deviation 0.55.
97
Chapter 6. VLSI Implementation o f Radial Basis Functions
6.4 Simulation Results
The simulated results on the programmability of the modified Gaussian synapse 
cell are shown in Figure 6-5 (a), (b) and (c) for different values of the peak magnitude, 
mean, and standard deviation respectively.
Different Peak Values
<3
c
Ë3Ü■4-13a-3o
1.5 1 -0.5 0 0.5 1 1.5
Differential input Voltage
(a)
Varying Mean
<3
I
3o
3O.
3O
-1.5 -0.5 0 0.5 1 1.5
Differential Input Voltage (V) 
(b)
98
Chayter 6. VLSI Implementation o f Radial Basis Functions
Different Standard Deviation
<3
C
g3Ü
3Sr3o
1.5 1 -0 .5 0 0.5 1 1.5
Differential Input Voltage (V)
Figure 6-5 Simulations for the Programmability of the Modified Gaussian Cell 
(a) Reference currents set at 16 p,A, 8 |aA and 4 pA (b) Weight Value Changed at 
-0.5V, OV and 0.5V reference current set at 8 pA (c) W/L ratios changed to 1, 
2, 4 with the setting at 1 producing the largest standard deviation while the 
setting at 4 gave the lowest standard deviation
Figure 6-6 shows an example configuration for demonstrating the performance of 
the Gaussian function network. Input neurons consist of unity gain amplifiers as data 
buffers. The same input voltage value is applied to four input neurons. A linear resistor in 
the output neuron converts the summed current into the output voltage. A lower bound 
value of this feedback resistor is to be deteimined by the allowable output voltage which 
can be distinguished from the noise. When the number of synapses increases, the summed 
current may drastically increase because all currents are unipolar. Thus, a proper upper 
bound value of the feedback resistor value should be determined from the network size. 
Since the non-inverting input of the output neuron is virtually giounded, the contribution 
fi*om one synapse cell cuiTent is independent of the output resistance of the synapse cell. 
Figure 6-7 shows the speed response of the Gaussian network. Figure 6-8 shows the 
simulation results for dc chaiacteristics of the four-input neuron. Synapse 1 has a mean of 
-1.6, standard deviation of 0.55, maximum cunent of 8 pA, synapse 2 has a mean of 0.2, 
standard deviation of 0.38, maximum current of 8 pA, synapse 3 has a mean of 1.3, 
standard deviation of 0.38, maximum cuiTent of 8 pA and synapse 4 has a mean of 2.5, 
standard deviation of 0.55, maximum current of 8 pA.
99
Chapter 6. VLSI Inwlementation o f Radial Basis Functions
Vout
Vin Output Neuron
W4
W1
W3
W2
Input Neurons Synapses
Figure 6-6 A Network with Four Gaussian Synapse Cells
Speed Response for Gaussian Network of 
Figure 6-6
0.6  —
o>O)B 0
g
- 0 . 6  -J
O'. 5  ^ time (us) ^
Vin ( V )  (“Vout) (V)
Figure 6-7 Speed Response for Gaussian Network of Figure 6-6
100
Chapter 6. VLSI Implementation o f Radial Basis Functions
i
Ë3Ü
Four Synapse Measurements
-2 -1 0
Input Voltage (V)
------- Synapse 1 - - - - - - • Synase 2 - - - Synapse 3
------- Synapse 4 — Total
Figure 6-8 Simulation results for dc characteristics of the four-input neuron. 
Synapse 1 has a mean o f-1.6, standard deviation of 0.55, maximum current of 8 pA,
synapse 2 has a mean of 0.2, standard deviation of 0.38, maximum current of 8 pA,
synapse 3 has a mean of 1.3, standard deviation of 0.38, maximum current of 8 pA,
synapse 4 has a mean of 2.5, standard deviation of 0.55, maximum current of 8 pA.
Rigorous simulation mns were earned out in order to observe the operation of the 
both the synapse and neuron circuit in a lai'ge-scale neural network environment. It was 
necessary that the synapse currents should be maintain at very low values so that each 
neui'on would be capable of accepting connection from several hundreds of synapses. In 
this case, the resistor within the feedback loop of the neuron was also maintained small so 
that it would not occupy a large silicon area and also not to saturate the neuron output.
Three hundred input neurons were connected to 300 synapses, which in turn were 
comiected a neuron. The same input voltage value is applied to all input neurons. The 
input voltage chosen was taken to be the highest positive input voltage that can be taken 
up by the synapse cell. This was done on puipose as it was necessary to test the operation 
of the radial basis function chip in the worst case conditions, especially so since the 
summed current may drastically increase because all currents are unipolar. Thus, a proper 
upper bound value of the feedback resistor value should be determined from the network
101
Chapter 6. VLSI Implementation o f  Radial Basis Functions
size. Since the non-inverting input of the output neuron is virtually grounded, the 
contribution fi'om one synapse cell current is independent of the output resistance of the 
synapse cell. Thus, a proper upper bound value of the feedback resistor value should be 
detennined from the network size. The transistor sizes quoted in Table 6.2 in fact allowed 
coiTect operation for the given network size. The value of the resistor was fixed at 30 Q, 
thus occupying a very small silicon area and consuming very little power.
Simulations were earned out in order to test the effects of temperature variation 
on the performance of the circuitry. The problem with the main circuitry is that the output 
cuiTent of the Gaussian circuit is temperature dependent and this will mean that if 
learning occurs at a particular circuit temperature, while system testing occurs at another 
temperature, then the circuit will not provide the appropriate response and a 7% decrease 
in recognition rates have resulted when the temperature increased from room temperature 
at25°C to 100°C.
6.5 Circuit Evaluation
A chip incoiporating 256 neiuon units and 256 modified Gaussian synapses has 
been implemented using 0.35 pm CMOS technology. The chip size is 4.3 mm by 
4.3 mm. The chip operates at ±2V supply, the synapse unit consumes a maximum power 
consumption of 100 pW, while the neuron unit consumes 20 pW. Test results have been 
carried out, in order to evaluate the performance of the chip for the task of phoneme 
recognition. It is found that the hardware-implemented chip provides recognition rates of 
54.3% and 67.5% for a 16-hidden neuron and a 64-hidden neuron system, respectively. 
Table 6-3 summarises the properties of the radial-basis flinction chip. The layout of the 
chip and building blocks are presented in Appendix B.
102
Chapter 6. VLSI Implementation o f Radial Basis Functions
Characteristics Properties
Chip Teclmology Analogue CMOS 0.35 pm technology
Chip Ai ea 4.3 mm x 4.3 mm
Number of Synapses 256
Number of Output Neurons 256
Minimum Power Supply Rail ±2V
Power Consumption; Neuron 20 pW
Synapse 100 pW
Phoneme Recognition Rates
(4000 samples -  8 Mel Scale Coefficients) 60.7%
64-hidden neurons
Table 6-3 Summary of the properties of the radial basis function chip.
6.6 Summary
This chapter reviews the theory behind radial basis functions. It then proceeds to 
present the individual building blocks that could be adopted to implement the various 
components that are required for the implementation of neurons and synapses for radial 
basis functions. The hardwaie implementation of each of the individual block is treated in 
detail. The main advantage of the implemented chip is that the Gaussian function is fully 
programmable and this enables faster training times, while maintaining good recognition 
rates. Simulation results are presented for the individual building blocks and comiected 
systems forming a network. These results are closely similar to the results obtained when 
the valions chip blocks are tested separately. The structure of the eventual learning chip is 
shown and some results of the hardware implementation for the task of phoneme 
recognition aie presented.
103
________ Chapter- 7. Hardware Implementation o f An Analogue Neural Network with On- |
Chip Back-Propasation Learnins
Chapter 7
Hardware Implementation of An Analogue Neural Network with On-Chip Back- Propagation Learning
7.0 Introduction
VLSI neural networks can be realised as either analogue or digital with digital 
neural networks successfully used for feedfbiivard neural networks, while analogue 
neural networks have been found more suitable for recurrent neural networks [108-111].
The main advantage of analogue systems over digital ones is the high processing speed 
that they can attain. However, analogue networks suffer from certain constraints and the 
following considerations are usually taken into account when designing neural networks 
[112]:
1. calculations including weight modification must be carried out in an analogue 
maimer to obtain smoothness in calculation
2. neural networks should have fully parallel on-chip learning capability for 
continuous time operation
3. neural networks should operate successfully in an analogue environment, which 
always exhibit circuit nonlinearities and offset eiTors
However, no practical analogue system satisfies all of the above conditions. For 
neural networks, offset eiTors are the critical enors, which need to be eliminated. On the 
other hand, neural network enviroimients can handle more easily circuit non-linearities.
This chapter proposes a back-propagation learning procedure that can cancel out the 
effects of offset errors in analogue neural network integrated chips. It also outlines the 
analogue neural network architecture adopted and presents the evaluation results of the 
fabricated chip.
In contrastive back-propagation learning, all offset eiTors, mentioned above, are 
cancelled out, since the subtraction operation between the two learning phases tends to 
cancel most offset errors in analogue circuit components. In fact, it can be shown that 
contrastive learning can handle systems with large offset errors.
104
________ Chapter 7. Hardware Implementation o f An Analogue Neural Network with On-
Chip Back-Propasation Learnins
The difference between contrastive and conventional back-propagation learning 
networks is very small as far as circuit configuration is concerned, but the effect is 
extremely large in tenus of cancelling out the offset eiTors [113].
7.1 Neural Network Chip Architecture
As proposed by T. Morie and Y. Amemiya, a back-propagation neural network is 
mainly composed of 3 building blocks: the neuron unit, the synapse unit and the enor 
signal generator unit [114]. The main building blocks proposed by T. Morie and Y. 
Amemiya [114] have been adopted for this chip implementation, but the circuit 
implementations have been tuned for large-scale neural network architectures, operating 
at low voltage and consuming low power.
In order to implement back-propagation learning, the neuron unit perfonns two 
functions. It transfonns an input current to a voltage output according to a sigmoid 
function with a variable gain and produces a constant output level defined by the signal 
supplied to an input terminal J, as shown in Figure 7-1. The first function relates to the 
operation of the neuron in either the hidden layer or output stage, while the second 
function relates to a neuron operating in the input stage.
The synapse imit has two paths, one for the feedfomard pass (o to wo) and the 
other for backpropagating the eiT or (d to wd). As can be seen from Figure 7-2, which 
illustrates the schematic diagram for the synapse unit, it consists of 3 multipliers and a 
weight processing unit for the storage and update of the weights.
SG
u
Multiplier
OutAmplifier
Switch
Figure 7-1 Neuron Unit Schematic [114]
1 0 5
Chapter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-Propasation Learnins:
Inverting
Amplifier
o
w o
W eight
P r o c e ss in g
Unit
Switch
d
wd
Multipliers
Figure 7-2 Synapse Unit Schematic [114]
The error signal generator unit, shown in Figure 7-3, provides an eiTor signal by 
multiplying the input enor by the differential coefficients of the neuron activation levels. 
Resistors are used to convert currents to voltages and a variable gain is used to improve 
learning.
106
Chapter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-Provasation Learnins
Differentiator
SG2
Multiplier
Multiplier
Figure 7-3 Error-signal Generator Unit Schematic [114]
Each unit back-propagation chip, implements a one-layer network on 8 neurons, 
64 synapses, 8 eiTor-signal generators and 8 output imits, shown in Figure 7-4. A block 
diagi am of the chip is shown in Figure 7-5. In the feedfoiward calculation, signals applied 
to temiinals J and U are transformed according to the sigmoid functions, weighted in the 
synapse blocks and then outputted hom terminals WO. The J teiminals aie used for 
operation in the input layer neurons or bias ones in hidden layer neurons and the U 
terminals are used in the hidden neurons. In the back-propagation learning calculation, 
eiTor signals applied to teiminals E aie multiplied by the differential coefficients of the 
neuron activation levels in the eiTor signal generator blocks. The error signals are then 
used to modify the synaptic weights in the synapse blocks and at the same time, they are 
weighted and output fi*om teiminals WD.
A three-layer (input, hidden, output) back-propagation network can be 
constructed by connecting unit chips in series, while connecting unit chips in parallel can
107
________ Chapter 7. Hardware Inwlementation o f An Analos'ue Neural Network with On-
Chip Back-Propasation Learnins
increase the number of neuron units in each layer. The output units sum up the cunents 
from the synapse units in the hidden layer and output a voltage signal. They also produce 
an eiTor signal. Since the chip adopted contrastive backpropagation learning, the target 
signal and output of the output unit are switched with each other and passed on as error 
signals.
AMPLIFIER
Figure 7-4 Output Unit
108
Chapter 7. Hardware Implementation o f An Analo&ie Neural Network with On-
Chip Back-Propasation Learnins
NEURONS
wo
o WD
ERRORSIGNAUGB^ERmORUNTS
Figure 7-5 Chip Layout
109
________ Chapter 7. Hardware Implementation o f An Analogie Neural NePvork with On-
Chip Back-Propasation Learnins
7.2 Electronic Circuits
The building blocks mentioned in the above section are made up of the followings 
circuits: amplifiers, multipliers, differentiators, operational amplifiers, bias generators and 
weight processing units. Before presenting the analogue circuitry adopted for 
implementing the chip, it is important to note that special design considerations were 
taken into account in order to tailor the circuit for operation within a neural network 
structure intended to be adopted for phoneme recognition. In this case, the main demand 
on the analogue building blocks is that of being capable of correct operation within 
massively parallel neural network systems.
For massively parallel, application specific systems, the level of integration of the 
building block components is preferably very high, but unfortunately this puts constraints 
on the architecture and minimising these constraints was one of the objects of the VLSI 
design.
The transistor size adopted for the amplifier, multiplier, differentiator and bias 
generator units, was L=3 pm and W= 10 pm. The above transistor sizes were chosen 
following results obtained from different simulations, hi fact, the objective of the 
simulations was to identify the minimum transistor size, capable of handling cunents 
from up to 200 building blocks. Consequently, the devices could be adopted for the 
implementation of typical nem al network architectures suitable for phoneme recognition.
The amplifier is a simple differential pair and is used as an activation function 
generator and a buffer with differential outputs and is shown in Figure 7-6. The objective 
of the differential amplifier is to amplify the difference between the two different 
potentials. The temiinal sets the current flowing through . M, and are current 
sources operating as active loads.
Multipliers are the most important components in the neural network as they are 
the most commonly used block and detennine the degree of integration and power 
consumption. The desired key multiplier characteristics are the following:
• they should be small in size 
® their current output
• their voltage inputs
110
________ Chavter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-Propasation Leaiiiins
A  four-quadrant multiplier is also a necessity as it satisfies the requirements of the 
learning hardware. Of great importance in our case is the dynamic range of the multiplier 
as this will put a constraint on the networks that can be mapped on a given topology. The 
multiplier output offset error is also very important. This is due to the fact that when 
connecting the outputs fi'om many synapse multipliers, the output offsets will 
accumulate, resulting in offsets greater than the maximum output of a single synapse. 
While in principle this offset can be cancelled within the neuron by adjusting a bias, this 
would necessitate the use of extra hardware, thus increasing the complexity of the chip 
and thus an undesirable step. The Gilbert multiplier has been adopted because of its 
small size and low power dissipation.
VD D
M1
M3
M2
M4
z
M5
V S S
Figure 7-6 Amplifier Schematic
111
Chapter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-Propasation Learnins
Amplifier Simulation Results
1-.5-
g
Ba.
3o -0.5 I—
— ItS----
- ——2— —  - 
-2.5 -
— -0.5- -
Input (V)
Figure 7-7 Amplifier dc sweep results
The operation of the Gilbert multiplier is based on the square law approximation 
for the saturated MOSFET. In fact, the output current is given by:
Z(+) -  Z (-) « ) (7.1)
where Z represents the output current, and are the input voltages applied at 
terminals W and K and and are the transconductance parameters for the upper 
two and lower differential pairs respectively.
Figure 7-8 shows the schematic diagram for the multiplier, while Figure 7-9 
shows the dc analysis when the input voltages X and K are swept from -IV  to IV. The 
classic Gilbert transconductance multiplier is named for Banie Gilbert who designed the 
circuit in 1968 with bipolar transistors [115]. The circuit combines diode-connected 
transistors, cuiTent mirrors, summing junctions, and differential pairs to multiply two 
differential signals. Output currents of both element circuits are transformed to voltage 
signals by using a pair of load resistors terminated to ground. The bias voltages determine 
the output currents. The simulation plots show that the multiplier output is only linear for 
half the input voltage scale. However, during learning, using contrastive back- 
propagation learning, the effects of the non-linearities introduced by the multipliers, are 
cancelled out.
112
________ Chapter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-Propasation Learnins
Besides providing multiplication operation for the linear part of the cui*ve, the 
Gilbert multiplier provides a very good approximation of the sigmoid function. In fact, 
simulation results have shown that, the output of the multiplier approximates an ideal 
sigmoid function with an accuracy of 96%, with the chosen transistor sizes.
The bias voltage values are adjusted so that no current flows in the load resistors 
at zero input. Thus, inputs and outputs have the same zero level, which enable direct 
connection among the element circuits. The bias generator is required to equalise the total 
cuiTent flowing in the pmos loads to the currents flowing in the nmos current source in 
the multiplier. Figure 7-10 shows the schematic diagram for the bias generator circuit. 
Since the multiplier consists of four-stage stacked transistors, the same stacked circuit, 
comprising transistors M ,, Mg,Mg,M^,M,Q,M^,and M,^is adopted. The bias voltage 
Fg, is generated through feedback control to equalise the output of the stacked circuit.
>
>
> 7
XT
>
r
■>
<
<
Figure 7-8 Multiplier Schematic [115]
113
Chavter 7. Hardware Implementation o f An Analosiie Neural Network with On-
Chip Back-Provasation Learnins
d.c. simulation results
â>O)2
g
3Q.
3O
-3 -
-1
0.5
Input Voltage (V)
-0.75
-0.75
-0.5
-0.25
—  -----0.25
(a)
Comparing Ideal Sigmoid and Multiplier
Output
>
0)U)3
—  0.5—
Input Voltage (V)
-0.5 0.53a
3o
0:5
Ideal Sigmoid Multiplier Output
(b)
Figure 7-9 Multiplier dc sweep simulation results
114
________ Chapter 7. Hardware Implementation o f  An Analos^ue Neural Network with On-
Chip Back-Propasation Learnins
The derivative generator is based on the fact that the derivative of the sigmoid (or 
hyperbolic tangent) function can be obtained by shifting the square of the sigmoid 
function and this is possible using the multiplier circuit without one of the loads, and an 
op amp. Moreover, since the multiplier has saturation characteristics, it is not necessary 
to input the signal transformed with the sigmoid function. The shift is done so that the 
output is 0 when the input is large and can be performed in the multiplier without one of 
the pmos active loads. Equation (7.8) gives the relationship between the sigmoid function 
and its derivative.
(7.2)
Figure 7-11 shows the schematic diagram for the derivative generator circuit, 
while Figure 7-12 shows the simulation results for a dc sweep on the input. The obtained 
cui-ves approximate the ideal derivative of the sigmoid fimction with an accuracy of 95% 
with the given transistor sizes.
TI “KO-------------1
MIO M il
r
■>
r
r n
VSS
Figure 7-10 Bias Generator Schematic [114]
115
Chapter 7. Hardware Implementation o f  An Analosue Neural Net\\>ork with On-
Chip Back-Propasation Learnins
M4
V SS
Figure 7-11 Derivative Generator Schematic [114]
d.c. sweep results for derivative generator circuit
>0>O)
>3Q.
3o
-1 -0,5 0 0.5 1
T  0.75 -------- 0.25 Input Voltage (V)0.5
(a)
116
Chapter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-Propasation Learnins
Comparing the Ideal Curve with the Obtained Curve
0 .7 - .....>Ü)cnis -0^
3Sr  0.4 -3o  0.3 -
 0 .-2-
1 - 0.8 - 0.6 -0.4 - 0.2 0 0.2 0.4 0.6 0.8 1
Input Voltage (V)
■ Ideal Curve O btained Curve
(b)
Figure 7-12 Derivative Generator Schematic dc sweep results
7.3 Weight Storage
The weight-processing unit is used to hold the weight data, output it and modify it 
in proportion to an error signal input [116]. It is desirable that:
1. the weights be held and modified in a fully-analogue mode in order to 
avoid quantisation effects
2. the memory device be small
3. the memory be non-volatile in practical use
A capacitor-type analogue memory is used as a memory unit due to its small size. 
However, it is volatile and requires refresh.
Within the weight-processing unit, an input signal is converted to constant voltage 
pulses with widths proportional to the error signal [114]. This is done by comparing the
117
________ Chapter 7. Hardware Iiiwlementation o f An Analosue Neural Network with On-
Chip Back-Propasation Leannns
input error signal with a triangular frequency, which has a typical frequency of 100 IcHz 
to 1 MHz. It controls the weight updating speed. Weight modification can be stopped by 
fixing the comparing wavefonn to either the positive or negative supply rails. Weight 
reverse for contrastive back-propagation learning is earned out by switching the polarity 
reverse PL control signal. The setup is shown in Figure 7-13. The output of the voltage- 
to-pulse converter charges a capacitor to a voltage representing a weight.
The learning rate can be set arbitiarily by adjusting resistances R± and IJ±. In 
order to attain a high resolution in the weight update, the resistances R± are usually made 
large, typically larger than 1 M q. The high resistance values can be obtained by a 
MOSFET operating in the sub-threshold region. Typical voltages for IJ+ and IJ- are -  
0.28V and -0.54V, respectively. Both transistors and M,, are operated in the sub- 
thieshold region.
The differential input of the weight processing unit includes offset enors arising 
in the learning circuits. In addition, there may be offset errors and/or hysteresis in the two 
comparators. These enors modify the output pulse width in the voltage-to-pulse 
converter, but the modification in first learning phase (FL=0) and that in the second 
learning phase (PL=1) are the same as shown. Therefore, both are cancelled out by 
contrastive back-propagation leaniing as long as errors in the logic circuits are negligible 
and the increment and decrement of weight modification are symmetric.
U7A
2
. 1
OP-249 4
USA
Z \
2 V
> rr
Figure 7-13 Weight Processing Unit Schematic [116]
118
________ Chapter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-Propasation Learnins
Since weights are not laiown a priori, it is best to start learning with weights near 
zero. This is done using the negative feedback loop, within the weight-processing circuit, 
as shown in Figure 7-2. It should be noted that weights are not completely reset to zero 
due to offsets in a real circuit, which guarantees that learning proceeds. Although the 
deviation of the initial weights fi'om zero due to offset is constant, the initial weights are 
slightly different at every reset operation due to noise. The magnitude of the deviation is 
effectively changed by changing the gain of the activation function. Figure 7-15 
illustrates the weight reset operation.
Another aspect that must be taken into account is the weight memory leakage. In 
order to minimise memory leakage, the time constant is made as large as possible,
4typically 10 s. This is possible by using large capacitances. A 5pF capacitor was chosen
4as simulation results provided a time constant of 10 s, while the capacitor size can be 
adequately fit into the chip layout. This value is fairly large but would still necessitate 
retraining. Figure 7-16 shows the weight decay at room temperature.
2.5 -,--------- A A r\T\
00
Time (us)
C M P IN+ IN-
3 - 
2
-2 - 
-3 -
100 200 300 400
time (us)
XOR (1) XOR (2)
3 -
-3
0 100 21)0 300
t
400 
ime (us)
3
2
g 100 300 400
time (us)3
119
Chavter 7. Hardware Implementation o f An Analogue Neural Network with On-
Chip Back-Provasation Learnin&
OR AND
3
2  -I—
> 1 ■ o ^ 0
1-1^
-2
-3 -
100 200 300 400
time (us)
3
2
1
0
200 
time (us)
300 4001001
2
3
Weight Processing Unit adjusting to a 1V Error 
Signal
1
0
—  4 0,0 200 300 400
time (us)
■1
Figure 7-14 Simulation 
Waveforms for Weight 
Processing Unit
Weight Initialisation
1
0.8
0.6
0.4
0.2
0
Ô  “0 .2  ( 
^  -0.4
o>
- 0.6
- 0.8
-1
).2 0.^ 0.6 0.8
I  I I I
Input to W P U  Output from WPU
Figure 7-15 Simulation Results for Weight Initialisation
120
Chapter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-Propasation Learnins
Weight Decay Plot
Q.
£  -0.2 « —  
I  -0.4 - -  
-0.6 ^ ------
—6___
time (hours)
Figure 7-16 Weight Decay Plot
7.4 Operation Amplifier Design
The weight processing unit and the derivative generator require the use of 
operation amplifier. The primary use of operational amplifiers is to provide sufficient 
gain to define and implement analogue signal processing functions through the use of 
negative feedback. A CMOS low voltage two-stage op amp was adopted, since a basic 
operational amplifier would be enough to compute the tasks of a comparator, subtractor 
and integrator. The schematic is shown in Figure 7-17.
The performance of the operational amplifier is dependent on the geometry of the 
devices. This creates an additional complexity in calculating the values, but will give 
more fieedom to meet the specifications of the design [117, 118]. This means that more 
constraints can be imposed to ensure that all MOS devices operate in saturation over wide 
process variations. The W/L widths are calculated such that operates in saturation. 
All other devices either operate in saturation by their connection or by external potentials 
applied to their inputs. For matching and symmetry,
w (7.3)
Forcing , will give the following relationship
121
Chapter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-Propasation LearninsfC
U
(7.4)
IV. PVThen since 1^=1^, and , the following equation can be derived:
Z-3
L„ L,
However, since 7^  = 0.57  ^ and ^  =
(7.5)
the condition for to remain in saturation becomes:
py. T
L, -sJu. (7.6)
Solving for the above equations, the following values were obtained.
Ml M2 M3 M4 M5 M6 M7 M8
W(pm) 40 40 10 10 45 10 10 10
L(pm) 10 10 22 22 10 10 50 10
Table 7-1 W/L for Operational Amplifier
122
Chapter 7. Hardware Implementation o f An Analo erne Neural Network with On-
Chip Back-Propasation Learnins
J
r
»
Figure 7-17 Operational Amplifier Schematic
A compensating capacitor (C )^ and a resistor (RJ was introduced to improve the 
phase margin of the operational amplifier. The capacitor value was assumed at 5pF and 
the resistor was calculated at 13.33 kQ. Figure 7-18 shows the dc response of the 
operational amplifier, while Figure 7-19 shows the ac response of the amplifier.
dc op amp response
—  2 -
- 0.001 -0.0005 0.0005 0.001
Vin
.—2J
Figure 7-18 DC Response for the Operational Amplifier
123
Chapter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-Propasation Learninst
M agnitude Plot
_  60 - m3  40 - a>
^  20 -
100 0 0 0 0 '\J  000000010000 100000100 1000-20  -
-40 -I
Frequency (Hz)
Phase Plot
170 -
120  -
70 -
I» 20 -
3) -3009 -100 -1000 —10000 — 10000-  -— 1000000-\—10000000
-80 -
-130
-180 J
Frequency (Hz)
Figure 7-19 AC Response for the Operational Amplifier
124
________ Chapter 7. Hardware Implementation o f An Analogue Neural Network with On-
Chip Back-Provasation Learnins
7.5 Chip Measurements
7.5.1 Synapse Operation
The measured fomard mode synapse transfer characteristics for a single synapse 
are shown in Figure 7-20, while the reverse mode synapse transfer characteristics for the 
same synapse are given in Figure 7-21. The non-linearity is less than 3.5%. The output 
offset currents are minimal typically around lOnA. Due to the low-current flowing within 
the synapses, each neui'on can handle hundreds of synapse connections and thus the chip 
sets can be scaled accordingly to build up the required large network sizes for phoneme 
recognition. This was possible by accurately choosing the sizes of the transistors and 
resistors.
The weight resolution was calculated at being 12 bit, which is sufficient for the 
required application. More importantly, offset errors were completely minimised within 
the weight processing unit as these would have severe effects on the stored weight value.
Forward Synapse d.c. Characteristics
-1.5-
-06
%
cË -1.53o  0;5 -
Voltage
Figure 7-20 Forward mode synapse transfer characteristics
125
Chapter 7. Hardwaj'e Implémentation o f An Analosue Neural Network with On-
Chip Back-Propasation Learnins
Reverse Synapse d.c. Characteristics
<3 - -0 .5 -
C
I3Ü
—1—
-1 .5 - Voltage
Figure 7-21 Reverse mode synapse transfer characteristics
7.5.2 Neuron Operation
The main important aspect in the design of the neuron is that it is capable of 
handling the output currents from several synapse cells. Figure 7-22 shows the 
percentage errors between an ideal sigmoid curve and one neuron output for various input 
voltages.
7.5.3 Error-Signal Generator Operation
Simulation results have been earned out on the derivative generator unit in order 
to compare the performance of the circuit when compared with an ideal cuiwe 
representing the derivative of the sigmoid. Figure shows the percentage discrepancies 
between the two curves. The percentage error is minimal and offset errors, tlnough 
careful transistor sizing is adjusted to a very small voltage 5mV, which negligibly affects 
the operation of the unit. Another important characteristic for the error-signal generator 
unit is the number of synapse connections it can handle. In this case, the synapses will 
coimect to a multiplier and multipliers were specially designed for massive coimectivity.
126
________ Chapter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-ProDa2ation Learnins
the error-signal generator unit can easily handle connections from more than 200 
synapses.
----------------------------------------- 2%-n
3 -2 * \  /
------ - ----------------------------------3%_
> 1 2  3 
Input V o ltage  (V)
Ir
Figure 1-22 Percentage errors between an ideal sigmoid curve and one neuron
output for various input voltages
-2%
- - 1% -
2ma>
c
g I— 1% Input Voltage (V)d>Q.
3%-
Figure 7-23 Derivative Generator Error Transfer Function Discrepancies
7.6 Circuit Evaluation
Using Gilbert multipliers and load resistances of 10-12kQ results in vei*y low 
power consumption. In fact, limiting the input voltage range to ±1 V and using 2 V for 
V dd and -2  V for Vss, the maximum power consumption for the multiplier was found to
127
________ Chapter 7. Hardware Implementation o f An Analosue Neural Network with On-
Chip Back-Propasation Learnins
be approximately 0.2 mW, giving a synapse power consumption of less than 1 mW. On 
the other hand the neuron unit was found to consume about 0.5 mW and the error signal 
generator about 1 mW. The given voltage range makes the chip compatible with the 
other chip sets.
Considering the whole network, including parasitic capacitances and diodes, the 
forward propagation time was found to be approximately 5 jUS and a time of 8 /xs was 
found to be an appropriate learning time for each sample. Each synapse has a propagation 
delay less than 2.5 ps allowing 12 million connections per second for each synapse and a 
100x100 synapse matrix can produce around 4 billion connections per second. The 
system produces an accuracy of a 10 bit digital system with the given component designs.
Speech recognition tests for a 12-150-48 multi-layer perceptron neural network 
implemented using the above chips and with a 4000 phoneme training sample showed 
that the recognition rates obtained were 60.2% which is around 2% less than those 
obtained in software simulation. The discrepancy between the recognition rates results 
due to the various analogue circuit imperfections, mainly non-linearities as simulations 
results have shown that with the given transistor sizes, offset-en'ors were effectively 
small and easily cancelled out by contrastive back-propagation learning. Simulation 
results have been caiTied out to evaluate the benefits of contrastive b ack-prop agati on 
learning as opposed to conventional back-propagation within the analogue circuit 
environment. Results have shown that analogue non-linearities severely effect the 
operation of conventional back-propagation learning as the recognition rates fell to less 
around 38.3% for the same network and sample size, while it stood at 60.2% for 
conti'astive back-propagation learning.
Simulations were earned out in order to test the effects of temperature variation 
on the performance of the circuitry. The problem with the main circuitries is that the 
output cuiTents are temperature dependent and this will mean that if learning occurs at a 
particulai' circuit temperature, while system testing occurs at another temperature, then 
the circuit will not have the conect weights to provide the appropriate responses and a 
9% decrease in recognition rates have resulted when the temperature increased from 
room temperature at 25°C to 100°C. Table 7-2 summaries the characteristics of the back- 
propagation chip.
128
Chapter 7. Hardware Imvlementatioii o f An Analosue Neural Network with On-
Chip Back-Propasation Learnins
Characteristics Properties
Chip Technology Analogue CMOS 0.35pm technology
Chip Area 3.5 mm x 3.4 mm
Number of Synapses 64
Number of Output Neurons 8
Minimum Power Supply Rail ±2V
Power Consumption: Neuron 0.5 mW
.Synapse 1 mW
Phoneme Recognition Rates
(4000 samples -  8 Mel Scale Coefficients) 60.2%
150-hidden neurons
Table 7-2 Summary of the properties of the backpropagatioii chip.
7.7 Summary
This chapter reviews contrastive back-propagation learning, which reduces the 
effect of offset errors in analogue hardware environments. It then proceeds to present the 
individual building blocks namely the synapse, the neuron and error signal generator. The 
hardware implementation of each of the individual block is treated in detail. Simulation 
results are presented for the individual building blocks. The chip, which implements a 
one-layer network on 8 neurons, 64 synapses, 8 error-signal generators and 8 output units 
is implemented using CMOS 0.35pm teclmology and occupies an area of 3.5mm x 
3.4mm. The structure of the eventual learning chip and the characteristics of the chips are 
outlined. Hardware tests on the implemented chip show that the implemented building 
blocks’ perfoiinances followed closely the results obtained during software simulation. 
The layout of the chip and building blocks are presented in Appendix C.
129
Chapter 8. System Testins
Chapter 8
System Testing
8.0 Introduction
Having designed and manufactured tlu'ee integrated circuits, one implementing 
Competitive learning, another back-propagation learning and a radial basis function chip, 
it is important to compare the perfonnance of the chips with the results obtained from the 
software simulations. All three chips have been designed to operate at a supply range 
between +2V and -2V and they can be cascaded to implement large neural network 
architectures. During the design phase of each chip, special care was taken in order to 
allow integration between the chips to provide complex learning algoritluns.
8.1 Test Setup
A test bed setup was prepared in order to test the different neural network 
architectures. A PC/C32 DSP board, made by Loughborough Sound Images, equipped 
with the AM/D16SA BuiT-Brown AD C/D AC daughter module was used to supply speech 
data to the different neural network ai'chitectures that have been tested. A computer 
program, written in C, generates the Mel-Scale coefficients for a given speech input, 
taken from the TIMIT database. These coefficients are passed on to the DSP board in 
order to generate the required test signals. The inputs to the neural network chips are 
analogue voltages, which represent the magnitude of each individual coefficient and these 
are presented to each of the input neurons. A block diagram of the test setup is shown in 
Figure 8-1.
8.2 Results for Different Speech Systems
Several other tests are earned out in order to evaluate the perfonnance of neural 
network architectures for speaker-dependent systems, speaker-independent systems 
(speakers of the same sex and dialect), speaker-independent sex-independent systems 
(same dialect) and speaker-independent sex-independent systems using speakers from 
different dialect regions.
130
Chapter 8. System Testins
it
Phoneme 
- Probabilities
Neural Network
Loughborough Sound 
Images PC/C32 DSP
Speech Wavefomis 
(TIMIT Database)
Figure 8-1: Test setup for the system testing
The self-organising map was implemented using 3 self-organising map chips as 
shown in Figure 8-2. The input buffers, weight buffers and synapse matrices from the 
different chips were comiected together in order to allow a 96-bit system. 8 bits were used 
to code each of the 12 Mel-Scale Coefficients. 48 winner-take-all cells were adopted in 
order to provide an output for each phoneme. Mel-Scale coefficients have been adopted 
mainly for two reasons. The main reason is that at the software simulation level they have 
provided the best results and secondly because their conversion from digital numbers to 
analogue values, spanning fr om -1.5 V to 1.5 V, has produced a good variability and the 
neural networks have been able to tap in to this variability and thus extract the required 
hidden laiowledge. The details of the training and testing data for each run are attached to 
the gi*aphs showing the results [119].
131
Chapter 8. System Testins
I
ÎI
I
«
I
I
.&
u
Î
IDJDs
(UCOfo
IgI
E
&JÜI I
*CO&J0
•I13I
a0 u
1 § 
£
50)u
6  k
132
Chapter 8. System Testirn
A 12-150-48 three layer multi-layer perceptron (MLP) was implemented by 
connecting 9 back-propagation chips as shown in Figure 8-3. The first chip implements 
the input layer, five chips are connected to implement the hidden layer and two other 
chips implement the output layer [119-120].
Input LayerZ' wo
W DW D
W D
idden Layer
wo Iwo wo wo
W D W D
T argetsT argets
Outputs
Figure 8-3 Phoneme Recognition using 3 Layer Multi-Layer Percepti ons
The Time-Delay Neural Network can be implemented similarly to the multi-layer 
perceptron network [119], but this time the number of synapses in the input to hidden and 
hidden to output layers is increased to accommodate time-hame windows. The same 
network size as the MLP network is assumed but each input unit is connected to each 
hidden unit by three different linlcs having time delays of 0, 1 and 2. Also, each hidden
133
Chavter 8. System Testins
unit is connected to each output units by 5 different links having time delays of 0, 1,2,3 
and 4.
A 12-12-48 Radial Basis Function network is implemented [121-122], The system 
requires combining the three-implemented chips together. This was possible because all 
tluee chips were designed in a way to allow cascadability and to operate within the same 
voltage rail supply. This was a difficult task as the design requirements for one algorithm 
is typically different from another’s and compromises had to be reached in order to allow 
possible integration as the main scope of the project had been to design circuitry, which 
provides the main benefits of neural network algoritluns that have been considered. This 
is why in the design stage of the different chips, special emphasis has been given to the 
effects of non-linearities and transistor sizes have been chosen such that the achieved 
simulation outputs closely matched the actual mathematical functions. Non-linearities and 
the lack of close approximation of the circuit output to their mathematical counterparts 
would have had dramatic effects in this case as effects would have accumulated from one 
layer to another and the recognition rates would have fallen considerably.
One radial basis function chip is required to implement the input layer. The 
weights are obtained through competitive learning on the self-organising map chips. In 
this case, tlnee self-organising map chips are required and the digital weights, stored 
within the self-organising map chips are converted to analogue values using the DSP 
board. The outputs of the radial basis function chip are passed on to two back-propagation 
chips, which produce the output.
Figure 8-4 shows the architecture of the network. Thus the weights in the 
Gaussian layer are obtained via unsupeiwised learning, while supeiwised back-propagation 
learning produced the required phoneme target output.
The phoneme recognition rates obtained for the hardware networks are slightly 
lower than the theoretical values obtained using software implementations. For single 
user systems, the hardware TDNN attained the highest recognition rate at 74%. When the 
system was trained with one speaker and tested on another speaker of the same sex, 
phoneme recognition rates are also quite high, clearly indicating that neural networks are 
capable of extracting the essential data that differentiates one phoneme from another. The 
TDNN, MLP and RBF systems produced approximately equal recognition rates of around 
69%, which is roughly 3% lower than the software simulated results. With the networks 
trained on more than one speaker, the recognition rates obtained are quite similar to the
134
Chapter 8. System Testins:
ones obtained in the previous simulations with the RBF and TDNN achieving the highest 
recognition rates at 67%, which is 4% and 5% lower than the software results. When the 
system is trained and tested with speakers of opposite sex coming from different dialect 
regions, the recognition rates fall slightly implying that the differences in sex and dialect 
regions have minimal impact on the neural networks. In the latter case, the RBF provides 
the highest recognition rates. The deviation from the simulation results to the hardware 
values is mainly attributed to the various non-linearities within the analogue circuitry, 
which could not be eliminated, together with parasitic capacitances included within the 
layouts of the chip.
Input Vector
Back-
Propagation
Chip
Back-
Propagation
Chip
SOM
Chips
SOM
Chips
SOM
Chips D/A
Radial Basis Function 
Chip
Output Target
Figure 8-4 Phoneme Recognition using a Radial Basis Function Neural Network
135
Chapter 8. System Testins
Speaker Dependent Recognition 
Rates
78%
76%
74%
72% /
70% /
68% /
66% X
64% X
62%
r = i
MLP TDNN RBF SOM
0  Hardware Recognition Rates 
□  SoftwareRecognition Rates
Training Data was obtained 
from a single female 
speaker.
Training Directory:
\Train\Dr4\FlhdO 
Test Data from the same 
speaker was adopted. 
Testing Directory:
\Test\Dr4\Flhd 
12 Mel-Scale Coefficients
Figure 8-5 Speaker Dependent Phoneme Recognition Rates
Speaker Independent 
Recognition Rates
72%
70%
68%
66%
64%
/I
MLP TDNN RBF SOM
Training Data was obtained 
from a single female speaker. 
Training Directory:
\T rain\Dr4\FlhdO 
Test Data from a different 
female speaker was adopted. 
Testing Directory: 
\Test\Dr4\FssbO 
12 Mel-Scale Coefficients
0  Hardware Recognition Rates 
□ Software Recognition Rates
Figure 8-6 Speaker Independent Phoneme Recognition Rates 
Speakers of the Same Sex and Dialect
136
Chapter 8. System Testins
Speaker Independent Recognition 
Rates
72%
70%
68%
66%
64%
62%
60%
MLP TDNN RBF SOM
^  Hardware Recognition Rates 
□  Software Recognition Rates
• Training Data was obtained 
from two female speakers.
• Training Directory:
\Train\Dr4VFlhdO
\Train\Dr4\FssbO
• Test Data from two different 
male speakers was adopted.
• Testing Directory:
\T est\Dr4\MarwO
\T estVDr4\MbmaO
• 12 Mel-Scale Coefficients were 
adopted as the Coding Option
Figure 8-7 Speaker Independent Phoneme Recognition Rates 
Speakers of Opposite Sex from the same Dialect Region
Speaker Independent 
Recognition Rates
69%
68%
67%
66%
65%
64%
XI
X{
MLP TDNN RBF SOM
Training Data was obtained from 
two female speakers.
Training Directory:
\Train\Dr4\FlhdO 
\Train\Dr4\FssbO 
Test Data from two different male 
speakers from a different dialect 
region was adopted.
Testing Directory:
\T est\Dr?\MarwO 
\T est\Dr7\MbthO 
12 Mel-Scale Coefficients
0  Hardware Recognition Rates 
□  Software Recognition Rates
Figure 8-8 Speaker Independent Phoneme Recognition Rates 
Speakers of Opposite Sex from Different Dialect Regions
137
Chapter 8. System Testins
8.3 Time-Delay Radial Basis Function (TD-RBF)
In order to exploit the strong points of the different algorithms, mainly the low 
training times for radial basis function network learning, the high recognition rates for the 
back-propagation algorithm and the benefit of competitive learning as an unsupervised 
learning algoritlmi, it was decided to combine the three chips together in order to 
implement a time-delay radial basis function neural network. The block diagram of the 
TD-RBF is very similar to that of the radial basis function network shown in Figure 8-4. 
The only difference is the network is larger as it will get input spaced at different time 
intervals. In this case, the system was tackled once the radial basis function produced 
encouraging results. Simulations were earned out before implementing the system in 
order to evaluate the combined perfoiinance of the chips in a large-scale environment, 
especially due to the fact that in the case of time-delay networks, the size of the network 
will be larger since the system needs to handle a larger amount of input data, which is 
time-shifted. In order to cope with such systems, it was important that the effects of non- 
linearities must be negligible on the system and also the computational analogue elements 
should very closely approximate their mathematical counteiparts. In fact, simulation 
results confinned that the transistor sizes and the circuitry adopted produced recognition 
rates just 1.5% lower than their software equivalents.
Consequently, a printed circuit board was designed using Oread Version 9 and 
manufactured in order to evaluate the performance of the time-delay radial basis function 
network. The schematic version and the layouts of the printed circuit board are attached 
in Appendix D. The time-delay neural network model incoiporates the concept of time- 
delays in order to process temporal context and simulation results of the algorithm have 
shown that the algorithm can be successfully applied to speech recognition tasks. The 
results obtained from the system developed here for phoneme classification and 
recognition are presented.
The Time-Delay Radial basis function (TD-RBF) neural network is a two-layer, 
hybrid learning network, with supeiwised learning adopted finm the hidden to the output 
units and unsupervised learning from the input to the hidden units. Back-prop agation 
learning is adopted as the super-vised learning algorithm, while competitive learning is 
adopted for training the weights of the synapses connecting the input to the hidden 
neurons. The tirne-delay version combines data from a fixed time ‘window’ into a single 
vector as input.
138
Chapter 8. System Testins
Similarly to RBF networks, the Time-Delay RBF has a static Gaussian function as 
the nonlinearity for the hidden layer processing elements. The Gaussian function responds 
only to a small region of the input space where the Gaussian is centred. The key to a 
successful implementation of these networks is to find suitable centres for the Gaussian 
functions. This can be done with supeiwised learning, but an unsupervised approach 
usually produces better results. For this reason, a hybrid supervised-imsupervised 
topology is adopted.
The algorithm starts with the training of an imsupervised layer. Its function is to 
derive the Gaussian centres and the widths from the input data. These centres are encoded 
within the weights of the unsupeiwised layer using competitive learning. During the 
unsupeiwised learning, the widths of the Gaussians are computed based on the centres of 
their neighbours. The output of this layer is derived fi'om the input data weighted by a 
Gaussian mixture.
Once the unsuperwised layer has completed its training, the supeiwised segment 
then sets the centres of Gaussian functions (based on the weights of the iinsupeiwised 
layer) and determines the width (standard deviation) of each Gaussian. The multilayer 
perception topology is used for the classification of the weighted input.
The advantage of the radial basis function network is that it finds the input to 
output map using local approximators. Usually the supervised segment is simply a linear 
combination of the approximators. Since linear combiners have few weights, these 
networks train extremely fast and require fewer training samples.
The designed hardware includes the self-organising chip to cany out competitive 
learning, the radial basis function chip to compute the Gaussian function and the 
backpropagation chip to implement supervised learning.
The Tirne-delay neural network uses a three-frame window. A network size of 13- 
39-48 was implemented. The input vector consists of 12 mel-scale coefficients and an 
energy term.
Four thousand phoneme samples were used for training the network, while 
another four thousand samples were adopted for evaluating the performance of the 
network. The four thousand samples were extracted from different speakers from 
different dialect regions. The samples used for training the network were extracted from 
the train directory, while the testing samples were taken from the test directory. 
Table 8-1 shows the recognition rates attained for each individual phoneme, while
139
Chapter 8. System Testins
Figure 8-9 compares the recognition rates and training efficiency obtained for the 
TD-RBF as compared to those obtained when a multi-layer perception, time-delay neural 
network and a self-organising map architecture was adopted for the same training and test 
sets. From the table results it can be noted that the recognition rate for all the phonemes 
was lower than the software simulated results. For some of the phonemes, the difference 
was quite small, such as for phoneme oy, i, el, sh, ix and ax, but it was quite high for 
other phonemes, such as u, hli, r, dh, t, dx and zh. These discrepancies are the result of the 
various offset errors present in the analogue circuitry.
Phoneme Software Recognition Rates Hardware Recognition Rates Difference
a 85% 82% 3.00%
axr 77.40% 75% 2.40%
er 81% 80.30% 0.70%
u 82.70% 77.40% 5.30%
ow 54.80% 53.66% 1.14%
oy 61.50% 61.33% 0.17%
ao 55% 54% 1.00%
ah 48.40% 46.10% 2.30%
ay 67.50% 66.23% 1.27%
aa 58.90% 56.44% 2.46%
ey 77.30% 74.15% 3.15%
i 80.20% 80.10% 0.10%
el 45.30% 45.22% 0.08%
hh 38.70% 31.56% 7.14%y 84% 81.98% 2.02%
w 80.60% 78.62% 1.98%
r 100% 93.33% 6.67%
1 57.80% 56.55% 1.25%
en 15.40% 12.40% 3.00%
em 18.30% 14.56% 3.74%
ng 17.90% 17.23% 0.67%
n 56.90% 55.44% 1.46%
m 40% 38.22% 1.78%
dh 18.30% 12.45% 5.85%
V 77.80% 77.34% 0.46%
th 62.50% 60.34% 2.16%
f 92.30% 85.34% 6.96%
z 88.60% 87.17% 1.43%
140
Chapter 8. System Testins
sh 83.30% 82.45% 0.85%
s 97.50% 92.45% 5.05%
ch 78.90% 71.35% 7.55%
jh 77.60% 77.14% 0.46%
k 90% 89.24% 0.76%
t 31.80% 22.56% 9.24%
P 27.80% 26.56% 1.24%
g 22.30% 18.35% 3.95%
d 29.10% 28.56% 0.54%
b 25% 23.30% 1.70%
dx 34.60% 26.70% 7.90%
q 20.80% 16.43% 4.37%
ix 75.50% 75.43% 0.07%
zh 74.30% 67.81% 6.49%
nx 68.20% 66.28% 1.92%
aw 76.20% 74.93% 1.27%
iy 81.40% 78.70% 2.70%
ih 77.80% 73.39% 4.41%
ux 67.80% 65.60% 2.20%
ax 66.60% 66.45% 0.15%
Total 6/. 05% 276%
Table 8-1 Recognition Rates for Time-Delay Radial Basis Function Network
Comparing the Different Algorithms for 4000 Sample Testing 
and Training
90% ■ 
80% 
70% 
60% 
50% 
40% 
30°/ 
20% 
10% 
0%
MLP TDNN RBF SO M  TD-RBF  
A lg o r ith m s
□  R ecogn ition  R a te s  □  Training E fficiency
Figure 8-9 Comparing the TD-RBF with other learning algorithms
141
Chapter 8. System Testins
8.4 Use of Second Priority Phoneme to Improve the Recognition Rate
All improvement of around 14% is attained if the phoneme obtaining the second 
highest priority is also considered together with the phoneme obtaining the highest 
priority. This system would present two likely phonemes to the computer system and 
simple language processing rules could then be used to choose the correct phoneme. This 
procedure would be necessary in order to extend the phoneme recognition system to a 
word level recognition system. In most cases, the two phonemes with the highest 
priorities would nonnally belong to the same phoneme category, implying that the main 
difficulty in the recognition process is to differentiate between phonemes which belong to 
the same categories.
8.5 Summary
Various neural networks have been implemented using the manufactured chips 
and their application to the task of phoneme recognition is presented. Recognition results 
have been compared to software simulations in order to evaluate the perfoiinance of the 
hardware chips. It has been shown that the various offset eiTors and non-linearities have 
had minimal effects on the recognition rates. Section 8.2 presents the results of a time- 
delay RBF system, which has been adopted to exploit the capabilities of three different 
leaniing algorithms, namely, the time-delay neural network, self-organising maps and 
radial basis functions. Recognition rates for the different algorithms in the context of 
different spealcer-dependent and spealcer-independent systems are also presented and 
compared with software simulations.
142
Chapter 9. Conclusions and Future Work
Chapter 9
Conclusions and Future Work
9.0 VLSI Implementations
In this thesis, the implementation of three VLSI neural network systems and their 
configuration for phoneme recognition systems has been presented:
1. A chip for implementing Self-Organising Maps
2. A Radial Basis Function chip
3. A Back-Propagation Learning chip
The first chip was implemented using mixed mode tecluiology, while the other 
two chips used analogue technology. The chips have been fabricated and they have been 
applied successfully to the task of phoneme recognition. Analogue technology has been 
adopted mainly in order to attain the high processing speeds that analogue neural 
networks can acquire. Analogue tecluiology also provides the benefit of designs with 
reduced power dissipation, low voltage operation and lower area cost.
Cascadability was a very important feature considered in the design of the chips, 
as the main intention was to allow as much flexibility as possible in order to test the 
functionality of the different topologies and architectiues. Only essential hardware was 
included on the chips in order to reduce the chip area. Each chip configuration was tested 
on its own and successful operation was registered for all the chip sets. Also, a design 
strategy to re-use components whenever possible was employed to reduce the possibility 
of design eiTors, design time and silicon area.
The main goal of the study was to design a VLSI neural network, which exploits 
the strong points of the different algorithms that have been analysed. The following 
characteristics were desirable -  namely low training times, which was obtained for radial 
basis function network learning, high recognition rates nonnally associated with the back­
propagation algorithm and also the benefit of competitive learning. In order to achieve the 
above it was decided to combine the three chips together in order to implement a time- 
delay radial hasis flinction neural network.
The time-delay neural network model incoiporates the concept of time-delays in 
order to process temporal context, and simulation results of the algorithm have shown that 
the algorithm can be successfully applied to speech recognition tasks. The results
143
Chapter 9. Conclusions and Future Work
obtained from the system developed for phoneme classification and recognition were 
presented.
The Time-Delay Radial basis neural network is a two-layer, hybrid learning 
network, with supervised learning adopted from the hidden to the output units and 
unsupervised learning from the input to the hidden units. Back-prop agation learning was 
adopted as the supervised learning algorithm. Self-Organising maps implementing 
Kohonen learning were adopted for obtaining the weights of the synapses comiecting the 
input to the hidden neurons. The time-delay version combines data from a fixed time 
‘window’ into a single vector as input.
The Time-Delay RBF uses a static Gaussian flinction as the nonlinearity for the 
hidden layer processing elements. This function is available through the radial basis 
function chip. The Gaussian flinction responds only to a small region of the input space 
where the Gaussian is centred. This procedure helps quicken the training process. The key 
to a successful implementation of these networks is to find suitable centres for the 
Gaussian functions. This can be done with supeiwised learning, but an unsupeiwised 
approach usually produces better results. This is why Kohonen learning was adopted in 
this case. The Self-Organising Map chip was adopted in this case.
The algorithm starts with the training of an unsupervised layer. Its function is to 
derive the Gaussian centres and the widths from the input data. These centres are encoded 
within the weights of the unsupervised layer using competitive learning. During the 
unsupervised learning, the widths of the Gaussians are computed based on the centres of 
their neighbours. The output of this layer is derived from the input data weighted by a 
Gaussian mixture.
Back-propagation learning is then adopted to compute the weights flom the 
hidden layer to the output layer. The back-propagation chip is adopted to perform this 
task.
The advantage of the radial basis function network is that it finds the input to 
output map using local approximators. Usually the supervised segment is simply a linear 
combination of the approximators. Since liiieai' combiners have few weights, these 
networks train extremely fast and require fewer training samples.
The proposed Time-delay neinal network uses a tluee-flame window. A network 
size of 13-39-48 was implemented and trained with an input vector consisting of 12 mel- 
scale coefficients and an energy tenn. The recognition rate achieved was 58.29%, which
144
Chapter 9. Conclusions and Future Work
is only 2.76% lower than the 61.05% obtained in software simulation. This result 
compares well with the recognition rates obtained for the time-delay neural network for 
the same phoneme set, while its training efficiency is very close to that of the radial basis 
function network. Thus, the time-delay radial basis function neural network truly exploits 
the benefits of different algoritluns, even at hardware level. This means that the work 
done to minimise analogue offset errors and non-linearities paid its dividends as it had a 
minimal effect on the recognition rates. The chips that were fabricated consume very little 
power, with synapses and neiuons requiring only tens and hundreds of pW to operate. 
Most of the individual building blocks can be operated with a supply rail of+1V and -IV, 
but in order to allow the implementation of complex neural network architectures, a 
supply rail of +2V and -2V was adopted when networks topologies were created.
The time-delay radial basis function neural network training is highly efficient due 
to high level of parallelism. With the software simulations rumiing on Pentium IV 
machines, parallelism is non-existent, resulting in much slower perfonnances when 
compared to their analogue hardware coimterparts. Thus, the implemented system, which 
has obtained recognition rates just 2% lower than the software implementation, can 
achieve a much faster training time, due to the analogue nature and the level of 
parallelism and at the same time it is consuming very minimal power.
9.1 Further Work
9.1.1 Hidden Markov Model/Neural Network Systems
HMM/NN hybrid approaches have recently been successfully applied to speech 
recognition. In these cases, the neural network is used to estimate the phoneme class 
probabilities and these probabilities are used as obseiwation probabilities in a phone-based 
HMM. Morgan and Bourlard had introduced the HMM/MLP setup [23] and they 
demonstrated both theoretically and practically the multi-layer perceptions can be 
successfully used in a hidden Markov model system for estimation of the state-dependent 
obseiwation probabilities. It was also shown that the comiectionist estimates can be used 
in combination with standard Gaussian mixture observation probability estimates to 
further increase perfonnance in state-of-the-art speech recognition systems.
Also experience with HMM technology [38] has shown that using context- 
dependent phonetic models significantly improves recognition accuracy. This is so
145
Chapter 9. Conclusions and Future Work
because acoustic correlates of coarticulatory effects are explicitly modelled, producing 
sharper and less overlapping probability density locations for different phoneme classes. 
Context-dependent HMMs use different probability distributions for every phoneme in 
every different relevant context.
Within the hybrid approach, context-dependent phonetic modelling is primarily 
done by the neural network. The system is based on a factorization of the context- 
dependent obseiwation probabilities, a network architecture that shares the input-to- 
hidden layer among the context-dependent nets to reduce the number of parameters, 
multiple states per phoneme with different context dependence for each state and a 
training procedure that “smoothes” networks with different degrees of context 
dependence to achieve robustness in probability estimates.
An alternative approach to introducing infonnation about the context, which is 
particularly amenable in the hybrid HMM/NN method [43], is to use an extended input 
window that in addition to the current frame, includes a number of previous and 
following acoustic frames. The contextual acoustic information associated with the 
adjacent phonemes helps remove ambiguity in otheiwise confusable acoustic patterns. In 
this case, the hybrid HMM/NN [19] can be a simple context-independent system. All the 
burden of the modelling of the contextual information is left to the neural network. 
Alternatively, both approaches to context modelling may be used jointly to increase the 
discriminability of phoneme classes.
Typically, an HMM/NN hybrid uses (scaled) observation probability estimates 
computed with the neural network [19]. These probabilities are combined with or used 
instead of the standard Gaussian mixture state-dependent obseiwation probability 
densities. The topology of the HMM system is kept unchanged.
The neural network is typically supemsed, with a target distribution defined as 1 
for the index coiTesponding to the phoneme label and 0 for other phonemes. The network 
will then output the approximate posterior probabilities , that is the probability of
having q as phoneme for an acoustic vector at time t.
In this case, the network architecture can be increased to consist of an input layer 
spanning up to 27 fi*ames of cepstra, delta cepstra, energy and delta energy features. The 
hidden layer will include between 1000 and 3000 units and an output layer having one 
unit per phoneme output. Duiing recognition, the posterior class probabilities are 
converted to phoneme class conditioned observation probabilities by using Bayes rule.
146
Chapter 9. Conclusions and Future Work
Since a lai'ge vocabulary speech recognition system is generally critically 
dependent on the linguistic laiowledge embedded in the input speech, the incoiporation of 
the laiowledge of the language, in the form of a “language” model is essential. The goal 
of the statistical language model is to provide an estimate of the probability of a word 
sequence for the given recognition task and its inclusion in a speech recognition system 
can highly improve the overall recognition gain. Alternative language models may also 
include fomial grammar and provide more realistic models for natural language input to 
machines than artificial N-grams or words. However, they are somewhat more difficult to 
integrate with acoustic decoding.
9.1.2 Circuit Level Implementation
The neural network chip set, which has been presented, can in principle 
implement many neural network topologies for various applications. However, for the 
implementation of huge, massively parallel systems, the scalable principle is beyond their 
capabilities. The required dynamic range of the synapses and neuron ability to sink 
current put a bound on the neuron fan-in. Also, the impact of offset eiTors and electrical 
parasitics degrade the speed of such an architecture. The chips’ input/output bottlenecks 
and inter-chip routing would also be problematic for very large neural networks. Learning 
would prove very hard in these circumstances, using the limited precision technology of 
analogue VLSI.
For such applications requiring huge networks, an alternative network topology 
must be found. From an analogue VLSI point of view, some kind of neuron clustering 
would be advantageous [123]. Such a system would consist of sparsely intercomiected 
modules of densely connected neurons. The individual clusters would solve different, 
reasonably complex sub-problems with the global problem being solved in a divide-and- 
conquer manner.
Another problem with the proposed analogue VLSI implementations is the 
memory system. The main purpose of analogue memory is for storing the weights. 
Much of the design of the analogue memory assumes a medium tenn memory system, 
which requires refreshing or retraining over a period of time. Thus, a percentage of the 
training time is used up for retraining. These problems could be eliminated through the 
use of a long tenn memory system, such as the floating gate approach.
147
Chapter 9. Conclusions and Future Work
Floating gate electrically programmable non-volatile memories (EEPROMs) form 
a mature tecluiology with wide spread digital applications [76]. However, their use in 
analogue applications has been only recently proposed, mainly for trimming. It has been 
restricted due to two main reasons:
• The accurate adjustment of a given voltage has to be done iteratively by a 
succession of high voltage pulses and measurements 
® A drift is observed shortly after each change
However, a single step progr'anuning technique with negligible subsequent voltage 
drift has been introduced and this system would be non-volatile and will retain the weight 
value for a long period of time without requiring any retraining.
148
Bibliography
Bibliography
[1] J. Dayhoff, “Neural Network Architectures: Au Introduction”, Van Nostrand 
Reinliold, New York, 1990.
[2] S. Haykin, “Neural Networks: - A Comprehensive Foundation”, Prentice Hall, 
New York, 1999.
[3] A.J. Montolvo, R.S. Paulos, “Towards a general-purpose analog VLSI neural 
network with on-chip Learning”, IEEE Transactions Neural Net\‘Vorlcs, 8, 
pp. 413-423, 1997.
[4] J. M. Zurada, “Introduction to AiTificial Neural Systems”, West Publishers: New 
York, 1992.
[5] K. Nakayama, A.Hirano, M. Fusakawa, “A selective learning algorithm for Non- 
Linear Synapses in Multi-Layer Neural Networks” , International Proceedings 
Joint Conference Neural Networks, 3, pp. 1704-1709, 2001.
[6] D.E. Rumelhart, J. L. Mc.Clelland, Eds. , “Parallel Distributed Processing: 
Explorations in Micro structures of Cognition”.
[7] D. Hammerstrom, “Neural Networks at work”, IEEE Spectrum, 30, pp. 26-32,
1993.
[8] R. Saipeshkar, “Analog versus digital: extrapolating from electronics to 
neurobiology”, Vewra/ Computation, 10, pp. 1601-1638, 1998.
[9] K.Roy, S. Prasad, “Low-Power CMOS VLSI Circuit Design”, Jolui Wiley and 
Sons, New York, 2000.
[10] S. Ridella, S. Rovetta, R. Zunino, “K-wimier machines for Pattern Classification”, 
IEEE TransactionsNeuralNetworks, 12, pp. 371-385, 2001.
[11] M. Ismail, T. Fiez, “Analog VLSI Signal and Information Processing”, McGraw 
Hill, New York, 1994.
[12] S. Rovetta, R. Zimino, “Efficient Training of Natiual Gas Vector Quantisers with 
Analog Circuit Implementation”, IEEE Transactions Circuits and Systems II, 46, 
pp. 688-698, 1999.
[13] T. Nozawa, M. Konda, M. Fujibayashi, M. Imai, K. Kotani, S. Sugawa, T. Ohmi, 
“A parallel vector-quantisation processor eliminating redundant calculations for 
real-time motion picture compression”, IEEE Journal Solid State Circuits, 35, pp. 
1744-1751,2000.
149
Bibliography
[14] C. Amerijck, M. Verleysen, P. Thissen, DJ. Legal, “Image Compression by self- 
organising Kohonen Map”, IEEE Transactions Neural Networks, 9, pp. 503-507,
1998.
[15] S. Ridella, S. Rovetta, R. Zunino, “lAVQ: Interval-Arithmetics vector quantisation 
for image compression”, IEEE Transactions Circuits and Systems II, 47, 
pp. 1378-1390, 2000.
[16] A. Gopalan, H. Titus, “A New Wide Range Euclidean Distance Circuit for Neural 
Network Hardware Implementations”, IEEE Transactions Neural Networks,1^, 
pp. 1176-1186, 2003.
[17] M. Milev, M. Hristov, “Analog Implementation of ANN with Inherent Quadratic 
Nonlinearity of the Synapses”, IEEE Transactions Neural Networks,!^, pp. 1187- 
1200,2003.
[18] Y. Taur and T. J. Watson, “The incredible shrinlcing transistor”, IEEE Spectrum, 
36, pp. 25-29, 1999.
[19] E.Sanchez-Sinencio, A. Andreou, “Low-Voltage/Low-Power Integrated Circuits 
and Systems”, I.E.E.E. Press, New York, 1999.
[20] D. Jurafsky, J. Martin, “Speech and Language Processing: An Introduction to 
Natural Language Processing, Speech Recognition and Computational 
Linguistics”, Prentice Hall, 1999.
[21] K. Lagus, M. Kurimo, “Language Model Adaptation in Speech Recognition using 
Document Maps”, Proceedings o f the IEEE Workshop on Neural Networks for  
Signal Processing (NNSP'02), pp. 627-636, 2002.
[22] M. Kurimo, K. Lagus, “Retrieving a User Language Model from an Unsupeivised 
Document Map”, Machine Learning Meets the User Interface, NIPS 2003.
[23] M. Hanes, S. Ahalt, A. Krislmaminthy, “Acoustic-to-Phonetic Mapping using 
Recuixent Neural Networks”, IEEE Neural Networks, 5, pp. 659-662, 1994.
[24] S.E. Joluison, P. Jourlin, G.L. Moore, K. Sparck Jones, P.C. Woodland, “The 
Cambridge University spoken document retrieval system”, Proceedings ICASSP , 
pp. 49-52, 1999.
[25] J. P. Lazzaro, J. Wawrzynek, A. Kiamer, “Systems Teclmologies for Silicon 
Auditory Models”, IEEE Micro, pp. 7-15, 1994.
[26] J.P. Campbell, “Speaker Recognition: A Tutorial”, Proceedings ICASSP, pp. 
1437-1462, 1997.
150
Bibliography
[27] W.B. Kleijn, K.K. Paliwal, “Speech Coding and Synthesis”, Elsevier, Amsterdam, 
1995.
[28] J. Sclnoeder, J.P. Campbell, “Digitial Signal Processing”, Special Issue: 
NIST 1999 Speaker Recognition Worlishop, 10, 2000.
[29] J. Tchorz, B. Kollmeier, “A Model of Auditory Perception as Front End for 
Automatic Speech Recognition”, Journal Acoustical Society o f America, 106, 
October 1999.
[30] R. Comerford, J. Makhoul, R. Schwartz, “The Voice of the Computer”, IEEE 
Spectrum, pp. 39-47, 1997.
[31] R. De Mori et al., “Spoken Dialogues with Computers”, Academic Press, 1998.
[32] S. H. Park, H. J. Moon, and N. M. Nasrabadi, “Subband image coding using 
block-zero tree coding and vector quantization” Proceedings ICASSP, 
pp. 2054-2057, 1996.
[33] P. Ki'oon, W.B. Kleijn, “Linear-prediction based analysis-by-synthesis coding”, 
Speech Coding and Synthesis, Kleijn, W. B. and Paliwal, K. K. (Ed.), Elsevier 
Science Publishers, pp. 79-119, 1995.
[34] T. Quatieri, “Discrete-Time Speech Signal Processing Principles and Practice”, 
Prentice Hall, New Jersey, 2001.
[35] L. Rabiner, B.H. Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 
New Jersey, 1993.
[36] B. Gold, N. Morgan, “Speech and Audio Signal Processing”, John Wiley and 
Sons Inc., New York, 2000.
[37] H.K. Kim, H.S. Lee, “Spectral peak-weighted filtering of cepstral coefficients for 
speech recognition”, lEICE Trans. Information and Systems, vol. E83-D, no. 7, 
pp. 1540-1549, 2000.
[38] S. Young, “The HTK Book”, Entropie, 1997.
[39] T.J. Hubbard et ah, “Recognition and ab initio structure predicitions using Hidden 
Markov models and [beta]-strand pair potentials”, Proteins, 1998.
[40] Q. Huo, C. Chan, C.H. Lee, “Bayesian adaptive learning of the parameters of 
Hidden Markov Models for speech recognition”, IEEE Transactions Audio and 
Speech Processing, 3, pp.334-345, 1995.
[41] C. Becchetti, L. Prina Ricotti, “Speech Recognition Theory and C++ 
Implementation”, Jolm Wiley & Sons Inc., New York, 2002.
151
Bibliography
[42] P.D. Picton, “Neural and neuro-fuzzy control systems”. Neural Network Analysis, 
Ar chitectures and Applications, Institute of Physics Publishing, Bristol, 1997.
[43] M. Brown, C. Hanis, “Neurofrizzy Adaptive Modelling and Control”, Prentice 
Hall, 1994.
[44] H. Schwenk, “Using Boosting to Improve a Hybrid HMM/Neural Network 
Speech Recognizer”, pp. 1009-1012, 1999.
[45] P. Baldi, “Gradient Descent Learning Algorithm Overview: A General Dynamical 
Systems Perspective”, IEEE Transactions for Neural Networks, 6, pp. 182-195, 
1995.
[46] P.A. Hetherington, M.L. Shapiro, "Simulating Hebb cell assemblies: the necessity 
for partitioned dendritic trees and a post-not-pre LTD rule". Network: 
Computations in Neural Systems, 4, pp. 135-153, 1993.
[47] D. Roobaert, M. M. Van Hulle, “A natural object recognition system using self­
organizing translation-invariant maps”, SNN Symposium on Neural Networks, 
pp. 151-154, 1995.
[48] R.O. Duda, P.E. Hari, D.G. Stork, “Pattern classification”, Wiley, 2001.
[49] E.S. Sinencio and R.W. Newcomb (eds. ,^ IEEE Transactions on Neural 
Networks: Special Issue on Neural Network Hardware, vol. 4, 1993.
[50] J. Schiirmann, “Pattern classification: A unified view of statistical and neural 
approaches”, Wiley 1996.
[51] L. C. Baird, A. W. Moore, “Gradient descent for general reinforcement learning”, 
Advances in Neural Information Processing Systems, 11, The MIT Press, 1999.
[52] D.W. Patterson, “Artificial Neural Networks Theory and Applications”, Prentice 
Hall, New York, 1995.
[53] J.M. Zurada, “Introduction to Artificial Neural Systems”, PWS Publishing 
Company, 1992.
[54] A. Waibel, T. Hanazawa, G.E. Hinton, K. Shikano, K.J. Lang, “Phoneme
Recognition using time-delay neural networks”, IEEE Transactions for Acoustics, 
Speech, Signal Processing, 31 1888-1898, 1989.
[55] B. Fritzke, “Growing Cell Structures - A Self-Organizing Network for Supervised 
and Unsupervised Learning”, Transactions Neural Networks, 1, p.p. 1441 -  1460,
1994.
[56] T. Kohonen, "Self-Organizing Maps”, Springer Veriag, Berlin, 1995.
152
Bibliography
[57] D. Merkl, “Content-Based Software Classification by Self-Organization”, 
Proceedings o f the IEEE International Conference on Neural Networks, pp. 1086- 
1091, 1995.
[58] M. Kurimo, “Using Self-Organising Maps and Learning Vector Quantisation for 
Mixture Density Hidden Markov Models”, Ph.D. Thesis, Helsinki University of 
Technology, 1997.
[59] P. Someivuo, “Self-Organising Maps for Signal and Symbol Sequences”, Ph. D. 
Thesis, Helsinki University of Technology, 2000.
[60] B. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, V. Vapnik, “Comparing 
support vector machines with Gaussian kernels to radial basis flinction classifier”,' 
IEEE Signal Processing, 45, pp. 2758-2765, 1997.
[61] C. Scholkopf, J.C. Burges, A.J. Smola, ''’Advances in Kernel Methods”, 
MIT Press, 1999.
[62] J. Park, I.W. Sandberg, “Universal Approximation Using Radial Basis Function”, 
Neimal Computation, 3, pp. 246-257, 1991.
[63] M.T. Musavi, W. Alimed, K.H. Chan, K.B. Faris, D.M. Hummels, “On the 
Training of Radial Basis Function Classifiers”, Neural Networks, 5, pp.595-603, 
1992.
[64] C. Bishop, “Improving the Generalization Properties of Radial Basis Function 
Neural Networks”, Neural Computation, 3, pp. 579-588, 1991.
[65] X. Sun, “On the Solvability of Radial Basis Function Interpolation”, 
Approximation Theory, 2, pp. 643-646, 1989.
[66] Q. Huo, C.H, Lee, “On line Adaptive learning of the Continuous density hidden 
markov model based on approximate recoursive Bayes estimate”, IEEE 
Transactions on Speech and Audio Processing, 5, pp. 161-172, 1997.
[67] P. lemie, “Digital comiectionist Hardware: Current Problems and Future 
Challenges”, Biological and Artificial Computation: From Neuroscience to 
Technology, pp. 688-713, Springer-Veriag, 1997.
[68] N. Melutash, D. Jung, H.H. Hellmich, T. Schoenauer, V.T. Lu, H. Klar, “Synaptic 
Plasticity in Spiking Neural Networks: A System Approach”, IEEE Transactions 
Neural Networl<s, 14, pp. 980-992, 2003.
[69] B.D. Roberts, C.C. Bell, “Spike Timing Dependent Synaptic Plasticity in 
Biological Systems”, Biol Cybern., 87, pp. 392-403, 2002.
153
Bibliography
[70] A. R. Omondi, “Neiirocomputers: A dead end?”, International Journal Neural 
Systems, 10, pp. 475-482, 2000.
[71] D. Anguita, S. Ridella, S. Rovetta, “Circuital implementation of support vector 
machines”. Electronic Letters, 34, pp. 1596-1597, 1998.
[72] S. Komori, Y. Aiima, Y. Rondo, H. Tsubota, K. Tanaka, K. Kyuma, “A 3.2 
GFLOPS neural network accelerator”, lEICE Transactions Electronics, E80-C, 
pp. 859-867, 1997.
[73] S. Sato, K. Nemoto, S. Aldmoto, M. Kinjo, K. Nakajima, “Implementation of a 
Neurochip Using Stochastic Logic”, IEEE Transactions Neural Networks, 14, 
pp. 1122-1127,2003.
[74] C. Lehmami, M. Viredaz, F. Blayo, “A generic systolic array building block for 
neural networks with on-chip learning”, IEEE Transactions for Neural Networks, 
3, pp. 400-407, 1993.
[75] A. Nakada, M. Konda, T. Morimoto, T. Yonezawa, T. Shibata, T. Olimi, “ Fully- 
Parallel VLSI Implementation of Vector Quantization Processor using Neuron- 
MOS Tecluiology”, lEICE Transactions Electronics, E82-C, pp. 1730-1739,
1999.
[76] M. Kinjo, S. Sato, K. Nakajima, “Hardware implementation of a DBM network 
with nonmonotonic neurons”, lEICE Transactions Information Systejtis, E85-D, 
pp. 558-567, 2002.
[77] E.A. Vittoz, “Analog VLSI signal processing: why, where and how?”. Analog 
Integrated Circuits and Signal Processing, 6, pp. 27-44, Kluwer Academic 
Pulishers, 1994.
[78] R. Saipeshkar, J. Kramer, G. Indiveri, C. Koch, “Analog VLSI architectures for 
motion processing: From fundamental limits to system applications”. Proceedings 
IEEE, 84, pp. 969-987, 1996.
[79] S. Kanieda, T. Yagi, “An Analog VLSI Chip Emulating Sustained and Transient 
Response Channels of the Vertebrate Retina”, IEEE Transactions Neural 
Networks, 14, pp. 1405-1412, 2003.
[80] B. Linares-Bananco, T. SeiTano-Gotanedona, R. Senano-Gotarredona, “On 
compact low-power calibration mini-DAC’s for neural massive arrays”, IEEE 
Transactions Neural Networks,!^, pp. 1337-1355, 2003.
[81] K. Hirota, M. Sugeno, “Industrial Applications of Fuzzy Technology in the
154
_________________________________________________________________ Bibliography
World”, World Scientific, 1995.
[82] A, Rodriguez-Vazquez, R. Navas, M, Delgado-Restituto, F. Vidal-Verdu, “A 
modular programmable CMOS analog fuzzy controller chip”, IEEE Transactions 
Circuits Systems II, 46, pp. 251-265, 1999.
[83] I. Baturone, S. Sanchez-Solano, J.L. Huertas, “Towards the IC implementation of 
adaptive frizzy systems”, lEICE Transactions Fundamentals, E-81-A, pp. 1877- 
1885, 1998.
[84] C. Toumazou, F.J. Lidgey, D.G. Haigh, “Analogue IC Design: the Current Mode 
Approach”, Peter Pereginus Ltd., England, 1990.
[85] L. X. Wang, “A Course in Fuzzy System and Control”, Prentice Hall, 1997.
[86] J. Kowalski, T. Kacprzak, K. Slot, “VLSI implementation of analog image 
median filter with average filter option based on cellular neural network 
architecture”, Proceedings XXI National Conference Circuit Theoiy and 
Electronic Netwof'l<s, 2, pp. 643-648, 1998.
[87] C.F. Neugebauer, A. Yariv, “A Parallel Analog CCD/CMOS Signal Processor”, 
Proceedings Neural Information Processing Systems Conference, pp. 748-755, 
1992.
[88] P.W. Hollis, J.J. Paulos, “Aitificial Neural Networks Using CMOS Analog 
Multipliers”, IEEE Journal o f Solid-State Circuits, 25, pp. 849-855, 1990.
[89] A. Diaz-Sanchaz, J. Ramirez-Angulo, A. Lopez, E. Sanchez-Sinencio, “A parallel 
analog median filter”. Proceedings IEEE International Conference Electronic 
Circuits and Systems, 1, pp. 381-384, 1998.
[90] R. Woodbum, H. M. Reekie, A.F. MuiTay, “Pulse-Stream Circuits for On-Chip 
Learning in Analogue VLSI Neural Networks”, Proceedings IEEE International 
Symposium on Circuits and Systems, pp. 103-106, 1994.
[91] N. Saxena, J.J. Clark, “A Four-Quadrant CMOS Analog Multiplier for Analog 
Neural Networks”, IEEE Journal o f Solid-State Circuits, 29, pp. 746-749, 1994.
[92] G. Fikos, S. Vlassis, S. Siskos, “High-speed, accurate analogue CMOS rank 
filter”, Electronic Letters, 36, pp. 593-594, 2000.
[93] J. Kowolski, “0.8 pm CMOS Implementation of Weighted-Order Statistic Image 
Filter Based on Cellular Neural Network Aichitecture”, IEEE Transactions 
Neural Networks, 14, pp. 1366-1376, 2003.
[94] N. Strom, “Sparse Connection and Pruning in Large Dynamic Artificial Neural
155
Bibliography
Networks”, Proceedings Eiirospeech, pp. 2807-2810, 1997.
[95] N. Strom, “Development of a Recurrent Tinie-Delay Neural Net Speech 
Recognition System”, STL-QPSR, 2-3, pp. 1-44, 1992.
[96] N. Strom, “Continuous speech recognition in the WXHOLM dialogue system”, 
STL-QPSR, 4, pp. 67-95, 1996.
[97] S. Young, J. Jansen, J. Odell, D. Ollason, P. Woodland, “HTK -  Hidden Markov 
Toolkit”, Entropie Cambridge Research Laboratory, 1995.
[98] L. Lamel, J.L. Gauvain, “High perfonnance speaker independent phone 
recognition using CDHMM”, Proceedings Eurospeech, pp. 121-124, 1993.
[99] M.E. Robinson, H. Yoneda, E. Sanchez-Sinencio, “A modular CMOS design of a 
Hamming network”, IEEE Transactions Neural Networks, 3, pp. 444-456, 1992.
[100] Y.He, U. Cilingirogulu, E. Sanchez-Sinencio, “A high density and low power 
charge-based Hamming Network”, IEEE Transactions Very Large Scale 
Integration Systems, 1, pp. 56-63, 1993.
[101] J. Choi, B. J. Sheu, “A High-Precision VLSI Winner-Take-All Circuit for Self- 
Organizing Neural Networks”, IEEE Journal o f Solid-State Circuits, 28, pp. 576- 
584, 1993.
[102] A.G. Andreou, K.A. Boahen, P.O. Pouliquen, A.Pavasovic, R.E. Jenkins, 
K. Strohbehn, “ CuiTent-mode Subtlueshold MOS Circuits for analog VLSI 
systems”, IEEE Transactions on Neural Network, 2, pp. 205-213.
[103] C. Mead, M. Ismail, “Analog VLSI Implementation of Neural Systems”, Kluwer 
Academic Publisher, Boston, 1989.
[104] S.S. Watkins, P.M. Chau, R. Tawel, “A radial basis function neurocomputer 
implemented with analog VLSI circuits”. Proceedings IEEE/INNS International 
Joint Corrfererrce for Neural Networks, pp. 607-612, 1992.
[105] R. P. Lippmami, “A Critical Oveiview of a Neural Network pattern classifier”, 
Proceeding IEEE Neural Networks for Signal Pr^ocessing Workshop, pp. 266-275, 
1991.
[106] A.L. Dajani, M. Kamel, M.L Elmasry, “Single layer potential function neural 
network for unsupervised learning”. Proceedings IEEE/INNS International Joint 
Conference for Neural Networks, pp.273-278, 1990.
[107] B.J. Sheu, J. Choi, “Neural Infonnation Processing and VLSI”, Kluwer Academic 
Publishers, Massachusetts, 1995.
156
Bibliography
[108] P. Bmi'ascano, “A noun selection criterion for the generalised Delta rule”, IEEE 
Transactions on Neural Networks, \ ,  pp. 125-130, 1991.
[109] S. Lee, R. Kil, “Multilayer feedforward potential function network”, Proceedings 
IEEE/INNS International Joint Conference for Neural Networks, pp. 161-172, 
1998.
[110] T. Poggio, F. Girosi, “Networks for approximation and learning”. Proceedings of 
IEEE, 78, pp. 1481-1497, 1990.
[111] F.J. Pineda, “Recunent backpropagation and the dynamical approach to adaptive 
neural computation”, Neural Computation, pp. 161-172, 1989.
[112] R.J. Williams, D. Zipser, “A learning algorithm for continually rumiing fully 
recurrent neural network”. Neural Computation, pp. 270-280, 1989.
[113] K. Doya, S. Yoshizawa, “Adaptive neural oscillator using continuous-time back­
propagation learning”. Neural Networld, pp.375-385, 1989.
[114] T. Morie, Y. Amemiya, “An All-Analog Expandable Neural Network LSI with 
On-Chip Backpropagation Learning”, IEEE Journal o f Solid-State Circuits, 29, 
pp. 1086-1093, 1994.
[115] T. Morie, O. Fujita, Y.Amemiya, “Analog VLSI implementation of adaptive 
algorithms by an extended Hebbian synapse circuit”, lEICE Transactions on 
Electronics, 75-C, pp. 303-311, 1992.
[116] P. Baldi, F. Pineda, “Contrastive learning and neural oscillations”. Neural 
Computation, pp. 526-545, 1991.
[117] P. E. Allen, D.R. Holberg, “CMOS Analog Circuit Design”, Oxford University 
Press, New York, 1987.
[118] R.L. Geiger, P.E. Allen, N.R. Stradler, “VLSI Design Techniques for Analog and 
Digital Circuits”, McGraw Hill International Editions, New York, 1990.
[119] E. Gatt, J. Micallef, P. Micallef, and E. Chilton, “Phoneme Classification in 
Hardware Implemented Neural Networks”, Proceedings o f the 8th IEEE 
International Conference on Electronics, Circuits and Systems, pp. 481-484, 
2001 .
[120] E. Gatt, J. Micallef, and E. Chilton, “An Analog VLSI Time-Delay Neural 
Network Implementation for Phoneme Recognition”, Proceedings of the 6th IEEE 
International Workshop on Cellular Neural Networks and their Applications, 
pp. 315-320, 2000.
157
Bibliography
[121] E. Gatt, J. Micallef, and E. Chilton, “Hardware Radial Basis Functions Neural 
Networks for Phoneme Recognition”, Proceedings o f the 8th IEEE International 
Conference on Electronics, Circuits and Systems, pp. 627-630, 2001.
[122] E. Gatt, J. Micallef, and E. Chilton, "Analogue radial basis function networks for 
phoneme recognition", Proceedings o f the 9th IEEE International Conference on 
Electronics, Circuits and Systems, pp. 583 -586, 2002.
[123] J.C. Houk, “Learning in Modular Networks”, Proceedings?th Yale Workshop on 
Adaptive and Learning Systems, pp. 80-84, 1992.
158
Appendix A
Self-Organising Map 
Chip Layout
y  y5  R  R  
R  ^
w  w  ^  
g  w  w  
w  w  w  
R  R  w  
R  R  ïyi
y  iK"" 1^5? w  w  5?
R  ’t^  R  *ti?
ti? R  w
^  w  R  w  R
W W w
&  &  w  &  s?^  ^  k'? w i l
i i i JéI « i ■
|E
if
m à  m m
-m i
m
mi
m
m■
#
m m m Iê
159
Appendix B
mm
O H
m m
16 0
m m m m i
I •• . ■urm
161
Appendix C
Time-Delay Neural Network 
Chip Layout
i, i i i i 1 i i i i
f  1 f r 4 t  1 e  ^ ______ t  _  : t . a t g r a._________ > j
1 6 2
Amplifier Layout
4 p
Multiplier Layout
I  j ,  ™
163
Derivative Generator Layout
164
Error Signal Generator Layout
165
1 6 6
Synapse Unit 
Layout
F3I
[ B *
i l :
J i :
1
: i i 
; ! - 1
r  — ■*-
i i :
: 1- i
-  T
167
Appendix D
PCB Top View
r##e@#wo#@40i 00!(##  • r
• • -'.y.O. . _ © #©
/ f . ' . V Î V
© e ©;©t©©©©©#©©©©©1.^© ©©©©©©©©
# # # '  * ’ •  '* ©©©©©©©©©©©©©©©©©©©©-J''-:'!>
Is© ©©©©©©#©©©©©©©©#©©
©9
90
9 9
© © © © © © © © © © ©  © © © © © © © © © ©  ©©:;«©© ©©
© ©  © © '•©  ©99© ©©.|i
: ©©©©©©©©©©©
>V.‘S7.-,VX^^
D R I L L  C H A R T
i S Y M D I A H T O L  1 Q T Y  1 N O T E
4 0.711  mm 61
X 0 , 8 6 4  mm 192 1
4 3 . 1 7 5  mm 2 i
T O T f i L  j 2 5 5  1
168
PCB Bottom View
/
- DRILL CHART
SYM| DIAH 1 TOL QTY NOTE
4 0 . 7 1 1  mm 1 1 61
X 0 . 8 6 4  mm j , 192
4 3 .175  mm {
TOT^L 1
169
