Hardware Learning in Analogue VLSI Neural Networks by Lehmann, Torsten
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Hardware Learning in Analogue VLSI Neural Networks
Lehmann, Torsten; Bruun, Erik
Publication date:
1995
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Lehmann, T., & Bruun, E. (1995). Hardware Learning in Analogue VLSI Neural Networks. Kgs. Lyngby,
Denmark: Technical University of Denmark (DTU).
ELECTRONICS
INSTITUTE
Hardware Learning in
Analogue VLSI
Neural Networks
A thesis by
Torsten Lehmann
In partial fulllment of
the requirements for the degree of
Doctor of Philosophy
September 
Technical University of Denmark
DK		 Lyngby
 Denmark
Page ii
Typeset using T
E
X
plain format ver N
PhDMac ver  by TL	
Edition  

Copyright
c
  
Torsten Lehmann
All rights reserved
Page iii
Abstract
English
In this thesis we are concerned with the hardware implementation of learning al
gorithms for analogue VLSI articial neural networks Articial neural networks
ANNs	 are often successfully applied to problems for which no algorithmic solution
exist but can be described by examples ANNs are fault tolerant and parallel of
nature analogue VLSI is a technology that can eciently exploit these properties
providing high performance systems Analogue VLSI implementations of recall
mode ANNs are maturing but the equally important problem of implementing
programming or learning	 hardware for these is still in its infancy
We shall present the analogue VLSI implementation of two supervised gra
dient descent learning algorithms for ANNs the error backpropagation learning
algorithm BPL	 for layered feed forward ANNs and the realtime recurrent learn
ing algorithm RTRL	 for general recurrent networks Both algorithms teach a
cascadable analogue VLSI chip set for ANNs which we shall also describe This
chip set use simple capacitive weight storage with a digital RAM backup memory
The BPL algorithm is implemented onchip based on a novel bidirectional prin
ciple resulting in a very modest hardware increase compared to the recall mode
system The RTRL algorithm is implemented as addon hardware to the recall
mode system using a compromise between computational speed and hardware
consumption The implementations of several algorithmic variations are also con
sidered eg weight decay and momentum	 Results from measurements on the
fabricated chips are presented as well as measurements on a recall mode system
We display the novel category of gradient descent like algorithms nonlinear
gradient descent which are better suited for hardware implementations than or
dinary gradient descent both in terms of accuracy and hardware consumption
Further we argue that ANN ensembles should be used to improve performance
of analogue neural systems Also included are novel considerations on analogue
computing accuracy oset compensation derivative computation analogue mem
ories network topologies process parameter dependency canceling and learning
in systems with RAM backup among other things
We conclude that though the technology is promising for implementing learn
ing algorithms much research is still needed both at a algorithmic level and at a
implementation level
Abstract Page iv
Dansk
I denne afhandling skal vi beskftige os med hardware implementeringer af indl
ringsalgoritmer til analoge VLSI kunstige neurale netvrk Kunstige neurale net
vrk articial neural networks ANNs	 er ofte med held brugt pa problemer for
hvilke der ikke eksisterer nogen lsningsalgoritme men som kan beskrives ved
hjlp af eksempler ANNs er af natur fejltolerante og parallelle analog VLSI er en
teknologi som eektivt kan udnytte disse egenskaber til implementering af syste
mer med stor ydeevne Analog VLSI implementeringer af fastprogramerede ANNs
er ved at modnes men det ligesa vigtige problem at implementere programerings
eller indlrings	 hardware til disse er stadig i sin spde begyndelse
Vi skal her prsentere analoge VLSI implementeringer af to overvagede gradi
ent nedstignings indlringsalgoritmer til ANNs Error backpropagation indl
ringsalgoritmen BPL	 til lagdelte netvrk uden tilbagekobling samt realtime re
current learning algoritmen RTRL	 til generelle tilbagekoblede netvrk Begge
algoritmer oplrer et kaskadekoblet analogt VLSI chipst til ANNs som vi ogsa
skal beskrive Dette chipst benytter et simpelt kapacitivt vgtlager med en dig
ital RAM hukommelse til vgtopfriskning BPL algoritmen er baseret pa et nyt
bidirektionelt princip og implementeres internt pa ANN chipsttet ved brug af
ganske lidt ekstra hardware RTRL algoritmen implementeres med extern hard
ware og som et kompromis mellem beregningshastighed og hardware forbrug Im
plementeringer af forskellige varianter af algoritmerne bliver ogsa overvejet fx
vgt henfald og moment	 Resultater fra malinger pa de fremstillede chips vil
blive prsenteret savel som malinger pa et system baseret pa ANN chipsttet
Vi fremviser den ny kategori af gradient nedstignings lignende algoritmer ikke
liner gradient nedstigning som er bedre egnet til hardware implementeringer end
sdvanlig gradient nedstigning bade med hensyn til prcision og hardware for
brug Endvidere papeger vi at ANN ensembles br benyttes for at forbedre
ydeevnen af analoge neurale systemer I teksten prsenteres ogsa nye overve
jelser angaende blandt andet prcision af analoge beregnende enheder oset
kompensering dierentialkvotient beregning analoge hukommelser netvrks to
pologier udligning af procesparameter afhngighed og indlring i systemer med
RAM opfrisknings hukommelse
Vi konkluderer at selvom teknologien er lovende for implementering af indl
ringsalgoritmer er megen forskning stadig ndvendig bade pa et algoritmemssigt
niveau samt pa et implementeringsmssigt niveau
Page v
Preface
The present thesis is a partial fulllment of the requirements for the degree of
Doctor of Philosophy licentiatgraden PhilosophiaeDoctor PhD	 The work was
carried out at the Electronics Institute the Technical University of Denmark and
was funded on a scholarship from the Technical University of Denmark Professor
Erik Bruun of the Electronics Institute was supervisor
I have tried to make this thesis a coherent presentation of hardware learning in
analogue VLSI neural networks  though by no means exhaustive This has made
it necessary to include work that is not entirely my own and I will state it clearly
whenever this is the case In particular my fellow PhD student John Lansner
was responsible for large parts of the work in implementation of neural networks
Thomas Kaulberg was responsible for the opamps and current conveyors	 used on
the chips A Masters student of mine Jesper Schultz did much of the work on the
sparse input synapse chip architecture Finally John Hertz Benny Lautrup and
Anders Krogh were responsible for the development of nonlinear backpropagation
I have tried to refer other authors whenever possible what is not refered apart
from well known matters	 is for most parts my own work Some people will
undoubtedly nd parts of the text provocative eg when I argue that one should
refrain from using oating gate memories for synapse strengths	  please do not
take oense it is by no means a mark of disrespect of other peoples work merely
personal views as well as a deliberate attempt to provoke which I think is sound
for development in any eld of research	
Regarding the layout of the thesis Disgracefully some would say	 I have the
habit of using parenthesis quite often Note that often they act like subordinate
clauses their contents being important Figure and equation numbers which are
globally enumerated carry chapter labels as superscripts for the ease of location
as gure 
 
and 
 
		 In most places references to other peoples work as
SanchezSinencio and Lau 
	 are incomplete in the sense that only a few of the
relevant references are displayed references that cover material related to the text
are usually preceded by cf or se also Italics font are used for emphasis and
for concepts that are found in the index I use we as as the personal pronoun
throughout the thesis as I think this eases reading
The thesis is organized as follows
In the introduction the eld of analogue VLSI neural networks including
hardware learning is briey introduced Motivations for the research are given
Preface Page vi
and the objective of the thesis is dened
In the implementation of the neural network chapter the neural network ar
chitecture that will make the basis of the rest of this thesis is presented The results
of the chipintheloop training also serves as a standard of reference for onchip
learning implementations
In the preliminary conceptions on hardware learning chapter the choices of
learning algorithms for implementation are considered and I elaborate on general
considerations on hardware learning
In the implementation of onchip backpropagation chapter the rst hardware
learning system is described It is based on a simple but elegant idea and was
meant to be just a minor part of the thesis my initial work was that of implementing
RTRL	 As it turned out however there was a lot of hard work involved in verifying
this simple idea work worth at least 


years	 The work has borne fruit though
in several paper invitations
The second hardware learning system is described in the implementation of
RTRL hardware chapter The system architecture of this work was developed
during my studies for the Masters degree As for the rst system there was a
lot of hard work involved in the implementation and testing of the experimental
system A work that is not complete at the time of writing
In the thoughts on future analogue VLSI neural networks chapter I have col
lected some odds and ends of the eld which did not t into the other chapters
During my PhD study a lot of ideas have come to my mind on network architec
tures subcircuits learning algorithms etc Also I have formed my own personal
opinion on several matters These thoughts are not all mature however I nd it
important to propagate this information to the scientic community in order that
other scientists can benet from the ideas
Finally in the last chapter conclusions the conclusions are drawn In this
section I have tried to emphasize what is my own contributions to science denoted
by we or our	
The appendices hold material that is of interest mostly to the meticulous
reader
The enclosures hold material that is of interest only to the reader who want
to carry on my work and they serves as documentation for my work
Being organized in a project oriented way many considerations are placed
in the bulk of the text where it is used for readability rather than being orga
nized in a logical manner I hope the index will prove adequate for locating such
considerations
Lyngby
September 
Torsten Lehmann
Page vii
Acknowledgements
I should like to thank the analogue integrated electronics group at the Electron
ics Institute for valuable discussions during this work in particular Gudmundur
Bogason Erik Bruun Thomas Kaulberg John Lansner and Peter Shah Thanks
to the members of connect for discussions on neural networks especially Lars
Kai Hansen and Anders Krogh Also thanks to Ole Hansen of the Mikroelek
tronik Centeret who was always ready with answer and a helping hand Thanks
to Mogens Yndal Petersen of the Electronics Institute for the layout and solder
ing of numerous PCBs Thanks to the DTU Eurochip sta who endured many
questions and pushed deadlines during chip manufacturing
Thanks to Peter A Toft John A Lansner Lars Kai Hansen Thomas Kaulberg
and Gudmundur Bogason for valuable criticism of the thesis
Finally thanks are due to the Danish Technical Research Council the Danish
Natural Science Council and Analog Devices Denmark for nancial support
Page viii
Contents
Abstract iii
Preface v
Acknowledgements vii
Contents viii
Abbreviations xii
Symbols xiv
List of gures xvii
Chapter  Introduction 
 Implementing ANNs in analogue hardware                                                     
 Implementing learning algorithms in analogue hardware                           
Chapter  Implementation of the neural network 
 The articial neural network model                                                                     
 The neurons                                                                                                     
 The network                                                                                                     
 Mapping the algorithm on VLSI                                                                       

 Architecture                                                                                                   

 Signalling                                                                                                       
 Memories                                                                                                       
 Multipliers                                                                                                     

 Activation functions                                                                                   
 Chip design                                                                                                               
 The neuron chip                                                                                           
 The synapse chip                                                                                         
 Sparse input synapse chip                                                                       
 Chip measurements                                                                                                 
 The neuron chip                                                                                           
 The synapse chips                                                                                       
 Chip compound                                                                                           
 System design                                                                                                           
 System measurements                                                                                             

 Further work                                                                                                             
Contents Page ix
 Process parameter dependency canceling                                           
 Temperature compensation                                                                     
 Other improvements                                                                                   
 Summary                                                                                                                     
Chapter  Preliminary conceptions on hardware learning 	
 Hardware consumption                                                                                         
 Choice of learning algorithms                                                                             
 Gradient descent learning                                                                       
 Error backpropagation                                                                             

 Realtime recurrent learning                                                                   
 Hardware considerations                                                                                       
Chapter  Implementation of on
chip back
propagation 
 The backpropagation algorithm                                                                       
 Basics                                                                                                               
 Variations                                                                                                       
 Mapping the algorithm on VLSI                                                                       
 Hardware ecient approach                                                                   
 Chip design                                                                                                               
 The synapse chip                                                                                         
 The neuron chip                                                                                           
 Chip measurements                                                                                                 

 The synapse chip                                                                                         
 The neuron chip                                                                                           
 Improving the derivative computation                                                 
 System design                                                                                                           
 ASIC interconnection                                                                                 
 Weight updating hardware                                                                       
 Nonlinear backpropagation                                                                               

 Derivation of the algorithm                                                                     

 Hardware implementation                                                                       
 Further work                                                                                                             
 Chopper stabilizing                                                                                     
 Including algorithmic improvements                                                     
 Other improvements                                                                                   

 Summary                                                                                                                     

Chapter  Implementation of RTRL hardware 
 The RTRL algorithm                                                                                             
 Basics                                                                                                               
 Variations                                                                                                       
 Mapping the algorithm on VLSI                                                                       
 System simulations                                                                                     
 Chip design                                                                                                             


 The width N data path module signal slice                                     


 Auto oset compensation                                                                       

Contents Page x
 Chip measurements                                                                                               

 System design                                                                                                         

 ASIC interconnection                                                                               

 The width  data path module                                                             

 The interface                                                                                               
 Algorithm variations                                                                               
 Nonlinear RTRL                                                                                                   
 Derivation of the algorithm                                                                   
 Hardware implementation                                                                     
 Further work                                                                                                           
 Continuous time RTRL system                                                           
 Other improvements                                                                                 
 Summary                                                                                                                   
Chapter 	 Thoughts on future analogue VLSI neural networks 
 Gradient descent learning                                                                               

 Neuron clustering                                                                                                 
 Self refreshing system                                                                                           
 Neural network ensembles                                                                     
 Hard soft hybrid synapses                                                                                 
Chapter  Conclusions 
Bibliography 
Index 
Appendix A Denitions 	
Appendix B Articial neural networks 		
B The ANN model                                                                                                   
B Applications and motivations                                                                           
B Teaching ANNs                                                                                                     

B Gradient descent algorithms                                                               

B Performance evaluation                                                                                     
Appendix C Integrated circuit issues 
C MOS transistors                                                                                                   
C Bipolar transistors                                                                                               
C Analogue computing accuracy                                                                         
C Integrated circuit layout                                                                                     

Appendix D System design aspects 
D The scalable ANN chip set                                                                               
D The synapse chips                                                                                   
D Chip set improvements                                                                         
D The onchip backpropagation chip set                                                         
D The backpropagation synapse chips                                                 
D The backpropagation neuron chips                                                 
D The scaled backpropagation synapse chips                                   
Contents Page xi
D Backpropagation chip set improvements                                       
D The RTRL chip                                                                                                     
D RTRL chip improvements                                                                     

D The RTRL backpropagation system                                                           

Appendix E Building block components 
E The opamp and the CCII!                                                                             

E The transconductor                                                                                             

E MOS resistive circuit                                                                                           

Enclosure I Published papers E 
Enclosure II Chip photomicrographs E 	
Enclosure III Test PCB schematics E 
Enclosure IV RTRLback
propagation system interface E 
Page xii
Abbreviations
AC Alternating Current
A D Analogue Digital
ADC Analogue to Digital Converter
ANN Articial Neural Network
ASIC Application Specic Integrated Circuit
BiCMOS Bipolar Complementary MOS integrated technology
BJT Bipolar Junction Transistor
BPL BackPropagation Learning
CCII Current Conveyor of second generation
CCO Current Controlled Oscillator
CMOS Complementary MOS integrated technology
CPS Connections Per Second
CUPS Connection Updates Per Second
D A Digital Analogue
DAC Digital to Analogue Converter
DC Direct Current
DNA DesoxyriboNucleic Acid
DSP Digital Signal Processor
EEPROM Electrical Erasable Programmable ROM
FLOPS FLoatingpoint Operations Per Second
FSM Finite State Machine
GCPS Giga CPS
GCUPS Giga CUPS
IPM Inner Product Multiplier
IS Input Scale
ISA Industry Standard Architecture
LBM Lateral Bipolar Mode
LSB Least Signicant Bit
MCPS Mega CPS
MCUPS Mega CUPS
MDAC Multiplying DAC
MFLOPS Mega FLOPS
MLP Multi Layer Perceptron
MOS Metal Oxide Semiconductor
MOSFET MOS Field Eect Transistor
Abbreviations Page xiii
MOST MOS Transistor MOSFET	
MPC Multi Project Chip
MPW Multi Project Wafer
MRC MOS Resistive Cell
mRNA Messenger RNA
MSB Most Signicant Bit
MVM MatrixVector Multiplier
NARV Normalized Average Relative Variance
NLBP NonLinear BackPropagation
NLRTRL NonLinear RTRL
NLSM Nonlinear Synapse Multiplier
OBD Optimal Brain Damage
OR Output Range
PC Personal Computer
PC AT IBM PC AT type compatible computer
PCB Printed Circuit Board
PFM Pulse Frequency Modulation
PLA Programmable Logic Array
PSRR Power Supply Rejection Ratio
PWM Pulse Width Modulation
RANN Recurrent Articial Neural Network
RAM Random Access Memory
RGC Regulated Gain Cascode
RNA RiboNucleic Acid
ROM ReadOnly Memory
RTRL RealTime Recurrent Learning
SAR Successive Approximation Register
TTL TransistorTransistor Logic
UV Ultra Violet
VLSI Very Large Scale Integration
WSI Wafer Scale Integration
Page xiv
Symbols
A list of the less commonly used symbols appearing in the thesis is given here For
standard symbols refer to the literature eg Geiger et al  Hertz et al  or
SanchezSinencio and Lau 
	
 As index index runs over all possible values
 Vector multiplication by coordinates    " 



 



      
T
	
 Convolution operator
b As Superscript 
b 
 bit number  of 
B Number of bits B
 
is a B bit discretization resolution of  Also band
width
B
M 
Memory necessary to store 
C
ox
MOS gate oxide capacitance per unit area
d
k
Neuron target value
D
 
Nonlinearity of 
D
A 
Accuracy of 
E
NARV
Normalized average relative variance error measure
g
k
s
k
	 Neuron activation function also as vector of scalar functions gs	
i ANN weight neuron	 index i  U 
I Set of ANN input indices m	
I
sm
Maximal signal current
j ANN input output index j  I  U 
J Cost function instantaneous	 eg quadratic J
Q
	 or entropic J
E
	
J
tot
Total cost function
k ANN neuron index k  U 
k
 
p
k
ij
index k	 RTRL chip access variable
K
 
MOST process transconductance parameter C
ox
	
L Number of layers in ANN Also MOST channel length
LSB
B
Sometimes used non unitless for bit absolute measures
l ANN layer number often indexing as superscript	 l  f 
input
        
L
output
g
#l
ofs
Geometric device size oset eg on L	
m ANN input index m  I
M Number of inputs in ANN
Symbols Page xv
M
k
Number of inputs to neuron k
N Number of neurons in ANN
N
A
Number of letters in input alphabet A
N
epc
Number of epochs ANN is trained
O 	 In most places implicitly order of N OfN		
N
E
Number of networks in ANN ensemble
N
l
Number of neurons in layer l of ANN
ofs As index oset error
p
k
ij
Neuron derivative variables for RTRL
p
k
Nij
Neuron derivative variables for nonlinear RTRL
res As index resolution
R
D
Dynamic range
s
k
 s
l
k
Neuron net input also as vector s
t Time Often unit less
t
  pd
Propagation delay when calculating 
T Set of neuron indices k	 for which a target exist
T
cyc
Learning cycle time
T
epc
Training epoch t  T
epc
when training
T
k
Indicates if k  T T
k
"  k  T 	
U Set of neuron indices k	
V
sm
Maximal signal voltage
V
t
Thermal voltage V
t
" kTq
V
T
MOSFET threshold voltage
w
ij
 w
l
ij
Connection strength from input neuron j layer l 	 to neuron i layer
l	 also as matrix w
w w without the columns corresponding to the inputs
w

Connection strengths arranged as vector
#w
ij
Connection strength change
#w
min
Connection strength change threshold
W MOST channel width
x
m
Input to ANN also as vector x
y
k
 y
l
k
Neuron activation also as vector y
z
j
 z
l
j
Neuron input ANN input or neuron activation also as vector z

FC
BJT forward emittercollector current gain

mtm
Learning momentum parameter

N
NLBP domain parameter
	 MOSFET transconductance parameter 	 "WL 	 C
ox

	
t
Activation function steepness 	
t
" 
gs	
s
 
 
s
	

F
Derivative or Fahlman perturbation

l
k
Weight strength error for backpropagation also as vector 
l

Symbols Page xvi

l
Nk
NLBP weight strength error

 
Droop rate of sampled signal eg weight drift in mVs	

k
 
l
k
Neuron activation error also as vector 

dec
Weight decay parameter
 General quantity random variable	 or free running index
 ANN learning rate Also neuron specic 
k

$
k
Neuron threshold
% Weight	 restoration eciency
 Carrier surface mobility
 General quantity random variable	 or free running index

I
Current mismatch standard deviation compared to reference device

k
ij
Neuron net input derivative variables for RTRL

 
Clock phase 
Generally the rules below are obeyed though deviations do appear Not all dier
ent kinds of signals are distinguished by the rules the context in which a symbol
appear must supply this lack of information The lack of a consistent usable
standard necessitates this unfortunate denition
For electrical signals the case of the symbol indicates whether it is a DC signal
or an AC signal For the rst case bias voltages quiescent currents etc	 we use
upper case letters eg I
bias
 for the second small signal quantities instantaneous
values etc	 we use lower case letters eg i
signal

The font of the subscript indicates if the subscript refer to another symbol italics
eg v
y
	 or if the subscript is descriptive roman eg v
out
	
The case of a descriptive index usually indicates whether this is an compound
abbreviation upper case eg v
GS
	 or not lower case eg v
ofs
	
Page xvii
List of gures
Figure 

 Expandable neural network                                                                               
Figure 

 Expandable recurrent neural network                                                           
Figure 

 Recongurable neural network                                                                         
Figure 

 Typical electronic synapse                                                                               
Figure 

 Capacitive storage                                                                                               
Figure 

 Floating gate MOSFET                                                                                     
Figure 

 MOS Gilbert multiplier                                                                                     
Figure 

 MOS resistive circuit multiplier                                                                     
Figure 

 MRC resistive equivalent                                                                                   
Figure 


 Multiplying DAC synapse                                                                               
Figure 

 Simple nonlinear synapse multiplier                                                         
Figure 

 Weightoutput characteristic of NLSM                                                     
Figure 

 Pulse frequency neuron                                                                                   
Figure 

 Distributed neuron                                                                                           
Figure 

 Hyperbolic tangent neuron                                                                             

Figure 

 Inner product multiplier                                                                                   
Figure 

 Synapse schematic                                                                                             
Figure 

 Current conveyor dierencer                                                                         
Figure 

 Nucleotide sequence                                                                                           
Figure 


 Sparse input synapse chip column                                                               
Figure 

 Measured neuron transfer function                                                             
Figure 

 Measured synapse characteristics                                                                 
Figure 

 Measured synapseneuron transfer characteristics                                 
Figure 

 Measured synapseneuron step response                                                   
Figure 

 Two layer test perceptron                                                                               
Figure 

 Test perceptron system architecture                                                           
Figure 

 Sunspot prediction                                                                                             

Figure 

 Sunspot learning error                                                                                     
Figure 

 Sunspot prediction error                                                                                 
Figure 


 Non unity ec current gain canceling                                                         
Figure 

 General process parameter canceling circuit                                           
Figure 
 
 Schematic backpropagation synapse                                                         
Figure 
 
 Schematic backpropagation neuron                                                           
Figure 
 
 MRC operated in forward mode                                                                   
Figure 
 
 MRC operated in reverse mode                                                                   
List of gures Page xviii
Figure 
 
 Backpropagation system                                                                                 
Figure 
 
 Second generation synapse chip                                                                   
Figure 
 
 Second generation hyperbolic tangent neuron                                         
Figure 
 
 Backpropagation neuron                                                                                 
Figure 

 
 Forward mode synapse characteristics                                                       
Figure 
 
 Reverse mode synapse characteristics                                                         
Figure 
 
 Forward mode weight osets                                                                         
Figure 
 
 Reverse mode weight osets                                                                         
Figure 
 
 Forward mode neuron characteristics                                                         
Figure 
 
 Computed neuron derivative                                                                         
Figure 
 
 Dierent neuron transfer functions                                                             
Figure 
 
 Dierent neuron nonlinearities                                                                   
Figure 
 
 Dierent parabola transfer functions                                                         
Figure 
 
 Dierent parabola nonlinearities                                                                 
Figure 

 
 Neuron sampler droop rate                                                                             
Figure 
 
 Dierential quotient derivative approximation                                       
Figure 
 
 Backpropagation ANN architecture                                                           
Figure 
 
 Digital weight updating hardware principle                                             
Figure 
 
 NLBP training error                                                                                         
Figure 
 
 Continuous time nonlinear backpropagation neuron                           
Figure 
 
 Discrete time nonlinear backpropagation neuron                                 
Figure 
 
 Neuron activation block schematic                                                               
Figure 
 
 Simulatedi neuron transfer function                                                           
Figure 
 
 Chopper stabilized weight updating                                                           
Figure 


 The discrete time RANN system                                                                 
Figure 

 The discrete time RTRL system                                                                   
Figure 

 Order N signal slice                                                                                         

Figure 

 Current auto zeroing principle                                                                   

Figure 

 Double resolution D A conversion                                                             

Figure 

 SAR bit slice                                                                                                     

Figure 

 Weight change IPM element characteristics                                           

Figure 

 Tanh derivative computing block characteristics                                 

Figure 

 Edge trigged sampler sampling                                                                   

Figure 

 Auto zeroing simulation                                                                                 

Figure 


 RTRL ANN basic architecture                                                                   

Figure 

 Nonlinear RTRL system                                                                             
Figure 

 Neuron clustering                                                                                             
Figure 

 Self refreshing ANN system                                                                         
Figure 
B
 General neural network model                                                                   
Figure 
B
 Layered feedforward neural network                                                     
Figure 
C
 Nchannel MOS transistor symbols                                                         
Figure 
C
 Nchannel MOS transistor                                                                           
Figure 
C
 Short channel snapback                                                                               
Figure 
C
 NPN bipolar transistor symbol                                                                 
Figure 

C
 NPN bipolar transistor                                                                                 
List of gures Page xix
Figure 
C
 Lateral bipolar mode MOSFET symbol                                                 
Figure 
C
 Lateral bipolar mode MOSFET                                                                 
Figure 
C
 Current subtraction by synapse                                                                 
Figure 
C
 Current subtraction by row                                                                         
Figure 
C
 Layout of matched transistors                                                                   
Figure 
D
 Digital level shifter                                                                                         
Figure 
D
 Synapse layout                                                                                                 
Figure 
D
 Table of ANN chip set characteristics                                                     
Figure 
D
 Table of row column element control                                                     
Figure 

D
 Backpropagation synapse column row element                                 

Figure 
D
 Forward mode BPL synapse row element                                             
Figure 
D
 Forward mode BPL synapse column element                                       
Figure 
D
 Route mode BPL synapse row element                                                   
Figure 
D
 Route mode BPL synapse column element                                           
Figure 
D
 Reverse mode BPL synapse row element                                               
Figure 
D
 Reverse mode BPL synapse column element                                       
Figure 
D
 Backpropagation neuron schematic                                                         
Figure 
D
 Backpropagation weight update schematic                                           
Figure 
D
 Table of Backpropagation chip set characteristics                             
Figure 


D
 Table of scaled BPL synapse chip characteristics                             
Figure 

D
 Scaled synapse chip characteristics                                                       
Figure 

D
 SAR start signal gating                                                                             
Figure 

D
 Transmission gate symbol                                                                         
Figure 

D
 RTRL signal slice schematic                                                                     


Figure 

D
 RTRL weight change schematic                                                               

Figure 

D
 Clock generator                                                                                             

Figure 

D
 Table of RTRL chip characteristics                                                       

Figure 

E
 The operational amplier                                                                           

Figure 

E
 Regulated gain cascode                                                                               

Figure 

E
 RGC current mirror                                                                                     

Figure 
E
 The current conveyor                                                                                   

Figure 
E
 Opamp frequency response                                                                     

Figure 
E
 The transconductor                                                                                     

Figure 
E
 Typical MRC layout                                                                                   

Page 
Chapter 
Introduction
This thesis describes the analogue VLSI implementation of two supervised learning
algorithms for articial neural networks One is the error backpropagation learning
algorithm for a layered feedforward network and the other is the realtime recurrent
learning algorithm for a general recurrent network Both operate in discrete time
on a cascadable analogue VLSI neural network that has a digital random access
backup memory for the weights Also included in this thesis is the implementation
of a cascadable analogue VLSI neural network as well as some general conceptions
on hardware learning and thoughts on future analogue VLSI neural networks
During the last decade or so the eld of articial neural networks ANNs see
appendix B	 has matured Articial neural networks are no longer magic devices
but powerful tools  when used in the right manner  for classication problems
and similar tasks for which no algorithmic solution is known The ANN foundation
 both theoretically and on an application level  growing increasingly solid
hardware implementations primarily analogue and digital VLSI see appendix C	
for high performance systems have begun to emerge VLSI implementations of
recallmode ANNs are maturing and the eld is ready for the step towards fully
adaptive VLSI ANNs ie including learning	 This is the objective of the present
thesis to study analogue VLSI implementations of computational neural networks
with the emphasis on learning hardware implementations In this introduction
motivations for using analogue VLSI to implement ANNs and learning algorithms
are described and the objective of the present work is dened
Chapter  Introduction Page 
   Implementing ANNs in analogue hardware
Why use analogue hardware Articial neural networks can easily be simulated
on standard digital von Neumann computers These general purpose computers
are rapidly moving into every imaginable part of our society They are the subject
of very intense research worldwide and the competition among manufactures is
very hard The computational performance is growing exponentially over time
How can we possibly hope to compete with them We can not There are niches
though where analogue integrated neural networks have the advantage In this
section we will examine these

 Parallelism In the ever present pursuit for faster data processing two ap
proaches are possible One is to use faster systems eg higher clock frequen
cies	 During the recent years this procedure has been exploited to the utmost
limit interfacing to and communication with these very fast systems is ever
more dicult Seitz 	 The other way to speed up data processing is to
use parallel data processing elements This is no trivial task though as many
problems can not be parallelized Almasi and Gottlieb  Leighton 	
Neural networks are inherently parallel and are therefore easily mapped on
parallel hardware Further though analogue data processing elements are in
herently slower than their digital equivalents the analogue versions of the neu
ral network computing primitives multiplication and addition can be much
smaller than their digital equivalents Thus massively parallel neural systems
are eciently implemented using analogue VLSI giving a potential for very
fast data processing eg Murray et al  Ismail and Fiez 
 Graf and
Jackel 	
This claim requires a couple of remarks As the exact mapping on parallel
hardware depends heavily on the network topology which is application de
pendent	 the massively parallel neural systems should be application specic
rather than general purpose Lehmann  Jackel 
 Mead 	 Fur
ther an analogue neural network should not be thought of as an accelerator
for a von Neumann computer as using a serial computer to supply the data for
a massively parallel neural network would in most cases severely limit the
performance Also note that for very high precision computations digital cir
cuits are required however it is generally believed that the precision oered
by analogue components is sucient in many neural systems though cf the
following chapters Edwards and Murray  Hollis et al  Tarassenko
et al 	
Massively parallel analogue neural networks have been reported by Arima
et al  Castro et al  and Masa et al  among others

 Asynchronousness Many neural networks are asynchronous in nature This
asynchronousness can be eciently exploited Asynchronous or self timed	
systems have a number of distinct advantages over systems governed by a
clock Seitz  Sutherland  Ramacher and Schildberg  Murray
et al 	
Chapter  Introduction Page 
 Synchronous systems must be designed to run at a conservative clock
frequency to ensure functionality in worst case situations asynchronous
systems run at the maximum speed of the present hardware Further
more if a certain component is the bottleneck of the system this can
be replaced by a faster component with immediate improvement of the
overall performance
 When increasing a system clock frequency communication between com
ponents becomes a problem as it is very dicult to distribute the system
clock without skew	 over a large area Asynchronous communication is
usually the solution
 In synchronous systems all components change states and thus draw
current from the power supply	 simultaneously at the clock edges This
puts very heavy demands on the tolerable power supply peak currents
capacitive decoupling etc The heavy peak currents also introduce a lot
of noise Asynchronous systems are power averaging
 Real world interfacing is basically an asynchronous task using asyn
chronous systems for this is the natural approach and eliminates for in
stance problems associated with metastability Seitz  Gabara et al
	
In spite of the very attractive features of asynchronous systems few conven
tional digital data processing systems have been successfully implemented
because of the hardware overhead needed for handshaking Spars et al 	
Pure analogue systems have no need for handshaking and are thus well suited
to implement asynchronous systems Though in systems with feedback the
stability must be considered	
Asynchronously operated analogue neural networks have been reported
by Alspector et al  Hollis and Paulos  and Mead  among others

 Fault tolerance A well trained ANN is insensitive to small weight changes as

J 
w
kj
 
 at the equilibrium where J is the error cost and w
kj
is a weight
cf 
B
		 However it is not insensitive to the complete loss of a connection
due to a short circuit a RAM fault radiation etc	 as the network must have
as simple an architecture as possible to ensure good generalization ability To
ensure fault tolerance even down to the hardware level it is necessary to
introduce redundant hardware This is in favour of hardware implementations
of neural networks in particular analogue hardware as the cost of an extra
synapse is relative low In this context a new emerging technology wafer
scale integration WSI	 deserves mentioning as fault tolerant systems are
crucial to the applicability of WSI This technology is well tailored to the
implementation of massively parallel neural networks Yasunaga et al 	

 Low power applications The use of subthreshold operated MOSFETs oer
the possibility of extremely low power systems Though digital systems can
function in subthreshold analogue systems carry more information per wire
and fewer transistors per operation and thus inherently use less power cf eg
Ismail and Fiez 
	
Low power analogue neural networks have been reported by Leong and
Chapter  Introduction Page 
Jabri  and Mead  among others

 Realworld interfacing As well as being asynchronous realworld interfaces
are often required to be analogue Analogue neural networks obviously elim
inate the need for A D and D A converters which is an attractive feature
However this becomes of paramount importance when the data is applied in
massive parallelism The use of hundreds or thousands of high speed A D
converters would seldom be justied For low power systems with realworld
interfaces it is also of great importance that no power is wasted in the pro
cesses of A D and D A converting
Analogue neural networks with realworld interfaces have been reported
by Leong and Jabri  Masa et al  and Mead  among others

 Regularity The regularity of articial neural networks makes them well suited
for massively parallel implementations The design eort can be put in design
ing a few ecient components which are used repeatedly and interconnected
in a regular way
To conclude At least two niches for analogue integrated neural systems exist both
possibly asynchronous or with redundant hardware

 Massively parallel application specic systems having a parallel realworld
interface

 Small low power application specic systems with a realworld interface
In this work we will be predominantly interested in the massively parallel analogue
neural networks rather than the low power ones
Today several applications using analogue integrated neural networks have
been reported For instance high energy particledetector trackreconstruction de
vices Masa et al 	 implantable heart cardioverters debrillators Leong and
Jabri 	 and silicon cochleas retinas and motion sensors Mead  Park
et al  Cao et al 	
Though most systems reported embrace the principles above general purpose
analogue neural systems have been reported Mueller et al  Van der Spiegel
et al  Satyanarayana et al 
	 Systems that do embrace the above prin
ciples can be found in Bibyk and Ismail   Bruun et al  Corso et al
 Eberhardt et al  Hollis and Paulos  Kub et al  Lansner and
Lehmann 
  LinaresBarranco et al  Mead  Moon et al 
Murray et al   Neugebauer and Yariv 
 and Ramacher and Ruckert

Chapter  Introduction Page 
  Implementing learning algorithms in
analogue hardware
Because of the nonideal characteristics analogue integrated neural networks are
most often taught using chipintheloop training Castro et al  Eberhardt
et al 	  that is rather than down loading predetermined weights for a given
application each individual chip or system	 is trained by i	 applying an input
pattern to the chip ii	 compute the network error on the basis of the target values
and the actual chip outputs and iii	 adjust the weights on the chip according to the
learning algorithm such that the network error decreases This can accommodate
for oset errors nonlinearities etc
There is a wealth of dierent training approaches which quite easily can be
programmed on a host computer for chipintheloop training The question now
arising is Why should we sacrice the exibility of chipintheloop training using
a host computer in favour of implementing learning algorithms in analogue hard
ware The reasons are similar to the reasons for implementing ANNs in analogue
hardware in the rst place
Performing similar operations on all synapses neurons of a regular system
composed of synapses and neurons many learning algorithms have the same prop
erties as neural networks when implemented in analogue hardware

 Parallelism Learning is computationally a very heavy task Typically of the
order ON
 
	 or ON

	 in a system with N neurons Compare to the recall
mode task which is of the order ON

	 assuming the system has ON

	
synapses	 In terms of speed it is therefore of even greater importance to uti
lize inherent parallelism for the learning algorithm than for the neural network
itself Fortunately many learning algorithms can be parallelized to a great ex
tent The arguments for using analogue hardware to utilize the parallelism are
the same as above
Massively parallel implementations of learning algorithms have been re
ported by Arima et al  among others

 Adaptability Adaptive neural systems are continuously taught while being
used At least two situations exist where the learning algorithm must be
embedded in the system hardware i	 In large adaptive systems where it is
crucial to utilize the inherent parallelism of the learning algorithm ii	 In
small adaptive lowpower systems where a host computer is not available
In these two application areas it is likely that the neural network is analogue
 and the arguments that advocated the use of analogue hardware for the
neural network holds for the implementation of the learning algorithm too

 Asynchronousness Many learning algorithms can be formulated in continu
ous time which enables asynchronous analogue hardware implementations of
learning algorithms An asynchronous analogue neural network with asyn
chronous analogue interfaces is most naturally taught by such a learning
algorithm

 Fault tolerance Some learning algorithms can be formulated to operate on
local data in such a way that the learning algorithm can be embedded in the
Chapter  Introduction Page 
extended	 synapse and neuron hardware In a fault tolerant analogue neural
network the inclusion of such a learning algorithms does not sacrice the fault
tolerance for the system as a whole

 Low power applications As is the case for the analogue neural networks
analogue implementations of learning algorithms can use MOSFETs operated
in subthreshold for extremely lowpower applications

 Data conversion The learning algorithm needs access to inputs outputs and
intermediate variable of the neural network If the neural network is analogue
the use of analogue hardware for the learning algorithm eliminates the need
of A D and D A converters cf above	

 Regularity As is the case for the neural networks many learning algorithms
are regular in structure thus the design of the equivalent hardware is inex
pensive
Combining these properties with the conclusion in section  we conclude that
at least two niches for analogue hardware implementations of learning algorithms
exist both possibly asynchronous or with redundant hardware

 Massively parallel possibly adaptive application specic systems having a
parallel realworld interface

 Small adaptive low power application specic systems with a realworld
interface
As for the neural networks it is believed in certain circles that the limited preci
sion of analogue hardware is sucient for the implementation of certain	 learning
algorithms because of the present feedback Some even argue that limiting eects
as noise Murray and Edwards 	 can improve learning ability Certain oset
errors however can be completely destructive for the learning scheme Montalvo
et al  Lehmann  
	 and it is not yet generally accepted whether or
not this prohibits analogue implementations of learning algorithms though a few
systems have been reported eg Alspector et al  Shima et al 
 and Valle
et al 	 Thus research on implementing analogue learning hardware still has a
wealth of unexplored areas This is to what we will commit the following chapters
Architectures for analogue hardware learning can be found in Alspector et al
 Arima et al  Card  Caviglia et al  Hollis et al  Jabri and
Flower 
 Lehmann    LinaresBarranco et al  Macq et al
 Matsumoto and Koga 
 Montalvo et al  Murray  Reyneri
and Filippi  Schneider and Card 
 Shima et al 
 Tarassenko and
Tombs  Valle et al  Wang  and Woodburn et al 

Page 
Chapter 
Implementation of the neural
network
Rather than implementing a learning system all at once we have chosen a se
quential approach rst implementing an acting neural system and second imple
menting learning hardware for this system In this chapter the implementation
of the articial neural network that is the core of the systems in this thesis will
be presented The chapter includes reections on the choice of network models
and topologies suitable for an analogue VLSI implementation Also the choices of
hardware topologies and essential subcircuits memories multipliers and thresh
olding hardware	 are discussed  with the future implementation of hardware
learning in mind Several examples from the literature are given After the pre
sentation of the design of and measurements on our cascadable ANN solution and
after the presentation of system level measurements reections on future work are
given the inclusion of process parameter dependency canceling and temperature
compensation A summary concludes the chapter
Chapter  Implementation of the neural network Page 
  The articial neural network model
As the very rst thing we must decide on a model for our articial neural network
There are three properties that this model must possess it must be

 General purpose

 Simple

 Suitable for the technology
It was argued in chapter  that analogue ANNs have to be application specic
With no particular application in mind at this point we shall deviate slightly
from this principle without actually violating it the object is to design a set of
general purpose building block components or modules Eberhardt et al  see
also Mueller et al 	 Application specic systems can then be composed of a
number of these
Analogue computational hardware is typically limited to a relative precision of
about & eg O	Leary  see appendix C	 For this reason and for the reason
of limiting the hardware cost it is preferable to use a simple ANN model Some
researchers try to model the biological mechanisms of neural networks very closely
Grillner et al  see also MacGregor 	 or use other complicated network
models This should not be necessary for computational neural networks as many
of the properties of these are owing to the structure and nonlinearity of the system
It is of paramount importance that our ANN model is compatible with the
restrictions imposed by the analogue VLSI technology eg Mead 	 Otherwise
the advantages of using the technology in the rst place would be lost It must
be absolutely clear that we thus can not justify the implementation of an arbitrary
model just as we can not justify the use of analogue integrated ANNs for an
arbitrary application as argued in the previous chapter	 the model must be easy
to map on the hardware in terms of both topology and computation primitives
Our rst objective will be to implement an acting system  which can be
rened later if necessary
 The neurons
Using stochastic neurons gives the possibility of implementing very powerful net
works such as Boltzmann machines Hertz et al 	 The activation y
k
 of the
stochastic neuron k typically of value  or  is probabilistically determined
Pry
k
" 	 " Pry
k
" 	 " g
k
s
k
	 
where s
k
is the neuron k net input and g
k
 	 is the activation function cf appendix
B	 Stochastic systems can very eciently explore the state space of the system's
free parameters during a learning process They are somewhat slow however as
the outputs must be averaged over time to nd the probability distributions of the
outputs and the stochastic processes are not very well suited for analogue signal
processing though see Alspector et al  for a VLSI implementation that come
around these problem using dierent kinds of annealing processes	
Chapter  Implementation of the neural network Page 
A very general deterministic network model uses higher order neurons Wul

 Giles et al 	
y
k
" g
k
s
k
	
" g
k


X
j
w
kj
z
j
!
X
j
 
j

w
 
kj
 
j

z
j
 
z
j

!
X
j
 
j

j

w
  
kj
 
j

j

z
j
 
z
j

z
j

! 	 	 	

A



	
where the w
kj
s w
 
kj
 
j

s w
  
kj
 
j

j

s       are the connection strengths and the z
j
 
s are
the neuron k inputsy cf appendix B	 The highest number of z
j
 
factors gives the
order of the neuron Though higher order networks can be very ecient compared
to conventional ie rst order	 networks they map poorly on VLSI because of
their high structural dimensionality a Dth order network has a D!dimensional
structure	 Thus using rst order deterministic neurons is preferable from a VLSI
implementation point of view This is also theoretically the most wellstudied one
which is also signicant
Finally there is the choice of the neuron transfer function The simplest pos
sible choice would be setting g
k
 	  sign 	 a hard limiter	 which is used in
Hopeld networks and which is well suited for an analogue VLSI implementation
Hollis and Paulos  SanchezSinencio and Lau 
	 This however would
sacrice the generality of the system obviously continuous valued outputs would
be impossible also many learning algorithms rely on a smooth transition from
low to high neuron output	 The choice therefore is to set g
k
 	  g 	 where
g 	 is a sigmoidlike function which is a suciently general solution networks us
ing this kind of neurons can approximate any limited function Lapedes and Farber
		
 The network
Ideally we should put no constraints on the network topology However as we shall
see in the following sparse interconnections between neurons will be dicult to
implement in general Thus the choice is to use fully interconnected groups of	
neurons which is the most general topology
Using an unconstrained topology imposes another problem if feedback is pre
sent instability An unknown number of neurons in a feedback loop would cause
an unknown phase shift at high frequencies and might lead to oscillations The
problem is further complicated by the fact that signs and magnitude of the gains
weights	 in the system change during learning A solution is to place the feedback
as shown in gure 
B
and ensuring a single dominating pole in the loop refer
to Hollis and Paulos  LinaresBarranco et al  Graf and Jackel  and
y z
j
 
can be either a network input x
j
 
 or the output activation from another
neuron y
j
 
see sections  and 	
Chapter  Implementation of the neural network Page 

Mueller et al  eg for such systems	 In nonrelaxation systemsy the network
time constants should  in some way  match the time constants of the input
data and of possible learning hardware In this case it is easier to use a discrete
time feedback ie a sampler	 in a general system though the asynchronousness is
sacriced Even using discrete time systems there is still a wealth of problems to
which we can apply analogue ANNs and most of the systems in this thesis will be
designed to work in discrete time
 Mapping the algorithm on VLSI
Before presenting our analogue integrated ANN solution we shall have a look at
dierent aspects of such implementations More specically we shall discuss dif
ferent architectures signalling methods memories multiplication and thresholding
circuits  with the future implementation of learning algorithms in mind
 Architecture
The architecture of a small low power application specic system cf chapter
	 must be tailored to the application The building block components for such
systems are thus the atomic parts of neural networks  synapses and neurons 
and would have the form of say a cell library to a CMOS process Though the
discussions in the present thesis are meant for massively parallel systems many
of the considerations apply equally well for small low power systems Only the
circuits should be replaced by low power ones We shall use strong inversion
circuits rather than subthreshold ones as the former are inherently faster than the
latter
For massively parallel application specic systems the level of integration of
the building block components is preferably very high this reduces design time	
Unfortunately this puts constraints on the architecture and minimizing these con
straints is one of the objects of VLSI neural networks design
Reformulating 

	 for a vector of rst order neurons cf 

B
		  corre
sponding to say a layer  we have
y " gs	 where s " w z  

	
where g 	 is a vector of sigmoidlike functions Assuming we have a parallel oper
ated matrixvector multiplier MVM 	 that gives as output the multiplication of its
input vector and a stored matrix the number of rows in the multiplier is increased
y Systems where the network is not allowed to settle to a steady state before
the next input pattern is applied Williams and Zipser  and others	 these are
used in time sequence data processing
Chapter  Implementation of the neural network Page 
simply by adding another multiplier with the same input vector The number of
columns is increased by adding the output vector to that of another multiplier
As dimensions are easily added to a vector of functions the implementation of an
ANN that is fully interconnected between layers if layered	 and which can be
scaled to an arbitrary size is feasible using two building block components Eber
hardt et al  Lansner and Lehmann  Shima et al 
	 This is shown
in gure 

 it is assumed that adding the outputs from several multipliers is done
simply by connecting their outputs together cf next section	 This cascadability is
most important We shall use the terms synapse chip the multiplier	 and neuron
chip the squashing functions	 for the two modules Further we shall refer to the
rows columns of w as rows of synapses and columns of synapses
For a recurrent network an elegant approach is to place synapses on the
neuron chip as illustrated in gure 

here y " gwy		 Duong et al 	 This
makes module interconnection easier
ww
ww
ww
ww
ww
ww
ww
ww
Neuron
Synapse
y
s
z
w
Figure 

 Expandable neural network This topology can implement systems
of arbitrary size fully connected between the layers
One could expect routing problems in systems with rigorously interconnected
units The distributed and regular placement of the synapses in the above systems
however practically eliminates this problem massive interchip communication is
still inconvenient though	 For sparse random connectivity routing would con
sume considerably more area per synapse
Obviously any rst order ANN topology can be mapped on one of the above
systems by setting some of the connection strengths equal to zero feedback and
extra layers can be added in gure 

	 If the system is known to have sparse
Chapter  Implementation of the neural network Page 
w ww w w
w w
w
w
w
w
w w
w
w
w
w
w
w
w
y
s
Figure 

 Expandable recurrent neural network This topology can also im
plement systems of arbitrary size The neurons must not be larger than the
synapses in order not to waste area
connectivity though it would be preferable not to waste hardware for all the null
connections This could be accomplished by folding the synapse matrix in a
way similar to the folding of sparse PLAs if the structure of the network is known
in advance Bruun et al 	 Often it is not however and certainly not when
implementing general neural architectures
Solving a problem with unknown properties one would typically arrive at
the sparse architecture by pruning ie removing unnecessary connections	 a fully
connected network eg using optimal brain damage OBD	 Le Cun et al  see
also Larsen 	 Thus preferably a recongurable neural network should be able
to emulate a fully connected one during the prepruning phase As depending on
how it is used OBD can remove as few as 
&(& of the connections and as
simple synapses can be very small care should be taken that interconnections and
routing switches does not take up more area than left free by the reduced number
of synapses Another way to avoid wasting hardware in a pruned network would
be to use the nullconnections of a fully connected architecture to introduce
redundancy in the system
We believe that the fully connected building block topology of gure 

is
though simple a very capable one We shall use this in the present work
A number of systems with recongurable network topologies have been pro
posed in the literature Mueller et al  Satyanarayana et al 
 Graf and
Henderson  and others	 Though it can be questioned if a random connected
neural network can be mapped eciently on these systems they do provide a gen
Chapter  Implementation of the neural network Page 
eral problemsolving environment Also the recongurability can be used to alter
the ANN topology during training and to map out defective blocks
Routing switches Neuron-synapses
Figure 

 Recongurable neural network The philosophy of this kind of
topology is to implement a general neural computer
A particularly interesting recongurable ANN is found in Satyanarayana et al

 see gure 

and 

 p  In this implementation the lumped synapses
and neurons above are replaced with distributed neuronsynapses The neuron
squashing circuit is distributed among the connected synapses and can be con
nected in parallel with other neuronsynapses ensuring that the routing switch
area is kept reasonably low as indicated in gure 


 Signalling
The domains in which the various signals are carried are closely related to the
needs of the matrixmultiplier above  or the needs of a synapse

 The output from a neuron or a network input	 must easily be distributed to
a column of synapses

 The outputs from a row of synapses must easily be accumulated
Distributing a signal is most easily done using a voltage as this can be detected
using high impedance sensors in parallel ie MOS gates	 In the current domain
the addition of analogue signals is simply done by connecting the input wires to
the output wire Thus using synapses with voltage inputs and current outputs
satises the above requirements  which is fortunate as multipliers typically have
voltage inputs and current outputs This is illustrated in gure 

 A variation of
the current output scheme is to use charge packages which can be accumulated on
an integrator
Chapter  Implementation of the neural network Page 
w
v
i
z j
sk
kj
Figure 

 Typical electronic synapse
The multiplier has voltage inputs and
current output to ensure the cascad
ability of synapses
Analogue signals carried in the voltage current domains are sensitive to noise
for instance coupled via the power supply or capacitive inductive parasitics In a
pulse stream neural network the noise sensitivity of the neuron outputs is e
ciently reduced by moving the information from the voltage domain to the time
domain  for instance using pulse frequency modulation PFM 	 or pulse width
modulation PWM 	 Murray et al  	 A digital voltage signal can be
easily distributed and regenerated and the temporal information is insensitive to
most noise sources The noise sensitivity of the synapse outputs is not so easily
reduced because of the requirement for easy accumulation The synapse outputs
would thus typically be charge packages the connections strengths multiplied by
the stream of input pulses	 To get the full advantage of the noise insensitive neu
ron outputs it is therefore important that the synapsetoneuron connections are
kept at a minimum local area That is only neuron outputs should be used for
interchip communication
The disadvantages of pulsed neural networks is a reduction in speed Given
a bandwidth B of our system we can process B data points Tugal and Tugal

	 in a pure analogue systemy whereas only B

 in a PFM neural network
with a dynamic range of 
 dB
The above are the most commonly used signalling methods in integrated neural
network contexts though other methods exist Neugebauer and Yariv 
Murray
et al  Mead  Webb  Mortara and Vittoz 	 We shall use
continuous valued signalling in the voltage and current domains in this work as
this is inherently the fastest signalling method compatible with a simple synapse
architecture
As seen in gure 

 the typical electronic synapse consists of two components a
y That is the fundamental Nyquist upper limit assuming we use sinc sincx	
def
"
sinx	x	 pulses in a linear system A more realisticmeasure would be for instance
B data points per second	 assuming the system has a single dominating pole at
the frequency f
dB
 the output corresponding to a step input would settle to  bit
accuracy within the time f
dB
" B Lehmann 
	 ie we could process B
data points
Chapter  Implementation of the neural network Page 
multiplier and a connection strength memory cell As the number of synapses in
an ANN mostly	 scale as ON

	 where N is the number of neurons reducing
the synapse area has been one of the major objects of integrated neural network
research Thus a discussion of memory cells and multipliers are the subject of the
following two sections
 Memories
Storing analogue signals are by no means simple no true ecient analogue elec
tronic memory exists today Thus the storage of the synaptic strengths is a major
concern in analogue ANNs research the solutions found in the literature are com
promises of one kind or another most of which can be put in one of the following
categories

 Capacitive storage

 Storage using special process facilities

 Digital storage
Capacitive storage The simplest method for storing an analogue signal is
to put a charge on a capacitor and reading this using the very high impedance
gate terminal of a MOSFET Tsividis and Satyanarayana 	 The drawback
of this method is that the leakage current primarily	 through the sampling switch
or some other weight changing device	 eventually exhausts the weight Several
approaches to reduce the leakage current are possible For instance using a dif
ferential scheme as shown in gure 

 which cancels the predominant sourcebulk
reverse biased junction current This scheme also cancels the oset error due to
charge injection	 Alternatively using a low oset voltage buer one can ensure a

V voltage drop across the sourcebulk diode eciently eliminating this leakage
see Shah  Vittoz et al  Horio and Nakamura 
	 Whatever method
employed though weight decay can not totally be eliminated and some kind of
refresh is necessary
SSVvw
write
Multiplier
Figure 

 Capacitive storage Several approaches to refreshing the charge
and reducing leakage as the di
erential scheme shown are possible
Chapter  Implementation of the neural network Page 
Most refreshing schemes rely on quantizing the weights and using these dis
crete valued weights to recharge the weight capacitors One of the more obvious
approaches to do so is to use a digital RAM backup memory The weights are
stored digitally in the RAM and the capacitors are periodically refreshed via a
D A converter Lansner and Lehmann  Eberhardt et al  Jackson et al
	 One would typically have serial access to the weight capacitors as well as
to the words in the RAM thus a count of O	 D A converters would be neededy
As digital RAM is very cheap this is an accountable solution for large systems)	
The serial access to the updating of weights is a severe limitation for a system with
hardware learning It will necessarily be an order ON

	 slower than a system with
parallel weight access However the much discussed issue of too coarse weight dis
cretizing during learning see eg Hollis et al  Tarassenko et al  and the
following chapters	 is less pronounced using this architecture than most others
rst as only O	 D A and A D converters are needed it is accountable to use
high precision converters second the RAM backup can have words of arbitrary
widths in number of bits	 and thus very small weight changes can be accumulated
cf the following chapters	
It is also possible to employ a quantizeregenerate refreshing scheme The
voltage on the weight capacitor is periodically compared to a discrete number of
reference voltages eg in the form of a staircase ramp	 and the capacitor voltage
is regenerated to the closest reference Vittoz et al  Bjork and Mattisson
 Horio and Nakamura 
	 For very high precision weights compared to the
weight droop rate	 it is necessary to place the regeneration circuit at the synapse
sites to allow parallel weight refresh For lower precision weights a column say
can share this circuit but a voltage buer must be placed at the synapse site
to drive the capacitance on the wire connecting the column to the regeneration
circuit In either case the quantizeregenerate technique is more area demanding
than simple capacitive storage with a RAM backup For systems with onchip
learning the quantizeregenerate refreshing scheme is not particularly well suited
because of the required high resolution of the weights  unless the learning scheme
is so fast that several weight updates can be accumulated between successive weight
refreshes compare to the following analogue adjustment	
An altogether dierent approach to refreshing relies on the presence of a learn
ing scheme refresh by relearning During an idle phase of the neural network it
is trained using an epoch say of the original training data thus restoring the
weights cf eg Valle et al  Woodburn et al 
 see also Arima et al
	 The obvious disadvantages with this approach are i	 that the network can
not run continuously and ii	 that the whole training set needs to be stored in the
system though see chapter 	 If the training scheme employed is an unsuper
vised learning scheme refresh by relearning can be employed in a more elegant
way Learning can be applied on each input pattern eliminating the need for an
idle phase and the storing of the training data Schneider and Card 
	 Such
weight refresh is also applicable if learning with a critic is employed a reinforce
y As before for a system with N neurons
Chapter  Implementation of the neural network Page 
ment signal can usually be extracted from the environment of an acting system
at a minimum cost Alstrm 
		 The trouble with this approach is that the
network will tend to forget the classication of scarcely occurring input patterns
Whether this is acceptable or indeed an advantage is strongly dependent on the
application
In most situations it is necessary to be able to read the contents of the weight
matrix for backup purposes for example or for transferring the network state to
another network retraining would be necessary	 It is possible to do so without
direct weight access if the outputs from the matrixvector multiplier are accessible
Applying z
 
" 
 j
y as inputs to the matrixvector multiplier yields as outputs s
i
"
w
ij
!w
ij ofs
 where w
ij ofs
is the MVM output oset error rst order approximation	
which would have to be canceledz
Special process facility storage Nonvolatile analogue memories can over
come the leakage problems of capacitive storage The most popular of these is
oating gate storage where a charge is trapped on the completely insulated oat
ing	 gate of a MOSFET thus programming the threshold voltage cf Sze 	
see gure 

Horio and Nakamura 
 Vittoz et al  compare to gure

C
	 The MOSFET would be the input transistors	 of the multiplier in gure


	 There exists numerous ways of trapping the charge on the oating gate some
compatible with standard CMOS processes other requiring special process steps
as those in EEPROM processes for example The programming is usually carried
out by i	 applying a high voltage across the gate oxide thereby forcing a tunneling
current to charge the gate Carley  Lee et al 	 or ii	 exposing the gate
to UV light thereby inducing a parallel conductance caused by the generation of
holes electron pairs in the oxide Benson and Kerns  Abusland and Lande 	
floating gate
p-
n+
n+
bulk
source
drain
control gate
FOX
Figure 

 Floating gate MOSFET
Schematic drawing of physical oat
ing gate MOST The controlgate
oatinggate overlap need not as in
the gure be on top of the channel
A completely dierent approach is amorphous silicon storage Reeder et al
	 Here the resistance of a vanadiumamorphous siliconchromium sandwich
can be programmed by applying highvoltage pulses  much like the way oating
gate MOSFETs are programmed electrically
y Kronecker's delta 
 j
"

 for  " j

 for  " j

z Note that unless the oset error is very small such a readout would not be
accurate enough for a learning rule as 
 
		
Chapter  Implementation of the neural network Page 
Though these often quite small special process facility memories are the only
true analogue memories existing today several matters weigh against their use in
analogue VLSI neural networks
The writing on these analogue memories usually wear the devices A typical
oating gate device can endure in the order of 



 full scale changes for example
This is sucient for programmable recallmode systems as in Castro et al 	
but for adaptive systems it is not Though weight changes can be accumulated on
a short term memory to reduce the number of device programmings such systems
are in general taught by example online learning rather than batch learning	 and
continuously over time Thus the number of weight changes scale as ON

t	 at
the least	 quickly exhausting the endurable number of device programmings
The driving force of VLSI processes is digital electronics primarily RAM
microprocessors etc	 Thus a state of the art process will be tuned to digital
requirement eg a V 
 m single poly triple metal nwell CMOS process
without precision components	 To get access to state of the art processes the
analogue circuit designer must submit to the potentials oered by digital processes
Further some argue that special analogue devices high resistive polysilicon oat
ing capacitors precision components  even BiCMOS processes	 will eventually
cease to be available to the average designer because of the future role of most ana
logue circuits interfaces to digital signal processors DSPs	 on mixed analogue
digital integrated circuits NEAR  see also Tsividis 	 For this reason one
should have a very good reason to use special process steps for analogue circuits 
especially in VLSI analogue circuits as these would be exceedingly expensive In
this context it is important to note that in most systems minimizing the synapse
cost and power dissipation rather than the synapse area is the objective)
Even nonvolatile analogue memories compatible with standard CMOS pro
cesses should be used with caution They rely on undocumented features of the
process which i	 must be characterized experimentally by the designer ii	 would
probably be subjects to immense process variations and iii	 could possibly be
changed without notice by the vendor One very important characteristic for in
stance is the memory life time which one should have a fairly good idea of before
considering a production even oating gate devices do degrade though on a time
scale of years rather than seconds as for capacitive storage	 Further the need for
high voltages or UV light for programming is inconvenient See alsoMurray et al
	
Digital storage The problems with weight degrading weight wearing and
special processing steps can be overcome if one is willing to refrain from using
analogue storage
Unless very high resolution synapse strengths are needed digital memories
consume more area than simple analogue memories The size of a digital memory
scale as Ologw
res
	 whereas analogue scale as Ow
res
	 where w
res
is the resolution
of the synapse strength limited by noise cf eg Geiger et al 	 For typical
ANN system which require a weight resolution of ( bit the analogue solution
is usually the smallest by far
Chapter  Implementation of the neural network Page 
The most severe problem with embedding digital circuitry in an analogue
system is the need for data converters the area of such monotonous	 converters
typically scale as Ow

res
	 cf Pelgrom et al  Lakshmikumar et al  and
Geiger et al 	 Using digital synapse strength storage require a digital to
analogue converter DAC 	 at each synapse site and though using a multiplying
DAC eliminates the need for the synapse multiplier this will be the most area
consuming part of the synapse
In spite of the area penalties of digital weight storage there are several ap
plication areas where such is very useful especially in small systems that can not
tolerate the need for support hardware as RAM	 Flower and Jabri  use digi
tal synapse strength storage in an implantable heart cardioverterdebrillator for
instance
For analogue systems that include hardware learning there is another obstruc
tion connected with digital storage the need for analogue to digital converters
ADC s	 to write the synapse strengths Using a parallel weight updating in such a
system is area inecient though see Hollis et al  Shima et al 
	 because
of the necessary high weight resolution during learning typically 
( bit	 Hollis
et al  Lehmann  
 Tarassenko et al  Hohfeld and Fahlman 
Brunak and Hansen 	 Inspired by the fact that the weight changes during
learning is usually much smaller than the necessary resolution of the recallmode
network say ( bit	 it has been proposed to use an analogue adjustment an
analogue bit	 to the digital memory which is active during learning see gure



Lehmann et al  Lansner  Bruun et al 	 The weight changes
determined by the onchip learning algorithm are accumulated on the analogue
memory When the equivalent of  LSB has been accumulated the digital word
is decreased or increased and the analogue adjustment is reset This operation
basically requires two one bit ADCs and a digital adder which could be shared by
a column of synapses
Of other synapse memories readonly memories should be mentioned Masa et al
 see also Mead and Ismail  Mead 	 which for instance could be
programmed by transistor sizing Readonly memories though are not interesting
in the context of hardware learning
In this work we have chosen the simple capacitive storage method with RAM
backup not because it is particularly well suited for learning it is not  because
this is a simple reliable concept which allow the most dense synapse packaging
The learning scheme will have to submit to this storage method
Chapter 	 Implementation of the neural network Page 

 Multipliers
Unlike analogue memories analogue multipliers are very easily implemented in
say CMOS  provided the inherent oset and nonlinearity are acceptable Many
dierent multiplier architectures have been proposed in the literature see Kub
et al  Neugebauer and Yariv 
 Bibyk and Ismail  Hollis et al 
Massengill  Woodburn et al 
 Saxena and Clark 
 SanchezSinencio
and Lau 
	 and we shall restrict our examination to only a few
The desired synapse multiplier key characteristics are the following cf above	

 Small

 Current output

 Voltage inputs One of these should have a very high input impedance ie a
MOS gate	 so that the capacitance on this node can be used for the synapse
strength storage
Prior to implementing a synapse multiplier there are two questions we need to
answer First do we need a four quadrant multiplier The synapse strengths in
a general neural network needs to be bipolar thus we need two quadrant synapse
multiplicationy Often the neuron activation function eg tanh 		 yields as output
a bipolar value indicating that we would need a four quadrant synapse multipli
cation However doing a simple linear transformation on the activation function
from bipolar units superscript 	 y

k
   to unipolar units super
script 
	 y

k
 
 	
g

k
 	 


g

k
 	 !



w

kj
" w

kj
$

k
" $

k
!
X
j
w

kj
 

	
where $
k
is the neuron threshold cf 

B
		 we see that there is  in a mathe
matical sense  no need for a bipolar activation function As would be expected
because pulse stream networks as our brain are operational	 When it comes
to learning though it is advantageous to use bipolar neuron outputs Learning
algorithms typically have a factor 
k
y
j
in the weight updating rule for w
kj
 where

k
is the neuron k error Thus if unipolar neurons are used g
k
 	 


tanh 	!



say	 the weight change is negligible when y
j
is close to the lower extreme value 
	
regardless of the neuron error For this reason ANNs using bipolar neurons tend
to learn faster Stornetta and Huberman 
 Haykin  see also Le Cun et al
	 It should be noted that the transformation 

	 can be applied to learning al
gorithms as well as to networks This would yield slightly more complicated weight
y Actually by adding a constant prior to the multiplication and do a subtraction
afterwards s " wz " w!w

	zw

z	 only one quadrant multiplication is strictly
necessary Johnson et al  see also Woodburn et al 
	 However this
method is bound to introduce additional oset errors which turns out to be a
major problem in analogue VLSI neural networks cf the following text	
Chapter 	 Implementation of the neural network Page 
updating rules and dierent rules for w
kj
and $
k
 though which is undesirable
A quite dierent motivation to use a four quadrant synapse multiplier is that this
will be needed to the implementation of the learning hardware cf the following
chapters	 and it reduces design time and error probability to reuse building blocks
The second question is do we need a linear multiplier Doing gradient descent
learning one needs to know the 
s
k

w
ij
derivative where s
k
is the sum of the
synapse outputs To fulll this need the multiplier should be linear to be in
compliance with our simple ANN model	 or at least have a computable transfer
function derivative However this is a requirement somewhat more strict than
needed The inherent fault tolerance of ANNs relaxes this requirement and for
ANN chips taught using chipintheloop training or hardware training inaccuracies
can be eliminated to a great extent by the learning algorithm Eberhardt et al 
Lehmann  
Montalvo et al  Castro et al  Valle et al  Card

 Leong and Jabri  Woodburn et al 
 among others see also section
	 What is of greater importance than the multiplier linearity is its dynamic
range usually restricted to about 
 dB which puts constraints on the networks
that can be mapped on a given topology In this connection the multiplier output
o
set error is very important When connecting the outputs from many synapse
multipliers the output osets will accumulate easily giving a resulting oset that
is greater than the maximum output of a single synapse if the multiplier has a
systematic oset this is very probable	 While in principle this resulting neuron
input oset error can be canceled by adjusting the bias the dynamic range of the
bias synapse is thus easily exceeded if steps to prevent this is not taken eg oset
canceling	 These chip specic oset errors and other process variation related
inaccuracies are the reasons why analogue ANNs in general need to be taught
using chipintheloop training Also in analogue systems with onchip learning
even small multiplier osets can cause severe problems cf the following chapters
Gilbert multiplier A very popular four quadrant multiplier is the Gilbert
multiplier shown in gure 

Kub et al  Schneider and Card 
	 Assum
ing a square law approximation for the saturated MOSFETsy one can show that
the output current is given by
i
wz
" i
wz	
 i
wz

q


	
w
	
z
	 v
w
v
z

when v
w
is small in the sense that the dierential output current of the upper
dierential pairs are linear in v
w
z 	
w
and 	
z
are the transconductance parameters
for the upper two and the lower dierential pairs respectively
When used as a synapse multiplier a row of synapses can share the current
mirror that is needed to take the i
wz	
 i
wz
dierence we call this circuit the
current di
erencer	 thereby saving a current mirror per synapse  though for a
y i
D
"


	v
GS
 V
T
	


z More precisely v
w

q
I
B

w
r

q

z
v

z
I
B



z
v

z
 I

B

Chapter 	 Implementation of the neural network Page 
SSV
wz+i
Bias
v
v
I
i
w
z
B
wz--
Figure 

 MOS Gilbert multipli
er All transistors work in satura
tion A wide range version is pos
sible by adding a number of cur
rent mirrors
given accuracy of the dierence the total transistor area devoted to this task can
not be decreased cf appendix C Note that if local current dierencing is used
and the multiplier output is not directly compatible with the neuron input utmost
care should be taken in the design of the required signal converter as not to loose
the good accuracy in this component
For a design with a restricted supply voltage a wide range version of the
multiplier is easily implemented by the addition of a number of current mirrors A
folded version of the multiplier is also a possibility
The MOS resistive circuit Another very linear multiplier is the The MOS
resistive circuit MRC 	 shown in gure 

Czarnul  Khachab and Ismail 
Tsividis et al 	 When ensuring virtual shortcircuit of the output terminals
the dierential output current is given by
i
wz
" i
wz	
 i
wz
" 	v
w
v
z
"

r
MRC
v
z
 
The transistors operate in the triode region Though requiring four matched tran
sistors the circuit has several nice properties It cancels out most nonlinearities of
the MOS transistors making the dierence current very linear in both v
w
and v
z

The dierence current is independent of the threshold voltage making it insensitive
to bulk eect and substrate noise Further the circuit is quite fast as the high fre
quency eects of parasitic capacitances also tend to cancel out The disadvantage
of the circuit is that the triode mode operation require a somewhat large power
supply voltage For dierential mode signals on the v
z
terminals the circuit acts
as two controlled resistors cf gure 

	 with resistances r
MRC
as dened above 
an observation that is often very convenient when analyzing circuits with MRCs
The need for a virtual shortcircuit prevents the feasibility of local current
subtraction when this circuit is used as a synapse multiplier  thus the circuit will
usually exhibit relative large output oset errors However consisting of only four
transistors the synapse density can be very high when using this multiplier Actu
ally even higher synapse density is possible if single ended signalling is employed
Chapter 	 Implementation of the neural network Page 
M1
M2
M3
M4
vz
vw
i
i
wz+
wz--
"0V"
Figure 

 MOS resistive circuit mul
tiplier All transistors work in triode
mode the output potentials must be
equal The triode mode operation re
stricts the dynamic range
rMRC
rMRC
vz
i
i
wz+
wz--
"0V"
Figure 

 MRC resistive equivalent
This convenient equivalent circuit is
valid for di
erential mode signals on
ly
Using one of the v
z
terminals as a constant reference and forcing the output poten
tials to be at that same potential two of the transistors will have zero drainsource
voltage and can thus be eliminated as they do not conduct any current Flower and
Jabri 	 The problem with this approach is that very low impedance summing
nodes are necessary at the outputs capable of sinking current from many synapses
Multiplying DAC When using a digital synapse memory in an analogue
system a multiplying DAC MDAC 	 must be used as the synapse multiplier The
disadvantage of this approach is its excessive area consumption However at the
expense of reduced accuracy cf above appendix C	 the scaling and summing
circuit of the DACs can be shared by say	 a row k of synapses This way only
B identical voltage controlled current sources are needed at the synapse sites for
an B bit resolution see gure 


Bruun et al  Dietrich  see also Van
der Spiegel et al 	 The transconductor and the diode coupled transistor
common to a column j	 ensures a current proportional to the input voltage v
z
j
for all the synapse current sources that have the corresponding weight bit w
b 
kj
set The currents on the B output lines are scaled and summed by the current
adder common to a row	 producing the resulting output current The accuracy
of this solution is primarily determined by the diode coupled transistor which has
to match all the synapse current sources of a column The resolution is determined
by the local current source matching and the accuracy of the scaling current adder
At a very modest area increase the circuit can handle bipolar inputs Dietrich
 Lehmann et al  Lansner 	 Such a    synapse chip designed
at our institute has been fabricated in a standard  m CMOS process giving
an accuracy of about &	 proving the applicability of the scheme Dietrich 
Lehmann et al 	
When doing onchip learning in a system with digital synapse weights the
weight resolution can be temporary increased during learning by the addition of
an analogue adjustment as noted in section  To get an accurate weight
Chapter 	 Implementation of the neural network Page 
algorithm
Learning
Analogue
adjustment
k
j
SSV
SSV
b
kj
i
b0
kj
b1
kjw w w
1/2
-1
1/2 -1
B-1
B
mg
s
vz
Figure 


 Multiplying DAC synapse Simplied schematic for positive in
puts The switch transistors are controlled by a local weight register bits w
b 
kj

not shown The value of the optional analogue adjustment is controlled by
an onchip learning algorithm
multiplication during learning an analogue multiplier should be added to the mul
tiplying DAC as indicated in the gure Improved learning would result by the
omission of this multiplier however as the network errors would be calculated on
the actual resulting network rather than on an intermediate network with higher
weight resolution Lansner 	 See also section 
If the input neuron activation value or the network input	 is binary the multi
plier architecture can be very simple Murray et al  Woodburn et al 

Graf and Jackel  Johnson et al 	 Basically a weight controlled current
source is either switched to the multiplier output or not depending on the state of
the input requiring as few as three or even one	 transistors for unipolar neuron
activation See also section 
Using a synapse multiplier which is nonlinear in v
w
NLSM 	 eg  sinhv
w
		
can reduce the problem of limited dynamic range Van der Spiegel et al  Hol
lis et al  Valle et al  Kwan and Tang 	 As it is the magnitude and
sign	 rather than the actual value of the weight that is of importance to the ANN
performance the inevitably reduced resolution for large weights is of less concern
Using MOS transistors operated above threshold achieving an exponential char
acteristic is not easy A squarelaw nonlinearity however is easily achieved as
indicated in gure 

 This particular circuit also has the advantage that an oset
free zero weight can be ensured ignoring subthreshold currents	 by proper bias
ing see gure 

 Also the at plateau around zero output current will tend to
Chapter 	 Implementation of the neural network Page 
trap small weights at eciently zero value  the multiplier could be said to be
selfpruning  which might improve the ANN generalization ability
VDD
SSV
iwz
Bias
Bias
v
v
w
z
Figure 

 Simple nonlinear synapse
multiplier The four leftmost tran
sistors act as level shifters that en
sure only one of the saturated middle
weight transistors are turned on
The multiplication is performed by
the rightmost switch transistor
0V 2V 4V 6V 8V 10V
vw
-I(vo)
200uA
100uA
0A
-100uA
-200uA
nlbm1 -- non-linear multplier with one input binary
Date/Time run: 07/25/94 11:47:56 Temperature: 27.0
Figure 

 Weightoutput character
istic of NLSM A simple double a
symmetric square law characteristic
Notice the null range which ensure ze
ro output o
set error
In this work we have chosen to use the MOS resistive circuit multiplier There
are several reasons to this i	 The MRC is a small fast multiplier which is im
portant to the applicability of massively parallel analogue ANNs ii	 The learning
algorithm which we are to add at a later stage not being determined in advance we
should not reject the possibility of implementing a gradient descent algorithm The
extent to which an unknown synapse nonlinearity can be canceled by the learning
scheme is problem and algorithm dependent a reasonably linear multiplier is the
safe choice iii	 The MRC is a very versatile component One can for instance
implement a voltage in voltage out multiplier divider v
out
" v
in
v
in
v
in
	 us
ing two MRCs and an opamp which is independent of process variations cf the
following chapters	 This will be needed for the learning scheme and we can thus
reuse our multiplier cell which reduces the possibility of design errors In this
connection it should be mentioned that there are several other possible choices for
process variation insensitive voltage in voltage out multipliers eg Wang 
see also Sakurai and Ismail 
 Coban and Allen  Botha 	
Chapter  Implementation of the neural network Page 
 Activation functions
The last thing that needs to be considered before implementing the analogue neural
network is the threshold function If a binary valued neuron transfer function
is sucient for the application at hand the neuron circuit can be very simple
Bibyk and Ismail  Hollis and Paulos 	 If on the other hand a continuous
valued nonlinearity is sought as in our case	 the circuit is often somewhat more
complex This complexity is usually not a major concern though as there are
only N neurons compared to the order ON

	 synapses The exact shape of the
neuron transfer function is usually irrelevant cf eg SanchezSinencio and Lau

	 What is more important is its qualitative shape eg that it is monotonous
and saturates for numerically large inputs This is a very attractive feature which
means that transfer functions easily implemented in the technology can be chosen
Pulse frequency neuron The neuron schematic is of course strongly depen
dent on the network signalling domains For pulse frequency neural networks for
instance a neuron must be a nonlinear current controlled oscillator CCO	 A
sample implementation of such a neuron can be seen in gure 

Murray et al
	
is k
v    tyk(  )
I dep
C int V
V
ref+
ref-
Figure 

 Pulse frequency neuron The output voltage alternates between
digital high and low at a frequency determined in a nonlinear way by the
input current
The pulse frequency will be in the range f
act
 
 I
dep
C
int
V
ref	
 V
ref
	
This particular circuit does not have a particularly low input impedance which
must be compatible with the synapse multipliers
Distributed neuron Using continuous valued current in voltage out sig
nalling for the neurons the neuron nonlinearity can be achieved simply by ap
plying a nonlinear load to the summed synapse outputs  the same wire would
then be used as both neuron input and output we assume that the loads at the
neuron output has a very high impedance	 This approach makes it very easy
Chapter  Implementation of the neural network Page 
to distribute the neuron hardware on the synapses rather than using conventional
lumped neurons see gure 

Satyanarayana et al 
	 The distributed ap
proach has the obvious advantage that the current range in the distributed elements
is that of a single synapse  thus making the system truly scalable to an arbitrary
size The function implemented by the distributed neuronsynapse approach is
y
k
" g
k


M
k
X
j
w
kj
z
j


where M
k
is the number of inputs to the resulting neuron k Some argue that
the typical weight magnitudes of a neuron is often proportional to M
k
though
see section 	 in which case this very factor can actually improve the eective
dynamic range of the synapses by a factor M
k
Satyanarayana et al 
 see also
Eberhardt et al 	 Note that the distributed neuron elements must be kept
small as their number scale as ON

	
Mk
vyk
VB2
VB1
VB2
VB1
i w zkj j
VB2
VB1
Synapse Synapse
kj
Synapse
VDD
SSV
VDD
SSV
VDD
SSV
Figure 

 Distributed neuron The input current and output voltage are
carried on the same wire The transfer function for this particular neuron
does not saturates in itself the synapse would contribute to the transfer char
acteristic This distributed approach makes the network truly scalable
Hyperbolic tangent neuron If we are to add learning hardware to an acting
neural network we will most probably	 not have access to the neuron net inputs
the s
k
s For the implementation of a gradient descent algorithm 
y
k

s
k
needs
to be computed Thus the choice of a hyperbolic tangent transfer function  the
transfer function of a bipolar dierential pair  is well tailored to this situation
we have



s
k
tanhs
k
	 "  tanh

s
k
	  
Chapter  Implementation of the neural network Page 
Unfortunately the dierential pair has voltage input and current output rather
than the other way around which makes the use of input and output transre
sistances necessary A hyperbolic tangent neuron implementation can be seen in
gure 

 p 
Bipolar transistors are not available in standard CMOS processes However
well MOS transistors can be operated in lateral bipolar mode LBM MOSFET 	
which turns on a reasonably good though somewhat slow	 bipolar transistor see
appendix C Though it is against the philosophy of section  to use non
documented devices the oense is not too severe in this case using a fairly sim
ple regulating circuit the primary parameter of the LBM MOSFET the emitter
collector current gain can be measured and adapted to See section 
We shall discuss the necessity of computing the transfer function derivative in
later chapters as well as problems related to this calculation Further we shall
examine other neuron circuits For now not constraining the future implementa
tion of the learning hardware motivates our choice of transfer function which is
this hyperbolic tangent one implemented using LBM MOSFETs
Many other neuron architectures can be found in the literature to which we refer
the interested reader eg Mueller et al  Schneider and Card 
 Sanchez
Sinencio and Lau 
 Mead 	
 Chip design
In this section we will describe the chip set developed at our institute Considera
tions on the choices of the central components were given in the previous section
A description of the chip set was published in Lansner and Lehmann 
 
see also Lehmann   Schultz 	 The cascadable chip set consist of a
neuron chip and a synapse chip having the topologies shown in gure 


The chip set was designed primarily to test the ANN functionality and thus
as little hardware as possible was included on the chips  to reduce the possibility
of malfunctioning chips This design methodology has proven successful all the
designed chips worked after rst processing though errors occurred	 The price
for this reduced error probability is basically that the chips need a large number of
biases voltages and currents	 which severely complicates their use As we unfor
tunately did not have sucient time to design a complete volume manufacturable
set of chips problems related to selfbiasing temperature compensation etc are
not experimentally covered in this thesis though very important to VLSI design
The integrated CMOS process we shall use is a standard analogue V  m
double poly double metal nwell CMOS process In order to put as few constraints
as possible on the analogue building blocks we have chosen a rather large V
power supply This was convenient as the dierent components on the rst chip set
Chapter  Implementation of the neural network Page 
was developed concurrently by three dierent designers myself John A Lansner
and Thomas Kaulberg	 In a future implementation the building blocks should be
redesigned to a standard V or  V	 digital process
A design strategy that has been employed is to reuse components whenever
possible in order to reduce the possibility of design errors and design time This
is true for onchip micro components as the opamp	 as well as macro compo
nents as the matrixvector multiplier	 as we shall see in the following chapters
A few preliminary general system aspect considerations are needed before the
actual chip designs These can be found in appendix D
 The neuron chip
The neuron chip design was done by John A Lansner see Lansner 	 The
schematic of a neuron on the neuron chip is shown in gure 

 The core of the
circuit is the bipolar dierential pair implemented using two LBM MOSFETs The
dierential output current of this pair is converted to a single ended voltage by the
output range OR	 MRC and opamp	 For the opamp schematic see appendix
E	 This is again buered so the neuron can drive the relative low impedance
input of the synapse chip cf below	 At the input of the neuron the input scale
IS	 MRC and opamp	 is likewise placed acting as the input transresistance that
converts the input current to voltage needed to drive the dierential pair The
resulting transfer function is the following
v
y
k
"

	
OR
V
OR

FC
I
B
tanh

i
s
k
	
IS
V
IS
V
t

" R
OR

FC
I
B
tanh

R
IS
i
s
k
V
t
  

	
where I
B
is the bias current and 
FC
is the emittercollector current gain Because
of the undesired vertical collector of the LBMMOSFET connected to the substrate
we have 
FC
" i
C
i
E
very
 
  The output voltage v
y
k
is referred to V
ref
" V
to be compatible with the synapse inputs
The input impedance is nonlinear and strongly dependent on though always
smaller than	 the resulting transresistance R
IS
 One should not be too oended by
this nonideal load of the current source When the tanh is saturated even severe
nonlinearities in the transresistance or in the synapse output stage caused by
the nonlinear nite load	 are indierent as an input current i
s
k
	 error would
be indistinguishable at the neuron output The dierential pair saturates for a
dierential input voltage of a few V
t
tanhV
t
V
t
	 " 
 	 which is thus the
voltage shift from the input reference 
V	 that must be tolerated by the synapse
chip regardless of R
IS
	 This is easily accomplished
Though the input transresistance is adjustable it has a dynamic range of
only 
 dB(
dB In other words the neuron transfer function steepness 	
t
 is
adjustable within this range or the e
ective maximum synapse weight jwj
max
 if
one prefers to think of the transfer function with a xed steepness 	
t
" 	 Clearly
Chapter  Implementation of the neural network Page 

VDD
SSV
V IS
i s k
VOR
V ref
k
vy
M4* OR
I B
M4* IS
Figure 

 Hyperbolic tangent neuron Basically a BJT di
erential pair
using parasitic components embedded in transresistances
in a system with an arbitrary large number of synapses connected to each neuron
this dynamic range does not allow a neuron to be saturated only if all synapses
are exiting it  as would be the case if the steepness scaled as M
k
 where M
k
is
the number of connected synapses The dynamic range of our neuron steepness is
sucient to cancel process variations but not much more In many classication
problems it is often seen that the neurons in a trained network both output and
hidden ones	 are most often saturated see eg Brunak and Lautrup 
 Krogh
et al  see also Williams and Zipser 	 that is they act more or less as
hard limiters  clearly this is not possible if the steepness scales as M
k
 If one
could ensure that the transresistance was adjustable down to 
* both extremes
could be embraced Eberhardt et al  see also Mueller et al 	 but using
a lumped neuron approach this would prove most dicult the lumped neuron
would have to be able to sink an arbitrarily large current Further even such
a system would not be able to handle the situation where all synapses but one
must agree to make a decision which in turn can be overruled by the last synapse
a situation which is not uncommon in the brain Rumelhart et al 

		 The
unfortunate conclusion is that in order to be absolutely general the dynamic range
of a synapse must scale as M
k
 which is incompatible with analogue VLSI This
Chapter  Implementation of the neural network Page 
once again stress the importance of making the dynamic range of the synapse as
large as possible
In order to make reasonably simple neurons and learning algorithms we have
chosen the xed neuron steepness approach The steepness has been selected
to be compatible with typical weight magnitudes found in reported systems and
our own simulations It should be noted that if the nonlinearity introduced by
the nonlinear neuron input impedance is acceptable the neuron steepness can be
reduced simply by adding a parallel external resistance at the input It could be
one of the objectives of further research to enhance the neuron steepness dynamic
range the steepness should be governed by the learning algorithm	
 The synapse chip
The synapse chip consists of a number of inner product multipliers IPM 	 that
multiply the input vector v
z

	y with a row of the stored matrix V
w
k
	 Such
an inner product multiplier is shown in gure 

see eg Bibyk and Ismail 	
The dierence of the summed synapse MRC outputs is taken by the opamp with
MRC feedback which also ensures the required virtual shortcircuit of the synapse
outputs The resulting voltage is transformed by the transconductance g
mk
	 to
the output current i
s
k
	 For the opamp and the transconductance schematics
see appendix E and E	 The resulting transfer function is the following
i
s
k
"
g
mk
W

L

	 V
C
X
j
W
j
L
j
	 V
w
kj
v
z
j
 

	
whereW
 
L
 
are the MRC width length ratios and V
C
controls the total transcon
ductance This control voltage is used to adjust the eective maximum synapse
weight though the dynamic range allow only for small adjustments as compen
sating process variations	 as was the case for the neuron steepness adjustments
The schematic of a single synapse is shown in gure 

 The synapse strength
is stored in a dierential manner on capacitors at each synapse site this way oset
due to charge injection Shieh et al Wegmann et al 
	 is canceled as well
as the dierential	 charge leakage due to the reversed biased drainbulk diodes
of the sampling switches provided that the components match To ensure random
synapse access the sampling switches are controlled by a nand gate rather than
directly by the row column select signals provided by the row and column decoders
as in Lee et al  or Kub et al 	 For minimum geometry transistors for
the gate the area overhead is acceptable See also appendix D
y  meaning that the index runs over all possible values
v
z

def
" v
z
 
 v
z

        v
z
M
k

T
Chapter  Implementation of the neural network Page 
1 wkj
V
vz j
M jM 1
V C
CM
isk
skv
V ref
wV
vz 1
k
4*4*
4*
gmk
Figure 

 Inner product multiplier One MRC per input vector dimension
is required upper part The MRC feed back opamp ensures virtual short
circuit at the MRC outputs the transconductance converts the opamp output
voltage to a current for cascadability
The second generation synapse chip Using an opamp with MRC feedback
as the synapse dierencer followed by a transconductance to obtain the desired
output current is somewhat indirect A simpler and more accurate approach is to
use a current conveyor see appendix E	 to take the dierence while ensuring the
virtual shortcircuit as shown in gure 

 See also chapter  The i
s
k
	
input
is lead directly to the output and the i
s
k

is negated by the current conveyor and
added to the output The yx voltage follower ensures the virtual shortcircuit
To avoid DC common mode currents in the synapses the output potential of the
current conveyor should be close to V
ref
 which requires a slightly changed neuron
schematic cf chapter 	 This solution does not allow the e
ective maximum
synapse weight to be tuned as above Process variations must thus be canceled
by scaling the weights which reduces the dynamic range of the synapse weights
slightly The adjustments are carried out automatically during learning though
Chapter  Implementation of the neural network Page 
i wz+i wz--
M syn
M swC w+ Cw--wkjV
vz j
Row =k
Co
lu
m
n 
=j
Vw ref
vwkj
i sk+
i sk-
Figure 

 Synapse schematic In addition to the MRC multiplier weight
storage di
erential capacitive and access switch transistors and nand gate
circuit is placed at the synapse sites
without any weight scaling learning mechanism
x
y
zCCII+
s
s
s
i
-k
i +k
i
k
Figure 

 Current conveyor dierencer This
circuit gives as output i
s
k
" i
s
k
	
 i
s
k

while
bu
ering the ynode voltage to the xnode The
input voltage is thus determined by the output
load
It is also possible to use a currentmode opamp with resistive feedback as a
dierencer This way the eective maximum synapse weight can be adjusted while
preserving the good accuracy of a simple dierencer
The use of one of these currentmode operational devices to our currentmode
signal processing is the superior choice when considering both speed and accuracy
of the circuit Bruun  Bruun et al 	
Chapter  Implementation of the neural network Page 
 Sparse input synapse chip
Often the inputs to a neural network are taken from a discrete input alphabet
consisting of a number N
A
 of symbols or letters	 

 

        
N
A
 To avoid
false distance relations among the dierent letters unary coding of the letters is
usually employed that is N
A
network input lines are assigned to every logical
input or letter input	 X
 

X
 
 x
 N
A
	
 x
 N
A
	
 x
 N
A
	
        x
 N
A
	N
A



 
 
 	 	 	 




  
 	 	 	 








N
A

 
 
 	 	 	 
We notice that the network inputs are used sparsely One can use  instead
of 
 as the inactive value as discussed in section  In this particular case
however the choice of 
 can actually improve learning only the weights related to
the present input letter in each letter input will be modied by a typical algorithm
cf above	
Sample applications As examples of applications where unary input coding
are used we shall mention prediction of splice sites and word hyphenation
In the human genom project which aim is to map all human genes the prediction
of splice sites in premRNA molecules a copy of part the information in a DNA
molecule	 is an important task Much of the DNA information is junk that
does not code for proteins and much of this junk DNA is scattered about in DNA
sequences that does code for proteins Prior to the protein synthesis the body
cells cut out these junk sequences of the premRNA molecules The junkcode
boundaries are the splice sites see gure 

Stryer  Brunak et al 	
These splice sites are dicult to predict and are dependent on the context of the
nucleotide sequence Applying neural networks to this problem has proven to be
very successful Brunak et al 	 The input alphabet for this ANN application
has N
A
"  letters A T or U	 G and C corresponding to the four nucleotide
bases of DNA or RNA	 Brunak et al used two layer perceptrons with the order of


 letter inputs ie 

 network inputs	 and 

 hidden and one output sigmoid
neurons for this classication task
Another application that uses a discrete input alphabet is word hyphenation Ig
noring context dependent hyphenation as desert vs desert	 a good solution
can be found using an ANN with an input window of a few letters Brunak and
Lautrup 
 used a two layer perceptron with eight letter inputs and 
 hidden
and one output sigmoid neurons to hyphenate Danish words The input alphabet
of this classication task has N
A
" 
 letters corresponding to the  letters of the
Danish alphabet ab        z a	 plus a null letter
Chapter  Implementation of the neural network Page 
pre-mRNASplice site
U
U
UU
U
U
U
U U
U
G
G
A G G
G
C
CG
A
C
C
C
A
G
A
G
AAG
A
G
Figure 

 Nucleotide sequence A splice site
is shown for a sample schematic premRNA mo
lecule RNA is composed of a sequence of the
bases adenosine cytidine guanosine and uri
dine
The sparse unary coding of the letter inputs  say applying  zero inputs for
each  input  clearly exploits the input bandwidth of a standard synapse chip
poorly For systems with say hundreds of letter inputs even a small reduction
in the required input bandwidth can reduce the system cost considerably This is
the motivation to develop a special input layer synapse chip for networks with a
discrete input alphabet  a sparse input synapse chip
The basic idea of this synapse chip is simply to use binary coding of the
input alphabet and decode this to unary coding onchip Standard digital CMOS
design techniques can be used for the decoding or alternatively a demultiplexor
controlled by the letter input	 can be used to redirect a current as shown in gure



 The gure shows the current demultiplexor for a two bit letter input and
one of the four synapse columns connected to the demultiplexor As discussed in
section  the synapse multiplier can be very simple when the z input is binary
eg as in gure 

 One can also use the demultiplexed current to derive the bias
currents for synapse transconductors as shown in gure 


 only one of the four
columns of synapse transconductances will have bias currents
i wx
Vw
ix1
IB
Network inputs Synapse columnLetter input
X
X
X b1
b0
VDD
SSV SSV SSV
Figure 


 Sparse input synapse chip column A binary coded letter in
put X corresponds to say four unary coded ANN inputs The bias current
I
B
is demultiplexed left to one of the four synapse columns one shown to
the right corresponding to the letter input Turning o
 or on the synapse
transconductance bias currents act as multiplication by  or 
Chapter 	 Implementation of the neural network Page 
To implement a non application specic sparse input synapse chip it is neces
sary that the demultiplexor is recongurable  that is one must be able to change
the number of outputs and control lines to t the applications The design of
such a recongurable sparse input synapse chip was done by Jesper S Schultz see
Schultz 	 As the required input bandwidth to drive a given number of synapse
columns scale as OlogN
A
	N
A
	 one can not exploit both input bandwidth and
chip area 

& eciently for all N
A
though the input signals can be time mul
tiplexed when N
A
is low to improve the exploitation of the chip area	 and it is
usually not of paramount importance to fully exploit the input bandwidth when
N
A
is high	 It is therefore necessary to select an ideal N
A
where the number
of synapse rows is tuned to the input bandwidth Also when the number of letters
in the input alphabet is not a power of two the binary input coding prohibits a


& ecient exploitation of both input bandwidth and chip area Thus even a
recongurable sparse input synapse chip is non application specic rather than
general purpose
 Chip measurements
A  neuron neuron chip and a    synapse synapse chip has been fabricated in
a standard  m CMOS process In this section we shall present measurements
on these chips rst published in Lansner and Lehmann 
 	 A table of the
most important chip characteristics can be found in appendix D
 The neuron chip
The measurements on the neuron chip were done by John A Lansner see Lansner
	 In gure 

measured neuron transfer characteristics for dierent values
of the input scale voltage V
IS
can be seen The maximum transfer function non
linearity compared to an ideal tanh	 is D
g


& of the output range and the
nonlinearity of the derivative is D
dg



&
-3
-2.5
-2
-1.5
-1
-50 0 50
Neuron function
Input current i_s,1 / uA
O
ut
pu
t v
ol
ta
ge
 v
_o
ut
 / 
V V_IS = 0.25V
2V
Figure 

 Measured neu
ron transfer function Charac
teristics for di
erent input scale
control voltages V
IS
 The dotted
lines are ideal tanh curves The
input o
set has been canceled
Chapter 	 Implementation of the neural network Page 
This low nonlinearity proves the applicability of the LBM MOST dierential
pair and the possibility of accurately computing the derivative on the basis of the
neuron output see also section 	
 The synapse chips
The measured synapse transfer characteristics for a single synapse can be seen
in gure 

 The characteristics showed a good linearity D
wz


& or  bits
accuracy	  with the exception of the case with negative V
w
kj
values and positive
v
z
j
values D
wz


&	 This is because it was necessary to lower V
SS
to ensure
a reasonable output current swing due to a layout error	 This nonlinearity is
by no means prohibitive for the application of the synapse chip chipintheloop
training or a future onchip learning mechanism can easily compensate for it cf
eg Castro et al  Valle et al  see also section 	
-20
0
20
-1 -0.5 0 0.5 1
Synapse Characteristic
Input voltage v_y,1 / V
O
ut
pu
t c
ur
re
nt
 i_
s,1
 / 
uA V_w,11=
0.94V ...
0V
...
-0.94VFigure  Measured synapse
characteristics Characteristics
for di
erent stored weight volt
ages V
w
 The output o
set has
been canceled
The weight matrix resolution was measured to V
wres


mV or 
 bit at the
least for a V range of matrix voltages Smaller changes are possible but lies
below the noise oor at the synapse chip output For a recallmode system the
resolution is sucient for a wide range of applications cf eg sections  	
Second generation synapse chip For measurements on a synapse chip with
current conveyor based synapse dierencing refer to chapter 
The output o
set currents on the synapse chip and the input o
set current on
the neuron chip are quite large approaching in magnitude the maximum synapse
output current The reason could be in addition to component mismatch	 that
the opamps have low gains  
 dB	 which together with opamp oset voltages
Chapter 	 Implementation of the neural network Page 
of mV would give the measured current osets This however is not necessarily
a major problem provided that the network is trained and used using the same
chips	 as the oset currents just displaces the neuron biases Likewise the matrix
oset voltages which are relative small	 could be used as small random initial
weights when the network is trained It should be noted that the oset errors are
mostly	 nonsystematic
Even with the large current osets the chip set characteristics are compatible
with many ANN applications though it is not as general as an ideal simulated
network of course	 The primary limitation of the network is the limited dynamic
range of the synapses For enlarging the application area ensemble methods can
be employed cf section 	
 Chip compound
Interconnecting a synapse and a neuron chip the combined transfer characteristics
can be measured This is shown in gure 

for dierent values of the synapse
strength verifying the synapse neuronchip compatibility The step response of
the synapseneuron combination is shown in gure 

 The delay through one
layer of an ANN based on our chips can be measured on this curve for an  bit
output accuracy we have t
lpd


 s corresponding to MCPS per synapse chip
As the synapse chip propagation delay should be largely independent of the number
of synapses a fullsize 

 

 synapses	 synapse chip would be expected to do
 GCPSy
-3
-2.5
-2
-1.5
-1
-3 -2.5 -2 -1.5 -1
Synapse & Neuron Function
Input voltage v_y,1 / V
O
ut
pu
t v
ol
ta
ge
 v
_o
ut
 / 
V V_w,11=
0.94V
...
0.00V
-0.94V
Figure 

 Measured synapseneuron
transfer characteristics Characteris
tics of combined chips ANN layer for
di
erent stored weight voltages V
w

-3
-2.5
-2
-1.5
-1
0 5 10
Synapse & Neuron Step Respons
Time  t/us
V
ol
ta
ge
s v
_y
,1
 &
 v
_o
ut
/V
Figure 

 Measured synapseneuron
step response Combined chips The
dashed line is the input signal The
e
ect of ve cascaded opamps can be
seen in the evidently high order trans
fer function
y For comparison A typical 	 workstation would be able to to about
MFLOPS HP 


  MHz HP Direct 	 or almost two orders of mag
nitude below the computational power of a single chip
Chapter  Implementation of the neural network Page 
 System design
The system design was done by John A Lansner see Lansner 	 To verify
the functionality of the chip set a two layer test perceptron based on it was imple
mented at our institute Using ve synapse chips and two neuron chips an 
architecture was implemented as shown in gure 

 A layout error caused one
of the neurons on the neuron chip to be disconnected also one of the inputs on the
synapse chip was malfunctioning Thus this architecture	 The two linear output
neurons was implemented by simple resistors
+1 +1
x
x
x
x
x
x
x 2
3
4
5
6
7
8
1x
1y
y
y
y
2
3
4Figure 

 Two layer test perceptron This
simple architecture is capable of solving a lar
ge range of nontrivial tasks
A standard PC interface was added to test and teach the system cf gure


	 The synapse strengths are stored in a  bit RAM and are periodically
refreshed via a  bit DAC Both output and hidden neurons are accessible from
the PC via  
 bit ADCs The inputs are driven by  bit DACs
DAC
DAC
ADC
counter
RAM
PC
Figure 

 Test perceptron system ar
chitecture To test the ANN system it
is embedded in a digital system In a
realworld application only the weight
backup would be digital
Chapter  Implementation of the neural network Page 

 System measurements
The system measurements were done by John A Lansner see Lansner  	
For a realistic performance evaluation a well known realworld data set was applied
to the hardware system namely the sunspot time series Weigend et al 	
This semi periodic time series is the yearly average of dark blotches on the sun
see gure 

the data is normalized to be within the range 
 	 Using a tapped
delay line to feed the ANN the sunspot activity of the latest M years the ANN
must predict the activity for the following year Note that the data set complexity
approximately matches the network architecture which is essential for obtaining a
good generalization ability Only one ANN output is used	
0
0.2
0.4
0.6
0.8
1
1700 1750 1800 1850 1900 1950 2000
Sunspot Prediction Time Series
Year
A
ct
iv
ity
training set
Figure 

 Sunspot prediction Classic regression problem The actual
sunspot time series solid and the sunspot activity as predicted by the hard
ware ANN dotted
The ANN was trained using a standard chipintheloop backpropagation al
gorithm In the calculation of the neuron k transfer function derivative the tanh
characteristic was exploited g
 
kcalc
" 	
k
  
k
g
k
	

	 where 	
k
and 
k
are con
stants and g
k
is the actual hardware ANN neuron activation To ensure g
 
kcalc
 

otherwise learning can not take place cf the following chapters Lehmann 

	 the neuron activations g
k
are scaled such that 
k
g
k
  to compensate for
the neuron output oset and scale errors	 For 
k
  this closely resembles
the procedure that we shall apply using learning hardware Prior to learning the
neuron output ranges are measured to determine the optimal  or 
k
s
The performance of the hardware ANN was compared to that of an ideal
software ANN with identical architecture but without the nonidealities of the ana
logue hardware ie coarse weight discretization oset errors nonlinearities etc	
The normalized average relative variance NARV	 of the error on the training set
as the training progresses see appendix B	 can be seen in gure 

 It is noticed
that the performance of the hardware ANN is somewhat noisy and slightly worse
than that of the software ANN  as would be expected because of the limited
accuracy of the analogue hardware The output accuracy of the hardware ANN is
approximately ( bit see Lansner  	 The NARV of the error on two test
sets can be seen in gure 

the curves with the high NARV are for the atypical
Chapter  Implementation of the neural network Page 
test set ( the sunspot count in this period does not resemble the rest of
the data set very closely	 The error will always be lower for the training set than
for the test set and the latter exhibits a minimum beyond which further training
leads to overtting Notice that the hardware ANN test error is noisy and higher
compared to the training error	 than the software test error This is caused by
inabilities of the limited accuracy of analogue VLSI
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000 4000
Sunspot Learning Errors
Epoch
N
or
m
al
iz
ed
 E
rro
r
Figure 

 Sunspot learning error
NARV as function of learning epoch
for the hardware ANN dotted and
an ideal software ANN solid The
error approaches asymptotically a
minimum
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000 4000
Sunspot Test Errors(1921-1955,1956-1979)
Epoch
N
or
m
al
iz
ed
 E
rro
r
1921-1955
1956-1979
Figure 

 Sunspot prediction error
NARV as function of learning epoch
for the hardware ANN dotted and
an ideal software ANN solid for
two di
erent test sets Notice the NA
RV minima further training leads to
overtting
The successful system evaluation indicates that implementing hardware ANNs
using the proposed architecture is indeed feasible as claimed above  and as other
authors have also noted for dierent analogue hardware ANNs The next step is
to implement learning hardware for this ANN system
Chapter  Implementation of the neural network Page 
	 Further work
Related to our hardware implementation of the articial neural network there are
several issues that need to be considered before a volume production would be
possible We shall display some of these in this section
 Process parameter dependency canceling
As noted in section  we must eliminate the undocumented process parameters
when using parasitic eects of our semiconductor process  such as the MOSFET
operated in the lateral bipolar mode In the case of the hyperbolic tangent neuron
the undocumented process parameter in question is the forward emittercollector
current gain 
FC
cf 

		 of the LMB MOST dierential pair In gure 


a
simple regulating circuit is seen A single LBM MOST emulates a dierential
pair that is driven to saturation such that tanh 	  	 The nonsubstrate	
collector current in this LBM MOST is thus the e
ective bias current I
Be

"
I
B

FC
 Subtracting a reference current I
Bref
from this and dumping the dierence
into a high impedance node gives a voltage V
AB
 with which to drive the bias
transistor V
AB
is distributed to all the neuron LBM MOST dierential pairs
which will thus have eective bias currents of approximately I
Bref
 A loop gain of
A
L
" g
mB

FC
g
dsI
 

 is easily implemented using simple MOS transistors
as indicated in the gure This corresponds to an eective bias current error
of I
Bref
 I
Be

	I
Bref
" A
L
	  & As V
AB
is meant to be distributed
to a large number of neurons ie bias transistors and dierential pairs 
FC
s	
distributed over a large area are required to match	 there is no point in increasing
the gain for better accuracy For a suciently high number of attached neurons
the C
gs
s of the bias transistors will act as compensation capacitance C
C
	
Using a simple regulating loop to cancel unknown global	 process parame
ters is a principle with general applicability The principle is illustrated in gure


 The references can be onchip internal references for instance temperature
compensated which would cancel the temperature dependency of the unknown
block	 An ochip reference would be used if a signal input or output	 needs to
be within some absolute range eg critical interchip signals in multichip systems	
The hyperbolic tangent neuron output range transresistance cf gure 

	 is
an evident component on which to apply current factor dependency canceling If
we wish to calculate the neuron derivative ochip as  g

 we must ensure that
the neuron output is in the absolute range   thus the output reference would
be a  The input reference must be the LBM MOST dierential pair output
current that corresponds to a neuron output of  ie I
Be


Chapter  Implementation of the neural network Page 
αFC
gmBgdsI
gdsI
VDD
SSV
C C
I ref
VAB
I Bref
Figure 


 Non unity ec current gain canceling Circuit for LBM MOST
di
erential pair The saturated BJT di
erential pair output current ie the
e
ective tail current is compared to a reference current amplied via the
high impedance V
AB
 node and fed back to the tail current source A single
BJT or a BJT pair with one transistor turned o
 can be used as reference
BJT pair
in
ctrl
out
in
ctrl
out
in
ctrl
out
A
unknown
transfer
function
output reference
input reference
desired output
Figure 

 General process parameter can
celing circuit Principal schematic of the
method applied to cancel nonunity ec cur
rent gain The unknown transfer function
blocks must be matched
 Temperature compensation
The excessive use of transconductances outside feedback loops	 in analogue neu
ral networks eg synapse multipliers	 makes temperature drift a major concern
this was actually a problem in the experimental ANN system solving the sunspot
regression problem	 The temperature dependence of the MRC for instance is
primarily determined by the mobility  which is proportional to T

for low
substrate dopings at room temperature Sze 	   T

for high substrate
dopings at room temperature	 In the temperature range 

K 
K this corre
sponds to a synapse strength drift of & In most realworld applications such a
large temperature drift is unacceptable unless the temperature is known to be con
stant as in implanted devices	 and temperature compensation must be employed
We can regard the temperature as a time varying undocumented process pa
Chapter  Implementation of the neural network Page 
rameter Thus temperature compensation can be implemented as above provided
that a temperature independent reference is available such as a bandgap reference	
Of primary interest in relation to temperature drift is the eective neuron
activation slope or equivalently the eective maximum synapse weight	 assuming
the output range is well dened As noted in section  process variation
inuences on the eective maximum synapse weight can be canceled by the learning
algorithm assuming constant temperature that is However in the important
special case of classication the neurons are usually saturated after the learning
phase In this case it is the relative rather than the absolute	 synapse strengths that
describes the network and if we can assume a constant system wide temperature
such a system will function independently of the temperaturey For regression
problems on the other hand where analogue outputs are required temperature
compensation is unavoidablez
 Other improvements
In addition to the important process parameter temperature variance compensa
tion several issues are subjects for improvement of the developed cascadable ANN
chip set These can be found in appendix D

 Summary
In this chapter we designed a cascadable analogue VLSI articial neural network
Several network models and topologies from the literature were displayed De
terministic rst order neurons using continuous valued currents and voltages for
signalling was chosen and a cascadable architecture placing neurons on one chip
and synapses on another selected for generality
Essential building block components memories multipliers and thresholders
were reviewed next No good analoguememories are presently available For adap
tive systems simple capacitive storage seems the best choice though weight refresh
is a problem We have chosen to use a digital RAM weight backup memory 
even though this puts severe restrictions on the eciency of the learning scheme
Another possibility would be to use digital storage in combination with an ana
logue adjustment Using four quadrant multipliers is not strictly necessary though
y In our system the temperature dependency of the synapse multipliers cancels
out with those of the neuron input scale transresistances The resulting tempera
ture dependency on the eective neuron activation slope will be that of V
t
" kTq
which varies & in the above temperature range
z This is not absolutely true using linear output neurons and assuming the
hidden neurons are saturated  a situation often found in regression networks
see eg Krogh et al  Svarer et al  Pedersen and Hansen 	 
temperature dependencies could be canceled
Chapter  Implementation of the neural network Page 
probably advantageous we use the very compact MRC It was noted that in order
to be absolutely general the dynamic range of the synapse multiplier in a scalable
system would have to be innite also the output oset error was important We
propose though we shall not employ this	 to use a highly nonlinear multiplier
to improve the dynamic range and thereby the relative output oset error As the
neuron activation function we chose a hyperbolic tangent function as not to restrict
the implementations of learning hardware the derivative is easily computed	
The design of our ANN chip set was displayed Further a sparse input
synapse chip was proposed this chip architecture exploits the limited input band
width eciently for problems with a discrete input alphabet Measurements on a
fabricated chip set was displayed indicating a possible  GCPSchip for a full
scale chip set Oset errors primarily synapse output and neuron input	 were
large though tolerable Training results from a (( test perceptron implemented
using our test chips and trained via a PC were displayed the hardware network
learning error was slightly worse than that of an ideal software net
Finally the importance of process parameter dependency canceling in real
systems was stressed including temperature compensation	 A sample circuit
for eliminating the unknown forward emitter current gain of LBM MOSTs was
displayed
Page 
Chapter 
Preliminary conceptions on
hardware learning
The basic analogue articial neural network architecture now being dened and
tested we shall turn our attention to the implementation of analogue learning
hardware for this ANN At rst some general conceptions on hardware learning
will be presented in this chapter Firstly we shall consider the tolerable amount of
hardware spent on the learning implementation Secondly the choices of learning
algorithms that we will implement are discussed Finally we give some general
considerations on the implementation of ANN learning algorithms using analogue
VLSI in relation to the limited precision of this technology
Chapter  Preliminary conceptions on hardware learning Page 
  Hardware consumption
In chapter  we argued that at least two niches for analogue hardware implemen
tations of learning algorithms exist

 Massively parallel possibly adaptive application specic systems having a
parallel realworld interface

 Small adaptive low power application specic systems with a realworld
interface
Now for both niches the tolerable amount of hardware put into the learning al
gorithm is application dependent At one extreme when it is crucial to exploit
inherent parallelism when speed rather than cost is important and when the
learning scheme is used excessively there is no upper bound on the amount learn
ing hardware other than it must be realistic to implement	 A massively parallel
implementation of the learning algorithm is the choice However a vast amount
of learning hardware can severely limit the applicability of the learning scheme for
certain applications  the learning hardware lies idle when the system is used in
recall mode and it reduces the number of integrated synapses for a given silicon
area So at the other extreme when cost or power consumptiony rather than
speed is important and when the learning scheme is employed only occasionally
the amount of learning hardware must be kept as small as possible
In this work we shall implement learning algorithms belonging to both cat
egories a hardware ecient implementation of backpropagation and a parallel
though not fully parallel	 implementation of realtime recurrent learning
y For a given learning algorithm the amount of energy used by the learning
hardware to process a single input output pattern is independent of the parallelism
of the implementation ideally in reality a serial implementation will most probably
consume more energy than a parallel implementation	 Thus if power consumption
is the concern and if the learning algorithm is employed only occasionally a fully
parallel implementation with a power down circuit could be a solution
Chapter  Preliminary conceptions on hardware learning Page 
 Choice of learning algorithms
The choice of learning algorithm is highly dependent on the application at hand
Dierent learning algorithms operate on dierent network architectures using dif
ferent optimization procedures for dierent goals In addition new learning algo
rithms arise continuously and it has thus been argued that VLSI learning hard
ware should be adaptable to changing learning algorithms cf eg Ramacher 	
However as for the recongurability of the analogue ANNs a high degree of pro
grammability of analogue learning hardware does not comply with the technol
ogy much hardware would lie idle or would be used to congure the algorithm
rather than participate in the computations Neither for massively parallel nor low
power implementations is this acceptable it would compromise the advantages of
the analogue technology algorithmic variations can be included when possible and
appropriate though cf the inclusion of both entropic and quadratic cost functions
in chapter 	 It is not possible  as it is in principle for the ANN itself  to
implement a general purpose analogue ANN learning machine Thus the learning
algorithm should be implemented specically for the application at hand Ideally
An application specic learning algorithm does not comply with our general
purpose building block ANN chip set though it would sacrice the generality of
the ANN learning chip set for the particular application this would not matter
of course	 If we do not have a specic application in mind we must then choose
our learning algorithm carefully in order that it can be applied to a large range
of applications Murray 	 The learning algorithm must possess the same
properties as the neural network model desirably it should be

 General purpose

 Simple

 Suitable for the technology
The simplicity is very important Partly because a complex learning model requires
much hardware but primarily because of the limited precision of the technology
other things being equal the more hardware participating in the calculations the
larger the accumulated errors the ner points of a complex learning scheme would
quickly be insignicant compared to the errors	 Needless to say the learning
algorithm must also map in a simple way on silicon it should rely on local com
munication and must not consume too much area memory etc	
Choosing a learning algorithm means choosing an application area Articial
neural networks can be applied to a wide range of applications see eg Sanchez
Sinencio and Lau 
	 For instance

 Classication or

 Pattern recognition

 Regression

 Example described problems

 Function approximation

 Associative memories

 Feature Mapping
Chapter  Preliminary conceptions on hardware learning Page 

 Optimization

 Control

 Data compression
In this work we shall predominantly be interested in applications classied as clas
sication or regression problems These important classes of problems are often
successfully solved using articial neural networks For instance the prediction
of splice sites in human premRNA molecules Brunak et al 	 pig carcase
grading Thodberg 	 implantable heart debrillators Jabri et al 
	 and
high energy particledetector trackreconstruction devices Masa et al 	 For
such problems supervised learning is usually employed Choosing such a learning
algorithm for our system will make it applicable to a broad range of applications
though not general purpose	 we shall thus commit the following text to the
implementation of supervised learning algorithms The implementation of unsu
pervised learning is also interesting however that is not our story	
 Gradient descent learning
A very simple approach to supervised learning when the ANN has dierentiable
neuron activation functions is to use gradient descent cf appendix B	 Dening
a cost function J which measures the cost of the network error one adjust the free
parameters of the ANN ie the synapse strengths neuron thresholds and slopes
etc	 such that J decreases most rapidly ie down the gradient
#w
kj
t	 " 

J t	

w
kj
t	

where the learning rate  is a small positive constant Real gradient descent re
quires the cost function to be a function of all training patterns J
tot
"
P
ptns
J
ptn

batch learning	 Often though the weights are changed on the basis of the instan
taneous cost function evaluated for one pattern only online learning	 For small
learning rates the methods are equivalent compare to the GaussSeidel method of
solving linear equations numerically Press et al 		
Gradient descent is not a very good optimization technique cf Hertz et al
 among others	 Most notably it

 Has a tendency to get stuck in local minima of the cost function

 Converges slowly
For improved conversion time algorithms as the conjugate gradient method or
quasiNewton can be employed see also Press et al 	 These methods have
the advantage that they use the rst order cost function derivatives as gradient
descent	 which can be computed quite eciently cf below	 for computing the
weight changes Using the second order derivatives the full Hessian matrix or ap
proximations to these can likewise improve learning time signicantly Hertz et al
 Buntine and Weigend  Pedersen and Hansen 	 Other methods such
as simulated annealing are also quite interesting Simulated annealing searches
Chapter  Preliminary conceptions on hardware learning Page 

the weight space with occasional uphill moves hence does not as often as gradient
based methods get stuck in a local minimum
The expense of these algorithms compared to simple gradient descent is in
creased computational cost
In spite of the poor performance gradient descent has had most vigorous
interest in the neural network society It has been thoroughly analysed and tested
in connection with articial neural networks and a wealth of improvements to clean
gradient descent has emerged Further it is very popular among application people
quite impressive results have been obtained by the method eg Brunak et al 
Masa et al  see also Williams and Zipser  Wul
 	 Gradient
descent is central in the present state of neural network art Finally the method
is quite simple and maps topologically very nicely on VLSI  which suggest that
an analogue hardware implementation would be possible
This is the motivation for using gradient descent which we shall do we should
have a fair chance to do an implementation in our limited precision technology
though cf section  and the following chapters	 and users application people	
would know what to expect from the implementation they would buy our solution
It should be noted that a supervised learning system eg using gradient de
scent	 can be extended in a straight forward manner to implement learning with a
critic which can be used for prediction and control
Other authors have also looked at the implementation of learning in analogue
hardware Many of these use gradient descent based algorithms though not every
one Alspector et al  for instance use a simulated annealing scheme Boltzmann
machine	 Card 
 uses Hebbian learning unsupervised learning	 andMacq et al
 use Kohonen feature mapping
 Error back	propagation
The usual network architecture applied to classication and regression problems is
the multi layer perceptron MLP	 The error backpropagation learning algorithm
Rumelhart et al  Hertz et al  chapter 	 is a formulation of gradient
descent which maps on feed forward ANN architectures eg the multi layer per
ceptrons	 From an analogue VLSI point of view it has a number of draw backs
though in addition to drawbacks of gradient descent learning	 for example

 The neuron derivative needs to be computed

 It is very sensitive to osets on various signals most notably the weight
changes	

 The learning rate must be rather small for convergence
It should be noted that these problems are not particular for backpropagation
they apply to many other algorithms as well see also section 	 The problems
shall be addressed in the following chapters From an analogue VLSI point of view
backpropagation also has a number of advantages which also applies to many
other algorithms as well	 for example
Chapter  Preliminary conceptions on hardware learning Page 

 Fully parallelizable

 Uses local signalling

 Relatively simple
The implementation of backpropagation in analogue VLSI has been considered by
several authors eg Valle et al  Wang  Cho et al  Lehmann 	
Other authors have chosen to derive new algorithms having the implementation
in analogue VLSI in mind These alternatives to backpropagation include weight
perturbation and virtual targets
Weight perturbation By some considered the standard algorithm for ana
logue VLSI as opposed to the standard algorithm backpropagation for simulated
networks	 weight perturbation Jabri and Flower 
	 is a dierence quotient ap
proximation to gradient descent learning Weight perturbation is inspired by the
fact that backpropagation i	 usually requires three times the synapse hardware
of a recall mode system though cf chapter 	 and ii	 requires the computation of
the neuron activation function derivatives Given an instantaneous	 cost function
J w t	 weight perturbation prescribes the weight changes
#w
kj
t	 " 
J

       w
kj
t	 !#w
pert
        t
	
 J

       w
kj
t	        t
	
#w
pert

where #w
pert
is the weight perturbation constant possibly weight dependent	
This very direct way of approximating the cost function derivative has the
additional advantage that any kind of nonlinearity oset error etc in the recall
mode network are transparent to the algorithm their impact on the cost function
derivative will be included by the algorithm in contrast implementations of back
propagation usually requires reasonably linear synapses as neuron derivatives are
used to compute the cost function derivative	 Assuming that the weights are
externally accessible very little hardware and no extra signal routing is required for
a hardware implementation of weight perturbation The drawback of the algorithm
is that it is computationally expensive ON
 
	 per training pattern or time step	
and not fully parallelizable serial weight update	 The latter problem has been
addressed by Flower and Jabri  in summed weight neuron perturbation a speed
up of ON	 can be achieved at the expense of a ON	 storage requirement
Matsumoto and Koga 
 use an algorithm very similar to weight pertur
bation Oscillating all the weights concurrently at dierent frequencies the weight
derivatives  and thus the weight changes  can be computed simultaneously
This procedure introduces new problems related to accurate band pass ltering
intermodulation and system bandwidth requirements though	
It should be noted that weight perturbation can be applied to any network
topology not just MLPs
Chapter  Preliminary conceptions on hardware learning Page 
Virtual targets While not exactly a gradient descent learning algorithm the
virtual targets MLP learning algorithm Murray 	 uses gradient descent for
each layer locally The weight change rule is the same as for backpropagation cf

 
		 assuming quadratic cost function	
#w
l
kj
t	 " g
 
s
l
k
t		
l
k
t	z
l
j
t	  
The neuron error 
l
j
t	 is computed dierently from backpropagation In virtual
targets all neurons are assigned target values for all training patterns The targets
for the hidden neurons are initialized to random values and are developed during
training according to
#d
l
k
t	 " 
tgt
X
j
w
l	
jk

l	
j
t	 
where 
tgt
is a target learning rate When a pattern is successfully learned the
algorithm must cease to react on this pattern otherwise the hidden neuron activa
tions would drift causing the pattern to be unlearned	 if the classication of the
pattern is forgotten during further training reaction on the pattern is resumed
Explicitly using target values for the hidden neurons can improve the learning
speed and the algorithm seems to possess an ability to jump out of local minima
The disadvantage of the algorithm is the requirements of hidden neuron access and
target storage furthermore the scheme is more complicated than for instance
backpropagation or weight perturbation
Being central in the present state of neural network art the implementation of
the gradient descent backpropagation learning algorithm in analogue VLSI is an
important issue of integrated ANN research A problem which we shall address in
the following chapter  with emphasize on the issues of hardware cost derivative
computation and weight updating schemes
 Real	time recurrent learning
Though very popular for pattern recognition etc feedforward ANN architectures
have their limitations A more general set of architectures are recurrent networks
recurrent articial neural networks RANNs	 If no constraints are put on the
connections as for instance symmetric connections found in many architectures	
recurrent network architectures are potentially very powerful They have the abil
ity for example to deal with

 Temporal information

 Storing of data

 Attractor dynamics
Chapter  Preliminary conceptions on hardware learning Page 
Using supervised learning read gradient descent	 to train recurrent neural net
works these can be taught to recognize sequential structures eg Reber grammars
Smith and Zipser 	 to imitate nite automatons eg Turing machinesWilliams
and Zipser 	 or to simulate strange attractors eg MackeyGlass series Wul

	 Recurrent neural networks can also be used instead of say tapped delay
line perceptrons eg Pedersen and Hansen 	 If a temporal pattern recognition
task is dependent on a few unknown temporarily wide distributed input samples
a recurrent network can solve the task using much fewer connections  which can
be very important especially if only a relatively small training data set is available
This is actually the application driven part of our motivation for implementing
a recurrent network learning scheme the prediction of splice sites in premRNA
molecules mentioned in section  can be solved using a 
(
 neuron RTRL
network Brunak and Hansen 		 The trouble with recurrent networks is that
they are usually very hard to train
To make available to the users of our ANN system the potent possibilities
of recurrent neural networks we shall in addition to the implementation of back
propagation learning investigate the implementation of a learning algorithm for
our ANN architecture connected in a recurrent way Several examples of gradient
descent like algorithms for recurrent networks exist in the literature see eg Hertz
et al 	 As not to compromise the generality of our cascadable ANN architec
ture more than absolutely necessary	 we must choose an algorithm with a most
general applicability Realtime recurrent learning RTRL	 Williams and Zipser
  chapter 	 is a good choice This algorithm has a number of advantages

 General ANN architecture The RTRL algorithm is formulated for a com
pletely general ANN architecture A fully interconnected network The net
work will organize itself to reect the structure of the application during learn
ing If any structure of the problem to be solved is known a priori which
should then be reected in the network architecture	 the algorithm can as well
teach a constrained architecture For the generality of our ANN learning chip
set this is very important it is possible to implement a ANN learning chip
set applicable to any network topology

 Application invariant When the network topology and size is determined
the learning scheme is determined independent of the application The stor
age requirement for other RANN learning algorithms as backpropagation
through time or timedependent recurrent backpropagation	 is often propor
tional to the maximum sequence length the memory of the system	 that needs
to be processed An application independent learning hardware architecture
is important to our general system

 Realtime training Or intheight training Unlike many other algorithms
RTRL does not use a training phase and a recall phase RTRL functions in
theight training the system during use This is essential to adaptive systems
but also very important to analogue implementations in general storing the
training patterns for batch learning is hostile to an analogue implementation
the storage must most probably	 be in digital RAM and is not in compliance
with the realworld interface requirement of analogue learning hardware Of
Chapter  Preliminary conceptions on hardware learning Page 
course if training patterns are not held in store for teaching the environment
in which the system resides must be able to generate representative learning
sequences when the system is to be taughty

 Hardware compatible The algorithm is parallelizable fully or partial and it
turns out that the architecture maps very nicely on hardware Furthermore
it is computationally a fairly simple algorithm which is suitable for analogue
hardware

 Powerful A range of impressive problems have been solved using RTRL the
above cited examples are solved using RTRL	
It also has a number of disadvantages

 Computation requirements RTRL requires an order ON
 
	 computation
primitives for each training example or as the required number of examples
scale as ON

	 at the least an order ON

	 computation primitives to train
the network This is the major drawback of RTRL RTRL requires parallel
computing even for relatively small systems

 Memory requirements RTRL requires memory of an order ON

	 Even for
semi	 parallel implementation this will limit the network size

 Trainability Recurrent networks are harder to train than feed forward net
works and should thus be used only when that type does not suce	 Some
argue eg Tsoi et al 	 that the completely general fully connected ar
chitecture is too hard to train and one should select less general recurrent
architectures Note however that one must always use a priori knowledge
of any kind in the problem at hand thus the general architecture should be
selected when nothing is known of the solution to the problem
In chapter  we shall investigate the implementation of the RTRL algorithm 
with emphasize on the issues of hardware cost derivative computation and weight
updating schemes as for the backpropagation implementation
y Though usually meant to be used oline backpropagation employed by ex
ample can be used in a similar intheight online	 manner  thus it is not of
paramount importance to hold the training patterns in store when using MLP	
backpropagation
Chapter  Preliminary conceptions on hardware learning Page 
 Hardware considerations
Implementing articial neural networks using limited precision technologies as ana
logue VLSI is by now considered a fairly straight forward matter The inherent
adaptability of the ANN systems can accommodate for nonidealities For learn
ing algorithms this is not so Several authors have noted that learning algorithms
presently available to the analogue designer are typically very sensitive to certain
kinds of nonidealities displayed by for instance analogue VLSI In this section we
shall have a look at the most important ones The discussion is based on work pri
marily concerning gradient descent like algorithms some of the issues are specic
to these kind of algorithms as the derivative computation	 while other probably
are generally applicable as the weight change oset	
The allowable nonideality magnitudes are dependent on each other as well as
being application topology and size dependent The observations below are only
qualitative
Weight discretization Most ANN implementations use connection strengths
with a discrete number of values either as a consequence of a weight refreshing
scheme or because of digital nonvolatile weights used in analogue digital hybrids
A discrete number of weight values limits the problem space solvable using the
network of course or in other words this will degrade performance Xie and
Jabri  Lehmann  
	
Much more important however is the fact that the smallest weight change
is restricted to one LSB weight values computed by gradient descent learning
algorithms for instance are constituted of many small weight changes Therefore
much higher weight resolution is required during learning than in recall mode
To meet this demand assuming the ANN has weight resolutions tailored to the
recall mode	 the learning hardware can have access to a high precision version
of the synapse strengths on the network Hollis et al  Asanovic and Morgan
 see also section 	 Another procedure is to use probabilistic rounding
For computed weight changes j#wj smaller than LSB a LSB weight change is
carried out with a probability j#wjLSB Hohfeld and Fahlman 	 See also
section 
Dynamic range When trained using a gradient descent like algorithm for
instance eg on a pattern recognition problem	 the synapse strength magnitudes
tend to grow with time This can easily exhaust the limited dynamic range of
the synapses see also section 	 Using logarithmic coded synapse strengths
Hollis et al 	 can increase the eective dynamic range Also weight decay
can be employed or the neuron gain can be increased during learning Hollis et al
	 to postpone weight exhaustion
Chapter  Preliminary conceptions on hardware learning Page 
Derivative computation Gradient descent like algorithms often need the
neuron derivatives to compute the cost function gradient This is a major concern
in many analogue implementations Several authors have reported that learning
can take place even for very approximate neuron derivative calculations eg Valle
et al 	 The learning trajectory will not follow the gradient in this case of
course	 One property must the calculated neuron derivative possess though it
must have the right sign This could typically be a problem for saturated sigmoid
neurons for which the actual derivative is close to zero if the calculated derivative
is negative a gradient descent weight updating rule would result in an uphill cost
function climb possibly irrecoverably bringing the neuron deeper in saturation
Lehmann  
 Krogh et al  see also Woodburn et al 
	 A small
positive oset deliberately introduced to the neuron derivative calculation circuit
can prevent this hazard Lehmann  Shima et al 
	 This would also enable
saturated neurons to be taught still using a gradient descent like algorithm	 which
is otherwise prevented by the zero derivatives
Oset errors Possibly the most problematic issues of analogue VLSI imple
mentations of ANN learning algorithms are the ever present o
set errors While
oset errors on some signals for instance the neuron net inputs	 are insignicant
they can completely prevent learning when present on other signals Most sensitive
to oset errors are
Weight change o
sets If the weight change oset error #w
ofs
 is comparable in
some sense with a typical weight change #w the weights would develop over time
as w
kj
t	 " w
kj

	 ! t#w
ofs
rather than governed by the learning algorithm for
suciently large #w
ofs
and t learning is impossible Lehmann  
 Montalvo
et al  Withagen 	
Neuron error o
sets Osets on the errors of output neurons just displaces the
target values which can be serious enough for analogue outputs	 Osets on the
errors of hidden neurons can be more severe though such osets cause learning to
take place on the hidden neurons even after the output error is zero ie the solution
to the training problem is not a stable state of the system Lehmann  
 see
also Murray 	
Cost function o
sets For weight perturbation where the learning is controlled by
the computation of the cost function oset errors on this quantity will as above
cause the solution to the training problem to be an unstable state of the network
which degrades learning Montalvo et al 	
Learning rate For simulated networks the learning rate is usually chosen quite
small for good learning Smaller than typically compatible with analogue VLSI In
the presence of weight discretization and weight updating osets the learning rate
must be so large that these eects are small' compared to typical weight changes
Tarassenko et al 	 Also if a learning scheme is used to refresh a purely
capacitive synapse storage the typical weight change must be large compared to
the memory droop rate see also Hansen and Salamon  Lehmann and Hansen
Chapter  Preliminary conceptions on hardware learning Page 
	
Noise Analogue systems are noisy While noise is beyond doubt a nuisance in
many analogue signal processing systems it can actually be an advantage in ANN
learning systems The limited resolution ie signal to noise ratio	 of analogue
systems is not comparable to the limited resolution in number of signal process
ing bits	 of digital systems The presence of noise can improve learning make
the system jump out of local minima because of occasional random movements	
improve generalization ability the network is forced to locate the underlying
structure of the training data in the presence of noise	 and improve fault tolerance
the information tend to spread more evenly among the synaptic connections	 Ed
wards and Murray  Jim et al  see also Hertz et al  and Qian and
Sejnowski 
	
Page 
Chapter 
Implementation of onchip back
propagation
The inclusion of backpropagation learning on our ANN chip set using a small
amount of additional hardware is the objective of this chapter The learning algo
rithm is rst described after which it is shown how it can be mapped on our ANN
architecture  and our hardware ecient solution is presented The design of an
experimental VLSI chip set a synapse and a neuron chip	 and measurements on
this are presented next We also present the design of a complete backpropagation
system including the learning hardware not present on the chips themselves Prob
lems in relation to derivative computation and learning in a system using digital
weight backup are discussed The novel nonlinear backpropagation learning al
gorithm is displayed and we show that the algorithm has several nice properties
in relation to an analogue hardware implementation hardware for a very lowcost
implementation is presented Reections on future work are then given A chopper
stabilization technique for elimination of oset errors is proposed and the inclu
sion of algorithmic variations in our system is outlined A summary concludes the
chapter
  The backpropagation algorithm
The error backpropagation learning BPL	 algorithm is a supervised gradient de
scent algorithm cf appendix B	 In this section we describe the basic algorithm
and display modications typically applied to it
Chapter 	 Implementation of onchip backpropagation Page 
 Basics
The error backpropagation learning algorithm for a layered feedforward neural
network multilayer perceptron MLP cf appendix B gure 
B
	 can be de
scribed as follows Hertz et al  Rumelhart et al 	 Given an input vector
xt	 at time t we can write the neuron k activation in layer l 
B
	 as
y
l
k
t	 " gs
l
k
t		 " g

X
j
w
l
kj
t	z
l
j
t	

 
 
	
where we assume g
l
k
 	  g 	 and where
z
l
j
t	 "


x
j
t	  for l " 
y
l
j
t	  for   l  L
 
The neuron biases are implicitly given as the connection strengths from constant
inputs z
l

"  Given a set of target values d
k
for the neurons in the output layer
L we dene the neuron errors when using a quadratic cost function 
B
		 as

l
k
t	 "


d
k
t	 y
l
k
t	  for l " L
P
j
w
l	
jk
t	
l	
j
t	  for   l  L
 
 
	
where the weight errors or deltas	 are dened as

l
j
t	 " g
 
s
l
j
t		
l
j
t	   
 
	
Using a discrete time online learning updating scheme the connection strengths
should then be changed according to the weight updating rule
w
l
kj
t! 	 " w
l
kj
t	 ! #w
l
kj
t	 " w
l
kj
t	 ! 
l
k
t	z
l
j
t	  
 
	
where  is the learning rate
 Variations
Though still often serving as the reference for new MLP learning algorithms the
eciency of the basic backpropagation algorithm has been questioned by several
authors the algorithm is slow and the network often gets stuck in local minima
eg Fahlman  Hertz et al  Haykin 	 For this reason a wealth of
backpropagationlike algorithms or improvements of the algorithm	 has emerged
Many of these improvements do not alter the basic topology of the algorithm and
are thus easy to incorporate in the VLSI architectures that we shall describe shortly
However the cost of the incorporation is very dependent on the exact implementa
tion ie whether digital analogue weights are used whether parallel serial weight
update is used etc	 The most common modications of the algorithm which
Chapter 	 Implementation of onchip backpropagation Page 

can also be applied to many other learning algorithms	 include Hertz et al 
Haykin  Plaut et al  Krogh and Hertz  Solla et al  Fahlman
 and others	

 Weight decay Modifying the weight updating rule 
 
	 as
w
l
kj
t! 	 "

w
l
kj
t	 ! #w
l
kj
t	
	
  
dec
	  

 
	
where 
  
dec
  is the weight decay parameter discourages large weight
magnitudes and eliminates small ie unnecessary	 weights as OBD	 This
improves generalization ability From an analogue VLSI point of view dis
couraging large weights is also advantageous as the limited dynamic weight
range is less likely to be exceeded

 Momentum Modifying the weight change implicitly dened by 
 
	 as
#w
l
kj
t	 " 
mtm
#w
l
kj
t  	 ! 
l
k
t	z
l
j
t	  
 
	
where 
  
mtm
  is the momentum parameter averages random weight
changes and magnies consistent ones This reduces oscillations often found
during learning The disadvantage of using momentum is the need for addi
tional memory especially severe for a VLSI implementation	

 Cost function Using the standard	 quadratic cost function cf appendix
B 
B
		 leads to the weight errors 
l
j
in 
 
	 Using a typical sigmoidlike
activation function these weight errors will be close to 
 when the neuron
net input s
l
j
is numerically large thus even for large neuron errors 
l
j
 no
weight change will take place This problem can be eliminated by the use
of the entropic cost function 
B
	 or the Fahlmann perturbation The latter
resulting in the following heuristically derived	 weight error

l
j
t	 "

g
 
s
l
j
t		 ! 
F
	

l
j
t	 
where 
F
 
 is the derivative perturbation For the entropic cost function

L
j
t	  
L
j
t		 Theoretically the weight errors should only be modied for
the output layer l " L	 However in an analogue VLSI implementation using
the derivative perturbations on all weight errors can ensure g
 
s
l
j
t		 ! 
F
 

rather than g
 
s
l
j
t		



 which would probably otherwise be calculated by the
nonideal hardware and which is destructive for the learning process
In adaptive systems it is especially important that learning can take place
even though the neurons are saturated ie very condent in their decision
to re or not	 as the system functionality changes over time In this case
the entropic cost function or the Fahlmann perturbation are superior to the
quadratic cost function
The incorporation of these alternative cost functions in a VLSI implemen
tation is straight forward

 Dynamic learning rate The learning rate is a most important parameter If
it is too large gradient descent leads to oscillations if it is too small gradient
Chapter 	 Implementation of onchip backpropagation Page 
descent converges very slowly Adapting the learning rate to the increase in
the cost function since last set of weight changes #J t	 " J t	  J t  	
can reduce this problem
t! 	 "



t	 ! a  for #J t	#J t  	       #J t T 	  

 b	t	  for #J t	  

t	  otherwise

where 
  a b  and T are constants Though not impossible to implement
in analogue VLSI a dynamic learning rate scheme is somewhat complex and
requires memory O		 in addition the cost function must be simple in order
to be computable	 Also in a hardware implementation one must ensure that
the learning rate does not decrease below a critical value 
crit
where the weight
changes are insignicant compared to weight discretization or weight change
oset errors which would switch o learning

 Eta nder To avoid the complexity of a dynamic learning rate one can
choose an optimal learning rate Reyneri and Filippi  propose for y
k

y
max
 y
max
 z
j
 z
max
 z
max
	

l
k
"

y

max
z

max
	

t
M
l
k

where 
l
k
" 
l
is the layer l learning rate M
l
k
"M
l
is the eective	 number
of inputs fanin	 to layer l and 	
t
is the neuron transfer function steepness
Others prefer Hertz et al  	

l
k
" 

q
M
l
k

which can be combined with a dynamic learning rate rule For a nonrecon
gurable network the learning rates can be computed in advance for the current
network architecture For a recongurable network additional hardware at
each recongurable block is needed if the learning rate computation is to be
automated

 Batch learning Doing real gradient descent the weights are updated only
after each epoch
w
l
kj

 ! n	T
epc
	
" w
l
kj
nT
epc
	 !
T
epc

X
t
#w
l
kj
nT
epc
! t	 
where T
epc
is the epoch length cf appendix B	 Usually online learning is
considered the faster methody however using batch learning it is possible to
do the weight updates in larger chunks which makes this scheme potentially
less sensitive to oset and weight discretization which is very important for
an analogue VLSI implementation	 Valle et al 	 The disadvantage is
that additional memory is needed ON

		
y Note however that online learning never converges for constant learning
rate	 the weights will stir about the optimal solution White  Battiti and
Tecchiolli 	
Chapter 	 Implementation of onchip backpropagation Page 
 Mapping the algorithm on VLSI
The MLP recall mode equation 
 
	 can be written y
l
" gs
l
	 s
l
" w
l
z
l
 as noted
in section  see Lehmann and Bruun  Widrow and Lehr 	 Likewise
we can write the neuron error equation 
 
	 as 
l
" w
l
	
T

l
 We notice two
important properties of these equations i	 The matrix used to calculate the neuron
error is the transposed of the one used to calculate the neuron net input ii	 The
signal ow is reversed For an onchip implementation of the backpropagation
algorithm this means that we can calculate the neuron errors and let the signals
propagate from layer l to layer l  using a matrixvector multiplier topologically
identical to our recall mode synapse chip but with the positions of the inputs and
the outputs exchanged In other words we can use an expanded version of the
synapses The neuron chip must in turn be able to calculate 
l
as given in 
 
	
η
Δ
δ
ε
δ
Figure 
 
 Schematic backpropaga
tion synapse Two additional cur
rent output multipliers are basically
needed compared to a recallmode
synapse
δ ε
Figure 
 
 Schematic backpropaga
tion neuron An additional neuron de
rivative computing block and multi
plier are needed compared to a recall
mode neuron
Block diagrams of the expanded synapse and neuron can be seen in gures 
 
and 
 
respectively On the synapse is also included the hardware for calculating
the weight change #w
l
kj
 according to 
 
	 The expanded synapse has voltage
inputs and current outputs just as the original synapse Mapping the algorithm
on silicon like this gives an order ON

	 improvement in speed compared to a
serial approach as all ON

		 weights can be updated simultaneously Several
backpropagation silicon systems with the above or similar architectures have been
reported lately Valle et al  Wang  Cho et al 	 By enabling the
neuron to route back the y
l
k
 signal on the 
l
k
 wire the architecture can also
realize Hebbian learningy or a backpropagation Hebbian hybrid algorithm Cho
et al  Shima et al 
	
Instead of placing the weight updating hardware at the synapse sites it can
be placed at the neuron sites  which reduces the amount of weight updating
y Plain Hebbian learning uses the weight changes #w
l
kj
" y
l
k
z
l
j
Hertz et al
	
Chapter 	 Implementation of onchip backpropagation Page 
hardware by an order ON	 In this case only weights from one neuron in layer
l to the neurons in layer l can be updated simultaneously thus this procedure will
give an orderON	 improvement in speed compared to a serial approach It should
be noted that according to 
 
	 z
l
j
t	 is needed when calculating the #w
l
j
t	s and
must thus be routed to the sites of the weight updating hardware if it is the
w
l
j
t ! 	s that is calculated also the w
l
j
t	s are needed The eciency of such
a scheme is highly dependent on the chosen weight storage method Using simple
capacitive storage without backup memory the weight change would typically
be applied by a transconductance amplier see Card 
 Wang  Linares
Barranco et al Woodburn et al 
	 which correspond to the amplier
of gure 
 
 To avoid weight degradation destruction by charge redistribution
on bus lines this amplier or some kind of shielding circuit	 would have to be
present in any implementation thus reducing the amount of saved hardware As
digital memory usually can be read nondestructively there is no similar penalties
when using capacitive storage with a digital backup memory or digital storage
the objective of the learning algorithm in these cases is to modify the digital
memory	 Saving weight updating hardware is particularly important when the
weight modications are in the digital domain as an A D converter is needed at
each modication site Placing A D converters with more than one bit precision
at each synapse would be unrealistic the area consumption would be too large
Shima et al 
	 In systems using a digital RAM backup memory the weight
access would most probably	 be serial thus the weight updating hardware could
 without further performance loss in orders of N  be placed o both synapse
and neuron chips in a single ie O		 separate module In this case the cost of
the weight update A D converter is insignicant but the weight updating would
be serial an order O	 speed improvement	
Using an O	 weight updating module has the advantage of low cost imple
mentations of certain algorithmic improvements the ones related to the weights
rather than the neurons	 as no additional synapse hardware is needed for a more
complex updating scheme Implementing momentum for instance basically re
quires a memory and an adder or a leaky integrator for continuous time	 at each
synapse site when using a fully parallel ON

		 weight updating scheme Using
an O	 weight updating module one adder would be required The weight change
matrix could be placed in a standard digital RAM which would cost far less than
additional synapse hardware If digital weight backup memory or indeed digital
weight memory	 was used in the rst place no additional data converters would be
needed The O	 weight updating scheme implementations also have potentially
better accuracy than fully parallel ones see below
Chapter 	 Implementation of onchip backpropagation Page 
 Hardware e
cient approach
The architecture in gure 
 
has two major drawbacks i	 For a given silicon
area the number of synapses is reduced compared to the number of synapses in
a recall mode system as three multipliers are used instead of one Also in recall
mode most of the synapse hardware lies idle which is of course undesirable ii	
The number of wires between the synapse and neuron chips is doubled compared
to a recall mode system Both disadvantages can severely restrict the applicability
of the adaptive neural network if its physical size is of importance Fortunately it
is possible to overcome these disadvantages
Figure 
 
 MRC operated in forward
mode This is the standard mode op
eration as shown in chapter 
δ
δ
δ
δ
ε
Figure 
 
 MRC operated in reverse
mode Because of the circuit symme
try no extra synapse hardware is re
quired for alternating forwardreverse
mode operation
Studying the synapse multiplier used in section  which is repeated in gure

 
for convenience we notice that it is perfectly symmetric Thus if we apply a
dierential voltage v

at the former output nodes and ensure virtual shortcircuit
between the former input nodes the dierence current i
w
injected to the former
input nodes would be
i
w
" i
w	
 i
w
" 	V
w
v

"

r
MRC
v


which is illustrated in gure 
 
 This implies that we can enable our original
matrixvector multiplier cf section 	 to perform multiplication with the trans
posed matrix ie to calculate 
l
" w
l
	
T

l
 simply by exchanging the output
current conveyors and the input bu
ers See gure 
 
 When several modied
matrixvector multipliers are cascaded it is still valid that multiplication with the
transposed matrix is performed when inputs and outputs are exchanged
If the weight updating hardware is placed at the neuron sites we can thus
implement the backpropagation learning algorithm without any extra hardware
at the synapse sites ie with a hardware cost of a mere order ON	 Routing
z
l
j
t	 to the neuron module is possible using the s  wires and a few ON		
Chapter 	 Implementation of onchip backpropagation Page 
switches Thus also no signicant extra signal routing is neededy It should be
noted that thus time multiplexing the z  and s  wires obviously requires
a discrete time system If a digital weight backup memory is used the learning
algorithm is required to run in discrete time anyway otherwise this might restrict
the applicability of the scheme	 The neurons that corresponds to this modied
synapse chip would look quite like the one in gure 
 
 only the output would
have to be sampled cf section 	
The implementation of such a low hardware cost analogue neural network with
onchip backpropagation can be found in the next section The principal operation
of the backpropagation system is illustrated in gure 
 
 During normal operation
all chips are in forward mode and the response to an input pattern is propagated
to the output after a certain delay When a synapse weight w
l
kj
in a layer l is to
be updated all chips in the previous layers operate in forward mode and produce
the z
l
i
s All the chips in the layers following layer l operate in reverse mode and
produce the 
l
k
s The synapse chips in layer l operate in route mode and route z
l
j
to the inputs of the neuron chips in layer l These in turn operate in learn mode
and calculate w
l
kj
t ! 	 It will be noticed that the newly updated weights are
used when backpropagating the errors in reverse mode This does not exactly
comply with the learning algorithm though for small learning rates the dierence
should be indistinguishable Actually one would expect faster learning than when
using synchronous weight update  this is actually equivalent to the GaussSeidel
method of solving linear equations numerically Press et al 	
Figure 
 
 Backpropagation system The di
erent operation modes of the
two backpropagation chip sets are shown for a three layer network when the
middle layer is updated
y Ie only an order O	 control lines etc This is assuming a O	 weight
updating module if a ON	 weight updating module is used this should be placed
at the synapse chip one segment per row	 rather than on the neuron chip to avoid
interchip routing of the #w
l
j
s or w
l
j
s
Chapter 	 Implementation of onchip backpropagation Page 
Exploiting the bidirectional properties of the MRC synapse multiplier is also
possible when placing the weight updating hardware at the synapse sites for max
imum speed improvement According to the weight updating rule 
 
	 neither s
l
k
nor 
l
j
are needed when updating the w
l
kj
s Thus the z
l
j
s and the 
l
k
s can be dis
tributed simultaneously on the synapse chip while ignoring the synapse multipliers
 transconductance multipliers placed at the synapse sites will then have access
to the appropriate signals for calculating the #w
l

s simultaneously for one layer
In other words at the expense of using discrete time it is possible to eliminate
the extra interchip connections and  of the synapse learning components com
pared to a straight forward fully parallel implementation of the backpropagation
algorithm on top of a VLSI MLP
As we shall see in section  it is possible also to reduce the additional neuron
hardware
An additional advantage of the bidirectional usage of the MRC is process vari
ation insensitivity the very same transistors are used for the synapse multiplication
in recall mode as well as in backpropagation mode Assuming the input reference
voltage is close to the output voltages M
 
in gures 
 
and 
 
will never conduct
any current In forward mode the dierential output current is i
D
 i
D
and in
reverse mode it is i
D
 i
D
 Thus for matching the forward and reverse currents
it is necessary only to matchM

andM

 AsM
 
ideally does not conduct current
it could be removed this however requires a very low neuron input impedance
cf Flower and Jabri 		
 Chip design
The basic idea of the ANN chip set with onchip backpropagation developed at
our institute is the bidirectional properties of the MRC In this section we shall
describe this chip set A description of the chip set was published in Lehmann
 
As noted in chapter  it is our intention to add learning hardware to an
acting recall mode ANN ie the one described in that chapter Thus we shall use
capacitive storage with a digital RAM backup memory which prevents us from
using a parallel weight updating scheme As the additional hardware cost for the
serial weight updating scheme is very small we can replace the original ANN chip
set with one containing onchip backpropagation the learning algorithm can be
disregarded by the user if so wished As in the original ANN chip set we shall use
a hyperbolic tangent neuron activation function
As was the case for the implementation of the original ANN the backprop
agation chip set was designed to test the functionality of the hardware ecient
approach Also as much layout as possible from the rst design was reused or
corrected improved and reused	 Design details can be found in appendix D
Chapter 	 Implementation of onchip backpropagation Page 
 The synapse chip
The computing elements of the backpropagation synapse chip in forward mode
can be seen in gure 
 
 this is identical to the computing elements of the second
generation synapse chip Writing on the synapse storage capacitors is done in the
same way as on the rst generation synapse chip Precharged row and column
selectors and nand gates at the synapse sites determines on which synapse the
globally distributed weight voltage is to be written The synapse schematic is a
little dierent from the original one no explicit storage capacitors are used the
synapse multiplier gatechannel capacitances act as memory
Figure 
 
 Second generation synapse chip For reverse mode backpropa
gation operation reverse the signal ow and exchange the bu
ers and current
conveyors
If the current conveyors are implemented as supply current sensed opamps
Toumazou et al 	 these opamps can be used as a voltage buers in reverse
mode This way the components needed for each row column on the chip are two
opamps two current mirrors and  switch transistors plus row column decoders
synapses etc	 Or basically an increase of one opamp per row and column for
the backpropagation implementation For reasonably sized switches the voltage
drop across these when dozens of synapses source current through them is non
negligible say




mV	 In order to ensure proper input voltage buering and
virtual synapse output short circuits it is necessary to put the switches inside
Chapter 	 Implementation of onchip backpropagation Page 
high gain loops or ensure zero or matched switch transistor currents In the latter
case the transistors need to be matched alsoy The switch transistor placement
for the row and column elements can be found in appendix D notice that the
two elements are identical only the control signals are permuted	
As no neuron errors are to be computed for the input layer the input layer
synapse chips need not be able to run in reverse mode For this reason the
hardware ecient architecture is compatible with the sparse input synapse chip
mentioned in section  Only the sparse input synapse chip functionality must
be extended to include routing of inputs to outputs
 The neuron chip
As stated in section  the output voltage of the second generation synapse
chip should be close to the reference voltage V
ref
to avoid DC common mode
currents in the synapses As the neuron output is referred to this voltage it is
necessary to separate the bipolar pair and the output range MRC with current
mirrors to ensure sucient emittercollector voltage of the bipolar pair Otherwise
the principal neuron schematic is unaltered The schematic of a second generation
hyperbolic tangent neuron is shown in gure 
 
 The extra current mirrors will
inevitably cause an increased neuron output oset
Figure 
 
 Second generation hyperbolic tangent neuron Simplied schema
tic The resistors are implemented as MRCs as in the rst generation neuron
Extending the second generation neuron chip for onchip backpropagation we
must enable the neurons to compute the weight errors 
l
k
 Though serial O		
weight updating scheme was selected we have chosen to place some of the weight
y Unfortunately this was overlooked at design time causing the synapse weight
osets to be larger than necessary and dierent in forward and reverse modes
Chapter 	 Implementation of onchip backpropagation Page 
updating hardware on the neuron chip Strictly speaking this is unnecessary for
all but one of the neuron chips in a system However it is convenient to to place
the 
l
k
	 z
l
j
multiplier near the physical location of the 
l
k
 and z
l
j
 signals
Also if the backup RAM is organized in parallel accessible banks as the p
RAM in gure 


	 this semi parallelism can be exploited A block diagram
of the backpropagation neuron is seen in gure 
 
 For a more detailed circuit
schematic refer to appendix D It is crucial for the functionality of the learning
algorithm that the oset errors on the weight change signals are very small cf
section  and the following	 For this reason hardware for oset compensating
the weight change signal is indispensable As the current system uses a serial
weight updating hardware and as the new weights ultimately have to be written
in the digital backup memory there is no strong motivation to keep the weight
change calculating hardware analoguey Indeed using digital hardware in most of
the O	 part of the system we can reduce the problem with weight change oset
cf section 
Figure 
 
 Backpropagation neuron Block diagram The positions of the
switches in forward reverse and learn mode are indicated The elements
below the dashed line is the weight updating hardware which is common for
all neurons
As we use a hyperbolic tangent neuron activation the derivative of this is
calculated as g
 
calc
"   g

 For this operation a twodimensional inner product
multiplier based on MRCs is used topologically identical to the IPM of gure 


y Recall that motivations for using analogue hardware for ANNs in the rst place
included the size advantage of massively parallel system and the power advantage
of especially	 small systems For O	scaling hardware neither of these are very
important Thus we should not maintain a purely analogue system at any cost
Of course data converters are needed when digital circuitry is included if the
expense of these is too large eg in small low power systems	 a fully analogue
system is the solution
Chapter 		 Implementation of onchip backpropagation Page 

p  without the output transconductor	 This is a very versatile component
which is the core of most computing circuitry in this thesis It has the power to do
multiplication addition subtraction and division using tree identical MRCs the
output is independently of process parameters	 given by
v
out
" v
s
k
"
v
z
 
	
 v
z
 

	v
w
k 
	
 v
w
k 

	 ! v
z

	
 v
z


	v
w
k
	
 v
w
k

	
v
C	
 v
C

where v
z
 

 v
w
k 

 v
z


 v
w
k

and v
C
are inputs cf gure 

 v
 
 v
 	
v
 
	
This is easily congured to compute the desired function level shifters will have
to be inserted at the v
w
k

and v
C
inputs to ensure the transistors operate in the
triode region	
The weight errors 
l
k
are computed using a onedimensional MRC IPM As the
inputs to the synapse chip are buered the neuron chip does not have to have a very
low output impedance neither in forward nor reverse mode	 Thus the switches
that redirect the signal ow in the various operating modes can be simple MOS
transistors a reasonably sized transistor cf appendix D	 in the technology
used can drive a 
 pF load to  bit accuracy in 

 ns  for larger capacitive
loads feedback can be used to lower the switch impedance cf the synapse chip
row column elements	 When operating in reverse or learn mode the neuron net
input s
l
k
is unavailable Thus the neuron activation has to be sampled in forward
mode in order to provide data for the calculation of the weight error
The synapse chip provides the neuron error 
l
k
as a current As the g
 
s
l
k
		
l
k

multiplier needs voltage inputs we need a transimpedance at the neuron error
input This is implemented using a MRC plus opamp	 as the neuron input scale
transimpedance see gure 

	 The neuron errors 
 
	 are computed dierently
for the output layer than for the preceding layers We can accommodate to this
by adding a second MRC to the neuron error transimpedance ie transforming
this to a onedimensional IPM	 and activating this MRC at the output layer only
see gure 
D
	 When acting as an IPM the circuit can take the dierence of an
applied input voltage and the onchip neuron activation In other words at the
output layer one shall now provide a target value as a voltage rather than a neuron
error as a current
 Chip measurements
A  neuron backpropagation neuron chip and an   synapse backpropagation
synapse chip has been fabricated in a standard  m CMOS process In this
section we shall present measurements on these chips rst published in Lehmann
and Bruun  and Lehmann 	 A table of the most important chip charac
teristics can be found in appendix D
Chapter 		 Implementation of onchip backpropagation Page 
 The synapse chip
The measured forward mode synapse transfer characteristics for a single synapse
can be seen in gure 

 
 the reverse mode characteristics for the same synapse
can be seen in gure 
 
 The nonlinearity is less than  & or approximately
equal to the rst generation synapse nonlinearity as would be expected The chip
output oset current in both forward and reverse mode	 can be quite large of a
magnitude comparable to the maximal synapse output current Responsible for this
output oset is the CCII! row current dierencer  presumably the xz current
buer According to appendix C we should expect a large output oset when
using current dierencing by rows rather than by synapse	 As the output current
conveyor is designed to sink current from a large 
(

	 number of synapses
and as it is this that gives the output oset	 the chip can be scaled ie synapses
added	 without signicant increase in output oset current Comparing this second
generation synapse chip output oset with that of the rst generation chip we notice
that it has been reduced by a factor  because of the simpler current dierencing
circuit Still with the reduced synapse maximum output current compared to
the rst generation chip	 it is necessary to dedicate a synapse per row for oset
canceling which would be an extra bias synapse	 in order not to exhaust the
neuron bias synapse For operation in reverse mode the oset errors are probably
larger than tolerable by the backpropagation algorithm without oset canceling
being employed cf section  section 	
μ
μ
Figure 

 
 Forward mode synap
se characteristics Measured transfer
functions for single backpropagation
synapse at di
erent synapse
strengths
δ
δ
ε
μ
μ
Figure 
 
 Reverse mode synap
se characteristics Measurements on
same synapse in reverse mode The
output o
set errors have been almost
canceled
As it was for the rst synapse chip the weight resolution is at least 
 bit
which should be sucient for a range of applications The eective weight oset
however is somewhat higher and strongly correlated with the dierencers to which
the synapses are connected cf gure 
 
and 
 
	 The osets are caused by
the inability of the synapse row column elements to ensure virtual short circuit at
the synapse output presumably caused by mismatch of the reconguring switch
transistors	 If the systematic oset is canceled the oset is below that of the rst
Chapter 		 Implementation of onchip backpropagation Page 
generation synapse chip This nonideality of the row column dierencers causes
the synapses to have dierent oset in forward mode and reverse mode Though
undesirable the magnitude of the oset is most probably tolerable by the learning
scheme the oset error corresponds to a  bit recall mode weight accuracy	
0
1
2
1 2
3 4
5 6
7
-7
-5
-3
-1
1
SC Row SC Col
Weight offset/LSB_8
Figure 
 
 Forward mode weight o
sets Sample chip Notice the corre
lation among synapses along a row
which is emphasized by the contour
plot dashed lines
0
1
2
1 2
3 4
5 6
7
-5
-3
-1
SC Row SC Col
Weight offset/LSB_8
Figure 
 
 Reverse mode weight o
sets Sample chip Notice the correla
tion among synapses along a column
dashed lines contour plot
 The neuron chip
The measured forward mode neuron transfer characteristics can be seen in gures

 
and 
 
 the increased output oset compared to the rst generation neuron
chip is noticed The nonlinearity is less than & cf gure 
 
	 and the oset
errors are of little importance to the recall mode performance as for the rst gen
eration chip set	 The increased output oset compared to the rst generation
chip is caused by the additional current mirrors cf above	 Figure 
 
shows
the electronically computed neuron derivative as calculated by the neuron chip as
dyds " y

 cf gure 
 
	 The asymmetry is caused by the output oset of the
neuron transfer function and the input oset of the derivative computing  y


block Note however that the neuron input oset does not aect the accuracy of
the computed derivative as would have been the case if the derivative was computed
on the basis of the neuron input The nonlinearity of the derivative computing
block is less than & cf gure 
 
	 The total nonlinearity of the computed
derivative is less than & The oset errors related to the derivative computa
tion are quite large however causing the computed derivative to be negative in
particularly poor specimens of the neuron which is destructive for the learning
process By including a derivative perturbation in the weight error calculation as
mentioned above we hope that the osets can be tolerated by the learning scheme
This has yet to be experimentally proven of course Note that several authors
have employed very coarse derivative approximations and still observed learning
progress eg Valle et al  see also Shima et al 
 Cho et al 		 The
derivative perturbation is easily inserted by substituting the  by a worst case
y

max
 when calculating the derivative dyds  y

max
 y

 This procedure is the
Chapter 		 Implementation of onchip backpropagation Page 
same as one of several	 used on the rst generation chip set the neuron output
variation is larger for the second generation chip though which will most likely
degrade the performance
μ μ
Figure 
 
 Forward mode neu
ron characteristics Measured trans
fer characteristics for di
erent input
scale voltages V
IS

δ
μ μ
ε μ
Figure 
 
 Computed neuron deriva
tive Measured derivative as compu
ted by the chip The asymmetry is
caused by the neuron output o
set
-3 -2 -1 0 1 2 3-1.5
-1
-0.5
0
0.5
1
1.5
Neuron net input, i_s / uA
N
eu
ro
n 
ou
tp
ut
, v
_y
 / 
V
Neuron transfer function
Figure 
 
 Dierent neuron trans
fer functions Four measured stair
case and tted smooth tanh func
tions almost indistinguishable
-3 -2 -1 0 1 2 3-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
Neuron net input, i_s / uA
n
o
n
-li
ne
ar
ity
, D
_y
 / 
%
Neuron tanh non-linearity
Figure 
 
 Dierent neuron nonli
nearities Measuredtted tanh rela
tive difference
The measurements on the chip set indicate that with the inclusion of oset can
celing on certain signals the chip set should be able to function as the core of a
discrete time analogue backpropagation neural network the chip set functions as
Chapter 		 Implementation of onchip backpropagation Page 
-1 0 1-0.5
0
0.5
1
1.5
2
2.5
Neuron output, v_y / V
D
er
iv
at
iv
e,
 v
_1
-y
^2
 / 
V
Derivative calculator transfer function
Figure 
 
 Dierent parabola trans
fer functions Four measured stair
case and tted smooth derivative
computing parabolas
-1 0 1-1
-0.5
0
0.5
1
1.5
2
2.5
Neuron output, v_y / V
n
o
n
-li
ne
ar
ity
, D
_1
-y
^2
 / 
%
Parabola non-linearity
Figure 
 
 Dierent parabola non
linearities Measuredtted parabo
la relative di
erence
predicted Variations on transconductances gains oset errors hopefully	 etc
should be canceled during the learning process
v
y
400st0s
-
1V
1V
Figure 

 
 Neuron sampler
droop rate Measured droop
rate for two di
erent starting
points V Notice the al
most linear decay caused by
mismatched diode reverse cur
rents The sign and magnitude
of the droop rate will vary from
neuron to neuron The synapse
strength droop rate have a sim
ilar appearance
The synapse and neuron chip propagation delays give a layer propagation
delay of t
lpd
  s which gives a recall mode speed of  MCPS per synapse
chip in a layer or GCPS if a fullsize 

 

 synapse chip is used	 Note that
the reconguring switches at the neuron output do not degrade the performance
compared to the rst generation chip	 at the current capacitive load The neuron
chip takes t
xwpd
  s to calculate a new weight giving a learning speed of
approximately 
 MCUPS for the backpropagation system Given the droop
rate of the neuron activation sampler cf gure 

 
	 this will limit the system
size to about  	 


connections for an  bit accuracy of the neuron activations
Applications using 
  	 


connections are known cf chapter  Brunak et al
		 Should this be a problem digital refresh of the sampled neuron activation
can be employed the neuron sampler actually has an extra input for this very
Chapter 		 Implementation of onchip backpropagation Page 
purpose	y
 Improving the derivative computation
Apart from the synapse chip output osets and weight change osets  which can
be canceled by an auto oset canceling scheme  the most concerning problem
of the chip set is the calculation of the neuron derivative this has also been the
concern of other authors Jabri and Flower 
 Hollis et al  and others some
authors take advantage of the fact that the derivative need not be computed very
accurately for learning to proceed and use very coarse derivative approximations
Valle et al  Woodburn et al 
	 To ensure a nonnegative result of
the computation we could use a derivative perturbation as mentioned above also
employed by Shima et al 
	 Alternatively or in addition to	 we could clamp
the output to zero whenever this was negative Using a two dimensional CCII!
based IPM as the second generation synapse chip IPM	 to calculate  y

 the
output would be a current Mirroring this current using a simple current mirror we
would ensure that the output was never negative some additional hardware would
be needed to ensure a reasonably input impedance speed etc	 It is also possible
though not necessarily without a considerable amount of additional hardware	 to
employ auto oset canceling techniques to cancel the osets
Other more elegant approaches to come around the oset related problems
of the derivative calculation are also possible for the backpropagation chip set
It was stated in section  that we would most probably not have access to
the neuron net inputs when calculating the neuron derivatives for the learning
algorithm Integrating the learning algorithm onchip  rather than as an addon
module  we actually do have access to the neuron net input This implies that we
can choose our neuron transfer function more freely  as long as the derivative is
computable Bogason  propose to approximate the derivative by a di
erential
quotient

i
y
v
s
	

v
s

i
y
v
s
!#V 	 i
y
v
s
#V 	
#V

where #V is a small voltage A block diagram for such a circuit based on a
general di
erential voltage in di
erential current out neuron is shown in gure

 
 The output current is the neuron output when the switches are open and
#V " 
	 and the derivative approximation when the switches are closed
The neuron could for instance be a MOST dierential pair In this case
the magnitude of a possible negative output when calculating the derivative is
determined by the matching of the two tail current sources A &mismatch of these
is realistic a small derivative perturbation could be added to ensure the derivative
is calculated strictly larger than 
 of course	 Using a MOST dierential pair
to implement the neuron transfer function is also an advantage in terms of speed
compared to the hyperbolic tangent neuron The accuracy of the dierence quotient
y In hardware ecient systems using parallel weigh update ON	 or ON

		
the neuron activation deterioration would be less problematic of course
Chapter 		 Implementation of onchip backpropagation Page 
VΔ
VΔ
vs
vs
iyVΔ2iy
Vref
Figure 
 
 Dierential quotient derivative approximation Two arbitrary
di
erential voltage in di
erential current out transfer function blocks can be
used for approximating the derivative switches closed of the transfer function
when the switches are open
approach can be brought within 
& using a reasonably large #V  Instead of using
two neuron transfer blocks to calculate the derivative a single block combined with
a switched capacitor much like that we shall see in section 	 can be used This
reduces the transistor count
Another approximating circuit for calculating the derivative which inherently
has positive output and which uses even fewer transistors has also been reported
lately Annema 	
The apparent diculties in computing the activation function derivatives need
ed by gradient descent have inspired some authors to do without the derivative
information for instance by substituting a constanty for the derivative or by us
ing completely dierent optimization techniques See eg Battiti and Tecchiolli
 also Krogh et al 	 We shall not elaborate on such solutions as one
of the objectives of this work is to approximate standard algorithms with known
properties
y Assuming a monotonous activation function Knowledge of the sign of the
derivative should be sucient for convergence using a decreasing learning rate	
compare to stochastic approximation Gelb et al 	
Chapter 	 Implementation of onchip backpropagation Page 
 System design
Most of the hardware for an ANN with onchip backpropagation is included on
the backpropagation chip set For a complete system though some additional
hardware is needed This is

 A digital weight backup memory

 Most of the order O	 scaling hardware ie D A and A D converters for
accessing the backup memory and some of the weight updating hardware
Also including

 A nite automaton to control the system weight refresh applying inputs
controlling the learning scheme etc	

 An environment in which to place the ANN
In this section we shall describe such a complete system
For ease of test we embed the ANN in a digital PC interface and we use the
PC as the master nite automaton or nite state machine FSM 	 This solution
does not allow us to test the speed of the system The PC AT ISA	 bus  and
the necessary A D and D A converters  is the the bottleneck of the system in
terms of speed Neither the recall mode nnor learning mode speed performance
can be tested However there is no reason that the system should not run at
the speed indicated by the measurements on the individual chips The circuit level
system performance on the other hand is most easily tested using a high degree of
programmability thus we use this articial PC environment Also the design
time was rather important	 For a real application the environment would be in
the electrical analogue domain and the nite automaton  which is quite simple
 would be a digital ASIC possibly including the D A and A D converters
The system designed is actually an RTRLbackpropagation learning ANN
hybrid The RTRL system cf chapter 	 shall be based on the second generation
backpropagation	 ANN chip set Thus backpropagation can be included in this
system almost for free eliminating the cost of the PC interface for a separate back
propagation system As a consequence the backpropagation ANN architecture is
determined by the RTRL ANN architecture and not designed to a specic appli
cation For test this is acceptable The complete system schematic can be found
in enclosure IIIsee also appendix D	
Scaled back
propagation synapse chip In backpropagation mode the sys
tem realizes a two layer (( perceptron As this architecture would require 

 synapse chips it was necessary to have a scaled backpropagation synapse chip
  	 fabricated Unfortunately on that particular MPC run the process pa
rameters were outside the specied ranges the nchannel process transconductance
parameter for instance being as low as K
 
N
" AV

compared to a nominal
value of K
 
N
" AV

	 Presumably the threshold voltages were numerically
large which would explain the reduced dynamic range of components tested in
previous MPC runs The impacts on the chip are primarily reduced input range
and large systematic oset errors see appendix D	 Using a raised reference
Chapter 	 Implementation of onchip backpropagation Page 
voltage and external current oset compensation circuits it is our hope that we can
make the system work in spite of the poor quality chips
 ASIC interconnection
Using the scaled   synapse chips the synapse chip count is reduced to three
interconnected when the system operates in backpropagation mode	 as shown in
gure 
 
the synapse chips are drawn as having the architecture of gure 

for
convenience	 Four input lines and four output lines in each matrix of synapses
are driven by DC voltages reserving four rows and four columns in each layer for
neuron thresholds forward mode only	 and oset compensation as indicated in the
gure Strictly speaking reverse mode oset compensation is unnecessary for the
input layer	 Prior to learning the matrixvector multiplier outputs are measured
in both forward and reverse mode	 and the oset compensation synapse strengths
are adjusted to minimize the osets ie on the s
l
k
s and the 
l
k
s	
Offset/Threshold synapses
NC
SC16
x1
yL
Nulls
NC
NC
NC
SC16SC16
4
4
Bias Bias
Bias Bias
Figure 
 
 Backpropagation ANN architecture ANN chip interconnections
when the system operates in backpropagation mode Blocks SC and NC
are synapse and neuron chips respectively Input output lines of the SC
blocks are accessible at both top and bottom left and right in this gure
As the synapse strength backup memory we use a  bit RAM Several authors
have addressed the problem of weight discretization in ANNs Hollis et al 
Tarassenko  Lehmann  
 Lansner  Brunak and Hansen 
see section 	 A bit resolution for a learning system seems to be in compli
ance with most of the reported simulations Subsequent to learning the necessary
weight resolution is lower  because it is usually the relative weight magnitudes
rather than the exact weight values that determines the system behavior during
learning however small weight changes need to be accumulated When refreshing
the synapse strengths on the synapse chip from the RAM we use a  bit DAC
The actual ANN weight discretization is thus reduced to  bit however we can
still accumulate weight changes as small as when using a  bit discretization Ac
tually the ANN performance when using this scheme will be superior to that of a
system where the weight discretization is reduced from bit to  bit after learn
ing We train on the actual nal network rather than on a intermediate network
with articial high weight resolution see also Lansner 	
Chapter 	 Implementation of onchip backpropagation Page 
 Weight updating hardware
Oset values and weight resolutions typically found in analogue VLSI systems
restrict the learning rate to a range somewhat higher than what is often employed
in software simulations cf sections  and  Tarassenko and Tombs 
Krogh et al  Hansen and Salamon  Hertz et al   Weigend et al
 see alsoWilliams and Zipser 	 As we are using an O	 weight updating
scheme and as we use high precision weight backup memorywe can reduce the
inuence of weight change osets and weight change discretization and therefore
reduce the minimum learning rate by adding the weight change to the old weight
in the digital domain Using a  bity ADC to convert the analogue weight change
signal to digital form and adding this padded with zeroes at the  MSBs	 to the
 bit weight we scale the eective weight change oset by a factor 
 
 This is
illustrated in gure 
 
 The actual schematic is slightly more complex as bipolar
weights and weight changes needs to be handled and as overow in the digital
adder must be prevented Also a multiplexor is included to allow test of the
onchip analogue weight updating hardware cf enclosure III	
w
backup
RAM
D
φMSBs
MSBs0
wkj t(  +1)ADC
wkj (  )t
wkj (  )tΔ
DACwkj (  )t
Figure 
 
 Digital weight updating hardware principle The digital part
uses higher precision than o
ered by the data converters thereby reducing the
e
ective learning rate
Studying the backpropagation neuron block diagram in gure 
 
 we see that
the #w
kj
signal is not directly accessible applying the w
kj
t	 and 
dec
signals on
the neuron chips as zeros gives the desired weight change at the output though
inevitable with a larger oset than that of the internal #w
kj
	 As the synapse chip
outputs the #w
kj
output is oset compensated prior to learning by applying zero
inputs in learn mode and adjusting the w
kj
t	 signal such that the output is also
zero	
In addition to the small hardware cost the advantage of using a serial weight
updating scheme is this possibility to employ an advanced accurate ie hardware
hungry	 weight updating scheme without an extreme hardware costy Learning al
gorithms that are inherently using serial weight updating eg weight perturbation	
y If the weight change oset is large we should use a coarser discretization
preferably the weight change oset should be below LSB
y All other things being equal adding additional hardware in the analogue do
main for inclusion of a more advanced updating scheme will increase weight updat
Chapter 	 Implementation of onchip backpropagation Page 

should take advantage of this for instance by using the weight updating scheme
above The above updating scheme requires a weight storage with digital weight
backup of course for large systems this is also a good choice when using a serial
weight updating scheme	
The backpropagation system is at the time of writing under construction Thus
unfortunately we can not present any system level experiments Hopefully such
will be available in the near future
 Nonlinear backpropagation
As we now have seen one of the major concerns when implementing gradient
descent like learning algorithms in hardware is the computation of the neuron
derivatives Many dierent approaches to approximate the derivative have been
proposed in the literature dierence quotient locally or globally computed	 or
other approximating approaches perturbations for reducing oset related errors
as well as implementations largely ignoring the derivative
These implementation related diculties recently motivated the development
of a new gradient descent like algorithm nonlinear backpropagation NLBP	 in
which the derivative computation is avoided Hertz et al 	 In this section we
shall display the algorithm and show how to incorporate it in an existing back
propagation architecture
 Derivation of the algorithm
The derivation of nonlinear backpropagation in the framework of recurrent back
propagation can be found in Hertz et al  In the feed forward case we recall
the weight updating rule 
 
	 which dene the weight change
#w
l
kj
t	 " 
l
k
t	z
l
j
t	 " g
 
s
l
k
t		
l
k
t	z
l
j
t	
" 
N


N
g
 
s
l
k
t		
l
k
t	z
l
j
t	

ing errors As reducing these are a primary concern in a learning scheme implemen
tation this advocates for using a simple weight updating scheme in the analogue
domain Thus it could be argued that presently the only realistic way to imple
ment advanced updating schemes is to use an order O	 updating scheme in the
digital domain
Chapter 	 Implementation of onchip backpropagation Page 
where we call 
N
the NLBP domain parameter Now the basic idea in nonlinear
backpropagation is to interpret the above equation as a rst order Tailor expansion
of the equation
#w
l
kj
t	  
N
h
g

s
l
k
t	 !


N

l
k
t	
	
 gs
l
k
t		
i
z
l
j
t	 
which is valid for small


N

l
k
t	 Redening the weight error denition 
 
	 to

l
Nk
t	 "

N

h
g

s
l
k
t	 !


N

l
k
t	
	
 gs
l
k
t		
i
 
 
	
where the 
l
Nk
t	s are the NLBP weight errors the NLBP weight change equation
has the same form as the original backpropagation equation
#w
l
kj
t	 " 
l
Nk
t	z
l
j
t	  
When the NLBP domain parameter 
N
is large the Tailor approximation is good
but requires high precision to compute When 
N
is small the algorithm is nu
merically stable but is taken far from gradient descent behaviour We think of 
N
as being in the range   
N
  In the numerically most stable limit  which
is most interesting for a VLSI implementation because of the limited precision of
this technology  
 
	 takes the simpler formy

l
Nk
t	 " gs
l
k
t	 ! 
l
k
t		  gs
l
k
t		 for 
N
"    
 
	
As for ordinary backpropagation we can chose 
L
Nk
t	 " 
L
k
t	 for the output layer
if we wish to use the entropic cost function
0 5 10 15 20 25 30 35 40 45 500
0.5
1
1.5
training epochs
sq
ua
re
d 
er
ro
r
Figure 
 
 NLBP training er
ror Training error as function
of learning epoch for normal
backpropagation solid line
and the nonlinear version with

N
"  " 
 
 dashed line
and 
N
"  dotdashed line
y Note that the errors are propagated through the same nonlinear units as the
neuron activations are in recall mode rather than through the linearized units used
in ordinary backpropagation Hence the name nonlinear backpropagation
Chapter 	 Implementation of onchip backpropagation Page 
In gure 
 
the training errors using the NETtalk data set Sejnowski and
Rosenberg 		 for normal backpropagation and NLBP are compared from
Hertz et al 	 The performance of NLBP is very similar to that of ordinary
backpropagation in these simulations note that the usual algorithm variations as
weight decay momentum etc are applicable to NLBP	 In a hardware implemen
tation it would be expected to be superior to ordinary backpropagation Being
based on addition and subtraction in addition to the neuron nonlinearity rather
than being based on dierentiation and multiplication the calculation of the NLBP
weight error is much simpler and thus bound to be more accurate Also favouring
a hardware implementation is the indication that NLBP seems to be superior to
ordinary backpropagation for large learning rates cf section 	
 Hardware implementation
As the only dierence between ordinary backpropagation and nonlinear back
propagation is the way the weight strength errors are computed  and as these
are computed locally  NLBP maps topologically on hardware in exactly the same
way as ordinary backpropagation only the neuron implementation dier In this
section we shall show the core of two neuron implementations for an ANN with
onchip NLBP learning rst reported in Hertz et al 	
Continuous time NLBP neuron Taking the BJT dierential pair of our
original neuron as a starting point for implementing NLBP with a hyperbolic tan
gent neuron activation function leads to the schematic in gure 
 
 For simplicity
the dierential pairs are shown implemented with npn BJTs in an actual CMOS
implementation lateral bipolar mode pchannel MOSTs would be used as in the
previous designs of course As the actual neuron activation function is unimpor
tant for NLBP MOST dierential pairs could be used instead as in Bogason 	
which would favour speed The use of LBM MOST dierential pairs probably
favour accuracy Vittoz  Salama et al 
	 One will notice that the cir
cuit requires application of the negated neuron error 
k
 Thus the synapse chip
would have to compute this in reverse mode rather than 
k
requiring a simple
modication	
It is interesting to notice that the circuit structure is identical to the one
used by Bogason  cf gure 
 
	 to compute the neuron activation function
derivative Substituting v

k
by a small constant voltage #V  the i

k
output
will approximate #V 	
i
y
k

v
s
k
 One can interpret NLBP as a way to exploit this
implicit multiplication of the dierence 
 
	 which eliminates the g
 
s
k
	 	 
k

multiplier  and hence a source of errors Further v

k
is not a small voltage
as opposed to #V 	 which makes inherent inaccuracies less signicant relatively
Consequently using the circuit for NLBP gives better accuracy than when using
it for ordinary backpropagation
The accuracy of the circuit ie on the weight error calculation	 is determined
by the matching of the two dierential pairs and their tail currents This can be in
the order of & of the output current magnitude see eg O	Leary  though see
Chapter 	 Implementation of onchip backpropagation Page 
VDD
VSS
Bias
vsk
iyk v k
ki
Figure 
 
 Continuous time nonlinear backpropagation neuron Using a
MOST di
erential pair is also possible though the activation function will be
di
erent The precision is determined by the matching of the two di
erential
pairs
also Salama et al 
	  which is better by far than the accuracy of our present
chips and will probably enhance the performance signicantly Still though linear
transresistances are needed at the inputs and outputs for compatibility with the
synapse chip This will degrade performance
The circuit as presented functions in continuous time and can substitute the
schematic backpropagation neuron of gure 
 
in a system that uses fully parallel
weight updating
Discrete time NLBP neuron As the actual shape of the neuron activation
function is irrelevant to nonlinear backpropagation implementationy there is no
need to base the implementation on dierential pairs A far better approach is
to use circuits that inherently have the current inputs and voltage outputs which
are needed Also as the same function is used to calculate the neuron activations
and the weight errors it would be preferable to use the same hardware for these
calculations as this eliminates the need for matched componentsz This is possible
if the system is not required to function in continuous time though the output
would have to be sampled which introduces errors	
Shown in gure 
 
is the simplied schematic of such a discrete time neuron
which reuses the activation function block and which has current inputs voltage
outputs During the 

clock phase v
y
k
is available at the output and is sampled
at the capacitor During the 

clock phase v

k
is available at the output It is the
switched capacitor that computes the dierence in 
 
	 and which determines the
y Indeed we could implement a bump function or radial basis function	
which is popular among certain authors see eg Hertz et al  SanchezSinencio
and Lau 
	 or any other function as long as the implementation is time invariant
and possibly also reproducible
z Also if a very complex activation function is used the extra hardware com
sumption might prohibit the implementation of two activation blocks per neuron
Chapter 	 Implementation of onchip backpropagation Page 
accuracy of the circuit Note that the output buer needs to be linear but its oset
error is canceled by the switched capacitor Also the neuron transfer function
block the six MOSTs	 can be an arbitrary current in voltage out circuit static
errors as input current output voltage osets	 in this block are irrelevant Using
design techniques to reduce charge injection and redistribution Robert and Deval
 Gatti et al  Signell and Mossberg  Kerth et al 	 the accuracy
can be brought within 
 & of the output voltage range
Bias
Bias
VDD
Vmax
VSS
Vmin
1
1 1
1
122
2 2
:
:
ski
ykv
ki
kv
Neuron activation block
Figure 
 
 Discrete time nonlinear backpropagation neuron Simplied
schematic Time invariant inaccuracies will not a
ect the performance of this
circuit The precision is determined by the switched capacitor
This discrete time NLBP neuron can directly replace the computing elements
of our original backpropagation neuron in gure 
 
 Thus assuming that the
voltage buer would be needed in a recall mode version of the neuron the NLBP
hardware overhead is potentially extremely small consisting of only a switched
capacitor at the neuron sites in addition to the very modest hardware increase of
our original hardware ecient backpropagation synapse chip and the order O	
weight updating hardware and nite automaton algorithm controller
Operating the output transistors of the transfer function block in the triode
mode the circuit output voltage will exhibit a reasonably smooth transition from
V
max
to V
min
when the input current is increased giving an Sshaped transfer func
tion The circuit however has a very poor power supply rejection ratio PSRR	
A more realistic circuit is shown in gure 
 
 The insertion of current mirrors
in the signal paths gives a much better PSRR and the possibility of a lower input
impedance To avoid drawing current from the output range references which
would compromise their rigidity	 simple ampliers buer these NLBP does not
require the neuron output range to be very well dened thus the large input o
set 


V
T
	 of the ampliers need not match very well	 The neuron steepness
is controlled by the input stage bias current I
B
 Transfer function simulation for
dierent bias currents can be seen in gure 
 

Chapter 	 Implementation of onchip backpropagation Page 
VDD
SSV
IB
IB
maxV
minV
IRGB
Vref
isk
minV buffer
IRGB
V’min
maxV’
vyk
Figure 
 
 Neuron activation block schematic Improved PSRR The outputs
from a standard source follower current input stage drives the transfer function
shaping output transistors via current mirrors The output level reference
bu
ers can be thought of as degenerated regulated cascodes
-3.0uA -2.0uA -1.0uA 0A 1.0uA 2.0uA 3.0uA
IinV(2)
1.0V
0.5V
0V
-0.5V
-1.0V
ivneu3 -- I in, V out squashing neuron, improved PSRR
Date/Time run: 05/15/94 16:47:16 Temperature: 27.0
Figure 
 
 Simulatedi neu
ron transfer function Transfer
function at  V power supply
for several bias currents
	 Further work
Obviously system level experiments need to be carried out in order to evaluate
the applicability of our proposed chip set especially with respect to derivative
computation and weight change osets	 This work is presently being carried out
Chapter 	 Implementation of onchip backpropagation Page 
at our institute Also the implementation of the proposed discrete time NLBP
neuron chip which could be pin compatible with our present backpropagation
neuron chip and thus substitute this in the backpropagation system	 is an evident
future design task
As for the recall mode neural network design of chapter  several design
issues need consideration prior to a volume production The considerations on
process parameter dependency canceling and temperature compensationmentioned
in section  also apply for the backpropagation chip set In addition the imple
mentation of the high accuracy calculations or rather low oset ones	 needed by
the learning scheme requires further investigation In this section we shall discuss
a few approaches to reduce the inuence of oset on critical signals and to the
inclusion of algorithmic variations
 Chopper stabilizing
One of the more critical signals in an ANN learning scheme with respect to oset
is the weight change A straightforward solution mentioned in section  to this
problem is to measure the oset during an auto zeroing phase and subsequently
subtract this oset This solution has two major drawbacks it requires i	 mem
ory and ii	 an oset free comparator though see chapter 	 Especially for
systems with a fully parallel weight updating scheme these drawbacks are quite
severe Another way to eliminate the weight change oset is to apply a chopper
stabilizing technique known from operational amplier oset cancellation Hsieh
et al 
 Allen and Holberg  Coln 	 The polarity of the inputs and out
puts of a dierential opamp is synchronously and periodically at f
chp
	 reversed
which moves the oset error and low frequency noise	 to the odd harmonics of
the chopping frequency f
chp
 In a similar way we can periodically permute the
inputs output polarities of the 
k
	 z
j
multiplier used to compute the weight
changes This is illustrated in gure 
 
 for the case where the weight updating
multipliers are placed at the synapse sites for parallel weight updatingy
Assume that the weight updating multiplier which should be a dierential
inputs dierential output multiplier eg the Gilbert multiplier of gure 

	 com
putes
#w
kjmul
" 
k
! 
kjofs
	z
j
! z
kjofs
	 ! 
kjofs
 
Inserting four switch transistors at each input and at the output which reverses the
polarity when the corresponding control signal 
 
is high and doing four successive
weight updates using





z




"






































y Pulse width modulation by gating the 

and 

signals	 of the contacts
connected to the storage capacitor could be used to adjust the learning rate this
is not shown in the gure A multiplier bias current could also be used to control
the learning rate
Chapter 	 Implementation of onchip backpropagation Page 
1
2
Iηkjofs
Vzkjofs
Vδkjofs
φη ηφ
VzBofs
φz
φz
vzj
wkjv
iΔwkj
φδ φδ
VδBofs vδk
Ideal
SSV
VDD
Figure 
 
 Chopper stabilized weight updating Principal schematic For
parallel weight updating as indicated only four extra minimum switches are
needed at the synapse sites
gives the following resulting weight change
#w
kj
"
X
	


	
z

	



 

	
k
! 
kjofs
	
 
z
	z
j
! z
kjofs
	
!  

	
kjofs
" 
k
z
j
 
We call this multidimensional chopper stabilization Note that the output switches
are placed in such a way that to a rst order approximation oset errors related
to the switched	 dierencing current mirror are also canceled Placing the switch
transistors as indicated on the gure we see that only four minimum switch tran
sistors in addition to the weight updating multiplier	 is necessary at each synapse
site The other switch transistors can be common to a row or column of synapses
This has the additional advantage that possible osets in input buers as indicated
on the gure	 also will be canceled If we were to add chopper stabilizing of the
weight change signal in our present backpropagation system only one stabilizer
or actually one per neuron chip	 would be required of course
For exact oset cancellation the data frequencies should be lower than half the
smallest of the chopper frequencies In a discrete time system however one would
probably apply a new data set for each of the four 
 
triples oset cancellation
would still be expected
Chapter 	 Implementation of onchip backpropagation Page 
The chopper stabilizing technique can be used at other signals as well Most
probably at the backpropagated error signals Chopper stabilizing this signal on
our present backpropagation synapse chip is unfortunately somewhat dicult
because of the current conveyor based dierencing technique Basing the synapse
dierencer on a current mode operational amplier instead the stabilizing might
be possible using a few more switch transistors in the row column elements
If signals used concurrently with the stabilized ones above are to be chopper
stabilized say one would stabilize the 
k
s computed at the neuron chip	 more
chopper frequencies need to be introduced One must ensure that all permutations
of chopper phases 
 
 giving a constant resulting sign of the weight change signal
are present in a complete cycle
 Including algorithmic improvements
As mentioned in section  an advantage of using a serial weight updating scheme
is that advanced procedures can inexpensively be employed Thus improvements
of the system by including relevant algorithm variations displayed in section 
or other variations	 are important to consider
Weight change threshold In addition to oset compensation and chopper
stabilizing the problem of osets on the weight changes can be solved by introduc
ing a weight change threshold #w
min
as proposed by Montalvo et al  The
inuence of a weight change oset is most severe when the ideal weight change is
close to zero in this case it is the oset rather than the desired weight change that
determines the actual applied weight change The eect is that the weight space
state will always drift away from a cost function minimum a solution to the prob
lem at hand	 where the ideal weight changes are zero Taking the consequences
of this is to introduce a weight change threshold below which weight changes are
ignored ie substituting #w
l
kj
in 
 
	 by
#w
l
kj
t	 "


  for j
l
k
t	z
l
j
t	j  #w
min

l
k
t	z
l
j
t	  otherwise
 
This is quite easily incorporated in our digital weight updating scheme without
introducing errors
Momentum One of the more popular improvements of backpropagation is
momentum In the analogue domain assuming parallel weight updating	 mo
mentum is included by adding a leaky integrator at the output of each 
k
	 z
j

multiplier However using a momentum parameter 
mtm
cf 
 
	 typically in
the order of 
 	 oset errors associated with the multiplier and the integrator	
is increased by a factor   
mtm
	 When the weight changes are small this
is a severe problem which might very well prohibit the inclusion of momentum in
pure analogue systems In the digital weight updating scheme used in our back
propagation system gure 
 
	 however the eective weight change oset was
Chapter 	 Implementation of onchip backpropagation Page 
reduced compared to the analogue approach Thus in this system we could hope
that the increased eective oset error would be acceptable especially if a weight
change threshold is included
Choosing the momentum parameter 
mtm
"


enables a very easy implemen
tation of momentum in our backpropagation system The only hardware needed
is a simple digital adder and a digital RAM B
w
" bit wide and the same
number of words as the weight backup RAM	 Using a B
w
bit discretization
of the weight change signal the memory in the resulting weight change signal is
only B
w
training samples This is smaller than desired for most applications
Using 
mtm
" 
  gives a   times longer memory however multiplying by 
 
using digital hardware is very inconvenient Another way to lengthen the weight
change memory is to apply the 
mtm
factor only every A
mtm
th sample we call this
degenerated momentum	 ie
#w
l
kj
t	 "


mtm
#w
l
kj
t  	 ! 
l
k
t	z
l
j
t	  for t " 
 A
mtm
 A
mtm
      
#w
l
kj
t 	 ! 
l
k
t	z
l
j
t	  otherwise
 
This increases the memory by a factor A
mtm
 The resulting weight change corre
sponding to the training data applied at time t is approximately A
mtm

mtm
	 	

l
k
t	z
l
j
t	 The hardware implementation of such a scheme is more complicated
than the simple choice of 
mtm
"


 It requires the addition of overow control
on the adder and the insertion of arithmetic shift left hardware for the selective
multiplication
Weight decay Implementing weight decay in a system with simple capacitive
storage and a parallel weight updating scheme	 is in principle just a matter
of making the storage capacitor leaky ie placing a resistor from the capacitor
to a zero weight reference voltage The weight decay however must be small
compared to typical weight changes in order not to prohibit learning Krogh and
Hertz  for example use a very small weight decay parameter learning rate
ratio of 
dec
 "  	 

 
 which would probably be insignicant compared to
typical weight change osets If one could accept a weight decay that was large
compared to the weight change osets the inuence of these could probably be
reduced in addition to the other advantages of weight decay	 Whether this is
possible experiments would have to show
In our backpropagation system with the order O	 digital weight updating
hardware the situation is dierent Though using 
dec
 "  	 

 
at a learning
rate  " 
  gives 
dec
" 

which is negligible when using a weight discretiza
tion of  bit	 the weight backup memory precision could easily be enhanced to
 bit say	 to accommodate such small weight changes If the weight change o
set is larger than one LSB of the weight change ADC a weight change threshold
would be needed	 The inclusion of weight decay in our system requires only a
digital adder
Chapter 	 Implementation of onchip backpropagation Page 

 Other improvements
As was the case for the recall mode ANN chip set several improvements of the
backpropagation chip set are possible A list of the more obvious can be found in
appendix D

 Summary
In this chapter we designed a variation of our cascadable ANN chip set includ
ing onchip error backpropagation learning The basic learning algorithm was
displayed and the applicability of common algorithmic variations for the imple
mentation in analogue VLSI was discussed It was shown that a fully parallel
implementation would give an order ON

	 improvement in speed compared to a
serial solution It was also shown that exploiting the symmetry of the MRC it
was possible to implement backpropagation with no extra hardware at the synapse
sites and no extra interchip connections at the cost of an order ON	 in speed	
this is our solution Using digital RAM weight backup weight access restrictions
reduce the learning speed to O	 compared to a serial solution
The design of our backpropagation chip set was displayed An improved
CCII! based current dierencer is used on the synapse chip and the neuron
chip is approximately twice as complex as the rst generation recallmode one
MRCs are used excessively for the extra computing circuitry on the neuron chip
Measurements on the chip set were displayed indicating a 
 MCUPS learning
speed The measurements suggest that a range of oset errors especially on the
weight change signal	 would have to be canceled Also the neuron derivative
computation seems to be problematic when haunted by certain oset errors Several
improvements to this circuit were proposed
A complete backpropagation system design based on our chip set was dis
played As the weights ultimately have to be placed in a digital RAM most of the
weight change hardware not present on the neuron chips are implemented using
discrete digital hardware We elaborated on the virtues of such a solution eg re
duced minimum eective learning rate and weight change oset and easy reliable
implementation of weight decay and momentum The system is presently under
construction thus no experimental results were presented
The novel nonlinear backpropagation learning algorithm was displayed Not
needing the neuron derivative the neuron circuitry for this algorithm is superior to
that of the original algorithm the exact neuron transfer function is irrelevant we
can focus the design eort on the electrical characteristics of the neuron	 Several
possible backpropagation neurons were proposed one for continuous time non
linear backpropagation and one compatible with our discrete time system Using
the latter solution virtually no extra hardware is needed for the learning algorithm
compared to the recallmode system
A chopper stabilization technique for reducing oset errors were proposed A
sample implementation for reducing oset errors in weight change signals computed
Chapter 	 Implementation of onchip backpropagation Page 
by local synapse multiplier	 circuitry was given
Finally the inclusion of weight change thresholds momentum and weight
decay in our backpropagation system was outlined
Page 
Chapter 
Implementation of RTRL
hardware
The implementation of addon realtime recurrent learning hardware for our ANN
chip set is the objective of this chapter The learning algorithm is rst briey de
scribed after which it is shown how it can be mapped on hardware compatible with
our ANN architecture using a realistic amount of hardware Results from simula
tions modeling analogue VLSI nonidealities in the architecture are displayed The
design of an experimental VLSI chip implementing most of the learning hardware is
next presented including an oset canceling scheme for the critical weight change
signal Chip measurements are also presented The design of a complete RTRL
system is done and it is shown how we can apply algorithmic variations using the
system We derive a nonlinear version of the RTRL algorithm and we argue that
this algorithm has the same virtues as nonlinear backpropagation Reections
on future work are then given A continuous time RTRL system is considered A
summary concludes the chapter
  The RTRL algorithm
The realtime recurrent learning RTRL	 algorithm is a supervised gradient de
scent algorithm cf appendix B	 for general recurrent articial neural network
architectures In this section we shall describe the basic algorithm and display
modications typically applied to it
Chapter  Implementation of RTRL hardware Page 
 Basics
The RTRL algorithm for an articial neural network with a discrete time feedback
discrete time recurrent articial neural network RANN cf appendix B gure

B
 gure 


	 can be described as follows Williams and Zipser  	 Given
an input vector xt	 at time t we can write the neuron k activation 

B
	 as
y
k
t	 " g
k
s
k
t		 " g
k

X
jIU
w
kj
t	z
j
t	


where
z
j
t	 "

x
j
t	  for j  I
y
j
t  	  for j  U
  

	
I is the set of input indices and U is the set of neuron indices The neuron biases
are implicitly given as the connection strengths from a constant input z

" 
Note that because of the discrete time feedback the z
j
s dependency on the time
is slightly dierent from the denition in chapter  Doing usual gradient descent
online learning we use a weight updating rule of the form
w
ij
t! 	 " w
ij
t	  

J t	

w
ij
t	
often
" w
ij
t	  
X
kU

J
k
t	

y
k
t	

y
k
t	

w
ij
t	
 

	
where J t	 is the instantaneous cost function cf appendix B	 Now the idea
of RTRL is that the neuron activation derivatives can be shown to equal to

y
k
t	

w
ij
t	
" p
k
ij
t	 
where the neuron derivative variables p
k
ij
are computed asy
p
k
ij

	 " 
 
p
k
ij
t	 " g
 
k
s
k
t		
k
ij
t	  

	

k
ij
t	 "
X
lU
w
kl
t	p
l
ij
t  	 ! 
ik
z
j
t	   

	
We have assumed that teaching starts at time t "  and we have introduced the
neuron net input derivative variables 
k
ij
 Using the quadratic cost function the
resulting weight change is equal to
#w
ij
t	 " 
X
kU

k
t	p
k
ij
t	   

	
y As before 
ik
denotes Kronecker's delta
Chapter  Implementation of RTRL hardware Page 
The neuron error 
k
t	 is dened as

k
t	 "

d
k
t	  y
k
t	  for k  T t	

  otherwise
 

	
where d
k
t	 is the neuron k target value at time t and T t	 is the set of neurons for
which targets exist at time t the target indices	 Using the entropic cost function
and hyperbolic tangent activation functions it can be shown that the weight change
is equal to
#w
ij
t	 " 
X
kU

k
t	
k
ij
t	   


	
A new fx dg pair is applied at time t! We call the completion of the computa
tions for a given t a learning cycle or time step	
 Variations
As the backpropagation algorithm RTRL can be varied in numerous ways Most
of these variations does not alter the topology of the algorithm and can easily be
applied to a VLSI implementation In addition to the use of dierent cost functions
as shown above variations include Williams and Zipser   Smith and
Zipser  Catfolis  Hertz et al  Brunak and Hansen  and others	

 Teacher forcing When the network is to be taught such that the dynamic
behaviour is altered in a qualitative manner teacher forcing can be employed
for instance when the network is taught to oscillate	 Here the target values
when they exist	 rather than the network outputs are fed back
z
TF
j
t	 "



x
j
t	  for j  I
d
j
t  	  for j  T t  	
y
j
t 	  for j  U n T t 	
 
For correct derivative calculation in this case the sum in 

	 should be taken
for l  U n T t  	 The incorporation of teacher forcing in a VLSI RTRL
implementation is straight forward

 Relaxation and Pipelining The RTRL network architecture cf gure 
B
	
implies that a delay of a number of time steps will be present from an input is
applied to the corresponding output will be seen If for instance the network
is to implement an xor function it must organize itself as a twolayer at
the least	 perceptron ie it can compute yt	 " x

t  	 x

t  	 When
applying target values to the network one is implicitly constraining the net
work architecture by choosing the inputoutput delay in terms of time steps	
The delay must be just long enough to enable a suciently complex network
organization without unnecessary delay insertions which degrades learning	
At least two procedures are possible when introducing the inputoutput delay
One is to relax the network for a number of time steps when an input has been
Chapter  Implementation of RTRL hardware Page 
applied ie set xnT
PD
! 	 " xnT
PD
! 	 " 	 	 	xn ! 	T
PD
	 and apply
the corresponding target only at time n ! 	T
PD
n  IN

and T
PD
 IN is
the desired propagation delay	 The other is to exploit the pipelined nature
of the network architecture ie use a new input xt	 at each time step and
apply the corresponding target at time t! T
PD
  Clearly the latest method
gives the highest throughput It might be harder to train though as delays
for synchronizing may have to be inserted by the learning algorithm The
choice of relaxation pipelining only eects the algorithm control mechanism
of a VLSI implementation This is also true for

 Learning by subsequence Sometimes it is interesting to apply dierent in
dependent sequences to the network rather than regarding the inputs as a
continuous stream of data To avoid false correlations between dierent se
quences the neuron derivative variables p
k
ij
and the neuron states	 are reset
between each sequence eg at t " nT
seq
if all sequences have the length T
seq
	
Related to this is the application of a priori temporal knowledge If the output
is known to be dependent of the latest T
mem
input vectors at most the p
k
ij
s
can be reset at t " nT
mem
to enforce this limited memory Note that in both
cases a tapped delay line feed forward ANN can solve the task However in
some cases especially when T
seq
 T
mem
is large	 a recurrent net needs much
fewer processing elements	

 Random initial state As an alternative to setting p
k
ij

	 " 
 or p
k
ij
nT
seq
	 "

	 the initial neuron derivative variables can be set to small random numbers
p
k
ij

	 " n
k
ij

	 
where n
k
ij
t	 are uncorrelated noise sources eg white Gaussian	 This have
the tendency to speed up the learning of small sequences

 Momentum weight decay etc The standard learning algorithm variations
mentioned in section  are also applicable to RTRL Also the notes on
applicability for hardware realizations do apply
It should be noted that a continuous time formulation of the algorithm is also
possible cf section 	
 Mapping the algorithm on VLSI
The topological mapping of the RTRL algorithm on analogue VLSI was rst pub
lished in Lehmann  
 As mentioned in chapter  it is our aim to implement
learning algorithms for the analogue ANN topology described in chapter  By
adding a sampleandhold circuit as feedback in a one layer system based on our
recall mode chip set we arrive at the discrete time recurrent network topology for
which RTRL was developed This network architecture is shown in gure 



Chapter  Implementation of RTRL hardware Page 
Figure 


 The discrete time RANN system Block diagram The system
is composed of a collection of synapse and neuron chips forming a onelayer
ANN and a sampleandhold circuit as feedback
The calculations for a training example needed by the RTRL algorithm can
be performed fully in parallel It is however unrealistic to construct a system that
grows as ON
 
	 when N is large Say we can have 



multipliers on a chip
For N " 

  a network that could t on a few synapse and neuron chips 
we need more than 



 multiplier chips) Also a fully parallel weight update is
incompatible with our ANN system as we have serial access to the weight backup
RAM	 Now studying the basic equations above we notice that
p

ij
t	 "

 yt	  yt	
	
 

ij
t	  

	


ij
t	 " w t	p

ij
t 	 ! 
i
z
j
t	  

	
where w is the synapse weight matrix w without the columns corresponding to
the inputs            
T
and  denotes vector multiplication by coordinates
In 

	 we have assumed g
k
 	  tanh 		 The weight change equations can be
written as
#w
ij
t	 " t	 	 p

ij
t	 
#w
ij
t	 " t	 	 

ij
t	 
for the quadratic and entropic cost function respectively Implementing the above
equations in parallel divides the ON
 
	 operations between the space domain and
the time domain as ON

	 to ON

	 This division has several advantages

 The area of the computing parts of the learning hardware does not grow faster
with N than does the ANNy

 Most of the calculations the order determining w p

ij
	 can be performed by a
matrixvector multiplier almost	 identical to the synapse matrixvector mul
tiplier of the ANN

 Most of the additional hardware can be implemented in cascadable signal
slices

 The system is cascadable ie can in principle	 be expanded to arbitrary size
y Still the area of the derivative variable memory grows as ON

	 of course
Chapter  Implementation of RTRL hardware Page 

 No signal paths need more lines than ON	

 Weight updating is serial which gives the advantages mentioned in chapter 
The disadvantage of the ON

	 ON

	 space time division is of course that the
system will be an order ON

	 slower than a fully parallel implementation
A block diagram of the proposed RTRL system can be seen in gure 


The two matrixvector multipliers the synapse weight memory and the derivative
variables memory can easily be identied The adders  and multipliers  are
working by coordinates the select block SEL chooses the outputs that have target
values the multiplexordemultiplexor pair computes 
i
z
j
 and  	 is a vector inner
product multiplier The dashdotted signal path is to be used for the entropic cost
function Controlled by a digital nite automaton the system operation is as
follows At the end of a learning cycle yt 	 is sampled Then xt	 and dt	 are
applied and yt	 g
 
st		 etc are computed asynchronously Now for each fi jg
p

ij
t  	 is read from the RAM after which p

ij
t	 and w
ij
t ! 	 are computed
and stored in the respective RAMs
Figure 

 The discrete time RTRL system Block diagram The lines carry
analogue signal vectors of width  N or N!M as indicated by their thickness
or digital signals
Comments on the topology As indicated on the gure the derivative vari
ables are meant to be placed in a digital RAM Digital RAM has been selected
as it is physically small cheap and reliable As the storage requirement grows as
ON

	 this is the size limiting factor of the system For small systems it might
be feasible to use analogue storage	 Large RAMs have serial word access Thus
to achieve the required ON	 parallel signal for the p

ij
s it will be necessary to
Chapter  Implementation of RTRL hardware Page 
multiplex the RAM access This is the speed limiting factor of the system more
precisely it will most probably be the ADCs connected to the multiplexors that
limit the speed this would be a reason to use analogue memory in small systems	
To follow the learning algorithm strictly we should update the analogue weight
storage on the matrixvector multipliers only after a full learning cycle has been
completed For very large systems the onchip weight storage might be degraded
too much during the time of a learning cycle though To come around this problem
two RAM banks would have to be used one for refresh and one for the new
weights If the weight changes are small however there is no reason to believe
that a periodical weight refresh using the weights in the partially updated weight
backup memory	 should prohibit learning cf section 	
All elements in gure 

except the multiplexor the digital weight updating
hardware the RAMs and the matrixvector multipliers operate by coordinates on
vectors of width N  These elements can thus be placed on a cascadable width N
data path module The inner product multiplier would be distributed among each
signal slice of the module and must have current output to ensure cascadability
Thus a system with the architecture in the gure will be comprised of the following
components

 A number of synapse chips doing the w z multiplication

 A number of neuron chips applying the tanh nonlinearities The two set of
chips act as the one layer core neural network

 An N !M way multiplexor the width N !M data path module

 Two digital RAMs with corresponding A D and D A converters The data
converters can be ochip or onchip components

 A width  data path module for the weight updating hardware On this module
the digital nite automaton for controlling the learning scheme would also be
placed

 A number of width N data path modules with a total of N signal slices	
performing the rest of the calculations
If we are to add the learning hardware to an existing ANN we must control
the ANN neuron output range as we compute the neuron derivative as   y

cf
previous chapters	 Also we are implicitly requiring the two matrixmultipliers
and the width N signal slices	 to match inter chip matching)	 It turns out
though that this matching is not very important cf below	
It should be noted that the sparse input synapse chip mentioned in section
 can be used to process the network inputs x	  if the z
j
multiplexor is also
made capable of handling binary coded inputs
Chapter  Implementation of RTRL hardware Page 
 System simulations
In Lehmann  
 simulations were done on the inuence of various non
idealities in the system Nonlinearities oset errors and quantizations on selected
signals were investigated The restrictions found are in compliance with what other
authors have found for other learning algorithms eg backpropagation and weight
perturbation	 Montalvo et al  Hollis et al  see also Tarassenko et al
Withagen  and others	 See also section  The qualitative conclusions
of these simulations were that

 The neuron output must not be larger than  Because of the way the neuron
tanh	 derivative is computed this is a very strict requirement If y
k
  the
computed derivative can have the wrong sign	 Nonlinearities of the transfer
function can be tolerated to some extent In the simulations nonlinearities
in the range 
&


D
y



& were acceptabley for the quadratic cost
function for the entropic the acceptable range is generally smaller	 However
the exact tolerable range will depend on the problem on the network size on
the number of training cycles on the learning rate and on other nonidealities
Thus the qualitative conclusion is more interesting than the actual gures	

 The synapse strength discretization must be suciently ne In the simula
tions at least  bit	

 The weight change o
set must be very small In the simulations at most

  	 


	

 The neuron error o
sets must be small This applies to the nontargeted
neurons For the output neurons a neuron error oset will merely displace the
network output by the oset In the simulations the error oset had to be at
most  	 


	
In addition to these constraints the simulations showed that

 The neuron derivative variable discretization can be rather coarse In the
simulations  bit though even three levels  
 	 seems to be adequate in
some situations	

 The nonlinearities in general can be large Up to at least 
& in the simu
lations	

 The o
set errors in general can be large For instance up to about 
& for
the derivative variables	
These very soft requirements on the computing accuracy indicates that for in
stance inter chip matching is not very important Oset cancellation on various
signals on the other hand will be necessary
y The lower bound was determined by the target values obviously these must
be within the neuron range
Chapter  Implementation of RTRL hardware Page 


 Chip design
The recall mode ANN for which we shall design the RTRL hardware is based on the
backpropagation chip set of chapter  We will disregard the backpropagation op
eration modes for this implementation An important design strategy for the RTRL
system as for the other systems in the thesis is to reuse hardware The w p

ij

multiplier is of course implemented using synapse chips but also at transistor level
for the width N data path module shall we reuse hardware Most of the layout on
this module actually originates from the backpropagation chip set As the other
chips the RTRL hardware chips was designed to validate the system topology ie
as little hardware as possible was put on silicon
In this section we shall present the design of the width N data path module
Design details can be found in appendix D The rest of the RTRL system will
be implemented using discrete components and will be presented in section 
 The width N data path module signal slice
Compiling the learning components on gure 

that operate in data paths of with
N  we arrive at the block diagram in gure 

for a single signal slice The width
N data path RTRL module the RTRL chip	 basically consists of a number of
these signal slices For a more detailed circuit schematic refer to appendix D As
on the other chips designed so far the MRC is used extensively for the analogue
computing components
To avoid oscillations or racearound	 when the neuron activations are sam
pled the y
k
sampleandhold circuit must be edge trigged The sampler is imple
mented using two successive simple trackandhold samplers the rst using the
inverted clock signal of the second
The   y

k
block calculating the neuron derivative is identical to the one
on the backpropagation neuron chip It is implemented using a two dimensional
MRC based IPM Likewise the 
k
 y
k
subtractor is implemented using a one
dimensional IPM as on the backpropagation neuron chip for the output layer	
A two way multiplexor controlled by T
k
t	 is used to select whether neuron k
has a target at time t If it has not a zero is given as the neuron error signal This
implementation ensures a negligible error oset originating from this circuit	 on
the neurons without target values which according to the simulations is essential
for learning The input oset on the succeeding inner product multiplier will
of course cause an oset this is unavoidable One must ensure that the IPM
dimension number k	 input with inherently lowest oset is used for the error
signal Controlled by a multiplexor the other dimension number k	 input to the
IPM can be either p
k
ij
or 
k
ij
depending on which cost function one chooses
The inner product multiplier that calculates the weight change is distributed
among the signal slices in exactly the same way that the IPMs on the back
propagation synapse chip are distributed among the matrix columns One MRC
multiplier is placed in each signal slice and the dierential current outputs from
Chapter  Implementation of RTRL hardware Page 

Figure 

 Order N signal slice Block diagram of signal slice in width N data
path module The  " kswitches are parts of distributed demultiplexors
Chip IO nodes are indicated by the bonding pads The elements below the
dashed line are p
k
ij
access circuitry
these are computed by a single CCII! giving the chip as a whole the desired #w
current output
The z
j
demultiplexor is also distributed among the signal slices The output
of the preceding ochip	 multiplexor will be a voltage which is also necessary
for distributing the z
j
signal to several width N data path chips	 Onchip z
j
is
transformed to a current which can easily be demultiplexed as in gure 


	 The
output k from the demultiplexor is added to the current	 output k of the w p

ij

matrix vector multiplier the resulting current being transferred to a voltage The
transresistance used for this is equivalent to the input transresistance of the back
propagation neuron chip it is an MRC plus an opamp	 the input voltage must
likewise be close to V
ref
to avoid DC common mode currents in the w p

ij
matrix
vector multiplier The transresistance value is chosen such that the maximum
eective synapse strengths are in the range 
 


jw
ij
j
max
 
 The output level
is large to ensure good accuracy at small inputs nominally V for z
j
" 	  at
the expense of a quickly saturated output
Finally the g
 
k
	 
k
ij
multiplier is a one dimensional IPM This gives a total
of six operational ampliers for the computing hardware in a signal slice
Also included in the width N data path chip is a demultiplexor sampleand
Chapter  Implementation of RTRL hardware Page 

hold circuits and a multiplexor for accessing the derivative variable RAM These
components are drawn below the dashed line in the gure The multiplexor and
demultiplexor are distributed among the signal slices as the z
j
demultiplexor The
sampleandhold circuits one per signal slice	 are simple trackandhold implemen
tations Placing the derivative variable access circuitry on the width N data path
chips as thus indicated means that one D A converter one A D converter and one
derivative variable RAM bank must be present per width N data path chip This is
convenient but the system speed will be low if many signal slices are implemented
on a single chip cf Lehmann  
	 The chip is supplied with two input
channels for the sampling of the p
k
ij
t  	s one channel meant for reading the
RAM and another meant for initialization of the p
k
ij
t  	s to be connected to
zero or a noise generator dependent on which variation of the algorithm one uses	
 Auto oset compensation
Repeatedly noted in this text is the necessity of a low weight change oset ie
the weight change output has to be oset compensated The weight change oset
compensation circuitry can be placed on the width  data path module However as
the oset standard deviation will be proportional to the square root of the number
of cascaded width N data path modules assuming uncorrelated individual module
osets	 the dynamic range of such an oset compensation circuit must be very large
in principle innite for an arbitrary ANN size	 For this reason the compensation
circuitry should be placed on the width N data path module instead or perhaps
in addition to	y
The principal schematic of the width N data path module oset compensation
circuit is shown in gure 

 During an autooset phase the #wsignal is discon
nected from the output pin and lead into a current comparator instead while the
inputs to the #w computing IPM are brought in a state resulting in an ideal zero
#w current Now the o
set compensation current controlled by the successive
approximation register SAR	 is adjusted to give a zero comparator input current
The DA converter To avoid the problems with charge injection and weight
drift of analogue storage we have chosen to store the measured oset in a digital
way as indicated on the gure A D A converter is therefore needed to subtract
the measured oset The summed weight change currents of the width N data path
chips must be converted to a digital signal by the width  data path module Now
the eective weight change oset must clearly be less than say


LSB
w
of this
converter if no measures against oset errors are taken on the width  data path
y Placing the compensation circuitry on the width N data path module is in
compliance with the observations of analogue computing accuracy of appendix C
Quadrupling the number of connected #w current outputs doubles the resulting
expected oset error To bring the oset error below a certain value the oset
canceling circuit precision must then be doubled which requires quadrupling the
area
Chapter  Implementation of RTRL hardware Page 

DAC
Vref
Δwij
calc
Δwij
chip
au
to
 z
er
o
learn output
current
state
"zero"
offset
compensation
comparator
SAR
i cps
i i
set bit
Figure 

 Current auto zeroing principle During autozeroing an ideally
zero output current i
chip
w
ij
is directed to the current comparator instead of the
chip output pin The o
set is stored in the SAR and subtracted i
cps
 from
the output
module Assuming the input range of the converter corresponds to the maximum
synapse weight the allowable oset relative to a unit output current	 is
#w
ofs

jw
ij
j
max

	 LSB
w
 
If the oset is known to be less than two MRC maximal output currentsy and
if we are using a B

w
" bit data converter a maximum synapse weight of
jw
ij
j
max
" 
 and a learning rate of  "  we need a 
 bit current oset can
celing D A converter While not impossible to implement in a standard CMOS
process Salama et al 
	 a 
 bit DAC is quite area consuming By sacricing
monotonicity fortunately we can instead cascade two or more	 lower precision
DACs
0
1
2
3
4
5
0 2 4 6 8 10 12 14
s
u
m
m
e
d 
ou
tp
ut
NL DAC input
single
double
Figure 

 Double resolution D A con
version Deliberately introducing non
monotonousness of weighted sums of
DA outputs increases the over all reso
lution
y Apart from a slightly scaled MRC unit cell the weight change computing
IPM is identical to a row on the backpropagation synapse chip thus the oset
measurements on these chips give a good estimate of what to expect on the RTRL
chip Hence this value
Chapter  Implementation of RTRL hardware Page 

Assume we have two B

and B

bit	 D A converters  and 	 with ideal
output ranges 
  LSB

	
max
 and 
  LSB

		
max
 and maximum rel
ative dierential and integral	 nonlinearities D

and D

 Taking the weighted
sum  of the outputs  and 	 in the following way
 " !
LSB

!D

	
max

max
	  

	
 will have a resolution of

res
  LSB

!D

	 LSB

!D

	
max
 
For D

 LSB

 D

 LSB

this corresponds to B

! B

  bits	 This
is illustrated in gure 

where a sample  is shown for B

" B

"  and
D

" D

"


LSB

 In practice the  resolution will be somewhat coarser
because one can not make the accurate scaling needed in 

	 one must make
sure that 	 is scaled with a factor at least as large as the ideal one
One is with relative ease convinced that the successive approximation register
on gure 

works with this resulting nonlinear D A converter Such analogue
oset canceling using a deliberately nonlinear DAC has also been proposed by
Kaulberg and Bogason 
On the width N data path chip we have used two eight bit standard cell
DACs for the current oset canceling D A converter The voltage outputs of these
control two mutually scaled	 MRCs connected to the input of the weight change
computing IPM A resolution of about  bit could be expected
The current comparator For high precision oset canceling it is of para
mount importance that the current comparator in gure 

is oset free This is
accomplished by using a very high input impedance voltage comparator as indi
cated in the gure see also DomnguezCastro et al 	 During oset canceling
the chip output current i
chip
w
ij
the oset encumbered zero output current i
calc
w
ij
minus the current value of the oset canceling current i
cps
	 will be integrated on
the input capacitance C
cmp
 of the comparator or actually the node capacitance
especially for small comparator input capacitances the current source output ca
pacitances will be signicant  this is actually the case for our RTRL chip	 Given
enough time the voltage on the capacitor will eventually exceed any comparator
oset error and saturate the comparator output at the desired value Given a
comparator gain A
cmp
 input oset V
ofs
 and minimum high output V
high
and a
maximum comparison time T
cmp
the input current resolution is
i
cmp res
" C
cmp
V
ofs
! V
high
A
cmp
T
cmp
 C
cmp
V
ofs
T
cmp
 
For C
cmp
" 

 fF V
ofs
" 
mV and T
cmp
" s we have i
cmp res
  nA To
reduce the comparison time for a given input resolution source followers can be
Chapter  Implementation of RTRL hardware Page 

placed at the comparator inputs though the oset error will increase the input
capacitance can be reduced much	
During normal operation the weight change output voltage will be close to the
reference voltage V
ref
cf above	 As the weight change oset would be expected
to be dependent on the output voltage we use a comparator reference level of V
ref
which ensures that the weight change output voltage is close to V
ref
during oset
canceling	 Prior to each comparison the comparator input voltage is reset to V
ref
to ensure fast comparison By adding a capacitor and one or more switches the
comparator oset error can be reduced by standard auto oset canceling techniques
Geiger et al 	 thus improving the comparison time
The SAR The successive approximation register can be implemented as bit
slices in a straight forward manner with CMOS multiplexors and dynamic delay
elements A bit slice of such a SAR is shown in gure 

 For the sake of clarity
the switches are shown as single transistor switches though CMOS switches are
actually used The circuit needs a two phase nonoverlapping clock and a start
conversion signal sc which must be high during the 

phase prior to conversion
for the generation of the clock signals and start signal refer to appendix D	
The current comparator must be reset at the 

clock phase and active during the


phase The SAR consists basically of a shift register that shifts a  from MSB
to LSB during conversion and a memory register The bitundertest will apply a
 to the DAC while the current comparator output set bit sb	 is stored in the
register The other bit slices apply the stored bit to the DAC
φ2φ1
shift register
sc
φ2φ1
memory register
sc
"0"
"1"
to DAC
SR outSR in
sb comperator output
o
th
er
: "
0"
M
SB
: "
1"
Figure 

 SAR bit slice Single transistor switches are shown for simplicity
Bit slices are cascaded by connecting SR in to SR out of the just more
signicant slice
Chapter 	 Implementation of RTRL hardware Page 

 Chip measurements
A  signal slices width N data path module RTRL chip was fabricated in the
 m CMOS process Unfortunately the process parameters of this particular
batch the same MPC run as the one the scaled backpropagation synapse chip was
a part of	 were outside the specied ranges This has a severe inuence on the
chip performance especially signal ranges and speed It is our hope that we can
implement a working system in spite of the poor chip performance we must raise
the reference voltage to accommodate to the reduced input voltage range of in
particular the current conveyor	 thus we shall present some measurement results
in this section Note that the chip functions from an architectural point of view
indicating correct onchip block interconnection and possibly applicability A table
of chip characteristics can be found in appendix D
Nonlinearities and oset errors are comparable to the ones found on the back
propagation chip set when the reduced signal range is taken into consideration A
typical multiplier characteristic from the weight change computing IPM	 can be
seen in gure 

 If the input is large v
y



 V the nonlinearity is D



&
for full scale range inputs D



& Other nonlinearities as the derivative
computing   y

block which characteristics are shown in gure 

compare
to gure 
 
	 are typically in the order of D


& a magnitude that should not
cause any trouble for the learning process In both gures the neuron activation
input v
y
has been varied for dierent values of the multiplexed neuroninput input
v
z
 A considerable oset on the last input is observed jV
zofs
j




mV Whether
this is acceptable will have to be experimentally investigated it is probably at the
upper limit for an acceptable oset The large oset is caused by the output oset
of the current conveyor used to transform the v
z
input to a current This current
conveyor is the same as the one used for the multidimensional IPMs and is thus
designed to source a much larger current than necessary for the v
z
transresistance
it should be redesigned
The other possibly problematic oset error is the output oset error of the
 y

block that computes the neuron derivative This was also the case for the
backpropagation neuron chip the oset error is of the same magnitude as for this
chip and must be dealt with in the same way cf section 	 Other nonidealities
on the chip are acceptable for the learning process
The input and output for an edge trigged neuron activation sampler is shown in
gure 

which illustrates its applicability to prohibit race around in the feedback
neural network The sample time is in the order of 

 ns the output settling time
seen in the gure is approximately s The charge injection is acceptable
On none of the chips tested the weight change output auto oset compensation
scheme worked Note that according to SPICE simulations of the circuit as shown
in gure 

 the auto zeroing circuit topology is valid But whether the circuit
malfunction is caused by a layout error or by the outofspeciedrange process
parameters we have not been able to determine no layout errors have been found
though	 because of the lack of internal test points Measurements seem to indicate
that the current comparator does not work though
Chapter  Implementation of RTRL hardware Page 

i Δ
w
-
2 
  Aμ
2 
  Aμ Vz = 1V
-1Vvy
vy 1V
-1V
Figure 

 Weight change IPM ele
ment characteristics Measured
weight change current as function of
neuron activation input for di
erent
network inputs IPM single element
characteristics Notice the reduced
input range and the v
z
o
set
v
p 
  t(  )
vy
vy
Vz = 1V
-1V
1V
-1V
1V
-
1V
Figure 

 Tanh derivative comput
ing block characteristics Measured
neuron derivative output as function
of neuron activation input for di
erent
network inputs parabola block char
acteristics Notice again the v
z
o
set
expanded
middle section
v (t)y
vy (t-1)
t
-
0.
5V
0.
5V
5    sμ
Figure 

 Edge trigged sampler sampling Measured half scale range input
and sampled output The e
ect of using two cascaded samplers is clearly seen
Charge injection occurs both at the sample time and half way through the hold
period at both clock edges
 System design
For the completion of the RTRL ANN system we need some hardware in addition
to the synapse and neuron chips and the width N data path RTRL chips As
mentioned in section  we design a RTRL backpropagation hybrid system to
save hardware and design time	 Thus much of this hardware is basically already
present in our system
Chapter  Implementation of RTRL hardware Page 

0s 2us 4us 6us 8us 10us 12us 14us
Time
1 I(Iin) I(Vsense) 2 V1(Cda)
1.5uA
1.0uA
0.5uA
0A
1
2V
1V
0V
2
>>
SAR--- successive approximation register
Date/Time run: 07/03/94 20:42:16 Temperature: 27.0
Figure 

 Auto zeroing simulation SPICE using simple DAC and opamp
models Top input   and output  currents Bottom current comparator
input voltage  Notice how the output current increasingly accurately ap
proximates the input current two trials The input voltage indicates whether
the output current is too large or too small
 The digital weight backup memory
 The digital serial weight updating hardware including A D and D A convert
ers for interfacing the width  data path module when including	
 The nite automaton for system control
 The ANN environment
To complete the RTRL system we also need

 The multiplexor the width N !M data path module	

 A derivative variable RAM including A D and D A converter for each RTRL
chip in the system
In this section we shall describe the complete system For a complete system
schematic refer to enclosure III see also appendix D	
Chapter  Implementation of RTRL hardware Page 

 ASIC interconnection
The application on which we want to apply the RTRL system is the prediction of
splice sites in premRNA molecules cf section 	 Using a recurrent network
we will need at least ten neurons of which one is an output neuron	 and tree letter
inputs corresponding to one amino acid code	 taken from a  letter alphabet
Brunak and Hansen 	 Using unary coding this gives twelve inputs plus
biases	y Mapping this topology on our chip set requires two synapse chips and
three neuron chips for the ANN Assigning two inputs for oset compensation the
expected total MVM output oset is
p
 	
I
wz
where 
I
wz
is the single chip output
oset standard deviation	 and two inputs for the thresholdsz this gives a RTRL
system topology of  inputs and  neurons
The custom chip interconnections when the system operates in RTRL mode
can be seen in gure 


assuming the neuron and RTRL chips have the same
number of neurons signal slices	 The z
j
t	 multiplexor MUX	 is implemented us
ing a standard cascadable analogue multiplexor The derivative variables RAMs
p RAM	 are implemented as  bit wide static RAMs accessed via  bit data convert
ers The A D converters are fast ash converters as they are the bottle neck in
the system in terms of speed	 For clarity the target values dt		 and target in
dicators T t		 are not explicitly shown These signals are among various control
signals supplied by the environment serial weight updating hardware block
y This particular application does not t the description massively parallel
possibly adaptive application specic system having a realworld interface for
analogue VLSI learning systems of chapter  This is a toy network however the
application would benet from an additional 
 neurons or so If enhanced per
formance using network ensembles is employed cf section 	 several hundreds
of neurons could be exploited  such a system would be massively parallel The
inputoutput	 on the other hand would still consist of only 	 bit and can thus
easily be supplied by say	 a standard AT bus harddisc which is how the input
data is available	 A realworld interface is unnecessary
z Connecting  synapse chips together the expected total output oset is
p
	
I
wz
where 
I
wz
is the single chip output oset standard deviation thus we must commit
 synapses per row to oset compensation cf section 	 We use two threshold
synapses per row because it is often seen that the neuron thresholds have a larger
magnitude than typical synapse strengths
 The system thus has four extra inputs which must be tied to zero This
illustrates a typical problem when using standard building block components to
implement an application specic system If the application does not exactly match
the topology of the building block components hardware is wasted A solution to
this problem is to make available a range of synapse and neuron chips say  
  and   synapses and   and  neurons	
Chapter  Implementation of RTRL hardware Page 

Offset synapses
Offset/Threshold synapses
p
RAM
p
RAM
(  )
(  -1)
ij (  -1)
(  )
(  )
(  )
p
RAM
(  )
(  )
x  t
y  t
p   t
y  t
z   t
z   tj
d  t
T  t
MUX
NC
NC
NC
R
TR
LC
R
TR
LC
R
TR
LC
control
U
nu
se
d 
sy
na
ps
es
serial
weight
updating
hardware
ment and
environ-
M
irr
or
ed
 sy
na
ps
e 
str
en
gt
hs
★
SC16
SC16
4
B
ia
s
Bias
4
SC16
4
4
Bias
Bias
Figure 


 RTRL ANN basic architecture ANN and RTRL chip interconnec
tions when the system operates in RTRL mode The blocks RTRLC are RTRL
chips Input output lines of the SC blocks are accessible at both top and
bottom left and right in this gure
 The width  data path module
Apart from the weight strength backup memory and the system environment the
components not explicitly shown in gure 


are part of the width  data path
module of the RTRL system implementation Though basically identical to the
backpropagation learning hardware some minor modications are necessary
All synapses in the systems are given a unique address fi j lg reecting the
backpropagation topology w
l
ij
 Also the w p

ij
MVM for the RTRL algorithm
is given an address in this space However except for the oset compensation
synapses the weight on this chip must be the same as the ANN synapse chip with
the yt  	 inputs cf the gure	 For this reason the weight backup RAM use
a ltered version of fi j lg as address bus which mirrors the relevant part of the
ANN MVM on the w p

ij
MVM Doing for instance a weight refresh  which
is governed by the PC  one would simply run through all the fi j lgs and all
synapse sites would get the correct weight
Chapter  Implementation of RTRL hardware Page 
The high precision digital weight updating hardware used in backpropagation
mode cf gure 
 
	 is also used in RTRL mode with the addition of a transre
sistance at the joined RTRL chip current weight change output for compatibility
The comments in section  on the virtues of this architecture apply for the RTRL
mode also of course	
On the RTRL chip the digital i signals for addressing the distributed demul
tiplexor and the digital k
 
signals used to access the derivative variables RAM was
erroneously combined to save pins on the chip The error has been xed by the
addition of a digital fi k
 
g to i k
 
converter see enclosure III	 The error is still
unfortunate though the lengthy analogue calculation of the weight change signal
can not take place concurrently with the derivative variable RAM updating this
degrades performance
The whole system is controlled via a large number of digital handles For
instance the sample the neuron activation signal 
Sy
 the initiate auto zeroing
signal sc or the current synapse signal fi j lg above The synapse and neuron
chips also needs signals as the forward reverse and learn signals In all some 

control signals are needed including the T
k
t	s Many of these can be combined if
the mutual timing requirements are known for the prototype chip set they were not
at the system design time however Letting the system controlling nite automaton
control the internal timing is a safe and fast designed	 choice As noted in section
 only the low frequency circuit level system performance ie not the speed	 can
be tested with this system	 The system controlling nite automaton must supply
all these 
 control signals These are placed in registers accessible to a host PC
which as said is used as this automaton
 The interface
The host PC is interfaced via a standard  bit I O channel at the PC AT ISA	
bus Eggebrecht 	 In addition to the 
 odd digital handles that controls the
RTRL backpropagation system the host PC also must provide various analogue
control signals that are likely to be changed by the user For instance learning
rate and neuron activation steepness control signals Most of these are placed in
 bit serial DACs signals for oset compensation are driven by  bit DACs
To monitor how learning progresses the PC has access to the weight backup
RAM and the derivative variables RAMs which requires a substantial amount of
extra hardware	 Also all neuron outputs as well as the output layer reverse mode
synapse chip currents and the w p

ij
MVM current outputs are available to the PC
via  bit ADCs These are used to oset compensate the synapse chips hence the
high precision	 Having all neuron outputs available also includes the possibility
of using the PC to refresh sampled neuron activations in case the onchip analogue
samplers should unexpectedly prove inadequate
As well as acting as the master nite automaton the PC provides the environ
ment in which the ANN is placed It supplies the input signals and target values
via  bit DACs and reads the network outputs via  bit ADCs For our applica
tion we only need binary inputs and targets	 and a  bit sampling of the neuron
Chapter 	 Implementation of RTRL hardware Page 
activation would most probably be sucient for the gradient descent algorithm
As this is a test system however we will not prevent ourself from using analogue
input output data as such data set can give additional information on the system
performance For this reason high precision data converters are used
 Algorithm variations
Most of the RTRL algorithm variations listed in section  aect only the algo
rithm control mechanism ie the nite automaton	 and are thus easily incorporated
in our system

 Teacher forcing Ignoring the neuron activation samplers at the RTRL chips
and conguring the ANN MVM inputs as in backpropagationmode cf gure

 
	 the PC can supply the z
TF
j
ANN MVM inputs used for teacher forcing
The initialization channel of the derivative variable RAM access circuit is
connected to zero in our system Thus when reading the p
k
ij
t 	s prior to a
weight calculation the initialization channels rather than the RAM channels
should be used when sampling p
k
ij
t  	s for which k  T t  	 to ensure
the sum in 

	 is taken over l  U n T t  	 This is assuming target
neurons on all neuron chips If this is not the case one can explicitly write
zeros in the derivative RAMs instead	 This approach is somewhat inelegant
designing the system to include teacher forcing would require N two way
analogue multiplexors

 Relaxation and pipelining The system is designed to use pipelining Using the
relaxation scheme is just a matter of updating the neuron activations and neu
ron derivative variables a couple T
PD
	 of times before updating the weights
Note that in general whenever the neuron activation target set is known to
be empty one should omit the weight updating step  partly because of speed
but primarily to avoid unnecessary inuence of weight updating osets

 Learning by subsequence Resetting the neuron activation variables between
each subsequence is trivial one just uses the initialization channels rather than
the RAM channels when sampling the p
k
ij
t  	s

 Random initial states Though our system not designed for this random ini
tial derivative variables are obtained by letting the PC write random numbers
in the derivative RAMs instead of using the initialization channel for initial
ization	 Including the option in the system design involves the placement
of digital pseudo random generators that can override the derivative variable
RAM outputs

 Momentum weight decay etc The observations on the implementation of
weight decay momentum etc in chapter  also apply for the system in RTRL
mode Note however that though the system is not designed for it it is pos
sible to read the weight changes #w
ij
t	 from the PC which actually enables
weight decay and momentum among other things in the present system in a
Chapter  Implementation of RTRL hardware Page 
somewhat laborious way though
The realtime recurrent learning system is at the time of writing under construc
tion Thus unfortunately we can not present any system level experiments Hope
fully such will be available in the near future
 Nonlinear RTRL
As for the backpropagation system and other gradient descent learning systems	
the hardware implementation of the 
gs	
s neuron activation derivative com
putation is a primary error source in the RTRL system Inspired by nonlinear
backpropagation we shall now derive and show how to implement an analogous
version of the RTRL algorithm in which the derivative computation is avoided
nonlinear realtime recurrent learning NLRTRL	 Actually this nonlinear prin
ciple for approximating a derivative is applicable to any gradient descent like learn
ing algorithm that uses derivatives RTRL virtual targets backpropagation and
variations etc	
 Derivation of the algorithm
Taking the derivative variable denition 

	 as our point of departure
p
k
ij
t	 " g
 
k
s
k
t		
k
ij
t	 
we interpret this as a rst order Tailor expansion of the equation
p
k
Nij
t	 " g
k

s
k
t	 ! 
k
ij
t	
	
 g
k
s
k
t		  

	
which denes the NLRTRL neuron derivative variables As for NLBP we could
scale 
k
ij
t	 in this equation by a factor 
N
for a more accurate theoretic Tailor
expansion when the NLRTRL domain parameter 
N
is large However we are
interested in numeric accuracy rather than theoretic accuracy which is why we
choose the domain parameter small 
N
"  as we did in the NLBP case The
NLRTRL algorithm and variations of it	 now simply arrives by substituting p
k
Nij
for p
k
ij
in the equations in section 
Simulations using the NLRTRL algorithm have not yet been performed and
an experimental verication of the algorithm functionality must be done before
an implementation of course However as both RTRL and backpropagation are
Chapter  Implementation of RTRL hardware Page 
gradient descent algorithms  the RTRL p
k
ij
s and 
k
ij
s plays a role very similar
to the 
l
kj
s and 
l
kj
s respectively of backpropagation  and as NLRTRL and
NLBP are derived in complete analogous ways we should expect the simulated
performance of NLRTRL to be very similar to that of RTRL The performance of
a NLRTRL hardware implementation is of course expected to be superior to that
of RTRL as the derivative computation is omitted and as a signal slice multiplier
is saved
 Hardware implementation
When mapping the NLRTRL algorithm on hardware it turns out advantageous to
use a time oset weight updating scheme
w
ij
t! 	 " w
ij
t	 ! #w
ij
t  	 
rather than 

	y Now assuming we use the discrete time NLBP neurons of
gure 
 
and 
 
possibly without the extra error input i

k
	 and assuming we
are using the quadratic cost function the discrete time system of gure 

takes
the form shown in gure 


Figure 

 Nonlinear RTRL system Block diagram This topology con
siderably reduces the order N data path hardware compared to the original
RTRL system and is most probably more accurate The system uses delayed
targets
y This is actually the weight updating scheme originally proposed by Williams
and Zipser  
Chapter  Implementation of RTRL hardware Page 
The system is operated as follows At the end of a learning cycle yt  	 is
sampled Then xt	 and dt	 are applied and yt	 is computed asynchronously
Now unlike the original system the neuron chip samples yt	 For each fi jg
p

ij
t 	 is read from the RAM after which the w p

ij
multiplier and the demulti
plexor adds 

ij
t	 to the ANN MVM output forcing the neuron chip to calculate
p

ij
t	 This as well as the concurrently computed w
ij
t ! 	 is stored in the
respective RAMs
A few notes on the architecture As the demultiplexor and both MVMs have
current outputs adding 

ij
t	 to st	 is trivial However doing this 

ij
t	 is not
explicitly present in the system and we must thus refrain from using the entropic
cost function As the neuron activations are sampled by the neurons for the p
k
ij
computation the neuron activation sampler on the NLRTRL chip need not be
edge trigged ie one opamp can be saved in all three opamps or almost 
&
of the computing hardware	 are saved per NLRTRL signal slice not counting the
derivative variable RAM access circuit	
	 Further work
As for the backpropagation system we need to carry out system level experiments
in order to evaluate the applicability of our proposed chip set These experiments
could suitably include simulated ie using the PC to perform the digital xed
point computations	 weight change threshold momentum weight decay etc to
verify the applicability of these algorithm variations these simulations can also be
performed using the system in backpropagation mode	 These experiments are
presently carried out at our institute Other obvious future tasks include a reman
ufacturing and possibly redesign	 of the weight change oset canceling circuit as
well as simulation on and implementation of the proposed NLRTRL algorithm
Considerations on process parameter dependency canceling and temperature
compensation are necessary before a volume production cf section 	 just as
was the case for the other chips designed so far Also the considerations on the
implementation of high accuracy calculations mentioned in section  as chopper
stabilizing and weight change threshold	 apply to the RTRL chip
In addition to these tasks further work on VLSI implementations of the RTRL
or NLRTRL	 algorithm could include a continuous time system version
Chapter  Implementation of RTRL hardware Page 
 Continuous time RTRL system
One of the very nice features of analogue signal processing is the inherently asyn
chronous functionality In our systems so far we have ignored this and dealt with
sampled time systems The backpropagation algorithm can fairly easily be imple
mented in an asynchronous way because of the feed forward architecture This is
not so easy for the RTRL architecture though it would be very attractive to do so
In this section we shall roughly outline how a continuous time or asynchronous	
version of the NL	RTRL algorithm could be implemented in analogue hardware
Now instead of using sampleandhold circuits as feedback in the ANN cf
gure 


 see also gure 
B
	 we use continuous time low pass lters having
transfer functions H
y
k
s	 Note that the lowpass lter can only hinder parasitic
oscillations caused by nonideal electrical components As the signs andmagnitudes
of the connection strengths in the synapse matrix are unknown these can cause
oscillation It is the learning scheme that must adjust the weights to prevent
oscillation  or indeed to cause it This applies also to the discrete time system
for that matter	 Using a continuous time feedback the neuron input denition


	 becomes
z
j
t	 "

x
j
t	  for j  I
h
y
j
t	  y
j
t	  for j  U

where h
y
j
t	 is the impulse response of the H
y
k
s	 ltery For input signal fre
quencies much lower than the lter cuto frequency f
y
use for instance H
y
k
s	 "
  sf
y
	

	 the network will work as a relaxation network though a re
laxed state can be an oscillation	 For input frequencies approaching the cuto
frequency the network will function somewhat like a pipelined network we can
ascribe the lter a delay of say lnf
y
	z
Choosing for example the NLRTRL neuron derivative variable denition
p
k
Nij
t	 " g
k

s
k
s	 ! 
k
ij
t	
	
 g
k
s
k
t		 
we now propose to compute the neuron net input derivative variables as

k
ij
t	 "
X
lU
w
kl
t	
h
h
p
l
ij
t	  p
l
Nij
t	
i
! 
ik
z
j
t	 
where we as for the ANN itself have substituted lowpass lters H
p
l
ij
s	 for the
delay elements To ensure stability the lowpass lters are needed also in this
equation Now if we select
H
p
k
ij
s	  H
y
k
s	 
it still holds that 
y
k
t	
w
ij
t	 " p
k
ij
t	 or 
y
k
t	
w
ij
t	  p
k
Nij
t		 as for the
original algorithm
y  denotes convolution ht	  yt	
def
"
R
t

h 	yt   	 d 
z This is of course vastly simplied See eg Gabel and Roberts  for details
on linear systems
Chapter  Implementation of RTRL hardware Page 
Finally we must formulate the weight updating rule in continuous time Quite
trivially 

	 generalizes to an integration
w
ij
t	 " w
ij

	 
Z
t


J  	

w
ij
 	
d
"











w
ij

	 ! 
Z
t

X
kU

k
 	p
k
Nij
 	 d Quadratic
w
ij

	 ! 
Z
t

X
kU

k
 	
k
ij
 	 d Entropic

where we have expanded the equation for the cases of the quadratic and the entropic
cost functions
Thus implementing the NLRTRL algorithm has one unsaid disadvantage it
is necessary to do a fully parallel ie ON
 
	 area)	 implementation thus lim
iting this approach to fairly small systems Needlessly to say continuous time
NLRTRL is incompatible with our present ANN architecture which uses serial
discrete time weight access Further research is needed before a continuous time
recurrent learning system can be implemented among other things the simpler dis
crete time version of the algorithm should be proven operational For continuous
time recurrent learning systems see also Ramacher and Schildberg 	
 Other improvements
Essentially composed by components also found on the backpropagation chip set
several issues of the RTRL chip are subject for improvements A list of these can
be found in appendix D

 Summary
In this chapter addon hardware for applying realtime recurrent learning to our
cascadable ANN chip set was designed The basic learning algorithm was dis
played and the applicability of common algorithmic variations for the implementa
tion in analogue VLSI was discussed It was shown that doing an ON

	 ON

	
space time division of the ON
 
	 computational primitives per time step was a
good choice with respect to scalability hardware cost ANN system compatibility
and implementation ease A system level implementation in which most of the
computations the order ON

	 part	 were performed by a synapse multiplier was
presented Results from simulations modeling analogue VLSI nonidealities in this
architecture were found in compliance with like simulations by others The sys
tem is generally tolerant to nonidealities with the exception of the weight change
osets hidden neuron error osets and neuron derivative computation
Chapter  Implementation of RTRL hardware Page 
The design of an RTRL chip for computing the ON	 part of the computa
tional primitives was displayed This chip was implemented using almost exclu
sively components from the backpropagation chip set A weight change oset
compensation circuit based on a DAC resolution enhancement technique was in
cluded on the chip Unfortunately this chip was malmanufactured which resulted
in a very poor computation speed and a reduced signal range possibly also the
malfunction of the oset canceling circuit which did only function in simulations	
A complete RTRL system design based on our various chips was displayed
A  inputs  neurons test system for premRNA splice sites prediction was
chosen The learning hardware not implemented on the ASICs  basically the
O		 weight updating hardware  is the same as used in the backpropagation
system the possibility of using mostly digital weight updating hardware is adding
to the virtues of the proposed silicon mapping	 For ease of test the system
is controlled by a PC Various algorithmic variations can be simulated via this
interface The system is presently under construction thus no experimental results
were presented
A nonlinear version of realtime recurrent learning was proposed A system
level implementation of this algorithm was shown and we saw that the implemen
tation was superior to the RTRL implementation in terms of both hardware cost
and computational accuracy We argued that the applicability of this algorithm
would be similar to that of nonlinear backpropagation
Finally we proposed a continuous time version of the nonlinear realtime
recurrent learning algorithm for analogue VLSI
Page 
Chapter 
Thoughts on future analogue
VLSI neural networks
In this chapter some odds and ends on future analogue VLSI neural network design
 which did not t into the other chapters  are collected Firstly some thoughts
on gradient descent learning using analogue VLSI are presented Secondly we
propose an ANN architecture for massively parallel systems that maps well on
hardware Thirdly we point out that using analogue VLSI neural network ensem
bles must be a future trend and we propose a weight refreshing scheme based on
such ensembles Finally we reect on combining readonly and plastic synapses
in analogue VLSI computational neural networks
Chapter  Thoughts on future analogue VLSI neural networks Page 

  Gradient descent learning
In this work we have investigated the possibility of implementing supervised learn
ing with a teacher  or more precisely gradient descent learning  in analogue
VLSI neural networks Being wiser from our experiments carried out so far we can
ask the question does gradient descent learning in analogue VLSI have a future
Perhaps Though unsupervised learning or learning with a critic is possibly better
suited for the technology because of the lack of a good analogue memory	 the
need for massively parallel pattern recognition engines is evident eg Hertz et al
 SanchezSinencio and Lau 
 Ramacher and Ruckert 	 which strongly
encourage the development of ecient learningwithateacher algorithms
Even should our RTRL backpropagation system prove able to solve classi
cation regression tasks which is indeed our expectation	 this small scale system
does not prove the applicability of the learning schemes for massively parallel
implementations And though we believe it demonstrate some noteworthy points
it is surely not the ideal solution From an analogue VLSI point of view present
learningwithateacher schemes have some serious drawbacks Most notably in
relation to oset errors and for very large systems probably also in relation to
the dynamic range of the synapse strengths	 Eliminating oset errors by oset
canceling or chopper stabilizing	 as proposed in this thesis is a solution probably
ensuring the applicability of gradient descent learning However it is not a good
solution It requires precision circuitry and is thus not neural of nature A
neural way to deal with weight change osets could be use of a weight change
threshold perhaps induced via a highly nonlinear weight change multiplier Other
or additional	 procedures to deal with weight change osets could be to somehow
i	 increase the learning loop gain corresponding to a very large learning rate	
or ii	 ACcouple the learning hardware to remove the dependency of DC osets
At any rate it is our profound belief that the ultimate implementation of
hardware learning with a teacher requires research in learning algorithms as well
as research in the electronic implementations Such research must of course focus
on the limitations of analogue VLSI and the resulting learning algorithms should
resemble popular simulated algorithms in order to attract application people The
elegant solutions found in nonlinear backpropagation or weight perturbation
to come around the problem of computing neuron activation derivatives are good
examples of how to bend the algorithm to meet the technology requirements The
human brain is obviously capable of doing reliable learning using an inaccurate
technologyy Perhaps we should seek further inspiration from neurobiology
y The computational part of the brain being totally embedded in sensors and
actuators can not possible use learning with a teacher as we know it This does
not mean that such learning algorithms are somehow inferior we must always use
all possible information when solving a problem ie we must use learning with a
teacher when we can	
Chapter  Thoughts on future analogue VLSI neural networks Page 
 Neuron clustering
Our present scalable ANN architecture which in principle can implement any
ANN topology is well suited for many present applications For the implementation
of huge massively parallel systems however the scalable principle will not hold
The implementation of neurons with say millions of synaptic inputs is beyond
the capabilities of the technology The required dynamic range of the synapses and
neuron ability to sink current put a bound somewhere on the neuron fanin to say
nothing of the impact of oset errors and of electrical parasitics degrading the speed
of such an architecture The chip input output bottle neck and interchip routing
would also be problematic One could imagine that learning in such a system using
conventional algorithmswould be dicult Krogh et al  Benedict  Haykin
 Houk 
	  even using an ideal learning machine using the limited
precision technology of analogue VLSI learning would almost surely	 prove very
hard increased fanin requires increased precision cf section  section 
section 	
For applications using huge networks this could be in robotics for instance	
an alternative network topology must be found From an analogue VLSI point
of view some kind of neuron clustering would be advantageous we propose an
architecture consisting of sparsely interconnected modules of densely connected
neurons see gure 

 The individual clusters would solve dierent reasonably
complex subproblems possibly the same problems as other clusters to ensure fault
tolerance at a module level see also the section 		 ie the global problem will
be solved in a divideandconquer manner See also Jacobs et al  Haykin 
Houk 
 Note that the input data structure of some problems eg that of visual
systems	 do need large fanin input layer neurons see eg SanchezSinencio and Lau

 Masa et al 	 so we would still need large fanin cascadable	 modules
dedicated for such peripheral tasks For high level data processing however the
need for large fanin is not so evident compare to models of the human brain
Rumelhart et al 

 Miles and Rogers  Joublin et al  Mountcastle
	 this neuron cluster ANN topology might be applicable for general high level
computational ANNs
The cluster size and topology will obviously have a tremendous impact on sys
tem functionality implicit constraints are put on the problem architecture Thus
for an ecient system we need to do excessive research in the areas of cluster
size and topology and cluster interconnection architectures Also the problem of
teaching such a system needs to be addressed Questions such as do we need more
than one type size topology	 of clusters must the cluster modules be recong
urable and should the modules be cascadable would be asked The learning
scheme should probably involve both supervised and unsupervised learning Now
it would be most convenient if a single cluster of neurons would t on a chip
In this case the problem with the chip input output bottleneck would be reduced
and inter module communication could eciently take place using robust neuron
activation signals eg pulse frequency modulation	 Onchip communication could
also be optimized with respect to speed power consumption area etc	 when the
Chapter  Thoughts on future analogue VLSI neural networks Page 
x1
x2
y1
Figure 

 Neuron clustering From a VLSI point of view this is a very attrac
tive network architecture The di
erent blocks could be individual chips Pos
sibly some kind of restricted cascadabilityrecongurability would be needed
local ANN topology is known a priori and no ochip communication of interme
diate signals as synapse outputs	 takes place Using CMOS VLSI technologies of
today the integration of about 

 neurons and 



 synapses including learning
hardware	 would be realistic
The high level data processing part of the human brain the cerebral cortex 	
is organized in ( vertical layers Communication within the cerebral cortex is
mostly between such layers and is distinctively local of nature there are only few
longdistance connections Further the cerebral cortex seems to be organized in
more or less	 disjoint patches Rumelhart et al 

 see also Miles and Rogers
	 This has inspired some authors to propose a modular model of the cerebral
cortex arranging the neurons in disjoint densely connected columns mutually
sparse interconnected Mountcastle  see also Joublin et al 	 very like
our proposed cluster architecture for analogue VLSI implementations We believe
that this topology would be a good starting point for investigations on ANNs with
clustered neurons Others have also proposed such neuron clustering architectures
inspired by the hardware friendly sparse connectivity Joublin et al 	
It should be noted that many reported chip architectures actually resemble the
proposed neuron clustering architecture in the sense that a densely interconnected
xed or recongurable to some extend	 architecture ANN is integrated onchip
in a nonexpandable way using neuron activation for interchip communication
eg Graf and Henderson  Valle et al  Castro et al  Hamilton et al
 Serrano et al 	 However to the best of our knowledge an exhaustive
investigation of chips architectures vs system generality has yet to be carried out
though cf Johansen and Foss  on the problem of modeling complex systems
with ensembles of simple models	
Chapter  Thoughts on future analogue VLSI neural networks Page 
 Self refreshing system
As mentioned in section  one of the major concerns of analogue VLSI ANN
research is the issue of synapse strength storage  especially in connection with
onchip learning The only true longterm analogue memories as oating gate de
vices	 are tedious to program probably expensive and not well suited for adaptive
systems The area penalty of high resolution onchip digital storage or some kind of
quantizeregenerate scheme compromises the advantages of analogue VLSI Finally
for systems where the use of external components is unacceptable or where a par
allel weight updating is necessary using capacitive storage with a RAM backup
as in our work is inapplicable
Now still referring to section 	 for certain applications using an unsu
pervised learning algorithm or learning with a critic we can in an elegant way
apply refresh by relearning on systems using simple capacitive storage using pure
analogue signal processing circuitry The question that now arises is this Is it
possible without storing training patterns to apply a like refresh by relearning
scheme in application areas using learning with a teacher as pattern recognition
and related tasks Using the selfrepair properties of neural network ensembles
this is the case for certain applications
 Neural network ensembles
A neural network ensemble Hansen and Salamon 
 Hansen et al 	 is a
collection of neural networks often topologically identical	 trained using dierent
initial states to solve the same problem The training algorithm applied could
for instance be backpropagation	 Applying the ensemble to a given problem the
solution given by this is a consensus decision based on the individual network
outputs This could be the majority decision for binary outputs or a weighted
sum decision as stacked regression LeBlanc and Tibshirani 	 for analogue
outputs Now providing that

 the individual networks perform reasonably well and

 the errors of the dierent networks are to some degree independent
the consensus decision will be superior to that of the individual networks More
specically for a classication task where the probability of doing a misclassica
tion is p for each individual network and providing the network errors are indepen
dent the error probability for an ensemble of N
E
networks is
p
E
"
N
E
X
kN
E


N
E
k

p
k
 p	
N
E
k
 
For p   we have p
E
 
 for N
E
 Using ensembles of N
E
"  networks
for instance Hansen et al  have reported a 
&(& improvement of the
individual network performance on a handwritten digit recognition problem
Chapter  Thoughts on future analogue VLSI neural networks Page 
While improving performance is always important  sometimes even at a very
high cost  the properties of neural network ensembles are particularly important
in relation to analogue VLSI ANN implementations

 Fault tolerance While the potential fault tolerance of neural networks is re
peatedly emphasized in the literature networks trained using standard ap
proaches as backpropagation	 do not exhibit a very high degree of fault tol
erance 

Serbedzija and Kock  Woodburn et al 
 Neti et al 	
Though insensitive to small weight perturbations recall that the gradient de
scent solution is given by 
J
w
kj
" 
	 the network is not insensitive to the
complete loss of a synapse as the network architecture is kept as simple as
possible to ensure good generalization abilities Using neural network ensem
bles provides a simple way to introduce fault tolerance From an analogue
VLSI point of view this fault tolerance is important because analogue sig
nal processing requires better components than digital signal processing and
are thus more sensitive to processing errors	 For a fault tolerant system the
consensus decider must also be implemented using redundant hardware

 Enhanced performance Implementing ANNs in the limited accuracy tech
nology of analogue VLSI the performance of our ANN solutions is bound to
be inferior to that of high precision simulated networks see eg section 
Tarassenko et al  Lansner  Ramacher and Ruckert 	 Whether
this is acceptable or not is application dependent if it is not neural network
ensembles provides a simple way to enhance the analogue ANN performance
Regularly it happens that an ANN trained using gradient descent gets
stuck in a local minimum of the cost function ie the network does not solve
the task at hand after learning	 For recall mode systems this is not a fatal
incident one can just repeat the training phase with other initial weights For
an adaptive system trained online this is not so There will be only one chance
to learn the task The enhanced performance oered by ANN ensembles might
well prove crucial in such systems
The implementation of a neural network ensemble is very simple requiring only
the design of the consensus decider assuming we have a collection of acting neural
networks	 As mentioned in the previous chapters such simplicity is important to
analogue VLSI design Further duplicating a whole ANN say N
E
"  times for
the implementation of an ensemble is computationally a very expensive procedure
Thus even for moderate size networks parallel computations might be necessary
for a dedicated hardware implementation ensemble methods are hardware hungry
 thus the potentially small sizes of analogue computing elements makes analogue
VLSI an ideal technology for neural network ensembles And vice versa
Chapter  Thoughts on future analogue VLSI neural networks Page 
The consensus decision of a neural network ensemble being superior in performance
to the individual networks provides a way to do selfrepair in a system haunted
by degrading weights Hansen and Salamon 	 Using the consensus decision as
target values for all the networks we simply apply a supervised online learning al
gorithm to make weight updates after each presentation of input patterns while the
system is running in recall mode we call this a consensus trainer	 Under certain
conditions this scheme can keep weight deterioration in check this can be exploited
for weight refresh in analogue VLSI neural systems using simple capacitive storage
and onchip learning with a teacher see gure 


Neural Network
Ensemble
Co
n
se
n
su
s
D
ec
id
er
In
pu
ts O
u
tp
ut
Targets
CH
iOUTvIN
vERROR
Synapse
Figure 

 Self refreshing ANN
system The selfrepair proper
ties of a neural network ensem
ble is used to retain the weights
stored on the pure capacitive a
nalogue synapse storage Using
occasional external target val
ues read adaptive systems will
prolong the time to memory ex
haustion
Now assume a weight perturbation gives a proportional network error proba
bility change whether caused by the learning algorithm or the weight deterioration
Chapter  Thoughts on future analogue VLSI neural networks Page 
mechanism	y and assume we can model the discrete time weight change as
#w
kj
t	 " 

J
C
t	

w
kj
t	
 z 
learning scheme
!

 ! nt	
	
T
cyc
 z 
weight deterioration

where  is a constant worst case	 weight droop rate and nt	 is a noise term and
where T
cyc
is the time needed to do a full weight matrix update learning cycle	 and
J
C
is a cost function computed using the consensus decision as target values the
consensus cost function	 Note that the T
cyc
product is equivalent to a weight
change oset	 In this case it turns out that when the weight restoration eciency
%  T
cyc
	 is larger than a critical value %
crit
 the system performance will
remain largely unchanged over a period of time that depends on the noise level For
%  %
crit
there is an abrupt transition to a regime where the system performance
degrades rapidly Hansen and Salamon 	
Though an ANN ensemble using a consensus trainer can sustain a weight de
terioration which would rapidly destroy a nonretrained ensemble to say nothing
of a single network	 the life time is nite Patterns once misclassied by the con
sensus decider are never relearned as all the networks are trained to imitate the
misclassication the probability of doing new misclassications is nite partly
because of noise Lehmann and Hansen 	 As for other refresh by retraining
schemes the system would tend to forget the classication of scarcely occurring in
put data If it is possible occasionally to apply an external teacher which would be
the case in adaptive systems the life time might be profoundly improved Actually
in such systems the inherent forgetfulness of the systems might be an advantage
if old memories are considered irrelevant
Intuitively the critical restoration eciency would be expected to increase as
the system size increases thus limiting the network size to which we can apply
a consensus trainer for weight refresh in an analogue system Further for very
large systems the extra hardware used for the ensemble might be a hindrance for
implementation for small systems on the other hand the reduction in hardware
cost for a more complex refreshing scheme might easily accommodate the extra
hardware for consensus refresh  especially if the improved performance of the
ensemble is taken into consideration
In all We propose to use a consensus trainer to do refresh by relearning
in small adaptive analogue ANN systems with onchip learning with a teacher
that use simple capacitive weight storage and function in an ever changing hostile
environment The applicability of the scheme is a subject for ongoing research
y Because of the pronounced nonlinearity of ANNs this is a highly inaccurate
though conservative  approximation Assume we use gradient descent learning
with small learning rate and a quadratic cost function and approximate the neuron
activation functions by gs	 " s for jsj   and gs	 " signs	 otherwise For
saturated neurons a small weight change does not alter the error probability For
nonsaturated neurons the output error change is linear in the weight change
Chapter 	 Thoughts on future analogue VLSI neural networks Page 
presently carried out at our institute Lehmann and Hansen 	 Of course
memory destruction at power loss will for most articial neural systems be un
acceptable We will need means to read the volatile synapse strengths for backup
purposes and for replication Or alternatively we could use a combination of
readonly and dynamic synapse memories
 Hardsoft hybrid synapses
Even if we could in a convenient manner read the synapse strengths of an ANN
using simple capacitive storage the memory restoration after a power loss or the
synapse strength down loading for volume manufacturing might well prove tedious
In both cases retraining would be necessary and an insystem backup memory for
insite memory restoration is not necessarily compatible with an analogue ANN
system If only short term adaptations are needed a solution could be the use of
hardsoft hybrid synapses consisting of a preprogrammed nonvolatile possibly
readonly	 part and a perturbation stored in a volatile capacitive memory The idea
is to i	 use a predetermined template for instance obtained via simulations as in
Masa et al 	 to generate the nonvolatile hard	 parts of the synapse strengths
during manufacturing implemented for instance as scaled transistors or oating
gate devices	 and to ii	 use an onchip learning algorithm for adapting the system
via the volatile soft	 part of the synapse strengths The hard memory part would
reect the behavioural model of the system and the soft memory part would reect
the current working conditions
Consider for instance a robot walking on sand snow earth pavement mud 
pebbles The basic locomotive behaviour could be preprogrammed place one
leg in front of the other keep the balance etc	 while temporary adaptations
to the current ground cover would be determined using realtime learning In such
applications adaptive analogue systems would be tremendous powerful
Page 
Chapter 
Conclusions
In this thesis the implementation of three analogue VLSI neural systems was pre
sented i	 a cascadable chip set for recall mode neural networks ii	 a variation
of this chip set including onchip error backpropagation learning in a hardware
ecient way and iii	 addon hardware for doing realtime recurrent learning on
the cascadable chip set using a realistic amount of hardware and time The recall
mode system was tested experimentally both at a chip electrical	 level and at a
system level The two learning system were tested at a chip level the learning
systems are at the time of writingy under construction
During our recall mode chip development we reviewed dierent chip and
network architectures as well as dierent basic building block components as mem
ories multipliers and thresholding functions We settled on a two chip cascadable
system a synapse chip and a neuron chip using analogue voltage current
signalling	 capable of  in principle  implementing any ANN topology using
rst order deterministic neurons Also we chose to use simple capacitive synapse
strengths storage with a digital RAM backup a MOS resistive circuit based
synapse multiplier and a hyperbolic tangent neuron activation function based on
parasitic bipolar transistors In addition to the basic synapse chip we showed
how a special sparse input synapse chip could eciently exploit the chip input
bandwidth when unary coded network inputs were used
A   synapses synapse chip and a  neurons neuron chip were fabricated in
a standard  m CMOS process Our measurements on these chips showed a



 bit synapse resolution nonlinearities below & on most quantities below &	
and oset errors below 
& on most quantities below &	 magnitudes compatible
y September 
Chapter  Conclusions Page 
with a large range of applications A system using full size 

  

	 synapse
chips was estimated to do  GCPS per synapse chip Measurements on a ((
experimental perceptron solving the sunspot prediction problem	 based on the
chip have shown a learning error slightly worse than that of an ideal simulated
network
For classication and regression tasks multi layer perceptrons trained by er
ror backpropagation are often employed A fully parallel VLSI implementation
of this algorithm gives a ON

	 improvement in speed compared to a serial im
plementationy usually at the cost of  times the synapse hardware and twice the
inter module wiring compared to a recall mode system If the physical size of the
system is important or if the learning scheme is employed only occasionally the ad
ditional hardware can severely restrict the system applicability We showed how to
implement backpropagation without any extra synapse hardware or inter module
wiring using the MOS resistive synapse multiplier in a novel conguration which
exploits its bidirectional properties  at the cost of discrete time and at most a
ON	 improvement in speed compared to a serial approach
A    synapses backpropagation synapse chip and a  neurons backprop
agation neuron chip were fabricated in a standard  m CMOS process Our
measurements on these chips showed a



 bit synapse resolution nonlinearities
below & on most quantities below  &	 and oset errors below & on most
quantities below &	 magnitudes compatible with a range of applications if oset
canceling is applied to critical signals A system using full size 



	 synapse
chips was estimated to do GCPS per synapse chip and 
 MCUPS serial weight
update as the digital weight backup RAM puts this restriction on the system	 We
showed how to implement a backpropagation system based on the chip set In
addition to the chip set this was basically a nite automaton controlling the system
and the weight updating hardware implemented using digital components
The very powerful recurrent neural networks can solve a larger class of prob
lems than perceptrons Realtime recurrent learning can train a completely general
network architecture and have several nice properties with respect to a VLSI imple
mentation It does however require a massive order ON
 
	 computational primi
tives per training example We showed how dividing the computational primitives
between the space and the time domain as ON

	 to ON

	 was a good choice with
respect to scalability hardware cost speed implementation ease and compatibil
ity with our acting recall mode neural system We showed how we could perform
learning in our recall mode system by adding a synapse multiplier and a scalable
RTRL chip consisting of an order ON	 signal slices plus weight updating
hardware of order O	 and a nite automaton controlling the system
A  signal slices RTRL chip was fabricated in a standard  m CMOS pro
cess unfortunately this chip was malmanufactured resulting in reduced signal range
and speed Our measurements showed a topological functionality nonlinearities
below & and oset errors below & on most quantities below 
&	 magnitudes
y In a system with N neurons
Chapter  Conclusions Page 

possibly compatible with the algorithm if oset canceling is applied to critical sig
nals Our implementation of a  neurons  inputs system was displayed
  
Using a digital RAM as backup for the synapse strengths obviously restricts the
learning scheme eciency only serial weight change is possible and D A and A D
converters are needed However we showed that implementing most of	 the O	
weight updating hardware using digital electronics exhibits a range of virtues The
possibility of implementing high accuracy circuits using digital components enables
the applicability of advanced weight updating schemes eg momentum which is
not likely to function in a simple analogue system as weight change osets are
amplied Also it turns out that the scheme can reduce the minimum eective
learning rate and the weight change oset which are of major concern in analogue
implementations of learning algorithms Other algorithmic variations as weight
decay and weight change threshold are also readily and accurately implemented
in the digital domain Note that since the RAM weight updating hardware scale
as O	 the hardware cost is not very important Finally placing the synapse
strengths in digital RAM is convenient for backup purposes which is important
for real applications
We noted that though applications exist that can tolerate parameter variances
induced by temperature drift analogue neural networks must in general be temper
ature compensated Some implementations as the backpropagation system above	
also need process parameter variation canceling in various subcircuits a scheme
for this was outlined
In relation to the limited accuracy of analogue computing systems we noted
that two problems of teaching articial neural networks using gradient descent
based algorithms were especially severe Oset errors primarily on the weight
change signal	 and neuron derivative calculation
We presented several procedures to reduce oset errors One based on a DAC
resolution enhancement technique and another utilizing chopper stabilizing
To ensure the correct sign of the computed neuron derivative which is of pro
found importance for convergence	 we introduced a deliberate oset in the com
puting hardware We also suggested that clipping the computed quantity was
a possibility However attacking the problem from an algorithmic starting point
was more attractive by far We proposed to use nonlinear gradient descent for
analogue VLSI
The novel nonlinear backpropagation learning algorithm was displayed This
algorithm has two important properties for a hardware implementation i	 no
activation function derivatives need to be computed and ii	 the backpropagation
of errors is through the same nonlinear network as the forward propagation We
showed that this implies that an electronic implementation can model the algorithm
muchmore accurately than is possible for ordinary backpropagation design eorts
can be put into the electrical properties of the system components We proposed
Chapter  Conclusions Page 
hardware implementations of both a continuous time version and a discrete time
version of the algorithm combining the latter with our hardware ecient back
propagation synapse chip we saw that the implementation of nonlinear back
propagation was possible using virtually no extra hardware compared with the
recallmode system Simulations using the nonlinear backpropagation for learning
the NETtalk problem have shown a performance very similar to ordinary back
propagation
We derived a nonlinear version of the realtime recurrent learning algorithm
and argued that this compared to ordinary realtime recurrent learning	 would
have properties very similar to nonlinear backpropagation both at an application
level performance	 and with respect to hardware implementations Our system
level implementation of the algorithm showed that half the hardware on the RTRL
chip could be saved We also proposed a continuous time version of the non
linear realtime recurrent learning for exploitation of the asynchronous properties
of analogue VLSI
We saw the implementation of analogue neural network ensembles as a future
trend of the eld Using ensembles introduce the much gloried but seldom im
plemented fault tolerance in analogue neural systems More importantly though
using ensembles enhance performance Because of the limited accuracy of the
technology analogue VLSI neural systems are bound to be inferior in performance
compared to ideal simulated networks For adaptive systems this is particular
severe as training data is not necessarily reproducible faulty learning is intolera
ble Neural network ensembles are expensive in terms of hardware thus analogue
VLSI is an ideal technology for neural network ensembles And vice versa
While the cascadable solution of our various chips set functions for a limited
number of cascaded devices we argued that the generalization to a truely arbi
trary size for huge neural networks is not in compliance with the analogue tech
nology such systems requires innite dynamic range of for instance the synapse
strengths Using highly nonlinear synapse multipliers the cascadability can be
improved though it will still be limited	 We also proposed to use a network
topology with clusters of neurons sparsely interconnected to other clusters	 to
come around the cascadability problem  though research in cluster topology vs
system generality must be carried out
In addition to the problem with limited accuracy of analogue VLSI the issue
of weight storage in such systems is a major concern no good analogue electronic
memory is presently available
To eliminate the need for RAM backup in systems using simple capacitive
storage we proposed to use the self repair properties of neural network ensembles
to do auto refresh in systems with an onchip supervised learning algorithm This
can eciently prolong the timetoexhaustion of the weights In adaptive systems
this might prove sucient especially if the synapse strengths are a combination of
readonly behavioral	 and plastic adaptive	 memories
If the use of digital synapse memory is acceptable one can in a similar man
ner combine this with an analogue adjustment enabling an analogue learning
Chapter  Conclusions Page 
algorithm to train a system with coarse discretized synapse strengths
The bottom line Though implementations of analogue neural network learning
systems have begun to emerge in the literature  including the present thesis 
excessive research in this eld is still needed before the eld is mature Research
in learning algorithms are needed both at an algorithmic and at an implementation
level Problems that need to be addressed include insensitivity to weight change
oset analogue memories and enhancement of system performance reliability It is
our believe that such researchwill prove fruitful for the future VLSI implemenations
of supervised and other	 learning algorithms
Page 
Bibliography
The references of this work are logically placed in ve categories

 Onchip learning   
      
   
  
   
    

 
 
   
       
 
   
     
  
    
  


 Analogue neural networks              
       
 
      
   
   
            

      
 
    

 Articial neural networks       
   
     
    
  
    
     
 
 
  
               
 
  

 
 
       
   
         

 Integrated circuits                 
   
 
 
  
         

 
 
 
 
           

           
   

 Miscellaneous references             
   
  
A few of the references are not cited in the thesis but have been included for a
more thorough referring of the eld Most probably many authors have unjustly
been excluded from this list I most sincerely apology for this Following are the
references listed in alphabetical order
Bibliography Page 
 The Connectionist Mailing List  restricted Internet mailing list
ConnectionistsRequest+cscmuedu
 HP Direct Hewlett Packard no   Series 

 Workstations
 NEAR Workshop on European Analog Research September  post
conference workshop at ESSCIRC, Copenhagen
 On Cytological Screening using Perceptrons December  talk at th
Neural Information Processing Systems Conference Denver
 Aanen Abusland and Tor S Lande Local Generation and Storage of Refer
ence Voltages in CMOS Technology in Proc th European Conference on
Circuit Theory and Design pp ( 
 P Y Alla G Dreyfus J D Gascuel A Johannet L Personnaz J Roman
andM Weinfeld Silicon Integration of Learning Algorithms and Other Auto
Adaptive Properties in Digital Feedback Neural Networks in VLSI Design of
Neural Networks Ulrich Ramacher and Ulrich R-uckert Eds Norwell Kluwer
Academic Publishers  pp 
(
 Phillip E Allen and Douglas R Holberg CMOS Analog Circuit Design Fort
Worth Holt Rinehart and Winston Inc 
 George S Almasi and Allan Gottlieb Highly Parallel Computing Redwood
City Benjamin Cummings Publishing Company Inc 
 Joshua Alspector Anthony Jayakumar and Stephan Luna Experimental
Evaluation of Learning in a Neural Microsystem in Proc 	th Neural
Information Processing Systems Conference pp ( 

 Preben Alstrm On VLSI Implementaion of Reinforcement Learning
private communication The Niels Bohr Institute 
 Per Andersson Portable CMOS design rules for the Swedish Universities
Lund Wallin . Dalholm Boktryckeri AB 

 A J Annema Hardware realisation of a neuron transfer function and its
derivative Electronics Letters vol 
 no  pp ( 
 AnneJohan Annema Analysis Modelling and Implementation of Analog In
tegrated Neural Networks PhD thesis University of Twente The Nether
lands 
 A J Annema K Hoen and H Wallinga Precision Requirements for Single
layer Feedforward Neural Networks in Proc 	th International Conference
on Microelectronics for Neural Networks and Fuzzy Systems Turin pp (
 
 AnneJohan Annema Klaas Hoen and Hans Wallinga Learning Behaviour
and Temporary Minima of Twolayer Neural Networks Neural Networks
vol  no  pp ! 
 Yutaka Arima Misuhiro Murasaki Tsuyoshi Yamada Atsushi Maeda and
Hirofumi Shinohara A Refreshable Analog VLSI Neural Network Chip with


 Neurons and 
K Synapses IEEE Journal of SolidState Circuits vol
SC no  pp ( 
Bibliography Page 
 Krste Asanovic and Nelson Morgan Experimental Determination of Pre
cision Requirements for Backpropagation Training of Articial Netral Net
works in Proc nd International Conference on Microelectronics for Neural
Networks pp ( 
 Les E Atlas and Yoshitake Suzuki Digital Systems for Articial Neural
Networks IEEE Circuits and Devices Magazine vol  no  pp 
(

 Roberto Battiti and Giampietro Tecchiolli Learning with rst second and
no derivatives a case study in High Energy Physics Neurocomputing vol
 pp 
 

 Randall D Beer Hillel J Chiel and Leon S Sterling A Biological Perspective
on Autonomous Agent Design Robotics and Autonomous Systems vol  pp
( 

 Kiet hA Benedict Learning in the Multilayer Perceptron Journal of
Physics A Math Gen vol  pp (
 
 Ronald G Benson and Douglas A Kerns UVActivated Conductances Allow
For Multiple Time Scale Learning IEEE Transaction on Neural Networks
vol  no  pp (
 May 
 Steven Bibyk and Mohammed Ismail Issues in Analog VLSI and MOS Tech
niques for Neural Computing inAnalog VLSI Implementation of Neural Sys
tems Carver Mead and Mohammed Ismail Eds Norwell Kluwer Academic
Publishers  pp 
(
 Steven Bibyk and Mohammed Ismail Neural Network Building Blocks for
Analog MOS VLSI in Analogue IC design the currentmode approach C
Toumazou F J Lidgey and D G Haigh Eds IEE Circuits and Systems 	
Series London Peter Peregrinus Ltd  pp (
 Christian Bj-ork and Sven Mattisson Multivalued memory in standard CMOS
for weight storing in Neural Networks in Proc th European Conference
on Circuit Theory and Design vol  pp ( 
 Gudmundur Bogason Generation of a Neuron Transfer Function and its
Derivative Electronics Letters vol  no  pp ( 
 E J Borowski and J M Borwein Dictionary of Mathematics Glasgow
Collins Reference 
 T Botha CMOS Analogue CurrentSteeringMultiplier Electronics Letters
vol  no  pp ( 
 Sren Brunak Jacob Engelbrecht and Steen Knudsen Prediction of Human
mRNA Donor and Acceptor Sites from the DNA Sequence Journal of
Molecular Biology vol 
 pp ( 

 Sren Brunak and Benny Lautrup Linjedeling med et neuralt netvrk
Skrifter for anvendt og matematisk lingvistik vol  pp !
 
 Sren Brunak and Hans Hansen On Predicting Splice Sites with RTRL
private communication Technical University of Denmark (
Bibliography Page 
 Erik Bruun John A Lansner and Torsten Lehmann Analog VLSI Architec
tures for Computational Neural Networks in Proc th NORCHIP Semi
nar pp ( 
 Erik Bruun Bandwith Limitations in Current Mode and Voltage Mode Inte
grated Feedback Ampliers EI preprint Technical University of Denmark

 Erik Bruun Analogue Signal Processing Collected Papers ( Elec
tronics Institute Technical University of Denmark Lyngby 
 Erik Bruun Gudmundur Bogason Thomas Kaulberg John Lansner and Peter
Shah On Analogue VLSI private communication Technical University of
Denmark (
 Wray L Buntine and Andreas S Weigend Computing Second Derivatives in
Feedforward Networks a Review IEEE Transactions on Neural Networks
vol NN pp ! 
 Graham Cairns and Lionel Tarassenko Learning with Analogue VLSI
MLPs in Proc 	th International Conference on Microelectronics for Neural
Networks and Fuzzy Systems Turin pp ( 
 Yong Cao Sven Mattisson and Christian Bj-ork SeeHear System A New
Implementation in Proc th European Solid State Circuits Conference
pp (
 
 Howard Card Relaxation Networks Recent Examples of Analog Circuits
from the U S and Canada in Proc rd International Conference on
Microelectronics for Neural Networks pp ( 

 Howard Card Analog Circuits for Relaxation Networks International
Journal of Neural Systems vol  no  pp ( 
 L Richard Carley Trimming Analog Circuits Using FloatingGate Analog
MOS Memory IEEE Journal of SolidState Circuits vol SC no  pp
( 
 Hernan A Castro Simon M Tam and Mark A Holler Implementation and
Performance on an Analog Nonvolatiole Neural Network Analog Integrated
Circuits and Signal Processing vol  pp ( 
 Thierry Catfolis A Method for Improving the RealTime Recurrent Learning
Algorithm Neural Networks vol  no  pp 
( 
 Daniele D Caviglia Maurizio Valle and Giacomo M Bisio Eects of Weight
Discretization on the Back Propagation Learning Method Algorithm Design
and Hardware Realization in Proc IEEE International Joint Conference on
Neural Networks pp II(II 

 TsinYuan Chang ChengChi Wang and JainBean Hsu Two Schemes for
Detecting CMOS Analog Faults IEEE Journal of SolidState Circuits vol
SC no  pp ( 
 Jungwook Cho Yoon Kyung Choi and SooYoung Lee Modular Analog
Neurochip Set with Onchip Learning by Error Backpropagation and or Heb
bian Rules in Proc International Conference on Articial Neural Networks
Bibliography Page 
	 Sorrento vol  pp ( 
 Leon O Chua and Lin Yang Cellular Neural Networks Applications IEEE
Transactions on Circits and Systems vol CAS no 
 pp (


 Leon O Chua and Lin Yang Cellular Neural Networks Theory IEEE
Transactions on Circits and Systems vol CAS no 
 pp (

 A L Coban and Pe E Allen Lowvoltage Fourquadrant analogue CMOS
Multiplier Electronics Letters vol 
 no  pp 

 

 M H Cohen and A C Andreou MOS Circuit for Nonlinear Hebbian
Learning Electronics Letters vol  no  pp ( 
 Dean R Collins and P Andrew Penz Considerations for Neural Network
Hardware Implementations in Proc IEEE International Symposium on
Circuits and Systems pp ( 
 Michael C W Coln Chopper Stabilization of MOS Operational Ampliers
Using FeedForward Techniques IEEE Journal of SolidState Circuits vol
SC no  pp ( 
 D Del Corso F Gregoretti and L M Reyneri An Articial Neural Sys
tem Using Coherent Pulse Witdh and Edge Modulation in Proc rd Inter
national Conference on Microelectronics for Neural Networks pp 
(

 P J Crawley and G W Roberts HighSwing MOS Current Mirror with
Arbitrarily High Output Resistance Electronics Letters vol  no  pp
( 
 Yann Le Cun John S Denker and Sara A Solla Optimal Brain Damage
in Proc Neural Information Processing Systems Conference  San Mateo
pp (
 

 Yann Le Cun Ido Kanter and Sara A Solla Second Order Properties of Error
Surfaces Learning Time and Generalization in Proc Neural Information
Processing Systems Conference  Denver pp ( 
 Zdzislaw Czarnul Novel MOS Resistive Circuit for Synthesis of Fully Inte
grated ContinousTime Filters IEEE Transactions on Circuits and Systems
vol CAS no  pp ( 
 M van Daalen J Zhao and J ShaweTaylor Real Time Output Deriva
tives for On Chip Learning using Digital Stochastic Bit Stream Neurons
Electronics Letters vol 
 no  pp ( 
 Casper Dietrich Analog VLSI  kontruktion af matrixvektor multiplikator
med digitalt lagrede vgte MSc thesis Elektronisk Institut Danmarks
Tekniske Hjskole Lyngby 

 B K Dolenko and H C Card Neural Learning in Analogue Hardware
Eects of Component Variation from Fabrication and from Noise Electronics
Letters vol  no  pp ( 
Bibliography Page 
 R Dom/0nguezCastro A Rodr/0guezV/azquez F Medeiro and J L Huertas
High Resolution CMOS Current Comparators in Proc th European
Solid State Circuits Conference pp ( 
 Kenji Doya and Shuji Yoshizawa Adaptive Neural Oscilator Using Continu
ousTime BackPropagation Learning Neural Networks vol  pp (

 T Duong S P Eberhardt M Tran T Duad and A P Thakoor Learn
ing and Optimization with Cascaded VLSI Neural Network Buildingblock
Chips in Proc IEEE International Joint Conference on Neural Networks
pp I(I June 
 Scott T Dupuie and Mohammed Ismail High Frequency CMOS Transcon
ductors in Analogue IC design the currentmode approach C Toumazou
F J Lidgey . D G Haigh Eds IEE Circuits and Systems 	 Series Lon
don Peter Peregrinus Ltd 
 pp (
 Silvio Eberhardt Tuan Duong and Anil Thakoor Design of Parallel Hard
ware Neural Network Systems from Custom Analog VLSI 1Building Block'
Chips in Proc IEEE International Joint Conference on Neural Networks
pp II(II
 
 Silvio Eberhardt Alex Moonpenn and Anil Thakoor Considerations for
Hardware Implementations of Neural Networks in Proc nd Asilomar
Conference on Signals Systems and Computers pp ( 
 Peter J Edwards and Alan F Murray Analogue Synaptic Noise  Implica
tions and Learning Improvements International Journal of Neural Systems
vol  no  pp ( 
 Lewis C Eggebrecht Interfacing to the IBM Personal Computer nd ed
Indianapolis Sams 

 S E Fahlman FastLearning Variations on Backpropagation An Empirical
Study in Proc Connectionist Models Summer School  Pittsburgh D
Touretzky G Hinton and T Sejnowski Eds Morgan Kaufmann pp (


 Nabil H Farhat Optoelectronic Neural Networks and Learning Machines
IEEE Circuits and Devices Magazine vol  no  pp ( 
 Barry Flower and Marwan Jabri The Implementation of Single and Dual
Transistor VLSI Analogue Synapses in Proc rd International Conference
on Microelectronics for Neural Networks pp (
 
 Barry Flower and Marwan Jabri Summed Weight Neuron Perturbation An
ON	 Improvement over Weight Perturbation in Proc Neural Information
Processing Systems Conference   San Mateo pp ! 
 Thaddeus J Gabara Gregory J Cyr and Charles E Stroud Metastability
of CMOS Master Slave FlipFlops IEEE Transactions on Circuits and
Systems Pt II vol CAS no 
 pp (
 
 Robert A Gabel and Richard A Roberts Signals and Linear Systems rd
ed New York John Wiley and Sons Inc 
Bibliography Page 
 Umberto Gatti FrancoMaloberti and Valentino Liberali Full Stacked Layout
of Analogue Cells in Proc IEEE International Symposium on Circuits and
Systems pp ( 
 U Gatti F Maloberti and G Palmisano An Accurate CMOS Sampleand
Hold Circuit IEEE Journal of SolidState Circuits vol  no  pp 
(
 
 Randall L Geiger Phillip E Allen and Noel R Strader VLSI Design Tech
niques for Analog and Digital Circuits Singapore McGrawHill Publishing
Company 

 Arthur Gelb Joseph F Kasper Jr Raymond A Nash Jr Charles F Price
Arthur A Sutherland Jr and the Analythic Science Corporation Applied
Optimal Estimation Cambridge MIT Press 
 C L Giles D Chen C B Miller H H Chen G Z Sun and Y C Lee
Grammatical Inference Using SecondOrder Recurrent Neural Networks
in Proc IEEE International Joint Conference on Neural Networks pp !


 Shelly D D Goggin Karl E Gustafson and Kristina M Johnson Connec
tionist Nonlinear OverRelaxation in Proc IEEE International Joint Con
ference on Neural Networks pp III(III 

 Malcolm S Gordon Animal Physiology Principals and adaptions nd ed
New York Macmillan Publishing Co Inc  pp (
 Hans P Graf and Lawrence D Jackel Analog Electronic Neural Network
Circuits IEEE Circuits and Devices Magazine vol  no  pp (

 H P Graf Analog Electronic Neural Networks in Proc th European
Solid State Circuits Conference pp (
 
 Hans Peter Graf and Don Henderson A Recongurable CMOS Neural
Network in Articial Neural Networks Edgar S/anchezSinencio and Cliord
Lau Eds New York IEEE Press  pp 
(
 David Grant John Taylor and Paul Houselander Design Implementation
and Evaluation of a HighSpeed Integrated Hamming Neural Classier IEEE
Journal of SolidState Circuits vol SC no  pp ( 
 Sten Grillner Peter Wall/en Lennart Brodin and Anders Lansner Neuronal
Network Generating Locomotor Behavior in Lamprey Annual Reviews on
Neuroscience vol  pp ( 
 Heng Guo and Saul B Gelfand Analysis of Gradient Decent Learning
Algorithms for Multilayer Feedforward Neural Networks IEEE Transactions
on Circuits and Systems vol CAS no  pp ( 
 Alister Hamilton Stephen Churcher Peter J Edwards Georey B Jackson
Alan F Murray and H Martin Reekie Pulse Stream VLSI Circuits and
Systems The Epsilon Neural Network Chipset International Journal of
Neural Systems vol  no  pp (
 
 Lars Kai Hansen Christian Liisberg and Peter Salamon Ensemble Methods
Bibliography Page 

for Handwritten Digit Recognition in Proc The  IEEE Workshop on
Neural Networks for Signal Processing pp  

 Lars Kai Hansen and Peter Salamon Neural Network Ensembles IEEE
Transactions on Pattern Analysis and Machine Intelligence vol PAMI
no 
 pp (

 

 Lars Kai Hansen and Peter Salamon Selfrepair in Neural Network Ensem
bles AMSE Conference on Neural Networks San Diego 
 Ole Hansen On VLSI Devices private communication Technical University
of Denmark (
 Simon Haykin Neural Networks A Comprehensive Foundation New York
Macmillan Collage Publishing Company Inc 
 John Hertz Anders Krogh Benny Lautrup and Torsten Lehmann Non
linear Backpropagation Doing Backpropagation without Derivatives of the
Activation Function Neuroprose preprint Niels Bohr Institute Copenhagen

 John Hertz Anders Krogh and Richard G Palmer Introduction to the
Theory of Neural Computation Redwood City AddisonWesley Publishing
Company 
 Marcus H-ohfeld and Scott E Fahlman Probabilistic Rounding in Neural
Network Learning with Limited Precision in Proc nd International Con
ference on Microelectronics for Neural Networks pp ( 
 Hollis Harper and Paulos The Eects of Precision Constraints in a Back
propagation Learning Network Neural Computation vol  no  pp (
 

 Paul W Hollis and John J Paulos Articial Neural Networks Using MOS
Analog Multipliers IEEE Journal of SolidState Circuits vol SC no 
pp ( 

 Paul W Hollis John J Paulos and Christopher J D'Costa An Optimized
Learning Algorithm for VLSI Implementation in Proc nd International
Conference on Microelectronics for Neural Networks pp ( 


 Paul W Hollis and John J Paulos A Neural Network Learning Algorithm
Tailored for VLSI Implementation IEEE Transactions on Neural Networks
vol NN no  pp ( 

 Yoshihiko Horio and Shogo Nakamura Analog Memories for VLSI Neuro
computing in Articial Neural Networks Edgar S/anchezSinencio and Clif
ford Lau Eds New York IEEE Press  pp (

 J C Houk Learning in Modular Networks in Proc th Yale Workshop on
Adaptive and Learning Systems pp 
( 

 KeunRong Hsieh and WenTsuen Chen A Neural Network Model which
Combines Unsupervised and Supervised Learning IEEE Transactions on
Neural Networks vol NN no  pp (
 

 KouChiang Hsieh Paul R Gray Daniel Senderowicz and David G Messer
Bibliography Page 
schmitt A LowNoise ChopperStabilized Dierential SwitchedCapacitor
Filtering Technique IEEE Journal of SolidState Circuits vol SC no
 pp 
( 

 John F Hurdle Erik L Brunvand and L-uli Josephson Asynchronous
VLSI Design for Neural System Implementation in Proc rd International
Workshop on VLSI for Neural Netwoks and Articial Intelligence pp  

 Mohammed Ismail and Terri Fiez Analog VLSI Signal and Information
Processing Electrical and Computer Engineering Series New York McGraw
Hill 

 Marwan Jabri and Barry Flower Weight Perturbation An Optimal Archi
tecture and Learning Technique for Analog VLSI Feedforward and Recurrent
Multilayer Networks IEEE Transactions on Neural Networks vol NN no
 pp ( 

 M Jabri S Pickard P Leong Z Chi B Flower and Y Xie ANN Based
Classication for Heart Debrillators inProc Neural Information Processing
Systems Conference  Denver pp ( 

 Marwan A Jabri Practical Performance and Credit Assignment Eciency
of Analog Multilayer Perceptron Perturbation Based Training Algorithms
System Engineering and Design Automation Laboratory Sydney University
Electrical Engineering SEDAL tech rep  

 Lawrence D Jackel Practical Issues for Electronic NeuralNet Hardware
tutorial notes at the 'th Neural Information Processing Systems Conference

 Georey Jackson Alister Hamilton and Alan Murray Pulse Stream VLSI
Neural Systems Into Robotics in Proc IEEE International Symposium on
Circuits and Systems London vol  pp ( 
 Robert A Jacobs Michael I Jordan and Andrew G Barto Task Decomposi
tion through Competition in a Modular Connectionist Architecture TheWhat
and Where Vision Tasks Cognitive Science vol  pp (
 
 Kam Jim C Lee Giles and Bill G Horne Synaptic Noise in Dynamically
driven Recurrent Neural Networks Convergence and Generalization Insti
tute for Advanced Computer Studies University of Maryland UMIACSTR
 and CSTR 
 Tor A Johansen and Bjarne A Foss Constructing NARMAX Models using
ARMAX Models International Journal of Control vol  no  pp (
 
 D E Johnson J S Marsland and W Eccleston Neural Network Implemen
tation using a Single MOST per Synapse to appear in IEEE Transactions
on Neural Networks 
 F Joublin M Lemesle S Wacquant and R Debrie Proposed Hardware
Implementation of Massively Parallel Cortical Automation Networks Elec
tronics Letters vol  no  pp ( 
 Yaron Kanshai and Yair Be'ery Back Propagation and Distrubuted Data
Bibliography Page 
Architectures in Proc rd International Conference on Microelectronics
for Neural Networks pp (
 
 Thomas Kaulberg and Gudmundur Bogason An Angle Detector Based on
Magnetic Sensing in Proc IEEE International Symposium on Circuits and
Systems London vol  pp ( 
 Brian W Kernighan and Dennis M Ritchie The C Programming Language
nd ed Englewood Clis Prentice Hall 

 Douglas A Kerns Experiments in Very LargeScale Analog Computation
PhD thesis California Institute of Technology Pasadena Caifornia 
 Donald A Kerth Navdeep S Sooch and Eric A Swanson A bit MHz
TwoStep Flash ADC IEEE Journal of SolidState Circuits vol SC no
 pp 
( 
 Edwin van Keulen Sel Colak Heini Withagen and Hans Hegt Neural
Networ Hardware Performance Criteria private communication Eindhoven
University of Technology 
 Nabil I Khachab and Mohammed Ismail A Nonlinear CMOS Analog Cell
for VLSI Signal and Information Processing IEEE Journal of SolidState
Circuits vol SC no  pp ( 
 Anders Krogh and John A Hertz A Simple Weight Decay Can Improve
Generalization in Proc Neural Information Processing Systems Conference
 Denver pp 
( 
 Anders Krogh Lars Kai Hansen and Jan Larsen On Neural Networks
private communication Technical University of Denmark (
 Francis J Kub Keith K Moon Ingham A Mack and Francis M Long
Programmable Analog VectorMatrix Multipliers IEEE Journal of Solid
State Circuits vol SC no  pp 
( 

 Hon Keung Kwan Systolic Architectures for Hopeld Network BAM and
MultiLayer FeedForward Networks in Proc IEEE International Sympo
sium on Circuits and Systems pp 
( 
 H K Kwan and C Z Tang Designing Multilayer Feedforward Neural
Networks Using Simplied Sigmoid Activation Functions and OnePowers
ofTwo Weights Electronics Letters vol  no  pp ( 
 Kadaba R Lakshmikumar Robert A Hadaway and Miles A Copeland
Charactiration and Modeling of Mismatch in MOS Transistors for Precision
Analog Design IEEE Journal of SolidState Circuits vol SC no  pp

(
 

 John A Lansner and Torsten Lehmann A Neuron and a Synapse Chip
for Articial Neural Networks in Proc th European Solid State Circuits
Conference pp ( 
 John A Lansner and Torsten Lehmann An Analog CMOS Chip Set for
Neural Networks with Arbitrary Topologies IEEE Transaction on Neural
Networks vol  no  pp ( May 
Bibliography Page 
 John A Lansner An Experimental Hardware Neural Network using a Cas
cadable Analog Chip Set to appear in International Journal of Electronics
Technical University of Denmark 
 John A Lansner Analogue VLSI Implementation of Articial Neural Net
works PhD thesis Electronics Institute Technical University of Denmark
Lyngby 
 Alan Lapedes and Robert Farber How Neural Nets Work in Proc Neu
ral Information Processing Systems Conference  D Z Anderson Eds
New York American Institute of Physics pp ( 
 Jan Larsen Design of Neural Network Filters PhD thesis Electronics
Institute Technical University of Denmark Lyngby 
 Michael LeBlanc and Robert Tibshirani Combining Estimates in Regression
and Classication preprint University of Toronto 
 BangW Lee Bing J Sheu and Han Yang Analog FloatingGate Synapses for
GeneralPurpose VLSI Neural Computation IEEE Transactions on Circuits
and Systems vol CAS no  pp ( 
 HaeSeung Lee David A Hodges and Paul R Gray A SelfCalibrating 
Bit CMOS A D Converter IEEE Journal of SolidState Circuits vol SC
no  pp ( 
 Torsten Lehmann A Hardware Implementation of the RealTime Recurrent
Learning Algorithm in Proc th European Conference on Circuit Theory
and Design vol  pp (
 

 Torsten Lehmann Neurale Netvrk i VLSI Teknologi MSc thesis Elek
tronisk Institut Danmarks Tekniske Hjskole Lyngby 
 Torsten Lehmann A Cascadable Chip Set for ANN's with Onchip Back
propagation in Proc rd International Conference on Microelectronics for
Neural Networks pp ( 
 Torsten Lehmann A Hardware Ecient Cascadable Chip Set for ANN's with
Onchip Backpropagation International Journal of Neural Systems vol 
no  pp ( 
 Torsten Lehmann and Erik Bruun Analogue VLSI Implementation of Back
propagation Learning in Articial Neural Networks in Proc th European
Conference on Circuit Theory and Design pp ( 
 Torsten Lehmann Implementation Issues for Backpropagation Learning
in Analog VLSI Neural Networks in preparation Technical University of
Denmark 
 Torsten Lehmann and Lars Kai Hansen Analogue VLSI Neural Network
Ensemble Issues in preparation Technical University of Denmark 
 Torsten Lehmann Erik Bruun and Casper Dietrich Analogue Digital Hy
brid VLSI Synapses for Recall and Learning Mode Neural Networks in
Proc th NORCHIP seminar Gothenburg Sweeden pp ( 
 Torsten Lehmann Erik Bruun and Casper Dietrich Mixed Analogue Digital
Bibliography Page 
Matrix Vector Multiplier for Neural Network Synapses in preparation
Technical University of Denmark 
 F Thomson Leighton Introduction to Parallel Algorithms and Architechrures
Arrays Trees Hypercubes San Mateo Morgan Kaufmann Publishers 
 Phillip H W Leong and Marwan A Jabri Kakadu  A Low Power Ana
logue Neural Network in Proc rd International Conference on Microelec
tronics for Neural Networks pp 
( 

 Phillip H W Leong and Marwan A Jabri Kakadu  A Low Power Ana
logue Neural Network Classier International Journal of Neural Systems
vol  no  pp ( 
 Bernab/e LinaresBarranco Edgar S/anchezSinencio Angel Rodr
/
iguezV/az
quez and Jos/e L Huertas A CMOS Analog Adaptive BAM with OnChip
Learning and Weight Refreshing IEEE Transaction on Neural Networks
vol  no  pp ( May 
 Richard P Lippmann An Introduction to Computing with Neural Nets
IEEE ASSP Magazine vol  no  pp ( 
 Ronald J MacGregor Neural and Brain Modeling San Diego Academic
Press Inc 
 Damien Macq Michel Verleysen Paul Jespers and JeanDidier Legat Analog
Implementation of a Kohonen Map with Onchip Laearning IEEE Transac
tion on Neural Networks vol  no  pp ( May 
 Kurosh Madani Ghislain de Tremiolles and Ion Berechet Temperature
Eects Modelling and Compensation Analysis in Analogue Implementation of
Stochastic Articial Neural Networks in Proc 	th International Conference
on Microelectronics for Neural Networks and Fuzzy Systems Turin pp 
(
 
 Jim Mann Richard Lippmann Bob Berger and Jack Rael A SelfOrganiz
ing Neural Net Chip in Proc IEEE Custom Integrated Circuits Conference
pp 
(
 
 P Masa K Hoen and H Wallinga 
 Million Patterns Per Second Analog
CMOS Neural Network Pattern Classier in Proc th European Confer
ence on Circuit Theory and Design pp (
 
 L W Massengill A Dynamic CMOS Multiplier for Analog Neural Network
Cells in Proc IEEE Custom Integrated Circuits Conference pp (
 

 Ofer Matan Christopher J C Burges Yann Le Cun and John S Denker
Multidigit Recognition Using a Space Displacement Neural Network in
Proc 	th Neural Information Processing Systems Conference pp (


 Takao Matsumoto and Masafumi Koga A HighSpeed Learning Method for
Analog Neural Networks in Proc IEEE International Joint Conference on
Neural Networks pp II(II 

 M J McNutt S LeMarquis and J L Dunkley Systematic Capacitance
Bibliography Page 
Matching Errors and Corrective Layout Procedures IEEE Journal of Solid
State Circuits vol SC no  pp ( 
 Carver Mead Analog VLSI  Neural Systems Reading AddisonWesley
Publishing Company 
 Carver Mead and Mohammed Ismail Analog VLSI Implementation of Neural
Systems Norwell Kluwer Academic Publishers 
 Christopher Michael and Mohammed Ismail Statistical Modeling of Device
Mismatch for Analog MOS Integrated Circuits IEEE Journal of SolidState
Circuits vol SC no  pp ( 
 JeanYves Michel HighPerformance Analog Cells in MixedSignal VLSI
Problems and Practical Solutions Analog Integrated Circuits and Signal
Processing vol  pp ( 
 Coe F Miles and C David Rogers The Microcircuit Associative Memory A
Bilogically Motivated Memory Architecture IEEE Transactions on Neural
Networks vol NN no  pp ( 
 Antonio J Montalvo Ronald S Gyurcsik and John J Paulos Building
Blocks for a TemperatureCompensated Analog VLSI Neural Network with
OnChip Learning in Proc IEEE International Symposium on Circuits and
Systems London vol  pp ( 
 Antonio J Montalvo Paul W Hollis and John J Paulos OnChip Learning
in the Analog Domain with Limited Precision Circuits in Proc International
Symposium on Circuits and Systems pp I(I
 
 Keith K Moon Francis J Kub and Ingham A Mack Random Address
X Programmable Analog VectorMatrix Multiplier for Articial Neural
Netwoks in Proc IEEE Custom Integrated Circuits Conference pp (
 


 Takashi Morie and Yoshihito Amemiya An AllAnalog Expandable Neural
Network LSI with OnChip Backpropagation Learning IEEE Journal of
SolidState Circuits vol SC no  pp 
(
 
 Alessandro Mortara and Eric A Vittoz A Communication Architecture
Tailored for Analog VLSI Articial Neural Networks Intrinsic Performance
and Limititions IEEE Transactions on Neural Networks vol NN no 
pp ( 
 V B Mountcastle An Organizing Principle for Cerebral Function The Unit
Module and the Distributed System in The Mindful Brain G M Edelman
and V B Mountcastle Eds Cambidge MIT Press  pp (

 Paul Mueller Jan van der Spiegel David Blackman Timothy Chiu Thomas
Clare Christopher Donham Tzu Pu Hsieh and Marc Loinaz Design and
Fabrication of VLSI Components for a General Purpose Analog Neural
Computer in Analog VLSI Implementation of Neural Systems Carver Mead
and Mohammed Ismail Eds Norwell Kluwer Academic Publishers 
pp (
 Alan F Murray Multilayer Perceptron Learning Optimized for OnChip
Bibliography Page 
Implementation A NoiseRobust System Neural Computation vol  no 
pp ( 
 Alan F Murray Dante Del Corso and Lionel Tarassenko PulseStreamVLSI
Neural Networks Mixing Analog and Digital Techniques IEEE Transactions
on Neural Networks vol  no  pp (
 March 
 A F Murray L Tarassenko H M Reekie A Hamilton M Brownlow
S Churcher and D J Baxter Pulsed Silicon Neural Networks  Following
the Biological Leader in VLSI Design of Neural Networks Ulrich Ramacher
and Ulrich R-uckert Eds Norwell Kluwer Academic Publishers  pp

(
 Alan F Murray and Peter J Edwards Analogue Synaptic Noise  A
Hardware Nuisance or an Aid to Learning in Proc rd International
Conference on Microelectronics for Neural Networks pp ( 
 Alan F Murray and Peter J Edwards Enhanced MLP Performance and
Fault Tolerance Resulting from Synaptic Weight Noise During Training
IEEE Transactions on Neural Networks vol NN no  pp (
 
 O Nerrand P RousselRagot D Urbani L Personnaz and G Dryfus
Training Recurrent Neural Networks Why and How An Illustration in
Dynamical Process Modeling IEEE Transactions on Neural Networks vol
NN no  pp ( 

 Charles F Neugebauer and Amnon Yariv A Parallel Analog CCD CMOS
Signal Processor in Proc Neural Information Processing Systems Comfer
ence  pp ( 
 Chalapathy Neti Michael H Schneider and Eric D Young Maximally Fault
Toerant Neural Networks IEEE Transactions on Neural Networks vol NN
 no  pp ( 
 Paul O'Leary Practical Aspects of Mixed Analogue and Digital Design in
Analogue Digital ASICs  Circuit Techniques Design Tools and Applica
tions R S Soin F Maloberti and J Franca Eds IEE Circuits and Systems
	 Series London Peter Peregrinus Ltd  pp (
 O Osowski New Approach to Selection of Initial Values of Weights in Neural
Function Approximation Electronics Letters vol  no  pp (

 G Palm K Goser U R-uckert and A Ultsch Knowledge Processing in
Neural Architecture in Proc rd International Workshop on VLSI for
Neural Netwoks and Articial Intelligence pp  
 Joshua C Park Christopher Abel and Mohamed Ismail Design of a Silicon
Cochlea Using MOS Switchedcurrent Techniques in Proc th European
Conference on Circuit Theory and Design pp ( 
 Morten With Pedersen and Lars Kai Hansen Recurrent Networks Second
Order Properties and Pruning EI preprint 
 Marcel J M Pelgrom Aad C J Duinmaijer and Anton P G Welbers
Matching Properties of MOS Transistors IEEE Journal of SolidState
Bibliography Page 
Circuits vol SC no  pp (
 
 D Plaut S Nowlan and G Hinton Experiments on Learning by Back
propagation Department of Computer Science Carnegie Mellon University
Pittsburgh tech Rep CMUCS 
 William H Press Brian P Flannery Saul A Teukolsky and William T Vet
terling Numerical Recipes in C Cambridge Cambridge University Press


 Ning Qian and Terrence J Sejnowski Predicting the Secondary Structure
of Globular Proteins Using Neural Network Models Journal of Molecular
Biology vol  no 
 pp ( 
 Jack I Rael Electronic Implementation of Neuromorphic Systems in Proc
IEEE Custom Integrated Circuits Conference pp 
(
 
 Ulrich Ramacher Guide Lines to VLSI Design of Neural Nets in VLSI
Design of Neural Networks Ulrich Ramacher and Ulrich R-uckert Eds
Norwell Kluwer Academic Publishers  pp (
 Ulrich Ramacher and Ulrich R-uckert VLSI Design of Neural Networks
Norwell Kluwer Academic Publishers 
 Ulrich Ramacher and Peter Schildberg Recent Developments in Neurody
namics and their Impact on the Design of Neurochips International Journal
of Neural Systems vol  no  pp 
( 
 A A Reeder I P Thomas C Smith J Wittgree D Godfrey J Hajto
A Owen A J Snell A F Murray M Rose and P G LeComber Applica
tion of Analogue Amorphous Silicon Memory Devices to Resistive Synapses
for Neural Networks in Proc nd International Conference on Microelec
tronics for Neural Networks Munich pp (
 
 Leonardo M Reyneri and Enrica Filippi An Analysis on the Performance of
Silicon Implementations of Backpropagation Algorithms for Articial Neural
Networks IEEE Transactions on Computers vol C
 no  pp 
(
 
 L M Reyneri M Chiaberge and D del Corso Using Coherent Pulse Width
and Edge Modulations in Articial Neural Systems International Journal of
Neural Systems vol  no  pp 
( 
 Jacques Robert and Philippe Deval A SecondOrder HighResolotion In
cremental A D Converter with Oset and Charge Injection Compensation
IEEE Journal of SolidState Circuits vol SC no  pp ( 
 D E Rumelhart G E Hinton and R J Williams Learning Internal
Repressentations by Error Propagation in Parallel Distributed Processing
Explorations in the Microstructure of Cognition vol  D E Rumelhart
J L McClelland and the PDP Reserch Group Eds Cambridge MIT Press
 chap 


 D E Rumelhart J L McClelland and the PDP Research Group Parallel
Distributed Processing Explorations in the Microstructure of Cognition
Cambridge MIT Press 
Bibliography Page 

 Eduard S-ackinger and Linda Fornera On the Placement of Critical Devives
in Analog Integrated Circuits IEEE Journal of SolidState Circuits vol
SC no  pp 
(
 


 Eduard S-ackinger and Walter Guggenb-uhl A HighSwing HighImpedance
MOS Cascode Circuit IEEE Journal of SolidState Circuits vol SC no
 pp ( 


 Shigeo Sakaue Toshiyuki Koda Hiroshi Yamamoto Susumu Maruno and
Yasuharu Shimeki Reduction of Required Precision Bits for BackPropaga
tion Applied to Pattern Recognition IEEE Transactions on Neural Networks
vol NN no  pp 
( 

 S Sakurai and M Ismail High Frequency Wide Range CMOS Analogue
Multiplier Electronics Letters vol  no  pp ( 

 C Andre T Salama David G Nairn and Henry W Singor Current
Mode A D and D A Converters in Analogue IC design the currentmode
approach C Toumazou F J Lidgey . D G Haigh Eds IEE Circuits and
Systems 	 Series London Peter Peregrinus Ltd 
 pp (

 Edgar S/anchezSinencio and Cliord Lau Articial Neural Networks New
York IEEE Press 

 Srinagesh Satyanarayana Yannis P Tsividis and Hans Peter Graf A Recon
gurable VLSI Neural Network IEEE Journal of SolidState Circuits vol
SC no  pp ( 

 Navin Saxena and James J Clark A Fourquadrant CMOS AnalogMultiplier
for Analog Neural Networks IEEE Journal of SolidState Circuits vol SC
 no  pp ( 

 J-urgen Schmidhuber An On

	 Learning Algorithm for Fully Recurrent Net
works Institut f-ur Informatik Technische Universit-at M-unchen M-unchen


 Christian Schneider and Howard Card Analog CMOS Synaptic Learning
Circuits Adapted from Invertebrate Biology IEEE Transactions on Circuits
and Systems vol CAS no  pp 
( 
 Jesper S Schultz Neurale Netvrk i VLSI Teknologi  med sparse digitale
inputs MSc thesis Elektronisk Institut Danmarks Tekniske Hjskole
Lyngby 
 Evert Seevinck Analog Interface Circuits for VLSI in Analogue IC design
the currentmode approach C Toumazou F J Lidgey and D G Haigh Eds
IEE Circuits and Systems 	 Series London Peter Peregrinus Ltd 

pp (
 Charles L Seitz System Timing in Introduction to VLSI Systems Carver
Mead and Lynn Conway Eds Reading AddisonWesley Publishing Com
pany 
 pp (
 Terrance J Sejnowski and Charles R Rosenberg Parallel Networks that
Learn to Pronounce English Text Complex Systems vol  pp (

Bibliography Page 
 Nikola B
2
Serbed2zija and Gerd Kock FaultTolerant NeuroComputing in
Proc International Conference on Articial Neural Networks 	 Sorrento
vol  pp 
(
 
 T Serrano B LinaresBarranco and J L Huertas A CMOS VLSI Analog
CurrentMode HighSpeed ART Chip in Proc IEEE International Confer
ence on Neural Networks Orlando vol  pp  
 Peter Shah A Short Term Analogue Memory in Proc th European Solid
State Circuits Conference pp (
 September 
 Samir Shah and Francesco Palmieri MEKA  A Fast Local Algorithm for
Training Feedforward Neural Networks in Proc IEEE International Joint
Conference on Neural Networks pp III(III 

 JeHurn Shieh Mahesh Patil and Bing J Sheu Measurement and Analysis
of Charge Injection in MOS Analog Swithces IEEE Journal of SolidState
Circuits vol SC no  pp ( 

 Takeshi Shima Tomohisa Kimura Yukio Kamatani Tetsuro Itakura Ya
suhiko Fujita and Tetsuya Iida Neuro Chips with Onchip Backpropagation
and or Hebbian Learning IEEE Journal of SolidState Circuits vol SC
no  pp ( 
 Masakazu Shoji Elimination of ProcessDependent Clock Skew in CMOS
VLSI IEEE Journal of SolidState Circuits vol SC no  pp (


 Svante Signell and Kare Mossberg OsetCompensation of TwoPhase
SwitchedCapacitor Filters IEEE Journal of SolidState Circuits vol SC
no  pp ( 
 Roy Ludvig Sigvartsen Yngvar Berg and Tor Sverre Lande An Analog
Neural Network with Onchip Backpropagation Learning in Proc th
NORCHIP seminar Gothenburg Sweeden pp ( 
 Patrick K Simpson Foundations of Neural Networks in Articial Neural
Networks Edgar S/anchezSinencio and Cliord Lau Eds New York IEEE
Press  pp (
 Anthony W Smith and David Zipser Learning Sequential Structure with the
Realtime Recurrent Learning Algorithm International Journal of Neural
Systems vol  no  pp ( 
 S A Solla E Levin and M Fleisher Accelerated Learning in Layered Neural
Networks Complex Systems vol  pp ( 
 Jens Spars Christian D Nielsen Lars S Nielsen and Jrgen Staunstrup
Design of Selftimed Multipliers A Comparison in Proc IFIP TCWG
 Working Conference on Asynchronous Design Methodologies Manch
ester pp ( 
 Jan Van der Spiegel Paul Mueller David Blackmann Peter Chance Cristo
pher Donham Ralph EtienneCummungs and Peter Kinget An Analogue
Neural Computer with Modular Architecture for RealTime Dynamic Com
putations IEEE Journal of SolidState Circuits vol SC no  pp (
Bibliography Page 


 Balsha R Stanisic Nishath K Verghese Rob A Rutenbar L Richard Carley
and David J Allstot Addressing Substrate Coupling in MixedMode IC's
Simulation and Power Distribution Synthesis IEEE Journal of SolidState
Circuits vol SC no  pp ( 

 W Scott Stornetta and B A Huberman An Improved Threelayer Back
Propagation Algorithm in Proc IEEE International Conference on Neural
Networks vol  pp ( 
 Lubert Stryer Biochemistry rd ed New York W H Freeman and
Company 
 Sun Moll Berger and Alders Breakdown Mechanism in Short Channel
MOS Transistors in Proc IEEE Technical Digest International Electron
Device Meeting Wasington DC pp  
 Ivan E Sutherland Micropipelines Communications of the ACM vol 
no  pp 
( June 
 C Svarer L K Hansen and J Larsen On Design and Evaluation of
TappedDelay Neural Network Architectures in Proc IEEE International
Conference on Neural Networks vol  pp ( 
 S M Sze Semiconductor Devices Physics and Technology New York John
Wiley . Sons 
 Janos Sztipanovits Dynamic BackpropagationAlgorithm for Neural Network
Controlled ResonatorBank Architecture IEEE Transactions on Circuits
and Systems II vol CAS no  pp (
 
 Lionel Tarassenko and Jon Tombs Onchip Learning with Analogue VLSI
Neural Networks in Proc rd International Conference on Microelectronics
for Neural Networks pp ( 
 Lionel Tarassenko Jon Tombs and Graham Cairns Onchip Learning with
Analogue VLSI Neural Networks International Journal of Neural Systems
vol  no  pp ( 
 Hans Henrik Thodberg The Neural Information Processing System used
for pig carcase grading in Danish Slaughterhouses Danish Meat Research
Institute preprint No  E Roskilde 

 Axel Thomsen and Martin A Brooke A Floating Gate CMOS Signal
Conditioning Circuit for Nonlinearity Correction Analog Integrated Circuits
and Signal Processing vol  pp ( 
 Cris Toumazou and John Lidgey Universal CurrentMode Analogue Ampli
ers in Analogue IC Design The CurrentMode Approach C Toumazou
F J Lidgey and D G Haigh Eds IEE Circuits and Systems 	 Series
London Peter Peregrinus Ltd 
 pp (

 C Toumazou F J Lidgey and D G Haigh Analogue IC Design The
CurrentMode Approach IEE Circuits and Systems 	 Series London Peter
Peregrinus Ltd 

Bibliography Page 
 C Toumazou F J Lidgey and C A Makris Extending Voltagemode Op
Amps to Currentmode Performance IEE Proceedings Pt G vol  no
 pp (
 

 C Toumazou A Payne and J Lidgey CurrentFeedback Versus Voltage
Feedback Ampliers History Insight and Relationships in Proc IEEE
International Symposium on Circuits and Systems pp 
(
 
 Yannis Tsividis Mihai Banu and John Khoury ContinuousTime MOSFET
C Filters in VLSI IEEE Transactions on Circuits and Systems vol CAS
no  pp (
 
 Y Tsividis and S Satyanarayana Analogue Circuits for VariableSynapse
Electronic Neural Networks Electronics Letters vol  no  pp (
 
 Yannis P Tsividis R.D in Analog Circuits Possibilities and Needed
Support in Proc th European Solid State Circuits Conference pp (

 A C Tsoi C N W Tan and S Lawrence Financial Time Series Forcasting
Application of Articial Neural Network Techniques preprint Department of
Electrical Engeneering and Computer Engineering University of Queensland
St Lucia Australia 
 Paul W Tuinenga SPICE A Guide to Circuit Simulation  Analysis Using
PSpice Englewood Clis Prentice Hall 

 Dogan A Tugal and Osman Tugal Data Transmission Analysis Design
Applications New York McGrawHill 
 Maurizio Valle Daniele D Caviglia and Giacomo M Bisio An Experimental
Analog VLSI Neural Chip with OnChip BackPropagation Learning in
Proc th European SolidState Circuits Conference pp 
(
 
 S R Vemuru Layout Comparison of MOSFETs with Large W L Ratios
Electronics Letters vol  no  pp ( 
 Eric A Vittoz MOS Transistors Operated in the Lateral Bipolar Mode
and Their Application in CMOS Technology IEEE Journal of SolidState
Circuits vol SC no  pp ( 
 E Vittoz H Oguey M A Maher O Nys E Dukstra and M Chevroulet
Analog Storage of Adjustable Synaptic Weights in VLSI Design of Neural
Networks Ulrich Ramacher and Ulrich R-uckert Eds Norwell Kluwer
Academic Publishers  pp (
 FongJim Wang and Gabor C Temes A Fast OsetFree SampleandHold
Circuit IEEE Journal of SolidState Circuits vol SC no  pp 
(
 
 Yiwen Wang A Modular Analog CMOS LSI for Feedforward Neural Net
works with OnChip BEP Learning in Proc  IEEE International Sym
posium on Circuits and Systems vol  pp ( 
 Zhenhua Wang A CMOS FourQuadrant Analog Multiplier with Single
Ended Voltage Output and Improved Temprature Performance IEEE Jour
Bibliography Page 
nal of Solidstate Circuits vol SC no  pp (
 
 Timothy L H Watkin Albrecht Rau and Michael Biehl The Statistical
Mechanics of Learning a Rule Moder Physics Review vol  pp ! 
 R B Webb Optoelectronic Implementation of Neural Networks Interna
tional Journal of Neural Systems vol  no  pp ( 

 George Wegmann Eric A Vittoz and Fouad Rahali Charge Injection in
Analog MOS Switches IEEE Journal of SolidState Circuits vol SC no
 pp 
(
 
 S Andreas Weigend Bernardo A Huberman and David E Rumelhart
Predicting the Future a Connectionist Approach International Journal
of Neural Systems vol  no  pp (
 

 Neil H E Weste and Kamran Eshraghian Principles of CMOS VLSI Design
A Systems Perspective Reading AddisonWesley Publishing Company 
 ChinLong Wey Benlu Jiang and Gregory M Wierzba BuildIn SelfTest
BIST	 Design of LargeScale Analog Circuit Networks in Proc IEEE
International Symposium on Circuits and Systems pp ( 
 Halbert White Learning in Articial Neural Networks A Statistical Per
spective Neural Computation vol  pp ( 
 Bernard Widrow and Michael A Lehr 
 Years of Adaptive Neural Net
works Perceptrons Madaline and Backpropagation IEEE Proceedings vol
 no  pp ( 

 Remco J Wiegerink Evert Seevinck and Wim de Jager Oset Cancelling
Circuit IEEE Journal of SolidState Circuits vol SC no  pp (

 Ronald J Williams and David Zipser A Learning Algorithm for Continually
Running Fully Recurrent Neural Networks Neural Computation vol  pp

(
 
 Ronald J Williams and David Zipser Experimental Analysis of the Real
time Recurrent Learning Algorithm Connection Science vol  no  pp
( 
 Heini Withagen Implementing Backpropagation with Analog Hardware
in Proc International Conference on Neural Networks Orlando pp  

 Robin Woodburn H Martin Reekie and Alan F Murray Pulsestream
Circuits for Onchip Learning in Analogue VLSI Neural Networks in Proc
IEEE International Symposium on Circuits and Systems London vol  pp

(
 
 Niels Holger Wul Learning Dynamics with Recurrent Networks MSc
thesis NORDITA Nordisk Institut for Teoretisk Fysik Kbenhavn 
 Yun Xie and Marwan Jabri Analysis of the Eects of Quantization in
Multilayer Neural Networks using a Statistical Model IEEE Transactions
on Neural Networks vol NN no  pp ( 
 Moritoshi Yasunaga Noboru Masuda Masayoshi Yagyu Mitsuo Asai Minoru
Bibliography Page 
Yamada and Akira Masaki Design Fabrication and Evaluation of a 
inch Wafer Scale Neural Network LSI Composed of  Digital Neurons
in Articial Neural Networks Edgar S/anchezSinencio and Cliord Lau Eds
New York IEEE Press  pp (
 ChongGun Yu and Randall L Geiger An Automatic Oset Compensation
Scheme with Pingpong Control for CMOS Operational Ampliers IEEE
Journal of SolidState Circuits vol SC no  pp 
(
 
Index Page 
Index
A priori knowledge                           
Abbreviations                                     xii
Absolute temperature                     
Abstract                                                 iii
AC signal                                           xvi
ACcouple learning hardware     

Accuracy                                             
Acknowledgements                           vii
Activation function                    
Activation functions                         
Adaptability                                           
Adaptive system                             
Adaptive systems                        

Adaptive                                             
ADC                                                       
Algorithm variations                     
Amorphous silicon storage             
Amount of learning hardware       
Analogue adjustment                
Analogue computing accuracy   
Analogue neural networks           
Analogue VLSI ANN
ensembles                                           
Analogue VLSI ANN properties     
Analogue VLSI learning ANN
properties                                             
ANN applications                             
ANN model map easily on
hardware                                                 
ANN must t technology                 
ANN                                                     
Application invariant                       
Application specic                             
Applications and motivations     
Architecture                                         

Articial neural network               
Articial neural networks    
ASIC interconnection             
 
Associative memories                     
Asynchronousness                            
Asynchronous                                   
Attractor dynamics                           
Auto oset compensation             

Auto zeroing simulation               

Backpropagation ANN
architecture                                         
Backpropagation chip computing
elements                                               
Index Page 
Backpropagation chip set
improvements                                   
Backpropagation learning             
Backpropagation mode                   
Backpropagation neuron chip E 
Backpropagation neuron
schematic                                           
Backpropagation neuron               
Backpropagation synapse
chip                                                   E 

Backpropagation synapse
column row element                       

Backpropagation system
hardware                                               
Backpropagation system               
Backpropagation weight update
schematic                                           
Backpropagation                               

Basics                                              
Batch learning                            
Bibliography                                     
Binary coding of inputs                   
Bipolar transistors                         
Bit absolute measure                     
Bit relative measure                       
BJT                                                     
Boltzman constant                         
Boltzmann machines                           
Boundary eects                             
BPL                                                       
Building block components 
 


 
Bulk threshold parameter             
Capacitive storage                             
Cascadability                                       
Cascadable                                           
CCII!                                                 

CCO                                                       
Cerebral cortex                                 
Channel length modulation
constant                                             
Channel length modulation
parameter                                           
Channel length                                 
Channel width                                 
Chip compound                                 
Chip design                         

  
Chip designs                                     
Chip measurements         
  

Chip photomicrographs             E 
Chip set improvements                 
Chipintheloop training                 
Choice of learning algorithms       
Chopper stabilized weight
updating                                               
Chopper stabilizing                           
Clamped derivative output           
Classication                                     
Clock generator                               

CMOS process used                         
Collector current                             
Columns of synapses                       
Comments on the topology           
Common centroid layout             
Computation requirements             
Computed neuron derivative         
Computing accuracy                     
Conclusions                                       
Conjugate gradient method           
Connection strength                       
Connection strengths                         
Connection updates per second 
Connections per second                 
Consensus cost function               
Consensus decision                         
Consensus trainer                           
Contents                                             viii
Continuous time NLBP neuron   
Continuous time nonlinear back
propagation neuron                           
Continuous time RTRL system 
Control systems                               
Convolution                                       
Cost function osets                         
Cost function                             
 

CPS                                                     
Critic                                                   

CUPS                                                   
Current auto zeroing principle   

Current controlled oscillator         
Current conveyor dierencer         
Index Page 
Current conveyor                             

Current dierencer                           
Current input inherently                 
Current levels                                   
Current mismatch                           
Current subtraction by row         
Current subtraction by synapse 
Cuto MOST                                 
DAC resolution enhancement     

DAC                                                       
Dansk                                                     iv
Data compression                           
Data conversion                                   
DC signal                                           xvi
Denitions                                         
Degenerated momentum                 
Derivation of the algorithm  

Derivative computation avoided   

Derivative computation                   
Derivative perturbation          

Design strategy                                 
Deterministic neurons                       
Device orientation                           
Dierencer                                           
Dierent neuron nonlinearities   
Dierent neuron transfer
functions                                               
Dierent parabola non
linearities                                             
Dierent parabola transfer
functions                                               
Dierential quotient derivative
approximation                              
Digital level shifter                         
Digital PC interface                         
Digital storage                                   
Digital weight updating hardware
principle                                               
Digital weight updating
hardware                                               
Discrete input alphabet                   
Discrete time feedback                     

Discrete time NLBP neuron         
Discrete time nonlinear back
propagation neuron                           
Distributed neuron                    
Distributed neuronsynapse           
Distributed neuronsynapses         
Domains of signalling                       
Double resolution D A
conversion                                         

Drain current variance                 
Drain current                                   
Driving force of VLSI                     
Droop rate                                           
DSP                                                       
Dynamic learning rate                     

Dynamic range               
 
Ec current gain cancelling           
Edge trigged sampler sampling 

Edge trigged sampler                     


Eective bias current                       
Eective Connection Primitives Per
Second                                                 
Eective maximum synapse
weight                                                   
Electronic synapse                             
Electronically computed neuron
derivative                                             
Elementary charge                         
Eliminate need for matched
components                                         
Emitter area                                     
Energy used by learning
hardware                                               
English                                                   iii
Enhanced performance                 
Entropic cost function                   
Epoch length                                     
Epoch                                                 
Error backpropagation                   

Error measure                                   

Eta nder                                             
Example described problems     
Exchanging inputs and outputs   
Expandable neural network           
Expandable recurrent neural
network                                                 
Exploit implicit multiplication     
Fahlman perturbation                   
Index Page 
Fault tolerance                         
Fault tolerant                                   
Feedback                                                 
Finite automaton                               
Finite state machine                         
Fires                                                     
Firing rate                                         
First generation synapse chip   E 
Floating gate MOSFET                   
Floating gate storage                       
Focus on electrical properties       
Folding synapse matrix                   
Font                                                       xvi
Forward Early voltage                   
Forward emittercollector current
gain                                                       
Forward mode neuron
characteristics                                     
Forward mode synapse
characteristics                                     
Forward mode BPL synapse column
element                                               
Forward mode BPL synapse row
element                                               
Forward mode neuron transfer
characteristics                                     
Forward mode synapse transfer
characteristics                                     
Forward mode weight osets         
Forward mode                                     
Four quadrant multiplier               

FSM                                                       
Fully interconnected                         
Further work                         
Gate oxide capacitance                 
General ANN architecture             
General high level computational
ANNs                                                   
General neural network model   
General process parameter canceling
circuit                                                     
General purpose analogue neural
network                                              
Generalization ability                     
Gilbert multiplier                               
Global process variations             

Gradient descent algorithms       

Gradient descent learning         

Gradient descent learning             
Gradient descent                     
 
Handles                                               
Hard limiter                                           
Hard soft hybrid synapses           
Hardware compatible                       
Hardware considerations                 
Hardware consumption                   
Hardware ecient approach         
Hardware ecient learning           
Hardware implementation      
Hardware on chip                             
Hebbian learning                               
Hessian                                                 
High accuracy calculations             
Higher order neurons                         
Hopeld networks                               
Huge massively parallel
systems                                               
Human genom project                     
Hyperbolic tangent neuron      

Hyperbolic tangent transfer
function                                                 
Implementation of onchip back
propagation                                         
Implementation of RTRL
hardware                                               
Implementation of the neural
network                                                   
Implementing ANNs in analogue
hardware                                                 
Implementing learning algorithms in
analogue hardware                               
Improving the derivative
computation                                         
Including algorithmic
improvements                                     
Index                                                   
Inner product multiplier                 
Inner product multipliers               
Input bandwidth                               
Input indices                                       
Input vector                                         
Instant cost function                     
Index Page 
Integral nonlinearity                     
Integrated circuit issues               
Integrated circuit layout               

Integrated circuits                  
Internal logic level                           
Intheight training                         
Introduction                                           
IPM                                                       
Kronecker's delta                               
Large learning rates                         
Lateral bipolar mode MOSFET
symbol                                                 
Lateral bipolar mode MOSFET 
Lateral bipolar mode MOSFET   
Lateral bipolar mode                     
Layer parallel backpropagation
hardware                                               
Layer synchronous weight
update                                                   
Layered feedforward neural
network                                               
Layout of matched transistors   
Layout                                                     v
LBM MOSFET                               
Learn mode                                         
Learning by epoch                         
Learning by example                     
Learning by subsequence               
Learning cycle                                     
Learning hardware                           
Learning loop gain                         

Learning rate                       
Learning speed                                   
Learning                                             

Least signicant bits                     
Letter input                                         
Letters                                                   
Linear MOST operation               
Linear multiplier                               
List of gures                                   xvii
Local process variations               

Low cost algorithmic
improvements                                     
Low power applications                
LSB
B
                                                  
Majority decision                           
Mapping the algorithm on VLSI 

 
Massively parallel learning             
Matrixvector multiplier                 

MDAC                                                   
Measured neuron transfer
function                                                 
Measured synapse characteris
tics                                                         
Measured synapseneuron step
response                                               
Measured synapseneuron transfer
characteristics                                     
Memories                                             
Memory requirements                     
Metastability                                         
Miscellaneous references               
Mismatch                                           
MLP                                              

Modules                                                   
Momentum inclusion                       
Momentum parameter                     

Momentum                                   
 
MOS Gilbert multiplier                   
MOS resistive circuit multiplier   
MOS resistive circuit                     

MOS transistor symbols               
MOS transistors                             
MOSFET                                           
MOST                                                 
Motivation for using gradient
descent                                                   

MRC operated in forward mode 
MRC operated in reverse mode   
MRC resistive equivalent               
MRC                                                     
Multi layer perceptron                     

Multidimensional chopper
stabilization                                         
Multilayer perceptron                 
Multiplier based on MRCs             
Multipliers                                           

Multiplying DAC synapse             
Multiplying DAC                               
Index Page 
MVM                                                     

NARV                                                 
Nchannel MOS transistor
symbols                                               
Nchannel MOS transistor           
Negligible neuron error oset     


Net input                                           
NETtalk                                               
Network input                                       
Network topology                                 
Neural network ensemble             
Neural network ensembles           
Neuron activation block
schematic                                             
Neuron activation                        
Neuron bias                                       
Neuron chip                                         
Neuron clustering                  
Neuron derivative variable
discretization                                       
Neuron derivative variables           
Neuron derivatives                           
Neuron error osets                    
Neuron error                                       
Neuron errors                                     
Neuron sampler droop rate           
Neuron indices                                   
Neuron k inputs                                   
Neuron k net input                             
Neuron net input derivative
variables                                               
Neuron output                                   
Neuron threshold                             
Neuron transfer characteristics     
Neuron transfer function
steepness                                               
Neurons                                               
Niches for analogue VLSI ANNs   
Niches for analogue VLSI learning
hardware                                                 
NLBP domain parameter               
NLBP hardware overhead             
NLBP simulations                             
NLBP training error                         
NLBP weight change                       
NLBP weight errors                         
NLBP                                                     

NLRTRL neuron derivative
variables                                             
NLRTRL                                           
NLSM                                                   
No extra routing                               
No extra synapse hardware           
Noise                                              
Non unity ec current gain
canceling                                               
Nonlinear backpropagation         

Nonlinear DAC                               

Nonlinear principle                       
Nonlinear realtime recurrent
learning                                               
Nonlinear RTRL system             
Nonlinear RTRL                           
Nonlinear synapse multiplier       
Nonlinearities                                   
Nonlinearity                                     
Nonrelaxation systems                   

Nonvolatile analogue memories   
Normalized average relative
variance                                               
NPN bipolar transistor symbol 
NPN bipolar transistor                 
Nucleotide sequence                         
OBD                                                       
Objective                                                 
Oset compensation               
 
Oset currents                                   
Oset error                                  
Oset errors                                  
Oset                                                   
Onchip learning                             
Online learning convergence       
Online learning                        
Opamp frequency response       

Operational amplier                     

Optimal brain damage                     
Order N signal slice                       

Oscillating weights                           
Other improvements         

Our                                                         vi
Index Page 

Output conductance                       
Output error                                     

Output layer                                       
Parallelism                                        
Parallel                                               
Pattern recognition                         
PC interface                                         
Perceptron                                         
Performance evaluation                 
PFM                                                       
Pipelining                                             
Powerful                                               
Preface                                                     v
Preliminary conceptions on
hardware learning                             
Principal BPL system operation 
Probabilistic rounding                     
Process gradients                             

Process parameter dependency
canceling                                               
Process transconductance
parameter                                           
Process variation insensitivity     
Propagation delay                             
Pruning                                                 
PSRR                                                     
Published papers                             E 
Pulse frequency modulation         
Pulse frequency neuron                   
Pulse stream neural network         
Pulse width modulation                 
PWM                                                     
Quadratic cost function               
Quantizeregenerate                         
Quantizing the weights                   
QuasiNewton                                     
Quiescent drain current               
Racearound                                     


Radial basis function                       
RAM backup memory                     
Random initial state                         
Random synapse access                 
RANN                                            
Read synapse matrix                       
Realtime recurrent learning
chip                                                   E 
Realtime recurrent learning  

Realtime training                             
Realworld data set                           

Realworld interfacing                       
Recall mode equation                       
Recall mode speed                           
Recongurable network
topologies                                             
Recongurable neural network     
Recurrent articial neural net
works                                                     
Recurrent networks                           
Reduce the minimum learning
rate                                                         
Refresh by relearning                     
Refreshing schemes                           
Regression                                         
Regularity                                          
Regular                                               
Regulated gain cascode                 

Regulated gain cascodes               

Relaxation                                           
Relearning                                           
Research in learning
algorithms                                         

Resolution                                    
Restoration eciency                     
Reuse activation function circuit 
Reverse mode synapse
characteristics                                     
Reverse mode BPL synapse column
element                                               
Reverse mode BPL synapse row
element                                               
Reverse mode synapse transfer
characteristics                                     
Reverse mode weight osets         
Reverse mode                                     
RGC current mirror                       

RGC                                                     

Rise fall time                                   

Route mode BPL synapse column
element                                               
Index Page 
Route mode BPL synapse row
element                                               
Route mode                                         
Routing                                                 
Rows of synapses                               
RTRL ANN basic architecture   

RTRL chip improvements           

RTRL chip                                         


RTRL signal slice schematic       


RTRL system hardware               

RTRL system topology                 

RTRL weight change schematic 

RTRL backpropagation hybrid 
RTRL backpropagation system
interface                                         E 

RTRL                                              
Sample applications                         
SAR bit slice                                     

SAR start signal gating                 
SAR                                                     

Saturation mode BJT
operation                                             
Saturation MOST operation       
Saving weight updating
hardware                                               
Scaled backpropagation synapse
chip                                              E 
Scaled synapse chip
characteristics                                   
Schematic backpropagation
neuron                                                   
Schematic backpropagation
synapse                                                 
Second generation hyperbolic
tangent neuron                                   
Second generation synapse chip 

Self refreshing ANN system       
Self refreshing system                   
Self timed                                               
Selfpruning                                         
Selfrepair                                           
Semi parallelism                                 
Serial weight updating scheme     
Serial weight updating                     
Short channel snapback               
Short term adaptations                 
Sigmoid function                             
Signal slices                                         
Signalling                                             
Simple ANN model                             
Simple models                                
Simple nonlinear synapse
multiplier                                             
Simple weight error calculation   
Simple weight updating scheme   

Simulated annealing                         
Simulatedi neuron transfer
function                                                 
Simulations of nonidealities         
Single ended signalling                 
Size limiting                                         
Snapback                                           
Space time domain compromise 
Sparse input synapse chip
column                                                   
Sparse input synapse chip      
 
Sparse inputs                                     
Special process facility storage     
Speed improvement                    
Speed limiting                                     
Splice sites                                           
Squashing function steepness     
Squashing function                         
Step response                                     
Stochastic approximation               
Stochastic neurons                               
Storing analogue signals                 
Storing of data                                   
Storing the training patterns         
Strong inversion circuits                 

Strong inversion surface
potential                                             
Subthreshold MOST operation   
Subthreshold slope                         
Successive approximation
register                                       
 

Summary                               

Summed weight neuron
perturbation                                         
Index Page 
Sunspot learning error                     
Sunspot prediction error                 
Sunspot prediction                           

Sunspot time series                           

Supervised learning                 
 
Supply current sensing                 

Surface mobility                               
Symbols                                               xiv
Symmetric synapse multiplier       
Synapse chip                                       
Synapse cost                                       
Synapse layout                                 
Synapse schematic                             
Synapse strength backup
memory                                                 
Synapse strength discretization   
Synapse transfer characteristics   
Synapse                                               
System design aspects                   
System design                   
  
System designs                                 
System measurements                     

System simulations                           
Table of ANN chip set
characteristics                                   
Table of Backpropagation chip set
characteristics                                   
Table of row column element
control                                                 
Table of RTRL chip
characteristics                                   

Table of scaled BPL synapse chip
characteristics                                   
Tanh derivative computing block
characteristics                                   

Tapped delay line ANN                   

Target indices                                     
Target set empty                             
Target value                                     
Target values                                       
Teacher forcing                                   
Teacher                                               

Teaching ANNs                               

Teaching                                             

Technology driven model            
Temperature compensation           
Temperature gradients                 
Temporal information                     
Test PCB schematics                   E 
Test perceptron system
architecture                                         
Test perceptron                                 
Test set                                               
The ANN model                             
The articial neural network
model                                                       
The backpropagation algorithm 
The backpropagation neuron
chips                                                     
The backpropagation synapse
chips                                                     
The current comparator               

The current conveyor                     

The D A converter                         

The discrete time RANN system 
The discrete time RTRL system 
The transconductor                         

The interface                                     
The MOS resistive circuit             
The network                                           
The neuron chip              
The neurons                                           
The onchip backpropagation chip
set                                                         
The opamp and the CCII!       

The operational amplier             

The RTRL algorithm                       
The RTRL chip                               
The RTRL backpropagation
system                                                 

The SAR                                           

The scalable ANN chip set         
The scaled backpropagation
synapse chips                                   
The second generation synapse
chip                                                         
The synapse chip                   
The synapse chips                    
The transconductor                         

The width  data path module   

Index Page 
The width N data path module
signal slice                                         


Thermal voltage                      
Thesis                                                     v
Thoughts on future analogue VLSI
neural networks                               
Threshold voltage                           
Time multiplexing                             
Time series analysis                       
Time step                                             
Total cost function                         
Trainability                                         
Training data                                   
Training set                                       
Transconductance parameter       
Transconductance                           
Transconductor                                 

Transfer function                        
Transistor parameter                     
Transmission gate symbol           
Transport saturation current
density                                                 
Triode MOST operation               
TTL level                                           
Two layer test perceptron             
Two phase nonoverlapping
clock                                                     
Typical electronic synapse             
Typical MRC layout                     

Unary coding of inputs                   
Unit size devices                             

Unsupervised learning                   

Variations                                      
Very large scale integration         
Virtual targets                                   
VLSI neural networks                     

VLSI                                                   
Voltage levels                                   
Voltage reference level                   
Wafer scale integration                     
Weight change IPM element
characteristics                                   

Weight change oset                       
Weight change osets                       
Weight change signal memory     
Weight change threshold                 
Weight change                                   

Weight decay inclusion                   
Weight decay                               
 
Weight discretization                       
Weight errors                                     
Weight matrix resolution               
Weight perturbation                         
Weight updating hardware
placement                                             
Weight updating hardware      
Weight updating rule                       
Weighted sum decision                 
Weight                                                 
Weightoutput characteristic of
NLSM                                                   
We                                                           vi
Width  data path module             
Width N data path modules         
Width N !M data path module 
Word hyphenation                             
Zero bias threshold voltage         
ELECTRONICS
INSTITUTE
Hardware Learning in
Analogue VLSI
Neural Networks
A thesis by
Torsten Lehmann
Appendices
and
Enclosures
September 
Page 
Appendix A
Denitions
In this appendix denitions are given of concepts that are not otherwise well
dened
Accuracy
The accuracy of a quantity  is dened as its maximal deviation from ideality
D
A 
def
"
max
x
jx	 
ideal
x	j

max
 
min
  
A
	
The normalization can also be with respect to the ideal range 
ideal
max
 
ideal
min

Bit relative measure
The bit relative measure of a quantity # is dened as the number of least signi
cant bits LSB
B
of # given an B bit discretization of the quantity  to which #
is related

#

max
 
min

LSB
B

#

max
 
min
 
B
  
A
	
Or LSB
B
def
" 
B
	 Sometimes we use the LSB
B
measure somewhat imprecisely
in a non unitless way in this case LSB
B
 
B

max
 
min
	 We call this the
bit absolute measure
Appendix A Denitions Page 
Connections per second
The standard speed measure for neural networks is the Connections Per Second
measure CPS which counts the number of synaptic connections multiplyadds
that is	 the network does per second Thus the work involved in computing the
activation function is ignored	 This measure has been questioned by several indi
viduals and other measures have been proposed The E
ective Connection Prim
itives Per Second Keulen et al 	 for instance seems a good candidate for
the future standard
Connection updates per second
The standard speed measure for teaching neural networks is the Connection Up
dates Per Second measure CUPS which counts the number of updates on the
synaptic connection strengths the learning algorithm does per second
Non	linearity
The nonlinearity or integral nonlinearity	 of a quantity  is dened as its max
imal deviation from ideality when the oset error has been canceled
D
 
def
"
max
x
jx	 
ofs
 
ideal
x	j

max
 
min
  
A
	
The normalization can also be with respect to the ideal range 
ideal
max
 
ideal
min

Oset error
The o
set error of a quantity  is dened as its deviation from ideality at ideal
zero value

ofs
def
" x
ideal
" 
		   
A
	
It is presumed that nonlinearities and oset errors that are related to x has been
canceled to make this denition welldened If this is not possible the oset
error must be dened for a specic x preferably a nonbiased one like x " 
	
Resolution
The resolution of a quantity  is dened as the smallest change of this quantity
that can be distinguished at some appropriate output f	

res
def
" min
jf 	 f j
#  
A
	
where  is smallest distinguishable output change
Page 
Appendix B
Articial neural networks
This appendix briey describes the concepts of neural networks We present the
most popular models and display typical application areas and motivations for
using articial neural networks We briey touch the concepts of learning with
emphasis on gradient descent A performance evaluation measure is also given
An articial neural network or ANN is a type of computer with a topology
inspired by the human brain it consists of a large number of simple calculating
units or neurons which are interconnected in massive parallelism In a typical bio
logical neuron each connection or synapse has an associated connection strength
and the neuron integrates the thus weighted outputs from other neurons over time
If this integral reach a certain threshold the output of the neuron is pulsed high
the neuron res The neuron ring rate will be in the range zero to the inverse
of the pulse time Very simplied the result is this the ring rate of a neuron
is a nonlinear function of a weighted sum of its inputs ring rates Gordon 
Rumelhart et al 

 MacGregor 	
Appendix B Articial neural networks Page 
B  The ANN model
In the standard model of an articial neural network it is this ring rate or neuron
activation relation that is modeled Hertz et al  Rumelhart et al 

 Haykin
	 A neuron k calculates as its output y
k
an often nonlinear	 function g
k
of
the weighted sum of its inputs z
j

y
k
" g
k
s
k
	 where s
k
"
X
j
w
kj
z
j
$
k
  

B
	
Here s
k
is called the net input $
k
is called the neuron threshold value also $
k
is called the neuron bias	 g
k
 	 is called the transfer function squashing function
or activation function and w
kj
is the connection strength or weighty $
k
is often
neglected as it can be modeled as the connection strength from an input with the
constant value of 
HFB(  )s
wkj ks
HFB k(  )s *yk
yk
zjx m
Feedback
Figure 
B
 General neural network model The feedback can be either a
continuous time or a discrete time one The arrows represent synaptic con
nections
Interconnecting these articial neurons gives the articial neural network A
general model of such a fully interconnected network can be seen in gure 
B

A neuron input can be either an output from another neuron or an input x
m
to
the network Letting the M inputs have indices m
 
 I and the N neurons have
indices k
 
 U  we have
x "




x
m
 
x
m




x
m
M




 y "




y
k
 
y
k




y
k
N




 z "




z
j
 
z
j




z
j
MN





y This notation is based on a paper byWilliams and Zipser  though some
what modied to be consistent with notation used in Hertz et al 
Appendix B Articial neural networks Page 
where
z
j
"

x
j
 for j  I
y
j
 for j  U
  
B
	
x m
zlj
w lkj y lk
yk
L
sk
l l
L
1
Figure 
B
 Layered feedforward neural network Also called a perceptron
This is a special very popular version of the general neural network suitable
for a large range of classicationregressionetc tasks
Often a network is constructed as a layered feedforward network or a percep
tron also multilayer perceptron MLP	 see gure 
B
 In this case the synapses
and neurons usually bear a layer index l in addition to the ones above Thus
y
l
k
" g
l
k
s
l
k
	 s
l
k
"
X
j
w
l
kj
z
l
j
 
B
	
where
z
l
j
"

x
j
 for l " 
y
l
j
 for   l  L
 
B
	
and L is the number of layers
In this thesis we shall be concerned with both types of networks
Appendix B Articial neural networks Page 
B Applications and motivations
Inspired by the human brain articial neural networks would be expected to be
good at solving problems that the human brain solves eciently Indeed this is so

 Example described problems as opposed to problems with an algorithmic so
lution	 is where ANNs have the advantage over traditional methods Problems
that typically fall in this category are

 Associative memories that are the Bohr atoms of neural networks closely
related to

 Classication and

 Regression
All applications where one typically has large set of data describing the problem
but no obvious	 algorithm for the solution Recognition of handwritten characters
eg Matan et al 	 is a good example Perhaps less obvious the example
described problems are also found in areas as

 Time series analysis

 Control systems and

 Data compression
Today ANNs are applied to a wide range of applications often performing an
order of magnitude better than previous solutions One can mention the prediction
of splice sites in human premRNA Brunak et al 	 pig carcase grading in
Danish slaughterhouses Thodberg 	 and cytological screening NIPS 	
In addition to the superior performance in certain application areas neural
networks oer several nice properties that further motivate their use

 Fault tolerant The distributed data processing of ANNs makes it very easy
to include the necessary redundancy to implement a fault error noise etc	
tolerant system

 Parallel The ANN equations can be totally parallelized which is also true for
many associated learning algorithms	 implying that ANNs can be very fast

 Regular Most ANNs are composed of few dierent elements that are inter
connected in a regular way This regularity makes a hardware eg VLSI	
implementation cheap

 Adaptive Programmed by way of examples ANNs are easily adapted to new
working conditions This is a very powerful property hardly challenged by
any conventional method

 Asynchronous Most neural networks can or do	 function in an asynchronous
way including the human brain	 This is advantageous when implementing
electrical circuits because problems with spikes on supply currents and worst
case timing design are eliminated
Notice that most of the above properties are in favour of hardware implementations
of ANNs
After the reviving of articial neural network research in the early 
es
ANNs got the reputation of being a magic tool that would give impressive results
Appendix B Articial neural networks Page 

when applied to anything This is obviously not true and ANNs were labeled
frivolous in certain circles Today this label is unjustied neural network theory
is advancing every day and is well founded ANN limitations are known see Hertz
et al  SanchezSinencio and Lau 
 Haykin 	
To cite John Denker neural networks are the second best way of doing just
about anything The best way always being that of applying an algorithm  if
such can be found As the literature shows this is often not possible especially if
an adaptive solution is sought
B Teaching ANNs
The process of determining the free parameters of an ANN such that it solves a
given task is called learning or teaching of the ANN Learning algorithms are
usually classied as one of the following Hertz et al 	

 Supervised learning using a teacher eg backpropagation or realtime recur
rent learning These algorithms are used when target values can be dened
for the ANN outputs

 Supervised learning using critic typically derived from an algorithm using a
teacher These algorithms are used when it is possible only to tell if the ANN
output is good or not The algorithms are less ecient than the ones using a
teacher

 Unsupervised learning eg a Hebb rule These algorithms are used when
nothing is known about the desired ANN output the network is supposed to
nd structures in the input data All the learning algorithms rely on the ability
to nd structure in the input data thus the unsupervised learning algorithms
can also be used to aid one of the other types of algorithms
Many applications are using supervised learning algorithms with a teacher such
learning algorithms are of primary interest in this text
B Gradient descent algorithms
Supervised learning algorithms are very often based on gradient descent  Some
kind of error measure is dened for the network and the object of the algorithm is
to minimize a cost function dened on this error measure This is done by succes
sively nding the gradient of the cost function with respect to the free parameters
of the system and changing the parameters a fraction in the opposite direction cor
responding to the steepest downhill climb in the cost function landscape Borowski
and Borwein 		 More precisely see Lehmann  Williams and Zipser 
and Hertz et al 	
For each network output y
k
t	 at time t we dene the output error 

k
t	 "

d
k
t	  y
k
t	 for k  T t	

 for k  U n T t	
 
B
	
Appendix B Articial neural networks Page 
where T t	 is the set of outputs that has a desired value or target value d
k
t	
at time t Combined with the corresponding inputs these are the training data
dened on t

 t  t

which is called an epoch also T
epc
" t

 t

is the epoch
length	 The time is often completely arbitrary but it is convenient to use	 The
total cost function is the instant cost function accumulated over time
J
tot
"
X
t
 
tt

J t	
usually
"
X
t
 
tt

X
kU
J
k
t	 
or using continuous time
J
tot
"
Z
t

tt
 
J t	 dt  
For the instant cost function a popular choice is the quadratic cost function
J
Q
t	
def
"


X
kU


k
t	 "


X
kT t
d
k
t	  y
k
t		

  
B
	
Letting w

denote the free parameters of the network we must change these
according to #w

" r
w
J
tot
 where  is a small positive number called the
learning rate Expressed in coordinates this is
#w
ij
" 

J
tot

w
ij
  
B
	
After this change the total cost function is calculated once again etc until equi
librium is reached This is hopefully the minimum of the cost function actually
what is reached is most likely a local minimum and this is one of the major objects
of the ongoing neural network theory research	
Rather than changing the weights after each epoch learning by epoch or batch
learning	 the weights are often changed continuously learning by example or on
line learning	 according to #w

t	 " r
w
J t	 or
#w
ij
t	 " 

J t	

w
ij
t	
  
B
	
If  is small this will apart from a constant factor sum up to 
B
	 approximately
This resembles the GaussSeidel method of solving linear equations numerically
Press et al  Goggin et al 
	 in the sense that iterative changes in the
unknown vector w

are applied as quickly as possible
When using a gradient descent algorithm the neuron squashing function must
be dierentiable as

J
k

w
ij
"

J
k

y
k
	

y
k

w
ij
"

J
k

y
k
	

y
k

s
k
	

s
k

w
ij
  
B
	
Appendix B Articial neural networks Page 
Often used is a function as the hyperbolic tangent g
k
s
k
	  tanh	
t
s
k
	 or the
sigmoid function g
k
s
k
	  !e

t
s
k
	 where 	
t
often
"  is the neuron squashing
function steepness at s
k
" 

Using the quadratic cost function above imposes a problem When an output
y
k
is close to  
y
k

s
k
and thus 
J 
w
ij
will be close to zero regardless of
the value of d
k

To come around this problem one can use a cost function that diverges when
d
k
and y
k
approaches dierent extreme values for instance the entropic cost func
tion which measures the relative entropy of d
k
and y
k
Hertz et al 	
J
E
t	
def
"
X
kT t



 ! d
k
t		 ln
 ! d
k
t	
 ! y
k
t	
!


 d
k
t		 ln
 d
k
t	
 y
k
t	

  
B
	
Alternatively one can use a Fahlman perturbation which substitutes

y
k

s
k
! 
F
for 
y
k

s
k
in 
B
	 where 
F
is a small positive number which
we also call the derivative perturbation
Gradient descent type learning algorithms exist for both recurrent and feedforward
networks and many perturbations to the true gradient descent have been proposed
to improve the algorithms in terms of learning speed generalization ability etc	
Many other types of learning algorithms also exist but we are primarily interested
in the gradient descent types in this work See also chapter 
B Performance evaluation
When applying an articial neural network using teacher supervised learning	 to
a problem this is usually dened as set of input output examples To evaluate the
performance of the network the data set is split in a training set and a test set
Using the timenotation above for instance
trainig set fx
m
t	 d
k
t	g t

 t  t

test set fx
m
t	 d
k
t	g t

 t  t

 
To avoid tting the noise in the input data it is of paramount importance that the
system is over determined compare to curve tting Press et al 		 ie that
there are more input output examples than degrees of freedom in the system 
as a rule of thumb Hertz et al  Krogh et al 	 ( times more
Appendix B	 Articial neural networks Page 
A good error measure based on the average relative variance Weigend et al
	 can be found in Svarer et al  the normalized average relative variance
NARV for the continuous time version replace the summations by integrations	
E
NARV
t

 t


def
"

Var
t
fd
k
t	gt

 t

	
X
t

tt

d
k
t	  y
k
t		

 

B
	
where the variance of d
k
t	 is taken over the complete data set For the data set
above this is equivalent to when using dimensionless time	
E
NARV
t

 t

 "

t

 t

X
t

tt

d
k
t	  y
k
t		


t

 t

X
t
 
tt


d
k
t	  mean
t
 
t

fd
k
 	g


 
A normalized average relative variance of  corresponds to the output being iden
tical to the mean of the target values
Monitoring the average relative variance as the learning progresses one will
typically see the training error asymptotically decreasing and the test error always
larger than the training error	 reaching a minimum where the network starts to t
the noise in the input data see section  Qian and Sejnowski 
 Watkin
et al 	 If the test error is close to the training error we say that the network
has a good generalization ability more precisely the generalization ability is the
probability that the network gives the correct answer to an arbitrary input Hertz
et al 		
Page 
Appendix C
Integrated circuit issues
In this appendix various issues of integrated circuits also somewhat inaccurately
termed VLSI very large scale integration	 circuits	 are displayed Most will prob
ably be known to the reader but will serve to dene various symbols used in the
thesis We display the standard models for MOS and bipolar transistors and the
issue of accuracy in analogue computing hardware is touched  in relation to both
component sizing and layout
C  MOS transistors
In this thesis we use the ShichmanHodges model for the MOSFET s Metal Ox
ide Semiconductor Field Eect Transistors also MOST 	 in strong inversion see
Geiger et al 	 The model is suited only for hand calculations  for more
accurate models refer to the literature especially for the subthresholdsaturation
region	 The model
The drain current i
D
cf gure 
C
	 for a Nchannel MOST is given by
i
D
"






  cuto v
GS
 V
T
 



	v
GS
 V
T
	

 ! v
DS
	  saturation 
  v
GS
 V
T
 v
DS
	v
GS
 V
T



v
DS
	v
DS
 ! v
DS
	  triode v
DS
 v
GS
 V
T


C
	
where 	 is the transconductance parameter V
T
is the threshold voltage  is the
channel length modulation parameter W is the channel width and L is the channel
length cf gure 
C
	 In reality the transistor is never completely cuto near the
Appendix C Integrated circuit issues Page 
vGS
vDS
Drain
Bulk
Source
Gate
+
-
-
+Di
Figure 
C
 Nchannel MOS transis
tor symbols Mostly the bulk terminal
is connected to the supply voltage and
we omit it in the schematic as shown
to the right The Pchannel MOST
symbol has the arrows pointing in the
opposite directions
p-
n+
n+
bulk
gate
drainsource
L W
FOX
Figure 
C
 Nchannel MOS transis
tor Schematic drawing of physical
substrate MOST The primary design
parameters width W and length L
are shown Notice the component
simplicity
threshold voltage the device enters the subthreshold region where the drain current
is approximately given by
i
D
"
W
L
I
D
st
e
v
GS
V
T
nV
t
 subthreshold v
GS
 V
T
 nV
t

C
	
where V
t
" kTq is the thermal voltagey n is the subthreshold slope and I
D
st
is a process related parameter which is dependent on v
DS
among other quantities
The parameters in the two strong inversion equations saturation and triode	 are
	 " K
 
W
L
" C
ox
W
L
V
T
" V
T
! 
p
j
F
j  v
BS

p
j
F
j	
 
k

L
 
C
	
where K
 
is the process transconductance parameter C
ox
is the gate oxide capaci
tance per unit area  is the surface mobility V
T
is the zero bias threshold voltage
 is the bulk threshold parameter j
F
j is the strong inversion surface potential
and we call k

the channel length modulation constant
The transconductance g
m
 and the output conductance g
ds
 in saturation is
often conveniently expressed by the quiescent drain current I
D

g
m
"

i
D

v
GS

p
I
D
	
g
ds
"

i
D

v
DS
 I
D
 
y Where k is the Boltzman constant T is the absolute temperature and q is the
elementary charge
Appendix C Integrated circuit issues Page 
The Pchannel MOST equations are dened in a similar way
In some MOS processes a phenomenon called snapback occurs in short chan
nel devices Sun et al  Hansen 	 For high drainsource voltages the strong
electrical eld near the drain junction will cause a device breakdown and inject
a current into the bulk of the device This current turns on the drainbulksource
parasitic bipolar transistor cf next section	 and causes the drainsource current
to increase enormously see gure 
C
 The phenomenon wears the device but is
not destructive as latchup is typically	 The critical drainsource voltage at which
snapback occurs increase with increased channel length
vDS
iD
Normal operation
Snap back
Figure 
C
 Short channel snapback Snap
back occurs if the parasitic drainbulksource
bipolar NPN for an Nchannel MOST tran
sistor is accidently turned on
Three dierent MOS transistor symbols are shown in gure 
C
 In this work
we usually use the simple middle one with the identiable source terminal We
assume bulk is connected to the supply voltage For transistors with no obvious
source terminal as switches	 and for digitally operated transistors we use the
right symbol The left symbol is used mainly in circuits where the bulk terminal
connection is of particular importance
Appendix C Integrated circuit issues Page 
i C Collector
Emitter -
+
+
-
v
v
BE
CE
Base
Figure 
C
 NPN bipolar tran
sistor symbol The PNP BJT symbol
has the arrow pointing in the opposite
direction
AE
n+
p+
p-
n+
n+
n-
p-
p+
base collector
emitter
FOX
Figure 

C
 NPN bipolar transistor
Schematic drawing of simple vertical
physical NPN BJT The primary de
sign parameter the emitter area A
E

is shown Even for this simple BJT
the minimum layout area is larger
than that of a MOST
C Bipolar transistors
In this thesis we use the simplied	 EbersMoll model for the BJT s Bipolar
Junction Transistors	 see Geiger et al 	
The collector current i
C
cf gure 
C
	 for a NPN BJT in forward saturation
mode is given by
i
C
" 
F
i
E
" 
F
A
E
J
S
e
v
BE
V
t
 	 ! v
CE
V
AF
	  
C
	
where 
F
is the forward emittercollector current gain A
E
is the emitter area J
S
is the transport saturation current density V
t
" kTq is the thermal voltage and
V
AF
is the forward Early voltage The PNP BJT equations are dened in a similar
way
Usually bipolar transistors are not meant to be available in MOS processes
However parasitic devices are always present and in a typical well CMOS process
one type of the transistors can be operated in the lateral bipolar mode LBM
MOSFET 	 which gives access to a bipolar device Vittoz 	 This is illustrated
in gure 
C
for an Nwell process The baseemitter or bulksource	 diode is
biased in the forward direction which turns on the bipolar device The BJT
has two collectors one connected to the substrate conducting a waste current
i
S
" 
FS
i
E
	 and one connected to the drain of the MOS device conducting
the desired collector current i
C
" 
FC
i
E
	 To get a reasonably high eciency of
this device it is important that i	 the area under the emitter diusion is as small
as possible compared to the sidewall emitter diusion area towards the collector
diusion and ii	 the emittercollector base width is as small as possible compared
to the emittersubstrate base width Thus proper layout of the LBM MOSFET
is a minimum size emitter junction encircled by a minimum length MOS gate cf
gure 
C
 For good bipolar operation the MOSFET gate area must be biased
in strong accumulation to turn o the MOST operation A LBM MOST device
symbol is shown in gure 
C

Appendix C Integrated circuit issues Page 
vCE
αFC iE
vBE
+
+
-
-
Base
Gate
Emitter
Substrate
Collector
Figure 
C
 Lateral bipolar mode
MOSFET symbol Connecting the
gate to V
DD
we can replace the LBM
MOST by a BJT with 
FC
 
right
p-
base gate
emitter
collector
n-
p+p+
n+
substrate
FOX
Figure 
C
 Lateral bipolar mode
MOSFET Schematic drawing of
physical Nwell Pchannel LBM
MOSFET The e
ective emitter area
is the emitter junction sidewall area
towards the collector junction
As a current is deliberately injected into the substrate care should be taken
to eciently guard the LBM MOST device to reduce the risk of latchup cf gure

C
	
C Analogue computing accuracy
The computing accuracy especially in relation to o
set	 of analogue signal pro
cessing elements are often limited by mismatch of the analogue components To
enlighten the inuence of mismatch we shall do a case study of subtraction using
simple MOS current mirrors
The outputs from analogue synapses are often in the form of dierential cur
rents The subtractions of these can be performed per synapse gure 
C
	 or per
row of connected synapses gure 
C
	 the latter giving poorer accuracy unless a
transistor area equal to that of the former is applied
VDD
i out1 20
0
0W
L
i M
Figure 
C
 Current subtraction by synapse For simplicity all M pairs of
current sources lead the same current All drain voltages are imagined large
and identical for optimal matching
Appendix C Integrated circuit issues Page 
1 20i M
VDD
i out
W
L
Σ
Σ
Figure 
C
 Current subtraction by row Using a single current mirror this
have to be M times as wide to be able to sink the increased current This
very wide current mirror would be implemented as many small in parallel
Now given a transistor parameter P  the variance of this with respect to
some identical reference device	 can be modeled as cf Pelgrom et al 
Lakshmikumar et al  Michael and Ismail  see also Ismail and Fiez

	


P
 A

P
G
P
! S

P
D

P

where A
P
and S
P
are process dependent constants G
P
is a function of the device
geometry and D
P
is a function of the device layout For many MOS transistor
parameters G
P
" WLy D
P
would be a probably highly nonlinear	 function
of device distance D device orientation device context wafer center distance
and other layout specic quantities Usually we set D
P
" D for simplicity this
assumes that careful symmetric layout is used see below	 Assuming a simple
quadratic law for a saturated MOSFET the relative drain current variance is rst
order approximation	


i
D
i
D




W
W

!


L
L

!


K

K
 

! 


V
T
v
GS
 V
T
	


where i
D
 V
T
 K
 
W and L are the average drain current threshold voltage process
transconductance parameter channel width and channel length respectively and
the 

 
s are parameter variances Qualitatively this equates assumingG
P
" WL
and D
P
" D	


i
D
i
D



 !


v
GS
 V
T
	




WL
!D


 
C
	
where the 

s are constants
Returning to our example assume a current mismatch
p
	
i

when mirroring
a current i

using a current mirror with transistor dimensions W

and L

and
current standard deviation 
i

 The sum of M such currents gure 
C
	 yields a
mismatch in the accumulated current i
out
 of 
i

"
p
M 	 
i

 for independent
error sources
y This holds quite accurately forK
 
and V
T
 For the transistor width and length
however it would be reasonable to assume G
W
"  and G
L
" W 
Appendix C	 Integrated circuit issues Page 

Were we to use a single current mirror gure 
C
	 the WL ratio would
have to be scaled to accommodate for the increased current W

L

"MW

L

for an unchanged set of terminal voltages which is assumed chosen optimally
for matching	 Ignoring the device distance for the moment 
C
	 reduces to


i
D
 i
D

WL Thus for an unchanged accuracy 
i

"
p
M 	 
i

 we would
need W

L

" MW

L

 or an unchanged total transistor area and W

"
MW

 L

" L

	 Ignoring the device distance is not a good approximation
However for reasonably large devices ecient layout techniques interdigited
common centroid etc	 can be employed in which case the result holds
Because of the tremendous interwafer process variations the accuracy of a
single MOS transistor is not usually interesting Rather the matching of two or
more devices are important as above	 In this connection the physical device
placement  the layout  is very important global process variations and pro
cess gradients must be taken into consideration Especially if the devices to be
matched occupy a very large area as these would be exposed to larger absolute
variations than devices concentrated on a small area In our example this means
that dierencing by synapse possibly gives better accuracy for a given area	 than
dierencing by row as the rst solution require only matching locally while the
other require matching of all the transistors in the large current mirror  though
process gradients having a linear inuence on the drain current can be canceled
out
C Integrated circuit layout
The physical layout of integrated circuit components strongly inuence the match
ing properties Problems that need to be considered during the layout include see
Sze  O	Leary  Michael and Ismail  Sackinger and Fornera 

McNutt et al 	

 Local process variations Small random variations on all parameters are in
evitable One can only reduce their inuence by device scaling as demonstrated
in the previous section

 Global process variations Most importantly causing a constant device size
oset #l
ofs
 eg caused by over under etching or over under exposure of pho
toresist Usually this problem is dealt with by using only unit size devices
which are replicated to implement say wider transistors cf previous section	
Only rational device ratios are possible in this case This procedure also has
the advantage that all unit devices can be placed in identical surrounding re
ducing errors due to boundary eects Also channel width modulation would
be the same for parallel coupled unit MOS transistors

 Process gradients In addition to random process variations parameters are
subjects to low spatial frequency systematic variations which can be quite
large Eg oxide thickness can vary uniformly over the wafer or device features
can be scaled dierently at the wafer center and wafer edge To minimize
Appendix C	 Integrated circuit issues Page 
such variations on matched devices one must obviously place such as close
together as possible Also large devices should be interweaved to ensure the
same average parameters on all devices Process gradients that causes the
drain current or capacitance for capacitors etc	 to vary linearly with the
device position can be canceled by employment of common centroid layout
where the devices are placed in such a way that the centers of mass the
centroid	 of the distributed devices are common cf gure 
C
	

 Device orientation As process gradients will be dierent in dierent direc
tions it is important that matched devices are placed symmetrical with re
spect to the gradients to ensure alike inuence on the devices in practice the
gradient orientation is unknown and the devices are placed symmetrical with
respect to the vertical and horizontal axis this is also symmetrical with re
spect to process gradients that varies linear in space	 As the temperature has
a strong inuence on most electronic device parameters temperature gradi
ents are as important as process gradients Matched devices must be placed
symmetrical with respect to known heat sources eg output drivers	

 Boundary e
ects A most prominent inuence on device mismatch is due to
inaccuracies on the boundary of the devices Therefore it is important to
minimize these inaccuracies by ensuring identical boundary conditions for all
devices

 Noise Especially in mixed analoguedigital circuits the noise coupling can
severely degrade performance In addition to standard practice of separating
power supplies and analogue digital circuit blocks and of using ground wire
shielding when signals have to be mixed guard bars must be placed around
critical devices to reduce noise coupled via the substrate which is usually
common for all devices	
An example of matched transistor layout is given in gure 
C

Appendix C	 Integrated circuit issues Page 
G
G
G
G
C
E
BC
E
B
C
E
BC
E
B
EB2C2C1B1VddVss
Figure 
C
 Layout of matched transistors These transistors are operated in
the lateral bipolar mode which explains their large distance necessary well
well spacing and the heavy guarding that prevents latchup The common
centroid layout symmetrical device orientation and almost identical boundary
conditions are noticed
Page 
Appendix D
System design aspects
In this appendix various aspects of the chipsystem designs too detailed to put in
the main body of the thesis are displayed In section D some general design
considerations are given Also the synapse chip design is discussed and a table of
measurements on the rst generation chip set is given as well as proposed chip set
improvements Discussions in this chapter apply to most of the fabricated chips In
section D a complete schematic and descriptions of the most important parts of
the backpropagation neuron and synapse chip are given A table of measurements
on the chip set as well as proposed chip set improvements likewise In section D a
schematic descriptions a table of measurements and proposed chip improvements
of the RTRL chip are given
Appendix D System design aspects Page 
D  The scalable ANN chip set
A few general system aspect considerations are needed for the design of the scalable
ANN chip set These also apply to the other chips

 Voltage levels We shall use nchannel MOS resistive circuits rigorously As
these must be biased in the triode region this determines the voltage levels
we can use For good dynamic range both inputs to the MRC must be chosen
as large as possible Choosing the gate voltages their dierence v
w
in gure


	 in the range
v
MRCG
 V V 
and the source drain voltages their dierence output voltage or v
z
in the
gure	 in the range
v
MRCS
 v
MRCD
 VV 
gives a reasonably safe margin in our process V
T
  V for v
BS
" V	
Further both gate and source voltages are kept well within the power supplies
ensuring that both can be driven easily by a non railtorail opamp

 Current levels We are not primarily interested in lowpower circuits Thus
as for the voltage range we select as high a current level as we can justify
to assure high dynamic range Further an increased current range allows us
to reduce node impedance levels which improves the speed The maximum
current is determined by the current sinking capabilities of the synapse row
dierencer cf later	 For 
(

 synapses connected to this the full scale
dierential output current of the MRC given the voltages above	 must be
limited to about
i
MRC
 A A  
For the rst synapse chip this was set approximately one order of magnitude
higher	

 Single ended signalling Though the components we use are mostly working on
dierential signal we have chosen single ended signalling between the chips
The reason for this is to reduce the pin count which is  

 for a 

 


synapse chip with single ended signalling The cost is a  bit reduction in
resolution and increased noise sensitivity We choose a voltage reference level
at V
ref
" V to be compatible with the chosen MRC voltage ranges For
signals applied to the gates of the MRC we chose the reference level to V
wref
"
V
For easy digital chip control all digital inputs are TTL compatible We use
the TTL level low  
 V high   
V	 to internal logic level low V high
!V	 converter in gure 
D
 the inset displays the levelshifter symbol The
levelshifter reference V
Dref
 VV must be capable of sinking current It
can be generated ochip by a zener diode In addition to levelshifting the circuit
act as a driver for internal capacitive loads
Appendix D System design aspects Page 
VDref
TTL input
internal
logic
VDD
SSV
Figure 
D
 Digital level shifter TTL level to internal logic level converter
The resistordiode gate protection circuit found on all high impedance chip
inputs is also shown The inset displays a levelshifter symbol
D The synapse chips
The rst generation synapse dimensions were determined to be compatible with
the opamp dimensions the chip layout of the opamp and the synapse matrix was
done simultaneously by dierent individuals	 As measurements have shown the
weight storage capacitance can easily be reduced Also the errors associated with
the synapse chips were predominantly determined by the current dierencers Thus
the synapse size can easily be reduced compared to the manufactured synapses
The layout of a reduced size synapse is shown in gure 
D
 One will notice that
nonminimum lengths transistors are used for the nand gate The reason is that
otherwise snapback would occur in the CMOS process used To avoid snapback
and to reduce power consumption	 the logical high voltage level is usually reduced
with respect to V
DD
 Unfortunately this is impossible in the present circuit as the
pchannel switches must have a v
GS
 V
T



 V to reduce subthreshold leakage
currents suciently or worst case	 v
G


V
wmax
 where V
wmax
is the maximum
allowable gate voltage at the computing transistors The guard bar around the
computing transistors is supposed to reduce noise coupled from the digital circuitry
nearby the highly interweaved placement of analogue and digital circuits can not
be avoided unfortunately	 Likewise several shielding power lines that protect
the analogue signals from the row and column select signal can be seen on the
gure Even using this shielding noise is coupled to the analogue outputs current
spikes in the order of hundreds of nA can be detected
In enclosure II a photomicrograph of the synapse chip can be seen It is no
ticed that the row and column decoders for writing on the synapse weight matrix
are placed to the left and top of the synapse matrix Digital circuitry is placed to
the left and top of these again to reduce interference with the sensitive analogue
components Analogue components are kept to the right and bottom of the matrix
The row and column decoders are precharged and gates each having a line driv
ing inverter at the output capable of driving 

 synapses The column decoder
is considerably faster than the row decoder ensuring that the selected sampling
switches will close sample	 at the falling edge of the column signal The analogue
weight refresh signal that is sampled	 is then distributed along each synapse row
Appendix D System design aspects Page 
Ro
w
Vs
sd
Vd
dd
I2
I1
Vx
re
f
Vd
a
VdddVssd Col
VddaVyrefVyVssa
Figure 
D
 Synapse layout Reduced are To the left inside the guard bar
are the four computing synapse transistors The two minimum size sampling
transistors are placed in the middle and the nand gate to the right
to reduce coupling from the sampling column signal
Some of the most important characteristics of the rst generation neuron and
synapse chips are shown in gure 
D
 The characteristics were measured us
ing standard methods and equipment oscilloscopes signal generators etc	 and
custom PCBs for applying various bias test signals The synapse chip test PCB
includes connection strength backup RAM and a parallel port PC interface for
accessing this The full schematic of this PCB can be found in enclosure III
Appendix D System design aspects Page 
Property Value Bits Notes
Neuron size A
neu
" 
m

Neuron nonlinearity D
g


& LSB

Neuron derivative nonlinearity D
dg



& LSB

Neuron input oset jI
sIofs
j



A LSB

Neuron output oset jV
gofs
j


mV LSB

Neuron propagation delayy t
gpd


 s


LSB

C
L
  pF
t
gpd



 s


LSB

 C
synchip
in
LPNP e c current gain 
FC
 
 
Synapse size A
syn
" 
m

Reducible
Matrix oset jV
wofs
j


mV LSB

Matrix resolution V
wres


mV

 
LSB

Synapse nonlinearity D
wz


& LSB

D
wz


&	 LSB

	 Estimated
Synapse output oset jI
sOofs
j


A LSB

Synapse input oset jV
zofs
j


mV LSB

R
L
 
 k*
Synapse propagation delayy t
spd


 
s


LSB



C
L
" pF
t
spd



 s


LSB

 C
synchip
out
Matrix write timez t
wwr



 ns


LSB

Matrix weight	 drift j
w
j



 mVs 
 
LSB

 s C
w 
" pF
Weight range jw
kj
j
max
 
  
 for y
k
" tanhs
k
	
Layer propagation delayy t
lpd


 s


LSB

C
L
  pF
t
lpd


 s


LSB

C
L
  pF
y Time from input change to output has settled within


LSB
z Necessary length of write pulse that ensures the output will settle within


LSB


Figure 
D
 Table of ANN chip set characteristics The column Bits is
the equivalent in least signicant bits of the property value given an  bit or
otherwise indicated resolution Note that the estimated synapse nonlinearity
is in compliance with the measurements on the backpropagation chip set
D Chip set improvements
In addition to the important process parameter temperature variance compensa
tion the following issues are subjects for improvement of the developed cascadable
ANN chip set

 Reducing the synapse area The synapse area can be reduced to about


m

m cf above	 Though such a reductionwill increase the synapse
output oset current the synapse size must be small to allow massively par
allel computations If another CMOS process is used the synapse size can be
further reduced

 Reducing the power supply voltage Redesigning the circuit for compatibility
with a V or even  V	 process is not a trivial matter The voltage range
on the MRC for instance must be reduced to about V in a V process
Appendix D System design aspects Page 

 Neuron output level shifting Referring the interchip voltages as the neuron
output	 to V was the cause of numerous interface problems Level shifters
should be introduced to allow a 
V reference

 Improving the opamps reducing area reducing oset increasing gain and
increasing the output voltage range Objectives of any opamp probably
However the regulated gain cascodes in the opamp did not have the expected
performance and the output voltage range must be almost	 railtorail for a
future implementation in a digital CMOS process Further the layout can be
improved

 Improving the current conveyor as the opamp and also improving the accu
racy of the xz current mirroring

 Implementing onchip bias circuits Most bias circuits need not be very ac
curate and can be generated onchip A few external references such as the
neuron output range	 are necessary though

 Implementing onchip TTL levelshifter reference and reducing the power
consumption of this circuit

 Introducing a current opamp based synapse dierencer for automatic process
parameter temperature dependency canceling

 Placing synapse column drivers on the synapse chip instead of on the neuron
chip This improves the cascadability of the chip set

 Reducing the neuron size The neuron size is predominantly determined by
the opamp Reducing the size of this will reduce the neuron size

 Exploring other neuron topologies If we do not need to calculate the neuron
derivative as a function of the neuron output the derivative calculation cir
cuit could be placed on the neuron chip	 other neuron circuit topologies are
possible This is discussed in section 

 Introducing oset canceling For very large synapse chips canceling of the
output current oset may be inevitable Dierent oset canceling techniques
must be considered This is discussed in chapter 

 Implementing a fullsize chip set Though the layer propagation time is not
very dependent on the synapse chip size it should be veried that a  GCPS
per chip system is indeed feasible using the our  m technology Also new
problems might arise when scaling the system these should be explored
Appendix D System design aspects Page 
D The onchip backpropagation chip set
The onchip backpropagation chip set is designed reusing as much layout of the
scalable recall mode chip set as possible
The backpropagation synapse chip layout closely resembles that of the second
generation recall mode synapse chip The synapse layout is identical and the
row column decoders for weight access have also been reused The opamp and the
current conveyor for synapse output dierencing are also largely unchanged though
switch transistors have been added for correct row column element operation
The backpropagation neuron chip on the other hand is drawn almost from
scratch only the opamp layout is reused
D The back	propagation synapse chips
The column row element of the backpropagation synapse chip which is used to
take the accumulated synapse output dierence and to drive vectors of synapses
is shown in gure 

D
 The route signal is distributed to all rows and columns
and is used to route the previous layer neuron activation when the chip operates
in route mode The six control signals A through F	 are operated dierently for
the row and the column elements as shown in gure 
D

Ctrl Driving signal
signal Row elements Column elements
A reverse reverse
B reverse reverse
C forward reverse
D forward reverse
E route route
F row " k column " j
Figure 
D
 Table of row column element control The three control signals
forward reverse and route corresponds to the equivalent operationalmodes
The row and column select signals needed for the route mode of the chip is
taken directly from the corresponding synapse matrix access or write	 signals
This means that the connection strength at location fk jg is lost when input j is
routed to output k After the learning hardware has computed new weight for a
layer l the connection strengths in that layer will be lost and must be rewritten
before the computation of new weights in the preceding layer is resumed
The principal schematic of the row and column elements in the dierent op
erational modes are shown in gures 
D
through 
D

In enclosure II a photomicrograph of the chip can be seen The synapse
matrix and the groups of row 	 and column 	 elements are easily identiable
A layout error  a V
DD
V
SS
short circuit in some nonessential test pad cells 
necessitated microsurgery on the chip to isolate the faulty pad cells Fortunately
Appendix D System design aspects Page 

i  
/ v
-
-
i  
/ v
+
+
synapses
V D
D SSV
V r
ef
A
B
C
D
E
F
7.
2/
13
7
40
/2
.4
route signal
Figure 

D
 Backpropagation synapse column row element The output
pushpull source follower of the left opamp is explicitly drawn to show the
dual functionality of this opamp voltage follower or current conveyor The
connection of a single synapse is also shown TransistorWL ratios are given
for the synapse and a sample switch
Appendix D System design aspects Page 
is+
is-
x
y
zCCII+
Figure 
D
 Forward mode BPL syn
apse row element The CCII act as
current di
erencer
Vref
v
v
z
z
-
+
x
y
zCCII+
Figure 
D
 Forward mode BPL syn
apse column element The CCII act
as voltage bu
er
Vref
ro
u
te
 s
ig
na
l
row
x
y
zCCII+
Figure 
D
 Route mode BPL synap
se row element The route signal is
applied to row row
Vref
colr
o
u
te
 s
ig
na
l
x
y
zCCII+
Figure 
D
 Route mode BPL syn
apse column element Column col
drives the route signal
Vref
v
vδ+
δ- x
y
zCCII+
Figure 
D
 Reverse mode BPL syn
apse row element The CCII act as
voltage bu
er
i
i
ε
ε+
-
x
y
zCCII+
Figure 
D
 Reverse mode BPL syn
apse column element The CCII act
as current di
erencer
this was possible without damaging other parts of the chips for a reasonably large
number of devices
D The back	propagation neuron chips
The schematic of a backpropagation neuron can be seen in gure 
D
 Note the
excessive use of the MRC see also the following schematics	 several of these are
preceded by level shifters on the gate inputs for a sample layout cf appendix E	
Various intermediate signal names are indicated in the gure The dierent blocks
of the neuron are identied as follows

 TransRFwd is the neuron forward mode transresistance R
IS
	 which is con
trolled by the external voltage V
CG
also V
IR
	 for adjusting the neuron slope

 Tanh is the hyperbolic tangent neuron The external voltage V
CT
also V
OR
	
controls the neuron output range

 SamplerI is the neuron activation sampler which is controlled by the holdsq
signal The dierential mode sampling scheme reduces the eect of charge in
Appendix D System design aspects Page 
jection and leakage currents The sampler has an additional global external	
input neuact which can be used to refresh the sampled neuron activation
This input is gated by the refresh signal and a neuron k select signal
neusel
k
 the latter being generated by a precharged column selector like the
one used on the synapse chip

 TransRRew is the reverse mode transresistance neuron error computer The
module is active when the chip is not in forward mode If lastlayer " 
it computes d
k
 y
y
 Otherwise the input current is converted to a voltage
controlled by V
CGR
 The transistor dimensions are chosen such that setting
V
CG
" V
CGR
is equivalent to a neuron input scaling of  at room tempera
ture

 SqrSqr computes the neuron derivative For an unscaled result set the external
control voltage V
Cg

V
wref
" V The maximal neuron output voltage V
ymax
must also be applied externally

 MulD computes the weight strength error For an unscaled value the external
control voltage V
C
 V
wref
" V

 actbus and deltabus are used to distribute the neuron activation and the
weight strength error respectively to the weight updating hardware
The schematic of the backpropagation neuron chip weight updating hardware is
seen in gure 
D
 It consist of two modules

 MulD scales the neuron activation with the learning rate controlled by the
external signals V

and V
C


 MulD multiply the scaled neuron activation and the weight strength error To
this value the old weight externally supplied by V
old
w
kj
 V
unit
must be set to
unity ie V	 and an oset compensation term externally supplied via the
V
xofs
and V
yofs
signals	 This sum is divided by the external voltage V

dec

giving an optional weight decay V

dec

" V gives no decay	

 OT OT and OT determines the output type This can be a raw one a
level shifted one or a level shifted and buered one The output is gated
by the learn control signal and the chip select signal cs
In enclosure II a photomicrograph of the chip can be seen The neurons are
the four long low horizontal strips
Some of the most important characteristics of the backpropagation neuron and
synapse chips are shown in gure 
D
 Again the characteristics was measured
using standard methods and equipment and custom PCBs for applying various
bias test signals The test PCBs are similar to the ones used for the recall mode
ANN chips  though the bidirectional operation requires a somewhat more exible
Appendix D System design aspects Page 
bo
tto
m
to
p
top
S
a
m
p
le
r2
I
V r
ef
V r
ef
V w
re
f
T
a
n
h
V r
ef
V w
re
f
T
r
a
n
s
R
F
w
d
V r
ef
V w
re
f
V r
ef
V r
ef
M
u
l1
D
V C
δ
V C
g’
Vy
m
ax
V C
G
R
fo
rw
ar
d 
*
la
st
la
ye
r
o
u
tp
ut
 p
ad
Fo
rw
ar
d 
m
od
e
V C
G
V r
ef
V w
re
f
V r
ef
V r
ef
V r
ef
V r
ef
S
q
r
S
q
r
Forward mode
input pad
holdsq
refresh *
neusel k
V
ref
V
wref
TransRRew
n
e
u
s
e
l
le
ar
n 
*
a
c
t
bu
s
de
lt
ab
us
r
e
v
e
r
s
e
fo
rw
ar
d 
*
la
st
la
ye
r
V C
T
kforward
neuact
forward
SSV
V D
D SSV
V D
D SSV
V D
D SSV
v s
k
v
ky
δ k
v
v ε
k
kg’
v
V D
D SSV
V D
D SSV
V D
D SSV
V D
D SSV
μ
~
30
   
 A
2.
4/
2.
4
2p
F
2p
F
40
/2
.4
7.
2/
17
.6
10
/1
9.
6
10
/4
1.
6
7.
2/
41
.6
7.
2/
41
.6
7.
2/
41
.6
40
/2
.4
4.8/182.8
Figure 
D
 Backpropagation neuron schematic Notice the excessive use of
MRCs WL ratios are indicated for a representative sample of transistors
Appendix D System design aspects Page 
V r
ef
V w
re
f
V r
ef
V w
re
f
V r
ef
V r
ef
V w
re
f
V r
ef
M
u
l3
D
V r
ef
V r
ef
V w
re
f
V r
ef
V r
ef
M
u
l1
D
V D
D
V D
D SSV
SSV
V η
V u
n
it
x
o
fs
V
y o
fs
V
V
ε d
ec
1/
(1-
    
    
)
v w
kjol
d
V D
D SSV
V D
D SSV
le
ar
n 
*
c
s
 
*
V C
η
v w
kjne
w
~
30
   
A
μ
V D
D
V D
D SSV
SSV
re
fe
re
nc
e 
pa
d
W
ei
gh
 c
ha
ng
e
o
u
tp
ut
 p
ad
W
ei
gh
t c
ha
ng
e
V δ
k
Vy
j
40
/2
.4
7.
2/
41
.6
7.
2/
17
.6
actbus deltabus
OT3
OT2
OT1
Figure 
D
 Backpropagation weight update schematic One instance per
neuron chip Three output types are possible for test purposes raw
level shifted and bu
ered level shifted The reference output is needed if
a level shifted output is used
Appendix D System design aspects Page 
test bed of course The full schematic of the backpropagation synapse and neuron
chip test PCBs can be found in enclosure III
Property Value Bits Notes
nonlinearity D
wz


 & LSB

transconduc variationy #G
syn


& LSB

chip input oset jV
zofs
j


mV LSB

chip output oset jI
sOofs
j


A LSB

z
weight oset jV
wofs
j


mV LSB

Sy
na
ps
e
weight resolution V
wres


mV LSB

tanh	 nonlinearity D
g


& LSB

input transres variationy #R
IS


& LSB

output transres variationy #R
OR


& LSB

input oset jI
sIofs
j



 A LSB

output oset jV
yOofs
j


mV LSB

parabola nonlinearity D
y



& LSB

parabola gain variation #A
PG


& LSB

parabola input oset jV
yIofs
j


mV LSB

parabola output oset jV
ofs
j


mV LSB

derivative nonlinearity D
g



& LSB

derivative input oset jV
g

Iofs
j


mV LSB

N
eu
ro
n
derivative output oset jV
g

Oofs
j


mV LSB

 C
L


 pF
Syn chip propagation delay t
zspd



 s LSB

R
L
  k*
Syn chip propagation delay t
zspd


s LSB

R
L
 

 k*
Neu squashing prop delay t
sypd


 s LSB

C
L


 pF
Neu chip weight calc time t
zwpd


 s LSB

C
L


 pF
Synapse weight drift j
w
j



 mVs 
 
LSB

 s C
H
   pF
Neuron activation drift j
y
j



 mVs 
 
LSB

 s C
H
   pF
y The variations are for a single chip
z Equivalent LSB of a single synapse
 The delays are for the signals to settle within the given precision
Figure 
D
 Table of Backpropagation chip set characteristics Apart from
the reduced synaptic output current level and the increased neuron output
o
set the chip performances are similar to the recall mode chip set
Appendix D System design aspects Page 
D The scaled back	propagation synapse chips
The scaled backpropagation synapse chip was implemented simply by adding rows
and columns to the backpropagation synapse chip In enclosure II a photomicro
graph of the chip can be seen It is seen that now for this matrix size   	
the major part of area is clearly taken up by synapses The control signals for
the scaled backpropagation synapse chip being identical to the ones of the rst
backpropagation synapse chip the test PCB was constructed as a piggyback
PCB that would t into the synapse chip socket of the original backpropagation
synapse chip test PCB The piggyback PCB splits the address space of the original
chip in two one part is mapped on a relocatable part of the scaled chip address
space while the other is replicated in the remaining address space In this way it
is possible for any synapse strength to be set independently of the other synapse
strengths using the limited address space of the original backpropagation synapse
chip test PCB The full schematic of the scaled backpropagation synapse chip test
piggyback PCB can be found in enclosure III
The scaled synapse chip being malmanufactured measured chip properties
for instance propagation delays oset errors and nonlinearities	 will not charac
terize the chip well Measurements done on the original backpropagation synapse
chip will probably for most parts give a better performance estimate of properly
manufactured chips As we shall try to use the chip in our RTRL backpropagation
system in spite of the poor performance a few measurement results are given in g
ure 


D
 Notice the very large systematic oset errors The reference voltage was
raised to V
ref
"  V to accommodate to the reduced input range of the current
conveyors A typical synapse transfer characteristic is shown in gure 

D

Property Value Bits stoch syst
Matrix oset jV
wofs
! mVj


mV LSB

LSB

Synapse nonlinearity D
wz


& LSB

Chip output oset jI
sOofs
  Aj


 A LSB

LSB

Synapse input oset jV
zofs
 
mVj


mV LSB

LSB

Synapse input range v
z
! V
ref
  V 
 
V
Synapse output range ji
wz
j


 A jv
z
j jV
w
j  V
Figure 


D
 Table of scaled BPL synapse chip characteristics Malfabricated
chip notice the large systematic o
sets See also the characteristics of the
nonscaled chip
Appendix D	 System design aspects Page 
2 
  Aμ
-
2 
  Aμ
-1V 1Vvz
vz
= 0.94VVw
-0.94V
i s
Figure 

D
 Scaled synapse
chip characteristics Measure
ments on a single synapse the
output o
set has been can
celed Notice the restricted in
put range and the fairly low
transconductance presumably
caused by a low process trans
conductance parameter and
raised threshold voltages
D Back	propagation chip set improvements
Being composed primarily of components found on the recall mode ANN chip
set most of the improvements mentioned in section D apply also to the back
propagation chip set eg reducing the power supply improving the opamps and
current conveyors and implementing onchip bias circuits and temperature com
pensation	 A few additional issues are subjects for improvement of the developed
cascadable backpropagation learning ANN chip set

 Redirection switch matching The weight osets are primarily determined by
mismatch in the synapse chip redirection switches These should be matched

 Auto oset compensation Many oset errors being destructive to learning
processes auto oset compensation circuits should be included on the chips
This could be chopper stabilizing circuits or circuits like the one mentioned in
section  Several signals are subjects for oset compensation
 The synapse chip forward mode output current
 The synapse chip reverse mode output current
 The neuron chip weight change output voltage
 The neuron chip neuron activation output voltage range
 The neuron chip weight change error voltage
Except for the weight change output the wire count of these signals grows
as ON	 thus a fairly simple cheap	 oset compensation scheme must be
employed

 Improved neuron derivative computation Several choices for improvement is
possible perhaps the simplest would be to clipping negative outputs from this
circuit

 Scaled synapse chip remanufacturing

 Nonlinear backpropagation extensions The synapse chip is compatible with
nonlinear backpropagation The neuron chip can in a simple way be ex
panded to include nonlinear backpropagation
Appendix D System design aspects Page 
D The RTRL chip
The width N data path RTRL chip consists almost entirely of layout taken from
the backpropagation chip set The chip development has thus consisted mainly of
a rearrangement and rerouting	 of building blocks plus the layout of a few digital
components
The schematic of a width N data path RTRL signal slice can be seen in gure


D
 Various intermediate signal names are indicated in the gure The dierent
blocks of the neuron are identied as follows

 ETSampler is the edge trigged neuron activation sampler The input signal
v
y
k
t	 is sampled at the falling edge of the 
Sy
clock signal

 Diff computes the neuron error signal v

k
t	 " v
d
k
t	v
y
k
t	 where v
d
k
t	 is
the externally applied target value The output is gated via the target select
input T
k
t	

 SqrSqr computes the neuron derivative v
g

k
t	 assuming a hyperbolic tan
gent activation function	 The maximal neuron output voltage V
ymax
must be
supplied externally

 TransR computes the net input derivative variable v

k
ij
t	 It is composed of
a transresistor transresistance controlled by the externally applied voltage
V
R
	 and the signal slice k part of the distributed i
z
j
demultiplexor the tree
transistors to the left controlled by bits i

and i

plus chip select i
CS
for
cascading	 of the i variable	 The external current input i
wp
k
t	 is added to
the demultiplexed i
z
j
signal

 MulD computes the neuron derivative variable v
p
k
ij
t	 This output is routed
to the chip	 global v
p
k

ij
t	 chip output when selected by the k
 
bits k
 

 k
 

and chip select k
 
CS
	 neuron derivative variable access input This input also
controls the

 Sampler is the signal slice k v
p
k
ij
t  	 sampler The input is taken from
the chip	 global v
p
k

ij
t  	 input and is sample at the falling edge of the

Sp
clock input when selected by k
 
using a precharged address decoder as
shown	 Actually two v
p
k

ij
t  	 input channels are provided one is meant
for initializing the p
k
ij
s and the other is meant for normal operation

 VMulElm is the signal slice k part of the distributed inner product multiplier
computing the weight changes The outputs are connected to the shared cur
rent conveyor The inputs are the v
g

k
t	 and either v
p
k
ij
t	 or v

k
ij
t	 for using
the quadratic and entropic cost function respectively	 controlled by the QE
and QE inputs
The schematic of the order N signal path RTRL chip weight updating hard
ware is seen in gure 

D
 It consists basically of the CCII! current conveyor that
takes the dierence of the distributed MRC inner product weight change multiplier
the CCII!X and CCII!YZ lines	 connected to the input of the CCII! are four
MRCs used for oset compensation The rst is used for external compensation
a dierential mode	 current proportional to the V
ofs
input is added to the weight
Appendix D System design aspects Page 
change current The three other MRCs are used for the automatic oset compen
sation circuit One is used to ensure a positive output current oset and the other
two are connected to the two bit oset compensating voltage output DACs ven
dor standard cell components	 The V
AZMM
and V
AZML
inputs are used to control
the transconductance of the MRCs connected to the most signicant and least
signicant DAC respectively the MRCs are scaled such that V
AZMM
 
V
AZML
should be chosen	
When the azero autozeroing signal is high the CCII! output is directed to
the current comparator rather than the chip weight change current output i
chip
w
ij

Also shown on the gure is the v
z
j
transconductor The transconductance is con
trolled by the V
Gz
input In enclosure II a photomicrograph of the chip can be
seen Note that the D A converters lower left	 and the SAR lower right	 take up
a considerably amount of area The signal slices are the four long low rows top	
A start conversion sc signal is needed for the successive approximation
register for initiating an auto oset cancellation phase This is generated by the
circuit in gure 

D
 when azero signal goes high the sc signal will be high in
the following 

clock phase The CMOS transmission gate symbol is dened in
gure 

D
 It must be driven by a complementary signal on the gates in the sc
generating circuit only noninverting input signals are shown for simplicity	
Φ1 Φ2
sc
scazero
Figure 

D
 SAR start signal gating Generation of oneclockperiod start
signal for successive approximation register The two successive inverters can
be removed they are present for design ease
Figure 

D
 Transmission gate
symbol Commonly used symbol for CMOS trans
mission gate controlled by complementary signals
top and bottom Sometimes one one of the control
gates are explicitly shown driven
Standard CMOS digital logic usually requires as our circuit	 a two phase non
overlapping clock usually as in our case	 available in both true and complimentary
forms A circuit that generates the four clockphases 

 

 

 and 

from a
single input clock signal  is shown in gure 

D
 It is based on a cross coupled
nor gate pair for generating the two nonoverlapping clock phases see eg Geiger
et al 	 and on two n n!  inverter chain pairs for the generation of skew free
inverted clock signals Shoji 	 Additional capacitive loads can be placed at
the outputs of the nor gates when IncICD " 	 to increase the interclock delay
the time period when both 

" 
 and 

" 
	
Appendix D System design aspects Page 


Q/
E
Q/
E
k’ C
S
k’ 1
k’ 0
Φ
Sp
i z
j
V R
σ
i C
Si 0 i 1
Φ
Sy
Vy
m
ax
Ta
rg
et
in
pu
t p
ad
Ta
rg
et
 se
lc
et
in
pu
t p
ad
N
eu
ro
n 
ac
tiv
at
io
n
in
pu
t p
ad
o
u
tp
ut
 p
ad
N
eu
ro
n 
ac
tiv
at
.
N
et
 in
pu
t d
er
.
in
pu
t p
ad
N
eu
ro
n 
de
riv
at
iv
e
o
u
tp
ut
 p
ad
t(  ) kT
ij
t(  )
k
σ
v
ε k
t(  )
v
k
t
y
(  -
1)
v
k
t(  )
y
v
k
t(  )
d
v
p i
j
t(  )
k
v
pij (  -1)tkv
pij (  -1)tk’v
pij tk’(  )v
bo
tto
m
to
p
top
bo
tto
m
to
p
top
E
T
Sa
mp
le
r
bo
tto
mto
p
Sa
mp
le
r
top
V r
ef
V r
ef
V r
ef
V w
re
fl
V w
m
ax
l
M
u
l1
D
V r
ef
V r
ef
V
M
u
lE
lm
CC
II+
X
CC
II+
Y
Z
V r
ef
V r
ef
V r
ef
V r
ef
V r
ef
Vw
m
ax
l
Vw
re
fl
Sq
rS
qr
V r
ef
V r
ef
V r
ef
re
fl
V w
V w
m
ax
l
D
if
f
V r
ef
V r
ef
T
r
a
n
s
R
SSV
V D
D SSV
SSV
V D
D SSV
SSV
V D
D SSV
V D
D
V D
D
SSV
SSV
V D
D
V D
D
SSV
SSV
V D
D SSV
V D
D SSV
V D
D SSV
V D
D SSV
V D
D SSV
V D
D SSV
v 
  
  
t
g’ k
(  )
~
30
   
A
μ
V D
D
V D
D
SSV
SSV
i
t(  ) k
(w
p)
2p
F
2.
4/
2.
4
7.
2/
17
.6
7.
2/
41
.6
4.
8/
91
.6
7.
2/
41
.6
7.
2/
41
.6
7.
2/
41
.6
4.
8/
91
.6
5.
6/
4.
8
40
/2
.4
2.4/6.4
Figure 

D
 RTRL signal slice schematic The width N data path signal
slice is composed mostly of components also found on the backpropagation
chip set Locally generated references V
wre
and V
wmaxl
 are used out of
convenience
For skew free clock inversion one must ensure that the sum of rise times t
R
Appendix D System design aspects Page 

N
et
w
or
k 
in
pu
ts
in
pu
t p
ad
w
V G
z
i z
j
i C
S
V o
fs
Vy
m
ax
V A
ZM
M
V A
ZM
L
in
pu
t p
ad
A
ut
o 
ze
ro
 si
gn
al
i Δ
w
ij
ch
ip
Φ
2
W
ei
gh
t c
ha
ng
e
o
u
tp
ut
 p
ad
LS
 8
b 
D
A
C
M
S 
8b
 D
A
C
Offset compensation
input pad
Max signal
input pad
AZ high ref
input pad
input pad
AZ lo reference
D
A
C 
in
pu
t
M
SB
s
LS
Bs
CC
II+
X
CC
II+
Y
Z
D
A
C 
m
ax
 o
ut
pu
t r
ef
er
en
ce
D
A
C 
m
in
 o
ut
pu
t r
ef
er
en
ce
z
j
v
azero
D
A
C 
ou
tp
ut
16
b 
SA
R
se
t b
it
V r
ef V
re
f
V r
ef
V r
ef
V r
ef
V r
ef
V r
ef
V r
ef
V r
ef
V r
ef
V r
ef
V r
ef
x y
z
CC
II+V
D
D SSV
V D
D SSV
V D
D
V D
D
SSV
SSV
V D
D
V D
D
SSV
SSV
x y
z
CC
II+
~
30
   
A
μ
4.
8/
18
2.
8
40
/2
.4
9.
6/
91
.6
4.
8/
91
.6
9.
6/
91
.6
2.
4/
39
0
2.
4/
2.
4
7.
2/
17
.6
40/2.4
Figure 

D
 RTRL weight change schematic One instance per RTRL chip
Most of the auto zeroing circuit is shown only as a block schematic The
network input signal transconductor is also shown in the gure
Appendix D System design aspects Page 

Φ2
Φ2
Φ1
Φ1
IncICD
Φ
16
16
1
1
1
2
4
8
Figure 

D
 Clock generator Twophase nonoverlapping skewfree clock
generating circuit with large capacitive load driving capability The inter
clock delay can be increased by adding capacitance at nor gate outputs 

is a delayed version of the input 
or t
pdLH
	 in the inverting  inverters	 and the noninverting  inverters	 inverter
chains are equal for both falling and raising input signals Likewise for the sum
of fall times t
F
or t
pdHL
	 Assume valid the simple CMOS inverter risefall time
relations see eg Weste and Eshraghian 	
t
R

C
L
V
DD
 V
SS
	K
 
P
W
P
L
P
 t
F

C
L
V
DD
 V
SS
	K
 
N
W
N
L
N
 
D
	
where C
L
is the inverter load capacitance and assume
C
L 
 C
ox
W
N 	
L
N 	
!W
P 	
L
P 	
	 
where the indices  and  ! 	 refer to inverter number counted from the left
In this case using the relative transistor widths indicated in gure 

D
equal
lengths	 gives skew free inversion at the output of the third inverter in the upper
chain compared to the second inverter in the lower chain If the nal load is
unknown this is the best we can do the output inverters are designed to drive a
large capacitive load with equal rise and fall times for typical process parameters
The RTRL chip test PCB schematic is shown in enclosure III Chip measure
ments was done in the standard way a table of chip characteristics is found in
gure 

D
 The chip  as the scaled backpropagation synapse chip  being
malmanufactured chip properties as propagation delays oset errors and signal
ranges	 will not be characteristic for a properly manufactured chip As we shall
try to utilize the chip anyway we have included a selection of characteristics The
reference voltage was changed to V
ref
"  V to accommodate to the reduced
input voltage range of the current conveyor as it was for the scaled synapse chip	
Appendix D System design aspects Page 

Property Value Bits Notes
Error input range v
d
 v
y
 
  
 Vy
Sampler input rangez v
 
! V
ref
  
  V
Neuron input input range v
z
! V
ref
  
 V
Neuron input oset jV
zofs
j




mV 
LSB

jw
ij
j
max
" 
Weight change oset jI
wofs
j


A LSB


Error input oset jV
ofs
j



mV Targeted
Net input derivative oset jI
wpofs
j


 nA LSB


Sampler output oset jV
 ofs
j



mV LSB

at v
in
" 
V
jV
 err
j



mV LSB

jv
in
j  V
Derivative output oset jV
ptofs
j


mV 
 LSB

Parabola input oset jV
g

Iofs
j


mV LSB

Parabola output oset jV
g

Oofs
j




mV LSB

IPM element nonlinearity D



& LSB


Multiplier nonlinearity D
mul


& LSB

Parabola nonlinearity D
y



& LSB

Propagation delays t
 pd
 
s
Sampler decay rate j
 
j



 mVs 
 
LSB

 s
y For v
d
  V at V
ref
"  V v
d
v
y
  V is possible though
the dierence will be nonlinear
z Both the edge trigged v
yt
	 and the transparent v
pt
	 sampler
 Relative to single multiplier element
 Typically in the order of
Figure 

D
 Table of RTRL chip characteristics Malfabricated chip primar
ily resulting in a reduced dynamic range The large V
zofs
is caused by a design
aw See also the backpropagation chip set characteristics
D RTRL chip improvements
Being composed primarily of components found on the backpropagation chip set
most of the improvements mentioned in the previous sections apply also to the
RTRL chip A few additional issues are subjects for improvement of the developed
realtime recurrent learning chip

 Making auto oset compensation work Because the weight change oset
is of paramount importance one of the primary tasks of future research is to
implement acting auto oset compensation hardware A chip remanufacturing
might solve the problem but one must make certain that this is probable before
doing so

 Reduce v
z
input oset This can be done in a straightforward manner just
redesigning the output current mirrors of the current conveyor to the correct
current range

 Improve neuron derivative computation Just as was the case for the back
propagation neuron chip see above	 though recall that only the neuron acti
vation and not the neuron net input	 will be available for the calculation in
Appendix D	 System design aspects Page 

this case

 Chip remanufacturing

 Nonlinear RTRL extensions Equivalently to the backpropagation neuron
chip the with N data path RTRL module can in a simple way be expanded
to include nonlinear realtime recurrent learning
D The RTRLbackpropagation system
A complete schematic without decoupling capacitors	 of the RTRL backpropa
gation system can be found in enclosure III It is currently under construction thus
we can not present any measurements based on it Note that the external synapse
chip output oset compensation have yet to be included on the board The board
basically consists of a large number of D A and A D converters and digital latches
interfaced to a standard PC AT ISA	 bus for controlling the custom ASICs in
our learning system Various control signals used on the board  the lowlevel
programmers interface  are described in enclosure IV
Page 

Appendix E
Building block components
In this appendix various building block components used on the chips are briey de
scribed The regulated gain cascode based operational amplier current conveyor
and transconductor designed by Thomas Kaulberg Also the layout of typicalMOS
resistive circuits is shown
E  The opamp and the CCII
The operational amplier and the current conveyor used on the chips was designed
by Thomas Kaulberg The opamp schematic is shown in gure 

E
 It is a
two stage cascode amplier the gain originating from the input stage dierential
current being dumped into the very high impedance node where the compensation
capacitor is connected The bulks of the pchannel transistors of the output push
pull source follower are connected to the source terminals to lower the minimal
output voltage The the pchannel transistor placed in series with the pchannel
current mirror is present for avoiding snapback
For high gain regulated gain cascodes RGC 	 Sackinger and Guggenbuhl

	 have been used for the current mirrors A ptype RGC is shown in gure


E
 The drain of main transistor the transistor connected to the gate terminal	
is kept at a constant potential using a cascode transistor and a simple inverting
amplier the two left most transistors	 An ntype RGC current mirror is seen in
gure 

E

Doing supply current sensing on the output transistors of the opamp and
connecting the opamp as a voltage follower we have a CCII type current con
Appendix E Building block components Page 

voutvin
CC
IB
+
-
SSV
VDD
Figure 

E
 The operational amplier Regulated gain cascode opamp with
pushpull source follower output stage Only a single very high impedance
node contributes to the gain
i out
vin
Bias
‘‘source’’
‘‘drain’’
‘‘gate’’
VDD
SSV
VDD
Figure 

E
 Regulated gain cascode
Ptype RGC The local feedback pro
vides a very high output impedance at
the drain terminal
i outi in
Bias
i in i out
SSV
VDD
SSV
Figure 

E
 RGC current mirror Accurate high output impedance Ntype
mirror composed of two regulated gain cascodes One must ensure by tran
sistor sizing that the transistors are saturated in the relevant input current
range
Appendix E Building block components Page 

veyor Toumazou et al   	 This is shown in gure 
E
the CCII!
symbol is given in gure 

	 The CCII voltage current relations are


i
Y
v
X
i
Z


"



 
 

 
 


  





v
Y
i
X
v
Z


 
CC
IB
vY
Z
X
Y
i
i
X
Z
vX
SSV
VDD
SSV
VDD
(in)
(in)
(out)
(out)
Figure 
E
 The current conveyor CCII implemented by current supply
sensing the output transistors of the unity gain coupled opamp Removing the
feedback we have a very versatile four terminal operational component
If we omit the feedback the resulting four terminal device is a very versatile
component this is used on the backpropagation synapse chip Using A RGC
bias currents and 
A dierential pair tail currents the typical open loop gain is
as shown in gure 
E
 Other typical characteristics are at a 
 k*k pF load	

 Input voltage oset V
Yofs


mV

 Output current oset I
Zofs


A

 Voltage follower slew rate SR


Vs

 Current output range ji
Z
j
max




A
-20
0
20
40
60
80
100
1 10 100 1000 10000
C
C
I
I
+
 
g
a
in
/d
B
frequency/kHz
"ccgain.asc"
Figure 
E
 Opamp frequency
response Measured open loop
gain vs frequency for sample
CCIIopamp positive input
to voltage output
Appendix E Building block components Page 

E The transconductor
The transconductor used on the rst generation synapse chip was designed by
Thomas Kaulberg The principal schematic is shown in gure 
E
 The resistor
is implemented as an nwell resistor
IB IB2 IB2
Vref
GTC
vin iout
SSV SSV
Figure 
E
 The transconductor
Basically a CCII with a Nwell
resistor connected to the xtermi
nal The CCII is based on sin
gle ended supply current sensing
of a twostage unitygain coupled
opamp
Appendix E Building block components Page 

E MOS resistive circuit
The MOS resistive circuit is used excessively in this work often with level shifters
at the gate inputs A typical MRC layout is shown in gure 
E
taken from the
backpropagation neuron chip	 Also shown is a pair or rater 


pair	 of gate input
level shifters pchannel source followers with bulk connected to source for precise
buering	 driven by RGC current sources All MRC gate input level shifters occur
in pairs to ensure matching of the voltage by which the level shifters raise the input
voltages
Vr
ou
t
Vl
so
ut
S3
F1
S3
I1
Ir
gc
b1
Vs
s
Vr
ou
t
Vl
so
ut
Vl
si
n
S3
F1
S3
F2
S3
I1S3
I2
Vd
d
Ir
gc
b2
Ir
gc
b1
Vc
mp
b
Vs
s
Vr
ef
i1
b
Vi
n1
b
Vi
n2
b
Vr
ef
i1
a
Vi
n1
a
Vv
em
Vv
ep
Vi
n2
a
Vo
ut
Vs
s
Vs
s
Vo
ut
MP
od
Vc
c
Vc
c
Vo
-
Vo
-
Vi-
Vi+
Vout
b VSS Vyref VlsbVcmpbVDD VSS VssVss Vyref VCgpVxref
B VSS Vyref VDD Vcmpb VlsB VSS Vymax VSS Vyref Vxref VCgp VSS
Warning: not straight
Figure 
E
 Typical MRC layout The layout of three MRCs connected
to an opamp right Two MRCs are preceded by level shifters left The
Pchannel MOSTs source followers and the corresponding regulated cascode
current sources can be identied The layout is taken from a backpropagation
neuron
