VLSI neural network with digital weights and analog multipliers by Koosh, Vincent F. & Goodman, Rodney
VLSI NEURAL NETWORK WITH DIGITAL WEIGHTS AND ANALOG MULTIPLIERS 
Vincent E Koosh, Rodney Goodman 
California Institute of Technology 
Pasadena, CA 91 125 
darkd @ micro.caltec h.edu 
http://www.micro.caltech.edu 
ABSTRACT 
A VLSI feedforward neural network is presented that makes use 
of digital weights and analog multipliers. The network is trained 
in a chip-in-loop fashion with a host computer implementing the 
training algorithm. The chip uses a serial digital weight bus im- 
plemented by a long shift register to input the weights. The inputs 
and outputs of the network are provided directly at pins on the 
chip. The training algorithm used is a parallel weight perturbation 
technique[l]. Training results are shown for a 2 input, 1 output 
network trained with an AND function, and for a 2 input, 2 hidden 
unit, I output network trained with an XOR function. 
1. INTRODUCTION 
Training an analog neural network directly on a VLSI chip pro- 
vides additional benefits over using a computer for the initial train- 
ing and then downloading the weights. The analog hardware is 
prone to have offsets and device mismatches. By training with the 
chip in the loop, the neural network will also learn these offsets 
and adjust the weights appropriately to account for them. A VLSI 
neural network can be applied in many situations requiring fast, 
low power operation such as handwriting recognition for PDAs or 
pattern detection for implantable medical device@]. 
There are several issues that must be addressed to implement 
an analog VLSI neural network chip. First, an appropriate algo- 
rithm suitable for VLSI implementation must be found. Tradi- 
tional error backpropagation approaches for neural network train- 
ing require too many bits of floating point precision to implement 
efficiently in an analog VLSI chip. Techniques that are more suit- 
able involve stochastic weight perturbation[1],[31,~41,~51,[61,[71, 
where a weight is perturbed in a random direction, its effect on 
the error is determined and the perturbation is kept if the error was 
reduced: otherwise, the old weight is restored. In this approach, 
the network observes the gradient rather than actually computing 
it. 
Serial weight perturbation[3] involves perturbing each weight 
sequentially. This requires a number of iterations that is directly 
proportional to the number of weights. A significant speed-up can 
be obtained if all weights are perturbed randomly in parallel and 
then measuring the effect on the error and keeping them all if the 
error reduces. Both the parallel and serial methods can potentially 
benefit from the use of annealing the perturbation. Initially large 
perturbations are applied to move the weights quickly towards a 
minimum. Then, the perturbation sizes are occasionally decreased 
to achieve finer selection of the weights and a smaller error. In 
general, however, optimized gradient descent techniques converge 
more rapidly than the perturbative techniques. 
Next, the issue of how to appropriately store the weights on- 
chip in a non-volatile manner must be addressed. If the weights 
are simply stored as charge on a capacitor, they will ultimately 
decay due to parasitic conductance paths. One method would be 
to use an analog memory cell [8],[9]. This would allow directly 
storing the analog voltage value. However, this technique requires 
using large voltages to obtain tunneling and/or injection through 
the gate oxide and is still being investigated. Another approach 
would be to use traditional digital storage with EEPROM's. This 
would then require having A/D/A (one A/D and one D/A) convert- 
ers for the weights. A single A/D/A converter would only allow a 
serial weight perturbation scheme that would be slow. A parallel 
scheme, which would perturb all weights at once, would require 
one ADIA per weight. This would be faster, but would require 
more area. One alternative would remove the A/D requirement by 
replacing it with a digital counter to adjust the weight values. This 
would then require one digital counter and one D/A per weight. 
2. CIRCUITS 
2.1. Synapse 
A small synapse with one D/A per weight can be achieved by first 
making a binary weighted current source (Figure 1 )  and then feed- 
ing the binary weighted currents into diode connected transistors 
to encode them as voltages. We then feed these voltages to tran- 
sistors on the synapse to convert them back to currents. Thus, we 
achieve many D/As with only one binary weighted array of tran- 
sistors. It is clear, that the linearity of the D/A will be poor because 
of matching errors between the current source array and synapses 
which may be located on opposite sides of the chip. This is not 
a concern because the network will be able to learn around these 
offsets. 
Ydd 
1 -  - - 
Figure 1 :  Current Source Circuit 
The synapse[6],[2] is shown in Figure 2. The synapse per- 
forms the weighting of the inputs by multiplying the input volt- 
ages by a weight stored in a digital word denoted by bO through 
111-233 
0-7803-6685-9/01/$10.0002001 IEEE 
Figure 2: Synapse Circuit 
b5. The sign bit, b5, changes the direction of current to achieve 
the appropriate sign. 
In the subthreshold region of operation, the transistor equation 
is given by[ 101: 
Id = IdOe"'b"/ut 
and the output of the synapse is given by[10],[2]: 
where W is the weight of the synapse encoded by the digital 
Thus, in the subthreshold linear region, the output is approxi- 
word and 10 is the least significant bit (LSB) current. 
mately given by: 
In the above threshold regime, the transistor equation in satu- 
. ration is approximately given by: 
I D  x K(V,, - Vt)l  
The synapse output is no longer described by a simple tanh 
function, but is nevertheless still sigmoidal with a wider "linear" 
range. 
In the above threshold linear region, the output is approxi- 
mately given by: 
It is clear that above threshold, the synapse is not doing a pure 
weighting of the input voltage. However, since the weights are 
learned on chip, they will be adjusted accordingly to the necessary 
value. Furthermore, it is possible that some syOapses will oper- 
ate below threshold while others above depending on the choice 
of LSB current. Again, on-chip learning will be able to set the 
weights to account for these different modes of operation. . 
""" 
- _  - -  - -  - -  
Figure 3: Neuron Circuit 
2.2. Neuron 
The synapse circuit outputs a differential current that will be surmned 
in the neuron circuit shown in Figure 3. The neuron circuit per- 
forms the summation from all of the input synapses. The neuron 
circuit then converts the currents back into a differential voltage 
feeding into the next layer of synapses. Since the outputs of the 
synapse will all have a common mode component, it is important 
for the neuron to have common mode cancellation[2]. Since one 
side of the differential current inputs may have a larger share of 
the common mode current, it is important to distribute this com- 
mon mode to keep both differential currents within a reasonable 
operating range. 
A I  Icm 
j Iin-d = Iin- -,&E = -- +- 2 '  2 2 
If the A I  is of equal size or larger than I,,, the transistor 
with I,,-, may begin to cutoff and the above equations would not 
exactly hold; however, the current cutoff is graceful and should 
not normally affect performance. With the common mode signal 
properly equalized, the differential currents are then mirrored into 
the current to voltage transformation stage. This stage effectively 
takes the differential input currents and uses a transistor in the tri- 
ode region to provide a differential output. This stage will usually 
Ill-234 
be operating above threshold, because the Vof f se t  and V,, con- 
trols are used to ensure that the output voltages are approximately 
mid-rail. This is done by simply adding additional current to the 
diode connected transistor stack. Having the outputs mid-rail is 
important for proper biasing of the next stage of synapses. The 
above threshold transistor equation in the triode region is given 
by I d  = ~ K ( v , ,  - - + ) v d S  x ~K(v, , -v , )v~,  for small 
enough v d s .  If Kldenotes the prefactor with w/L of the cas- 
code transistor and Kz denotes the same for the transistor with 
gate Vout, the voltage output of the neuron will then be given by: 
l i n  = Kl(Vgo in  - V, - ~ 1 ) ~  x ~2 (2(vout  - vt)vl) 
vl = Vgoan - V, - & 
Wl M'z 
L1 Lz 
if - = - then Kl = = A',
Thus, it is clear that Vgatn can be used to adjust the effective 
gain of the stage. 
Using these two circuit building blocks it is possible to con- 
struct a multilayer feed-forward neural network. Note that the non- 
linear squashing function is actually performed in the next layer of 
synapse circuits rather than in the neuron as in a traditional neural 
network. However, this is equivalent as long as the inputs to the 
first layer are kept within the linear range of the synapses. The 
biases for each neuron are simply implemented as synapses tied 
to fixed bias voltages. Also, depending on the type of network 
outputs desired, additional circuitry may be needed for the final 
squashing function. 
3. TRAINING ALGORITHM 
The neural network is trained by using the parallel perturbative 
weight update rule[ I ] .  The perturbative technique requires gener- 
ating random weight increments to adjust the weights during each 
iteration. These random perturbations are then applied to all of the 
weights in parallel. In batch mode, all input training patterns are 
applied and the error is accumulated. This error is then checked 
to see if it was higher or lower th'an the unperturbed iteration. If 
the error is lower, the perturbations are kept, otherwise they are 
discarded. This process repeats until a sufficiently low error is 
achieved. The following is an outline of the algorithm: 
14 
12 -7. 
Figure 4: Training of a 2: 1 network with AND function 
Initialize Weights; 
Get Error; 
while(Error > Error Goal) ; 
Perturb Weights; 
Get New Error; 
if (New Error c Error), 
Weights = New Weights; 
Error = New Error; 
Restore Old Weights; 
else 
end 
end 
4. TEST RESULTS 
A chip implementing the above circuits was fabricated in a 1.2pm 
CMOS process. All synapse and neuron transistors were 3.6p/3.6p 
An LSB current of loOnA was chosen for the current source. The 
above neural network circuits were trained with some simple dig- 
ital functions such as 2. input AND and 2 input XOR. The results 
of some training runs are shown in Figures 4-5. As can be seen 
from the figures, the network weights slowly converge to a correct 
solution. Since the training was done on digital functions, a dif- 
ferential to single ended converter was placed on the output of the 
final neuron. This was simply a 5 transistor transconductance am- 
plifier. The error voltages were calculated as a total sum voltage 
error over all input patterns at the output of the transconductance 
amplifier. Since 1 was 5V, the output would only easily move to 
within about 0.5V from 1 & because the transconductance ampli- 
fier had low gain. Thus, when the error gets to around 2V it means 
that all of the outputs are within about 0.W from their respec- 
tive rail and functionally correct. A double inverter buffer can be 
placed at the final output to obtain good digital signals. At the be- 
ginning of each of the training runs. the error voltage starts around 
or over 1OV indicating that at least 2 of the input patterns give an 
incorrect output. 
Figure 4 shows the results from a 2 input, 1 output network 
111-235 
20 
5 0  ..... ...... .......... ........ .* \ .. .......... .................. .........., \., ” ~ 
-20 - - c I  ‘---, -- .
0 50 100 1 5 0  200 250 300 350 400 6 0  500 
-40. 
12 
~ ~~ 
0 50 100 150 200 250 300 350 400 450 500 
iterabon 
Figure 5 :  Training of a 2:2:1 network with XOR function starting 
with small random weights 
learning an AND function. This network has only 2 synapses and 
1 bias for a total of 3 weights. The weight values can go from -3 1 
to +3 1 because of the 6 bit DIA converters used on the synapses. 
Figure 5 shows the results of training a 2 input, 2 hidden unit, I 
output network with the XOR function. The weights are initialized 
as small random numbers. The weights slowly diverge and the 
error monotonically decreases until the function is learned. As 
with gradient techniques, occasional training runs resulted in the 
network getting stuck in a local minimum and the error would not 
go all the way down. 
Figure 6 shows the same network trained with XOR, but the 
initial weights are chosen as mathematically correct weights for 
ideal synapses and neurons. Although, the ideal weights should, 
in theory, start off with correct outputs, the offsets and mismatches 
of the circuit cause the outputs to be incorrect. However, since 
the weights start near where they should be, the error goes down 
rapidly to the correct solution. This is an example of how a more 
complicated network could be trained on computer first to obtain 
good initial weights and then the training could be completed with 
the chip in the loop. 
5. CONCLUSIONS 
A VLSI implementation of a neural network has been demon- 
strated. Digital weights are used to provide stable weight storage. 
Analog multipliers are used because full digital multipliers would 
occupy considerable space for large networks. Although the func- 
tions learned were digital, the network is able to accept analog 
inputs and provide analog outputs for learning other functions. A 
parallel perturbation technique was used to train the network suc- 
cessfully on the 2-input AND and XOR functions. 
6. REFERENCES 
[I]  J. Alspector, R. Meir, B. Yuhas, A. Jayakumar, and D. Lippe, 
“A Parallel Gradient Descent Method for Learning in Ana- 
Weight and Ermr vs Iteration for baining XOR nelwolk 
1 40, 
J 
0 50 100 150 2M) 250 300 350 400 450 500 
2 , -.- ~ ___c_____ 4 L.-.--- 7- 
0 50 100 150 200 250 300 350 400 450 500 
iteration 
Figure 6: Training of a 2:2:1 network with XOR function starting 
with “ideal” weights 
[21 
r31 
r41 
r51 
r61 
r71 
[81 
[91 
r101 
log VLSI Neural Networks”, Advances in Neural Iizforma- 
tiorr Processing Systems, San Mateo, CA: Morgan Kaufman 
Publishers, vol. 5, pp. 836-844, 1993. 
R. Coggins, M. Jabri, B. Flower, and S. Pickard, “A Hybrid 
Analog and Digital VLSI Neural Network for Intracardiac 
Morphology Classification”, IEEE J. of Solid-State Circuits, 
vol. 30, no. 5, pp. 542-550, May 1995. 
M. Jabri, B. Flower, “Weight Perturbation: An Optimal Ar- 
chitecture and Learning Technique for Analog VLSI Feed- 
forward and Recurrent Multilayer Networks”, IEEE Trari. on 
Neural Networks, vol. 3, no. 1,  pp. 154-157, Jan. 1992. 
B. Flower, M. Jabri, “Summed Weight Neuron Perturbation: 
An O(N) Improvement over Weight Perturbation”, Advances 
in Neural bzformatiori Processirig Systems, San Mateo, CA: 
Morgan Kaufman Publishers, vol. 5 ,  pp. 212-219, 1993. 
G. Cauwenberghs, “A Fast Stochastic Error-Descent Algo- 
rithm for Supervised Learning and Optimization”, Advances 
in Neural b2formation Processing System, San Mateo, CA: 
Morgan Kaufman Publishers, vol. 5 ,  pp. 244-251, 1993. 
F? W. Hollis, J. J. Paulos, “A Neural Network Learning Al- 
gorithm Tailored for VLSI Implementation”, IEEE Tran. on 
Neural Networks, vol. 5, no. 5 ,  pp. 784-791, Sept. 1994. 
G. Cauwenberghs, “Analog VLSI Stochastic Perturbative 
Learning Architectures”, Aiialog Integrated Circuirs arid Sig- 
nal Processing, 13, 195-209, 1997. 
C. Diorio, P. Hasler, B. A. Minch, C. A. Mead, “A Single- 
Transistor Silicon Synapse”3EEE Tran. 012 Electron Devices, 
vol. 43, no. I1,pp. 1972-1980, Nov. 1996. 
C. Diorio, P. Hasler, B. A. Minch, C. A. Mead, “A Com- 
plementary Pair of Four-Terminal Silicon Synapses”, Aizalog 
Inregrated Circuits aiid Signal Processing, vol. 13, no. 1-2, 
pp. 153-166, May-June 1997. 
C. Mead, Analog VLSI aiid Neural Systems, New York: 
Addison-Wesley, 1989. 
III-236 
