Analog VLSI implementation of kernel-based classifiers by Verleysen, Michel et al.
Analog VLSI implementation of kernel-based classifiers 
M. Verleysen, Ph. Thissen 
Universitk Catholique de Louvain, Laboratoire de Microdectronique, 
3 pl. du Levant, B-1348 Louvain-la-Neuve (Belgium) 
J. Madrenas 
Universitat Politecnica de Catalunya, Departament d’Enginyeria Electrbnica, 
Gran Capitk s/n, mbdul C4, E-08071 Barcelona (Spain) 
Abstract 
Kernel-based classifiers are neural networks (Ra- 
dial Basis Functions) where the probability densities 
of each class of da ta  are first estimated, t o  be used 
thereafter to  approximate Bayes boundaries between 
classes. Such algorithm however involves a large num- 
ber of operations, and its parallelism makes it an ideal 
candidate for a dedicated VLSI implementation. We 
present in this paper the architecture for a dedicated 
processor for kernel-based classifiers, and the imple- 
mentation of the original cells. 
1 Introduction 
Many different types of neural networks are used 
in classification tasks. Classification in this context 
means first to estimate in a high-dimensional space 
regions attributed to the different classes, according 
to given input/output pairs, and then to choose the 
best candidate (more probable) between the different 
classes when a new input point is presented to the 
network. 
It is know that the Bayes law can be used directly 
to choose the most probable class for each point in the 
space, once the probability densities of each class and 
the a priori probabilities of the classes are known. Es- 
timation of probability densities can be done through 
sums of kernels (Gaussian for example). In this pa- 
per, we present the implementation of a dedicated 
processor for a Bayes classifier based on estimation of 
probability densities. After a short description of the 
algorithm, we first present the global architecture of 
the system and of the processor, and then the imple- 
mentation of the specific cells (kernels, analog memory 
points and distance computations). 
2 Algorithms for classification tasks 
Many neural-like algorithms may be used for classi- 
fication tasks. As stated above, we will concentrate on 
kernel-based estimators of probability densities, used 
to build Bayes classifiers. The proposed implementa- 
tion however includes the circuitry for LVQ-like algo- 
rithms, as it will become clear with the description of 
the processor. 
Kernel-based classifiers (KBC) work in two phases. 
First, the probability density of the data distribution 
inside each class is estimated; then, the Bayes law is 
used to determine the boundaries between classes in 
the data space, and by this way to classify new input 
vectors. 
To estimate the probability density of data belong- 
ing to a particular class [l], the principle is to sum ker- 
nels centered on the data from the learning set avail- 
able in this class: 
where {x i ,  1 5 i 5 N,} denotes the samples at dis- 
posal in class w,; we suppose that there are C classes 
denoted w,, 1 5 c 5 C. The scalar parameter h is 
called the width factor of the kernel. The kernel @ is 
said to be radial if it is only a function of the norm of 
its argument. Several types of kernels @ may be used, 
the most classical one being a Gaussian function: 
where d is the dimension of U and xi. The convergence 
of such estimator is proved in [2]. Let us mention that 
the width factor h may be made different for each ker- 
nel @(U - x i /h ) ;  in this case the kernels are referred to 
138 
as variable rather than fixed. However, the prohibitive 
number of operations involved in the case of variable 
kernels make them rather inefficient in terms of soft- 
ware or hardware implementations; furthermore, tests 
have shown that the gain in performances between es- 
timators based on fixed and variable kernels is limited, 
especially in the case of finite databases. For these rea- 
sons, only fixed kernels estimators will be considered 
in the following. 
Let us finally mention that, even if the probability 
density estimators are shown to be asymptotically un- 
biased, the number of computations is reduced in prac- 
tical applications by reducing the number of samples 
through some kind of vector quantization, for exam- 
ple a LVQ procedure. In order for the LVQ procedure 
to give an appropriate distribution of the centroids in 
each class, their number will be set proportional to the 
a priori probabilities of the respective classes. 
Once the probability densities are estimated in each 
class, the Bayes criterion may be used to classify any 
new vector z; class wc (1  5 c 5 C) will be attributed 
to vector z if 
where P(wi) is the a priori probability of class wi. 
Such classifier is mostly interesting because of its prop- 
erty to approximate the Bayes limits between classes, 
i.e. the boundaries leading to a minimum number 
of misclassifications in case of overlapping distribu- 
tions; most other classification systems do not have 
this property. 
3 A mixed architecture for neural net- 
work classifier 
The main weak point of the algorithm described 
above resides in the number of operations involved 
in the computation of a class. If there are N ker- 
nels, N distances must be evaluated, passed through 
non-linear functions, and summed, before deciding the 
most probable class, i.e. the largest estimate of the 
in-class probability densities (for equal a priori class 
probabilities, which will be supposed in the follow- 
ing). The fact that all distances and all kernels may 
be evaluated simultaneously make this algorithm an 
ideal candidate for an analog parallel implementation. 
The architecture presented here is made of two 
parts. First, an analog processor implements all oper- 
ations that can be found in the kernel-based algorithm 
described above; we will see later how this chip can 
also be used for LVQ-like algorithms. This processor 
is algorithm-independent, provided that the algorithm 
only uses the resources included in the chip: values of 
weights can be downloaded or adapted on the chip, 
non-linear functions may be used or by-passed, inter- 
mediate results may be obtained (for LVQ-like algo- 
rithms), ... Secondly, all operations are sequenced us- 
ing an external digital architecture, which is connected 
to the analog processor through the input/output 
lines and the control ones. This part is algorithm- 
dependent, since different algorithms will need differ- 
ent sequences of operations, and the use of different 
output lines of the processor. The control part will be 
a digital finite-state machine designed according to the 
algorithm; it can be realized by a digital specialized 
or general-purpose chip, or even with discrete com- 
ponents on a printed-circuit board, or by FPGAs. . . . 
Combining the analog processor and the digital con- 
trol part leads to an efficient architecture for classifi- 
cation tasks; the next part of this paper details the 
analog processor and the different cells involved in. 
4 Analog processor 
4.1 Block description of the circuit 
To describe in details the analog processor and the 
different operations which can be realized, let us ex- 
amine the functional description of figure 1. 
The core of the system is built around P identical 
cells, each of them being composed of memory points 
to store the coordinates of the centroid and its class, 
together with a distance calculator to compute the 
distance between this centroid and an input vector. 
Shortly, the system will work as follows. A set of P 
centroids p k ,  1 5 k 5 P will be stored in the proces- 
sor; in the case of the KBC algorithm, the coordinate 
of the centroid corresponds to the center of the kernel 
function Q.  Then, when an input vector is presented 
to the circuit for classification, all distances between 
this input vector and each of the centroids are com- 
puted in a parallel way; this is the purpose of the P 
mentioned distance computation cells. 
The P computed distances are then used in two 
ways. On one hand, they are compared to find the 
smallest one, in order to select the closest centroid 
from the input vector; this is used in LVQ-like algo- 
rithms, in the purpose of selecting the winning cen- 
troid. On the other hand, the distances serve as in- 
puts to P Gaussian-like kernel function, used in KBC 
algorithms, as mentioned in equation 2. 
In the case of the LVQ algorithm, the selection 
of the winning centroid pa completes the recognition 
139 
neuron n input vectors class index nearest centre 1 
tI 
I 
analog memory points 
refreshment system 
P 
probability densities 
winner-take-all a 
Figure 1: Functional description of the analog processor 
phase of the algorithm. In the case of the KBC algo- 
rithm, the P kernel outputs are summed class by class, 
according to equation 1, in order to estimate the prob- 
ability densities of each class. The parameters of the 
kernels, namely their widths and shapes, may be ad- 
justed by external commands; this will be detailed in 
section 4.2.4. According to the Bayes law (equation 3), 
classification of the input pattern is then realized by 
selecting the largest probability density from among 
the different classes. 
A supplementary factor P(wa) is found in equation 
3; it corresponds to the a priori probabilities of the 
classes. As in the LVQ algorithm [3], these a priori 
probabilities are estimated by the relative number of 
points in each class, condition which is realized in our 
circuit since we have one kernel per input point in the 
distribution. 
In the following sections we describe the analog cells 
used to realize the analog processor. The sizes of all 
transistors have been chosen for cells designed in the 
MIETEC 2.4 pm technology, and for a circuit with 
r of kernels), d = 16 (the dimen- 
sion of the data space) and a precision in the memory 
points equal to 8 bits. 
4.2 Description of the analog cells 
4.2.1 Analog memory points 
Analog memory points have been used to store the 
locations of the centroids for silicon area reasons, and 
to avoid non-necessary analog/digital conversions in 
the chip. 
The principle of our analog memory point is to store 
a current on capacitor C, in figure 2. When switch 
transistor T, is on, the drain and gate of memory tran- 
sistor T, are connected together, and its gate voltage 
adjusts to let the input current Imem flow through the 
transistor. When transistor T, is switched off, the ca- 
pacitor C, will memorize the gate voltage of T, to 
keep the same current I,,, flowing through the tran- 
sistor. 
To compensate for leakage currents effects in the 
blocked junction of transistor T,, a refreshment sys- 
tem sequentially reads all analog values stored on the 
chip and refreshes them. The principle is the follow- 
ing. We consider that both the charge injection (when 
switching off transistor T,) and the leakage current in 
the blocked junction make the voltage V(C,) between 
V d d  and the gate of T, decrease from less than one 
LSB in a refreshment period T; this LSB is measured 
140 
over the whole dynamics of stored voltages on C,. In 
figure 2, both the leakage current and the charge injec- 
tion will have the same sign: the blocked junction will 
inject positive charges from Vdd to c,, so as the switch- 
ing of transistor T,. We then know the sign of the 
slope of V(C,). If the analog value in a memory point 
is now read at regular intervals T, and converted into 
the smallest digital value greater than the analog one, 
the memory point may be refreshed to its initial level 
as illustrated, keeping the stored value fixed up to a 
precision of one LSB. All memory points of the circuit 
may be refreshed by the same system, an analog-to- 
digital converter followed by a digital-to-analog one, 
provided that the period T between two refreshments 
of the same memory point is small enough to ensure a 
decay in V(C,) less than one LSB; a detailed descrip- 
tion of this system may be found in [4]. 
Vdd 
vss 
Figure 2: Regulated cascode analog memory point 
Another problem is the dependency of the current 
in transistor Tm with its drain voltage; the cell im- 
plemented on the chip is thus a regulated cascode one 
[5] (use of transistors Tm and T,). In figure 2, tran- 
sistor T, operates in its linear region to reduce its 
transconductance g, as well i19 the current variation 
due to charge injection on C,. In order to  keep the 
drain voltage of T, as fixed as possible, one has to 
increase the gain of the loop formed by T, and T,; 
they are thus both kept in saturation, and transistor 
T, operates in weak inversion (through a very small 
I,.,! current) to maximize its gain (transconductance 
over output conductance). 
The capacitance of C, must be around 1pF to reach 
a 8-bits accuracy in the stored current; to obtain this 
value, a supplementary capacitor realized between the 
two polysilicon layers of the MIETEC 2.4 pm technol- 
ogy is added in parallel to the gate capacitor of T,. 
The maximum current memorized in the cell has been 
set to 128 p A  one LSB corresponding to 500 nA. 
4.2.2 Synapse and input circuitry 
The circuit of figure 3 is repeated P x d times on the 
chip, and connected to the P x  d analog memory points 
described in section 4.2.1. The purpose of the circuit 
in figure 3 is twofold. First, it is used as input of 
the corresponding memory point when a current must 
be stored. In this mode, an external input voltage 
generates a current Iin in figure 3, which is the 
current I,,, in figure 2; write transistor T, is then 
switched on, and the current is memorized in the cell. 
In the second operation mode, we suppose that a cur- 
rent I,,, is memorized in the cell of figure 2, and that 
it has to be subtracted from current Isn; we will see 
in the next section how this difference may be used to 
compute the distance between an input vector xi and 
a centroid pj  . In this mode however, the difference be- 
tween currents I em and Iin may be allowed to flow 
out of these cells; this will also be detailed in section 
4.2.3. The principle of the cascode cell in figure 3 is 
similar to the principle of the memory point; the sizes 
of transistors are given in the figure. 
Vdd 
--c 
Iref 
Vin 
I I 
Figure 3: Regulated cascode input circuitry 
4.2.3 Distance computation 
One of the main operations that must be realized 
on-chip is the distance computation between a d- 
dimensional input vector and P d-dimensional cen- 
troids. Manhattan distance has been used here for 
simplicity reasons, since we know that the choice of 
distance measure do not influence a priori the perfor- 
mances of a classifier [6]. 
To compute the Manhattan distance between an 
input vector x and a centroid p i ,  1 5 i 5 P ,  three op- 
erations must be realized: subtraction between x and 
p j  coordinate by coordinate, absolute value, and sum 
of these results over all coordinates. The subtraction 
has already been addressed in the previous section; 
when a memory point is in read mode, the difference 
141 
Figure 4: Computation of the Manhattan distance between the input vector and one centroid 
between the memorized currents Imemi and the input 
currents Iani in figure 4 (1 5 i 5 d)  is allowed to flow 
out from the cells (transistors T,i are switched on). 
This current may either be positive or negative; de- 
pending on its sign, it is directed to one of the two 
summation current lines in figure 4. The difference 
between these two sums must finally be computed by 
a set of current mirrors in order to complete the im- 
plementation of the distance computation. Voltages 
on lines I+ and I- are kept fixed through simple op- 
erational amplifiers. 
4.2.4 Kernel functions 
Recent developments in the theory of KBC algorithms 
[7] have shown that the quality of probability density 
estimations can be greatly improved by adjusting two 
kinds of parameters in the Gaussian kernels. The first 
one is classically its width factor, but a second one, 
which must be adjusted depending on the dimension 
of the data space, determines the tail curvature of the 
Gaussian function, i.e. the rate at which the kernel 
function drops off. We show in this section two ways 
of implementing kernel functions. 
Figure 5 shows the differential pair used to realize 
the first type of Gaussian-like kernel. Let us first men- 
tion that the exact kernel shape is not critical for the 
approximation of probability densities as soon as two 
such parameters can be adjusted; moreover, only half 
of the Gaussian function has to be realized, since its 
argument is always positive (distances). We thus use 
the non-linear characteristics of a differential pair to 
evaluate the Gaussian-like functions. 
1_ 
vss 
1 
vss 
Figure 5: Kernel Gaussian-like function 
In figure 5, the input voltage Kn is generated by 
flowing the argument of the kernel function, namely 
the difference between currents I+ and I- in figure 
4, into a transistor in its linear region. The width 
of the kernel is determined by voltage V&, while its 
curvature is adjusted by modifying voltage V, which 
acts on the conductance of transistor T3. 
Figure 6 respectively shows a simulation of the ker- 
nel Gaussian like function for only one VC and V& 
ranging from 0.5 to  2.5 V, and for V,,j fixed to 1.5 V 
and a sweep of Vc. The chip is currently under test 
and the measurements are not yet available. 
Another implementation of a Gaussian kernel, 
142 
Figure 6: Simulation of the kernel Gaussian like func- 
tion: influence of V,,j and of VC 
which respects more closely the equation of a Gaussian 
kernel (2) is also proposed here. The principle of the 
circuit is illustrated in figure 7. The Gaussian function 
is constructed into two steps. First, a MOS transistor 
in saturation Ts performs the square function of the 
input voltage; secondly, a negative exponential circuit 
realizes the Gaussian function. 
To understand how the circuit of figure 7 works, 
let vus first suppose that transistors T3 and T4 act 
as resistors, identified by R1 and R2. The reference 
current IR is set small enough to put transistors TI and 
T2 in weak inversion. If we note VDl, 1 / 0 2 ,  Vsl ,  Vs2, 
and VG respectively the drain voltages of TI and T2, 
their source voltages, and their common gate voltage, 
we have if v D 1  and v D 2  are much greater than U T :  
(4) 
( 5 )  
where UT = kT/q and n the subthreshold slope. Di- 
viding both expressions and substituting V.1 and Vs2 
R ~ I R - R Z - I D - R ~ I L  
leads to 
Io = IRe UT 
Finally, assuming that I1 >> I,, and then IO << IR ,  
which can be easily guaranteed since IR is small as 
before, we have 
(7) 
The output current IO decreases exponentially with 
current I I ,  which was the expected behavior; the value 
of R2 determines the exponential constant. 
The value of R2 is then determined by transistor 
T4, working in strong inversion and in its linear re- 
gion; assuming that the drain voltage of T4 is small, 
vc 
Vdd 
w/L 
24018 
24018 
9016 
9016 
6018 
Figure 7: Kernel Gaussian function 
its output conductance is controlled by VC and given 
by: 
(8) 
1 
R4 gds4  = - = p N ( v C  - VTN) 
where gd34 is the transconductance of transistor T4, 
VTN the threshold voltage of N-type transistors and 
PN their conductance parameter. Transistor T5 is sat- 
urated and in strong inversion too, and current I I  is 
then given by: 
(9) 
where VTP is the threshold voltage of P-type transis- 
tors, Pp their conductance parameter, and X takes the 
substrate effect into account. Combining 7, 8 and 9 
leads to: 
PP( vtn(- I vTP 1)’ ~ 
(10) Io = IRe ~ ~ U T P N  VC-VTN 
which has the desired Gaussian form; VC controls the 
parameter of the Gaussian function, and the constant 
VTP can be compensated by an offset voltage shifting 
the center of the function. 
Figure 8 shows the measured output of the circuit 
for Vc varying between 1 and 4V (with a step of l V ) ,  
and for an input voltage between 0 and 5V. The cell 
has been realized in UCL SO1 (Silicon-On-Insulator) 
3pm technology. 
5 Conclusion 
We described in this paper the analog implementa- 
tion of a kernel-based classifier, based on estimation 
of in-class probability densities. The global architec- 
ture of the system was detailed, together with the im- 
plementation of some specific blocks, including mea- 
surements on the Gaussian kernels realized in Silicon- 
On-Insulator technology. Such analog processor may 
143 
Figure 8: 
function 
Chip measur.emrnt,s of Kernel Gaiissiaii 
be used in any classification systeni when speed ant1 
port,abilit,y are newssary, tqetkier wit,h the liigli 1 ~ -  
formances of Bayes classifiers. Work still tso achieve 
concern the implementa.tion of a. large systxxi (oiily 
sparse cells were realized and t,ested 111’ tjo iiow) m t l  
the programming of a digital c,ontmller to seqiience 
the operastions in the processor. 
Acknowledgments 
Part of this work has been fiintletl hy the ESPR,IT- 
BRA project, 6891, ELENA-Nerves 11, siipport,etl by 
the Commission of trhe European Cominiiiiit,ies (DG 
XIII), Michel Verleysen is a Senior Researdi Assis- 
tant of Belgian Nat,iona.l Fund for Sc,ient,ific R 
(FNRS). Philippe Thissen is working towa.rt1.c; t,lie 
Ph.D. degree in microelectmnics under an IRXA (In- 
stitut pour l’Encouragement, de la. H..eclierche Scien- 
tifique dans 1’Industzie et, 1’Agriciiltue) fellowsliip. 
References 
[l] E. Parzen, “On the estima.tion of a prohl)ilit,y tleii- 
sity function and the mode,” Ann. Moidlr.. S’tiid., 
vol. 27, pp. 1065-1076, 1962. 
[2] T. Cacoullos, “Estimat,iori of a. mult,iva.ria.t,e tlen- 
sity,” Annuls of Inst. S i d .  M(/,t/t,., vol. 18, pp. 178- 
189, 1966. 
[3] T. I<olionen, Se!f-Or~ariizi/,tirin u.nd Associative 
Mt:i i tw7.! / .  Berlin: Springer-Verlag, 1089. 3rd Edi- 
t8ioii. 
[4] D. n/la,c.cl, J .  Legat,, aiitl P. Jespers, “Analog stor- 
age of acljiist,n.ble synaptic weights,” in Proceedings 
of i/t.<: SPIE c o a f i : r . ~ : ~  or1 Applacations of Art$- 
ci (1.1 Neiim I Net 1110 r k s ,  Or1 an  (1 o (USA), p p . 7 12- 
718, April lW2.  
151 (1:. Tol1iniiZoi1, .J. Hitgues, a.nd D. Pat tdo,  “Regu- 
wit,chetl-ciirrent, memory cell,” Elec- 
, vol. 26, pp. 303-305, March 1990. 
n ,  J .  Voz, and J .  Madrenas, 
.rchit,echire for neural netr- 
[(i] M . Verleysen, P 
AC?~CT.O, vol. 14, pp. 16-28, 
Jiuie 1994. 
[7] E’. (~oinoii, J .  Voz, a.iitl M .  Verleysen, “Estimation 
of  pe r fo rn ia~ ,e  boiinds in supervised classifica- 
tion,” iri ESANN9~-Eir.7opf:ri,n Symposiiim on Arti- 
,ji:cinl Nt:ii~ro.l Nciiuo7k.s (M. Verleysen, ed.), (Brus- 
sels, Brlgiiini) , 1’1’. 37-42, D fa.ct,o publications, 
April 1!)!14. 
144 
