An FPGA Implementation of Kak's Instantaneously-Trained, Fast-Classification Neural Networks by Zhu, J. & Sutton, P. R.
Abstract
Motivated by a biologically plausible short-memory
sketchpad, Kak’s Fast Classification (FC) neural networks
are instantaneously trained by using a prescriptive training
scheme. Both weights and the topology for an FC network
are specified with only two presentations of the training
samples. Compared with iterative learning algorithms such
as Backpropagation (which may require many thousands of
presentations of the training data), the training of FC net-
works is extremely fast and learning convergence is always
guaranteed. Thus FC networks are suitable for applications
where real-time classification and adaptive filtering are
needed. In this paper we show that FC networks are “hard-
ware friendly” for implementation on FPGAs. Their unique
prescriptive learning scheme can be integrated with the
hardware design of the FC network through parameteriza-
tion and compile-time constant folding.
1. Introduction
There exist certain classes of real-time classification /
adaptive control systems within which learning is a critical
task which must be guaranteed to finish on time. Reconnais-
sance robots, and satellite sensory systems looking for inter-
esting objects in unfamiliar environments are some of the
examples of such real-time systems. In these types of sys-
tems, one cannot fully anticipant the full range of objects
the classification systems may encounter. The classification
system must learn to classify these in real-time and learn as
they continue to explore. Neural networks have been shown
to be powerful classification tools. However, neural net-
works which are based on iterative learning algorithms such
as multilayer perceptrons, radial basis functions and support
vector machines can suffer from training bottleneck [1]:
that is learning may not converge or take too long to be use-
ful in real-time applications even with hardware accelera-
tion. For this reason, iterative learning neural networks are
not suitable for the real-time systems where learning is a
critical task. 
Kak’s Fast Classification (FC) networks [2] overcome
the learning bottleneck by employing instantaneous learn-
ing. The model of FC networks is motivated by a biologi-
cally plausible sketchpad mechanism for short-term
memory in which learning occurs instantaneously. The
learning in FC networks does not suffer from the learning
bottleneck and is always guaranteed to converge. Both the
weights and the topology of an FC network are determined
by simple inspection of the training examples. Only two
presentations of training samples are required to train an FC
network, which is extremely efficient compared with itera-
tive learning algorithms, such as backpropagation, where
thousands of presentations of training samples are required.
In this paper, we show that Kak’s FC networks with their
prescriptive learning scheme are well suited for implemen-
tation on FPGA based reconfigurable hardware platforms
by exploiting fine grained parallelism. We show that the
prescriptive learning algorithm can be integrated into hard-
ware design for the FC networks through parameterization
and compile-time constant folding. 
The remainder of this paper is organised as follows. Sec-
tion 2 describes the algorithm framework for the FC net-
works. Operations in the training and execution phases of
FC networks are formally presented. Section 3 presents the
hardware design for FC networks. The overall system archi-
tecture is outlined first, followed by implementations for
network components: hidden neurons, hidden layer rule-
bases and the output neurons. Section 4 discusses strategies
for integrating prescriptive learning with the design of FC
networks and Section 5 draws some conclusions.
2. Algorithmic Framework for FC Networks
The FC networks have a three layer feed-forward archi-
tecture which consists of a layer of inputs, a layer of dis-
An FPGA Implementation of Kak’s Instantaneously-Trained, Fast-Classification 
Neural Networks
Jihan Zhu and Peter Sutton
School of Information Technology and Electrical Engineering
The University of Queensland
Brisbane QLD 4072 Australia
{jihan, p.sutton@itee.uq.edu.au}
tance based hidden neurons, a fuzzy rule base and an output
layer as illustrated in Fig. 1. 
Figure 1: An Illustration of an FC Network
Input data presented to an FC network is a  element
long continuous-valued vector , where
 is the length of the input and is determined by the prob-
lem specification. Each hidden neuron  
stores an exemplar training sample faithfully as its weight
vector . A hidden neuron  first
computes the distance  between the input vector  and its
weight vector as shown in Fig. 2 (a). 
Figure 2: (a) Computation in a hidden neuron 
(b) Activation function 
The distance , a scalar, is then passed to the activation
function  to produce an output  for the hidden neuron
as illustrated Fig. 2 (b). The operation of the hidden unit ac-
tivation function  can be specified mathematically as:
(1)
where  is defined as the radius of generalization for hid-
den neuron . The effect of using this activation function is
to make any test vector  within a certain distance of a
stored exemplar training sample  indistinguishable from
the training sample, and hence the test vector will be classi-
fied in the same output class as the training sample . The
generalization radius  hence “fuzzifys” the input space
around a training sample ; any test vector that lies within
this fuzzy region will be classified as the same as the train-
ing example.
All outputs from the hidden layer form a distance vector
 which represents the similarity be-
tween the test vector  and each of the training samples
stored in the hidden layer of the FC network. The distance
vector is then presented to the fuzzy rule base which maps
the distance vector  into a membership grade vector
. The vector  represents the degree of
membership that the test vector  has to each of the output
classes. The properties of  are described below:
. (2)
The output neuron then computes a dot product between
the output weight vector  and the fuzzy
membership vector  to aggregate all fuzzy contributions
from hidden neurons to produce the final network output 
for the test vector . The fuzzy aggregation is illustrated in
Fig. 3. The fuzzy aggregation is described mathematically
as:
, (3)
where output weight  is assigned to be the corresponding
target outputs of each exemplar vector stored in the hidden
neurons.
Figure 3: Fuzzy generalization and aggregation 
in the output layer.
2.1. Prescriptive Learning in FC Networks
The prescriptive learning scheme for training FC net-
works is very simple and only requires two presentations of
X
X
X
X
F
X
X
X
X
F
X
X
X
X
F
X
X
X
X
F
Rule Base
+
* * * *
x1
x2
x
n
w11
w12
w1n
w21
w22
w2n
w
m1
w
m2
w
mn
r1 r2 rm
h
mh2h1
µ1 µ2 µm
v1 v2
v
m
y
n
x x1 x2 ...,xn, ,( )=
n
i i 1 2 ...,m, ,=( )
wi wi 1, wi 2, ...,wi n,, ,( )= i
di x
(a) (b)
X F
xN
 
x2
x1
win
wi2
wi1
di
ri
hi
0
F
ri di
hi
F
di
Fi hi
Fi
hi 0= di ri≤
hi di= di ri>
ri
i
x
w
w
r
w
h h1 h2 ...,hm, ,( )=
x
h
µ µ1 µ2 ...,µm, ,( )= µ
x
µ
0 µ< i 1<( ) µ∈ µi
1
m
 1=
v v1 v2 ...,vm, ,( )=
µ
y
x
y µivi
i 1=
m
=
vi
Rule
base
+
h1
h2
h
m
µ1
µ2
µ
m
*
*
*
v1
v2
v
m
Y
the training samples. The first presentation of training sam-
ples is used to prescribe the topology and weights while the
second presentation is used to determine the radius of gen-
eralization  for each hidden neuron. These two processes
are described briefly below.
2.1.1. Prescribing the Topology and Weights An FC net-
work faithfully represents all training samples in hardware
by allocating one neuron for each training sample. Given a
training sample set, the prescriptive algorithm determines
directly the topology and weights of an FC network. The re-
quired number of input neurons is the length of the input
vector. The number of hidden neurons equals the number of
samples in the training set (i.e. each hidden neuron repre-
sents one training sample). The required number of output
neurons equals the desired number of outputs (only one
neuron is shown in Fig. 1 for simplicity, multi-way classifi-
cation can always be partitioned into networks with one out-
put neuron).
With the topology of the network specified, assigning
weights for the network is done by simply inspecting the
training samples. The first presentation of the training sam-
ple determines the input and output weights. Let  be the
training set which contains  training samples and  is the
index; let  be the input-target pair of the  ex-
emplar training sample to be stored at a hidden neuron, then
the input weight for the hidden neuron is assigned to be
. Similarly, the corresponding output weight is
assigned to be . Here  are the length of the
input vector and the target vector.
2.1.2. Determining the radius of generalization The sec-
ond presentation of the training samples is needed to deter-
mine the radius of generalization  for each of the hidden
neurons . Distances are calculated between the exemplar
training sample  represented by the current hidden neu-
ron and all other training samples. The smallest distance
 is from the training sample  to its nearest neighbour.
The radius of generalization  for the  hidden neuron
is then set to . This is to ensure that the generaliza-
tion regions of all hidden neuron never overlap.
2.1.3. Generalization with Fuzzy Rule Base When a test
vector falls in one of the generalization region of a hidden
neuron, the fuzzy rule base merely acts as a gating function.
That is if , the fuzzy membership grades are as-
signed according to the following rule:
. (4)
In this way the test vector is classified as belonging to the
same output class that the exemplar training sample does.
The purpose of employing a fuzzy rule base in the FC
network is to provide a generalization method to allow a test
vector to have fuzzy membership grades in output classes of
its  nearest neighbours when the test vector does not fall
into the generalization region of any training sample. This
is because, as mentioned above, the generalization regions
for hidden neurons do not overlap. The fuzzy rule base pro-
vides a way to interpolate the final output for a test vector
based on its distances from  nearest training examples. Al-
though a variety of fuzzy membership functions can be used
to map the distance vector  to membership vector , the
simplest is the triangular membership function. For exam-
ple, when , and ,  are the distance between the
test vector and its two nearest neighbour, the triangular
membership function would be expressed as:
. (5)
Fig. 4 gives an illustration for a triangular fuzzy member
ship function for the two nearest neighbour case. Other
membership functions, for example the quadratic function
, are possible. However, Kak’s experiments show that the
performance of an FC network is not significantly effected
by the choice of the fuzzy membership function.
Figure 4: Fuzzy membership function for 2 
nearest neighbour case.
3. Implementation
Like its biological counterpart, an FC network is also
limited by its capacity to store training samples. By pre-
scriptive learning definition, all training samples must be
faithfully represented in the network. That is one hidden
neuron in an FC network is needed to represent a training
sample in the training set. Hence the size of hidden layer is
of  where  is the total number of training samples.
When  is large, the hardware realization of an FC network
is also large. To overcome this problem, the implementation
for a hidden neuron in a FC network must be space efficient.
ri
T
m i
ti j, oi k,,  ith
wi j, ti j,=
vi k, oi k,= j k,
ri
i
ti
dmin ti
ri ith
dmin 2⁄
hj 0=
µi 1= for i j=
µi 0= for i j≠
i j 1 2 ... m, , ,( )∈,
k
k
h µ
k 2= d1 d2
u1
1
d1
----
1
d1
----
1
d2
----+ 
 ⁄=
u2
1
d2
----
1
d1
----
1
d2
----+ 
 ⁄=
S
xwj wk
µ
µj
µk
0 µjµj
O m( ) m
m
Various strategies are used to minimise the resources used
to implement the hidden neurons. Strategies are also used to
simplify the implementation for the fuzzy rule base as the
fuzzy rule base sorts the distance vector  to select  near-
est neighbours of a test vector. The search space is related
to the number of hidden neurons. Specifically, a  element
full parallel bitonic sorter is  deep. These strate-
gies are described below in detail.
3.1. An Overview of System Architecture
The implementation is targeted to a Celoxica RC2000
board with a Xilinx XC2V6000 Virtex-II chip. The design
and simulation for the hardware implementation are carried
out by using the JHDL hardware description language. As
described later, the design of an FC network is parameter-
ised and integrated with the prescriptive learning scheme to
produce design cores for final synthesis on hardware. 
3.2. Hidden Neuron Circuit
The hidden neuron circuit is a critical part of the imple-
mentation. As each of the training samples in the training
data set is required to be represented by one hidden neuron,
resources used by the hidden layer in an FC network are
considerable when the training set is large. Strategies are
use to make the hidden neuron circuit implementation as re-
source efficient as possible.
3.2.1. Implementation of Distance Function 
As explained in section 2, hidden neurons in an FC network
are distance based, and they compute the distance between
the input vector  and every training sample stored in the
hidden layer as the input weights. This distance calculated
by the  hidden layer can be expressed as:
(6)
where  is the length of the input vector . In Kak’s FC net-
work proposal Euclidean distance is preferred because it is
rotational invariant and it minimizes the within-class classi-
fication variance. Euclidean distance is a special case of
general distance metric, i.e. , and
. (7)
However, Euclidean distance is very expensive to imple-
ment on FPGAs as it requires a “squaring a number” oper-
ation. If Euclidean distance were used, a hidden layer of 
neurons, each with a  element weight vector would require
 “squaring” operations. Fortunately, experiments
conducted by Kak have shown that the FC network per-
formance is robust and it is not seriously effected by the
choice of distance metric. Two alternative distance metrics
were investigated previously by [3] as replacements for the
Euclidean distance:
• when , the city-block distance;
, and (8)
• when , the box-distance:
. (9)
In the K-means image classification context, Estlick’s
experiments [3] show that these two alternative distance
metrics gave acceptable performance and are more amena-
ble to FPGA hardware implementation because the costly
“squaring” operation is avoided. 
The city-block distance, which requires  subtractions,
 absolute values and  comparisons, is selected to im-
plement the required distance computation in a hidden neu-
ron. The distance circuit for a four input case is shown
schematically in Fig. 5 (a).
Figure 5: (a) A four input distance circuit for a 
hidden neuron. (b) Two possible half-adders 
after folding of constant input weights into 
subtractors
Since the input weight  becomes known after the
weights have been prescribed by training, significant real
estate savings can be made by dynamically folding the now
constant input weight  into the subtraction circuit. Hence,
each of the subtractors is implemented by using a series of
half-adders which take either of the configurations as illus-
trated in Fig. 5 (b). The circuits for absolute values and
comparators in the comparator-tree are realised by using
standard  bit adders and subtractors respectively.
3.2.2. Implementation of Activation Function As the ra-
dius of generalization  also becomes a constant after
training, the activation function is thus implemented as a 
bit constant comparator  as shown in Fig. 6. The constant
hi k
m
O ln 2 m( )( )
x
ith
di xi wi–
p
xi j, wi j,–
p
j 1=
n
= =
n x
p 2=
di xi wi–
2
xi j, wi j,–
2
j 1=
n
= =
m
n
O mn( )
p 1=
di xi wi– xi j, wi j,–
j 1=
n
= =
p ∞=
di xi wi–
∞ Max xi j, wi j,–( )= =
n
n n 1–
x1
w1
x2
w2
-
-
| |
| |
c
x3
w3
x4
w4
-
-
| |
| |
c
c
si
ci+1
ci
xi
If  bi = 0
si
ci+1
ci
xi
If  bi = 1
for each wi
(a) (b)
w
w
n
ri
n
KC
comparator is implemented by folding the constant  into
a subtractor circuit as the subtractend in the same fashion as
described above. The sign bit of the comparator is regis-
tered first then is connected to the MSB of bus . This bit
is set if , else it is low. This signal is used in the final
stage of the output layer circuit to select 1NN output as dis-
cussed in next subsection.
To simplify the control of the overall circuit, in addition
to storing the radius of generalization constant, it was decid-
ed to also store the output weights  with the activation
function circuit in a register bank. The  bit from the output
weight  forms the next  bits of bus . Storing and
transmitting the output weights with the distance values is
to avoid the need to track the indices for the  nearest
neighbour hidden neurons and be able to load their corre-
sponding output weights in the output layer stage. An alter-
native to this approach is to store and transmit the index
encoding for each hidden neuron in the  bus. However,
the index would occupy  bits in the bus. When there
are many hidden neurons (e.g. ), the indices for hid-
den neurons will require more than 8 bits to represent,
which is the width used to store and transmit the output
weights .
The distance  is also registered and then connected to
the lower  bits of bus . The comparator sign bit, the out-
put weights  and the distance  are transmitted together
through a  bit wide bus  as illustrated in Fig. 6.
Figure 6: An eight bit example of the activation 
circuit.
A collection of buses  is connected
to the fuzzy rule base for further processing.
3.2.3. Parameters of a Hidden Neuron The parameters of
a hidden neuron are listed below.
Inputs to a hidden neuron:
• Input : a vector of length ; each element has  bits.
• Input weight : a constant vector of length ; each el-
ement has  bits;
• Radius of generalization : a constant value which has
 bits;
• Output weight : a constant value which has  bits;
Output from a hidden neuron:
• bus  has  bits with:
bits [0:b-1] - distance  between  and ;
bits [b:2b-1] - output weight ;
MSB - set if ;
A hidden neuron is pipelined with  bit register banks
after the constant subtractors, and absolute value units after
each level of the comparator-tree. The depth of the pipeline
is dependent on the size of the input layer. When there are
 input neurons the depth of the pipeline is . 
3.3. Implementation of the Fuzzy Rule Base
The primary function of the fuzzy rule base is to map the
distances between a test vector and each of the training sam-
ples (stored in hidden neurons) into fuzzy membership
grades. The fuzzy rule base accomplishes this function in
two ways, 1NN and kNN, which are described below:
• 1NN: If and only if a test vector lies within the radius of
generalization of a hidden neuron  (which implies
 and ), the hidden neuron  fires. In a
FC network only one hidden neuron can fire at any giv-
en time because the distance regions of hidden neurons
do not overlap. In this case, the fuzzy rule base functions
as a 1NN. The fuzzy rule base acts as a gating function
which assigns the fuzzy membership  to 1 and as-
signs the rest  to 0. The test vector will be
classified as belonging to the same output class as the
training sample stored at hidden neuron .
• kNN: Further generalization is achieved by the fuzzy
rule base if a test vector does not lay within any hidden
neuron’s radius of generalization, i.e. none of the
. In this case the fuzzy rule base first selects
 nearest neighbours of the test vector  from the dis-
tance vector and maps the distances between the test
vector with each of its  nearest neighbours into a set of
fuzzy membership grades .
Through this process of “fuzzification”, the decision of
the FC network is further generalized.
Notice that with either the 1NN or kNN rule, the firing
hidden neuron’s index or the indices of hidden neurons cor-
responding to the  nearest neighbours are tracked and used
to select their corresponding output weights. We have de-
cided to simplify the fuzzy rule base because a faithful im-
plementation of the above fuzzy rule base involves realizing
complex controls and results in variable neuron firing rates.
Recall from the above that 1NN rule fires instantaneously
and selects its output immediately while kNN rule needs to
ri
hi
di ri<
vi
b
vi b hi
k
hi
m( )log
m 256>
vi
di
n hi
vi di
2b 1+ hi
hi
KC
di > ri
A
H
Q1
Q8
ENB
Output Weight Register
2*8+1
Q
QSET
CLR
D
A
H
Q1
Q8
ENB
Distance Register
8
8
8
8
di
vi
h h1 h2 ... hm, , ,( )=
x n b
wi n
b
ri
b
vi b
hi 2b 1+
di x wi
vi
di ri<
b
n 3 n( )log+
i
di ri≤( ) hMSB
i 1= i
µi
µj j i≠,( )
i
hiMSB 1=
k x
k
µi i, 1 2 ... k m<, , ,=
k
wait for the sorting circuit to select  nearest neighbours
and then wait for the fuzzy computation of the output to fin-
ish. Specifically, the 1NN rule is now integrated into kNN,
and the 1NN rule will not take effect until the final stage of
the output layer. That is, regardless of whether a hidden
neuron has fired, the kNN rule always applies. The appro-
priate 1NN or kNN output can always be selected at the last
stage of the output layer by checking the most significant bit
of bus . If it is set, then a hidden neuron has fired and the
1NN output should be selected as illustrated in Fig. 11 in
next subsection. In a sense 1NN is a special case of kNN:
1NN’s output is the smallest value in the  nearest neigh-
bour distance. Two benefits are achieved with this ap-
proach:
• no complex control circuitry is needed for output selec-
tion; and
• the output neuron firing rate is now constant.
3.3.1. Implementation of kNN circuit The kNN circuit
performs two tasks:
• Task 1: given a test vector , the kNN circuit selects the
test vector’s  nearest neighbours based on the distance
vector  from the hidden layer.
• Task 2: it maps the distances between  and its  near-
est neighbours into fuzzy membership grades .
The selection of  nearest neighbours is implemented
using a bitonic selection network [4]. A bitonic selection
network is used because its sorting components are very
simple and the basic operations of the sorting algorithm are
simple and highly parallel. Different stages in the sorting
network are amenable to pipelining. Fig. 7 shows a sche-
matic illustration of an eight-input bitonic sorting network
using the notation and style used by Kumar et al. [5]. A bi-
tonic sorting network takes an unsorted series of distances
(transmitted by the lower  bits from bus ) and sorts
them into a series with monotonically increasing order. Be-
cause we are only interested in selecting  nearest neigh-
bours of a test vector, and we know that , the
bottom-left corner of the last stage of the bitonic sort net-
work (shown in gray in Fig. 9) is not used.
Figure 7: An eight input bitonic sorting network.
Given a bitonic sequence, a bitonic merging network, as
shown in Fig. 9, is optimal and can be pipelined in 
stages. Each of the pipeline stages contains a total of 
two-input comparator-and-swap units labelled as either
+BM[2] or -BM[2] as shown in Fig. 8. The two-input com-
parator-and-swap units +BM[2] or -BM[2] sort the two ele-
ments into a increasing and a decreasing order respectively,
and can be expressed mathematically as follows:
+BM[2]: (10)
-BM[2]: (11)
As the FC network selects  nearest neighbours, we de-
sire the  smallest distances to be sorted in monotonically
increasing order, hence +BM[2] units are used.
Although bitonic sorting networks are optimal to sort bi-
tonic sequences, overheads are incurred to convert an un-
sorted sequence into a bitonic sequence. For example, in the
eight input bitonic sorting network shown in Fig. 7, the first
two stages of the network convert an unsorted sequence into
a bitonic sequence of length 8 so that the last stage of the
eight-input bitonic sorting network can merge them into
sorted sequence. The first two stages of the bitonic se-
quence converter unit are shown explicitly in Fig. 10.
Figure 8: Two-input comparator-and-swap 
units: +BM[2] sorts two elements in increasing 
order, -BM[2] sorts two elements in decreasing 
order. An eight bits example is given.
k
hi
k
x
k
h
x k
µ
k
n hi
k
k m 2⁄«
+BM[2]
-BM[2]
+BM[2]
-BM[2]
+BM[4]
-BM[4]
Selector
+BM[8]
Un
so
rte
d 
Se
qu
e
n
ce
So
rte
d 
Se
qu
e
n
ce
mlog
m 2⁄
x min l r,( )=
y max i r,( )=
x max l r,( )=
y min i r,( )=
k
k
l > r
Mux
Mux
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
l
r
y
x
8
8
8 8
88
+BM[2]
-BM[2]
l > r
Mux
Mux
A
H
Q1
Q8
ENB
Register
A
H
Q1
Q8
ENB
Register
l
r
y
x
8
8
8 8
88
Figure 9: An eight bit example of bitonic sorting 
/ selection network. The left half is unused, only 
first four outputs are selected.
Figure 10: Bitonic converter network which 
merges an unsorted sequence into a bitonic 
sequence.
In our implementation, the unsorted sequences of dis-
tances are transmitted by  buses from each neuron in the
hidden layer. The lowest  bits in each  bus represent the
distance value. The distance values form an unsorted se-
quence. This unsorted sequence of distance values can be
seen as a concatenation of bitonic sequences of size two.
They are first converted by  stages of bitonic merg-
ing networks. Starting with pairs of +BM[2] and -BM[2]
units, then pairs of +BM[4] and -BM[4] units, and so on,
each stage merges a sequence which is twice as long as its
input until a bitonic sequence of  is reached. This bitonic
sequence of length  can then be merged by the  input bi-
tonic sorting network and transformed into sorted sequenc-
es. 
The final task of the fuzzy rule base is to map the select-
ed  nearest neighbour distances into fuzzy membership
grades. The triangular fuzzy membership function used in
Tang and Kak’s original work for the two nearest neighbour
case is shown in Fig. 4. Straightforward implementations of
the triangular fuzzy membership functions would require a
large number of division operations and would be very ex-
pensive to realize on FPGAs. Fortunately, their experiments
also found that the FC network is robust with the choice of
fuzzy membership functions. We have decided to use a lin-
ear approximation of the triangular fuzzy membership func-
tions. In particular, depend on their ranking, each of the
selected  nearest neighbours is assigned linear weights
which approximate the triangle fuzzy membership function.
For example, when , the fuzzy membership grades
assigned to the four nearest neighbours are 24/32, 4/32, 2/
32 and 2/32 respectively. The weights sum to 1 and monot-
onically decrease as ranking increases. Although other sets
of weights are possible, the above weights are easy to im-
plement in FPGAs as the weights are equivalent to a series
of right shift operations as seen in the output neuron circuit
(see Fig. 11).
3.3.2. Parameters of the Fuzzy Rule Base The parame-
ters of the fuzzy rule base are listed below.
Inputs to the Fuzzy Rule Base:
• a set of  buses . Each bus is 
bits wide with:
bits [0:b-1] - distance  between  and ;
bits [b:2b-1] - output weight ;
MSB - set if ;
•  a constant which is the number of hidden neurons;
•  a constant which defines the number of near-
est neighbours.
Output from the Fuzzy Rule Base:
• a selected set of  nearest neighbour buses
 with the distance values 
sorted in monotonically increasing order. The width and
content of bus  is the same as bus .
The fuzzy rule base circuit is pipelined with  bit
register banks after each stages of bitonic sorting network.
Hence the depth of the pipeline is the depth of bitonic sort-
ing network which is a variable which depends on  the
number of hidden neurons: . 
3.4. Output Neuron Circuit
Buses belonging to sorted  nearest neighbours are con-
nected to the output neuron circuit in a increasing order ac-
cording to their ranking in the previous stage. The fuzzy
membership grades are hardware encoded with respects to
their ranks, and their outputs  the middle  bits of the bus
are weighted by these fuzzy grades as illustrated in Fig. 11.
+ +
+ +
+ +
+ +
+ +
+ +
+ ++ +
+ +
+ +
+ ++ +
unused
4 nearest neighbors
Bitonic Sequence
4-Selector
+BM[8]
- -
- -
- -- -
+ +
+ +
+ ++ +
+ +
- -
+ +
- -
Unsorted Sequence
Bitonic Sequence
h
n h
nlog 1–
n
n n
k
k
k 4=
m h h1 h2 ... hm, , ,( )= 2b 1+
di x wi
vi
di ri<
m
k k m«( ),
k
µ µi µ2 ... µk, , ,( )= µ 0:b-1[ ]
µ h
2b 1+
m
2
mlog mlog+( ) 2⁄
k
vi n
The contributions for all  nearest neighbours are then
summed by an adder.
Figure 11: Output neuron circuit.
The final output is selected to be either 1NN or kNN de-
pending on the MSB of the smallest neighbour of a test vec-
tor. If MSB is set, it means a hidden neuron has fired
previously in the hidden layer and hence the corresponding
output class of that hidden neuron is selected to be the final
output. If MSB is not set, the kNN output is valid, and the
final output is the kNN fuzzification output.
In our design, we have decided to fix  for convi-
enence. The pipeline register banks of width  are inserted
after each of parallel-right shift stages, and at each leaf of
the addition tree. The pipeline depth in this case  is
6. The final output is to be interpreted as a 16-bit signed
fixed point two’s complement number with 1 sign bit, 7 bit
integer and 8 bit fraction.
4. Integrating Prescriptive Learning with the 
Design of FC Networks
The training for an FPGA based FC network is carried
out offline because the training of an FC network is not
computationally intensive. Offline training allows for com-
pile-time folding of constants and hardwiring of constants
into the circuit functionality. A JHDL package is written to
fully parameterize the design for an FC network. The fold-
ing of the input weights  and the radius of generalization
 into constant subtractors / comparators; the hardwiring of
a set of fuzzy membership grades into constant multiplica-
tions and divisions in the output neuron circuit are carried
out by the JHDL package at the compile-time. Given a set
of training data and the required precision specified by the
user, the JDHL package automatically defines the topology
and weights for the FC network as well as designing the
whole circuit for the appropriate FC network. Specifically,
the JDHL package contains a main class and four circuit
generator classes for instantiating the hidden neuron circuit,
the activation function circuit, the bitonic sorting network
circuit and the output neuron circuit. The main class is re-
sponsible for instantiating the required circuit modules and
connecting them together according to the problem specifi-
cation, the training sample size and the number of those
training samples to be stored. Each of the circuit generator
classes is responsible for designing the corresponding sub-
modules according to specific information such as the sizes,
the constants and required stages of pipelines. As the final
output, the main class generates an EDIF configuration for
the fully instantiated circuit object as the solution for the re-
quired FC network.
5. Conclusions
We have shown that computational characteristics of FC
networks are highly parallel, simple and modular. These
characteristics are well suited for FPGA implementation
exploiting fine-grained parallelism. We have also shown
that the prescriptive learning scheme of FC networks can be
integrated with the design of the FC network so that the im-
plementation of an FC network can be fully parameterized.
Strategies to reduce the resource cost by using compile-time
constant folding techniques are also discussed. Given its
fast training speed, and the ease of mapping into FPGA ar-
chitectures, FC networks are a better alternative to other
neural network models which are based on iterative learn-
ing.
References
[1] Chakrabarti, S., S. Roy, and M.V. Soundalgekar, “Fast 
and Accurate Text Classification via Multiple Linear 
Discriminant Projections”, in proceedings of Interna-
tional Conference on Very Large Data Bases. 2001, 
Morgan Kaufmann: Hong Kong. p. 658-669.
[2] Tang, K.W. and S. Kak, “Fast Classification Networks 
for Signal Processing”, Circuits Systems Signal 
Processing, 2002. 21(2): p. 207-224.
[3] Estlick, M., et al., “Algorithmic transformations in the 
implementation of K- means clustering on reconfigura-
ble hardware”, in proceedings of the Ninth ACM Inter-
national Symposium on Field-Programmable Gate 
Arrays. 2001, ACM SIGDA: California. p. 103-110.
[4] Batcher, K.E., “Sorting Networks and Their Applica-
tions”, in Proceedings of American Federation of In-
formation Processing Societies 1968 Spring Joint 
Computer Conference. 1968, Thomson Book Compa-
ny, Washington D.C.: Atlantic City, NJ, USA. p. 307-
314.
[5] Kumar, V., et al., “Bitonic Sorting”, in Introduction to 
Parallel Computing - Design and Analysis of Algo-
rithms. 1994, The Benjamin/Cummings Publishing 
Company, Inc.: California. p. 214-224.
k
KCMx3 2 r-shift
3 r-shift
2 r-shift 2 r-shift
2 r-shift 2 r-shift
+
+
+
8
v1
v2
v3
v4
8
8
8
16 16
16
16
16
16
16
16
Y
[15:8]
17
Mux
[15:8]
17
[15:8]
17
[15:8]
17
delay delay delay delay delay delay[16:8]
[15:8]   
 
 
 
 
 
 
 
 
 
 
 [16]
delay
µk1
µk2
µk3
µk4
16
16
16
GND [7:0]
k 4=
b
k 4=( )
w
r
