A real-time clustering microchip neural engine by Serrano-Gotarredona, Teresa & Linares-Barranco, Bernabé
November 22, 1995 4:31 pm 1
A Real-Time Clustering Microchip Neural Engine
Teresa Serrano-Gotarredona and Bernabé Linares-Barranco
Centro Nacional de Microelectrónica (CNM), Dept. of Analog Design, Ed. CICA, Av. Reina Mercedes s/n,
41012 Sevilla, SPAIN, Phone: 34-5-4239923, Fax: 34-5-4624506,
E-mail: bernabe@cnm.us.es
Abstract
This paper presents an analog current-mode VLSI implementation of an unsupervised clustering
algorithm. The clustering algorithm is based on the popular ART1 algorithm [1], but has been modified
resulting in a more VLSI-friendly algorithm [2], [3] that allows a more efficient hardware
implementation with simple circuit operators, little memory requirements, modular chip assembly
capability, and higher speed figures. The chip described in this paper implements a network that can
cluster 100 binary pixels input patterns into up to 18 different categories. Modular expansibility of the
system is directly possible by assembling an N×M array of chips without any extra interfacing circuitry,
so that the maximum number of clusters is 18×M and the maximum number of bits of the input pattern
is N×100. Pattern classification and learning is performed in 1.8µs, which is an equivalent computing
power of 4.4×109 connections per second plus connection-updates per second. The chip has been
fabricated in a standard low cost 1.6µm double-metal single-poly CMOS process, has a die area of 1cm2,
and is mounted in a 120-pin PGA package. Although internally the chip is analog in nature, it interfaces
to the outside world through digital signals, and thus has a true asynchronous digital behavior.
Experimental chip test results are available, obtained through digital chip test equipment. Fault
tolerance at the system level operation is demonstrated through the experimental testing of faulty chips.
  I. Introduction
Two types of neural hardware engineers can be distinguished. The first designs “general purpose”
hardware accelerators or systems that speed up neural algorithms running on conventional computers
[4]-[12]. This kind of hardware allows considerable flexibility in the topology and operations of the neural
systems. In this way algorithm researchers have a powerful tool to further develop neural algorithms and
industry engineers have some attractive chips that significantly speed up their neural commercial products.
The second type of hardware engineers are those who design a real-time system for a specific application.
They must select the best-suited algorithm and map it into hardware. This achieves a close-to-optimum
efficient hardware for a limited range of applications. The work described in this paper falls into this second
category of hardware engineering. The specific application is real-time clustering of binary input patterns.
A clustering device is a device able to build categories from a collection of patterns. A real-time
clustering device has to be able to do this at the speed of arrival of the patterns. There are some clustering
algorithms [13]-[18] that need to be trained off-line to build the categories. For a real-time clustering device,
however, it would be desirable to use an algorithm that can be trained on-line: if a new pattern arrives the
algorithm updates it internal knowledge (instead of erasing all the accumulated knowledge and retrain with
the old and new collection of patterns).
For the second type of neural hardware engineers, the issue of efficiently implementing in hardware a real
size neural network is not a trivial task. Many neural network algorithms are available in the literature which
have been developed, studied, and optimized for applications through computer and/or software based
systems. Consequently, when designing a hardware realization, engineers face many problems like excessive
Submitted to IEEE Trans. on VLSI Systems on December 9, 1994. Accepted on November 9, 1995.
November 22, 1995 4:31 pm 2
interconnectivity, high resolution of weights, high precision of operations, complicated operator requirements
(e.g., integrals and derivatives), high number of neurons required for a real-world application, etc. Many
times some of these requirements can be relaxed, the topology modified, or the operations simplified, with no
significant deterioration of global operation of the neural system but with a considerable boost in the hardware
performance. Modifying neural algorithms to make them more VLSI-friendly and produce more efficient
hardware should be a common practice among neural hardware engineers of the second type [19]-[22]. After
selecting an appropriate neural algorithm the next step consists of studying how far the algorithm can be
simplified without performance degradation. The simplifications have to be hardware-oriented, so that the
final combination of “theoretical algorithm” + “hardware circuit technique” results in a high performance real
time system. The success of the hardware system depends on the selection of the algorithm, the selection of a
powerful circuit design technique, and how the algorithm is modified to efficiently “marry” the circuit
technique resulting in an optimum performance final system.
In the case of our application, real-time binary patterns clustering, we chose the ART1 algorithm mainly
due to the attractive hardware-oriented properties, as well as the theoretical computational properties that will
be highlighted below. We also chose to slightly modify the mathematical ART1 algorithm to obtain more
efficient hardware. This modification (described in the next Section) allows the use of simpler operations
while preserving all the computational properties of the original ART1 architecture [2], [3]. As an extra
bonus, the hardware circuit introduces a significant speed improvement as it automatically parallels the
sequential ART search process [1] inherent in the mathematical neural algorithm.
The advantageous features of the ART1 algorithm are described next, as well as different possible
mathematical levels of description:
A. Computational Properties of the ART1 Algorithm:
From a purely algorithmic point of view, the ART1 architecture is capable of learning, in an unsupervised
way, recognition codes in response to arbitrary orderings of arbitrarily many and complex binary input
patterns. This architecture has a collection of interesting computational properties [1]:
• Self-Scaling: The self-scaling property discovers critical features in a context-sensitive way.
• Vigilance or Variable Coarseness: There is a vigilance parameter ( ) that allows tuning the
coarseness of the categories to learn.
• Subset and Superset Direct Access: The system is able to classify a new input pattern as belonging to either
a subset or a superset category, depending on global similarity criteria. No restrictions on input
orthogonality or linear predictability are needed.
• Stable Category Learning: In response to an arbitrary list (finite or infinite) of binary input patterns,
learning is assured to self-stabilize within a finite number of learning trials.
• Biasing the Network to form New Categories: There is a parameter that can bias the tendency of the system
to code unfamiliar patterns into new categories, independent of the vigilance parameter.
• On-Line Learning: The ART1 algorithm learns as it performs, as opposed to other algorithms, where first
the algorithm must be trained and second, it can be used in an application. The ART1 algorithm can
incorporate new knowledge as it is being used. This property makes ART1 an excellent candidate for
real-time clustering.
0 ρ 1≤<
November 22, 1995 4:31 pm 3
• Capturing Rare Events: ART1 is able to identify and build clusters of events that appear with a very low
frequency. Even if an event corresponding to a clearly distinct cluster appears only once, ART1 is able to
detect it while building and preserving the corresponding cluster or category.
B. Hardware-Oriented Attractive Properties of the ART1 Algorithm
In performance comparison of hardware implementations, a common figure of merit is the number of
interconnections per second. More refined figures have recently been proposed that include resolution and
precision [23]. However, these figures would be reasonably fair criteria for the first type of hardware
engineering mentioned above, the general-purpose one. In order to compare hardware systems of the second
type, the specific-application neural hardware, some global figure must be used that evaluates the overall
system performance. Usually this figure will be application dependent. In our case, since we are concerned
with a real-time clustering application of binary input patterns, an appropriate figure of merit might be
(1)
where,
• number of patterns processed/second is the speed at which patterns are classified and learned (including
the number of learning trials required). This speed generally depends on the patterns themselves, and on
the knowledge already stored in the system. Therefore, this speed can be given as an average or as the
slowest case measured.
• pixels is the maximum number of pixels of the input patterns.
• categories is the maximum number of categories the system is able to form.
As we will see later in the Section on experimental results, the chip described in this paper is able to cluster up
to 18 different categories of binary patterns with 100 pixels, while classifying and learning each pattern in less
than 1.8µs. Since ART1 learns on-line, 1 iteration of input patterns presentations provides the system with
sufficient knowledge to perform properly1. This results in a ppc/s of
(2)
If we would like to obtain the same performance using Backpropagation based hardware, and assuming the
network would learn with 10,000 iterations of patterns presentations, this means that a speed of 180ps would
be needed for each pattern classification and corresponding weights update. Assuming this task could be
performed with a Backpropagation network with 100 input neurons, 5 hidden-layer neurons, and 5 output
neurons2 (which means a total of  interconnections), and that the speed of feedforward
classification is the same as for feedback learning, hardware able to perform
(3)
1.  The input patterns set can be iterated several times to stabilize the internal weights, but this is not necessary for the
system to start working.
2.  Optimistically, a backpropagation net with 5 output nodes might be able to code up to  categories.
ppc/s number of patterns processed
seconds---------------------------------------------------------------------- pixels categories××=
ppc/s n patterns1 iteration n patterns× 1.8µs×-------------------------------------------------------------------------- 100 pixels 18 categories×× 1.0
9
×10 ppc/s= =
100 5 5 5×+× 525=
25
2 525 connections×
180ps------------------------------------------------ 5.83
12
×10 connections/s plus connection-updates/s=
November 22, 1995 4:31 pm 4
would be needed. For the chip described in this paper, since it is based on the powerful ART1 algorithm, the
above performance can be achieved with a hardware of only connections/s plus
connections-updates/s, as discussed in Section IV.B.3.
Note that the Backpropagation algorithm is not appropriate for clustering applications, and comparing it
against ART1 is slightly unfair. There are other algorithms available in the literature that have been developed
specially for clustering applications [13]-[18]. However, they usually do not provide all the computational
properties mentioned in subsection A previously, specially the “On-Line Learning” property which is crucial
for real-time clustering, or they present serious difficulties when mapped into hardware.
Another hardware attractive feature that an ART1 based implementation offers with respect to others, is
that the interconnection weights do not have to be analog, as shown in the next Section. Most of the neural
algorithms reported in the literature require a real-valued set of weights defined within a certain interval.
These weights can be discretized in a number of digital steps, but the granularity required for proper operation
of the system is usually very fine (around 16-bits for the Back-Propagation algorithm [24]). Even worse, in
some cases the granularity requirements become more severe as the size of the system increases. For example,
in a BAM system [25] of  neurons, storage capacity has been heuristically estimated to be around
 [26], where  is the average maximum number of patterns that can be stored. The
resolution required by the interconnection weights in this case is at least . In the chip described in this
paper, since it is based on the ART1 algorithm and requires only binary-valued weights, the resolution of the
weights is not affected by the size nor the storage capacity of the system. This, and the non necessity of analog
weights is one of the most hardware attractive features of the ART1 algorithm.
Another consideration to take into account during the design of a hardware system is how it scales up with
size and performance. We have already mentioned that some neural systems need to increase their weight
resolution as they scale up. Another feature is how their size and interconnectivity scale up with pattern size or
storage capacity. For an ART1 based system, the number of neurons N in the bottom layer is the number of
pixels of the patterns, the number of neurons M in the top layer is the maximum number of categories, and
 is the number of synapses. This system scales up linearly with storage capacity (M) and input pixels
(N). For a BAM system, for example, the size scales quadratically with the storage capacity and the number of
pixels.
Section V will present other scaling considerations, more directly related to the hardware technique
selected. In the case of an analog hardware, random and systematic errors due to fabrication process variations
will appear. A neural network can usually cope very well with random errors, even if the size of the system
increases. However, systematic errors may accumulate as the system increases and may render the complete
network useless as it scales up. The chosen circuit technique must be either insensitive to the accumulation of
systematic errors, or allow for some kind of calibration technique to overcome them.
C. Description Levels of the ART1 Algorithm:
In the original ART1 paper [1] the architecture is mathematically described by sets of Short Term Memory
(STM) and of Long Term Memory (LTM) time domain nonlinear differential equations. A valid assumption
also done by Carpenter and Grossberg, is to make the STM differential equations settle instantaneously to
their corresponding steady state, and consider only the dynamics of the LTM differential equations. In this
4.4 9×10
N M×
np N M×( )
1/4
= np
np 1+
N M×
November 22, 1995 4:31 pm 5
case, the STM differential equations must be substituted by nonlinear algebraic equations that describe the
corresponding steady state of the system. Furthermore, Carpenter and Grossberg also introduced the fast
learning mode of the ART1 architecture, in which the LTM differential equations are also substituted by their
corresponding steady-state nonlinear algebraic equations. Thus the ART1 architecture originally modelled as
a dynamically evolving collection of neurons and synapses governed by time-domain nonlinear differential
equations, can be behaviorally modelled as the sequential application of nonlinear algebraic equations: an
input pattern is given, the corresponding STM steady state is computed through the STM algebraic equations,
and the system weights are updated using the corresponding LTM algebraic equations.
At this point three different levels of ART1 implementations (both in software or in hardware) can be
distinguished:
Type-1: Full Model Implementation: Both STM and LTM time-domain differential equations are realized.
This implementation is the most expensive (both in software and in hardware), and requires a large
amount of computational power.
Type-2: STM Steady-State Implementation: Only the LTM time-domain differential equations are
implemented. The STM behavior is governed by nonlinear algebraic equations. This implementation
requires less resources than the previous one. However, proper sequencing of STM events must be
introduced artificially, which is architecturally implicit in the Type-1 implementation.
Type-3: Fast Learning Implementation: This implementation is computationally the least expensive. In this
case, STM and LTM events must be artificially sequenced.
Regarding hardware implementations of the ART1 architecture, several attempts have been reported in the
literature. Ho et al. suggested a Type-1 implementation [27]. Tsay and Newcomb proposed a CMOS circuit
technique that would realize a partial Type-2 implementation [28]; Wunsch et al. [29] have built optical-based
Type-3 implementations; this paper presents a CMOS VLSI Type-3 circuit.
The next Section explains how we slightly modified the ART1 algorithm to make it more VLSI-friendly.
Section III describes the circuit implementation of the so-modified ART1 algorithm using analog
current-mode circuit design techniques. Experimental results of an actual prototype chip are given in Section
IV. Section V highlights some potential improvements that would help to make this chip an industry-ready
commercial chip, and finally, some conclusions are made in Section VI.
  II. A VLSI-friendly ART1 Algorithm
Let us start describing the Type-3 model of the original ART1 architecture. The ART1 topology is shown
in Fig. 1, and consists of two layers: layer F1 is the input layer and has N nodes (one for each binary “pixel”
of the input pattern), and layer F2 is the category layer. Each node in the F2 layer represents a “cluster” or
“category”. In this layer only one node will become active after presentation of an input pattern
. The F2 layer category that will become active is that which most closely represents the
input pattern I. If no preexisting category is satisfactory for a given input pattern, a new category will be
formed. Each F1 node xi is connected to all F2 nodes yj through bottom-up connections of weights3 zijbu, so
that the input received by each F2 node yj is given by
I I1 I2 … IN, , ,( )≡
November 22, 1995 4:31 pm 6
(4)
Layer F2 acts as a Winner-Take-All network, so that all nodes yj remain inactive, except that which receives
the largest bottom-up input Tj,
(5)
Once an F2 winning node arises, a top-down pattern is activated through the top-down weights4 zjitd. Let us
call this top-down pattern . The resulting vector X is given by the equation,
(6)
Since only one yj is active, let us call this winning F2 node yJ, so that yj=0 if  and yJ=1. In this case we
can state
(7)
where . This top-down template will be compared with the original input pattern I
according to a predetermined vigilance criterion, tuned by the vigilance parameter , so that two
alternatives may occur:
a) If5  the active category J is accepted and the system weights will be updated to
incorporate this new knowledge.
3.  Bottom-up weights zijbu may take any real value in the interval [0,K], where , and  [1].
4.  In the Fast Learning (Type-3) model top-down weights zjitd may take only the values ‘0’ or ‘1’.
5.  The notation |a| represents the cardinality of vector a, i.e., .
K L
L 1– N+----------------------= L 1>
Tj zij
buIi
i 1=
N
∑=
yj
1 if Tj maxk Tk{ }=
0 otherwise


=
F  (WTA)
F1
y y
I1 I2 I3 IM
|X|
ρ |I| comparator
RESET
ji
td
z
Tj
z
bu
ij
1 2 3
1x x2 x3 x
y
2
yM
N
 Fig. 1: Simplified block diagram of the architecture of a Type-3 ART1 system
X X1 X2 … XN, , ,( )≡
Xi Ii zji
tdyj
j
∑=
j J≠
Xi IizJi
td
= or X I zJ
td
∩=
zJ
td
z1J
td
z2J
td
…zNJ
td
, ,  ≡
0 ρ 1≤<
ρ I I zJ
td
∩≤
a ai
i 1=
N
∑=
November 22, 1995 4:31 pm 7
b) If  the active category J is not valid for the actual value of the vigilance parameter ρ.
In this case yJ will be deactivated (reset) making , so that another yj node will become active through
the Winner-Take-All action of the F2 layer.
Learning takes place when an active F2 node is accepted by the vigilance criterion. The weights will be
updated according to the following algebraic equations,
(8)
or using vector notation
(9)
where parameter L has to be larger than ‘1’ [1]. Note that only the weights of the connections touching the F2
winning node yJ are updated. Therefore, operation of the Type-3 (or Fast Learning) implementation of the
ART1 architecture is described by the algorithm depicted in Fig. 2(a).
From a hardware implementation point of view, one of the first issues that comes into consideration is that
there are two templates of weights to be built. The set of bottom-up weights zijbu, each of which must store a
real value belonging to the interval [0,K], and the set of top-down weights zjitd, each of which stores either the
value ‘0’ or ‘1’. The physical implementation of the bottom-up template memory presents the first hardware
difficulty, because their weights need either an analog or a digital memory with sufficient bits per weight so
that the digital discretization does not affect the system performance. However, looking at eqs. (8) it can be
seen that the bottom-up set {zijbu} and the top-down set {zjitd} contain the same information: each of these
sets can be fully computed by knowing the other set. It can be seen that the bottom-up set zijbu is a normalized
version of the top-down set zjitd. Therefore, from a hardware implementation point of view it would be
desirable to physically implement only a binary valued set (one bit per weight) and let the hardware do the
normalization of the bottom-up weights during the computation of {Tj}. This way, the two sets {zijbu} and
{zjitd} can be substituted by a single binary valued set {zij}, and eq. (4) modified to take into account the
normalization effect of the original bottom-up weights6,
(10)
6.  Note that we are using the notation  to represent the vector .
ρ I I zJ
td
∩>
TJ 0=
ziJ
bu
new
L
L 1– zJ
td
old
I∩+
--------------------------------------------- Xi
L
L 1– zJ
td
old
I∩+
--------------------------------------------- IizJi
td
old
= =
zJi
td
new
Xi IizJi
td
old
= =
zJ
bu
new
LI zJ
td
old
∩
L 1– I zJ
td
old
∩+
---------------------------------------------=
zJ
td
new
I zJ
td
old
∩=
zj z1j z2j …zNj, ,( )
Tj
LTAj
L 1– TBj+
-------------------------
L zijIi
i 1=
N
∑
L 1– zij
i 1=
N
∑+
--------------------------------
L zj I∩
L 1– zj+
-------------------------= = =
November 22, 1995 4:31 pm 8
Considering this minor “implementation” modification the algorithm of Fig. 2(a) would be transformed into
that depicted in Fig. 2(b). The system level performance of the algorithms described by Fig. 2(a) and Fig. 2(b)
are identical. There is no difference in the behavior between the two diagrams, and the one in Fig. 2(b) offers
more attractive features from a hardware (as well as software) implementation point of view.
However, in Fig. 2(b) an extra division operation, , need be performed for
each node in the F2 layer. This is an expensive hardware operation and would probably constitute a
performance bottleneck in the overall system for both analog and digital circuit implementations. If possible,
it would be very desirable to avoid this division operation. In [2] and [3] we show that this division operation
can be substituted by a substraction operation, while preserving all the computational properties of the
original ART1 algorithm. For some sequence of patterns a different behavior can be observed with respect to
the original ART1, but the overall clustering behavior is still equivalent. Mathematically, the input to the F2
layer is now,
(11)
where LA and LB are positive parameters that play the role of the original L (and L−1) parameter. The
condition  must be imposed for proper system operation [2], [3].  is a constant parameter
needed to assure that , for all possible values of  and .
 Fig. 2: Type-3 implementation algorithms of the ART1 architecture: (a) original ART1, (b)
ART1 with a single binary valued weights template, (c) modified VLSI-friendly ART1
Initialize weights:
zji
td 1 , zij
bu L
L 1– N+----------------------==
Read input pattern:
I I1 I2 …IN, ,( )=
Tj zij
buIi
i 1=
N
∑=
Winner-Take-All:
yJ 1 if TJ maxj Tj{ }==
yj 0 if j J≠=
ρ I I zJ
td
∩> TJ 0=
YES
NO
Update weights:
zJ
bu
new
LI zJ
td
old
∩
L 1– I zJ
td
old
∩+
----------------------------------------------=
zJ
td
new
I zJ
td
old
∩=
Initialize weights:
zji 1=
Read input pattern:
I I1 I2 …IN, ,( )=
Tj
LTAj
L 1– TBj+
--------------------------=
Winner-Take-All:
yJ 1 if TJ maxj Tj{ }==
yj 0 if j J≠=
ρ I I zJ∩> TJ 0=
YES
NO
Update weights:
zJ
new
I zJ
old
∩=
Initialize weights:
zji 1=
Read input pattern:
I I1 I2 …IN, ,( )=
Tj LATAj LBTBj– LM+=
Winner-Take-All:
yJ 1 if TJ maxj Tj{ }==
yj 0 if j J≠=
ρ I I zJ∩> TJ 0=
YES
NO
Update weights:
zJ
new
I zJ
old
∩=
(a)
(b)
(c)
Tj LTAj( ) / L 1– TBj+( )=
Tj LATAj LBTBj– LM+=
LA LB> LM 0>
Tj 0≥ TAj TBj
November 22, 1995 4:31 pm 9
Replacing a division operation with a substraction one is a very important hardware simplification with
significant performance improvement potential. Fig. 2(c) shows the final VLSI-friendly Type-3 ART1
algorithm, which has been mapped into hardware, as described in the next Section.
  III. Circuit Description
The operations in Fig. 2(c) that need to be implemented are the following:
• Generation of the terms . Since  and  are binary valued (0 or 1), “binary multiplication” and
addition/substraction operations are required.
• Winner-Take-All (WTA) operation to select the maximum  term.
• Comparison of the term  with .
• Deselection of the term  if .
• Update of weights.
The first three operations require a certain amount of precision, while the last two operations are not precise.
We intended to obtain a precision between 1 and 2% (equivalent to 6-bits) for our circuit, while handling input
patterns of up to 100 binary pixels. Fig. 3 shows a possible hardware block diagram that would physically
implement the algorithm of Fig. 2(c). The circuit consists of an 18×100 array of synapses ,
a 1×100 array of controlled current sources , two 1×18 arrays of unity-gain current mirrors
, a 1×18 array of current comparators , an 18-input
WTA circuit, two 18-output unity-gain current mirrors CMM and CMC, and an adjustable-gain ( )
current mirror. Registers  and the NOR gate are optional, and their function is explained later.
Each synapse receives two input signals  and , has two global control signals RESET and ,
stores the value of , and generates two output currents:
• the first goes to the input of current mirror CMAj and is .
Tj zij Ii
Tj
ρ I I zJ∩
TJ ρ I I zJ∩>
 Fig. 3: Hardware Block Diagram for the modified VLSI-friendly ART1 algorithm
18i
i1
y2
y1
S
SS
S S
S
S
1I I 2 I 100
C C21
RESET LEARN
y1 y2 y3 y18
N1
N2
N’
N’
N’1
3N2
3
18N’
N’’
CMA1
CMA2
CMA3
CMA18
CMM
CMB18
CMB3
CMB2
CMB1
ML
1:ρ
CMC
CC1
CC2
CC3
CC18
ER
WTA
1s
2s
18s
c18
c2
c
2i
1
y18
R1
R2
R3
R18
FULL
S
2,100
1,100
18,218,1
21 22
12S 11
N18
100C
18,100
S11 S12 … S18 100,, , ,
C1 C2 … C100, , ,
CMA1 … CMA18 CMB1 …CMB18, , , , CC1 … CC18, ,
0 ρ 1≤<
R1 … R18, ,
yj Ii LEARN
zij
LAzijIi LBzij–
November 22, 1995 4:31 pm 10
• the second goes to the input of current mirror CMBj and is .
All synapses in the same row j ( ) share the two nodes (  and ) into which the currents
they generate are injected. Therefore, the input of current mirror CMAj receives the current
(12)
while the input of current mirror CMBj receives the current
(13)
Current , which is replicated 18 times by current mirror CMM has an arbitrary value as long as it assures
that the terms  are positive.
Each element of the array of controlled current sources  has one input signal  and generates the
current . All elements  share their output node, so that the total current they generate is . This
current reaches the input of the adjustable gain  current mirror, and is later replicated 18 times by current
mirror CMC.
Each of the 18 current comparators CCj receives the current  and compares it against
zero. If this current is positive, the output of the current comparator falls, but if the current is negative the
output rises. Each current comparator CCj output controls input  of the WTA. If  is high the current sunk
by the WTA input  (which is ) will not compete for the winning node. On the contrary, if  is low, input
current  will enter the WTA competition. The outputs of the WTA  are all high, except for that which
receives the largest : such output, denominated , will fall.
Now we can describe the operation of the circuit in Fig. 3. All synaptic memory values  are initially set
to ‘1’ by the RESET signal. Once the input vector I is activated, the 18 rows of synapses generate the currents
 and , and the row of controlled current sources  generates the
current . Each current comparator CCj will prevent current  from
competing in the WTA if . Therefore, the effective WTA inputs are , from which the
WTA chooses the maximum, making the corresponding output  fall. Once  falls, and assuming the
synaptic control signal  is low, all  values will change from ‘1’ to ‘ ’.
Note that initially (when all ),
(14)
This means that the winner will be chosen among 18 equal competing inputs, basing the election on
mismatches due to random process parameter variations of the transistors. Even after some categories are
learned, there will be a number of uncommitted rows ( ) that generate the same
competing current of eq. (14). The operation of a WTA circuit in which there are more than 1 equal and
winning inputs becomes more difficult and in the best case, renders slower operation. To avoid these problems
18 D-registers, , might be added. Initially these registers are set to ‘1’ so that the WTA inputs
LAzijIi
Sj1 Sj2 … Sj 100,, , , Nj Nj'
Tj LA zijIi
i 1=
100
∑ LB zij
i 1=
100
∑– LM+ LA I zj∩ LB zj– LM+= =
LA zijIi
i 1=
100
∑ LA I zj∩=
LM
Tj
Ci Ii
LAIi Ci LA I
ρ
LA I zj∩ LAρ I–
cj cj
ij Tj cj
Tj yj
cjTj yJ
zij
LA I zj∩ LB zj– LA I zj∩ C1 … C100, ,
LA I Tj LA I zj∩ LB zj– LM+=
ρ I I zj∩> cjTj{ }
yJ yJ
LEARN ziJ Ii
zij 1=
cjTj LA I LBN– LM+= N=100( ) j∀
z1j … z100 j, 1= = =
R1 … R18, ,
November 22, 1995 4:31 pm 11
 are high. Inputs  have the same effect as inputs : if  is high  does not
compete for the winner, but if  is low  enters the WTA competition. Therefore, initially only
competes for the winner. As soon as  rises once, the input of register R1 (which is ‘0’) is transmitted to its
output making . Now both  and  will compete for the winner. As soon as  wins once, the
input of register R2 is transmitted to its output making . Now , , and  will compete, and
so on. If all available F2 nodes ( ) have won once, the “FULL” signal rises, advising that all F2
nodes are storing a category. The WTA control signal “ER” enables operation of the registers.
A. Synaptic Circuit and Controlled Current Sources:
The details of a synapse  are shown in Fig. 4(a). It consists of three current sources (two of value
and one of value ), a two-inverter loop (acting as a Flip-Flop), and nine MOS transistors working as
switches. As can be seen in Fig. 4(a) each synapse generates the currents  and . The
RESET control signal sets  to ‘1’. Learning is performed by making  change from ‘1’ to ‘0’ whenever
, , and .
Fig. 4(b) shows the details of each controlled current switch . If  no current is generated, while if
, the current  is provided.
B. Winner-Take-All (WTA) Circuit:
Fig. 5 shows the details of the WTA circuit. It is based on Lazzaro’s WTA [30], which consists of the
array of transistors MA and MB, and the current source . Transistor MC has been added to introduce a
cascode effect and increase the gain of each cell. Transistors MX, MY, and MZ transform the output current
into a voltage, which is then inverted to generate . Transistor MT disables the cell if  is high, so that the
input current  will not compete for the winner. Transistors MS and ME have the same effect as transistor
MT: if signals ER and  are high,  will not compete.
s2 … s18, , s1 … s18, , c1 … c18, , sj Tj
sj Tj c1T1
y1
s2 0= c1T1 c2T2 c2T2
s3 0= c1T1 c2T2 c3T3
y1 … y18, ,
Sij
z ij
I i
yj
RESET LEARN
BL
AL AL
L   z    I   -L   z
 ijA
L   z    I
 ijA
i B  ij
i
A
I
L   IA
i
L
i
(a)
(b)
 Fig. 4: (a) Details of Synapse Circuit Sij, (b) Details of Controlled Current Source Circuit Ci
LA
LB
LAzijIi LBzij– LAzijIi
zij zij
LEARN 0= yj 0= Ii 0=
Ci Ii 0=
Ii 1= LA
IBIAS
yj cj
Tj
sj Tj
November 22, 1995 4:31 pm 12
C. Current Comparators:
The circuit used for the current comparators is shown in Fig. 6(a). Such a comparator forces an input
voltage approximately equal to the inverters trip voltage, has extremely high resolution (less than 1pA), and
can be extremely fast (in the order of 10-20ns for input around 10µA) [31].
D. Current Mirrors:
Current Mirrors , and the ρ-gain mirror have
been laid out using common centroid layout techniques to minimize matching errors and keep the 6-bit
precision of the overall system. For current mirrors  and  a special
topology has been used, shown in Fig. 6(b) [32]. This topology forces a constant voltage  at its input node,
thus producing a virtual ground in the output nodes of all synapses, which reduces channel length modulation
distortion improving matching between the currents generated by all synapses. In addition, the topology of
Fig. 6(b) presents a very wide current range with small matching errors [32].
The adjustable gain ρ current mirror also uses this topology, as shown in Fig. 6(c). Transistor M0 has a
geometry factor ( ) 10 times larger than transistors . Transistors  act as
switches (controlled by signals ), so that the gain of the current mirror can be adjusted between
MXMY
MZ
MA
MC
MB
MT
MS
ME
ER
V
c1 s1 s2c2 c3 s3 c18 s18
COMMONV
BIASI
18yT183yT32yT21yT1
MY MX
MZ
MA
MC
MB
MT
MS
ME ME
MT
MS
MB
MC
MA
MZ
MY MX MY MX
MZ
MA
MC
MB
MT
MS
ME
casc
 Fig. 5: Circuit Schematic of Winner-Take-All (WTA) Circuit
CMA1 … CMA18 CMB1 … CMB18 CMM CMC, , , , , , ,
CMA1 … CMA18, , CMB1 … CMB18, ,
VD
W/L M1 … M10, , MR1 … MR10, ,
r1 … r10, ,
November 22, 1995 4:31 pm 13
 to  in steps of 0.1, while maintaining . By making  higher than 0 Volts, ρ can be
fine tuned.
E. Synaptic Current Sources:
The current sources  and  inside each synapse  and controlled current sources  have to match
within approximately 1% to keep the system 6-bit precision. There is a total of
 current sources and  current sources spread over a die area of 1cm2 which have to
match within 1%. For such distances, number of current sources, and reasonable current values, a spread of
10% in the currents would be an optimistic estimate. However, a single current mirror, with a reduced number
of outputs (like 10), a reasonable transistor size (like ), a moderate current (around 10µA), and
using common centroid layout techniques can be expected to have a mismatch error standard deviation  of
less than 1% [33]. By cascading several of these current mirrors in a tree-like fashion as is shown in Fig. 7 (for
current sources ), a high number of current sources (copied from a single common reference) can be
generated with a mismatch equal to
(15)
Each current mirror stage introduces an error . This error can be reduced by increasing the transistor areas
of the current mirrors. Since the last stage q has a higher number of current mirrors, it is important to keep
ρ 0.0= ρ 1.0= r0 0= r0
cjIin
 Fig. 6: (a) Circuit Schematic of Current Comparator. (b) Circuit Schematic of Active-Input
Regulated-Cascode Current Mirror. (c) Circuit Schematic for Adjustable Gain ρ Current Mirror
Iin VD outI
Iout
.  .  .
MR0
M0
Iin VD
r0
MR1
M1
MR2
M2
MR3
M3
r1 r2 r3 r
M10
MR10
10
(a) (b)
(c)
LA LB Sij Ci
100 18 2×× 100+ 3700=
LA 100 18× 1800= LB
40µm 40µm×
σq
 Fig. 7: Cascade of Current Mirrors for low Mismatching
.
.
.
.
.
.
.
.
.
sw1
sw10 (100)
sw91 sw100LB
(σ   )1 10
2(σ   ) 10
2(σ   ) 10
3(σ   ) 6
3(σ   )
(600)
6
4(σ   ) 3
4(σ   )
(1800)
3
LB
σTotal σ1 σ2 … σq+ + +=
σk
November 22, 1995 4:31 pm 14
their area low. For previous stages the transistors can be made larger to contribute with a smaller , because
they are less in number and will not contribute significantly to the total transistor area. For current sources ,
a circuit similar to that shown in Fig. 7 is used. Current  in Fig. 7 (and similarly current ) is injected
externally into the chip so that parameter  can be controlled.
F. Weights Read Out:
The switches  to  of Fig. 7 were added to enable reading out the internally learned synaptic
weights , and test the progress of the learning algorithm. These switches are all ON during normal
operation of the system. However, for weights read-out, all except one will be OFF. The switch that is ON is
selected by a decoder inside the chip, so that only column i of the synaptic array of Fig. 3 injects the current
 to nodes . All nodes  can be isolated from current mirrors , and connected to output pads to
sense the currents , thus measuring the values of .
G. Modular System Expansibility:
The circuit of Fig. 3 can be expanded both horizontally, increasing the number of input patterns from 100
to , and vertically increasing the number of possible categories from 18 to . Fig. 8 shows
schematically the interconnectivity between chips in the case of a  array.
Vertical expansion of the system is possible by making several chips share the input vector terminals
, and node  of the WTA (see Fig. 5). Thus, the only requirement is that  be
externally accessible. Horizontal expansion is directly possible by making all chips in the same row share
their , , and  nodes, and isolating all except one of them, from the current mirrors
, , and the adjustable gain ρ-mirror. Also, all synapse inputs  must
be shared.
Both vertical and horizontal expansion degrades the system performance. Vertical expansion causes
degradation because the WTA becomes distributed among several chips. For the WTA of Fig. 5, all MA and
MB transistors must match well, which is very unlikely if they are in different chips. A solution for this
σk
LA
LB LA
α LA/LB=
sw1 sw100
zij
zijLB Nj Nj CMAj
zijLB zij
.     .     . 
ρ
ρ
C
M
s
C
C
s
C
M
s
C
C
s
W
 T A
W
 T A
 Fig. 8: Interchip Connectivity for Modular System Expansion
100 N× 18 M×
2 2×
I1 … I100, , VCOMMON VCOMMON
Nj N'j N''
CMA1 … CMA18, , CMB1 … CMB18, , yj
November 22, 1995 4:31 pm 15
problem is to use a WTA topology based on current processing and replication, insensitive to inter-chip
transistor mismatches [34], [35].
Horizontal expansion degrades the performance because current levels have to be changed:
• Either currents  and  are maintained the same, which makes the current mirrors , ,
, , , the current comparators , and the WTA to handle higher currents. This may
cause malfunctioning due to eventual saturation in some of the blocks.
• Or currents  and  are scaled down so that the current mirrors , , , , ,
the current comparators , and the WTA handle the same current level. However, this produces an
increase in mismatch between the current sources  and .
  IV. Experimental Results
A prototype chip that contains the previous circuit description of a real-time clustering engine has been
fabricated in a standard double-poly double-metal 1.6µm CMOS digital process (Eurochip ES2). The die area
is 1cm2 and it has been mounted in a 120-pin PGA package. This chip implements an ART1 system with 100
nodes in the F1 layer and 18 nodes in the F2 layer. Most of the pins are intended for test and characterization
purposes. All the subcircuits in the chip can be isolated from the rest and conveniently characterized. The F1
input vector I, which has 100 components, has to be loaded serially through one of the pins into a shift
register. The time delay measurements reported in this paper do not include the time for loading the shift
register.
The experimental measurements provided in this Section have been divided into four parts. The first
describes DC characterization results of the elements that contribute critically to the overall system precision.
These elements are the WTA circuit and the synaptic current sources. The second describes time delay
measurements that contribute to the global throughput time of the system. The third presents system level
experimental behaviors obtained with digital test equipment (HP82000). Finally, the fourth focuses on yield
and fault tolerance characterizations.
A. System Precision Characterizations:
The ART1 chip was intended to achieve an equivalent 6-bit (~1.5% error) precision. The part of the
system that is responsible for the overall precision is formed by the components that perform analog
computations. These components are (see Fig. 3) all current sources  and , all current mirrors ,
, , , and the ρ-mirror, the current comparators , and the WTA circuit. The most critical
of these components (in precision) is the WTA circuit. Current sources and current mirrors can be made to
have mismatch errors below 0.2% [33], [36]-[38], at the expense of increasing transistors area and current,
decreasing distances between matched devices, and using common centroid layout techniques [39]. This is
feasible for current mirrors , , , , and the ρ-mirror, which appear in small numbers.
However, the area and current level is limited for the synaptic current sources  and , since there are
many of them. Therefore, WTA and current sources  and  are the elements that limit the precision of the
overall system, and their characterization results will be described next.
LA LB CMAj CMBj
CMM 1:ρ CMC CCj
LA LB CMAj CMBj CMM 1:ρ CMC
CCj
LA LB
LA LB CMAj
CMBj CMM CMC CCj
CMAj CMBj CMM CMC
LA LB
LA LB
November 22, 1995 4:31 pm 16
A.1 : WTA Precision Measurements:
 and  will have current values of 10µA or less. The maximum current a WTA input branch can
receive is (see eq. (12)),
(16)
which corresponds to the case where all  and  values are equal to ‘1’ (remember that ). In our
circuit the WTA was designed to handle input currents of up to 1.5mA for each input branch. In order to
measure the precision of the WTA, all input currents except two were set to zero. Of these two inputs one was
set to  and the other was swept between  and . This will cause their corresponding output
voltages  to indicate an interchange of winners. The transitions do not occur exactly at . Moreover,
the transitions change with the input branches. The standard deviation of these transitions was measured as
σ=0.86µA (or 0.86%). Table 1 shows the standard deviation (in %) measured when the constant current is set
to 10µA, 100µA, and 1mA.
A.2 : Synaptic Current Sources Precision Measurements:
The second critical precision error source of the system is the mismatch between synaptic current sources.
In our chip each of the 3700  current sources and each of the 1800  current sources could be isolated and
independently characterized. Fig. 9 shows the measured mismatch error (in %) for 18 arbitrary  current
sources when sweeping  between 0.1µA and 10µA. As can be seen in Fig. 9, for currents higher than 5µA
10µA 100µA 1mA
1.73% 0.86% 0.99%
Table 1. Precision of the WTA
Tj
σ Tj( )
LA LB
Tj max LM zij LAIi LB–( )
i 1=
100
∑
max
+ LM 100 LA LB–( )+= =
zij Ii LA LB 0> >
100µA 98µA 102µA
yj 100µA
LA LB
 Fig. 9: Measured Mismatch error (in %) between 18 arbitrary LA current sources
LA
LA
November 22, 1995 4:31 pm 17
the standard deviation of the mismatch error is close to 1%. The same result is obtained for the  current
sources.
B. Throughput Time Measurements:
For a real-time clustering device the throughput time is defined as the time needed for each input pattern
to be processed. During this time the input pattern has to be classified into one of the pre-existing categories or
assigned to a new one, and the pre-existing knowledge of the system has to be updated to incorporate the new
information the input pattern carries. From a circuit point of view, this translates into the measurement of two
delay times:
• The time needed by the WTA to select the maximum among all { }.
• The time needed by the synaptic cells to change  from its old value to .
B.1 : WTA Delay Measurements:
The delay introduced by the WTA depends on the current level present in the competing input branches.
This current level will depend on the values chosen for , , and , as well as on the input pattern  and
all internal weights . To keep the presentation simple, delay times will be given as a function of  values
directly. Table 2 shows the measured delay times when  changes from  to , and  to  have the
values given in the table.  is the time needed by category  to win when  switches from  to , and
 is the time spent by category  in winning when  decreases from  to . As can be seen, this delay
is always below .
For the cases when the vigilance criterion is not directly satisfied and hence comparators  cut some of
the  currents, an additional delay is observed. This extra delay has been measured to be less than  for
the worst cases. Therefore, the time needed until the WTA selects the maximum among all { } is less than
.
B.2 : Learning Time:
After a delay of  (so that the WTA can settle), the learn signal  (see Fig. 3) is enabled
during a time . To measure the minimum  time required, this time was set to a specific value
during a training/learning trial, and it was checked that the weights had been updated properly. By
0 550ns 570ns
1mA 0 210ns 460ns
660ns 470ns
440ns 400ns
1.50mA 1.00mA 230ns 320ns
0
0
0
Table 2. Delay times of the WTA
LB
cjTj
zij yjIizij
T1
a T1
b T2 T3 …T18, td1 td2
0µA 200µA 100µA
0µA 500µA
100µA 150µA 125µA 100µA
400µA 600µA 500µA 400µA
500µA 500µA
90µsA 110µA 100µA 1.12µs 1.11µs
490µA 510µA 500µA 1.19µs 1.06µs
990µA 1.01mA 1.00mA 380ns 920ns
LA LB LM I
zj Tj
T1 T1
a T1
b T2 T18
td1 y1 T1 T1
a T1
b
td2 y2 T1 T1
b T1
a
1.2µs
CCj
Tj 400ns
cjTj
1.2µs 0.4µs+ 1.6µs=
1.6µs LEARN
tLEARN tLEARN
November 22, 1995 4:31 pm 18
progressively decreasing  until some of the weights did not update correctly, it was found that the
minimum  time for proper operation was 190ns. By setting  to 200ns and allowing the WTA a
delay of , the total throughput time of the ART1 chip is established as .
B.3 : Comparison with Digital Neural Processors:
A digital chip with a feedforward speed of a connections per second, a learning speed of b connection
updates per second, and a WTA section with a delay of c seconds must satisfy the following equation to
achieve a throughput time of  when emulating the ART1 algorithm of Fig. 2(c):
(17)
Note that there are 100 synapse weights  to update for each pattern presentation, and 3700 feed-forward
connections: 1800 connections to generate all , 1800 connections to generate
, and 100 connections to generate .
Assuming , and , eq. (17) results in a processing speed of
connections/s and connection-updates/s. A digital neural processor would require such figures of merit to
equal the processing time of the analog ART1 chip presented in this paper. Therefore, this “approximate
reasoning” makes us conclude that our chip has an equivalent computing power of
connections/s plus connection-updates/s.
C. System Level Performance:
Although the internal processing of the chip is analog in nature, its input ( ) and output ( ) are binary
valued. Therefore, the system level behavior of the chip can be tested using conventional digital test
equipment. In our case we used the HP82000 IC Evaluation System.
An arbitrary set of 100-bit input patterns { } was chosen, shown in Fig. 10. A typical clustering
sequence is shown in Fig. 11, for  and . The first column indicates the input
pattern  that is fed to the F1 layer. The other 18 squares (  pixels) in each row represent each of the
internal  vectors after learning is finished. The vertical bars to the right of some  squares indicate that
these categories won the WTA competition while satisfying the vigilance criterion. Therefore, such categories
correspond to , and these are the only ones that are updated for that input pattern  presentation. The figure
shows only two iterations of input patterns presentation, because no change in weights were observed after
these. The last row of weights  indicates the resulting categorization of the input patterns. The numbers
below each category indicate the input patterns that have been clustered into this category. In the following
figures we will show only this last row of learned patterns together with the pattern numbers that have been
clustered into each category.
Fig. 12 shows the categorizations that result when tuning the vigilance parameter ρ to different values
while the currents were set to , , and  ( ). Note
tLEARN
tLEARN tLEARN
1.6µs 1.8µs
1.8µs
3700
a
-----------
100
b-------- c+ + 1.8µs=
zij
Tj LA I zj∩ LB zj– LM+=
LA I zj∩ LA I
c 100ns= a b= a b 2.2 109×= =
a b+ 4.4 109×=
Ii yj
I2I1 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18
 Fig. 10: Set of Input Patterns
Ik
ρ 0.7= α LA/LB 1.05= =
Ik 10 10×
zj zj
zJ I
k
zj
LA 3.2µA= LB 3.0µA= LM 400µA= α LA/LB 1.07= =
November 22, 1995 4:31 pm 19
 Fig. 11: Clustering Sequence for ρ=0.7 and α=LA/LB=1.05
I1
I2
I3
I4
I5
I6
I7
I8
I9
I10
I11
I12
I13
I14
I15
I16
I17
I18
I1
I2
I3
I4
I5
I6
I7
I8
I9
I10
I11
I12
I13
I14
I15
I16
I17
I18
z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 z16 z17 z18
1 2 3 4 5 6 11 8 12 13 14 15 16 18, 7 9 17, 10
November 22, 1995 4:31 pm 20
that below some categories there is no number. This is a known ART1 behavior: during the clustering process
some categories might be created that will not represent any of the training patterns. In Fig. 13 the vigilance
parameter is maintained constant at , while  changes from 1.07 to 50. For a more detailed explanation
on how and why the clustering behavior depends on ρ and α see references [2] and [3], or other ART1
theoretical papers [1], [40].
D. Yield and Fault Tolerance:
A total of 30 chips (numbered 1 through 30 in Table 3 and Fig. 14) were fabricated. For each chip every
subcircuit was independently tested and its proper operation verified; 14 different faults were identified. Table
3 indicates the faults detected for each of the 30 chips. The faults have been denoted from F1 to F14, and are
separated into two groups:
• Catastrophic Faults (digital sense) are those clearly originated by a short or open circuit failure. These
faults are F1, ... F8. This kind of faults would produce a failure in a digital circuit.
ρ 0= α
Catastrophic Faults (digital sense) Non-Catastrophic Faults (digital sense)
chip # F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14
1 x x
Table 3. Fault Characterizations of the 30 ART1 Chip Samples. Dark Shades: Sample with
Catastrophic Fault; Light Shade: Sample with no Catastrophic Fault but with non
Catastrophic Fault; no Shade: Sample with no Fault.
2 x x x x x x
3 x x x x
4 x x x x
5 x x x
6 x x x x x
7 x x
8 x
9 x x x x
10 x x x x
11 x
12 x x
13 x x x x x
14 x x
15 x x x x
16 x x
17 x x x
18 x x x
19 x
20 x x x x x x
21 x x
22 x x
23 x x
24 x x
25
26 x
27 x x x x
28
29 x
30 x x x
November 22, 1995 4:31 pm 21
• Non-Catastrophic Faults (digital sense) are those that produce a large deviation from the nominal
behavior, too large to be explained by random process parameter variations. These faults are F9, ... F14.
This kind of faults would probably not produce a catastrophic failure in a digital circuit, but be responsible
for significant delay times degradations.
Table 4 describes the subcircuits where the faults of Table 3 were found. Note that the most frequent faults are
F2/F9 and F3/F10, which are failures in some current sources  or , and these current sources occupy a
significant percentage of the total die area. Fault F1 is a fault in the shift register that loads the input vector .
Fault F2 is a fault in the WTA circuit. Therefore, chips with an F1 or F2 fault could not be tested for system
level operation. Faults F3 and F9 are faults detected in the same subcircuits of the chip, with F3 being
catastrophic and F9 non-catastrophic. The same is valid for F4 and F10, F5 and F11, and so on until F8 and
F14.
Note that only 2 of the 30 chips (6.7%) are completely fault-free. According to the simplified expression
for the yield performance as a function of die area  and process defects density  [41],
Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Z10 Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18ρ
1,..,18
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
3 5 6 12, , , 8 10–13 17, 4 7 11 14, , ,16 18, 15
6 15 9 13 14, , 4 8, 7 10 12, ,17 18, 3 5 11, ,
3 5 12, , 9 14 16, , 8 11 13, , 4 6, 10 17, 7 15 18, ,
3 5, 8 11– 12 14, 6 4 16 18, , 13 15, 9 17,
3 5, 8 11, 7 12, 6 15 13 14, 4 10 17, 9 16 18,
15 6 5 12, 4 14, 11 16, 8 13, 3 9 7 18, 10 17,
3 6 5 7, 4 11 8 10, 12 14
14
13
13
15 16 18, 9 17,
3
3
6
6
4
4
5
5
8
8
12
12
9
9
11
11 15
15 16 7 1017 18,
7 10 13 14 17 18 16
.0
.1
.2
.3
.4
.5
.6
.7
.8
.9
 Fig. 12: Categorization of the input patterns for LA=3.2µA, LB=3.0µA,
LM=400µA, and different values of ρ
16
Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Z10 Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18α
1.07
1.5
2
3
4
5
7
20
50
1,...,18
1-4,6,11-15 8 17 16 8 10, 18
1 2 5, , 3 15, 4 14, 12 9 8 13, 5 7 16, , 11 10 18 17
1 2,
1 2,
1 2,
1 2,
1 2,
3 6, 4
4
4
4
4
4
5 6,
5 7, 12
12
12
12
12
12
8
8
8
8
8
8
13
13
13
13
13
13
9
9
9
9
9
9
11
11
11
11
11
11
14
14
14
14
14
14
15
15
15
15
15
16
16
16
16
16
16
10
10
10
10
10
10
18
18
18
18
18
18
17
17
17
17
17
17
3 15, 5
5
5
5
5
6
6
6
6
6
7
3
3
3
3
7
7
7
7
21
 Fig. 13: Categorization of the input patterns for ρ=0 and different values
of α
LA LB
Ik
Ω ρD
November 22, 1995 4:31 pm 22
(18)
this requires a process defect density7 of . On the other hand, ignoring the non-catastrophic
faults yields 9 out of 30 chips (30%). According to eq. (18) such a yield would be predicted if the process
defect density is .
Even though the yield is quite low, many of the faulty samples were still operative. This is due to the fault
tolerant nature of the neural algorithms in general [42]-[45], and the ART1 algorithm in particular. Looking at
Table 3 we can see that there are 16 chips that have an operative shift register and WTA circuit. We performed
system level operation tests on these chips to verify if they would be able to form clusters of the input data,
and verified that 12 of these 16 chips were able to do so. Moreover, 6 (among which were the two completely
fault-free chips) behaved exactly identically. The resulting clustering behavior of these 12 chips is depicted in
Fig. 14 for  and .
  V. Further Enhancements
The chip described in this paper is the first prototype designed by the authors for real-time clustering. As
such, the design focused on testability and full characterization possibilities, instead of maximizing speed and
yield, for example.
7.  The effective die area is  to account for a  width pad ring.
F1 non-operative shift register for loading
F2 non-operative WTA circuit
F3/F9 fault in a current source
F4/F10 fault in a current source
F5/F11 fault in vigilance parameter ρ current mirror
F6/F12 fault in current mirror
F7/F13 fault in current mirrors  or
F8/F14 fault in current mirror
Table 4. Description of Faults
Ik
LA
LB
CMM
CMAj CMBj
CMC
yield 100e
ρDΩ–
=
ρD 3.2cm
1–
=
Ω 0.92cm( ) 2= 400µm
ρ'D 1.4cm
1–
=
1 2 3 5, 8 11, 7 12, 6 13 14, 15 4
10 17,
9 16 18,
1 2 3 5, 7 12, 8 11, 6 13 14,
6
6
6
6
6
1
1
1
1
1
2
2
2
2
2
3 5,
3 5,
3 5,
9 11,
12
7 12,
7 12,
7 12,
8 11,
8 11,
8 11,
12
13 14,
13 14,
13 14,
15
15
15
15
15
15
4
5
4
4 16,
4
10
16 18,
4 9 16, ,
17 18,
5 8 13, 3 10 7, 14 16, 18 17,
3 4 8 13, 7 9, 10 11, 14 16, 17 18,
9 16 18, ,
9 10 17, , 18
9 10 17, ,
10 17,
chip # z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 z16 z17 z18
9,10,22,
8
14
18
19
24
30
 Fig. 14: Categorization of the input patterns performed by operative samples
25,26,28
ρ 0.5= α 1.07=
November 22, 1995 4:31 pm 23
To make this chip an industry ready prototype, several trivial modifications should be introduced:
• First, substitute the serially loaded shift register that holds the input pattern , with some kind of parallel
loading mechanism (using either electrical or optical data acquisition techniques).
• Use some simple yield enhancement technique. Looking at Table 3 and Table 4 we can see that most of the
failures are due to faults in the synaptic current sources. A simple yield enhancement technique would be
to add a number of spare columns of synapses, some of which would substitute faulty columns of synapses.
• Add a handshaking mechanism that would allow the chip to communicate with the outside circuitry. Thus,
when the WTA produces a fast response (which, by the way, is most of the time), the outside circuitry need
not wait for the worst case WTA delay.
Other, less trivial, enhancements that should be addressed relate to the high area and current consumption of
the synaptic current sources  and . One possibility would be to use UV-activated floating-gate-calibrated
[46]-[49] current sources, instead of the tree-like structure of Fig. 7. In principle, it should be possible to use
one single calibrated MOS transistor per synaptic current source. This transistor, which can be close to
minimum size, does not have to drive a large current either. Calibration errors of 0.2% have been reported for
currents of 200nA [49]. Using a scheme like this significantly reduces the current and silicon area
consumption per synapse, allowing a much higher number of synapses per chip and thus boosting the
performance of the chip significantly.
Other considerations relate to the question of how this chip would scale up with size. What would be the
practical limitations? Usually a strong limitation when scaling up analog neural hardware is how systematic
offsets accumulate. A common circuit technique for analog neural VLSI is the use of transconductors [22],
[50]. Connecting many of them in parallel results in addition of their systematic offset components. If the size
of the system is sufficiently large, this total offset can drive the system out of working range8. For our circuit
the accumulation of systematic offsets of the synaptic current sources is not a problem. Note that the total
currents  (which certainly include a common systematic offset) will compete in a WTA circuit, and the
maximum among all { } is the same regardless of the presence or not of a common offset component.
A real scaling limitation for the circuit technique used in our chip is the following. The smallest current
per synaptic current source is limited by the precision we want to achieve (even when using UV-activated
floating-gate calibration techniques). Therefore, the maximum number of synapses that can be put into the
same chip will be limited by the maximum power dissipation allowed by the package for a given precision.
This implies a trade-off between precision and size.
Another problem that might arise when the number of nodes in the F2 layer (maximum number of
categories) becomes significantly large, is that the WTA circuit might not be able to detect the maximum
among a large number of close-to-maximum inputs. At that point, one might reconsider if it is necessary to
have an F2 layer that provides one (and only one) winner, instead of an F2 layer that provides a “bubble” of
winners [51], [52].
A different way of system growth is to assemble different ART1 subsystems to perform supervised
clustering tasks [53], or to combine ART cells hierarchically for higher level knowledge processing [54], [55].
8.  In this case a global offset calibration technique can be used to overcome this problem.
Ik
LA LB
Tj
Tj
November 22, 1995 4:31 pm 24
  VI. Conclusions
This paper presented the algorithm, circuit implementation, and experimental test results of an analog
(digital-compatible) current-mode VLSI clustering engine. The algorithm is mainly based on the popular
ART1 architecture, although slight modifications have been introduced to produce more efficient hardware.
The presented prototype chip realizes an architecture with 100 F1 nodes and 18 F2 nodes, and is thus able to
cluster 100-bit input patterns into up to 18 different categories. Modular expansibility is possible by directly
assembling a matrix array of chips, without any extra interfacing circuitry. It has been shown that a digital
neurocomputer able to provide the same throughput speed should have a processing speed of
connections per second and connections updates per second. Extensive chip characterization results have been
given including system precision measurements, system speed measurements, system level clustering
behavior, fault characterizations, and system level clustering behavior of faulty chip samples.
Finally, some improvement possibilities have been highlighted to enhance the efficiency and performance
of the overall system for industrial production of commercial prototypes.
  VII. References
[1] G. A. Carpenter and S. Grossberg, “A Massively Parallel Architecture for a Self-Organizing Neural Pattern
Recognition Machine,” Computer Vision, Graphics, and Image Processing, vol. 37, pp. 54-115, 1987.
[2] T. Serrano-Gotarredona and B. Linares-Barranco, “A VLSI-friendly ‘Fast-Learning’ ART1 Algorithm,”
Proceedings of the 1995 World Congress on Neural Networks, Washington DC, vol. I, pp. 27-30, 1995.
[3] T. Serrano-Gotarredona and B. Linares-Barranco, “A Modified ART1 Algorithm more suitable for VLSI
Implementations,” Neural Networks, accepted for publication.
[4] M. Griffin, G. Tahara, K. Knorpp, and W. Riley, “An 11-million Transistor Neural Network Execution Engine,”
Proc. of the 1991 IEEE Int. Conf. on Solid-State Circuits, 1991, pp. 180-181.
[5] M. Yasunaga, N. Masuda, M. Yagyu, M. Asai, K. Shibata, M. Ooyama, M. Yamada, T. Sakaguchi, and M.
Hashimoto, “A Self-Learning Neural Network Composed of 1152 Digital Neurons in Wafer-Scale LSIs,” Proc. of
the 1991 Int. Joint Conf. on Neural Networks, Seattle, Washington, July 1991, pp. 1844-1849.
[6] N. Mauduit, M. Duranton, J. Gobert, and J. A. Sirat, “Lneuro 1.0: A Piece of Hardware LEGO for building Neural
Network Systems,” IEEE Trans. on Neural Networks, vol. 3, No. 3, May 1992, pp. 414-422.
[7] A. J. De Groot and S. R. Parker, “Systolic Implementation of Neural Networks,” SPIE High Speed Computing II,
vol. 1058, pp. 182-190, Los Angeles, California, 1989.
[8] S. Jones, K. Sammut, Ch. Nielsen, and J. Staunstrup, “Toroidal Neural Network: Architecture and Processor
Granularity Issues,” in VLSI Design of Neural Networks, U. Ramacher and U. Rueckert (Eds.), pp. 229-254,
Kluwer Academic Publishers, Dordrecht, Netherlands, 1991.
[9] R. W. Means and L. Lisenbee, “Floating-point SIMD Neurocomputer Array Processor,” paper distributed by HNC
Inc., 1993.
[10] U. Ramacher, J. Beichter, W. Raab, J. Anlauf, N. Bruels, U. Hachmann, and M. Wesseling, “Design of a 1st
Generation Neurocomputer,” in VLSI Design of Neural Networks, U. Ramacher and U. Rueckert (Eds.), pp.
271-310, Kluwer Academic Publishers, Dordrecht, Netherlands, 1991.
[11] M. A. Viredaz, C. Lehmann, F. Blayo, and P. Ienne, “MANTRA: A Multi-Model Neural-Network Computer,”
Proc. of the 3rd Int. Workshop on VLSI for Neural Networks and Artificial Intelligence, Oxford, September 1992.
[12] J. H. Chung, H. Yoon, and S. R. Maeng, “A Systolic Array Exploiting the Inherent Parallelisms of Artificial Neural
Networks,” Microprocessing and Microprogramming, vol. 33, pp. 145-159, 1992.
[13] T. Kohonen, Self-Organization and Associative Memory, 3rd ed., Berlin, Germany: Springer-Verlag, 1989.
[14] J. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum, 1981.
[15] R. Duda and P. Hart, Pattern Classification and Scene Analysis, New York: Wiley, 1973.
[16] J. Hartigan, Clustering Algorithms, New York: Wiley, 1975.
[17] R. Dubes and A. Jain, Algorithms that Cluster Data, Englewood Cliffs, NJ: Prentice Hall, 1988.
[18] Y. H. Pao, Adaptive Recognition and Neural Networks, Reading, MA: Addison Wesley, 1989.
[19] A. Rodríguez-Vázquez, S. Espejo, R. Domínguez-Castro, J. L. Huertas, and E. Sánchez-Sinencio, “Current-Mode
Techniques for the Implementation of Continuous- and Discrete-Time Cellular Neural Networks,” IEEE Trans. on
Circuits and Systems-II: Analog and Digital Signal Processing, vol. 40, No. 3, March 1993, pp. 132-146.
2.2 109×
November 22, 1995 4:31 pm 25
[20] A. Rodríguez-Vázquez and M. Delgado-Restituto, “Generation of Chaotic Signals using Current-Mode
Techniques,” Journal of Intelligent and Fuzzy Systems, vol. 2, No. 1, pp. 15-37, 1994.
[21] H. Oh and F. M. A. Salam, “Analog CMOS Implementation of Neural Network for Adaptive Signal Processing,”
Proc. of the 1994 IEEE Int. Symp. on Circuits and Systems (ISCAS’94), London, 1994, pp. 503-506.
[22] B. Linares-Barranco, E. Sánchez-Sinencio, A. Rodríguez-Vázquez, and J. L. Huertas, “A CMOS Analog Adaptive
BAM with On-Chip Learning and Weight Refreshing,” IEEE Trans. on Neural Networks, vol. 4, No. 3, May 1993,
pp. 445-455.
[23] E. Keulen, S. Colak, H. Withagen, and H. Hegt, “Neural Network Hardware Performance Criteria,” Proc. of the
1994 Int. Conf. on Neural Networks (ICNN’94), Orlando, 1994, pp. 1885-1888.
[24] M. Riedmiller, “Advanced Supervised Learning in Multi-layer Perceptrons - From Backpropagation to Adaptive
Learning Algorithms,” Int. Journal of Computer Standards and Interfaces (Special Issue on Neural Networks), (5),
1994.
[25] B. Kosko, “Adaptive Bidirectional Associative Memories,” Applied Optics, vol. 26, pp. 4947-4960, December,
1987.
[26] Y. F. Wand, J. B. Cruz, and J. H. Mulligan, “On Multiple Training for Bidirectional Associative Memory,” IEEE
Trans. on Neural Networks, vol. 1, September 1990, pp. 275-276.
[27] C. S. Ho, J. J. Liou, M. Georgiopoulos, G. L. Heileman, and C. Christodoulou, “Analogue Circuit Design and
Implementation of an Adaptive Resonance Theory (ART) Neural Network Architecture,” Int. Journal on
Electronics, vol. 76, No. 2, pp. 271-291, 1994.
[28] S. W. Tsay and R. W. Newcomb, “VLSI Implementation of ART1 Memories,” IEEE Transactions on Neural
Networks, vol. 2, No. 2, pp. 214-221, March 1991.
[29] D. C. Wunsch II, T. P. Caudell, C. D. Capps, R. J. Marks II, and R. A. Falk, “An Optoelectronic Implementation of
the Adaptive Resonance Neural Network,” IEEE Transactions on Neural Networks, vol. 4, No. 4, pp. 673-684, July
1993.
[30] J. Lazzaro, R. Ryckebush, M. A. Mahowald, and C. Mead, “Winner-Take-All Networks of O(n) Complexity,” in
Advances in Neural Information Processing Systems, vol. 1, D. S. Touretzky (Ed.), Los Altos, CA: Morgan
Kaufmann, 1989, pp. 703-711.
[31] A. Rodríguez-Vázquez, R. Domínguez-Castro, F. Medeiro and M. Delagdo-Restituto, “High Resolution CMOS
Current Comparators: Design and Applications to Current-Mode Function Generation,” Analog Integrated Circuits
and Signal Processing, Kluwer Academic Publishers, 7, pp. 149-165, 1995.
[32] T. Serrano and B. Linares-Barranco, “The Active-Input Regulated-Cascode Current Mirror,” IEEE Trans. on
Circuits and Systems-I: Fundamental Theory and Applications, vol. 41, No. 6, June 1994, pp. 464-467.
[33] M. J. M. Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers, “Matching Properties of MOS Transistors,” IEEE
Journal of Solid-State Circuits, vol. 24, October 1989, pp. 1433-1440.
[34] T. Serrano and B. Linares-Barranco, “A Modular Current-Mode High-Precision Winner-Take-All Circuit,” Proc.
of the 1994 Int. Symposium on Circuits and Systems, London, 1994, vol. 5, pp. 557-560.
[35] T. Serrano and B. Linares-Barranco, “A Modular Current-Mode High-Precision Winner-Take-All Circuit,” IEEE
Trans. on Circuits and Systems II, vol. 42, No. 2, pp. 132-134, February 1995.
[36] J. B. Shyu, G. C. Temes, and F. Krummenacher, “Random Error Effects in Matched MOS Capacitors and Current
Sources,” IEEE Journal Solid-State Circuits., vol. SC-19, No. 6, pp. 948-955, December 1984.
[37] K. R. Lakshmikumar, R. A. Hadaway, and M. A. Copeland, “Characterization and Modeling of Mismatch in MOS
Transistors for Precision Analog Design,” IEEE Journal Solid-State Circuits., vol. SC-21, No. 6, pp. 1057-1066,
December 1986.
[38] C. Michael and M. Ismail, “Statistical Modeling of Device Mismatch for Analog MOS Integrated Circuits,” IEEE
Journal Solid-State Circuits., vol. 27, No. 2, pp. 154-166, February 1992.
[39] P. E. Allen and D. R. Holberg, CMOS Analog Design, Holt Rinehart and Winston Inc., New york, 1987.
[40] B. Moore, “ART1 and Pattern Clustering,” in Proceedings of the 1988 Connectionist Summer School, D. S.
Touretzky, G. Hinton, and T. Sejnowski (Eds.), pp. 174-185, San Mateo, CA, 1989. Morgan Kaufmann.
[41] N. R. Strader and J. C. Harden, “Architectural Yield Optimization,” in Wafer Scale Integration, E. E. Swartzlander,
Jr. (Ed.), pp. 57-118, Kluwer Academic Publishers, Boston, 1989.
[42] L. C. Chu, “Fault-tolerant Model of Neural Computing,” IEEE Int. Conf. on Computer Design: VLSI in Computers
and Processors, pp. 122-125, 1991.
[43] C. Neti, M. H. Schneider, and E. D. Young, “Maximally Fault Tolerant Neural Networks,” IEEE Trans. on Neural
Networks, vol. 3, No. 1, pp. 14-23, January 1992.
[44] J. H. Kim, C. Lursinsap, and S. Park, “Fault-Tolerant Artificial Neural Nertworks,” Int. Joint Conf. on Neural
Networks, (IJCNN’91) Seattle, vol. 2, pp. 951, 1991.
[45] T. Petsche, “Trellis Codes, Receptive Fields, and Fault Tolerant, Self-Reparing Neural Networks,” in Machine
Learning: From Theory to Applications, Cooperative Research at Siemens and MIT, S. J. Hanson, W. Remmele,
November 22, 1995 4:31 pm 26
and R. L. Rivest (Eds.), Springer-Verlag, Berlin, Germany, pp. 241-268, 1993.
[46] D. A. Kerns, Experiments in Very Large-Scale Analog Computations, PhD Thesis, California Institute of
Technology, Pasadena, California, 1993.
[47] G. Cauwenbergs, C. F. Neugebauer, and A. Yariv, “Analysis and Verification of an Analog VLSI Incremental
Outer-Product Learning System,” IEEE Trans. on Neural Networks, vol. 3, No. 3, pp. 488-497, May 1992.
[48] K. Yang and A. G. Andreou, “The Multiple Input Floating Gate MOS Differential Amplifier: An Analog
Computational Building Block,” Proc. of the 1994 Int. Symp. on Circuits and Systems (ISCAS’94), London, pp.
37-40, 1994.
[49] H. Miwa, K. Yang, P. O. Pouliquen, N. Kumar, and A. G. Andreou, “Storage Enhancement Techniques for Digital
Memory Based, Analog Computation Engines,” Proc. of the 1994 Int. Symp. on Circuits and Systems (ISCAS’94),
London, pp. 45-48, 1994.
[50] B. W. Lee and B. J. Sheu, Hardware Annealing in Analog VLSI Neurocomputing, Boston: Kluwer Academic
Publishers, 1991.
[51] S. Grossberg, “Nonlinear Neural Networks: Principles, Mechanisms, and Architectures,” Neural Networks, vol. 1,
pp. 17-61, 1988.
[52] T. Kohonen, Self-Organization and Associative Memory, Springer Verlag, New York, 1984.
[53] G. A. Carpenter, S. Grossberg, and J. H. Reynolds, “ARTMAP: Supervised Real-Time Learning and Classification
of Nonstationary Data by a Self-Organizing Neural Network,” Neural Networks, vol. 4, pp. 565-588, 1991.
[54] S. Grossberg, Neural Networks and Natural Intelligence, MIT Press, 1989.
[55] G. A. Carpenter and S. Grossberg, Pattern Recognition by Self-Organizing Neural Networks, MIT Press, 1991.
