This paper presents a modification to the original ART 1 algorithm [Carpenter, 1987a] that is conceptually similar, can be implemented in hardware with less sophisticated building blocks, and maintains the computational capabilities of the originally proposed algorithm. This modified ART 1 algorithm (which we will call here ART 1 m ) is the result of hardware motivated simplifications investigated during the design of an actual ART 1 chip [Serrano, 1994 [Serrano, , 1996 . The purpose of this paper is simply to justify theoretically that the modified algorithm preserves the computational properties of the original one and to study the difference in behavior between the two approaches.
I. Introduction
In 1987 Carpenter and Grossberg published the ART 1 algorithm in a brilliant and well-founded paper [Carpenter, 1987a] , the first of a series of Adaptive Resonance Theory (ART) architectures. ART 1 is an architecture capable of learning (in an unsupervised way) recognition codes in response to arbitrary orderings of arbitrarily many and complex binary input patterns. The ART 2 [Carpenter, 1987b] and Fuzzy-ART [Carpenter, 1991a] architectures do the same but for analog input patterns. ART 3 [Carpenter, 1990] introduces a search process for ART architectures that can robustly cope with sequences of asynchronous analog input patterns in real time. ARTMAP [Carpenter, 1991b] and Fuzzy-ARTMAP [Carpenter, 1992] can be taught to learn (in a supervised way) predetermined categories of binary and analog input patterns, respectively. This paper focuses only on the ART 1 architecture. This architecture has a collection of interesting computational properties:
• Self-Scaling: The self-scaling property discovers critical features in a context-sensitive way. For example, if two binary input patterns have M bits set to '1', and all except for m of them are at the same location, these two different input patterns can be classified into the same category if m/M is sufficiently small, or as two different categories if m/M is not so small.
• Vigilance or Variable Coarseness: There is a vigilance parameter ( ) that adjusts the coarseness of the categories that will be formed. If the vigilance parameter is set close to '1', more attention will be dedicated to distinguishing very similar input patterns and classifying and learning them as belonging to different categories. However, if the vigilance parameter is close to '0', there must be a significant difference between two input patterns for the system to separate them into two different categories.
0 ρ 1 ≤ < Submitted to Neural Networks on November 24, 1994. Accepted on November 13, 1995.
• Subset and Superset Direct Access: Suppose the system has learned two different categories such that one is represented by a binary pixel image that is a subset of the image representing the other. The first is a subset of the second, which is a superset of the first. Under these circumstances, the system can classify a new input pattern as belonging to either the subset or the superset category, depending on global similarity criteria. No restrictions on input orthogonality or linear predictability are needed.
• Stable Category Learning: In response to an arbitrary list of binary input patterns, all interconnection weights subject to learning approach limits after a finite number of learning trials. Learning is guaranteed to stabilize, and it does so for a small number of training patterns presentations.
• Biasing the Network to form New Categories: When a new pattern arrives, a competition starts between stored patterns to capture it. One of the competing categories is the empty or uncommitted category. There exists a parameter that can bias the tendency of the uncommitted category to initially capture a new pattern, before the vigilance parameter plays any role.
In the original ART 1 paper [Carpenter, 1987a] , the architecture is mathematically described as sets of Short Term Memory (STM) and Long Term Memory (LTM) time domain nonlinear differential equations. The STM differential equations describe the evolution of and interactions between processing units or neurons of the system, while the LTM differential equations describe how the interconnection weights change in time as a function of the state of the system. The time constants associated with the LTM differential equations are much slower than those associated with the STM differential equations. A valid assumption, also presented by Carpenter and Grossberg [Carpenter, 1987a] , is to make the STM differential equations settle instantaneously to their corresponding steady state and consider only the dynamics of the LTM differential equations. In this case, the STM differential equations must be substituted by nonlinear algebraic equations that describe the corresponding steady state of the system. Furthermore, Carpenter and Grossberg also introduced the fast learning mode of the ART 1 architecture, in which the LTM differential equations are also substituted by their corresponding steady-state nonlinear algebraic equations. Thus, the ART 1 architecture, originally modelled as a dynamically evolving collection of neurons and synapses governed by time-domain differential equations, can be behaviorally modelled as the sequential application of nonlinear algebraic equations: an input pattern is given, the corresponding STM steady state is computed through the STM algebraic equations, and the system weights are updated using the corresponding LTM algebraic equations.
At this point three different levels of ART 1 implementations (in either software or hardware) can be distinguished:
Type-1
Full Model Implementation: Both STM and LTM time-domain differential equations are realized.
This implementation is the most expensive and requires a large amount of computational power. al. [Wunsch, 1993] have built an optical-based Type-3 implementation; elsewhere [Serrano, 1994 [Serrano, , 1996 we present a CMOS VLSI Type-3 circuit.
This paper presents a modification to the original ART 1 algorithm [Carpenter, 1987a] (which we will call from now on ART 1 m , as referring to "ART 1 -modified") that is conceptually similar, can be implemented in hardware with less sophisticated building blocks, and maintains the same computational capabilities as the originally proposed algorithm. This modification was motivated by a Type-3 hardware implementation and was investigated during the design process of an actual ART 1 Type-3 chip [Serrano, 1994 [Serrano, , 1996 . However, such modifications can be extended to Type-2 and Type-1 implementation versions as well, as shown at the end of this paper.
The paper is organized as follows: Section II develops the ART 1 m architecture starting from the original ART 1 Type-3 (or Fast Learning) description and driven by hardware implementation considerations. Section III shows that all computational properties present in the original ART 1 architecture are preserved in the modified version. Section IV studies the differences in behavior between the two descriptions and provides simulation results, and Section V indicates how to extend the ART 1 m Type-3 description to Type-2 and Type-1 models.
II. From the Original ART 1 Algorithm to the Modified One
Let us start by describing the Type-3 model of the original ART 1 architecture. The ART 1 topology is shown in Fig. 1 and consists of two layers: layer F 1 is the input layer and has M nodes (one for each binary "pixel" of the input pattern), and layer F 2 is the category layer and has N nodes. Let us call the nodes in layer F 1 , and the nodes in layer F 2 . In the original ART 1 paper specific notations were used to distinguish between internal state, output, and node name for F 1 and F 2 nodes. In this paper, since we are concerned exclusively with Type-3 descriptions, we will use a single notation to refer to either internal state, output, and node name of F 1 nodes ( ) and F 2 nodes ( ). Each node in the F 2 layer represents a "cluster" or "category". In this layer, only one node will become active after presentation of an input pattern . The F 2 layer category that will become active is that which most closely represents the input pattern I. If no preexisting category is satisfactory for a given input pattern, a new category will be formed. Each F 1 node x i is connected to all F 2 nodes y j through bottom-up connections of weight 1 z ij bu , so that the input received by each F 2 node y j is given by
Layer F 2 acts as a Winner-Take-All network 2 so that all nodes y j remain inactive, except that which receives the
Once an F 2 winning node arises, a top-down pattern is activated through the top-down weights 3 z ji td . Let us call this top-down pattern . The resulting vector X is given by the equation, .
Since only one y j is active, let us call this winning F 2 node y J , so that y j =0 if and y J =1. In this case we can
where . This top-down template will be compared with the original input pattern I according to a predetermined vigilance criterion, tuned by a vigilance parameter , so that two alternatives may occur:
1. Bottom-up weights z ij bu may take any real value in the interval [0,K] , where , and [Carpenter, 1987a] .
2. In principle, layer F 2 is not restricted to act as a Winner-Take-All network. Contrast enhancement is another possible choice [Carpenter, 1987a] . 3. In the Fast Learning (Type-3) model top-down weights z ji td may take only the values '0' or '1'. 
, the active category J is accepted, and the system weights will be updated to incorporate this new knowledge 4 .
b)If , the active category J is not valid for the given value of the vigilance parameter ρ. In this case y J will be deactivated (reset) making , so that another y j node will become active through the Winner-Take-All action of the F 2 layer.
Learning takes place when an active F 2 node is accepted by the vigilance criterion. The weights will be updated according to the following algebraic equations,
or, using vector notation,
where is a constant parameter. Note that only the weights of the connections incident to the winning F 2 node y J are updated. Therefore, the operation of the Type-3 (or Fast Learning) implementation of the ART 1 architecture is described by the algorithm depicted in Fig. 2(a) .
From a hardware implementation point of view, one of the first issues that comes into consideration is that there are two templates of weights to be built. The set of bottom-up weights z ij bu , each of which must store a real value belonging to the interval [0,K] , and the set of top-down weights z ji td , each of which stores either the value '0' or '1'. The physical implementation of the bottom-up template memory presents the first hardware difficulty because the weights need either an analog or a digital memory with sufficient bits per weight so that the digital discretization does not affect the system performance. However, it can be seen from eqs. (6) that the bottom-up set {z ij bu } and the top-down set {z ji td } contain the same information: each of these sets can be fully computed by knowing the other set. The bottom-up set {z ij bu } is a normalized version of the top-down set {z ji td }. Therefore, from a hardware implementation point of view, it would be desirable to implement physically only a binary valued set (one bit per weight) and introduce the normalization of the bottom-up weights during the computation of {T j }. This way, the two sets {z ij bu } and {z ji td } can be substituted by a single binary valued set {z ij }, and eq.
(1) modified to take into account the normalization effect of the original bottom-up weights, 5
4. The notation |a| represents the cardinality of vector a, i.e., .
5. This type of modification is employed in the Fuzzy-ART model [Carpenter, 1991a] , which operates with analog patterns, instead of binary ones. Making Fuzzy-ART to work with binary patterns results in ART 1 behavior, but using only one set of weights, similar to the system described in this paper.
Considering this minor "implementation" modification, the algorithm of Fig. 2 (a) would be transformed into that depicted in Fig. 2(b) . The system level performance of the algorithms described by Fig. 2(a) and (b) is identical.
There is no difference in the behavior between the two diagrams, and the one in Fig. 2 (b) offers more attractive features from a hardware (as well as software) implementation point of view. For the remainder of this paper we will consider the original ART 1 Type-3 architecture as described by the algorithm of Fig. 2(b) .
However, in Fig. 2(b) , an extra division operation, , needs to be performed for each node in the F 2 layer. This is an expensive hardware operation and would probably constitute a performance bottleneck in the overall system for both analog and digital circuit implementations. If possible, it would be very desirable to avoid this division operation. The main idea of this paper is precisely to substitute this
Initialize weights:
Read input pattern:
Winner-Take-All: Update weights:
Winner-Take-All:
YES

NO
Update weights:
YES
NO
division operation by another, less expensive one, and, although this results in a system with a slightly different behavior, we will show that it preserves all the computational properties of the original ART 1 algorithm. Fig. 3 (a) shows the curves that represent the division operation of eq. (7). A first simplification could be to substitute these curves by a piece-wise linear approximation as shown in Fig. 3(b) . Such an approximation still presents some hardware difficulties and could also limit the performance of the overall system. A more drastic simplification would be to substitute the original operation by the operation represented by the set of curves of 
where L A and L B are positive parameters that play the role of the original L (and L−1) parameter. As we will see in the next Section, the condition must be imposed for proper system operation. is a constant parameter needed 7 to ensure that for all possible values of and .
Replacing a division operation with a subtraction one is a very important hardware simplification with significant performance improvement potential. Fig. 2 (c) shows the final Type-3 ART 1 m algorithm, the object of this paper. In the next sections, we will try to show that the price paid for this drastic simplification, although it yields a system with slightly different input-output behavior, is insignificant since all the computational properties of the original ART 1 architecture are preserved.
It is worth mentioning here that substituting a division operation by a subtraction one means a significant performance boost from a hardware implementation point of view. Implementing physically division operators 6. During the writing of this paper, similar T j functions (also called distances or choice functions) have been proposed by other authors for Fuzzy-ART. Since ART 1 can be considered a particular case of Fuzzy-ART when the input patterns are binary, Fuzzy-ART choice functions can also be used for ART 1. In the Appendix we show how these other choice functions also yield to ART 1 architectures that preserve as well all the original computational properties. However, the choice function purpose of this paper is computationally less expensive and is easier to implement in hardware. 7. In reality, parameter has been introduced for hardware reasons [Serrano, 1994 [Serrano, , 1996 . In a software ART 1 m implementation parameter can be ignored. 
in hardware constraints significantly the whole system design and imposes limitations on the overall system performance.
In the case of digital hardware, a division circuit can be built using either sequential techniques or large size higher speed special purpose circuits [Cavanagh, 1985] . Sequential techniques use simpler hardware but are slower, while a dedicated circuit is very large compared to the former and requires much more power consumption. As an example, and for a sequential type division circuit, in order to realize the following division ,
q addition/substractions operation would be needed, where q is the number of bits needed for the result of the division. If, for example, there are nodes in the F 1 layer, numerator and denominator in eq. (9) should be represented by 10-bit words. If, for a given input , we want to differentiate between two terms and whose respective templates and differ in one bit, the F 2 layer (WTA) would need to resolve .
The worst case occurs when , . In this case .
A reasonable minimum value for L is 1.01. Therefore, if then . On the other hand, it is easy to see that is close to but less than one. Consequently, for each a dynamic range of (12) is needed. Such dynamic range requires a q=27 bit representation. Thus, for each division operation we need to realize 27 10-bit addition/subtractions. Furthermore, the WTA in the F 2 layer would need to choose the maximum among N 27-bit words. On the other hand, if the ART 1 m algorithm is used, instead of the 11-bit addition/subtractions, we need only to realize N 11-bit subtractions, and the WTA has to choose the maximum among N 11-bit words.
In the case of analog hardware, there are ways to implement the division operation with compact dedicated circuits [Bult, 1987] , [Sánchez-Sinencio, 1989 ], [Gilbert, 1990] , [Sheingold, 1976] , but they usually suffer from low signal-to-noise ratios, limited signal range, noticeable distortion, or require bipolar devices which are available for more expensive VLSI technologies. In any case, the performance of the overall ART system would be limited by the lower performance of the division operators. If the divison operators are eliminated the performance of the system would be limited by other operators which, for the same VLSI technology, render considerable better performance figures. Furthermore, in the case of analog current mode signal processing [Serrano, 1996] , the addition and subtraction of currents does not need any physical components. Consequently,
by eliminating the need of signal division, the circuitry is dramatically simplified and its performance drastically improved.
III. On the Computational Equivalence of the Original and the Modified Models
Throughout the original ART 1 paper [Carpenter, 1987a] , Carpenter and Grossberg provide rigorous demonstrations of the computational properties of the ART 1 architecture. Some of these properties are concerned with Type-1 and Type-2 operations of the architecture, but most refer to the Type-3 model operation.
From a functional point of view, i.e., when looking at the ART 1 system as a black box regardless of the details of its internal operations, the system level computational properties of ART 1 are fully contained in its
Fast-Learning or Type-3 model. The theorems and demonstrations given by Carpenter and Grossberg [Carpenter, 1987a] Network to form New Categories, and the properties consequent of the theorems in the original ART 1 paper [Carpenter, 1987a] . In the remainder of this Section we will show that these properties remain in the ART 1 m architecture.
Let us define a few concepts before demonstrating that the original computational properties are preserved. 
c) Superset Template: an input pattern I is said to be a Superset Template of a learned category if 
A. Direct Access to Subset and Superset Patterns
Suppose that a learning process has produced a set of categories in the F 2 layer. Each category y j is characterized by the set of weights that connect node y j in the F 2 layer to all nodes in the F 1 layer, i.e.,
. Suppose that two of these categories, and , are such that ( is a subset template of ). Now consider two input patterns I (1) and I (2) such that,
The Direct Access to Subset and Superset property assures that input I (1) will have Direct Access to category and that input I (2) will have Direct Access to category .
Proof:
If pattern I (1) is given as the input pattern we will have
Since , it follows that (remember L B >0) . If pattern I (2) is presented at the input layer of the network, it would be,
In order to guarantee that , the condition (17) must be satisfied.
B. Direct Access by perfectly learned patterns (Theorem 1 of original ART 1):
This theorem, when adapted to a Type-3 implementation, would state the following:
An input pattern has direct access to a node which has perfectly learned .
Proof:
In the case of the ART 1 m algorithm, in order prove that has direct access to we need to show that: (i)
is the first F 2 node to be chosen, (ii) is accepted by the vigilance criterion, and (iii) remains active as learning occurs 8 .
To prove property (i), we must establish that, at the start of each trial, for all . Since , we need to prove .
Suppose first that . Since is always true, then eq. (18) is satisfied, .
Suppose that . Then, since , it follows that . Finally, since is always true, it follows that, .
Property (ii) is directly satiesfied because, .
Finally, property (iii) also holds, because after node is selected as the winning category, its weight template will remain unchanged (because ), and consequently the inputs to the F 2 layer will remain unchanged.
C. Stable Choices in STM (Theorem 2 of original ART 1):
Whenever an input pattern I is presented for the first time to the ART 1 system, a set of {T j } values is formed that compete in the Winner-Take-All F 2 layer. The winner may be reset by the vigilance subsystem, and a new winner appears that may also be reset, and so on until a final winner is accepted. During this search process, the T j values that led to earlier winners are set to zero. Let us call O j the values of T j at the beginning of the search process, i.e., before any of them is set to zero by the vigilance subsystem. Theorem 2 of the original ART 1 architecture states:
8. In the original ART 1 paper it also shown that read out of the top-down template does not deactivate node as the winning node. This is because there the proof was developed for a Type-1 implementation where activation of an F 2 node results in a change of terms through the influence of the top-down connections.
Suppose that an node is chosen for STM storage instead of another node because . Then read-out of the top-down template preserves the inequality and thus confirms the choice of by the bottom-up filter.
This theorem has only sense for a Type-1 implementation, because there, as a node in the F 2 layer activates,the initial values of (immediately after presenting an input pattern ) may be altered through the top-down "feed-back" connections. In a Type-3 description (see Fig. 2 ) the initial terms remain unchanged, independently of what happens in the F 2 layer. Therefore, this theorem is implicitely satisfied.
D. Initial Filter Values determine Search Order (Theorem 3 of original ART 1):
Theorem 3 of the original ART 1 architecture states that (page 92 of [Carpenter, 1987a] ):
The Order Function ( ) determines the order of search no matter how many times F 2 is reset during a trial.
The proof is the same for the ART 1 and the ART 1 m (both Type-3) implementations 9 . If is reset by the vigilance subsystem, the values of will not change. Therefore, the new order sequence is and the original second largest value will be selected as the winner. If is now set to zero, is the next winner, and so on.
This Theorem, although trivial in a Type-3 implementation, has more importance in a Type-1 description where the process of selecting and shutting down a winner alters all values T j [Carpenter, 1987a] .
E. Learning on a Single Trial (Theorem 4 of original ART 1):
This theorem (page 93 of [Carpenter, 1987a] ) states the following, assuming a Type-3 implementation is being considered 10 :
Suppose that an winning node is accepted by the vigilance subsystem.
Then the LTM traces change in such a way that increases and all other remain constant, thereby confirming the choice of . In addition, the set remains constant during learning, so that learning does not trigger reset of by the vigilance subsystem.
Proof:
In this case, if y J is the winning category accepted by the vigilance subsystem, from eq. (8) we obtain .
9. However, note that the resulting ordering {j 1 , j 2 , j 3 , ...} may differ for the original and the modified architecture. 10. A more sophisticated demonstration for this theorem is provided in the original ART 1 paper [Carpenter, 1987a] . This is because the demonstration is performed for a Type-1 description of ART 1.
and the new T J value is given by,
Therefore, learning confirms the choice of y J , and by eq. (23) the set remains constant.
F. Stable Category Learning (Theorem 5 of original ART 1):
Suppose an arbitrary list (finite or infinite) of binary input patterns is presented to an ART 1 m system. Each template set is updated every time category y j is selected by the Winner-Take-All F 2 layer and accepted by the vigilance subsystem. Some times template z j may be changed, and others it may remain unchanged. Let us call the times z j suffers a change . Since the vector (or template) z j of a committed node has M components (of which, at the most, M−1 are set to '1'), and by eq. (23) 
Since template z j will remain unchanged after time , it is concluded that the complete LTM memory will suffer no change after time .
If there is a finite number of nodes in the F 2 layer t learn has a finite value, and thus learning is completed after a finite number of time steps. This is true for both the ART 1 and the ART 1 m architectures. Therefore, the following theorem (page 95 of [Carpenter, 1987a] ) is valid for the two algorithms:
In response to an arbitrary list of binary input patterns, all LTM traces approach limits after a finite number of learning trials. Each template set remains constant except for at most times at which it progressively loses elements, leading to the Subset Recoding Property:
.
The LTM traces such that decrease to zero. The LTM traces such that remain always at '1'. The LTM traces such that but stay at '1' for times but will change to and stay at '0' for times .
G. Direct Access after Learning Stabilizes (Theorem 6 of original ART 1):
Assuming F 2 has a finite number of nodes, the present theorem (page 98 of [Carpenter, 1987a] ) states the following:
After recognition learning has stabilized in response to an arbitrary list of binary input patterns, each input pattern either has direct access to the node which possesses the largest subset template with respect to , or cannot be coded by any node of . In the latter case, contains no uncommitted nodes.
Proof:
Since learning has already stabilized I can be coded only by a node y j whose template z j is a subset template with respect to I. Otherwise, once y j becomes active, the set z j would contract to , thereby contradicting the hypothesis that learning has already stabilized. Thus, if I activates any node other than one with a subset template, that node must be reset by the vigilance subsystem. For the remainder of the proof, let y J be the first F 2 node activated by I. We need to show that if z J is a subset template, then it is the subset template with the largest O J ; and if it is not a subset template, then all subset templates activated on that trial will be reset by the vigilance subsystem:
If y J and y j are nodes with subset templates with respect to I, then .
Since is an increasing function of ,
and, .
Therefore, if y J is reset ( ), all other nodes with subset templates will be reset ( ).
Now suppose that y J , the first activated node, does not have a subset template with respect to I ( ), but another node y j with a subset template is activated in the course of search. We need to show that , so that y j is reset. We know that, ,
which implies that . Since y J cannot be chosen, it must be reset by the vigilance subsystem, which means that . Therefore, ,
H. Search Order (Theorem 7 of original ART 1):
The conditions expressed in the original Theorem 7 must be changed to adapt this theorem to the ART 1 m architecture. The modified theorem states the following:
and that input pattern satisfies . 
Now, note that
From eqs. (35), (39) and (41), it follows that .
On the other hand, .
Therefore, . 
Therefore, if all subset templates are searched and if no other learned template exists, an uncommitted node will be activated and code the input pattern.
11. We assume that y j is not an uncommitted node ( ).
If all subset templates have been searched and there are learned superset templates but no mixed templates, the node with the smallest superset template will be activated (and not an uncommitted node ) and code I:
. (47) If there is more than one superset template, the one with the smallest will be activated. Since , there is no reset, and I will be coded. 
This completes the proof of the modified Theorem 7 for the ART 1 m architecture.
) . =
I. Biasing the Network towards Uncommitted Nodes:
In the original ART 1 architecture, choosing L large increases the network's tendency to choose uncommitted nodes in response to unfamiliar input patterns I. In the ART 1 m architecture, the same effect is observed when choosing large. This can be understood through the following reasoning.
When an input pattern I is presented, an uncommitted node is chosen before a coded node if .
This inequality is equivalent to .
As the ratio increases it is more likely that eq. (54) be satisfied, and hence uncommitted nodes are chosen before coded nodes, regardless of the vigilance parameter value ρ.
J. Remarks:
Even though this Section has shown that the computational properties of the original ART 1 system are preserved in the ART 1 m system, the response of both systems to an arbitrary list of training patterns will not be exactly the same. The main underlying reason for this difference is that the initial ordering (55) is not always exactly the same for both architectures. The next Section will study the differences between the two ART 1 systems.
IV. On the Functional Differences between Original and Modified Model
As stated previously, the difference in behavior between the ART 1 and ART 1 m models is caused by the different orderings of the terms of eq. (55). Assuming that both models, at a certain time, have identical weight templates { }, and the same input pattern is given, eq. (55) has the following two formulations:
where might be different than . The ordering resulting for the original ART 1 description is modulated by parameter . For example, if is very large compared to all terms, then the ordering depends exclusively on the values of ,
If is very close to 1, then the ordering depends on the ratios,
Likewise, for the ART 1 m description, the ordering is modulated by a single parameter . If α is extremely large, the situation in eq. (57) results. However, for α very close to 1, the ordering depends on the differences,
Obviously, the behavior of the two ART 1 descriptions will be identical for large values of and α. However, moderate values of L and α are desired in practical ART 1 applications. On the other hand, it can be expected that the behavior will also tend to be similar for very high values of ρ: if ρ is very close to 1, each training pattern will form an independent category. However, different training patterns will cluster into a shared category for smaller values of ρ. Therefore, a very similar behavior between ART 1 and ART 1 m will be expected for high values of ρ, while more differences in behavior might be apparent for smaller values of ρ.
In order to compare the two algorithms' behavior, we have performed exhaustive simulations using randomly generated training patterns sets 12 . As an illustration of a typical case where the two algorithms produce different learned templates, Fig. 4 shows the evolution of the memory templates, for both the ART 1 and the ART 1 m algorithms, using a randomly generated training set of 10 patterns with 25 pixels each. Weight templates for original ART 1 are named , while for ART 1 m they are named . The vigilance parameter was set to for the original ART 1 , and for the ART 1 m . In Fig. 4 , boxed category templates are those that met the vigilance criterion and had the maximum value. If the box is drawn with a continuous line, the correponding template suffered modifications due to learning. If the box is drawn with dashed line, learning did not alter the corresponding template. Both algorithms stabilized their weights in 2 training trials. Looking at the learned templates we can see that input patterns 4 and 5 clustered in the same category for both algorithms ( for original ART 1 and for ART 1 m ). This also ocurred for patterns 6 and 8 ( and ) and for patterns 3, 9 and 10 ( and ). However, patterns 1, 2, and 7 did not cluster in the same way in the two cases. In the original ART 1 algorithm patterns 1 and 7 clustered into category , while pattern 2 remained independent in category . In the ART 1 m algorithm patterns 1 and 2 clustered together into category , while pattern 7 remained independent in category .
To measure a distance between the two templates and , let us use the Hamming distance between two binary patterns and , ,
12. For all simulations in this paper, randomly generated training patterns sets were obtained with a 50% probability for a pixel to be either '1' or '0'. We can use this metric to define the distance between two sets of patterns and as that which minimizes .
For this purpose, the optimal ordering of indexes must be found. In the case of Fig. 4 (where   ) , the distance D between the two learned patterns sets is given by, .
In general, we can define the distance between two patterns sets and as, .
In the case of Fig. 4 , both algorithms produced the same number of learned categories. This does not always occur. For the case where a different number of categories results, we measured the distance between the two learned sets by adding as many uncommitted F 2 nodes to the set with less categories as necessary to equal the number of categories. An uncommitted category has all its pixels set to '1'. Thus, having a different number of committed nodes drastically increases the resulting distance, and is consequently a strong penalty.
We have repeated the simulation of It seems natural to expect that, for a given value of ρ and a given value of the original ART 1 parameter L, there is an optimal value for the ART 1 m parameter α that will minimize the difference in behavior between the two algorithms. To find this relation between L and α for each ρ, we computed (for a given ρ and L) the value of α that minimizes the average distance between the learned patterns sets generated by the two algorithms. The results of these computations are shown in Fig. 7 13 . Fig. 7 (a) shows a family of curves (one for each value of ρ), that shows the optimal value of α as a function of L. Fig. 7(b) shows the resulting minimum average distance between learned sets for the same family of curves. As shown in Fig. 7(a) , the optimum fit between parameters α and L is very slightly dependent on the value of ρ.
As can be concluded from Fig. 5, Fig. 6, Fig. 7 , and the discussion in this Section, the behavior of the two algorithms is qualitatively the same although some slight quantitative differences can be observed. ART 1 m parameter α has a wider tuning range than original ART 1 parameter L. On the other hand, ART 1 m needs a slightly higher number of learning trials than the original ART 1. Also, there is an optimal adjustment between parameters α and L that minimizes the difference in behavior between the two algorithms, and this adjustment appears approximately independent of ρ.
13. Note that high values of ρ and L were omitted in this analysis, since in these cases the behavior of the two algorithms tends to be similar, regardless of the fit between parameters L and α. 
A. A Type-2 ART 1 m Implementation
The change in weights must be smooth in a Type-2 description. Every time an input pattern I is presented and an F 2 category node is selected for LTM storage, only a partial change in LTM traces is allowed. In this case, it is obvious that we can no longer use a binary valued weight template.
As seen in Section II, Fig. 2(c) shows the flow diagram of a Type-3 implementation of the ART 1 m algorithm.
Extending this diagram to a Type-2 description is straightforward. The only box that needs to be changed is that corresponding to the update of weights. Instead of using the algebraic formula we have to use a time domain differential equation that would lead to the same steady state. The following set of differential equations fulfills this requirement, z ji 1 = Read input pattern:
where K is a positive constant, a sigmoidal function, and an STM variable given by, .
If is the time required for the LTM eqs. (65) to settle to their steady state, the update of weights (i.e., the simulation of eqs. (65)) would be allowed only for a time interval for each input pattern I presentation.
As τ approaches , application of eqs. (65) 
where,
Parameters , , , sigmoidal. Note that . Functions will be responsible for the resulting Winner-Take-All action of the F 2 layer. These STM equations are identical to those of the original ART 1 algorithm [Carpenter, 1987a] , except that we use one weight template instead of two. However, the main difference lies in the way the terms are computed. In this case will be given by the following equation, .
where is constant and positive. Using eqs. (67)- (69) together with an STM Reset System will assure that if the STM time constants are very small compared to the LTM ones, the Type-2 description of Fig. 8 
where I is the input pattern and is the number of F 1 nodes such that . The nonspecific reset wave shuts off active F 2 nodes until the input pattern I shuts off.
VI. Conclusions
This paper has presented, analyzed, and studied a modification to the original ART 1 algorithm. Such modification has drastic consequences from a hardware implementation point of view, in the sense that it extraordinarily simplifies the hardware requirements and components of the overall system and provides a very important increased performance potential. Although the modification produces some changes in the original behavior of the system, we have shown that all the computational properties of the original ART 1 algorithm are preserved. We have also performed exhaustive simulations to highlight the differences in behavior introduced by the modified system. Finally, we have sketched how to extend conceptually such a modified system to a non-Fast
Learning description although this would lead to the loss of important hardware advantages.
We have used this ART 1 m model to implement a high performance, analog current mode, real-time clustering chip in a standard low cost 1.5µm CMOS process [Serrano, 1994 [Serrano, , 1996 . Although we have used a specific circuit design technique (analog current mode), the ART 1 model described in this paper can be used with other circuit techniques. The only functions needed are binary storage, sums and/or subtractions, comparisons, and a Winner-Take-All action. The advantages of the ART 1 m model can be exploited using any hardware technique. We hope that the modifications introduced in this paper can be used by other neural hardware engineers regardless of the circuit design technique they choose to use. 
