Arti®cial neural networks (ANN) are being used as one of the prime computing tool for an increasing number of applications. This is due in part to the ANN's ability to adapt to changes via learning. The dynamic nature of many applications as well as the computational and storage requirements of current learning algorithms creates a need for high performance neuro-architectures with learning capabilities. In this paper we identify a set of computational, communication and storage requirements for learning in ANNs. These requirements are representative of a wide variety of algorithms for dierent learning approaches. We propose a novel neuro-emulator that provides the computational ability for the stated requirements. While meeting all the identi®ed requirements the new architecture maintains a high performance during learning. To show the capabilities of the proposed machine we present four diverse learning algorithms and step through the execution of each using the proposed architecture. We include an evaluation of the machine performance as well as a comparison with other architectures. It is shown that with a modest amount of hardware the proposed architecture yields an extremely high number of connections per second. Ó 1999 Elsevier Science B.V. All rights reserved. 
Introduction
The arti®cial neural network (ANN) paradigm has been used as a computing alternative for a number of applications to which other paradigms may not be well suited. These applications include speech processing [1] , adaptive and self-learning control systems [2, 3] , sequence veri®cation through inferencing [4] , visual mapping and localization [5] , classi®cation of parts for cellular manufacturing [6] , location of visual features [7] , path optimization [8, 9] , and load forecasting resulting in economic power distribution [10] .
In the ANN paradigm the input to a neuron is the weighted sum of an input vector. The input vector is comprised of the outputs of the neurons connected to the present neuron. The output of the neuron is the result of applying an activation function, usually non-linear, to the input value. Consequently, there must be a set of mechanisms that compute: the product of the input vector and the weight matrix, the sum of the products and the activation functions. The outputs can then be used either as the solution or as the inputs for the next network iteration. In the learning process these outputs are used to compute an output error vector.
The purpose of a learning algorithm is to increase the correctness of the network's output as well as to provide network adaptiveness. This is accomplished to a large extent by changing the value of the weights in the network. The amount and method by which the weights are modi®ed varies greatly from one algorithm to another. A number of learning algorithms have been developed to suit the applications and network characteristics [11±14] . These learning algorithms present dierent computation and storage requirements. We have identi®ed a set of computation and storage requirements that are common to most learning algorithms. A list of these requirements follows.
· Competitive learning capabilities. Competitive approaches are used to determine the neuron with the largest (or smallest) output (commonly referred to as winner-takes-all: WTA). This will require the ability to compare the outputs and determine the winner. · Support for multi-layered networks. Multi-layered networks are sparsely connected; i.e. every node is not connected to every other node. This presents demands on both recall and learning. Assuming the neurons are fully connected between layers, there must be a separate weight matrix for each layer to implement recall. During learning in multi-layered networks the error at the output layer can be calculated directly.
The error for neurons in a hidden layer i is derived from the errors in layer i 1. This requires both calculation of the hidden layer neuron error and storage for intermediate results. · Support for gradient descent. Gradient descent is the process of traversing a surface in the steepest direction. In neural networks the surface is usually an error surface, and the direction is such that the error is minimized. This type of learning requires the ability to produce the derivative of the error function at the neuron output value. · Weight modi®cation and storage. Learning algorithms usually increase the correctness of the output by changing the value of the weights. To embed learning in an architecture there must be a means of modifying the weights and storing the modi®ed values. · Calculation of dot product. Some algorithms are concerned with matches in the direction of the input and weight vectors rather than matches of magnitude. The distance between two vectors in N space can be determined by means of a dot product of the two vectors for these algorithms. · Method of normalizing weight vectors. The result of the dot product of two vectors is a function of magnitude and the angle between them. The dot products of the weight and input vector of all nodes can be directly compared if the weights are normalized [15] . This requires the ability to normalize the weight vectors. · Generation of multiple functions. Learning algorithms employ a number of functions in the process of calculating new weight values, as well as in determining the winning neuron. These functions include inverse, exponential, square root, threshold, sigmoid, and the derivatives of some of the activation functions. The required functions must be available within acceptable error limits as well as time and cost constraints. · Storage for intermediate results. Often learning algorithms are comprised of multiple stages or values. To increase performance each value or stage can be evaluated for all neurons before beginning the next logical step. This requires the intermediate results be stored until all calculations are completed, and the ®nal result can be determined. · Calculation of neighborhood size. Some algorithms have been shown to converge faster if the weights are updated as a neighborhood around the winner rather than updating the winning neuron alone. The calculations for the neighborhood size and determination of the nodes within the neighborhood are typically dierent from those required by the recall operation. · Calculation of linear distance between node locations. There are algorithms which use the linear distance as the criterion for determining the winning neuron. The linear distance between nodes is also an intermediate step in some neighborhood calculations. · Vigilance testing. Clustering algorithms require a criterion for creating a new class. Some algorithms test the degree of similarity of the winning neuron to the input vector for this purpose. This test often requires many calculations and evaluation of the result. · Ability to disable and manipulate individual nodes. The vigilance test utilized by some algorithms is an iterative process. If the vigilance test failed another round of competition must occur without the previous winner(s) taking part. It can possibly become necessary to progressively disable nodes during each iteration. · Ability to store weights to symmetric locations in matrix. Algorithms exist that result in symmetric weight matrices. The ability to store the unique weights to all their symmetric locations reduces the number of calculations required when implementing these algorithms. As the complexity and problem size of new applications becomes large, there is an increasing need for architectural support for learning algorithms. Arti®cial neural networks are increasingly being used in applications with inherently dynamic environments. This in turn implies the network weight values need to be updated frequently. The number of calculations required by learning algorithms are usually higher than that required for the recall operation. As a consequence the total number of calculations and the amount of storage required greatly increase. Embedding the capabilities required by learning algorithms in the architecture may not require much additional hardware. Further it appears that learning algorithms, in particular those we have studied, contain parallelism and will exploit the proposed machine organization. If the learning algorithms are implemented in a host computer, the communication overhead will potentially become a bottleneck. This will be due to the high frequency of updating the weights and the massive amount of data required to implement the learning algorithms. Embedding learning capabilities in the neuro-emulator greatly impacts the neuroemulator's performance. The communication with the host computer is reduced to the data required at initialization of the network, returning the solution, and the swapping of input values ± which can possibly be overlapped with other computations. This paper has been organized as follows. First we present a description of the Sequential PIplined Neuro-emulator with Learning capabilities (SPIN-L) organization. We show how four learning algorithms can be mapped to the machine organization. These examples are used to illustrate the architecture's functionality; they represent a small sample of the algorithms the emulator is capable of implementing. We then present a performance study of four learning algorithms and a comparison of neuro-emulators that have been suggested in the literature [16±24]. Finally we draw a set of conclusions from the study presented here.
SPIN-L description
A novel emulator, called SPIN-L, provides architectural support to ful®ll the computational, communication, and storage requirements that are commonly found in a number of ANN learning algorithms. The proposed SPIN-L emulator has a similar structure to a neuro-emulator for the ANN recall process (SPIN) [25] which does not provide hardware support for learning. In the design of SPIN-L we have incorporated a number of novel features to support the learning requirements. In this section we provide a description of SPIN-L with particular emphasis on the novel features.
Machine organization
The SPIN-L organization, shown in Fig. 1 , is comprised of three main sections: the parallel, the reduction, and the multi-function generator. This overall organization is similar to SPIN [25] where these sections are identi®ed as processing elements (containing registers and multipliers), adder tree (a set of bit serial adders arranged in a tree structure), and function generator (based on a lookup table).
In SPIN-L, the parallel processing cells (PPC) contain four units: a memory unit, an update unit, a register ®le and a multiplier unit. The reduction section is a binary tree structure (called the processing and communicating tree ± PCT) whose processing cells are capable of performing a variety of operations including arithmetic, logic and bi-directional communication. The multi-function generator (M-FG) is capable of evaluating the sigmoid, inverse, inverse squared, exponential, cosine, sine and other non-linear functions. The multi-function generator section also contains storage for intermediate results (IS), as well as hardware to perform single valued operations. The PN-Bus connects both the root of the PCT and the output of the M-FG back to the last PPC. The parallel and reduction sections operate in a bit serial fashion while the multi-function generator is a word parallel implementation. The overall machine design contains N PPCs, the PCT with log 2 x stages, the multi-function generator and the feedback bus.
The machine is designed to exploit parallelism inherent in the target algorithm and to take advantage of pipelined execution. To calculate a sum of products, a common requirement in ANN algorithms, the PPCs compute N products in parallel. The products are generated bit serially and, the bit-streams as they exit the PPC multipliers, are passed to the PCT which performs the summation. The sum can then be passed to the input of the M-FG for evaluation of a function or passed back to the last PPC using the PN-Bus. The individual hardware sections are independent of each other allowing them to operate on dierent sets of data simultaneously. As the ®rst sum of products is being passed back to the last PPC, the second set of products are being summed, and the third are being calculated. This parallel pipelined operation results in a high performance computing engine, while the bit serial implementation provides¯exibility of word size, low cost and potentially higher clock rates.
Detailed description of major components
The PPC organization, shown in Fig. 2 , consists of four units as described above. The memory unit (MEM) is comprised of standard memory and three shift registers. The shift registers provide the bit serial interface between the standard memory and the other hardware units. There is also a buer register which allows for continued pipelined operation during data transfers to memory. The register ®le (RF) consists of four shift registers connected to one input and two outputs. The two outputs of the register ®le can be supplied by any combination of the four shift register values, including one register providing both outputs. The shift registers can receive data from any one of four possible sources, the memory unit, the update unit, the data bus or the previous PPC's register ®le. The multiplication unit (MLT) is a bit-serial 16-bit two's complement multiplier. It is preceded by a multiplexer which allow its inputs to be selected from a variety of sources. The multiplexer also allows any source to supply both inputs allowing for ecient calculation of squared terms. The update unit (UPD) contains the ability to add, subtract and accumulate. It is possible to bypass both the update and the multiplier cells with no delay.
The PCT is a binary tree with log 2 x stages. Each node of the tree is called a processing communicating cell and is capable of performing a variety of operations. Because the PCCs operate on bit-streams of data the hardware is compact in size allowing for greater functionality at a reasonable cost. The data paths shown in Fig. 1 The PPCs are able to exploit the inherent parallelism of learning algorithms. The reduction section is able to perform a summation of N values O log 2 x cycles. The Multi-Function Generator provides an accurate approximation to non-linear functions within the constraints of time and cost, allowing an increased number of functions to be supported. The sections of hardware, as well as the stages or units within the sections, can operate in a pipelined fashion. These traits along with thē exibility of the multi-function generator provide a strong core for the execution of learning algorithms on chip. The learning requirements stated in the introduction have been addressed in the design of SPIN-L. Below we present the major elements of the new architecture and the learning requirements which they ful®ll. · The register ®les (RFs). The register ®les have been enhanced to provide part of the support required by multi-layered networks as well as intermediate result storage (Fig. 3) . The RFs contain four registers for the storage of the input vector and intermediate results. These registers are connected to two multiplexers which provide the outputs of the RF. Any combination of registers can be provided to the outputs, including one register providing both outputs. One output is also connected to the next RF to implement inter-RF communications. · The connections between the weight store and the multipliers. The outputs of the weight store and the RF are connected to multiplexers which supply the inputs to the multiplier (Fig. 4) . These multiplexers also control the routing of values to the inputs a subtraction unit prior to the multipliers. The bit serial design allows for a high degree of connectivity with a small number of actual data paths. The multiplier is able to receive its inputs from a variety of sources, including one source providing both inputs. This assists in the calculations required for learning in multi-layered networks and neighborhood size determination. · The Update Unit. The adders in the Update Units facilitate the weight modi®cation during the learning and normalizing processes (Fig. 5 ). The source of the inputs for these adders can be the output of the multiplier, the RF, or the weight store. The sum is utilized for incremental modi®cations of the weight vectors. The carry is used to perform the weight updating for algorithms that require the logical AND function, this is a bitwise operation during which the carry is not fed back to the adder. · The processing communicating tree (PCT). The PCT is a multi-faceted hardware design supporting many computational, communicative and miscellaneous requirements. It provides the necessary arithmetic operations, and also has the capability to perform logical operations required for competitive learning. The tree also provides communication from the sequential end to all the leaves. During competitive learning it maintains the path from the leaf of the winner to the root of the tree. This is used to disable individual nodes as well as for communication purposes. · The storage at the sequential end. The added memory at the sequential end of the architecture will be used to store the learning rate, neighborhood width parameter, and data required for neighborhood calculations, as well as intermediate results. The intermediate results will be stored here to allow for improved performance through continued pipelined operation. · The multipliers in the M-FG section. Many learning algorithms are composed of iterative operations, i.e. sum of products. In their ®nal stages they also often require operations on singular values. These multipliers will be used in this manner during the neighborhood calculations and multi-layered network learning.
SPIN-L recall operation
Most of the learning algorithms today use the network output for recall to determine the amount of adjustment required. As a result the recall and learning operations will be interleaved with learning occurring if the recall operation's output is determined to be incorrect. It is therefore appropriate to examine the recall process before advancing to the learning operation. Fig. 6 presents the architecture fully pipelined during recall. We present here an overview of a fully connected network with N neurons during a recall operation.
1. The initial weight vectors are pre-loaded into memory such that the elements of each vector are distributed across the PPC elements. The input vector is stored in the RFs, distributed in the same manner as the weight vectors. 2. SPIN-L begins recall with the multiplication of the input vector, which is stored in the RFs, and the weight matrix. This occurs one weight vector at a time using the multipliers. The input vector elements are multiplied by their corresponding weight vector elements. 3. As the products are produced they are passed to the PCT for the summation process. 4. Its output is used as the input to the function generator which implements the activation function. 5. This value is then passed back on the PN bus to the register ®le. As the outputs are produced and passed back on the PN-Bus, they are iteratively passed up to the next RF. These calculations are repeated for all the weight vectors in the described pipelined fashion. As neurons are emulated the updated values are passed back to registers in the register ®le. The multiple registers in the RFs allow this to occur without overwriting the input vector elements. At the end of the update period, all N neurons have been emulated and the updated-inputs are in the RF registers available for the next update period. The SPIN-L architecture allows the multiplication of the current weight vector elements to be overlapped with the summation of the products of the previous weight vector elements. Likewise the summation can occur simultaneously with the operation of the function generator. Communications occur on a separate bus allowing it to proceed simultaneously with the operation of the other sections. The outputs of the register ®le can supply any combination of the register values, including one register supplying both outputs. The bit-serial design allows the units of the PPCs to be highly interconnected using a small number of actual data paths. The update unit is capable of altering the weight vector elements by incrementing or logically ANDing. The PCT is capable of calculating a summation of N terms O log 2 x . The storage and multipliers in the M-FG section provide for increased performance by allowing continued pipelined operation. It should be pointed out that SPIN-L and SPIN machines are fully compatible when they perform recall (or feedforward) neurocomputing operations.
In this section we have presented an overview of the proposed machine, and its functionality during recall. We also presented the components that will implement the additional computation and storage requirements imposed by learning algorithms. To show the performance of the proposed machine we now present an evaluation approach for neuroemulators that focuses on the architectural design, i.e. it is implementation and technology independent.
Performance evaluation approach
There have been a number of approaches proposed to measure the performance of a neuroemulator. Most of them, however, have been technology-oriented; i.e. the assumptions and/or implementations are based on a given technology.
We propose and alternative performance measurement based on an architectural perspective where the delays in the critical path are taken into account. The approach for the evaluation is introduced by means of an example.
Evaluation method
In order to evaluate the neuro-emulator performance, we have identi®ed the computational requirements of the neural network paradigm as well as hardware assumptions that are common to all the neuro-emulators. The governing equation for the emulation of a fully connected arti®cial neural network with N neurons is:
When evaluating the performance of a network of neurons governed by equations in the form of Eq.
(1), we have de®ned an update cycle as the time required to compute and communicate all the new neuron values. Any evaluation of dierent architectures must be based on a set of realistic assumptions [26±28]. It is imperative that the evaluation technique itself does not give any organization an unfair advantage. The evaluation must be carried out at a level where not only the elements of the organizations have an impact on the performance results but also implementation issues do not in¯uence the results. The assumptions that will be used to de®ne this level of evaluation are · Realization of Eq. (1). All neuro-emulators use Eq. (1) for the purpose of emulating a N-neuron fully connected network. This equation is the most general equation for neural network applications. · A non-linear activation function. The activation function is considered to be a non-linear function [14, 13] .
· L-bit precision. All architectures will use L bits of precision for both the weight and neuron values. · L most signi®cant bits required by activation function. The 2L bits produced by the multiplication will be truncated to the most signi®cant L bits after the sum of products operation. · No bit-serial activation function. The activation function requires that the L bits all be present before beginning its operation. A number of implementations use a lookup table approach that requires all the input bits to be present [29] . · All inputs must be available to start another cycle.
Due to the recursive nature of Eq. (1) the i t 1 neuron emulation cannot begin until all N j t output values have been calculated. · Similar hardware components have the same delay and cost. Hardware of a given type has the same delay and cost regardless of which architecture it is in, i.e. an adder has the same delay and cost in all architectures. · Bit serial operations. A large number of neurons are usually emulated, this in turn imposes a need for a large amount of hardware. Without loosing the generality of this study, we have assumed bit serial implementation as a consideration for practical implementation. The focus of this evaluation approach is aimed at establishing the architectural impact on performance. Emulators need to emulate the entire network before another iteration can start; this is due to the recursive nature of Eq. (1). We have identi®ed the delay incurred to emulate the entire network once. In this delay calculation, all the conditions such as setup, computation and storage must be taken into account. The delay is incurred along the critical path of the architecture. By determining this critical path from a study of the architecture, the delay along this path during the emulation of the network can be calculated.
To avoid technology and implementation issues we express all delays in terms of functional units rather than in terms of time (Fig. 7) . If this delay is expressed in terms of time two implementations of the same architecture may show dierent performances. The delays are represented by the symbol d, with a subscript indicating the operation associated with the delay. The subscripts used in this study are: A (adder), M (multiplier), AF (activation function), and B (communication). Other delays will be discussed in the organizations for which they are relevant.
Virtual emulation
Virtual emulation is required when the number of neurons (m) of the ANN being emulated is larger than the machine's physical neuron capacity (N). For this virtual emulation mode of operation, N of the m inputs are multiplied by their corresponding weights. These products are added together and this partial sum is stored in an accumulator. The subsequent partial sums are added to the accumulator until all m inputs have been processed. This total sum will then be passed to the activation function resulting in the output value for that neuron. Only N of the m neurons have been emulated, so the process continues iteratively, in the manner described above, with the hardware emulating the next N neurons until all m neurons have been emulated:
In order to show how the proposed architecture scales as the number of neurons increases, we have plotted four dierent machine sizes. Fig. 8 shows the recall delays for 32-, 64-, 128-, and 256-neuron machines.
When the problem size is smaller than the machine, SPIN-L provides a delay that is dependent on the problem size (in a linear fashion). When the problem size is larger than the physical machine, the size of the machine has a small impact on the delay. At the point when the machine is used in virtual mode, the delay remains within the same order of magnitude.
ANN learning algorithm mapping
SPIN-L has been designed as a general purpose neuro-emulator. Thus, in this context the architecture is¯exible enough to accommodate a large number of learning algorithms. Without losing generality, we have mapped four learning algorithms onto the proposed neuro-emulator to show the¯exibility and potential uses of the structure. The four learning algorithms included in this study are: Self-Organizing Map, Adaptive Resonance Theory 1, Backpropagation, and Hop®eld algorithm. We have chosen these four algorithms because they not only are widely used but also present a diverse set of computing requirements. The mapping of the Self-Organizing Map is presented in this section; the mappings of the other three algorithms are contained in Appendix.
To show the strength of the machine we present the mappings of the four algorithms onto the machine with the operations fully pipelined. The mappings consist of the required operations proceeded by a number and a label. The label naming conventions used in the mappings are presented in Table 1 .
In the mappings the number indicates operations that occur simultaneously during pipelined operation, and the label corresponds to a particular section of hardware. Lower case labels indicate units within a main section that are operating in a pipelined fashion. The labels that are bold are the major contributors to the execution delay, units within sections will appear in italics to indicate the same. The mapping and SPIN-L resource utilization for each algorithm is explained below.
Self-organizing map
The self-organizing map (SOM) algorithm [11] is an example of an unsupervised learning algorithm. The SOM is a clustering algorithm which classi®es input patterns based on their similarities. The neurons are placed in an ordered topology which de®nes the map. Each weight vector represents a point in the feature space described by the map topology. The location of the winning neuron in the map topology indicates which cluster the input pattern belongs to. The neurons in the map take on a spatial ordering, with respect to the clusters, that is preserved as learning progresses [11] . This clustering approach is accomplished by an algorithmic weight updating process. The weights of neurons that are physically near the winning neuron, as described by the map topology, are updated. The amount of change of each weight vector depends on its proximity to the winning neuron as dictated by the neighborhood function. A more detailed description of the algorithm is presented in the literature [11] .
SOM algorithm brief description
The ordering of the weights in the map topology, one-, two-or n-dimensional is an input variable. The location of each weight vector in the pre-determined topology will be down loaded from the host at the beginning of execution. The locations will be speci®ed in n-tuples with each element corresponding to a dimension of the map topology. These arrays will be stored at the sequential end of the architecture. Stored with the locations are tags indicating the weight vector they correspond to. The SOM algorithm uses the angle between the weight and input vectors to determine the closest match. If the lengths of the weight vectors are normalized a direct comparison of the dot products can be utilized to determine the weight vector closest to the input vector using this criterion [15] . To utilize this method the weights will be normalized according to the following formula:
The normalized weights are then used in the calculation of the dot product. Competitive learning using the dot products selects the winning node. During learning the weight vectors are then updated according to the following equation:
where Fx i Y x Ã is the neighborhood function. x i is the node being updated and x Ã is the winning node. We assume the neighborhood function is Gaussian of the form:
where r represents the weight locations in the map, rt is the time variant width parameter and gt is the time variant learning rate. The form of the neighborhood function shows that as the distance from the winner increases the weight update amount will decrease.
SOM algorithm mapping onto SPIN-L
In this section, we present a mapping of the SOM algorithm onto the SPIN-L organization. For this mapping we have manipulated the equations of the SOM algorithm to fully take advantage of SPIN-L's features. These calculations are scheduled to maximize utilization and minimize execution time. The algorithm is implemented as follows: calculate the normalizing terms for the weight vectors, normalize the original weight vectors, calculate the dot product, perform the WTA function, calculate the neighborhood size, and update the weight vectors. Each step is performed for all values before moving on to the next operation, i.e. the normalizing term is calculated for all weight vectors before any weight vector elements are normalized. The mapping, following the conventions described in the introduction of this section, is shown below where idle hardware is not included.
Normalization:
1. PPC Each element of the weight vector is passed to both inputs of its respective multiplier.
PCT The bit-streams of the products, as they are calculated, are passed to the which performs the summation. MFG The sum of squared terms is used as the input to the M-FG which evaluates the inverse square root function. IST The normalizing terms are stored in the intermediate storage after the M-FG.
PPC
mlt The weight vector elements are used as one input to the multiplier. The normalizing term is broadcast back through the PCT, and is used as the other input. upd The bit-streams of the products, as they are calculated, are passed through the update unit and stored in the memory units. PCT The normalizing term is broadcast from the intermediate storage through the PCT to all leaves. Calculation of the Dot Products: 3. PPC The elements of the input vector are multiplied by the corresponding elements of the weight vector. The inputs are passed from the RFs and the weights from the memory units.
PCT The bit-streams of the products, as they are calculated, are passed to the PCT which performs the summation. PNB The output of the neuron is passed back on the PN-Bus to the last RF. The outputs, as the next output is passed back, are shifted up through the RFs using the inter-RF data path. Winner Takes All (WTA) Function: 4. PPC The dot products are passed around the multipliers to the PCT using the by-pass path with no delay. PCT The dot products enter the PCT MSB ®rst. The WTA operation is performed by comparing the input bits. While equal the bit is passed on through the PCT. If unequal the PCC sets the path to the larger input, effectively blocking the smaller. Calculation of Neighborhood Size: 5. PNB The location of the winning node in the map topology is passed to the RFs, each coordinate in a separate contiguous RF. 6. PNB The location of each node is passed to the RFs as was done with the winner in a pipelined manner. 7. PPC sub The coordinates of both will be passed to the subtraction unit, where the non-winner will be subtracted from the winner. mlt The bit-streams of the dierences, as they are calculated, are passed to both inputs of the multipliers. PCT The bit-streams of the squared terms, as they are calculated, are passed to the PCT which performs the summation operation. MFG This scaled value is used as the input to the M-FG which evaluates the exponential function. IS The square of the linear distance, between the winner and non-winner being evaluated, is multiplied by the inverse of the squared neighborhood width parameter. IS The result of the exponential function evaluation is multiplied by the learning rate and stored in the intermediate storage after the M-FG.
Weight Update: 8. PPC sub The elements of the weight vector are subtracted from the input vector elements using the subtraction unit before the multipliers. mlt The bit-stream of the dierence, as it is calculated, is passed to one input of the multipliers. The other input is the neighborhood function value, which is broadcast back through the PCT. upd The bit streams of the products, as they are calculated, are passed to the update unit.
Here they are added to the original weight values.
PCT The neighborhood function value is broadcast from the intermediate storage back through the PCT to the multipliers.
In this subsection we have shown how the SOM algorithm can be mapped to the SPIN-L architecture. The manipulation of the scheduling of operations allows the machine to maintain a highly pipelined implementation of the SOM algorithm resulting in good performance. This is made possible by fully utilizing all aspects of the hardware present. The parallel portion of the machine performs the required operations a vector at a time rather than performing element by element calculations. Performance for the algorithm implemented as described is reported in Section 5.
Evaluation
The evaluation of the proposed architecture is based on functional delays as described in Section 3. To compare this architecture with other neuro-emulators that have been suggested in the literature we consider the use of a benchmark commonly used for neural network evaluation. We will present ®rst an evaluation of SPIN-L based on resource delay in the critical path. This is shown by deriving the delay equation for the ART-1 learning algorithm. The delay equations for the three other learning algorithms are then presented. This is followed by an evaluation based on million connection updates per second (MCUPS) using the NETtalk routine as a benchmark.
Delay estimation
The implementation of the ART-1 algorithm using SPIN-L is highly pipelined. This results in the overlapping of many operations. The algorithm suers the delay of only one of these operations as will be shown in the derivation of the delay equation. The resource delays will be represented by the symbol d and a subscript indicating the resource. The multiplier delay will have a M subscript the PCC will have an A, the function generator an AF, the PN bus a B and the delay incurred in storing a value will be denoted by the subscript STORE. We will assume for the purposes of clarity that the number of nodes present in hardware equals the number required by the algorithm, which is equal to N.
We have presented the ART-1 algorithm to show the strength of SPIN-L's pipelined processing and to address the issues of delay prediction in this style of operation (Table 2) . In calculating the delay we were only concerned with getting values to the multipliers or the ®rst stage of the adder tree. This is not to say that the processing stopped there, but rather the next operation at the multiplier would overlap any further operations on those values therefore absorbing the delay.
We now present in Table 3 the delay equations for the learning algorithms studied in this paper, for Backpropagation we assume a two layer network.
Virtual neuron emulation
In processing ANNs, as in other computing paradigms, it is often necessary to provide a mechanism to deal with larger problems than those that can be directly handled by the actual hardware. This results in greater¯exibility at a lower cost without restricting the complexity of problems that can be addressed with a machine of set size. The proposed machine, and its predecessor, have been shown to be capable of virtual emulation [35, 36] . Below we provide a description of how the SOM algorithm could be implemented in virtual mode. The steps outlined here have been explained in detail in Section 4.1.2. Table 3 Delay equations for learning algorithms Algorithm Delay equation The virtual emulation delay equations for the learning algorithms that have been studied in this paper are shown below. In the equations that follow m is the number of neurons required to be emulated and N is the number of neurons the hardware is capable of emulating directly. We now present the delay equation for the self-organizing map during virtual emulation. In Eq. (6) n is the map dimension size:
Eq. (7) shows the delay equation for the Adaptive Resonance Theory as it could be implemented under virtual emulation conditions using the SPIN-L architecture.
The Backpropagation algorithm is used in the performance comparison that follows. In this comparison we report performance values for machine sizes that require the algorithm to be executed in virtual mode. For this reason we have expanded it for the sake of clarity and reproducibility of our results. The delay has been split into a delay for recall, calculation of the error for all layers, and the weight update process. In Eqs. (8a)±(8c) V w and V w À 1 represent the number of neurons in layer M and M À 1, respectively:
The SPIN-L architecture can still take advantage of the symmetry found in the weight matrix of the Hop®eld network. In Eq. (9), which indicates the delay of the Hop®eld algorithm in virtual emulation mode, n is the number of bits in the input patterns:
In this section we have presented the delay equations for SPIN-L's virtual emulation perfor-mance. The description of the SOM algorithm shows that the architecture continues to fully utilize the hardware present under the rigorous demands of virtual emulation.
Comparison
In order to compare the proposed SPIN-L architecture to other machines we have chosen the NETtalk network which is a widely used benchmark. The NETtalk network is a three layer feed forward network. The con®guration we use for comparison purposes consists of 203 input neurons, 60 hidden neurons and 26 output neurons. The neurons in the hidden and output layers each have an additional weight to implement the variable threshold. This con®guration results in 12,180 connections between the input and hidden layers, 1560 connections between the hidden and output layers and the 86 threshold connections. Overall there are 13,826 variable connections.
It has been suggested that all the resource delays can be expressed in terms of an adder delay. The multiplier delay is v log 2 x adder delays, the function generator is 2L adder delays, the bus and store delays are equivalent to one adder delay each. Using 10 ns as the adder delay, a typical value for current technology, we report the performance in terms of Millions of Connection Updates Per Second (MCUPS).
The available resources do not match the number of neurons for the SPIN-L machine sizes 64 and 128 reported. Thus, the architecture operates in a virtual mode for these machine sizes. Eqs. (8a)±(8c) are used to determine the delay under NETtalk network constraints. The delay value is then used to calculate the number of connection updates that could be performed per second running this benchmark. The performance of SPIN-L and other reported neuro-emulators are shown in Table 4 .
From Table 4 the following observations can be made. A SPIN-L 256 bit serial architecture provides the highest MCUPS performance. For the NETtalk benchmark this SPIN-L machine size outperforms all the other neuro-emulators even those with a much larger number of processors. A SPIN-L 64 bit-serial performs extremely well using a modest amount of hardware. SPIN-L 64 provides better performance than other much larger systems such as the Cellular Array, CM-2 and CM-1. Machines such as CNAPS and SANDY/6 have a similar number of physical neurons; however they have parallel word implementations. This in turn imposes a need for much larger hardware; CNAPS and SANDY/6 may utilize over three times the hardware required for SPIN-L. Using distributed weight vectors requires an increase in memory by a factor of N, as well as increased communications to maintain coherency. SPIN-L has an extremely good scalability. It can be observed that increasing the size of the machine provides performance gains. In this example, we did not consider machine sizes above 256 since the machine becomes underutilized. SPIN-L can accommodate this mode of operation easily, however for larger machine sizes the performance will be of the same order as that for the SPIN-L 256.
Concluding remarks
In this paper we have identi®ed the major computational and communication requirements for a large set of ANN learning algorithms. These algorithms span many styles of learning: supervised and unsupervised, multi-layered, spatial, etc. This set of requirements has been used in the design of a novel neuro-emulator, called SPIN-L. This neuro-emulator operates in a pipeline fashion. This in turn along with SPIN-L's parallel, reduction, and multi-function generator structure provide for high performance. We have reported the performance of SPIN-L on four dierent learning algorithms that represent a wide range of requirements as well as learning approaches. These algorithms are: Self-Organizing Map (which requires competitive learning, and the ability to determine node locations within the map), Adaptive Resonance Theory (requiring vigilance testing, and the ability to iteratively disable nodes in competitive learning), Backpropagation (which utilizes the transpose of the weight matrix during the update process, a separate weight matrix for each layer, and the ability to implement a sparsely connected network), and the Hop®eld algorithm (the performance of which can be improved with the ability to store weight values to all symmetric locations, and requires massive communications due to its fully connected nature). We have compared SPIN-L with other machines executing NETtalk (203 inputs, 60 hidden, and 26 output neurons). In this comparison SPIN-L with 64 neurons (directly implemented in hardware) performs extremely well achieving 131 MCUPS. Its similar performance is similar to other machines with a much larger number of processing elements. SPIN-L accommodates a large number of complex functions in the most cost eective way by means of a single multi-function generator. This in turn makes SPIN-L the neuro-architecture with the least amount of hardware among the machines with similar capabilities.
Appendix A. Adaptive resonance theory 1
There have been multiple forms of the adaptive resonance theory (ART) algorithm ®rst proposed by Carpenter and Grossberg [12,30±32] . As an example of this style of learning we have studied the binary (ART-1) form for this paper. The ART-1 algorithm is an unsupervised clustering algorithm similar to the SOM. The ART-1 algorithm searches for more than just the weight vector closest to the input vector. It uses a vigilance test to insure that the chosen weight vector is similar enough to the input vector that it can be considered the correct classi®er. To implement this algorithm we have assumed the F2 layer performs the WTA function. A more complete description of the algorithm is given by Carpenter and Grossberg [12] .
A.1. ART-1 algorithm description
Like the Self-Organizing Map, ART-1 can use the dot product to determine the weight vector closest to the input vector. The weights used in this dot product will be the bottom up weights f , which are the top down weights TD normalized. The weights will be normalized using the following equation:
These values are used to form the dot products with the input vector. The dot products then undergo competitive learning to determine the weight vector closest to the input vector. The winning vector will be considered to be close enough to the input vector if it satis®es the vigilance criterion. The vigilance criterion has been stated as [12] :
where V i is the vigilance vector described as:
We have manipulated the vigilance equation to allow the calculations required to take full advantage of the SPIN-L organization. The actual vigilance equation that will be implemented using the SPIN-L machine will be:
When a weight vector passes the vigilance test it will be considered the correct classi®er. It will then be updated to re¯ect the inclusion of the present input in the cluster it represents. If the winner fails the vigilance test it is not considered the correct classi®er and competitive learning will occur again with the previous winner(s) disabled.
A.2. ART-1 algorithm mapping onto SPIN-L
To implement this algorithm eciently we exploit the pipeline ability of the organization thereby reducing the overall delay by overlapping operations continually. The equations for this algorithm have been broken into steps and scheduled to take advantage of the machine's abilities. The h will be down loaded by the host initially. The implementation proceeds as follows: calculate the dot product of the top down weights and the input vector, calculate the normalizing term, calculate the scaled norm of the input vector, perform the WTA function, apply the vigilance test, and update the weights. Each step will be performed on all values before proceeding to the next step. The mapping, following the conventions described in the introduction of this section, of the ART algorithm is as follows. In this subsection we have shown how the ART-1 algorithm can be mapped to the SPIN-L architecture. It is clear from this description that the mapping fully utilizes the organization of the machine. The advantages of the parallel pipelined nature of the machine are also apparent from the description. Performance for the algorithm implemented as described is reported in Section 4.
Appendix B. Backpropagation
Backpropagation is one of the most widely used learning algorithms for multi-layered neural networks [33] . Its dependence on intermediate values from other layers presents demands for storage as well as computational hardware. Backpropagation is a gradient descent algorithm, and as such adjusts the weights to minimize an error function. Its strength is the ability to addresses the problem of assigning an error to the neurons in hidden layers. This results in two types of error, an error for neurons in the output layer (L), and an error for neurons in the hidden layers (k). A description of the algorithm in greater detail has been presented [33] .
B.1. BP algorithm description
Assuming a sigmoid activation function of the form 1 1 e Àsaa Y BX1 the error at an output layer neuron is de®ned as
where i is the desired output value for neuron i, and i is the actual output value of the same neuron. The error for a neuron in a hidden layer is then de®ned in terms of the errors of the neurons in the layer above it as:
where n is the number of neurons in the layer k+1, and k is the current layer with the constraint that kL. The weights are adjusted in the direction of greatest descent along the error surface, de®ned by the Backpropagation algorithm to be
where gt is the time varying learning rate. A multi-layered network of perceptrons consisting of three layers is capable of handling any mapping between inputs and outputs, where the complexity of the mapping is limited by the number of nodes in the network [34] . This makes Backpropagation an extremely powerful algorithm while the sparse connections and the values for multiple layers make it a demanding algorithm to implement.
B.2. BP algorithm mapping onto SPIN-L
We will now show that even under the demands of a multi-layered network it is possible for the proposed emulator to maintain a high utilization of available hardware resources. This will be shown by examining the execution of the algorithm at the hardware resource level. The execution of the Backpropagation algorithm on the SPIN-L architecture has been previously discussed [28] , so we will present here an overview of this process. The equations for the Backpropagation algorithm have not been altered. The calculations for the error equations have been distributed across the machine organization to extract a higher degree of parallelism from the overall algorithm. Without losing generality and for presentation clarity we present a multi layer network with one hidden layer. The mapping, following the conventions described in the introduction of this section, occurs as follows.
Recall Process: 1. PPC The input vector elements are multiplied by the weight vector elements using the N parallel multipliers. The inputs are passed from the RF units and the weights are passed from the memory units.
PCT The bit-streams of the products, as they are calculated, are passed to the PCT which performs the summation. MFG The sum-of-products is used as the input to the M-FG which evaluates the sigmoid function. PNB The output of the neuron is passed back to p x using the PN-Bus. b The partial hidden layer errors are passed to the sequential end of the machine from the ®rst RF using the forward bus, they are shifted up to the next RF using the inter-RF bus until arriving at the ®rst RF.
PCT The scaled output layer errors are broadcast back through the PCT to all RFs. IS mlt The partial hidden layer errors are multiplied by the corresponding outputs. sub The outputs are subtracted from 1; equivalent to the derivative of the sigmoid activation function. By storing output values at the sequential end of the architecture this process can be extended to any number of layers. As was stated earlier three layers can create an arbitrary decision surface with the complexity limited only by the number of neurons in the network. We have not limited the architecture to this con®guration and thus provide¯exibility while maintaining the ability to realize the full potential of the algorithm at a reasonable cost. Performance of the algorithm implemented as described here is reported in Section 4.
Appendix C. Hop®eld learning algorithm
The Hop®eld learning algorithm is a type of associative memory. The Hop®eld model is an interesting implementation due to its highly interconnected structure. Each node of the Hop®eld network is connected to every other node. This requires massive communication abilities of any architecture which is used to realize this model.
C.1. Hop®eld algorithm description
The patterns are stored in the network via learning and are then available for recall. The weights are determined using the following equation:
where x i is the ith bit of the uth input pattern [14] . This algorithm for implementing the Hop®eld model requires that the inputs be AE1. The resulting weight matrix will be square, K´K with K being the number of bits in each input pattern. This matrix will be symmetrical with respect to the diagonal i j. The actual number of calculations resulting in unique values will be equal to:
C.2. Hop®eld algorithm mapping onto SPIN-L
The Hop®eld algorithm is comprised of storing the patterns, by calculating the weight values, and recall. We will now show how these two functions, as they have been described above, can be implemented using the proposed architecture. The mapping is presented using the conventions described in the introduction of this section.
Calculation of weights: 1. PPC The N multiplications are carried out in parallel.
2. PCT The products will then be passed to the PCT where the summation will be carried out. Scaling of the weights: 3. MFG The weights are scaled using the number of neurons in the network. This multiplication occurs at the end of the PCT. 4 . PNB This value, representing one of the weights, is passed back to the register ®le using the PNbus. Storing of weights: 5. PPC The hardware allows the tag to be matched as X, Y and Y, X. This results in the weights being stored in the symmetrical positions of the weight matrix. Recall: 6. PPC 7. mlt The input pattern is multiplied by the weight vectors. 8. rf The outputs, as the next output is passed back, are shifted up through the RFs using the inter-RF data path. 9. PCT The bit-stream of the products, as they are produced, are passed to the PCT which performs the summation operation. 10. MFG The M-FG is idle (idle hardware will not be shown further). 11. PNB The output of the neuron is passed back on the PN-Bus to the last RF. Winner Take All: 12. PPC The outputs calculated in step 3 are passed around the multipliers using the bypass path. 13. PCT The outputs enter the PCT MSB ®rst.
The WTA operation is performed by comparing the input bits. While equal the bit is passed on through the PCT. If unequal the PCC sets the path to the larger input, eectively blocking the smaller. 14. PNB The winner is passed back on the PN-Bus as the network output.
The recall performance of the Hop®eld model has been shown to degrade if an attempt is made to store more than N patterns, where N is the number of weights [14] . Thus we have assumed that there will be N patterns and they will be available at the beginning of execution. The Hop®eld algorithm has been shown to be eciently implemented in spite of its demanding communication requirements. This is achieved using a single bus by evaluating one neuron output at a time, removing any bus contention, with the operations highly pipelined. The performance of the Hop®eld algorithm is reported in Section 4. Vassiliadis, A neural processor with learning capabilities, Technical Report, Department of Electrical Engineering, Delft University, Delft, The Netherlands, 1997.
