Abstract-In this paper, we present a digital system called (SP 2 INN) for simulating very large-scale spiking neural networks (VLSNNs) comprising, e.g., 1 000 000 neurons with several million connections in total. SP 2 INN makes it possible to simulate VLSNN with features such as synaptic short term plasticity, long term plasticity as well as configurable connections. For such VLSNN the computation of the connectivity including the synapses is the main challenging task besides computing the neuron model. We describe the configurable neuron model of SP 2 INN, before we focus on the computation of the connectivity. Within SP 2 INN, connectivity parameters are stored in an external memory, while the actual connections are computed online based on defined connectivity rules. The communication between the SP 2 INN processor and the external memory represents a bottle-neck for the system performance. We show this problem is handled efficiently by introducing a tag scheme and a target-oriented addressing method. The SP 2 INN processor is described in a high-level hardware description language. We present its implementation in a 0 35 m CMOS technology, but also discuss advantages and drawbacks of implementing it on a field programmable gate array.
Synaptic Plasticity in Spiking Neural Networks Abstract-In this paper, we present a digital system called (SP 2 INN) for simulating very large-scale spiking neural networks (VLSNNs) comprising, e.g., 1 000 000 neurons with several million connections in total. SP 2 INN makes it possible to simulate VLSNN with features such as synaptic short term plasticity, long term plasticity as well as configurable connections. For such VLSNN the computation of the connectivity including the synapses is the main challenging task besides computing the neuron model. We describe the configurable neuron model of SP 2 INN, before we focus on the computation of the connectivity. Within SP 2 INN, connectivity parameters are stored in an external memory, while the actual connections are computed online based on defined connectivity rules. The communication between the SP 2 INN processor and the external memory represents a bottle-neck for the system performance. We show this problem is handled efficiently by introducing a tag scheme and a target-oriented addressing method. The SP 2 INN processor is described in a high-level hardware description language. We present its implementation in a 0 35 m CMOS technology, but also discuss advantages and drawbacks of implementing it on a field programmable gate array. Utilization of the neurotransmitter after the first incoming pulse in a spike train. VHDL Very high-speed integrated circuit hardware description language. VLSNN Very large-scale spiking neural network. Corresponds to the PSP. Presynaptic activity of the target neurons.
Index Terms-Digital

I. INTRODUCTION
A NEUROCOMPUTER is an essential tool for the investigation and evolution of biologically inspired artificial neural networks and their applications. Since more than a decade different universities, research institutes and enterprises have been engaged in the development of ASICs for the simulation of 1045-9227/03$17.00 © 2003 IEEE neural networks. However modern computers compete against a neurocomputer in many respects, e.g., performance, flexibility and price.
The evolution of the general purpose computer goes on continuously. These computers become increasingly faster by applying various internal techniques, such as the increase of the clock frequency, pipelining and the number of processor units as well as the improvement of the cache strategy in the processor. On the system level the performance was enhanced by means of high bandwidth dynamic memories, such as DDR-RAM and RDRAM. Thereby the problem of the bottle-neck between processor and external memory was eased. In fact these computers allow the simulation of small-or medium size networks ( neurons) in an acceptable time. The software implementation of the NN on such platforms (personal computer, workstation) is surely an economical solution which is appointed mostly for proving the algorithms or processing the data (e.g., an image) in a static scene. Some of these programs, e.g., GENESIS and NEURON, are developed professionally.
In the field of nontechnical applications, as in the testing of the models and their evaluation the real-time processing is not implicitly necessary. In neuroscience the demand for simulating complex neuron models in a large network increases continuously. In spite of the increase in performance of modern computers, users in these fields are still forced to further simplify their hypothetical complex models. Hence they still work with quite small networks, e.g., the number of neurons amounts some thousands and the number of the synaptic connections some ten thousand [1] . Thus the development of specific high performance hardware for the emulation of VLSNNs, which also support learning, will remain a relevant task in the future.
The motivation during the evolution of such a hardware consists of developing a neurocomputer that is not as extensive as a parallel computer (e.g., connection machine), however, capable of simulating a great number of neurons in real time. Such a hardware is supposed to be programmable for a great variety of applications (e.g., for pattern recognition, background separation, simulation of some functionality in the nervous system, or also in robotics technology). For this specific hardware there exist two basic approaches: Either the structure of the network is mapped directly in analog and/or digital hardware or there are the so-called neurocomputers that are not tailored to a specific network model but show a great measure of flexibility.
When the goal of the hardware development is the implementation of a specific model for a specific application (e.g., [2] ), the analog circuit is usually applied. The authors in [3] implemented a limited amount of spiking neurons in an analog chip and a small silicon retina in another one. With several of these chips and a microcontroller they want to build a larger multichip system allowing exploration of the spike-based processing models in a neural network. A similar cascading approach was aimed before with a digital method [4] . For the exploration of large networks consisting of 16 K spiking neurons with 16 synapses each 1 K chips were needed. The hardware requirement of such a concept is too high. Despite this problem the analog design approach will no doubt continue to be refined and be used for testing neural models in the future.
A high performance neurochip allows the real-time computation of larger networks. Due to continuous performance increase of general purpose computers the expensive design of digital ASICs is only reasonable under the following conditions.
• The goal of a digital hardware development for simulating the hypotheses of the biologically inspired neural networks cannot be only a chip, but rather has to offer a system solution.
• Such a platform should allow the evaluation of hypotheses about the functionality of at least a brain area, for example the visual area.
• The system is supposed to be architecturally projected in such a way, that an extension of the system to other functionalities, as e.g., the auditory area is possible, without great change of the outline of the system. • A sufficient flexibility of the calculation of a configurable neuron model for various network topologies must be guaranteed. The description of the network topology should be possible both graphically and textually. Without such flexibility and the corresponding comfort such platforms will not be applied by potential users.
• Continuous adaptation of the system architecture to the new aspects of neuroscience. Such a system must be targeted as an open-end-project in order to accomplish a technology basis.
• The building of paradigms that can be applied in technical areas is necessary among the design of the system as a main task. Thereby a bridge can be built between system development and application. In Section II, the neuron and modeling of synaptic plasticity will be presented. The various hardware concepts for data reduction and processing are discussed in Section III. An event-driven technique for reducing computation time will be presented. The addressing scheme in a network has an impact on the amount of the required parameters and the computation time. We discuss these means in Section IV. Subsequently, the main features of the concept are described in detail. Thereby the architecture of the processor will be presented. The performance of a conventional workstation and the estimated performance of the will be revealed in Section VI. In the last two sections, a summary and references are given.
II. NEURON MODEL
The spiking neuron model can be classified as the third generation of neuron models after the perceptron and the sigmoidal neuron [5] . Fig. 1 shows the neuron model. This model consists of four main compartments. The synapse model is responsible for mathematical representation of the behavior of the applied synapse in the network which can be modeled statically or dynamically. The variables X (1 or 0) represent the activity of presynaptic neurons. In the static case and without "learning" of the connection the efficacy is a positive scalar value, which simply represents the strength of the associated synapse and can be modified according to the spike-timing dependent Hebbian learning rule. In this case, the PSP is equal to . In case of dynamic synapses is a set of parameters for modeling a PSP. In the dendrite model, the integration of the PSPs takes place. The simple program, in the soma model, describes the manner of the contribution of the different DPs to form the MP. In the axon model the resulting MP will be compared to the current spiking threshold of the neuron. If the MP exceeds the threshold, then the neuron sends a spike. This pulse reaches the target neurons after a time delay
. After the corresponding time delay the activity state of these postneurons, i.e the computation of MP, will be assessed. In this compartment there is a loop for modeling refractory time of the neuron. If a neuron fires, the channels are blocked for a specific time, the so-called . After this time in which the hyperpolarization of the action potential runs off, the blockage of the channels slowly relaxes, so that the neuron would be able to fire again, however, only due to a great input potential. The time, from the end of the absolute refractory period up to the total expansion of the channels, is called . For the modeling of this process, the value of the spiking threshold in the absolute refractory time was put to infinity. In the relative refractory period, a dynamic threshold which maximum value amounts to and decays with a time constant equal to , is added onto the static threshold value . With the parameter , the model can be programmed as an integrate-and-fire neuron. In this case, all DPs and the new incoming PSPs are shunted to ground after generating a spike during the absolute refractory time. With the , the potentials will not be shunted to ground. In this case, all DPs are decayed with a related time constant . The neuron model is based on the model of French and Stein [6] and affiliated models [5] , [7] - [9] , [21] are incorporated. In the following, the main features of the neuron model are summarized.
• Only the like neuron model, e.g., the leaky-integrate-and-fire neuron model can be simulated.
• In-and output of the neuron are pulse-coded.
• Static and/or dynamic synapses are programmable.
• Relative and absolute refractory time, and , are supported.
• Each neuron possesses a minimum of one and a maximum of eight dendritic segments, including one leakyintegrator. In order to avoid higher requirements for the memory to store parameters of the network, it has been assumed that all synapses attached to a dendritic segment have the same time constant for decaying its PSP. Therefore, each dendritic segment possesses a time constant for decaying its DP. The DP is the sum of the decayed dendritic potential, , with the new incoming PSPs of the active synapses. The PSPs can be the weights of the afferent synapses (in the case of static synapses), or are computed through the synaptic plasticity procedure. In the following equation, corresponds to the PSP (excitatory or inhibitory) and to the presynaptic activity of the target neurons (1) where is the index of synapses attached to the th dendritic segment and TS is the current time slice. Decaying of a DP is computed as follows: (2) where is the last time slice in which all DPs of the neuron were decayed.
• The value of the in a time slice is a function of DPs (3) where is the membrane potential of the target neuron with the index . The can be the arithmetic operations ADD, SUB, and MULT.
• The static and dynamic spiking threshold, and , are supported. The dynamic threshold is calculated as follows: for for (4) where is the current time slice and is the last fire time of the target neuron and . • if then neuron fires! update and after ; end; • axonal delays: spikes are transferred over a synapse in the neocortex relatively slowly. In vivo each synapse has its specific time delay which amounts from 1 ms to 300 ms or more [10] . These time delays are very important for the dynamic of the network and its stability [11] . The realization of specific synaptic time delays means a higher requirement for memory and an increase of the network's computation time. To solve the above mentioned problem, it is simplified in : it is assumed that each neuron innervates signals to a maximum of two arbitrary populations of neurons (cortical subareas). Time delays are assigned to these populations. This means all synapses of a neuron signaling to the neurons of a population have the same time delay. Thus, it corresponds to a constant value ( or ) between 1 and 32 TS. We will extend the maximum target populations available for a neuron up to eight subareas for the second version of , but still with two different time delays. It is to mention that each neuron in can receive PSPs from maximum eight different receptive fields, consisting of up to neurons each.
A. Synaptic Behavior Modeling
While in a conventional neural network the information is distributed only in the weight matrix, in a SNN employing synaptic plasticity, the spike time adopts this task priority. Synaptic plasticity is an important property for processing a rapid temporally changing stimulus. This property cannot be reproduced by more neurons or connections with various efficacies in a network.
A lot of approaches were presented for modeling the short-term dynamic of synapses [12] - [14] . These models are based on a phenomenological description of the behavior of the synapses and reflect more or less the internal processes of the synapses. Such models differ in the degree of detail. This is, however, very important for the implementation of the model in hardware. A model with a lot of parameters requires large memory allocation to store parameters and results in a high computation time.
The model described in [13] is implemented in . This model describes both depression and facility behavior of synapses mainly with two parameters. The parameter A corresponds to the absolute synaptic efficacy and the other parameter, , to the utilization of the neurotransmitter after the first incoming pulse in a spike train. Also, two further time constants describe the recovery from the synaptic depression and from the synaptic facilitation .
The long term plasticity, i.e., LTP and LTD, are realized by the Hebbian learning rule [15] - [17] . This is applied to the adaptation of the parameter A. The relative timing of the pre-and postsynaptic spikes determines the sign (LTP or LTD) and amount of the synaptic modification (see Fig. 2 ). A postsynaptic spike that follows a presynaptic spike generates LTP, whereas the reverse ordering produces LTD. Synaptic modification occurs only if the pairing of the presynaptic spike with the postsynaptic spike falls within a window (learning window) of roughly 50 ms [18] . The maximum modification of the synaptic strength occurs when the time interval between pre-and postsynaptic spike amounts to just a few milliseconds. The long-lasting effect of the plasticity decreases as the time interval increases. The synaptic strength is not affected when the time interval is out of the learning window.
The parameter SCR in Fig. 2 describes the percentage change of the parameter depending on the time difference between the last FT of the postsynaptic neuron and the SAT from the presynaptic neuron. The parameter HebbTime describes the size of the learning window. The LTD is independent of the synaptic strength. However the LTP depends on the synaptic strength. Strong synapses (large A) are altered only by a small portion compared to weak ones [16] as shown by the dotted lines in Fig. 2 . This is the course of the learning curve in Fig. 2 for long term potentiation displaces in the direction of the arrow the larger the parameter is (dotted lines). This mechanism prevents the fast saturation of the synaptic strength. It also causes a fast adaptation of A and supports the stabilization of the network activity. The computation of the long term plasticity is carried out before the computation of the short term plasticity in the . At this time it is not clear yet, whether the postsynaptic neuron fires or not. Due to that we only consider the two latest presynaptic pulses and the last spike of the postneuron. That means we compute the difference between the time of the current incoming pulse and FT as well as the difference between FT and SAT. The time of the current incoming pulse corresponds to the current TS, while SAT corresponds to the arrival time of the last spike of the presynaptic neuron (5) (6) Whereas is always positive, can be negative. The (forward adaptation) and (reverse adaptation) are computed for the change of the parameter, A, correspondingly. The second term is only valid when is positive, since if it is negative the corresponding contribution would have been computed in a former time, i.e., in . Fig. 2 represents two mentioned scenarios. If the current time slice is equal to then is positive, however, if it is equal to then is negative. In the the adaptation curve in Fig. 2 is linearized. The modified A corresponds then: (7) while (8) (9) III. STRATEGY OF DATA PROCESSING IN THE A large biologically inspired pulse coupled NN has some hundred thousands or even up to a few millions of neurons. In spite of their sparse networking, the number of the connections is some millions. As soon as such networks are used for real tasks, as object and pattern recognition of real world scenes, the necessary computation time required exceeds a crucial boundary for the updating of the states in the network. For this problematic nature at least the three following points are responsible:
1) number of neurons and connections to be processed; 2) addressing scheme of the connections; 3) topology of the connections. Depending on the given task one needs a specific number of neurons and their connections. The network parameters are usually stored in the external memory. With the fast increase of the clock frequency in ASICs in contrast to a slow rise of the clock frequency in the case of external memory devices, it is undesired to transfer the data via the chip interface. Because of the simplicity of operations in neural networks, the average "lifetime" of the data is very small on the chip. Usually a great amount of data is read per unit of time from the external memory into the chip (processing unit) and written back after a short stay into the memory again. For relaxing this "bottle-neck"-problem various strategies can be pursued. A possibility would be the distribution of the network onto many standard-processors working in parallel. In such a multiprocessor system the retention of the parameter sets can occur either with high expenditure in the external memory or on-chip. In the latter case the chosen microprocessor (e.g., a DSP) should have a relatively large memory [19] . But even in such a system the communication between the corresponding sections of the network needs to be handled by the interface of the responsible processors. In addition in the case of the dynamic stimuli one has to face a high expenditure of data management. Only the minimum computational potential of such processors is exploited in this case. Therefore, computation of the algorithm is typically limited by data communication (it is I-O-bounded) and not by data processing.
Another strategy pursues the reduction of the needed parameters, e.g., the amount of the neurons and connections, by means of the replacement of many simple model neurons through a few neurons, with more complex but powerful models. In a multiprocessor-system, working in parallel the capability of the system increases with the number of the processors. In conventional neural networks, only the soma of the neuron is modeled as a processing unit and the synapses as passive wire that only transmit the signals. In a modern network, the synapses are modeled more realistically. These synapses do not only transmit the signals, but they process it too. In such a system the number of processors climbs not only depending on the number of the neurons, but also on the number of the synapses. Such a model needs more operation steps than simple models for the updating of the neuron states. Complex neurons replace many more simple neurons because of their high capability, whereby the number of the data to be processed is reduced. Also, the "lifetime" of the data is prolonged on the chip. This strategy was pursued in the by the modeling of the dynamic synapses. During the previous discussion about relaxing the "bottle neck" problem, it was a matter of reducing the amount of required data. Also, the processing sequences of parameter sets can contribute to the solution of this problem. This strategy will be illustrated in Section III-A.
The connections in the network and their topology can be generated either statically and point to point during the initialization at the beginning of the simulation, or dynamically and on-line during the simulation by means of the limited parameter sets. In the second case, the entire topographical information is not stored in the external connection memory, but only a small subset of it. With the hierarchical distribution of the network and the regular receptive fields, the needed parameters and the computation time can be reduced. This solution is prefered in the .
A. Processing Period of the Neurons
During the simulation of neural networks with a digital platform the simulation time is divided into small sections. Every period is designated as a time slice and usually lasts in analogy to the duration of an action potential of a biological neuron, which is 1 ms. Sometimes a time slice can amount to 0.1 ms for the simulation of the fast spiking areas. Often a platform is only referred to as real-time capable, when it is able to update the network state within a time slice. The definition of the "real-time capability" in the context of the duration of an action potential is not precise, rather this should be associated with the stimuli and network complexity. In some works [20] - [22] , an arbitrary percentage of less than 1% for the network activity is stated as a condition for the "real-time capability." Both a small network with high stimuli frequency (SF) and large networks with a low SF can be possibly simulated in real time. Here, the real-time capability of the simulator is explained by the network achieving a "built-up" state up to the next stimuli. The required time for the attaining this state can amount to a stimuli period . The "built-up" state depends on the respective task and can mean, for example, the separation of the objects from the background or their synchronization. Usually, such tasks are carried out in different accuracy stages, from coarse to fine. From that the stimuli period is to be divided up reasonably into several time slices. The time slices are not equidistant and can be longer after arriving of a new stimulus than in the "built-up" state [23] . That depends on the fact that the amount of data to be processed being very much higher in the coarse stage than in the fine stage. With that the "real-time capability" of a system associates with the amount of data to be processed. Without reduction of these data in a network consisting of some million neurons and some 10 000 000 synapses for the processing of a real scene, the "real-time capability" of the relevant simulator would not be guaranteed.
A considerable reduction of the data to be processed in a network can be achieved by storing the numbers of all active neurons in a list (for the updating of the neuron states). Then only the synapses of these neurons are updated and supplied as current PSPs to their receivers. The other synapses in the network are ignored. This is valid only if the remaining potential of the ignored synapses are accounted in an other way, e.g., by grouping a part of synapses. Then as simplification, it is assumed that all synapses in a group have the same time constant for decaying its potentials and the same polarity (either excitatory or inhibitory synapses). With this assumption it is sufficient to define a DP corresponding to the sum of all PSPs in a group. The computation time, which is thereby economized, depends on the stimuli, the network topology and parameters. After the updating of the DPs in the network, i.e., after decaying the old value of the DPs and possibly accumulating the new entered PSPs, these are summed up for calculating the current MPs (see (1)- (3)).
A further reduction of the data can be achieved by DPs with smaller values being neglected. Then these DPs either can be marked in the memory particularly [24] , or are not stored at all [22] . In the first case, neither the memory requirement nor the "bottle-neck" problem between processor and external memory are reduced. With this approach, a reduction of the required time for the calculation the MPs could be achieved since the DPs with a very small value are ignored. However, this profit is rather marginal in a data path with pipeline structure for calculating the MPs. In the second case, both the memory requirement and the "bottle-neck" problem are reduced since in the external memory only the DPs to be processed are stored. However, the control expenditure is higher than following the first approach. During the occurrence of PSPs at the dendrites whose potentials are not stored in the memory, in order to create "gaps" for the new DPs, available entries must be displaced. In this case, the gain of time is minimized again. At fast changing stimuli, the advantages of both procedures can be further reduced.
Both of the above-mentioned procedures have in common, that all nonnegligible DPs in every time slice are decayed in order to compute the activity state of the neurons. A third procedure will be introduced that is implemented in the . With this procedure the activity computation is carried out for the neurons which receive a new spike (spike event: SE). A neuron can fire without incoming new pulses if the MP of the neuron is very high in the relative refractory period. Two curves MP and the spiking threshold could hit sometime later. During the activity computation of a neuron it will be calculated in advance, whether this neuron would be able to fire in the next time without any new SE. If that is the case, then this neuron is tagged particularly. This procedure is designated as burst prediction (BP). This predicted time can be the next firing time. The is the time, in which the MP is equal to the current value of the spiking threshold. Otherwise the neuron remains in idle (standby) status. The decaying of the DPs is computed retroactively. With that the I-O-bounded problem is bypassed and the computation time decreased considerably. This predicted time will be invalid as soon as a new spike occurs before the . However, this procedure is computed for the neuron model. In case of the intergrate-and-fire neuron, all DPs are shunted to ground after generating a spike.
IV. ADDRESSING METHODS FOR THE NEURAL CONNECTIONS
The inter-neuron connectivity of a central neuron can be classified in two groups, i.e., the connections where a group of neurons signal to the particular neuron (target-oriented) and those from this neuron to others neurons (source-oriented). In the the target-oriented connections are assigned to the dendritic segments and the source-oriented one origin from the axon. In both cases, particularly for the medium size and large networks, all parameters of the neurons, e.g., DPs and synaptic efficacy, are usually stored in the off-chip memory. As mentioned in a source-oriented addressing method the connections are considered as outputs of the neurons (divergence of the projection). All synaptic parameters and related dendrite addresses of the receiver neurons are stored in a CM, in a specific block associated to the sender neuron. These blocks are addressed by the event list, i.e., the number of the active sender neurons, over a PM. This addressing scheme was applied in some designs [20] - [22] . During the computation of the membrane potential of the TNs the information about the activity state of the SNs is not available. Since in the first step all PSPs of the ASNs are sent to the TNs, i.e., to the corresponding dendritic segments, and in the second step the MPs of the TNs will be computed. Thus implementing the Hebbian learning algorithm is computational expensive. At some approaches [25] this is carried out for a part of the synapses of the network in every time slice and with other [26] in all N time slices one time. In this case, N is some hundreds TS. In these cases it is unclear according to which criteria a part of the synapses must remain "illiterate." Both procedures have in common, that the learning process is based on the relation of the firing rate of the SN as well as the TN and not on the relation of the spike arrival time and time in which the spike of the target neuron is released. This mechanism prevents, however, a fast adaptation of the synaptic strength at fast-changing stimuli.
In a target-oriented addressing method, the connections are considered as the inputs of the neurons (convergence of the projection). Also, in this procedure the CM is divided up into many blocks, not necessarily of the same size. The number of the blocks corresponds to the neuron numbers of the network to be processed. In a block the parameters of all entered synapses of a neuron are stored. The start addresses of the blocks are stored in the PM, in such a way that its precedence corresponds to the neuron numbers. The DPs of each neuron are stored in the same block in which the synaptic parameters are stored. For each target neuron to be processed its DPs with the parameters of the active synapses are read into the processor, processed and written back. In other words, per receiving dendritic segment there are two accesses (read and write back) on the external CM necessary.
Since the information about the ASN is present during the computation of the MP of a target neuron, the adaptation of the parameters of the active synapses can be carried out immediately by the Hebbian learning rule. The advantage of this spike time-based Hebbian learning rule as opposed to the one based on the firing rate is that the former is characterized very well for the adaptation of the synaptic strength at fast-changing stimuli. This procedure can be implemented in a network with the target-oriented addressing without any time-or hardware-expenditure.
A further advantage of this is that the "lifetime" of the data is extended in the chip in comparison to other procedures. The reason for this is that different operations, such as the decaying of the DPs, the generation and the integration of the PSPs, the computation of the MP as well as the adaptation of the synaptic strength for a single TN are carried out consecutively. It is not needed to transfer a large number of data exceeding the chip limit, whereby the "bottle-neck" problem is eased.
A. Tag Scheme for the Target-Oriented Addressing
As previously mentioned, the "connection vein" of the network pulses by the activity of the source neurons. This is valid both for the source-oriented procedure and for the target-oriented one. At the latter one its number is converted into a bit, if a neuron fires, and entered in a memory, the SNT. The row and column address of this memory are extracted from neuron number by a simple method, the NTC. Also, the number of all TNs, receiving a spike from an ASN, are converted into the bits likewise, and entered in a further memory, the TNT. Every bit in the SNT memory and TNT memory is assigned to a neuron.
We assume, that in the time slice all ASNs and its TNs are tagged with "1" in the SNT memory and TNT memory, respectively (Fig. 3) . In the time slice the tags are read out from the TNT memory and converted into a neuron number. From the row and column address of the TNT memory, in which a "1" was found, the number of the TNs to be processed are extracted by a simple method, the TNC. With this number a line in the PM is addressed whose content shows the start address of a block in the CM. This block includes all neuron specific parameters, such as FT, DPs, all input synaptic parameters and so on. The synaptic parameters in the block are stored in a specific precedence that match the assignment of the corresponding SNs in the RFs. That means that the entry of the respective source neuron number is unnecessary in the block. TN number and the network parameters the specific fields in the SNT memory are read out.
The addresses of the active synapses are extracted from the SNT. These are reference addresses which are added to the block start address for building the absolute address of a line in the block in which the parameters of the synapses are stored.
The starting point of this addressing method is the ASN like the source-oriented one. With the previous means all TNs and its connections are addressed forward. In the case of the means discussed here, the addressing occurs forward and backward. First the projective fields (all TNs) are tagged in the TNT memory in a forward fashion. Then the active synapses are addressed backward from the respective neuron of these fields into the RFs, tagged in the SNT memory. Therefore, TNT and SNT memory take over the role of the event list memory in a source-oriented addressing scheme.
However, with the source-oriented addressing forward as well as backward addressing are necessary, if the adaptation of the connections is carried out by the Hebbian learning rule depending on the spike-timing.
If a neuron fires, however, the generated spike arrives its TNs after a delay, so-called axon delay or . In this case the number of this ASN and information about its TNs are entered in the delayed event list (DEL) memory. After or time slices, the numbers of the ASN and its TNs are converted and tagged into SNT and TNT memory, respectively. These tags are read out in the following TS from these memories and as mentioned already used for the addressing of the active synaptic parameters. The DEL memory does not have any common characteristics with the EL memory in the source-oriented addressing. This memory is divided up into 32 equal-sized blocks. This corresponds to a delay in the range of TS. The procedure in the case of BP is similar as the delays, with the difference that during the registration of the neuron number into the DEL memory, this is marked particularly. These marked neurons are read out after passing time again as TNs. The number of these neurons will be converted to a tag in the TNT memory. In addition to other neurons, the MPs of these neurons are computed once more in order to state whether due to new spike events the predicted fire time changed in the meantime.
In spite of the expenditure, but because of the mentioned advantages, the target-oriented addressing was favored in the . With the event-driven and target-oriented concept the computation time of the network is reduced strongly.
V. PROCESSOR
At the beginning of the project there were many unanswered questions regarding the stability of a VLSNN employing dynamic synapses, the complexity requirements for a deterministic task in comparison to a conventional SNN and the ability for object segmentation and synchronization. In order to answer these questions and to carry out the required changes in in an early stage of the design process, we developed a software simulator, called SpinnSoftSim or triple S (SSS). At this stage of its development SpinnSoftSim is a stand-alone simulator, but we are also going to extend it to an interface between the user and the accelerator [27] . By analogy with Fig. 1 the processor consists of four units each including some modules and a system controller that manages the data transfer between these units. (Fig. 4) computes the activity of all target neurons, receiving at least one spike from a source neuron a time slice ago. Due to strong dependency of the integrate unit and the fire unit as well as to allow hardware sharing we implemented both units in a larger unit, the IFU. The NTC/TNC module interacts with all units with exception of the synapse unit (SU). Therefore, its architecture features multiple interface controller. It does not possess any time-critical data path but a lot of complex FSMs to carry out protocols. The data paths and FSMs of the topology module (TP) exhibit some characteristics appropriate to carry out algorithmic tasks but are still not time critical. An IFU holds several architecturally complex and time-critical data paths and FSMs.
The data paths of the SU are also time-critical but architecturally not complex. In order to keep communication of the processor with off-chip memory low, there are two on-chip cachemodules. Each module consists of small memories for storing parameters of the active synapses, a controller and other components. We will discuss the architecture of these data paths more closely. Fig. 5 describes the functionality of different units of the and their interaction. All operations listed in a box of the flow chart are executed in parallel by the mentioned modules emphasized their names in the box. After system initialization, the system controller (SC) writes stimuli to the CM, while the NTC/TNC-module tags corresponding positions in the TNTmemory with a "1." When stimuli loading is completed, SC increments TS. This triggers the NTC/TNC-module to convert a tag ("1") in the TNT-memory to a neuron number representing an active target neuron (ATN). The SC checks whether the ATN belongs to a new population. If so, the SC reads the population specific parameters set from CM and distributes them among the units. In the next step, the SC sends the ATN to TP and reads the start address of the neuron parameters block from the PM, located in the CM. The stored tags in the SNT support the TP for calculating active synaptic addresses (Syn.adr) in the CM, where all data are deposited. Therefore, TP begins immediately to calculate the start and end address of a block in the SNTmemory by means of the ATN number and with a part of the network parameters, i.e., information about a population of a RF of the ATN. These addresses are sent to the NTC/TNC-module. This module reads the corresponding SNT-lines and sends them to the TP. The TP cuts the specific part of these lines as defined by the RF form of the network parameter. Based on the position of the tags ("1') in this region, the TP calculates the relative addresses of the active synapses (Syn.adr) of the ATN in the CM and sends these to the SC. While the TP calculates the Syn.adr, the SC reads DPs from the CM and forward them to the IFU which decays them directly. Thereafter, the SC reads the active synapses parameters from the CM and sends them to the SU. Then the SU computes the PSPs and accumulates these as a as far as the dendrite number remains the same. Then the and the corresponding dendrite number are sent to the IFU to be added up to the related DP. After completing all active synapses the data paths of the SU are finished with the ATN while the IFU starts to compute the MP with the user defined simple program, , and BP eventually. Due to that, the SU has one data path but two cache modules for storing synaptic parameters of two ATNs. However, there are two parallel working IFUs. Subsequently, the SC writes back the updated parameters (synaptic and dendritic) to the CM. However, providing a free cache module or IFU with the data takes priority over writing back the updated data to the CM. If an ATN fires it will be labled as an ASN. Then NTC/TNC writes the information of the neuron and its projective fields (PFs) in the appropriate block, based on the axon delay of the DEL-memory via the bus ANd (delayed active neuron). In case of nondelay spiking, the TP calculates the PFs of the firing neuron (FN) with the user defined form while the NTC/TNC writes a tag of the FN in the SNT and those of PFs in the TNT with the NTC algorithm. The supports four forms of RFs (or PFs) such as square, hexagon, rhombus and ellipse (0, 45, 90, 135). The described process will continue until all tags in the TNT (ATN) are processed.
The SU computes the PSPs and carries out the modification of the synaptic parameter A by applying the mentioned Hebbian learning rule. These tasks are accomplished mainly by two data paths in the synaptic plasticity module (SPM). It is beyond the scope of this paper to present the architecture of all blocks, however in Fig. 6 an example of a block's architecture is given. It presents the architecture of a data path on RTL for accomplishing the Hebbian learning. For clarification some additional logic circuits for checking the results of the components, e.g on zero, negative or overflow as well as truncating values exceeding the appropriate range, were omitted in Fig. 6 . When the SPM is made available to a cache module by the top level controller of the SU, the controller of the cache module starts computing the PSP by activating the enable signal load-par, high, for a clock period (CP). SCR and (the inverse of the parameters HebbTime) are population specific parameters and the FT is a neuron specific one. They are available before beginning the process. At first, is calculated by sub1. In the next CP, the synaptic parameters, i.e., SAT, A are read from the on-chip memory of the cach module and fed into the data path. In this period, sub1 produces while mult1 computes . The product of the multiplication is valid only for results smaller than one. Finally will be computed by building the two's complement of . It will be saved in a register by the delayed signal of load-par. This value remains unchanged for all synapses of this neuron. In the same CP mult2 produces SCR. The eight least significant bits will be cut. SCR is always smaller than one. With LTPsym a symmetric or asymmetric LTP (dotted lines in Fig. 2 ) can be selected. In the next pipeline stage and will be produced by mult4 and mult3, respectively. Then add calculates . The last stage sub2 computes the modified A,
. This value will be fed into the data path for computing the PSP. All multipliers are two-stage pipelined (with the exception mult1) Booth recoded Wallace-tree multiplier [28] . The adder and subtractors are of Brent-Kung [29] architecture. These components carry out unsigned data arithmetic and are instantiated from the design ware library of Design Compiler (Synopsys).
Each unit of the processor are divided into controller and data path blocks which are described in VHDL RTL. The entire code consists of over 50 000 line. Because of the continuously changing of the neural model this kind of the functional partitioning is very reasonable. The expenditure of an eventual redesign due to some changes in a functional module is limited. Each unit is synthetized separately by Design Compiler and mapped on the Alcatel's standard cell technology, CMOS technology. The design complexity covers almost 250K gates equivalent. All units and the entire chip are functionally simulated (pre-and postsynthesis) by using ModelSim. The chip is targeted for an operating frequency of 100 MHz, however, we synthetized all units with 125 MHz by using path based time budgeting. With this 25% margin we did not encounter any timing problem under worst-case conditions.
The architecture of the is very similar to an RISC. As mentioned in Section II, describes the manner of the contribution of the different DPs to build the MP. This simple instruction set consists of seven operation codes (OpCodes) a two bits for carrying out arithmetic operation ADD, SUB, and MULT on eight DPs. The IFU reads the OpCodes sequentially and generates the control signals for the data path. The address part of the instructions, i.e., dendrite number comes from the SU. Indeed the IFU with SU play the role of the CPU in the and the TP-and NTC/TNC-module carry out peripheral tasks. Although a single instruction is executed per CP, multiple data paths perform this instruction in parallel. Thus, the is a single address computer with an SIMD architecture [30] . The combines, like multithreaded processors, control-flow and data-flow ideas and utilizes the advantages of both paradigms. Table I depicts the main parameters and variables which are used in the . With the exception of the variables A, U0, PSP, DP, SAT and FT which will be modified in general, the other parameters remain unchanged within a simulation session. The synaptic parameters including SAT consist of four bytes for the dynamic synapses and two bytes for the static ones. The accuracy of DPs is 16 bits, however, the accumulater of the arithmetic unit in the IFU is extended to 20 bits in order to avoid overflow. The three time constants in Table I are dendritic segment (compartment) specific, i.e., they are the same for all synapses assigned to a compartment. All e-functions are linearized, i.e., is simplified by the first Taylor polynomial and replaced by where presents the inverse of . However, the occuring errors introduced by truncation and rounding by the -system are not systematic errors. Since the amount of these errors depends on several variables and parameters such as SAT, FT, active inhibitory and excitatory synapses as well as their time constants. Therefore, this can be partially compensated. The remaining variable error can be considered as a random noise generator. It is to mention that the neuron model does not provide an explicit noise generator up to now. A small on-chip memory with the desired capacity of, e.g 2 x HebbTime bytes for the Hebbian learning, as a lookup table (LUT) offers an alternative solution. Then the variable builds the address of the memory. The advantage of this solution is the free programming of various functions for decaying of DPs or for the learning algorithm [31] , [32] .
One processor may simulate a network consisting of maximum 1024 populations, 512 K neurons, 4 M (eight per neuron) dendritic segments, and 50 M (225 per dendritic segment and 1 K per neuron) synapses. Due to the high complexity of the entire system (hardware/software) and the benefit of a higher flexibility, we pursue a prototype-board consisting of FPGAs. Besides functional testing of the entire system this board will allow us to evaluate further models or new features, e.g., local cortical circuits discussed recently [Cerebral Cortex January 2003, Volume 13 Number 1], before implementing it in the processor.
VI. PERFORMANCE ESTIMATION
Many parameters influence the system performance. These include the network activity, the stimuli, network topology and running task. The connections update per second can be considered as a suitable measure to compare the performance of different systems, since this is widely independent of the stimuli and task. Nevertheless, the applied neural model influences it. Due to the event-driven and target-oriented concept of the the activity state of a part of the network is computed per time slice. These neurons which receive at least one spike from their ASNs will be denoted here as TN and related synapses as active synapses, i.e., the parameter con in (10) . In a regular network topology, e.g., RFs in the retina or in the LGN, it is most likely that one TN receives more than one spike from their ASNs simultaneously (see Fig. 7 ). Therefore, the number of computed target neurons is lower than the one of active synapses, defined as the connections overlap probability (COP) (10) where CON is an active synapse. The greater the COP the higher is the performance in a system applying the target-oriented addressing scheme. In Fig. 7 , e.g., the COP is equal to . In order to estimate the performance, we ran a test-bench on a SUN platform. This test network consists of one layer with 256 K model neurons. All neurons have the same property. They have three dendritic segments, one for receiving the excitatory synapses and another for receiving inhibitory ones. The feeding dendritic segment is stimulated by the corresponding pixel of a stimulus, consisting of a binary matrix of 512 512 pixels, via a single excitatory synapse. Each neuron receives 80 excitatory and 80 inhibitory synapses of its 80 neighboring neurons (lateral). All synapses have depression behavior [10] . The task of the network is the synchronization of neuron groups, corresponding to a specific object in the stimulus. The objects of the stimulus are 32 rectangles with 16 16 pixels (gray level 255) which are evenly distributed in the stimulus. Additionally, there is some random noise. Table II shows the performance of the above mentioned machines.
The entire system of the has not been completed yet. However, we have tried to estimate the performance of the system accurately. The data transfer via the interface between and the CM is still the "bottle-neck" of the network simulation. All data of the system, besides the tag information of the TNT and SNT, are storaged in the CM.
This includes the pointer memory, the population specific parameters, denderitic specific parameters and synaptic specific parameters. Therefore only some of the accesses to the CM are used for the synaptic parameters, while others can be labeled as "overhead." Only parameters of the active synapses are transferred into the , updated and written back in the CM. The number of the active synapses per second (CUPS) can be taken from Table II . Thus the number of the active synapses per TS amounts to Some of the active synapses are synapses from 32 objects (16 16 pixels) in the stimulus and additional noise . The postsynaptic potentials of these synapses are registered as feeding dendrite potentials in the first time slice. That means that the number of the actual active synapses amounts to When a neuron is active in this network, it signals to 80 neighboring neurons excitatory as well inhibitory synapses. Then the number of computing target neurons is:
The population specific parameters require eight memory rows (32 bytes). We need 11 clock periods to read these rows in the chip. The number of populations amounts to 512 (16 32 neurons). We assume that 4.36% of the populations are active per time slice (like the neuron activity in Table II ). This requires 242 clock periods . For each target neuron, four clock periods accesses are required to the pointer memory and 12 clock periods (CPs) for dendritic parameters. For reading into the chip and writing out of it, each active synapse requires eight clock periods (in worst case). Further we assume that 104 clock periods are required for the refreshing of the memory per time slice. The sum of the required clock periods is The average period of the time slice, assuming a SDRAM with a clock frequency of 100 MHz amounts to
The number of updated synapses per second (CUPS) in amounts to
If we had implemented the long term plasticity in this test bench, the performance would have differed extremely from the one in Table II . The number of the pipeline stages for computing STP is greater than those of the pipeline stages for the long-term plasticity. Therefore, the calculation of the synaptic strength is ready in time before its utilization in the STP-module. This means that the computation of this important feature does not require additional time in the , while in case of a general purpose computer it forced to execute additional computational steps.
VII. CONCLUSION
We presented a novel digital accelerator system for the simulation of VLSNN. This concept is a processor based approach. The provides configurable leaky-integrate-and-fire neuron models, spike-timing dependent Hebbian learning and STP. In the , connections are computed online. Synaptic parameters on the other hand are stored in an external memory. We introduced a new tag-scheme, which offers a very efficient access of these parameters. We discussed an event-driven concept and a target-oriented addressing method. The I-O-bounded problem is bypassed and the computation time decreased by implementing these concepts in the . Comparison of the with an Ultra Sparc workstation (500 MHz) shows a performance increase of greater than 35 times.
The processor is described in VHDL, synthetized and mapped on the Alcatel's standard cell technology, 0.35 CMOS technology. The design complexity covers almost 250 K gates equivalent. The chip is functionally simulated (pre-and postsynthesis) by using ModelSim. The is targeted for an operating frequency of 100 MHz.
Considering the performance increase of the general purpose computer, it will be very hard for digital designer to compete against such platforms. Such designer will be obliged to provide a system solution for simulating VLSNNs. Supporting various models have to be guaranteed in such a system. Surely the reconfigurable hardware consisting of modern FPGAs (e.g., [33] ) provide such a flexibility and system-on-chip realization, however, it is insufficient to support only simple neuron model. The performance of a FPGA-based system is lower than an ASIC-based one. In spite of this disadvantage, the state-of-the-art FPGAs can replace digital ASIC-solutions for simulating neural networks. Also, an easy to use software to interact with the accelerator is crucial. We developed a software simulator, called SpinnSoftSim which is in this stage a stand-alone simulator, but we are also going to extend it to an interface between the user and the accelerator. Without such a flexibility, the developed platforms will not be accepted by the potential users.
