Parallelism and distribution have been considered the key features of neural processing. The term parallel distributed processing is even used as a synonym for arti cial neural networks. Nevertheless, the actual implementations are still in search of the appropriate model to "naturally represent" neural computing. And thenal judgement is always given in performance gures { keeping the parallelization issue high on the neurosimulation agenda. Two approaches have yielded the best results: parallel simulations on general-purpose computers, and specially developed neurohardware. Programming neural networks on parallel machines requires high-level techniques re ecting both inherent features of neuromodels and characteristics of the underlying computers. On the other hand, emulation of the neuroparadigm requires that the functioning of neural operations be mimicked directly by the hardware. Both approaches are presented, and their advantages and shortcomings are outlined.
Introduction
An arti cial neural network is composed of many non-linear computational elements (nodes) which are massively linked by weighted connections. Neuroprocessing is simple: a node receives the input, performs the weighted summation, applies an activation function (usually of a sigmoid type) to the weighted sum and outputs its result to other nodes. Most often, a neurosimulation session consists of two phases: learning and recall. In the learning phase, a set of training examples is presented to the network, and weights are adjusted in accordance with input values, their correlation, and the output values (i.e. a network learns a problem). In the recall phase, a new input set is presented, and a network eventually produces appropriate results. Given a large number of weighted summations and weight updates that have to be performed and a large number of training examples, the learning phase is the most compute-bound part of neuroprocessing.
The connectionist paradigm is most suitable for problems that are di cult or impossible to specify by mathematical means. Unlike traditional computational methods, the basic principle here is to learn a problem by examples, rather than solve it by algorithms. This makes neural networks very attractive for applications in domains like character, speech and image recognitions; signal analysis; various prediction systems; control theory and robotics; and similar areas where explicit knowledge is not available.
For a long time, however, arti cial neural networks was a more theoretical than a practical eld. It was only with the boom in simulation methods and neurohardware, that the practical use of arti cial neural networks started to grow. In a rapidly changing neurosimulation area, new products soon become obsolete, and a subsequent hardware generation improves performance by an order of magnitude.
To keep track of the latest developments, but to avoid simply listing them all, this article concentrates on techniques and strategies, rather than giving an overview of existing simulators. Furthermore, it focuses on parallelization techniques, since parallelism lies at the very heart of neuroprocessing. Two approaches are explored: parallel simulation on general-purpose computers, and simulation/emulation on neurohardware. Di erent parallelization methods are discussed, and the most popular techniques are explained (see Figure 1). While the software approach is concerned with nding an optimal programming model for neural processing, the hardware approach tries to immitate neuroparadigm using the best of silicon technology. Parallel techniques for simulation on generalpurpose computers are characterized by engineering efforts to develop an e cient system taking into account the speci c features of connectionist models and the architectural characteristics of available hardware. As in any parallel application design, one has to consider processing demands, communication demands, parallelization strategies, and restrictions imposed by the underlying hardware.
Processing demands of neurosimulations are considerable, especially in the learning phase, where weights have to be updated. A neuroprogram is usually structured in such a way that input or activation values, grouped in vectors, are multiplied by weights arranged in matrices re ecting the neurons' interconnections. The (non-)linear activation function is then applied to the resultant values, giving the neurons' outputs (that are further propagated or compared with one another). In some models there is a need to determine a maximum, among the output values. Such processing requires e cient add and multiply operations, calculation of the activation function (often of a sigmoid type), and comparison operations.
Communication demands are very high, owing to the massive neurons' interconnections . E cient communication has to be provided for fast value propagation between processing elements. The most frequent used communication schemata are: broadcast, circulation, and general routing. Broadcast seems to naturally re ect neural processing, but there are architectures that provide faster data exchange using circulation or general routing. A communication strategy has a great impact on overall performance, and has to be considered together with hardware characteristics and parallelization techniques for each implementation separately.
Parallelization of neurosimulations can be achieved in a number of ways. The amount of parallelism achieved depends on the granularity of problem decomposition. Nordstrom and Svensson 1] propose several structuring approaches, yielding the following levels of parallelism:
training-session parallelism (simultaneous execution of di erent sessions), training-example parallelism (simultaneous learning on di erent training examples), layer parallelism (concurrent execution of layers within a network), node parallelism (parallel execution of nodes for a single input), and weight parallelism (simultaneous weighted summation within a node). Each re nes the preceding one in a number of possible parallel activities. Although this categorization was made with feed-forward networks with backpropagation learning in mind, it can be applied (to a certain extent) to many other models as well.
Parallel computer architectures as possible hosts for neurosimulations are characterized by a large number of processors organized in di erent topologies. Depending on the presence/absence of central control, parallel computers may be divided into two broad categories: data-parallel and control-parallel. Dataparallel architectures simultaneously process large data sets using centralized (typically SIMD) or regular (e.g. pipelined) control ow. Control-parallel architectures perform processing in a decentralized manner, allowing di erent programs to be executed on di erent processors (typically MIMD). The two categories require a quite di erent style of programming. The choice of target architecture with its corresponding communication schema, together with the choice of parallelization technique, can have a profound e ect on performance.
Performance measurements often play a key role in deciding about the applicability of a neuroimplementation. Owing to the speci c nature of neural processing, standard e ciency measures do not guarantee a good neuroperformance. To re ect the needs of neural processing, two units of measurement are most commonly accepted: CPS ( Data-parallel neurosimulations exploit all the mentioned parallelization techniques. However, three structuring groups are typical: (1) coarse structuring, providing training-example and node-per-layer parallelization; (2) ne structuring, providing single-node and weight parallelization; and (3) pipelining, providing layer (and sometimes also node) parallelization. parallelism to implement backpropagation networks on the Connection Machine. Each processor is used to store a node from each of the layers, so that a "slice" of nodes lies on a single processor. The number of processors needed to store a network is equal to the number of nodes in the largest layer of the network. The weights are stored in a memory structure shared by a group of 32 processors (re ecting CM speci c architecture which allows a 32-bit number to be stored across 32 processors, and to be accessed by each processor as if it were stored in its own memory). With 64K processors, the CM is a perfect candidate for training-example parallelism. The authors have called this replication, as they replicate networks to make full use of the machine (e.g. if n is the number of nodes in the largest layer, then m is the number of replications, so that (n m) 64K).
The implementation was tested on NETtalk. Peak performance achieved in a training phase was 38M CUPS and a forward pass performed 180M CPS.
The MasPar implementation 4] of a backpropagation network is similar to Zhang's implementation. The network is placed on an array of processors, so that each processor contains a vertical set of neurons (one from each layer), making the total number of processors equal to the number of neurons in the largest layer. Each processor stores the weights of corresponding neurons in its local memory. In the forward phase, the weighted sums are evaluated, with intermediate results rotated from right to left (using MasPar's local interconnection schema). Once the input values have been evaluated, sigmoid activation functions are applied, and the same procedure is repeated for the next layer. In the backward phase, a similar procedure is performed, with errors propagated from the output layer down to the input layer.
This implementation also exploits the trainingexample parallelism. Multiple copies of the same network are placed on available processors. It is particularly convenient to exploit the two-dimensional connection schema of the MasPar computer, with one instance of the network being placed along one dimension, and network copies being duplicated along another. Learning is performed within each copy of a network, and weight changes are accumulated. After performing a number of training examples, weights are synchronously updated.
Maximal performance obtained on MasPar, measured on NETtalk, is 176M CPS in the recall phase, and 42M CUPS in the learning phase.
Fine Structuring Rosenberg and Blelloch 2] have used node and weight
parallelism to implement backpropagation networks on the Connection Machine. Processors were organized in a one-dimensional array with one processor being allocated to a node and two processors to each connection; one for the output and one for input side of a connection. A forward pass was implemented by spreading the activation values over the processors that hold weights for outgoing connections. The connection-processors multiply the values by their respective weights and forward the products to corresponding nodes. Products are incrementally added at the destinations. After the sums have been accumulated, a sigmoid operation is performed, and new activation values are prepared. The backpropagation phase was done in a similar way.
The NETtalk implementation achieved a maximum speed of 13M CUPS.
Pipelining
Pipelined structuring is often exploited on systolic arrays { speci c hardware architectures designed to map high-level computation directly onto hardware. Numerous simple processors are arranged in one-or multidimensional arrays, performing simple operations in a pipelined fashion. Circular communicationinsures that data arrive at regular time intervals from (possibly) di erent directions. Systolic computing can also be implemented on a general-purpose parallel machine by enforcing pipelining with rhythmical processing.
Pomerleau et al. 5] have used training and layer
parallelism to implement a backpropagation network on a Warp computer with processors organized in a systolic ring. In the forward phase, the activation values are shifted circularly along the ring and multiplied by the corresponding weights. Each processor accumulates the partial weighted sum. When the sum has been evaluated, the activation function is performed. Backward processing is similar, but instead of activation values, accumulated errors are shifted circularly. Training parallelism is achieved by partitioning the training-pattern set and executing di erent training subsets on di erent processors independently. The weights are updated periodically by broadcasting the accumulated weight changes.
Performance measurements for the NETtalk application on a Warp computer showed a speed of 17M CUPS.
Chung et al. 6] have applied classical systolic algorithms for matrix-by-vector multiplication when simulating backpropagation networks. They exploited layer and node parallelization by partitioning the neurons of each layer into groups and by partitioning the operation of each neuron into the sum of the product operations and the non-linear function. The execution of forward and backward phases was also done in parallel by pipelining multiple input sets.
This implementation uses a ner pipelined structuring than Pomerleau`s and achieves ner parallelization without communication performance losses. The NETtalk implementation was done on a 2-D systolic array with 13 K processing elements achieving a maximum speed of 248M CUPS.
Control-Parallel Techniques
Common characteristics of control-parallel programming techniques are decentralized control and decentralized data distribution. A parallel program is explicitly divided into several di erent tasks which are placed on di erent processors. Each processor executes di erent program, synchronisation is explicit, and the communication schema is usually general routing (i.e. parallel components communicate by messagepassing). The most popular control-parallel host for neural simulations is the transputer system. However, the problem with transputer-based implementations is the mapping from a large number of neurons onto a relatively small number of transputers. Some approaches provide support for virtual neurons; others divide computation into subtasks and place them on di erent machines.
In 7], the authors describe a backpropagation implementation on the T8000 system, consisting of seven transputers, divided into one master and six slaves. The master machine controls computation and maintains global error and state information, while the slaves execute allocated packages (subtasks). A multilayer feedforward network is "vertically divided", so that each slave contains a fragment of nodes from each layer. Computation is synchronized by the master so that the layers are computed in sequence. This implementation uses a technique similar to "coarse structuring", but the execution ow is closer to the SPMD (single program, multiple data) model.
The authors gave their peak performance as 58.2K CUPS for a smaller network, and 207K CUPS for a larger network (improved performance for a larger network is due to better processor utilization).
Signi cance of Parallel Techniques
Neural simulations on general-purpose parallel computers have played an important role in determining which parallelization techniques are most appropriate and which hardware characteristics can help to obtain improved e ciency. The lesson learned is that performance gures are not as important as they may appear at rst sight. Sound programming concepts can always be successfully reused, while e cient low-level tricks remain at the margin together with outdated hardware. Table 1 lists the presented parallel techniques. Designers most freguently used training, layer and node parallelism to implement connectionist models. Though performance varied, it can be generally concluded that data-parallel implementations signi cantly outperformed their control-parallel counterparts.
The advantages gained by coarse structuring may seem rather surprising. But due to the simplicity of neurons and complexity of processors, one-to-one mapping has always resulted in poor processor utilization and high communication overhead. Packing more neurons onto a single processor proved to be a much better strategy. The best architectural support was ob- The results given are from late '80s and early '90s. The performance gures are not impressive today, but the techniques used in these implementations are still actual.
tained on systolic arrays with circular communication. By rhythmical processing, even a very ne pipelined structuring resulted in improved performance.
Neurohardware
The size and e ciency requirements of growing neuroapplications cannot be successfully met by generalpurpose parallel computers. A new generation of dedicated computing devices is needed incorporating embedded neuroalgorithms. The ideal neurocomputer should consist of simple processors with potential for massive interconnections. It should mimic neural processing and increase the execution speed preserving the acceptable accuracy. Well-known architectural techniques such as instruction cashing, pipelining, superscalar technology and bit-serial processing should also be applied.
In the last ten years numerous neurocomputers have been designed. They can be roughly divided into two categories: (1) general-purpose, and (2) specialpurpose architectures. General-purpose architectures are based on "generic" neurofeatures, since they support a wider range of connectionist models. Specialpurpose approaches aim at the construction of a specic neurodevice that emulates a concrete neural network model.
General-Purpose Neuroarchitecture
A general-purpose neurocomputer o ers both hardware and software support for the e cient and exible execution of di erent neural network models. The main emphasis is on programmability, i.e. a generalpurpose neurocomputer should be (re-)usable, exible and scalable.
General-purpose neurohardware can be further subdivided into processor arrays and co-processors. Processor arrays are complex VLSI architectures organized in a data-parallel manner. Co-processors are simple boards that can be added to personal computers or workstations, converting these machines into neurocomputers.
Processor Arrays
A general-purpose neurocomputer is usually a cellular array with a large number of processing units connected in a regular topology, optimized for neural networks. A typical processing unit of a neurocomputer has local memory (needed for storing the local weights and state information) and a bus interface providing high interconnection capability. The whole system is connected with a parallel broadcast bus, and usually has a central control processor and a host computer. The programming techniques used for such a system are similar to the data-parallel techniques described above.
The experience gained with general-purpose parallel computers has shown that data-parallel architectu-res are most convenient for neural processing. Three approaches dominate: (1) systolic arrays, (2) processor arrays of the SIMD type (with processing synchronized per instruction), and (3) processor arrays of the SPMD type (with somewhat looser synchronization, per group of instructions, or per program).
The SYNAPSE System, produced by Siemens 8] , is one of the most popular general-purpose neurocomputers. It is based on a two-dimensional systolic architecture, designed to accelerate matrix operations and maximum nding. Its basic components are pipelined MA16 chips; each with its own o chip weight memory. Each chip has a throughput of 500M CPS. The standard con guration consists of eight MA16, two MC68040 CICS processors for control purposes, and a 128 Mbyte DRAM memory. The system comes complete with a software package for easier neuroprogramming. The full system, connected to a workstation, performs 5.12G CPS and 33M CUPS.
The CNAPS System (Coprocessing Node Architecture for Parallel Systems), developed by Adaptive Solutions 9], is a typical SIMD architecture comprising up to 64 processors per chip, connected into a onedimensional array structure. A common bus provides fast broadcast communication and connects processors to the common microcoded instruction sequencer. To support on-chip learning, each chip can hold 128K 16 bit weights. The parallel decomposition uses coarse structuring (node-per-layer parallelization). With lowprecision arithmetics, the CNAPS processor can perform 1.6G CPS in the recall phase, while the learning speed goes up to 300M CUPS. The complete CNAPS system may have 512 nodes connected to a host workstation and include a software development tool. It offers a maximum performance of 5.7G CPS and 1.46G CUPS (tested on a backpropagation network).
The RAP System, developed at ICSI Berkeley 10], is a ring array of DSP (Digital Signal Processor) chips specialized for fast dot-product arithmetic. Each DSP has a local memory (256Kbytes of static RAM, and 4Mbytes of dynamic RAM) and a ring interface. Four DSPs can be packed on a board, with a maximum of ten boards. Each board has a VME bus interface for connection with a host workstation. The processing is performed in a SPMD manner, thus avoiding the lock-step synchronization typical for SIMD architectures. The parallel decomposition uses coarse structuring technique, mapping several neurons onto a single DSP (node-per-layer parallelization). A single board can perform 57M CPS and 13.2M CUPS (a maximum speed for a full system is estimated at 574M CPS and 106M CUPS).
Co-Processors
Co-processors or neuroaccelerators are the most popular hardware upgrades for neuroapplications. They are special boards that can be connected either to PCs or workstations. They are used to accelerate the special operations needed for neuroprocessing, basically providing oating-point processors for vector-matrix arithmetics and faster memory access. Usually, suppliers o er a software package (i.e. a library of useful modules) for easier writing of neuroapplications. The overall neuroimprovement makes systems with co-processors several thousand times faster than standard workstations.
The SAIC SIGMA-1 neurocomputer is a PC computer with a DELTA oating-point processor board and the software packages ANSim (a neural net library) and ANSpec (an object-oriented language). The coprocessor can hold 3M virtual processing elements and connections, performing 11M CPS and 2M CUPS.
The Balboa 869 co-processor board is a generalpurpose board that can be plugged into both PCs and SUN workstations. Its central processor is an Intel i860 chip, specialized to enhance the neurosoftware package ExploreNet. The maximum speed for a backpropagation network is 25M CPS in the recall phase, and 9M CUPS in the learning phase.
The Lneuro (LEP neuromimetic circuit) 11] is a general-purpose building-block processor produced by Philips, which consists of a number of Lneuro chips that can be connected to a host-transputer combining the coarse-grained (MIMD-like) parallelism of the host with the ne-grained (SIMD-like) parallelism of the VLSI chips. The reported performance is claimed to be 19M CPS and 4.2M CUPS.
Special-Purpose Neuroarchitecture
Despite advances, the performance of generalpurpose neuroarchitectures is often insu cient. To make further improvements, a custom hardware for neural models has to be developed. Instead of generalpurpose components, highly specialized chips produced with silicon (digital and analog) technology should be used. In the last few years, optical technology has also emerged as another successful candidate for dedicated neural processing.
The common goal of neurochip designers is to pack as many processing elements as possible on a single silicon chip, thus providing faster connectivity and improving execution time. Optimal performance would have been achieved if all neurons had been placed on a single chip. However, current CMOS VLSI technology allows only a few hundred neurons to be packed on a single chip. There are two possible strategies for overcoming this problem: either the complete network is integrated on a single chip, or functional blocks (emulating only the compute-bound part of neuroalgorithm) are integrated on a single chip, which is then added to a host processor performing the rest of the computation.
Digital Technology
Digital technology has produced the most mature neurochips, providing high precision, exibility and reliability at relatively low costs. Furthermore, due to massproduction, a lot of powerful tools to custom design are available.
There are numerous programmes for digital neurochip design. All major microchip companies and research centers worldwide have announced performance results for their neuroproducts. One pioneering example is is the series of neurocomputers dedicated to image processing, developed within the WISARD (Wilke, Stoneham, Aleksander Recognition Device) program at Imperial College, London. One of the fastest neurochips is WSI chip made by Hitachi, with 576 digital neurons and 36K weights integrated onto a 5-inch silicon wafer, using 0.8 m CMOS technology. The system of 8 WSI boards performs 2.3G CUPS, measured on a backpropagation network.
Analog Technology
The processing simplicity of a neuroelement, which needs only addition and multiplication, makes the analog approach a promising candidate for the construction of a neurochip. Furthermore, analog electronics o ers bene ts in terms of packing density, high-speed parallel processing on a chip, and low power consumption. Even the major weakness of the analog chip { its inaccuracy { does not appear to be an obstacle within "inherently fault-tolerant" neurocomputing. Nevertheless, problems such as analog-storage of weights, susceptibility to temperature changes and vulnerability to interference make analog chips less attractive (practice also shows that realistic neuroapplications often require accurate calculations). Taking into account these drawbacks, the optimal solution would appear to be a combination of both analog and digital techniques.
An example of an analog neurochip is ETANN (Electrically Trainable Analog Neural Network), made by Intel. This chip can hold 64 neurons with 10K weights and can perform 2G CPS in the recall phase. The price for the speed in the recall mode is the very slow o -line training. The complete system comes together with a PC host.
Hybrid Technology
Hybrid technology combines digital and analog techniques, exploiting the advantages of each approach. The optimal combination applies digital techniques to perform accurate and exible training in the learning phase, and then uses potential density of analog chips to obtain ner parallelization on a smaller area in the recall phase.
A typical example of the hybrid approach is the AT&T programme for the development of associative memory chips. It has come up with three products: (1) an analog/digital programmable connection matrix with 54 fully interconnected neurons on a chip; (2) an analog connection matrix with 1104 connections (46 inputs 24 outputs) which can be used for learning with adjustable weights; (3) a digital pipelined bestmatch chip with 50 neurons which calculates the Hamming distance of the input vector from the feature vector. Another successful hybrid implementation is the "Boltzmann Machine", produced by Mitsubishi. The chip has 336 one-bit neurons with 28K ve-bit connections. Reported learning speed is 28G CUPS achieved by digital circuits, and a maximum recall speed of to 1T CPS.
Optical Technology
Optical technology introduces photons as basic information carriers. They are much faster than electrons, with greater potential bandwidth and less interference problems. Furthermore, the processing of light beams o ers massive parallelism. These features put optical computing rst among possible candidates for the neurocomputer of the future. So far, two general lines of research can be observed here: one is the development of special-purpose associative memory systems; the other is the development of general-purpose highspeed optical processing elements.
Optical techniques are ideally suited to the realization of dense network of weighted interconnections. Spatial optics o ers three-dimensional interconnection networks with enormous bandwidth and Performance gures should be taken as being relative, since the presented systems used di erent precision, and were not tested with a common benchmark. very low power consumption. A "classical" example of an optical neurocomputer is the Caltech Holographic Associative Memory.
A combination of the highly parallel connectivity of optics and the exibility of silicon technology is used in constructing the optical neural chip 12]. Its authors have designed two devices: an analog optical chip with on-chip weight storage (for 128 fully connected neurons) and on-chip learning; and the arti cial retina chip used to concurrently sense and process images. These chips are used for a wide range of high-speed pre-processing operations for image compression and character recognition.
Signi cance of Neurohardware
In Table 2 neurohardware is illustrated by the three examples from each of the mentioned categories (processor arrays, neuroaccelerators and special-purpose neurocomputers). The table gives product name, underlying hardware technology, software/parallel techniques, chip capacity in terms of the number of neurons/connections packed per chip, and performancegures.
General-purpose neurohardware appears to o er an optimal solution, achieving both e ciency and exibility at an acceptable price. Especially popular are neuroaccelerators that combine accelerated speed with user-friendly PC environment. This class of neurohardware is now mature and widely available.
Special-purpose neurohardware demonstrates the best performance (for the recent achievements see IEEE Micro, June 1994). A combined use of digital and analog technologies yields neurochips that perform more than a Giga CUPS in the learning phase, and more than a Tera CPS in the recall phase. And these are not the nal gures. Paradoxically though, the immaturity of arti cial neural network research, where existing models are often modi ed, and new ones rapidly developed, constitutes one of the major di culties in the development of dedicated neurohardware.
Discussion
The chronology of parallel neural simulations shows that initial implementations were carried out on general-purpose parallel machines, then on generalpurpose neurocomputers, and nally on specially developed neurohardware. The performance gures clearly show how each generation of neuroimplementations has brought signi cant improvements in performance.
Simulations on general-purpose parallel computers were mostly done in the late eighties. This approach is characterized by the search for appropriate parallelization techniques and optimal hardware architectures. Since the neuroparadigm is most often interpreted by matrix-vector operations, the best results are obtained on data-parallel machines. Even control parallel architectures are programmed in a data-parallel style. However, there still remain the problems of the exible restructuring of a neuroimplementation, the mapping of the neural network topology to the hardware topology, and e cient modelling of sparsely connected neural networks.
With a versatile architecture that emulates elementary operations of neural paradigms, general-purpose neurocomputers o er an almost optimal solution for neurosimulations. Because of their low price and wide availability, acceleration boards are certainly the most popular neuroupgrades. For more advanced neuroprojects, highly parallel processor arrays o er even better performance. However, these architectures often lack exibility and scalability. The best performance is achieved with specialpurpose neurocomputers that implement a particular neural model directly in silicon. Here, state-of-the-art digital and analog technology are competing one another, the former being more exible and accurate, and the latter faster and inherently closer to neural paradigms. It seems likely that in the years to come silicon technology will continue to dominate, with more and more neurons being packed on a chip and with performance being measured in Tera CPS. In the long term, optical technology will make its mark with its massively parallel and highly e cient processing on a micro scale. The neurocomputer of the future may consist of a number of modular components ranging from "conventional" hardware to highly specialized silicon, optical and molecular devices. Neural computation means organizing processing into a number of processing elements that are massively interconnected and exchange signals. Processing within elements usually involves adding weighted input values, applying a (non-)linear function to the input sum, and forwarding the result to other elements. Since the basic principle of neurocomputation is learning by examples, such processing has to be repeated again and again, with weights being changed until a network learns the problem.
An arti cial neural network can be implemented either as a simulation programmed on a general-purpose computer, or as an emulation realized on special purpose hardware. Although sequential simulations are widespread and o er comfortable software environments for the development and analysis of neural networks, the computational needs of realistic applications exceed the capabilities of sequential computers.
Parallelization was therefore imperative in the e ort to cope with the high computational and communication demands of neuroapplications. As matrix-vector operations are at the core of many neuro algorithms, processing is often organized in such a way as to ensure their e cient implementation. The rst implementations were exercised on general-purpose parallel machines. When they came close to the performance limits of standard super-computers, the focus shifted to architectural improvements. One approach was to build general-purpose programmable neurohardware; another was to construct special-purpose neurohardware that emulates particular neuromodel. This article discusses techniques and means for parallelizing neurosimulations, both at a high programming level and at a low hardware-emulation level.
